fundamentals of speech signal processing. 1.0 1.0 speech signals
TRANSCRIPT
![Page 1: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/1.jpg)
Fundamentals of Speech Signal Processing
![Page 2: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/2.jpg)
1.0 Speech Signals
![Page 3: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/3.jpg)
Waveform plots of typical vowel sounds - Voiced(濁音)
tone 1
tone 2
tone 4
t
![Page 4: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/4.jpg)
Speech Production and Source Model
tu
• Human vocal mechanism • Speech Source Model
Vocaltract
tx
u(t)
x(t)
![Page 5: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/5.jpg)
Voiced and Unvoiced Speech
u(t) x(t)
pitch pitch
voiced
unvoiced
![Page 6: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/6.jpg)
Waveform plots of typical consonant sounds
Unvoiced (清音) Voiced (濁音)
![Page 7: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/7.jpg)
Waveform plot of a sentence
![Page 8: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/8.jpg)
Frequency domain spectra of speech signals
Voiced Unvoiced
![Page 9: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/9.jpg)
Frequency Domain
Voiced
formant frequencies
![Page 10: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/10.jpg)
Unvoiced
formant frequencies
Frequency Domain
![Page 11: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/11.jpg)
Spectrogram
![Page 12: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/12.jpg)
Spectrogram
![Page 13: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/13.jpg)
Formant Frequencies
![Page 14: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/14.jpg)
Formant frequency contours
He will allow a rare lie.
Reference: 6.1 of Huang, or 2.2, 2.3 of Rabiner and Juang
![Page 15: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/15.jpg)
2.0 Speech Signal Processing
![Page 16: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/16.jpg)
Speech Signal Processing
• Major Application Areas1. Speech Coding:Digitization and Compression
Considerations : 1) bit rate (bps) 2) recovered quality 3) computation
complexity/feasibility
2. Voice-based Network Access —
User Interface, Content Analysis, User-content Interaction
LPF outputProcessing Algorithms
x(t) x[n]
Processing xk
110101…Inverse
Processing
x[n] x[n]^
Storage/transmission
• Speech Signals– Carrying Linguistic
Knowledge and Human Information: Characters, Words, Phrases, Sentences, Concepts, etc.
– Double Levels of Information: Acoustic Signal Level/Symbolic or Linguistic Level
– Processing and Interaction of the Double-level Information
![Page 17: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/17.jpg)
Sampling of Signals
X(t)X[n]
t n
![Page 18: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/18.jpg)
Double Levels of Information
字 (Character)
詞 (Word)
句 (Sentence)人 人 用 電 腦電 腦
![Page 19: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/19.jpg)
Speech Signal Processing – Processing of Double-Level Information• Speech Signal • Sampling • Processing
• Linguistic Structure
• Linguistic Knowledge
今 天 的
常 好
Lexicon Grammar今天 的
今天的 天氣 非常 好
Algorithm
Chips or Computers 天 氣 非
![Page 20: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/20.jpg)
Voice-based Network Access
Content AnalysisUser Interface
Internet
User-Content Interaction
User Interface
—when keyboards/mice inadequate Content Analysis — help in browsing/retrieval of multimedia content User-Content Interaction —all text-based interaction can be accomplished by spoken language
![Page 21: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/21.jpg)
User Interface —Wireless Communications Technologies are Creating a Whole Variety of User Terminals
at Any Time, from Anywhere Smart phones, Hand-held Devices, Notebooks, Vehicular Electronics, Hands-
free Interfaces, Home Appliances, Wearable Devices… Small in Size, Light in Weight, Ubiquitous, Invisible… Post-PC Era Keyboard/Mouse Most Convenient for PC’s not Convenient any longer
— human fingers never shrink, and application environment is changed Service Requirements Growing Exponentially Voice is the Only Interface Convenient for ALL User Terminals at Any Time,
from Anywhere, and to the point in one utterance Speech Processing is the only less mature part in the Technology Chain
Internet Networks
Text Content
Multimedia Content
![Page 22: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/22.jpg)
Content Analysis—Multimedia Technologies are Creating a New World of Multimedia Content
• Most Attractive Form of the Network Content will be in Multimedia, which usually Includes Speech Information (but Probably not Text)
• Multimedia Content Difficult to be Summarized and Shown on the Screen, thus Difficult to Browse
• The Speech Information, if Included, usually Tells the Subjects, Topics and Concepts of the Multimedia Content, thus Becomes the Key for Browsing and Retrieval
• Multimedia Content Analysis based on Speech Information
Future Integrated Networks
Real–time Information– weather, traffic– flight schedule– stock price– sports scores
Special Services– Google– FaceBook–YouTube– Amazon
Knowledge Archieves– digital libraries– virtual museums
Intelligent Working Environment– e–mail processors– intelligent agents– teleconferencing– distant learning– electric commerce
Private Services– personal notebook– business databases– home appliances– network entertainments
![Page 23: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/23.jpg)
User-Content Interaction — Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processing
voice information Multimedia
ContentInternet
voice
input/
output
text information
• Network Access is Primarily Text-based today, but almost all Roles of Texts can be Accomplished by Speech
• User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues• Hand-held Devices with Multimedia Functionalities Commonly used Today• Using Speech Instructions to Access Multimedia Content whose Key Concepts Specified
by Speech Information
Multimedia Content Analysis
Text Information Retrieval
Text Content
Voice-based Information
Retrieval
Text-to-Speech Synthesis
Spoken and multi-modal
Dialogue
![Page 24: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/24.jpg)
3.0 Speech Coding
![Page 25: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/25.jpg)
Waveform-based Approaches
• Pulse-Coded Modulation (PCM)
– binary representation for each sample x[n] by quantization
• Differential PCM (DPCM)
– encoding the differences
d[n] = x[n] x[n1]
d[n] = x[n] ak x[nk]
• Adaptive DPCM (ADPCM)
– with adaptive algorithms
Ref : Haykin, “Communication Systems”, 4-th Ed.
3.7, 3.13, 3.14, 3.15
P
k=1
![Page 26: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/26.jpg)
Speech Source Model and Source Coding
•Speech Source Model
– digitization and transmission of the parameters will be adequate– at receiver the parameters can produce x[n] with the model– much less parameters with much slower variation in time lead to much less
bits required– the key for low bit rate speech coding
x[n]u[n]
parameters parameters
Excitation Generator
Vocal Tract Model
Ex G(),G(z), g[n]
x[n]=u[n]g[n]X()=U()G()X(z)=U(z)G(z)
U ()U (z)
![Page 27: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/27.jpg)
Speech Source Model
x(t)
a[n]
t
n
![Page 28: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/28.jpg)
Speech Source Model and Source Coding
Analysis and Synthesis
– High computation requirements are the price for low bit rate
![Page 29: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/29.jpg)
Simplified Speech Source Model
– Excitation parameters v/u : voiced/ unvoiced N : pitch for voiced G : signal gain excitation signal u[n]
unvoiced
voiced
randomsequencegenerator
periodic pulse traingenerator
x[n]
G(z) = 1
1 akz-k P
k = 1
Excitation
G(z), G(), g[n]
Vocal Tract Model
u[n]
Gv/u
N
– Vocal Tract parameters {ak} : LPC coefficients formant structure of speech signals– A good approximation, though not precise enough
Reference: 3.3.1-3.3.6 of Rabiner and Juang, or 6.3 of Huang
![Page 30: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/30.jpg)
LPC Vocoder(Voice Coder)
N by pitch detectionv/u by voicing detection
{ak} can be non-uniform or vector quantized to reduce bit rate further
Ref : 3.3 ( 3.3.1 up to 3.3.9 ) of Rabiner and Juang, “Fundamentals of Speech Recognition”, Prentice Hall, 1993
![Page 31: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/31.jpg)
Multipulse LPC
poor modeling of u(n) is the main source of quality degradation in LPC vocoder u[n] replaced by a sequence of pulses
u[n] = bk [nnk]
– roughly 8 pulses per pitch period
– u[n] close to periodic for voiced
– u[n] close to random for unvoiced
Estimating (bk , nk) is a difficult problem
k
![Page 32: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/32.jpg)
Multipulse LPC
Estimating (bk , nk) is a difficult problem– analysis by synthesis
– large amount of computation is the price paid for better speech quality
![Page 33: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/33.jpg)
Multipulse LPC
Perceptual Weighting
W(z) = (1 ak z -k) / (1 ak ck z -k)
0 < c < 1
for perceptual sensitivity
W(z) = 1 , if c = 1
W(z) = 1 ak z -k , if c = 0
practically c 0.8
Error Evaluation
E = |X() X() |2 W()d
P
k = 1
P
k = 1
2
0
^
![Page 34: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/34.jpg)
Multipulse LPC
Error Minimization and Pulse search
u[n] = bk [nnk]
x[n] = bk g[nnk]
E = E ( b1 , n1 ,b2 , n2……)
– sub-optional solution finding 1 pulse at a time
^
k
k
![Page 35: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/35.jpg)
Code-Excited Linear Prediction (CELP)
Use of VQ to Construct a Codebook of Excitation Sequences
– a sequence consists of roughly 40 samples
– a codebook of 512 ~ 1024 patterns is constructed with VQ
– roughly 512 ~ 1024 excitation patterns are perceptually adequate
Excitation Search –analysis by synthesis
– 9 ~ 10 bits are needed for the excitation of 40 samples , while {ak} parameters in G(z) also vector quantized
![Page 36: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/36.jpg)
Code-Excited Linear Prediction (CELP)
Receiver
– {ak} codewords can be transmitted less frequently than excitation codeword
Ref : Gold and Morgan, “Speech and Audio Signal Processing”, John Wiley & Sons, 2000, Chap 33
![Page 37: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/37.jpg)
4.0 Speech Recognition andVoice-based Network Access
![Page 38: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/38.jpg)
Feature Extraction
unknown speech signal
Pattern Matching
Decision Making
x(t)WX
output wordfeature
vector sequence
Reference Patterns
Feature Extraction
y(t) Y
training speech
Speech Recognition as a pattern recognition problem
![Page 39: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/39.jpg)
• A Simplified Block Diagram
• Example Input Sentence this is speech• Acoustic Models (聲學模型 ) (th-ih-s-ih-z-s-p-ih-ch)• Lexicon (th-ih-s) → this (ih-z) → is (s-p-iy-ch) → speech• Language Model (語言模型 ) (this) – (is) – (speech)
P(this) P(is | this) P(speech | this is) P(wi|wi-1) bi-gram language model
P(wi|wi-1,wi-2) tri-gram language model,etc
Basic Approach for Large Vocabulary Speech Recognition
Front-endSignal Processing
AcousticModels Lexicon
FeatureVectors
Linguistic Decoding and
Search Algorithm
Output Sentence
SpeechCorpora
AcousticModel
Training
LanguageModel
Construction
TextCorpora
LanguageModel
Input Speech
![Page 40: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/40.jpg)
2.0
2.0 Fundamentals of Speech Recognition
Hidden Markov Models (HMM)
˙ Formulation Ot= [x1, x2, …xD]T feature vectors for frame at time t qt= 1,2,3…N state number for feature vector Ot
A =[ aij ] , aij = Prob[ qt = j | qt-1 = i ] state transition probability
B =[ bj(o), j = 1,2,…N] observation probability
bj(o) = cjkbjk(o)
bjk(o): multi-variate Gaussian distribution for the k-th mixture of the j-th state M : total number of mixtures
cjk = 1
= [ 1, 2, …N ] initial probabilities i = Prob[q1= i]
HMM : ( A , B, ) =
a22 a11
M
k = 1
M
k = 1
o1 o2 o3 o4 o5 o6 o7 o8…………
q1 q2 q3 q4 q5 q6 q7 q8…………
observation sequence state sequence
b1(o) b2 (o) b3(o)
a13 a24
a23 states a12
![Page 41: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/41.jpg)
Observation Sequences
![Page 42: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/42.jpg)
1-dim Gaussian Mixtures
State Transition Probabilities
![Page 43: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/43.jpg)
2.0
Hidden Markov Models (HMM)
˙ Double Layers of Stochastic Processes
- hidden states with random transitions for time warping
- random output given state for random acoustic characteristics
˙ Three Basic Problems
(1) Evaluation Problem:
Given O =(o1, o2, …ot…oT) and = (A, B, )
find Prob [ O | ]
(2) Decoding Problem:
Given O = (o1, o2, …ot…oT) and = (A, B, )
find a best state sequence q = (q1,q2,…qt,…qT)
(3) Learning Problem:
Given O, find best values for parameters in
such that Prob [ O | ] = max
![Page 44: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/44.jpg)
Simplified HMM
RGBGGBBGRRR……
![Page 45: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/45.jpg)
Peripheral Processing for Human Perception (P.34 of 7.0 )
![Page 46: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/46.jpg)
2.0
Feature Extraction (Front-end Signal Processing)
˙ Mel Frequency Cepstral Coefficients (MFCC)
- Mel-scale Filter Bank triangular shape in frequency/overlapped uniformly spaced below 1 kHz
logarithmic scale above 1 kHz
˙ Delta Coefficients
- 1st/2nd order differences
Discrete Fourier Transform
windowed speech samples
MFCC Mel-scale Filter Bank
log( | |2 )
Inverse Discrete Fourier Transform
![Page 47: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/47.jpg)
Mel-scale Filter Bank
![Page 48: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/48.jpg)
2.0
Language Modeling: N-gram
W = (w1, w2, w3,…,wi,…wR) a word sequence
- Evaluation of P(W)
P(W) = P(w1) II P(wi|w1, w2,…wi-1)
- Assumption:
P(wi|w1, w2,…wi-1) = P(wi|wi-N+1,wi-N+2,…wi-1)
Occurrence of a word depends on previous N 1 words only
N-gram language models
N = 2 : bigram P(wi | wi-1)
N = 3 : tri-gram P(wi | wi-2 , wi-1)
N = 4 : four-gram P(wi | wi-3 , wi-2, wi-1)
N = 1 : unigram P(wi)
probabilities estimated from a training text database
example : tri-gram model
P(W) = P(w1) P(w2|w1) II P(wi|wi-2 , wi-1)
R
i = 2
N
i = 3
![Page 49: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/49.jpg)
N-gram
tri-gram
W1 W2 W3 W4 W5 W6 ...... WR
W1 W2 W3 W4 W5 W6 ...... WR
𝑃 (𝑊 )=𝑃 (𝑤1)∏𝑖=2
𝑅
𝑃 (𝑤𝑖|𝑤1 ,𝑤2 ,⋯ ,𝑤 𝑖−1 )
𝑃 (𝑊 )=𝑃 (𝑤1 )𝑃 (𝑤2|𝑤1 )∏𝑖=3
𝑁
𝑃 (𝑤 𝑖|𝑤𝑖 −2 ,𝑤 𝑖−1 )
![Page 50: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/50.jpg)
2.0
Language Modeling - Evaluation of N-gram model parameters
unigram
P(wi) =
wi : a word in the vocabulary
V : total number of different words in the vocabulary N( ) number of counts in the training text database
bigram
P(wj|wk) =
< wk, wj > : a word pair
trigram
P(wj|wk,wm) =
smoothing estimation of probabilities of rare events by statistical approaches
N(wi)
N (wj)
V
j = 1
N(<wk,wj>)
N (wk)
N(<wk,wm,wj>)
N (<wk,wm>)
![Page 51: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/51.jpg)
… th i s ……… 50000
… th i s i s …… 500
… th i s i s a … 5
Prob [ is| this ] =
Prob [ a| this is ] =
![Page 52: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/52.jpg)
2.0
Large Vocabulary Continuous Speech Recognition W = (w1, w2,…wR) a word sequence X = (x1, x2,…xT) feature vectors for a speech utterance
W* = Prob(W|X) MAP principle
Prob(W|X) = = max
Prob(X|W) P(W) = max
by HMM by language model
˙ A Search Process Based on Three Knowledge Sources
X
- Acoustic Models : HMMs for basic voice units (e.g. phonemes)
- Lexicon : a database of all possible words in the vocabulary, each word including its pronunciation in terms of component basic voice units
- Language Models : based on words in the lexicon
Arg Max
w
Search W*
Acoustic Models
Lexicon Language
Models
Prob(X|W) P(W)
P(X)
![Page 53: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/53.jpg)
Text-to-speech Synthesis
Text Analysis and Letter-to-
sound Conversion
Text Analysis and Letter-to-
sound Conversion
Prosody Generation
Prosody Generation
Signal Processing
and Concatenation
Signal Processing
and Concatenation
Lexicon and Rules
Prosodic Model
Voice Unit Database
Input Text
Output Speech Signal
• Transforming any input text into corresponding speech signals • E-mail/Web page reading • Prosodic modeling • Basic voice units/rule-based, non-uniform units/corpus-based, model-
based
![Page 54: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/54.jpg)
假如今天胡適先生還活著,又會怎樣呢?text analysis
假如 | 今天 | 胡適 | 先生 | 還 | 活 | 著 | ,又 | 會 | 怎樣 | 呢 | ?Cbb, Nd, Nb, Na, Dfa,VH,Di,, D, D, D, T,?
假如 | 今天 胡適 | 先生 還 | 活 | 著 | , 又 | 會 怎樣 | 呢 | ?Cbb, Nd, Nb, Na, Dfa,VH,Di,,D, D, D, T,?
5 1 3 1 2 1 1 4 1 2 1 5
prosody generation
Energy
Pause
Intonation
Tone
Duration
speech synthesis
· Three Major Steps- text analysis- prosody generation- speech synthesis
Text-to-Speech Synthesis
![Page 55: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/55.jpg)
Automatic Prosodic Analysis for an Arbitrary Text Sentence
Cbb Nd Nb Na Dfa VH Di
Nd+Nb
VH+DiCbb+Nd+Nb
endstart
Na+Dfa
Nb+Na Dfa+VHCbb+Nd
Dfa+VH+Di
· Predict B1,B2,B3 (B4,B5 determined by Punctuation marks)· minor phrase patterns: (Cbb+Nd,Nb+Na,Dfa+VH+Di,D+D,D+T,etc)
假如 | 今天 胡適 | 先生還 | 活 | 著 | ,又 | 會 怎樣 | 呢 | ?Cbb,Nd, Nb, Na,Dfa,VH,Di,,D,D, D, T,?minor phrases
prosodic groups
major phrases
breath groups
words
break indices5 1 3 1 2 1 1 4 1 2 1 5
![Page 56: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/56.jpg)
Speech Understanding
• Understanding Speaker’s Intention rather than Transcribing into Word Strings
• Limited Domains/Finite Tasks
acoustic models
phrase lexicon
Syllable Recognition
Syllable Recognition
Key Phrase Matching
Key Phrase Matching
input utterance syllable lattice phrase graph
concept graph
concept set
phrase/concept language model
Semantic Decoding
Semantic Decoding
understanding results
• An Example utterance: 請幫我查一下 台灣銀行 的 電話號碼 是幾號 ? key phrases: ( 查一下 ) - ( 台灣銀行 ) - ( 電話號碼 ) concept: (inquiry) - (target) - (phone number)
![Page 57: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/57.jpg)
Speaker Verification
Feature Extraction
Feature Extraction VerificationVerification
input speech yes/no
• Verifying the speaker as claimed• Applications requiring verification • Text dependent/independent• Integrated with other verification schemes
Speaker Models
Speaker Models
![Page 58: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/58.jpg)
Voice-based Information Retrieval
• Speech Instructions• Speech Documents (or Multi-media Documents including Speech
Information)
speech instruction
我想找有關新政府組成的新聞?我想找有關新政府組成的新聞?text instruction
d1
text documents
d2
d3d1
d2
d3
speech documents
總統當選人陳水扁今天早上…
![Page 59: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/59.jpg)
Spoken Dialogue Systems
• Almost all human-network interactions can be accomplished by spoken dialogue
• Speech understanding, speech synthesis, dialogue management• System/user/mixed initiatives• Reliability/efficiency, dialogue modeling/flow control• Transaction success rate/average number of dialogue turns
Databases
Sentence Generation and Speech SynthesisOutput
Speech
Input Speech
DialogueManager
Speech Recognition and Understanding
User’s Intention
Discourse Context
Response to the user
Internet
Networks
Users
Dialogue Server
![Page 60: Fundamentals of Speech Signal Processing. 1.0 1.0 Speech Signals](https://reader038.vdocuments.site/reader038/viewer/2022102718/56649e875503460f94b8bcfc/html5/thumbnails/60.jpg)
“Speech and Language Processing over the Web”, IEEE Signal Processing Magazine, May 2008
References