timedomain speech analysis -...
TRANSCRIPT
TimeDomain Speech Analysis
Motivation for shorttime processing Fundamental assumptions about speech signals
Shorttime speech features Energy, amplitude, amplitude difference, zero
crossing (ZC) rate and autocorrelation Windowing techniques
Rectangular, Hamming, Hanning, raised cosine etc. Applications
End point detection: speech vs. silence Pitch and formant detection
Speech Signal Basics
Time varying Properties Excitation goes from voiced to unvoiced and vice
versa Peak amplitude varies with the sound being produced Pitch varies within and across voiced sounds Periods of silence where background signals dominate
Key issue in timedomain processing Create simple methods that enable us to reliably and
accurately measure/estimate speech representations (e.g., start and end of utterance, pitch, and so on)
Fundamental Assumptions
Properties of the speech signal change relatively slowly with time (510 sounds per second) Over very short (520 msec) intervals: uncertainty due to
small amount of data, varying pitch and amplitude Over medium length (20100 msec) intervals: uncertainty
due to changes in sound quality, transitions between sounds, rapid transients in speech
Over long (100500 msec) intervals: uncertainty due to large amount of sound changes
There is always uncertainty in short time measurements and estimates from speech signals
Compromised Solution Basic idea of shorttime processing: short segments of
the speech signal are isolated and processed as if they were short segments from a sustained sound with fixed properties
The short segments are called analysis frames or frames Overlapping (for analysis) or nonoverlapping (for coding) Typical frame size: 80240 samples (or 1030ms at 8K
samples/second) The end result of shorttime processing is a new time
varying sequence that serves as a new representation of the speech signal
Framebased ShortTime Processing
Input x(n): 8000 samples per second (human speech is approximately bandlimited at 4KHz)
Output f(m): a single scalar (e.g., pitch) or vector (e.g., vocal tract parameters) per frame
The goal of shorttime processing is to facilitate the analysis
Shorttimeprocessing
Speech waveform x(n) Speech representation f(m)
TimeDomain Speech Analysis
Motivation for shorttime processing Fundamental assumptions about speech signals
Shorttime speech features Energy, amplitude, amplitude difference, zero
crossing (ZC) rate and autocorrelation Windowing techniques
Rectangular, Hamming, Hanning, raised cosine etc. Applications
End point detection: speech vs. silence Pitch and formant detection
List of ShortTime Features By selecting different T(), we can obtain
Shorttime energy Shorttime amplitude Shorttime average zerocross (ZC) rate Shorttime amplitude difference Shorttime autocorrelation
The issue of selecting W() will be addressed later (temporally we assume it is a rectangular window)
We will assume overlapped window here (nonoverlapped window will be useful in speech coding applications)
ShortTime Energy
Recall: energy of an infinite sequence {x(n)} It has little use if the
sequence is timevarying Shorttime energy is the
summation of squares over a short period It is a timevarying
sequence itself Operator T: T(x)=x2
E= ∑m=−∞
∞
x2m
En= ∑m=n−N1
n
x2m
Longterm energy is a single value
Shorttime energy is a function of n
Graphical Illustration
• As temporal index n evolves, window w(nm) moves as well • Shorttime energy can be recursively computed based on inclusionandexclusion principles
Recall: energy – square term magnitude – absolute value
Shorttime energy is the summation of absolute values over a short period It is timevarying too
Operator T: T(x)=|x|
ShortTime Magnitude
Mn= ∑m=n−N1
n
∣x m ∣
Instead of taking average over magnitude, we do it over magnitude difference
Operator T: T(x) is the concatenation of a linear filter H(z)=1zp and taking absolute value |x|
It will be used in a computationally efficient solution to pitch estimation
Average Magnitude Difference Function
ΔMnp = ∑m=n−N1
n
∣x m −x m−p ∣
ShortTime Average ZC Rate
Zero crossing rate is a simple measure of the frequency content of a signal
ZC is especially useful for narrowband signals (e.g., sinusoids)
ZC Rate Definition Operator T: concatenation of sgn, difference
and magnitude
It is simply the total number of sign flips times two
Zn= ∑m=n−N1
n
∣sgn [x m ]−sgn [x m−1 ]∣
where
sgn x ={ 1 x≥0−1 x0
Issues in ZC Rate Computation
Offset Signal needs to have a zero mean (DC
component) because offset affects ZC rate Quantization
Signal can be quantized to 1bit for fast computation
Sensitivity to noise It does not distinguish 0.00001→0.00001 from1000000 →100000 as long as sign flips
Autocorrelation Function For a deterministic signal x(n), its
autocorrelation function is defined by
Properties of autocorrelation function
φ k = ∑m=−∞
∞
x m x mk
φ k =φ kP for periodic functions (period=P)
φ k =φ−k proof will be given in the class
φ k ≤φ0 ,∀ kφ(0) is the energy of x(n)follows Cauchy’s inequality
ShortTime Autocorrelation Function
Basic idea: correlation is calculated for the pair of windowed signals
Rn k = ∑m=−∞
∞
x m wn−m x mk wn−k−m
window for x(m) window for x(m+k)
Rn k =Rn −k = ∑m=−∞
∞
x m wn−m x m−k wnk−m
Easy to show
Summary Shorttime energy
Shorttime amplitude Shorttime average zerocross (ZC) rate
Shorttime amplitude difference
Shorttime autocorrelation
En= ∑m=n−N1
n
x2m
Mn= ∑m=n−N1
n
∣x m ∣
ΔMnp = ∑m=n−N1
n
∣x m −x m−p ∣
Rn k = ∑m=−∞
∞
x m wn−m x mk wn−k−m
Zn= ∑m=n−N1
n
∣sgn [x m ]−sgn [x m−1 ]∣
TimeDomain Speech Analysis
Motivation for shorttime processing Fundamental assumptions about speech signals
Shorttime speech features Energy, amplitude, amplitude difference, zero
crossing (ZC) rate and autocorrelation Windowing techniques
Rectangular, Hamming, Hanning, raised cosine etc. Applications
End point detection: speech vs. silence Pitch and formant detection
Fundamental Assumptions (Encore)
Properties of the speech signal change relatively slowly with time (510 sounds per second) Over very short (520 msec) intervals: uncertainty due to
small amount of data, varying pitch and amplitude Over medium length (20100 msec) intervals: uncertainty
due to changes in sound quality, transitions between sounds, rapid transients in speech
Over long (100500 msec) intervals: uncertainty due to large amount of sound changes
There is always uncertainty in short time measurements and estimates from speech signals
Why Windowing? Obvious advantage: localizing a part of
speech for analysis Less obvious disadvantage: how will the
windowing affect the analysis results? Compare the localization property of two extreme
cases: N=1 vs. N=∞ N=1: best in time but worst in frequency N= ∞: best in frequency but worst in time
Fundamentally speaking, windowing is the pursuit of a better tradeoff between time and frequency localization
Two Degrees of Freedom Window length
Cannot be too short: too few samples are not enough to resolve the uncertainty
Cannot be too long: too many samples would introduce more uncertainty
Window shape Rectangular window is conceptually the simplest,
but suffers from spectral leakage problem It is always a tradeoff: there is no universally
optimal window; which window is more effective depends on specific applications
By varying the shape of window, it is possible to significantly reduce spectral leakage
Solution
Midterm Assay Topic #1: Why is spectral leakage a bad thing? In particular,What kind of difficulty does it cause to speech processing?
Advanced Windowing Techniques
Hamming/Hanning window Also known as raised Cosine window
Blackman window Based on two cosine terms
Bartlett window Also known as triangular window
KaiserBessel window Based on Bessel functions
Hamming/Hann(ing) Window
W n =0.50.5*cos 2πnN
Npoint Hann window
W n =0.540.46*cos 2πnN
Npoint Hamming window
W n =a1−a *cos 2πnN
Npoint raised cosine window
KaiserBessel Window
zero order modified Bessel function of the first kind
B determines the tradeoff between the main lobe and sidelobesand is often specified as a half integer multiple of pi
TimeDomain Speech Analysis
Motivation for shorttime processing Fundamental assumptions about speech signals
Shorttime speech features Energy, amplitude, amplitude difference, zero
crossing (ZC) rate and autocorrelation Windowing techniques
Rectangular, Hamming, Hanning, raised cosine etc. Applications
End point detection: speech vs. silence Pitch and formant detection
Speech Detection (Con’t)
Why study it? Save computations (don’t have to process
background noise) Save bits (don’t have to code background noise) Facilitate recognition (can’t mistake background
noise for speech) What are the challenges?
Complexity of speech signals Diversity of background noise
Difficult Scenarios
Speech signal Weak fricatives (/f/,/th/,/h/) are beginning or end of
utterance Weak plosive bursts for /p/, /t/ or /k/ Nasals at end of utterance (often devoiced and reduced
levels) Voiced fricatives which become devoiced at end of
utterance Trailing off of vowel sounds at the end of utterance
Background noise Highlevel noise (Low SNR), nonGaussian noise, etc.
Algorithm Design Come up with an attack
Which feature should we use? energy, amplitude, ZC rate or something else?
Implement the idea Collect speech data for experimental use Realize the algorithm by MATLAB (or any other language
you are familiar with) Analyze your result: If energy does not work, why? If ZC
works better than amplitude, why? Refine the attack
Can we do it recursively? Can we combine two approaches together?
Adhoc Algorithm #1 Three steps
Calculate the shorttime energy En of a given speech signal
Obtain a threshold T about the energy of background noise
Thresholding En by T Two Problems
Speech segment does not necessarily have high energy (especially around start and end points);
Background noise does not necessarily have low energy (some assumption about SNR is needed)
Three steps Calculate the shorttime ZC rate Zn of a given
speech signal Obtain a threshold T about the ZC rate of
background noise Thresholding Zn by T (note that low ZC rate
corresponds to voiced segments) It is not highly reliable either because the
distributions of ZC rate for speech and noise are not completely separate
AdHoc Algorithm #2
Pitch Period Estimation (Con’t)
Why do we want to estimate P? Pitch is one of the most fundamental attributes in
speech signals Knowing P facilitates speech coding, speech
synthesis, speech enhancement and speaker recognition
Why is it nontrivial? Speech signals are much more complex than
sinusoids
Possible Attacks
How to estimate the period for sinusoids? Fourier Transform (will be covered by the next
Chapter) In time domain, we might consider
Autocorrelation function – recall its periodic property
Average Magnitude Difference Function (AMDF) – essentially follow the same motivation but computationally more efficient