timedomain speech analysis -...

TimeDomain Speech Analysis

Motivation for shorttime processing Fundamental assumptions about speech signals

Shorttime speech features Energy, amplitude, amplitude difference, zero

crossing (ZC) rate and autocorrelation Windowing techniques

Rectangular, Hamming, Hanning, raised cosine etc. Applications

End point detection: speech vs. silence Pitch and formant detection

Speech Signal Basics

Time varying Properties Excitation goes from voiced to unvoiced and vice

versa Peak amplitude varies with the sound being produced Pitch varies within and across voiced sounds Periods of silence where background signals dominate

Key issue in timedomain processing Create simple methods that enable us to reliably and

accurately measure/estimate speech representations (e.g., start and end of utterance, pitch, and so on)

Fundamental Assumptions

Properties of the speech signal change relatively slowly with time (510 sounds per second) Over very short (520 msec) intervals: uncertainty due to

small amount of data, varying pitch and amplitude Over medium length (20100 msec) intervals: uncertainty

due to changes in sound quality, transitions between sounds, rapid transients in speech

Over long (100500 msec) intervals: uncertainty due to large amount of sound changes

There is always uncertainty in short time measurements and estimates from speech signals

Compromised Solution Basic idea of shorttime processing: short segments of

the speech signal are isolated and processed as if they were short segments from a sustained sound with fixed properties

The short segments are called analysis frames or frames Overlapping (for analysis) or nonoverlapping (for coding) Typical frame size: 80240 samples (or 1030ms at 8K

samples/second) The end result of shorttime processing is a new time

varying sequence that serves as a new representation of the speech signal

Examples of Speech Frames

nonoverlapping frames

50%overlapping frames

Framebased ShortTime Processing

Input x(n): 8000 samples per second (human speech is approximately bandlimited at 4KHz)

Output f(m): a single scalar (e.g., pitch) or vector (e.g., vocal tract parameters) per frame

The goal of shorttime processing is to facilitate the analysis

Shorttimeprocessing

Speech waveform x(n) Speech representation f(m)

Generic ShortTime Processing

List of ShortTime Features By selecting different T(), we can obtain

Shorttime energy Shorttime amplitude Shorttime average zerocross (ZC) rate Shorttime amplitude difference Shorttime autocorrelation

The issue of selecting W() will be addressed later (temporally we assume it is a rectangular window)

We will assume overlapped window here (nonoverlapped window will be useful in speech coding applications)

ShortTime Energy

Recall: energy of an infinite sequence {x(n)} It has little use if the

sequence is timevarying Shorttime energy is the

summation of squares over a short period It is a timevarying

sequence itself Operator T: T(x)=x2

E= ∑m=−∞

∞

x2m

En= ∑m=n−N1

n

x2m

Longterm energy is a single value

Shorttime energy is a function of n

Graphical Illustration

• As temporal index n evolves, window w(nm) moves as well • Shorttime energy can be recursively computed based on inclusionandexclusion principles

Example

Speech x(n):

/What she said/

V UV V

You will do it yourself incomputer assignment 1

Recall: energy – square term magnitude – absolute value

Shorttime energy is the summation of absolute values over a short period It is timevarying too

Operator T: T(x)=|x|

ShortTime Magnitude

Mn= ∑m=n−N1

n

∣x m ∣

Example

Speech x(n):

/What she said/

V UV V

Instead of taking average over magnitude, we do it over magnitude difference

Operator T: T(x) is the concatenation of a linear filter H(z)=1zp and taking absolute value |x|

It will be used in a computationally efficient solution to pitch estimation

Average Magnitude Difference Function

ΔMnp = ∑m=n−N1

n

∣x m −x m−p ∣

ShortTime Average ZC Rate

Zero crossing rate is a simple measure of the frequency content of a signal

ZC is especially useful for narrowband signals (e.g., sinusoids)

ZC Rate Definition Operator T: concatenation of sgn, difference

and magnitude

It is simply the total number of sign flips times two

Zn= ∑m=n−N1

n

∣sgn [x m ]−sgn [x m−1 ]∣

where

sgn x ={ 1 x≥0−1 x0

Example of Sinusoid

ZC=8

Total number of sign flips =4

Example for Noise

ZC=252

ZC=122

Issues in ZC Rate Computation

Offset Signal needs to have a zero mean (DC

component) because offset affects ZC rate Quantization

Signal can be quantized to 1bit for fast computation

Sensitivity to noise It does not distinguish 0.00001→0.00001 from1000000 →100000 as long as sign flips

ZC Rate for Speech Signals

Examples

Autocorrelation Function For a deterministic signal x(n), its

autocorrelation function is defined by

Properties of autocorrelation function

φ k = ∑m=−∞

∞

x m x mk

φ k =φ kP for periodic functions (period=P)

φ k =φ−k proof will be given in the class

φ k ≤φ0 ,∀ kφ(0) is the energy of x(n)follows Cauchy’s inequality

ShortTime Autocorrelation Function

Basic idea: correlation is calculated for the pair of windowed signals

Rn k = ∑m=−∞

∞

x m wn−m x mk wn−k−m

window for x(m) window for x(m+k)

Rn k =Rn −k = ∑m=−∞

∞

x m wn−m x m−k wnk−m

Easy to show

Example for Voiced Signal

Periodic peaks

X(n)

Rn(k)

k

Example for Unvoiced Signal

No peak can be found in autocorrelation function

Summary Shorttime energy

Shorttime amplitude Shorttime average zerocross (ZC) rate

Shorttime amplitude difference

Shorttime autocorrelation

En= ∑m=n−N1

n

x2m

Mn= ∑m=n−N1

n

∣x m ∣

ΔMnp = ∑m=n−N1

n

∣x m −x m−p ∣

Rn k = ∑m=−∞

∞

x m wn−m x mk wn−k−m

Zn= ∑m=n−N1

n

∣sgn [x m ]−sgn [x m−1 ]∣

Fundamental Assumptions (Encore)

Properties of the speech signal change relatively slowly with time (510 sounds per second) Over very short (520 msec) intervals: uncertainty due to

small amount of data, varying pitch and amplitude Over medium length (20100 msec) intervals: uncertainty

due to changes in sound quality, transitions between sounds, rapid transients in speech

Over long (100500 msec) intervals: uncertainty due to large amount of sound changes

There is always uncertainty in short time measurements and estimates from speech signals

Why Windowing? Obvious advantage: localizing a part of

speech for analysis Less obvious disadvantage: how will the

windowing affect the analysis results? Compare the localization property of two extreme

cases: N=1 vs. N=∞ N=1: best in time but worst in frequency N= ∞: best in frequency but worst in time

Fundamentally speaking, windowing is the pursuit of a better tradeoff between time and frequency localization

Two Degrees of Freedom Window length

Cannot be too short: too few samples are not enough to resolve the uncertainty

Cannot be too long: too many samples would introduce more uncertainty

Window shape Rectangular window is conceptually the simplest,

but suffers from spectral leakage problem It is always a tradeoff: there is no universally

optimal window; which window is more effective depends on specific applications

What is Wrong with Rectangular Window?

By varying the shape of window, it is possible to significantly reduce spectral leakage

Solution

Midterm Assay Topic #1: Why is spectral leakage a bad thing? In particular,What kind of difficulty does it cause to speech processing?

Advanced Windowing Techniques

Hamming/Hanning window Also known as raised Cosine window

Blackman window Based on two cosine terms

Bartlett window Also known as triangular window

KaiserBessel window Based on Bessel functions

Hamming/Hann(ing) Window

W n =0.50.5*cos 2πnN

Npoint Hann window

W n =0.540.46*cos 2πnN

Npoint Hamming window

W n =a1−a *cos 2πnN

Npoint raised cosine window


> Help wintool

Blackman Window

W n =0.420.5cos π nN

0.08cos2π nN

Barlett Window

W n =1−∣n−N /2∣

N

KaiserBessel Window

zero order modified Bessel function of the first kind

B determines the tradeoff between the main lobe and sidelobesand is often specified as a half integer multiple of pi


> help kaiser

Problem 1: Speech Detection

Start point End point

Speech Detection (Con’t)

Why study it? Save computations (don’t have to process

background noise) Save bits (don’t have to code background noise) Facilitate recognition (can’t mistake background

noise for speech) What are the challenges?

Complexity of speech signals Diversity of background noise

Difficult Scenarios

Speech signal Weak fricatives (/f/,/th/,/h/) are beginning or end of

utterance Weak plosive bursts for /p/, /t/ or /k/ Nasals at end of utterance (often devoiced and reduced

levels) Voiced fricatives which become devoiced at end of

utterance Trailing off of vowel sounds at the end of utterance

Background noise Highlevel noise (Low SNR), nonGaussian noise, etc.

Algorithm Design Come up with an attack

Which feature should we use? energy, amplitude, ZC rate or something else?

Implement the idea Collect speech data for experimental use Realize the algorithm by MATLAB (or any other language

you are familiar with) Analyze your result: If energy does not work, why? If ZC

works better than amplitude, why? Refine the attack

Can we do it recursively? Can we combine two approaches together?

Adhoc Algorithm #1 Three steps

Calculate the shorttime energy En of a given speech signal

Obtain a threshold T about the energy of background noise

Thresholding En by T Two Problems

Speech segment does not necessarily have high energy (especially around start and end points);

Background noise does not necessarily have low energy (some assumption about SNR is needed)

Three steps Calculate the shorttime ZC rate Zn of a given

speech signal Obtain a threshold T about the ZC rate of

background noise Thresholding Zn by T (note that low ZC rate

corresponds to voiced segments) It is not highly reliable either because the

distributions of ZC rate for speech and noise are not completely separate

AdHoc Algorithm #2

An Improved Algorithm

Algorithm Details

Examples

Problem 2: Pitch Period Estimation

P

PP

Pitch Period Estimation (Con’t)

Why do we want to estimate P? Pitch is one of the most fundamental attributes in

speech signals Knowing P facilitates speech coding, speech

synthesis, speech enhancement and speaker recognition

Why is it nontrivial? Speech signals are much more complex than

sinusoids

Possible Attacks

How to estimate the period for sinusoids? Fourier Transform (will be covered by the next

Chapter) In time domain, we might consider

Autocorrelation function – recall its periodic property

Average Magnitude Difference Function (AMDF) – essentially follow the same motivation but computationally more efficient

Autocorrelation based Estimation

P1 P2P3

P=Average(P1,P2,…)

AMFDbased Estimation

P1P2

P3

P=Average(P1,P2,…)

timedomain speech analysis -...

Documents