le 460 l acoustics and experimental phonetics l-13 anu khosla drdo, delhi [email protected]...

LE 460 L Acoustics and Experimental Phonetics

L-13

Anu KhoslaDRDO, Delhi

[email protected]

Introduction• Most of analysis methods are not designed to analyse sounds whose

characteristics are changing in time• Practical solution is to model the speech signal as a slowly varying function

of time.

• During intervals of 5 to 25 ms the speech characteristics don’t change too

much and are considered to be constant.

• Analyse in small segments - analysis intervals

• Optimal analysis interval length depends on the kind of information you

want to extract from the speech signal.

• Therefore the analysis results always represent some kind of average of the

analysis interval.

Parameters for AnalysisThree parameters to be decided for analysis Window Length

• There is no one optimal window length that fits all circumstances

• It depends on the type of analysis and the type of signal

• e.g

- to make spectrograms one often chooses either 5 ms for a wideband spectrogram or 40 ms for narrow band

- For pitch analysis a window length of 40 ms is more appropriate

Time step• This parameter determines the amount of overlap between

successive segments. • If the time step is much smaller than the window length we

have much overlap.• If time step is larger than the window length we have no

overlap at all.• In general we like to have at least 50% overlap between two

succeeding frames and we• will chose a time step smaller than half the window length.

Window shape

• In general we want the sound segment’s amplitudes to start and end smoothly.

• A lot of different window shapes are popular in speech analysis,

– square window (or rectangular window)

– Hamming window

– Hanning window

– Bartlett window.

• In Praat the default windowing function is the Gaussian window.

Speech Analysis

Short Time Analysis

• In time domain–Short time energy:Used to segment speech into smaller units

–Short time zero crossing: Used to help in making voicing decisions

(high ZCR indicates unvoiced speech)

–Short time autocorrelation : pitch determination

• In Frequency Domain–Fourier analysis:Spectrogram, formants

Computerized Speech

PrecautionsTry to avoid making recordings in reverberant rooms (a church is very reverberant).• Try to avoid making recordings at places where environment is noisy and uncontrollable • To avoid large intensity variations in the recording, the distance from the speaker’s mouth to the microphone should remain as constant as possible.Avoid simultaneous speaking

Th e s p ee ch s i g n al l e v el v a r ie s w i th t i m(e)

Computerized Speech•Speech (sound) is analog

• Computers are digital

•We need to convert

• Sampling is the reduction of a continuous signal to a discrete signal

• Sampling frequency or sampling rate fs is defined as the number

of samples obtained in one second (samples per second), • fs = 1/T.

• Shannon and Nyquist proved in the 1930’s that for the digital signal to be a faithful representation of the analog signal, a relation between the sampling frequency and the bandwidth of the signal had to be maintained.

• The Nyquist-Shannon sampling theorem: A sound s(t) that contains no frequencies higher than F hertz is completely determined by giving its sample values at a series of points spaced 1=(2F ) seconds apart.

• The number of sample values per second corresponds to the term sampling frequency.

• Sample values at intervals of 1/2F s translate to a sampling frequency of 2F hertz.

Poor Sampling

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12

Sampling Frequency = 1/2 X Wave Frequency

Sampling rate 2* wave period

Even Worse

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12

Sampling Frequency = 1/3 X Wave Frequency

Higher Sampling Frequency

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12

Sampling Frequency = 2/3 Wave Frequency

Getting Better

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12

Sampling Frequency = Wave Frequency

Good Sampling

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12

Sampling Frequency = 2 X Wave Frequency

Shannon-Nyquist's Sampling Theorem

• A sampled time signal must not contain components at frequencies above half the sampling rate (The so-called Nyquist frequency)

• The highest frequency which can be accurately represented is one-half of the sampling rate

Range of Human Hearing

• 20 – 20,000 Hz• We lose high frequency response with age• Women generally have better response than

men• To reproduce 20 kHz requires a sampling rate

of 40 kHz– Below the Nyquist frequency we introduce

aliasing

Effect of Aliasing

• Fourier Theorem states that any waveform can be reproduced by sine waves.

• Improperly sampled signals will have other sine wave components.

Half the Nyquist Frequency

-1.5

-1

-0.5

0

0.5

1

1.5

0 5 10 15 20 25

Nyquist Frequency

-1.5

-1

-0.5

0

0.5

1

1.5

0 2 4 6 8 10 12

Recovery of a sampled sine wave for different sampling rates

SamplingSampled waveform

0

1 201

Sampled waveform

0

1 201

Sampled waveform

0

1 201

Signal waveform

0

1 201

Impulse sampler

0

1 201

Quantization and encoding of a sampled signal

Quantization Error• When a signal is quantized, we introduce an error - the coded signal is

an approximation of the actual amplitude value.• The difference between actual and coded value (midpoint) is referred

to as the quantization error.• The more zones, the smaller which results in smaller errors.• BUT, the more zones the more bits required to encode the samples ->

higher bit rate

Digitization of Analog Signal• Sample analog signal in time and amplitude• Find closest approximation

Original signal

Sample value

Approximation

Rs = Bit rate = # bits/sample x # samples/second

3 b

its /

sam

ple

• All DAC’s have a fixed highest sampling frequency and to guarantee that the input contains no frequencies higher than half this frequency we have to filter them out.

• If we don’t filter out these frequencies, they get aliased and would also contribute to the digitized representation.

For most phonemes, almost all of the energy is contained in the 5Hz-4 kHz range, allowing a sampling rate of 8 kHz. This is the sampling rate used by nearly all telephony systems

CD quality audio is recorded at 16-bit.

http://en.wikipedia.org/wiki/Phoneme

http://en.wikipedia.org/wiki/Telephony

le 460 l acoustics and experimental phonetics l-13 anu khosla drdo, delhi [email protected]...

Documents