characteristics of speech zlong-term (sentence level, several seconds) ydrastic/irregular changes...

Characteristics of Speech

Long-term (sentence level, several seconds) Drastic/irregular changes

Short-term (frame level, 20ms or so) Regular periodic changes for voiced sounds Noise-like for unvoiced sounds

Hard to recognize without context information

Spectrum in Frequency-DomainThree basic characteristics in a spectrum:

Timbre: Spectrum after smoothing Pitch: Distance between harmonics Intensity: Magnitude of spectrum

Second formant F2First formant

F1Pitch freq

Intensity

Timber Demo: Real-time Spectrogram

Simulink model for real-time display of spectrogram dspstfft_audio (Before MATLAB R2011a) dspstfft_audioInput (R2012a or later)

Spectrogram:Spectrum:

Audio Feature Extraction & Recog.

Frame blocking Frame duration of 20 ms

Feature extraction Volume, pitch, MFCC, LPC, etc

Endpoint detection Based on volume & ZCR

Recognition DTW, HMM

Example: Audio Feature Extraction

256 points/frame84 points overlap11025/(256-84)=64 feature vectors per second 0 50 100 150 200 250 300

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Zoom in

Overlap

Frame

0 500 1000 1500 2000 2500-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

Three Basic Acoustic Features Three basic speech features

Volume/Energy/Intensity（音量、能量、強度）： Vibration Amplitude

Pitch（音高）： Fundamental frequency (which is equal to the reciprocal of the fundamental period)

Timbre（音色）： The waveform within a fundamental period

These features are perceived subjectively by humans. However, we can use some mathematics to “emulate” human and capture these features.

Acoustic Feature: EnergyEnergy is the square sum of a frame, also known as

intensity or volume.Characteristics:

Usually noise and fricative have low energy. Energy is influence a lot by microphone setup. If we take log of square sum, and times 10, we have

energy in terms of Decibel（分貝） Energy is commonly used in endpoint detection. In embedded system implementation, volume can be

computed as the abs. sum of a frame in order to reduce computation.

Acoustic Feature: Zero Crossing Rate

Zero crossing rate (ZCR) The number of zero crossing in a frame.

Characteristics： Noise and unvoiced sound have high ZCR. ZCR is commonly used in endpoint detection,

especially in detection the start and end of unvoiced sound.

To distinguish noise/silence from unvoiced sound, usually we add a bias before computing ZCR.

Pitch

Computation Pitch freq. is the reciprocal of fundamental period. Pitch in terms of semitone:

440log*1269 2

freqsemitone

一般聲音的產生與接收基本流程

發音體的震動空氣的波動耳膜的振動內耳神經的接收大腦的辨識

發聲機制敲擊所引發的自然震動頻率（例：音叉）空氣摩擦所引發的共振頻率（例：笛子）

Human Speech Production

The Vocal Tract

Glottal Volume Velocity &Resulting Sound Pressure (Voiced)

Speech Production

Glottal Pulses Vocal Tract Speech Signal

(a) Source Spectrum (c) Output Energy Spectrum

+

+=

=

(b) Filter Function

Acoustical Analysis(speech signal of “ 七” )

Speech Production Modeling

phonation

whispering

frication

compression

vibration

Impulse Train

Generator

Noise Generator

Pitch Period

×u(n)

Time-varying digital filter

Vocal Tract Parameters

s(n)

G

Parametric Representation

×u(n)

G

A(z) s(n)

Z-Transform

Model

Write in A(z)

G = gain of excitationu(n) = excitation source(quasi-periodic pulse train or random noise)

p

kk

knsnuGns a1

)()(.)(

p

k

k

kzSzUGzS za

1

)()(.)(

)(

1

1

1

)(.

)()(

1zAzUG

zSzH p

k

k

k za

The Speech Model : A Summary

Voiced/unvoiced classification,Pitch period for voiced sounds,The gain parameter, andThe coefficients of the digital filters, {ak}.

p

kk

knsnuGns a1

)()(.)(

p

kk

knsns a1

)()(

名詞對照 Cochlea：耳蝸 Phoneme：音素、音位 Phonics：聲學；聲音基礎教學法（以聲音為基礎進而教拼字的教學法）

Phonetics：語音學 Phonology：音系學、語音體系 Prosody：韻律學；作詩法 Syllable：音節 Tone：音調 Alveolar：齒槽音

Silence：靜音 Noise：雜訊 Glottis：聲門 larynx：喉頭 Pharynx：咽頭 Pharyngeal：咽部的，喉音的 Velum：軟顎 Vocal chords：聲帶 Esophagus：食管 Diaphragm：橫隔膜 Trachea：氣管

Hints for Exercises

How to generate a sine wave signal: Math formula: MATLAB code:

duration=3;

f=440;

fs=16000;

time=(0:duration*fs-1)/fs;

y=0.8*sin(2*pi*f*t);

plot(time, y);

sound(y, fs);

)2sin(* ftay

characteristics of speech zlong-term (sentence level, several seconds) ydrastic/irregular changes...

Documents