characteristics of speech zlong-term (sentence level, several seconds) ydrastic/irregular changes...
TRANSCRIPT
Characteristics of Speech
Long-term (sentence level, several seconds) Drastic/irregular changes
Short-term (frame level, 20ms or so) Regular periodic changes for voiced sounds Noise-like for unvoiced sounds
Hard to recognize without context information
Spectrum in Frequency-DomainThree basic characteristics in a spectrum:
Timbre: Spectrum after smoothing Pitch: Distance between harmonics Intensity: Magnitude of spectrum
Second formant F2First formant
F1Pitch freq
Intensity
Timber Demo: Real-time Spectrogram
Simulink model for real-time display of spectrogram dspstfft_audio (Before MATLAB R2011a) dspstfft_audioInput (R2012a or later)
Spectrogram:Spectrum:
Audio Feature Extraction & Recog.
Frame blocking Frame duration of 20 ms
Feature extraction Volume, pitch, MFCC, LPC, etc
Endpoint detection Based on volume & ZCR
Recognition DTW, HMM
Example: Audio Feature Extraction
256 points/frame84 points overlap11025/(256-84)=64 feature vectors per second 0 50 100 150 200 250 300
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Zoom in
Overlap
Frame
0 500 1000 1500 2000 2500-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
Three Basic Acoustic Features Three basic speech features
Volume/Energy/Intensity(音量、能量、強度): Vibration Amplitude
Pitch(音高): Fundamental frequency (which is equal to the reciprocal of the fundamental period)
Timbre(音色): The waveform within a fundamental period
These features are perceived subjectively by humans. However, we can use some mathematics to “emulate” human and capture these features.
Acoustic Feature: EnergyEnergy is the square sum of a frame, also known as
intensity or volume.Characteristics:
Usually noise and fricative have low energy. Energy is influence a lot by microphone setup. If we take log of square sum, and times 10, we have
energy in terms of Decibel(分貝) Energy is commonly used in endpoint detection. In embedded system implementation, volume can be
computed as the abs. sum of a frame in order to reduce computation.
Acoustic Feature: Zero Crossing Rate
Zero crossing rate (ZCR) The number of zero crossing in a frame.
Characteristics: Noise and unvoiced sound have high ZCR. ZCR is commonly used in endpoint detection,
especially in detection the start and end of unvoiced sound.
To distinguish noise/silence from unvoiced sound, usually we add a bias before computing ZCR.
Pitch
Computation Pitch freq. is the reciprocal of fundamental period. Pitch in terms of semitone:
440log*1269 2
freqsemitone
一般聲音的產生與接收基本流程
發音體的震動 空氣的波動 耳膜的振動 內耳神經的接收 大腦的辨識
發聲機制 敲擊所引發的自然震動頻率(例:音叉) 空氣摩擦所引發的共振頻率(例:笛子)
Human Speech Production
The Vocal Tract
Glottal Volume Velocity &Resulting Sound Pressure (Voiced)
Speech Production
Glottal Pulses Vocal Tract Speech Signal
(a) Source Spectrum (c) Output Energy Spectrum
+
+=
=
(b) Filter Function
Acoustical Analysis(speech signal of “ 七” )
Speech Production Modeling
phonation
whispering
frication
compression
vibration
Impulse Train
Generator
Noise Generator
Pitch Period
×u(n)
Time-varying digital filter
Vocal Tract Parameters
s(n)
G
Parametric Representation
×u(n)
G
A(z) s(n)
Z-Transform
Model
Write in A(z)
G = gain of excitationu(n) = excitation source(quasi-periodic pulse train or random noise)
p
kk
knsnuGns a1
)()(.)(
p
k
k
kzSzUGzS za
1
)()(.)(
)(
1
1
1
)(.
)()(
1zAzUG
zSzH p
k
k
k za
The Speech Model : A Summary
Voiced/unvoiced classification,Pitch period for voiced sounds,The gain parameter, andThe coefficients of the digital filters, {ak}.
p
kk
knsnuGns a1
)()(.)(
p
kk
knsns a1
)()(
名詞對照 Cochlea:耳蝸 Phoneme:音素、音位 Phonics:聲學;聲音基礎教學法(以聲音為基礎進而教拼字的教學法)
Phonetics:語音學 Phonology:音系學、語音體系 Prosody:韻律學;作詩法 Syllable:音節 Tone:音調 Alveolar:齒槽音
Silence:靜音 Noise:雜訊 Glottis:聲門 larynx:喉頭 Pharynx:咽頭 Pharyngeal:咽部的,喉音的 Velum:軟顎 Vocal chords:聲帶 Esophagus:食管 Diaphragm:橫隔膜 Trachea:氣管
Hints for Exercises
How to generate a sine wave signal: Math formula: MATLAB code:
duration=3;
f=440;
fs=16000;
time=(0:duration*fs-1)/fs;
y=0.8*sin(2*pi*f*t);
plot(time, y);
sound(y, fs);
)2sin(* ftay