characteristics of speech zlong-term (sentence level, several seconds) ydrastic/irregular changes...

Characteristics of Speech

Long-term (sentence level, several seconds) Drastic/irregular changes

Short-term (frame level, 20ms or so) Regular periodic changes for voiced sounds Noise-like for unvoiced sounds

Hard to recognize without context information

Spectrum in Frequency-DomainThree basic characteristics in a spectrum:

Timbre: Spectrum after smoothing Pitch: Distance between harmonics Intensity: Magnitude of spectrum

Second formant F2First formant

F1Pitch freq

Intensity

Timber Demo: Real-time Spectrogram

Simulink model for real-time display of spectrogram dspstfft_audio (Before MATLAB R2011a) dspstfft_audioInput (R2012a or later)

Spectrogram:Spectrum:

Audio Feature Extraction & Recog.

Frame blocking Frame duration of 20 ms

Feature extraction Volume, pitch, MFCC, LPC, etc

Endpoint detection Based on volume & ZCR

Recognition DTW, HMM

Example: Audio Feature Extraction

256 points/frame84 points overlap11025/(256-84)=64 feature vectors per second 0 50 100 150 200 250 300

Zoom in

Overlap

0 500 1000 1500 2000 2500-0.4

Three Basic Acoustic Features Three basic speech features

Volume/Energy/Intensity（音量、能量、強度）： Vibration Amplitude

Pitch（音高）： Fundamental frequency (which is equal to the reciprocal of the fundamental period)

Timbre（音色）： The waveform within a fundamental period

These features are perceived subjectively by humans. However, we can use some mathematics to “emulate” human and capture these features.

Acoustic Feature: EnergyEnergy is the square sum of a frame, also known as

intensity or volume.Characteristics:

Usually noise and fricative have low energy. Energy is influence a lot by microphone setup. If we take log of square sum, and times 10, we have

energy in terms of Decibel（分貝） Energy is commonly used in endpoint detection. In embedded system implementation, volume can be

computed as the abs. sum of a frame in order to reduce computation.

Acoustic Feature: Zero Crossing Rate

Zero crossing rate (ZCR) The number of zero crossing in a frame.

Characteristics： Noise and unvoiced sound have high ZCR. ZCR is commonly used in endpoint detection,

especially in detection the start and end of unvoiced sound.

To distinguish noise/silence from unvoiced sound, usually we add a bias before computing ZCR.

Computation Pitch freq. is the reciprocal of fundamental period. Pitch in terms of semitone:

440log*1269 2

freqsemitone

一般聲音的產生與接收基本流程

發音體的震動空氣的波動耳膜的振動內耳神經的接收大腦的辨識

發聲機制敲擊所引發的自然震動頻率（例：音叉）空氣摩擦所引發的共振頻率（例：笛子）

Human Speech Production

The Vocal Tract

Glottal Volume Velocity &Resulting Sound Pressure (Voiced)

Speech Production

Glottal Pulses Vocal Tract Speech Signal

(a) Source Spectrum (c) Output Energy Spectrum

(b) Filter Function

Acoustical Analysis(speech signal of “ 七” )

Speech Production Modeling

phonation

whispering

frication

compression

vibration

Impulse Train

Generator

Noise Generator

Pitch Period

×u(n)

Time-varying digital filter

Vocal Tract Parameters

Parametric Representation

×u(n)

A(z) s(n)

Z-Transform

Write in A(z)

G = gain of excitationu(n) = excitation source(quasi-periodic pulse train or random noise)

knsnuGns a1

)()(.)(

kzSzUGzS za

)()(.)(

1zAzUG

zSzH p

The Speech Model : A Summary

Voiced/unvoiced classification,Pitch period for voiced sounds,The gain parameter, andThe coefficients of the digital filters, {ak}.

knsnuGns a1

)()(.)(

knsns a1

名詞對照 Cochlea：耳蝸 Phoneme：音素、音位 Phonics：聲學；聲音基礎教學法（以聲音為基礎進而教拼字的教學法）

Phonetics：語音學 Phonology：音系學、語音體系 Prosody：韻律學；作詩法 Syllable：音節 Tone：音調 Alveolar：齒槽音

Silence：靜音 Noise：雜訊 Glottis：聲門 larynx：喉頭 Pharynx：咽頭 Pharyngeal：咽部的，喉音的 Velum：軟顎 Vocal chords：聲帶 Esophagus：食管 Diaphragm：橫隔膜 Trachea：氣管

Hints for Exercises

How to generate a sine wave signal: Math formula: MATLAB code:

duration=3;

f=440;

fs=16000;

time=(0:duration*fs-1)/fs;

y=0.8*sin(2*pi*f*t);

plot(time, y);

sound(y, fs);

)2sin(* ftay

characteristics of speech zlong-term (sentence level, several seconds) ydrastic/irregular changes...

Documents

telephone network - aalto university · • le = local...

user-perceived latency zlong perceived latency is the most...

mötz/locherboden 26. feber 2016 oe7fmi...

new horizon english book 1 unit 8. pre-reading activity:...

kt65dl data form - test meter...6a/20ms 2a/20ms 15ma/20ms...

nos mémoires la mémoire sensorielle 20ms à 10s la...

supplyco outlets & mobile numbers€¦ · · 2017-01-20ms...

mobileinsight - cs | computer...

aksa · son sistem diode lazer Ÿ 10-120j'ye ulaşabilen...

ghid-ul pacientului oncologic supus tratamentului...

writing the business plan geoff huston. whats the business...

nist ms data program€¦ · · 2009-11-20ms library...

folien 03 mohl hirschmann - zhaw.ch · mrp parameter...

ghid-ul pacientului oncologic supus tratamentului...

invertersstatic.rekoser.com/rekoser/docs/catalogs/rekoser-inverter-catalog.pdf ·...

circular motion. the situation below is called uniform...

com · grz100 2 features fully numerical distance...

te bio 2016-l2 -...

click i/o module specifications - · pdf...

sequence design for dna computing - seoul national...