estimation of postoperative vowel of benign vocal fold ...estimation of postoperative vowel of...
TRANSCRIPT
Estimation of Postoperative Vowel
of Benign Vocal Fold Lesions using
Nonlinear Speech Production Modeling
Seung Jin Jang
The Graduate School
Yonsei University
Department of Biomedical Engineering
2
Estimation of Postoperative Vowel
of Benign Vocal Fold Lesions using
Nonlinear Speech Production Modeling
A Dissertation
Submitted to the Department of Biomedical Engineering
and the Graduate School of Yonsei University
in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Seung Jin Jang
August 2007
3
This certifies that the dissertation of Seung Jin Jang is approved.
The Graduate School
Yonsei University
August 2007
4
ACKNOWLEDGMENTS
First and foremost, I would like to thank my parents for their continuous concern,
generous support and endless love. They have encouraged me to keep going to work on the
fundamental problems. This dissertation is dedicated.
I would like to thank my dissertation advisor, Professor Young-Ro Yoon, for his
insightful advice. His abundant scientific experiences and approaches have been an inspiration
to me in the past years of my graduate study and research. He is not only an academic advisor
who advises me on research, but also a great mentor with enthusiasm and patience to enlighten
me throughout this dissertation work.
I would also like to gratefully acknowledge the valuable assistance from the other
member of my dissertation committee, Professor Kyoung-Joung Lee, Professor Kyoung-Hwan
Kim, Professor Young-Cheol Park, and Professor Hong-Sik Choi. I want to sincerely thank
Professor Hong-Sik Choi and Young-Cheol Park for their scholarly comments and
encouragements on my dissertation. Another Professor Hyung-Ro Yoon, Professor Yoon-Sun
Lee, Professor Dong-Yoon Kim, Professor Young-Ho Kim, Professor Tae-Min Shin,
Professor Hyo-Sung Jo, Professor Bup-Min Kim, and Professor Han-Sung Kim have followed
my dissertation work closely during the past five years of my graduate study. Their insight and
knowledge in biomedical engineering have greatly influenced my research work on various
aspects. They always give me an inspiration and diligence.
I also want to thank Professor Seung-Hoon Park in Kyunghee University, Professor
Sung-Oh Hwang in wouju medical college, Professor Dong-Yul Na in department of
Computer & Telecommunication Engineering, and Chul-Gyu Lee in Konkuk University. They
have been always academic guiding light to me. I also appreciate professor everyone in the
department of Biomedical Engineering, even though they do not directly contact with me.
I really want to give my thankful mind to office mates, Hyo-min Kim and Sang-Ha Song,
old fellow, Young-Gu Yoon and Jung-Woo Lee, and benefactor, Hironori Suzaki in Japan.
Without their generous support and collaboration, this dissertation would never have been
finished.
I also appreciate Dr. Sung-Hee Choi, Dr. Jae-Name Choi, Dr. Sung-Eun Im, Dr. Jae-Ok
Kim, and Dr. Hae-Suk Park in the Institute of Logopedics & Phoniatrics, Yong-Dong
5
Severance hospital for friendly suggestion of the useful information and acquisition of speech
database. In particular, I would like to mention Dr. Sung-Hee Choi for her practical advice,
and Dr. Jae-Name Choi for her unsparing help.
I would like to thank my former officemates and other students, Dr. Won-Sik Kim, Dr.
Dong-Ik Cha, Dr. Hong-Mo Sung, Dr. Jae-Woo Shin, Dr. Ah-ram Sul, Woo-Hee Lee, Won-
Suk Jang, Sung-Yoon Kim, Suk-Gyun Hong, Hae-Won Choi, Seung-Ha Lee, Byung-Yoon
Kang, Min-Suk Cha, Joo-Sung Lee, Sae-Lim Park, Gyu-Suk Hong, Jung-Hoon Lee, Hun
Shim, Yong-Ju Yang, Zip-Min Jung, Joo-Hwan Lee, and Yong-Gu Jang for creating a
pleasant working environment for me. The administrative support of our system officer, Jong-
Su Ahn, Myung-Bae Yang, Gyung-Ja Kim, Mi-Hyung Lee and Byung-Wook Kim is also
highly appreciated.
I would like to thank my old intimate friends, Ji-Eun Lee, Sang-Woo Kim, Won-Wu Lee,
Byung-Geun Hong, Jun-young Lee, Seung-Hoon Son, Byung-Joo Lee, Jin-Won Kang, Jun-
Hee Yoon, Hee-Joong Lee, Sang-Hoon Han, Dae-Young Kim, Ki-Tae Park, Phil-Sung Oh,
Seung-Hyun Son, Chan-Ho Lee, Hyun Heo, Sang-Hoon Han, Dong-Gyu Shin, Dong-Sun Kim,
Dae-Geun Jeon, Dong-Won Kang, Ki-Won Lee, Nam-Hoon Kim, Jong-Gu Lee, Seung-Hoon
Kim and Gi-Sik Tae. I was always encouraged by their reinforcement, even though they are
far from me.
“인내는 정말 힘든 것이지만 배울 만한 가치가 있는 유일한 것이다.
자연과 성장은, 평화와 번영과 아름다움은 모두 인내를 바탕으로 하며,
시간과 고요함과 신뢰를 필요로 한다.”
Hermann Hesse
August 2007
from Seung-Jin Jang
i
CONTENTS
FIGURE LEGENDS ................................................................................................................. iv TABLE LEGENDS..................................................................................................................vii ABBREVIATIONS.................................................................................................................viii ABSTRACT............................................................................................................................... x
1. Introduction............................................................................................................................ 1 1.1 General Backgrounds........................................................................................................ 1
1.1.1 Mechanism of the Voice Production .......................................................................... 1 1.1.2 A Brief View of the Voice Disorders: Benign Vocal Fold Lesions Focus .................. 1 1.1.3 Speech Production Model.......................................................................................... 7
1.2 Problem Definition ........................................................................................................... 9 1.3 Organization of the Thesis .............................................................................................. 11
2. Robust Pitch Detection Algorithm for Pathological Voice ................................................... 12
2.1 Introduction .................................................................................................................... 12 2.1.1 Introduction to Pitch (Fundamental Frequency; F0) Perception.............................. 12 2.1.2 Difficulties of Pitch Estimation ............................................................................... 14 2.1.3 Characteristics of Pathological Voice ...................................................................... 15
2.2 Review of the Several Established PDAs ....................................................................... 17 2.2.1 Time Domain Approaches ....................................................................................... 18
2.2.1.1 Autocorrelation (AC)......................................................................................... 18 2.2.1.2 Average Magnitude Difference Function (AMDF)............................................ 19 2.2.1.3 YIN.................................................................................................................... 20
2.2.2 Frequency Domain Approaches............................................................................... 22 2.2.2.1 CEPSTRUM...................................................................................................... 22 2.2.2.2 Simplified Inverse Filtering Techniques (SIFT) ................................................23
2.2.3 Alternative Approaches............................................................................................ 24 2.2.3.1 Wavelet .............................................................................................................. 24 2.2.3.2 State-Space Embedding..................................................................................... 26
2.3 Robust Pitch Detection Algorithm for Pathological Voice Based on Fast Orthogonal Search ............................................................................................................................ 27
2.3.1 Introduction of Fast Orthogonal Search Algorithm ................................................. 27 2.3.2 Pitch Selection ......................................................................................................... 30
2.4 Experimental Procedure.................................................................................................. 32 2.4.1 Speech Database ...................................................................................................... 32 2.4.2 Preconditions of Performance Evaluation................................................................ 32 2.4.3 Error Types of PDA ................................................................................................. 33 2.4.4 Optimum Window Selection.................................................................................... 34
2.5 Experimental Results ...................................................................................................... 36 2.5.1 Evaluating Performance of PDAs in Normal versus BVFL Voices ......................... 36 2.5.2 Evaluating Performance of PDAs in Aperiodicity Level of Voices ......................... 39
2.6 Summary......................................................................................................................... 42
ii
3. Comparison of Acoustic and Electroglottographic Parameters of BVFL before and after Laryngeal Surgery.................................................................................................................... 43
3.1 Introduction to Acoustic and Electroglottographic Analysis on Vowel .......................... 43 3.1.1 Acoustic Analysis..................................................................................................... 43 3.1.2 Electroglottographic Analysis.................................................................................. 45 3.1.3 Measurement of Pathological Voice ........................................................................ 45
3.2 Methods and Experiments .............................................................................................. 46 3.2.1 Experimental Data and Protocol .............................................................................. 46 3.2.2 Analysis and Results of Formant Frequencies ......................................................... 46
3.2.2.1 Estimation of Formant Frequencies................................................................... 46 3.2.2.2 Comparative Results.......................................................................................... 47
3.3 Analysis and Results of Fundamental Frequency Perturbation (Jitter) ........................... 52 3.3.1 Various Jitter Measures............................................................................................ 52 3.3.2 Comparative Results ................................................................................................ 54 3.3.3 Analysis and Results of Intensity Perturbation (Shimmer) ...................................... 57
3.3.3.1 Various Shimmer Measures ............................................................................... 57 3.3.3.2 Comparative Results.......................................................................................... 59
3.3.4 Analysis and Results of Noise Components ............................................................ 60 3.3.4.1 Estimation of the noise in the spectral domain .................................................. 60 3.3.4.2 Estimation of Harmonic-to-Noise Ratio (HNR)................................................ 60
3.3.5 Estimation of Degree of Hoarse (DH) and Normalized Noise Energy (NNE) ........ 64 3.3.6 Estimation of the normalized first harmonic energy (NFHE).................................. 65
3.3.6.1 Comparative Results.......................................................................................... 67 3.3.7 Analysis and Results of Electroglottographic Parameters ....................................... 71
3.3.7.1 Estimation of Open Quotient and Speed Quotient............................................. 71 3.3.7.2 Comparative Results.......................................................................................... 73
3.4 Summary......................................................................................................................... 77 4. Modification of Preoperative Vowel Sounds based on Acoustic and Electroglottographic Analysis.................................................................................................................................... 78
4.1 Introduction to Perception of Aperiodicity in Pathological Voices.................................78 4.2 Synthesized Vowel Modeling ......................................................................................... 79
4.2.1 Glottal Waveform Modeling .................................................................................... 79 4.2.1.1 Rosenberg’s Model ............................................................................................ 79 4.2.1.2 Titze’s model ..................................................................................................... 80
4.2.2 Aperiodicity of Glottal Waveform ........................................................................... 82 4.3 Modifications of Preoperative Vowel ............................................................................. 83
4.3.1 Design of Modification of Fundamental Frequency ................................................ 83 4.3.1.1 Pitch Scale Modification and Jitter using PSOLA............................................. 83 4.3.1.2. Modification of Intensity .................................................................................. 85 4.3.1.3 Short term Postfilter .......................................................................................... 86
4.4 Design of Enhancement of Noise Components .............................................................. 88 4.4.1 Introduction to Wavelet Transform Threshold Shrinkage........................................ 88 4.4.2 Determination of Adaptive Threshold...................................................................... 89
4.5 Modification of Baseline Wander of EGG Signal........................................................... 93 4.5.1 Introduction to Empirical Mode Decomposition ..................................................... 93
iii
4.6 Summary......................................................................................................................... 98 5. Nonlinear Speech Production Modeling using Nonlinear Autoregressive Exogenous based on Support Vector Regression .................................................................................................. 99
5.1 Introduction of Speech Production Modeling................................................................. 99 5.1.1 Overview of Linear Speech Production Modeling................................................ 100 5.1.2 Limitations of Linear Speech Production Modeling.............................................. 102
5.2 Overview of Nonlinear Speech Production Modeling on Support Vector Regression .103 5.2.1 Review of Former Research in Nonlinear Speech Production Modeling .............. 103 5.2.2 Introduction of Support Vector Machine for nonlinear regression......................... 105
5.3 Nonlinear Speech Production Modeling based on Support Vector Regression ............ 107 5.3.1 NARX using SVR Model ...................................................................................... 108 5.3.2 Optimum parameter Selection ............................................................................... 113
5.4 Evaluation of NARX using SVR Model....................................................................... 115 5.4.1 Multi-band Model.................................................................................................. 116
5.5 Experimental Results .................................................................................................... 119 5.6 Summary....................................................................................................................... 121
6. Conclusion ......................................................................................................................... 122 Appendix A ............................................................................................................................ 124 Appendix B ............................................................................................................................ 134 References.............................................................................................................................. 143 국문초록................................................................................................................................ 154
iv
FIGURE LEGENDS
Figure 1-1. The subsystems of voice.......................................................................................... 2 Figure 1-2. Benign vocal fold lesions pictured by stroboscope (KayLab). (A) Vocal fold
nodules, (B) Vocal Fold cyst, (C) Vocal fold Polyp, (D) normal vocal fold ........... 3 Figure 1-3. Diagram of the source-filter theory ......................................................................... 7 Figure 2-1. Episodes of unvoiced sound |t| and voiced sound |a| ............................................. 13 Figure 2-2. Vibratory pattern of vocal folds with a single opening and closing & airflow
between the vocal folds as change of vocal folds vibration ................................. 15 Figure 2-3. Episodes of pitch doubling and halving error: pitch estimated by autocorrelation
method.................................................................................................................. 16 Figure 2-4. Pitch detection standard, domain and boundary of AC: (left) analyzed speech
signal, (right) pitch selection in time domain after autocorrelation...................... 18 Figure 2-5. Pitch detection standard, domain and boundary of AMDF: (left) analyzed speech
signal, (right) pitch selection in time domain after AMDF................................... 20 Figure 2-6. Pitch detection standard, domain and boundary of YIN: (left) analyzed speech
signal, (right) pitch selection in time domain after YIN....................................... 21 Figure 2-7. Pitch detection standard, domain and boundary of Cepstrum: (left) analyzed
speech signal, (right) pitch selection in quefrency domain after Cepstrum.......... 22 Figure 2-8. Overall process of SIFT (top) and Pitch detection standard, domain and boundary
of SIFT: (bottom left) analyzed speech signal, (bottom right) pitch selection in time domain after SIFT (Hybrid method)............................................................. 23
Figure 2-9. Diagram of fast lifting wavelet transform.............................................................. 24 Figure 2-10. Pitch detection standard, domain and boundary of Wavelet: (left) analyzed speech
signal, (right) pitch selection in time domain after FLWT ................................... 25 Figure 2-11. Pitch detection standard, domain and boundary of State-Space Embedding: pitch
selection in periodicity histogram (time domain) after singular value decomposition ...................................................................................................... 26
Figure 2-12. Two episodes of pitch selection in FOS............................................................... 30 Figure 2-13. Pitch selection of boundary of a third of global Maximum peak......................... 31 Figure 2-14. Gross error rates as a function of window size in normal database ................... 35 Figure 2-15. Gross error rates as a function of window size in BVFL database ...................... 35 Figure 3-1. Episodes of speech, spectrum, and EGG of normal and pathological voices ........ 44 Figure 3-2. Formant frequencies tracking based on phase spectrum of LPC ........................... 47 Figure 3-3. Box plots of F1, F2, and F3 formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u|
of male group before and after surgery ................................................................50 Figure 3-4. Box plots of F1, F2, and F3 formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u|
of female group before and after surgery ............................................................. 50 Figure 3-5. Loci of the mean and S.D. of 1 of F1 and F2 formant frequencies of male before
and after surgery................................................................................................... 51 Figure 3-6. Loci of the mean and S.D. of 1 of F1 and F2 formant frequencies of female before
and after surgery................................................................................................... 52 Figure 3-7. Mean F0 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after
laryngeal surgery .................................................................................................. 55 Figure 3-8. Jitter (%) of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after
v
laryngeal surgery .................................................................................................. 55 Figure 3-9. Pitch perturbation factor of vowel |a|, |e|, |i|, |o|, |u| of male and female group before
and after laryngeal surgery ................................................................................... 56 Figure 3-10. RAPP15 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after
laryngeal surgery .................................................................................................. 56 Figure 3-11. Shimmer (%) of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after
laryngeal surgery .................................................................................................. 58 Figure 3-12. RAAP15 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after
laryngeal surgery .................................................................................................. 59 Figure 3-13. HNR ratio calculation using Cepstral smoothing in Spectrum............................ 61 Figure 3-14. Influence of cepstral smoothing due to liftered long-term temporal window
(57ms) .................................................................................................................. 62 Figure 3-15. Influence of cepstral smoothing due to liftered short-term temporal window
(1ms) .................................................................................................................... 63 Figure 3-16. Plot of harmonic and noise phase segment in spectral domain............................ 66 Figure 3-17. Box plots of HNR, NNE, and DH of voiced sounds |a|, |e|, |i|, |o|, |u| of male
group before and after surgery ............................................................................. 68 Figure 3-18. Box plots of HNR, NNE, and DH of voiced sounds |a|, |e|, |i|, |o|, |u| of female
group before and after surgery ............................................................................. 68 Figure 3-19. Box plots of NFHE of voiced sounds |a|, |e|, |i|, |o|, |u| of male group before and
after surgery.......................................................................................................... 69 Figure 3-20. Box plots of NFHE of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and
after surgery.......................................................................................................... 69 Figure 3-21. Determination of start point of opening phase and closing phase in EGG, 16-
smoothed EGG, and differentiated EGG waveform ............................................. 72 Figure 3-22. Detail definition of opening and closing phase period in EGG waveform .......... 73 Figure 3-23. Box plots of OQ and SQ formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of
male group before and after surgery..................................................................... 76 Figure 3-24. Box plots of OQ and SQ formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of
female group before and after surgery.................................................................. 76 Figure 4-1. Glottal waveform generated by Rosenberg model................................................. 79 Figure 4-2. Glottal waveform generated by Titze’s model ....................................................... 81 Figure 4-3. Pitch period modification by PSOLA.................................................................... 83 Figure 4-4. Episodes of pitch scale modification by PSOLA ................................................. 84 Figure 4-5. Intensity modification by Shimmer (%) of 2.5 %.................................................. 86 Figure 4-6. Plots of short term postfiltered voiced sound in time and spectral domain ........... 87 Figure 4-7. Plots of HNR, NNE, NFHE, and DH as a function of (a) jitter (S.D. 40%), (b)
shimmer (40%), and (c) noise (6%), and (d) magnification of (c) in NFHE ...... 90 Figure 4-8. Examples of synthetic voiced sound |a| (a) with shimmer and noise of 5 % and
jitter of 0.75 %, (b) with shimmer and noise of 40 % and jitter of 6 %................ 90 Figure 4-9. Plots of HNR as a function of Jitter (S.D. 0.75- 6.0 %) & noise and shimmer (S.D.
5-40 %) for phonation |a|,|e|,|i|,|o|,|u| for female group ......................................... 91 Figure 4-10. Plots of HNR as a function of Jitter (S.D. 0.75- 6.0 %) & noise and shimmer (S.D.
5-40 %) for phonation |a|,|e|,|i|,|o|,|u| for male group ............................................ 91 Figure 4-11. Episode of denoising with Wavelet threshold shrinkage ..................................... 92 Figure 4-12. Process diagram of EMD..................................................................................... 94 Figure 4-13. Reduction of baseline wander in EGG waveform by EMD and high pass filter
vi
with FIR 500-order (Pass band: over 40 Hz); voiced sound |u| with sampling rate of 22050 Hz.......................................................................................................... 96
Figure 4-14. Plots of IMFs and residue of EGG waveform in Figure 4-7. ............................. 97 Figure 5-1. Buffered MLP structure with input TDL ............................................................. 104 Figure 5-2. A scheme of NARX using SVR........................................................................... 113
Figure 5-3. 2D-plot of selection of optimum 2 andσ ϒ for phonation |i| of male group . 114
Figure 5-4. Synthesized versus original signal (time delay = 50): (top) modified speech signal, (middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |a| of male ........................................................................................................... 115
Figure 5-5. Synthesized versus original signal (time delay = 50): (top) modified speech signal, (middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |i| of male ............................................................................................................ 116
Figure 5-6. Multiband SVR Model with wavelet filterbank................................................... 117 Figure 5-7. synthesized versus original signal (time delay = 50): (top) modified speech signal,
(middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |i| of male ............................................................................................................ 118
Figure 5-8. Comparison of spectrogram between original speech signal and synthesized speech signal .................................................................................................................. 119
vii
TABLE LEGENDS
Table 1-1. Classification of voice disorder................................................................................. 4 Table 1-2. Phonatory function examinations.............................................................................. 5 Table 1-3. Detail characteristics of benign vocal fold lesions: Nodules, Polyps, and Cysts ...... 6 Table 2-1. Results of performance of the avaiable PDAs in database of normal and BVFL.... 37 Table 2-2. Results of performance of the available PDAs in database of cyst (Npolyp =195) .................................................................................................................................................. 38 Table 2-3 Results of performance of the available PDAs in database of cyst (Ncyst = 85) .................................................................................................................................................. 38 Table 2-4 Results of performance of the avaiable PDAs in nodule database (Nnodule = 140) .................................................................................................................................................. 39 Table 2-5. Results of performance of the available PDAs in database of normal and BVFL .................................................................................................................................................. 40 Table 2-6. Results of performance of the available PDAs in database of normal and BVFL .................................................................................................................................................. 41 Table 3-1 Mean and S.D. of formant frequencies from sustained vowel |a|, |e|, |i|, |o|, |u| before
and after laryngeal surgery ................................................................................. 49 Table 3-2 Various jitter measures ........................................................................................... 53 Table 3-3 Various shimmer measures ..................................................................................... 57 Table 3-4 Mean and S.D. of formant frequentcies from sustained vowel |a|, |e|, |i|, |o|, |u| before
and after laryngeal surgery ................................................................................. 70 Table 3-5 Mean and S.D. of NFHE from sustained vowel |a|, |e|, |i|, |o|, |u| before and after
laryngeal surgery ................................................................................................ 71 Table 3-6 Mean and S.D. of open quotient and speed quotient from sustained vowel |a|, |e|, |i|,
|o|, |u| before and after laryngeal surgery ............................................................ 74 Table 3-7. Mean and S.D. of speed quotient from sustained vowel |a|, |e|, |i|, |o|, |u| before and
after laryngeal after laryngeal surgery ................................................................ 75 Table 4-1. Correlation coefficients of each noise estimation as a change of jitter, shimmer, and
noise ..................................................................................................................... 89
Table 5-1. Results of optimized 2 ,σ ϒ and mean square error of RBF kernel in phonation |a|,
|e|, |i|, |o|, |u| for both sexes ................................................................................. 114 Table 5-2. Results of jitter(%) between synthesized and postoperative sounds in phonation |a|,
|e|, |i|, |o|, |u| for both sexes ................................................................................. 120 Table 5-3. Results of Lyapunov exponents between synthesized and postoperative sounds in
phonation |a|, |e|, |i|, |o|, |u| for both sexes ........................................................... 120
viii
ABBREVIATIONS
AC Autocorrelation
AMDF Average Magnitude Difference Function
APF Amplitude Perturbation Factor
AR Autoregressive
BVFL Benign Vocal Fold Lesions
DAPF Directional Amplitude Perturbation Factor
DEGG Derivative Electroglottography
DH Degree of Hoarse
DPPF Directional Pitch Perturbation Factor
EGG Electroglottography
EMD Empirical Mode Decomposition
FOS Fast Orthogonal Search
GCI Glottal Closure Instant
HHT Hilbert-Huang Transform
HNR Harmonic-to-Noise Ratio
IMF Intrinsic Mode Function
LS-SVM Least-Squares SVM
LP Linear Prediction
MAJ Mean Absolute jitter
MAS Mean Absolute Shimmer
NFHE Normalized First Harmonic Energy
NN Neural Network
NNE Normalized Noise Energy
OQ Open Quotient
PDA Pitch Detection Algorithm
PPF Pitch Perturbation Factor
PSOLA Pitch Synchronous Overlap and Add
RAAP Relative Average Amplitude Perturbation
RAPP Relative Amplitude Pitch Perturbation
SIFT Simplified Inverse Filtering Techniques
SQ Speed Quotient
SVM Support Vector Machine
SVR Support Vector Regression
ix
ABSTRACT
Estimation of Postoperative Vowel of Benign Vocal Fold Lesions
using Nonlinear Speech Modeling
Seung-Jin Jang
Department of Biomedical Engineering
The Graduate College
Yonsei University
.
In pathological voices, perceptual aperiodicity is mainly caused by perturbation factors
such as jitter, shimmer, and noise. These factors are mainly affected by lack of control of
vocal fold vibration, mass lesions of vocal cords, and presence of noise at emission and
breathiness. Our hypothesis is that reduction of these perturbation factors in pathological voice
can be enhanced similar to postoperative voice.
In benign vocal fold lesions, a design and implementation of estimation of postoperative
vowel is studied using nonlinear speech modeling based on nonlinear autoregressive with
exogenous input (NARX), according to the acoustic and electroglottographic analysis between
preoperative and postoperative sustained vowel.
First, robust pitch detection algorithm (PDA) for pathological voice is suggested for
accurate acoustic analysis. Compared to other established PDAs, our proposed PDA based on
fast orthogonal search can considerably reduce the pitch gross errors, especially pitch halving
error.
After that, it is investigated that various measurements related with acoustic and
electroglottographic analysis are achieved twice before and after laryngeal surgery, for 42
subjects in a relevant of benign vocal fold lesions. Mean pitch of male group decreased about
12-15 % value of preoperative pitch, whereas that of female group does not significantly
change. Formant frequencies show constant values before and after surgery. Most of jitter
measures are significantly changed, but some of shimmer measures are different later the
x
surgery. In noise estimation relevant measures such as harmonic-to-noise ratio (HNR),
normalized noise energy (NNE), degree of hoarse (DH), and normalized first harmonic energy
(NFHE), some of phonation significantly present the difference according to sex. No changes
are achieved in open quotient (OQ) and speed quotient (SQ) of Electroglottography (EGG)
relevant measures, but particular characteristics of SQ group, regressing within normal range,
are presented in condition of division of two groups separated by mean SQ value.
According to above results, we modify the preoperative voiced sounds in order to
enhance the perceptual quality like normal voice. Enhancement rates are adjusted by statistical
results based on the difference between preoperative and postoperative speech sounds.
Modification of pitch period, intensity, and noise of aspiration are controlled by pitch
synchronous overlap and add (PSOLA), intensity modifier, and Wavelet threshold shrinkage
methods and baseline wander of EGG signal using empirical mode decomposition (EMD).
These modified speech and EGG signal was used as input signals in nonlinear speech
modeling, NARX based on Least Square-Support Vector Regression.
Finally, modification of preoperative vowel based on acoustic and electroglottographic
analysis can resemble amount of postoperative vowel in spectral and dynamic domain.
Performance of nonlinear speech modeling using NARX based SVR also showed better than
LPC in perceptual quality of voiced sounds, and this result is assumed that natural jitter,
shimmer, and noise are conserved, whereas LPC produces artificial sounds due to lack of
naturalness
Key Words: pitch detection algorithm, benign vocal fold lesions, nonlinear speech modeling,
nonlinear autoregressive exogenous, acoustic analysis, electroglottographic analysis
- 1 -
CHAPTER 1
Introduction
1.1 General Backgrounds
1.1.1 Mechanism of the Voice Production
Voice is produced by a complex and multi-organ system [1]. The system is mainly
consisted of three subsystems as shown in Fig. 1-1. First, voice production begins with
respiration system. Respiratory organs provide the system with airflow which was inhaled as
the diaphragm lowers, and some aerodynamic energy of the air is converted to acoustic energy
by larynx, sometimes called voice box. The larynx is positioned between the base of the
tongue and the top of the trachea, and is a cylindrical framework of cartilage that serves to
anchor the vocal folds. The vocal folds, also called vocal cords, are two bands of smooth
muscle tissue that lie opposite each other, housed within the larynx. The vocal folds play an
essential role in the production of glottal sound. The glottal sound is resonated and filtered as
it travels the vocal tract. The vocal tract length varies between 13 cm and 20 cm for different
speakers and may change according to the sound produced. For an average adult male, the
vocal tract is considered to be about 17 cm long in its rest position. Finally, the filtered glottal
sound is radiated through the throat, nose and mouth (resonating cavities). The size and shape
of these cavities, along with the size and shape of the vocal folds, help to determine voice
quality. Furthermore, variety within an individual voice is the result of lengthening or
shortening, tensing or relaxing the vocal folds.
1.1.2 A Brief View of the Voice Disorders: Benign Vocal Fold Lesions Focus
Voice disorder, or dysphonia, is one of a group of problems involving abnormal pitch,
loudness, or quality of sounds produced by defective larynx [2]. Voice disorders are usually
divided into three main categories: organic, functional, and a combination of the two. Organic
voice disorders are divided into two groups [3]: structural and neurogenic. Structural disorders
- 2 -
involve something physically wrong with the mechanism; especially often involving tissue or
fluids of the vocal folds. Neurogenic disorders are caused by a problem in the nervous system
as it interacts with the larynx. Functional disorders are caused by poor muscle functioning. All
functional disorders fall under the category of muscle tension dysphonia. Psychogenic
disorders exist, because it is possible for the voice to be disturbed for psychological reasons.
In this case, there is no structural reason for the voice disorder, and there may or may not be
some pattern of muscle tension. A detail classification of voice disorder is introduced in Table
1-1.
Figure 1-1. The subsystems of voice
In clinical settings, phonatory function examinations aim to determine the diagnosis of
the lesion and its size, vibratory mode, and degree of dysphonia, which lead to the
establishment of a treatment strategy. Aside from the subjective impressions of the patient and
voice therapist, there are objective measures available to aid in the assessment of laryngeal
function before and after surgery. Acoustic, phonatory airflow, and qualitative stroboscopic
measurements and etc (summarized in Table 1-2) have been used to analyze the results of
microlaryngeal phonosurgery [4].
- 3 -
The only three structural voice disorders were focused in this dissertation; nodules,
polyps, and cysts (usually called benign vocal fold lesions; BVFL). There are three reasons; 1)
these lesions are surgically treated in common case, though operation is not a modal treatment,
2) therefore, excision is the only factor which affect the voice quality between preoperative
and postoperative voiced sounds, and 3) they are usually reversible, i.e. will resolve, and
recurrence rates are low. Benign vocal fold lesions are non-cancerous growths of abnormal
tissue on the vocal folds, so these lesions are non–life-threatening pathologies. However, these
lesions are important because of their influence on voice quality, and excessive growth of
these lesions may affect breathing patterns.
Figure 1-2. Benign vocal fold lesions pictured by stroboscope (KayLab). (A) Vocal fold nodules, (B) Vocal Fold cyst, (C) Vocal fold Polyp, (D) normal vocal fold.
A B
D C
- 4
-
Types of
Voice
Disorder
Division
•Contact Ulcers
•Nodules (nodes)
•Cysts
•Polyps
•Granuloma
•Hemorrhage
•Hyperkeratosis
•Laryngitis
•Leukoplakia
•Trauma
•Miscilaneous growths
•Papiloma
Structural
•Paralysis/Paresis
•Spasmodic Dysphonia
(Laryngeal Dystonia)
•Tremor
(Benign Essential Tremor)
•Voice Problem caused by
another neurological disorder
(e.g. Parkinson's disease,
myasthenia gravis, ALS/
Lou Gherig's Disease)
Neurogenic
Organic
•Muscle tension dysphonia
•Anterior-posterior
construction
•Hyperabduction
•Hyperadduction
•Pharyngeal constriction
•Ventricular Phonation
•Vocal fold bowing
Functional
• Conversion dysphonia
(aphonia)
• Puberphonia
(mutational falsetto)
Psychogenic
Table 1-1. Classification of voice disorder
- 5 -
Table 1-2. Phonatory function examinations
Examination Parameter
Aerodynamics
• Subglottic pressure
• Supraglottic pressure
• Glottal impedance
• Volume velocity of the airflow at the glottis (mean airflow rate)
Stroboscope
(Vocal folds
vibration)
• Regularity or periodicity
• Symmetry between the vocal folds
• Glottal closure (Glottal Area Waveform)
• Amplitude
• Mucosal wave
• Non-vibrating portion
Acoustic Analysis
• Fundamental frequency (F0; Pitch)
• Intensity
• Perturbations of pitch (various Jitter and Shimmer measures)
• Amount of noise
Glottis Analysis • EGG (ex: Speed Quotient, Open Quotient)
• PGG
Psychophysical
Measurement
• GRBAS scale
• Vocal Profile Analysis (VPA)
• Buffalo III Voice Profile
Phonatory Ability • Various physical phonatory parameters
(ex: Maximum phonation time)
Voice Profile • Frequency range
• Intensity range
Looking for a detail BVFL, vocal fold nodule, polyp and cyst are defined as separate
entities by the otolaryngologist and voice pathologist based on their anatomic location and
gross appearance. A polyp is defined as a lesion on the anterior third of the vocal fold. It may
be sessile or pedunculated and, if pedunculated, very mobile. A nodule is defined as a small
lesion occurring on both sides of the vocal fold, strictly symmetric on the border of the
- 6 -
anterior and middle third of the vocal fold and usually immobile during phonation. The lesion
is confined to the superficial layer of trauma or irritation to the vocal fold [5]. A cyst is
divided into two types, mucous retention and squamous inclusion cysts. Mucous retention
cysts usually arise below the free margin of the glottis and translucent collections of mucous
likely arising from a plugged mucous gland duct. Squamous inclusion cysts appear as yellow
fusiform masses within the lamina propria. Vocal fold mucosal wave is reduced to absent and
amplitude of vibration is moderately to severely decrease [6]. The detail characteristics of each
BVFL are introduced in Table 1-3 [7].
Table 1-3. Detail characteristics of benign vocal fold lesions: Nodules, Polyps, and Cysts
Type Nodules Polyps Cysts
Appearance • Blister-like or
callous-like • symmetric, firmfull
Solid or fluid filled, thin surface (can be quite large)
Undulation or fullness
Color White to opaque Translucent to red Translucent to yellow
Location Anterior-middle third junction
Free edge of anterior third
Free edge of superior surface of middle third
Closure configuration
Posterior chink, hourglass
Irregular Likely complete
Free edge roughness Slight Moderate to severe Smooth to slight
Amplitude Slight to moderate decrease
Slight to severe decrease Moderate to severe decrease
Wave Any Present or increased Diminished or absent
Vibration Majority Present Majority with complete or partial absence
Majority with partial absence
Voice Symptom
• variable (from normal to breathy, very hoarse and strained) • unable to sing high and soft notes (occurrence of delay in the onset of the sound with an audible air escape)
variable (from normal to severely dysphonic)
variable (from normal to breathy, very rough and hoarse)
- 7 -
1.1.3 Speech Production Model
There are largely two approaches to develop a speech production model: articulatory
modeling and acoustic modeling. The articulatory modeling is to model the positions and
movements of the articulatory organs, and a similar result of it is caused by a similar
underlying system. Articulatory models have an advantage of the good reproduction with
simple control and can reproduce all the perceptually relevant effects of real speech such as
co-articulation [8]. However, it needs abundant information related with values and
dimensions of vocal tract and a detailed analysis of the movement of the articulators. On the
contrary, the acoustic modeling approach is widely adopted in speech production applications
because only the speech waveform is required, which is easily obtained by a recorder. The
acoustic modeling approach is to model the speech waveform directly in either the time or
frequency domain.
Figure 1-3. Diagram of the source-filter theory
One of the most popular models in the acoustic models is the source-filter speech
production model as shown in Fig. 1-3 [9, 10]. This theory models speech as a combination of
- 8 -
sound source (represents vocal folds), a filter (represents vocal tract), and radiator (represents
lip). This model usually have two different phonemes distinguished by the properties of their
sources and spectral shapes: voiced sounds and unvoiced sounds. For voiced sounds, periodic
glottal excitation is regarded as source. For unvoiced sound, turbulent noise produced by at a
constriction in the vocal tact itself is regarded as source. In this model, transfer function of
equation (1.1) characterizes the vocal tract system in the frequency domain.
( )( ) ( )
( )
Y fH f R f
X f= (1.1)
The transfer function provides the ratio of the spectrum of the pressure wave in the sound
frequency, ( )Y f , at some fixed distance from the lips, to that of the volume velocity wave,
( )X f , at the source. ( )H f and ( )R f are the vocal tract transfer function and the lip
radiation characteristic, respectively. Finally, the spectrum of speech is achieved by
combination of these transfer functions like an equation (1.2).
( ) ( ) ( ) ( )Y f X f H f R f= (1.2)
Linear prediction (LP) analysis [11] is generally adopted in the source-filter model for
performing speech processing such as synthesis, and forms the basis of most speech coding
systems, such as vocoders, code excited linear prediction (CELP) coders, and multi-pulse
coders. The reputation of linear prediction is due to ease of analysis and implementation and
low computational requirements. An alternative approach is to model the transfer function of
the vocal tract system, or vocal tract modeling. Autoregressive with Exogenous input (ARX),
output error (OE), and state-space parameterizations for the vocal tract filter have been used
and differ in their underlying structure of the model and the nature of the error which is
minimized in the parameter estimation procedure.
The truth that the speech production mechanism is nonlinear, is proved by experimental
and theoretical evidence [12]. Nonlinearities in the speech data are caused by rapid transitions
between and during phones, especially plosives where there is occlusion of the vocal tract, and
by turbulent excitation during unvoiced segments. Glottal opening and closure during the pitch
periods of voiced speech causes coupling at the back of the throat which introduces additional
energy loss. Linear models of the vocal tract system have a limited performance because they
may not capture the structure of the data or the underlying system dynamics. The application
- 9 -
of nonlinear models to the prediction of the speech has shown 2-3 dB improvement in
prediction gain over linear models [13-15].
1.2 Problem Definition
In the field of voice pathology, the nature of the pathological voice has been usually
classified and described using scales or terms denoting its perceptual impression such as
hoarseness, breathiness, roughness and etc. by speech-language pathologists. The three formal
perceptual protocols are commonly used in all over the world: The Vocal Profile Analysis
(VPA) [16], GRBAS Scale, and The Buffalo III Voice Profile [17]. Perceptual voice quality
evaluation provides a baseline of the extent and type of the presenting problem and, therefore,
allows a monitoring of the process of therapy. This provides the clinician with a valuable
clinical outcome tool. Perceptual evaluation of voice quality may be the most valid of clinical
outcome measures as patients with voice disorders seek treatment, because their voices do not
sound normal and often decide on whether treatment has been successful based on whether or
not they sound better.
However, there are some problems with perceptual evaluation of voice quality. The lack
of a standard set of well-defined terms, the variability of the human voice, the reliability of
perceptual voice quality ratings between and within raters, and a considerable disparity in the
design of the available perceptual rating scales are potential problems [18-20]. Also, another
problem exists. Many patients who must be operated to resolve their voices are worried about
the result of the excision surgery, therefore to suggest estimated postoperative phonation can
relieve patients from fear, perhaps, that phonation dose not improve or worse than before;
moreover, the estimation of postoperative phonation informs patients how well surgery works
by means of comparing postoperative sounds with ideal simulated sounds, which have normal
range of personalized acoustic characteristic.
Some research have advanced in the analysis of benign vocal fold lesions before and after
surgery using acoustic, aerodynamic, and stroboscopic measures analysis [21,22], glottal area
waveform analysis [23], characteristic features of muscle tension analysis [24,25], and
correlation analysis between acoustic examination and image of vocal folds [26,27]. These
studies are valuable, but give little objective information on influence of the operation. There
is little study in estimation of postoperative phonation in benign vocal fold lesions. Kim [28]
- 10 -
and Baek [29] studied that prediction of postoperative voice by speech synthesis in benign
laryngeal diseases. However, though these studies are the first research in prediction of
postoperative phonation, it appeared to be impractical. Because prediction based on linear
prediction analysis adjusted with normal range of jitter and shimmer have lack of naturalness
in phonation and uniqueness between variable human voices.
This dissertation addresses the problem of prediction of postoperative phonation based on
acoustic and electroglottographic analysis between preoperative and postoperative voiced
sounds and natural speech production model. The overall strategy of our investigation is to
model refined nonlinear speech production model using speech modification by use of
parameters extracted from diverse analysis. The primary contributions of this dissertation can
be summarized as follows. First, we develop robust pitch detection algorithm for pathological
voices which often occur pitch doubling and halving errors, because performances of the
established pitch detection algorithms are not good in aperiodic pitch sounds such as
pathological voices. Second, we investigate the quantitative measures of preoperative and
postoperative data by acoustic and electroglottographic analysis in order to find the difference
before and after laryngeal surgery. Third, we suggest nonlinear speech production model
based on nonlinear autoregressive with exogenous input (NARX) using least square support
vector regression. Experimental results show that this production model is able to synthesize
natural-quality. Finally, we compare some simulation results on postoperative vowel.
- 11 -
1.3 Organization of the Thesis
This dissertation is organized as follows. Chapter 2 introduced robust pitch detection
algorithm for pathological voice using Fast Orthogonal Search (FOS) analysis. After
established pitch detection algorithm (PDA) were reviewed, our proposed PDA for
pathological voice was addressed and compared to those PDAs in various conditions such as
normal/pathological, male/female, severity of voice pattern, ages of subject, and phonation
types. Chapter 3 presents the diverse acoustic and electroglottographic characteristics of
pathological voice of benign vocal fold lesions (BVFL) before and after laryngeal surgery.
Pathological voice of pre- and post-treatment in BVFL were analyzed by various
measurements; perturbation of pitch period and intensity, noise due to physical changes of
vocal folds, and pattern of vocal folds vibration. In Chapter 4, the more precisely predictable
modeling of postoperative vowel was presented. Pitch synchronous overlap and add (PSOLA)
was used for pitch and formant frequency modification, and wavelet transform threshold
shrinkage to estimate the noise-eliminated pathological voice were described and evaluated.
Reduction of baseline wander of electroglottography (EGG) using empirical mode
decomposition (EMD) was also presented. After existing linear and nonlinear speech
production modeling were introduced in the Chapter 5, proposed nonlinear speech production
modeling using least-squares Support Vector Regression was developed and tested in this
chapter. Voice quality between postoperative vowel and synthesized vowel was also evaluated,
and the experimental results will be discussed. Finally, a conclusion and comments about
some interesting future research topics are given in Chapter 6.
- 12 -
Chapter 2
Robust Pitch Detection Algorithm for Pathological Voice
2.1 Introduction
This chapter explores some of the issues, problems, and solution involved in the
estimation of fundamental frequency in pathological voice, especially related with benign
vocal fold lesions. First, there is an introduction of what is meant by the terms fundamental
frequency, often called pitch. After some characteristics of pathological voices and difficulties
of pitch estimation are presented, a brief review of several established pitch detection
algorithms are then discussed. Finally, proposed robust pitch detection algorithm is introduced
and evaluated in various conditions.
2.1.1 Introduction to Pitch (Fundamental Frequency; F0) Perception
Pitch is a fundamental auditory attribute for the perception of speech and music, and is
related to the repetition rate of the waveform of the sound and co-varies with its fundamental
frequency. Perception of pitch is a very complex sensory phenomenon which involves a lot of
sciences such as physics, psychology, psychophysics, psychoacoustics, physiology, and
neurological science. Therefore, there is no solitary theory or model capable of explaining all
the processes how human perceives a pitch in musical theories. Nevertheless, there have been
two long-lasting theories in rivalry on pitch perception, either of which most experts agreed to
with their own variations [30-32]. One is a place theory [33], and the other is a temporal
theory [34]. A place theory postulated that the rate-place patterning of neural firings (place-
rate coding) is used to transmit information concerning frequency to the central nervous
system, while a temporal theory insisted that peripheral level of the auditory nerve, temporal
patterning of neural firings (temporal coding) is used. This inability to explain all of the
experimental data related to perception of pitch with a solitary theory or model led to a view
that there might be two separated pitch perception mechanisms: place or spectral mechanism
for low, resolvable harmonics, and temporal mechanism for high, irresolvable harmonics.
- 13 -
Pitch is loosely related to the log of the frequency, perceived pitch increasing about an
octave with every doubling in frequency. However, frequency doubling below 1000 Hz
corresponds to a pitch interval slightly less than an octave, while pitch doubling above 5000
Hz corresponds to an interval slightly more than an octave [35-37]. In the time domain, the
pitch information in voiced speech is present as quasi-periodic signal excursions, as shown in
Fig. 2-1. Voiced sound |a| is caused by the excitation vocal cords, whereas unvoiced sound |t|
are caused by the resonant cavity vocal tract shape. This periodic voiced sound can generally
be labeled by the eye method sometimes employed to obtain a reference pitch signal.
Figure 2-1. Episodes of unvoiced sound |t| and voiced sound |a|
- 14 -
2.1.2 Difficulties of Pitch Estimation
Pitch estimation is widely adopted in speech processing applications, but is still
considered one of the most difficult tasks. It is difficult to estimate pitch period for several
reasons below stated.
First, domain specific modeling problem exists. There are many other applications where
pitch information is of great use such as automatic music transcription, speech recognition,
and pathological voice relevant applications. In musical application such as automatic music
transcription, pitch is indispensable since it directly corresponds to the height of the musical
notes. In speech, pitch can greatly improve speech intelligibility and thus can be very useful in
speech recognition systems. Sound source separation is another application where pitch
information is critical especially when there are concurrent sound sources. In voice pathology,
jitter (perturbation of frequency) and shimmer (perturbation of amplitude) calculated from
pitch are commonly used to test the existence of dysphonia, to measure voice quality such as
hoarseness, and to assess the severity of pathological voice [38, 39]. It has been difficult to
develop a pitch estimator for wide domains, and a lot of applications only depend on the
specific domain of the data. Therefore, a pitch detector for one domain is less accurate when
applied to a different domain.
Second, complexity of human voices more than the ability of current pitch detectors has
also been troubled. For examples, pitch period changes with time, often with each glottal
period. This jittered pitch period can be against the rules of pitch detector; an assumption that
voiced speech is stationary and periodic in analysis segment, about 15~20 ms, and may result
in major failure of pitch estimation. Sub-harmonics of fundamental frequency often appear
that are sub-multipliers of the "true" frequency. In many cases when strong sub-harmonics and
presence of aperiodic components with high intensity are present, the most reasonable
objective pitch estimate is clearly at odds with the auditory percept.
Finally, the dynamic range of the voice fundamental frequency also makes a problem.
Generally, Fant [40] determined the average F0 in conversational speech in European
languages were approximately 120 Hz for men, 220 Hz for women, and 330 Hz for children,
and the typical range exploited by a single speaker within one utterance is normally within one
octave. The maximum overall range of fundamental frequency in ordinary conversation is
about 50-250 Hz for men and about 120-480 Hz for women [41]. However, the pitch of some
- 15 -
male voices can be as low as 50 Hz, whereas the pitch of children’s voice can be as high as
600 Hz.
2.1.3 Characteristics of Pathological Voice
The general idea of fundamental frequency estimation or detection is to obtain the period
of the glottal excitation waveform. When comparing electroglottographic signal with the
acoustic signal, it is evident that the vibrations in the voice tract have greatest amplitude at the
moment of closing the glottis [42]. Moreover, closing the glottis is much more abrupt
compared to its opening so the moment of closing can be determined more precisely.
Figure 2-2. Vibratory pattern of vocal folds with a single opening and closing & airflow
between the vocal folds as change of vocal folds vibration
- 16 -
This waveform is the result of the periodic opening and closure of the vocal cords in the
glottis while air is forced through from the lungs; therefore, it needs to understand a vibratory
cycle of the vocal folds such as a single opening and closing of the vocal folds for deeply
understanding how voice is produced. As shown in Fig. 2-2, general description of vibratory
pattern of the vocal folds starts with a moment when subglottal pressure overpowers fold
resistance just enough for the vocal folds to first start to blow open. They continue to blow
apart during the open phase until the escape of air reduces subglottal pressure enough for fold
resistance to overpower air flow, or d moment. At that point, the closing phase begins as the
folds move toward each other. It ends as soon as the glottis is closed, or f moment. After that,
the close phase continues until the opening phase starts the entire cycle over again.
Figure 2-3. Episodes of pitch doubling and halving error: pitch estimated by autocorrelation method
- 17 -
A variety of methods exists for determination of T0 (1/F0), but their application to
pathological voices gives rise to the following objective difficulties potentially leading to
severe errors:
- The significant variations of the amplitude and pitch are presented.
- The vocal folds do not contact in right position for voicing.
- The presence of the F0 appears at exactly half or double the correct fundamental
frequency is presented, caused by influence of the physical changes of vocal folds.
Especially, structural modification of vocal chord due to BVFL tends to produce abnormal
cycle or pattern of glottal excitation waveform. This abnormal pattern usually results in
undesirable effect of pitch doubling or halving error as shown in Fig. 2-3.
2.2 Review of the Several Established PDAs
In categories of pitch detection algorithm (PDA), there are mainly three kinds of
approaches; time domain approaches, frequency domain approaches, and alternative
approaches. Most of pitch detection methods are based on the assumption that speech signal is
stationary in short time, but the reality is that speech signal is non-stationary and quasi-
periodical. So, it will sometimes induce detection error. Usually, PDAs are not designed to
work with the whole range of pathological voices. They depend on a minimum of signal
periodicity to give accurate results. Titze et al. [43] reported an upper limit of 6% jitter for
accurate period detection. Next part, time domain method, frequency domain method, and
alternative method of three types of PDA will be addressed before comparing with our
proposed PDA algorithm for pathological voice.
- 18 -
2.2.1 Time Domain Approaches
2.2.1.1 Autocorrelation (AC)
The most basic approach to the problem of pitch estimation is to look at the waveform
that represents the change in air pressure over time, and attempt to detect the pitch from that
waveform. The goal of the autocorrelation [44, 45] routines is to find the similarity between
the signal and a shifted version of itself. The result of a correlation is a measure of similarity
as a function of time lag between the beginnings of the two waveforms. Equation (2.1) shows
the mathematical definitions of the autocorrelation of a finite discrete function( )x m of sizeN .
Figure 2-4. Pitch detection standard, domain and boundary of AC: (left) analyzed speech signal, (right) pitch selection in time domain after autocorrelation
( ) ( ) ( ) 0 | | , :1,..., 1N
m N
R k x m x m k k N if input N=−
= + ≤ ≤ −∑ (2.1)
) ( ) ( )
) (0) | ( ) |
) ( ) ( )p p
i R k R k
ii R R k
iii R k R k N if N is periodic
= −≥= +
(2.2)
Given equation (2.2) of the characteristics of the autocorrelation, if the signal is periodic,
the autocorrelation function,( )R k , also will be, and if the signal is harmonic, the
autocorrelation function will have peaks in multiples of the fundamental frequency. Maximum
peak displayed when lag of 0. As the time lag increases to half of the period of the waveform,
- 19 -
the correlation decreases to a minimum as shown in Fig. 2-4. This is because the waveform is
out of phase with its time-delayed copy. As the time lag increases again to the length of one
period, the autocorrelation again increases back to a maximum, because the waveform and its
time-delayed copy are in phase. The first peak in the autocorrelation indicates the period of the
waveform.
This technique is the most efficient at mid to low frequencies. Thus it has been popular in
speech recognition applications where the pitch range is limited. The weakness of the
autocorrelation approach is that it is prone to sub-harmonic errors, that is, it occasionally
generates pitch estimates that differ from human pitch judgments, most often by an octave.
Wider period peaks and multiplicity of non-periodic peaks also produce wrong pitch estimates.
2.2.1.2 Average Magnitude Difference Function (AMDF)
The Average Magnitude Difference Function (AMDF) [46] analysis is a variation of AC
analysis where, instead of correlating the input speech at various delays, a difference signal is
formed between the delayed speech and the original, and at each delay value the absolute
magnitude is taken. Hence, when considering DSP-chip/hardware implementation, AMDF is
more favorable since calculation of the AMDF requires no multiplications, a desirable
property for real-time applications. The mathematical definition of the AMDF was show in
equation (2.3).
1
1( ) | ( ) ( ) | 0 | | 1, :1,...,
N
m
F k x m x m k k N if input NN =
= − + ≤ ≤ −∑ (2.3)
Where ( )x m are the samples of input speech and ( )x m k+ are the samples shifted by
lags of k . The vertical bars denote taking the magnitude of the difference between ( )x m
and ( )x m k+ . Thus a difference signal ( )F x is formed by delaying the input speech various
amounts, subtracting the delayed waveform from the original, and summing the magnitude of
the differences between sample values. Defined as equation (2.4) of the characteristics of the
AMDF, the difference signal is always zero at lag of 0, and is particularly small at delays
corresponding to the pitch period of a voiced sound having a quasi-periodic structure as shown
in Fig. 2-5.
- 20 -
) (0) 0
) (0) ( )p p
i F if perfectly periodic
ii F F N if N is periodic
==
(2.4)
An advantage of this method is that the relative sizes of the nulls tend to remain constant
as a function of lags, because there is always full overlap of data between the two segments
being cross differenced.
Figure 2-5. Pitch detection standard, domain and boundary of AMDF: (left) analyzed speech signal, (right) pitch selection in time domain after AMDF
2.2.1.3 YIN
The meaning of YIN [47] is originated from the word of the oriental yin-yang
philosophical balance, intended to represent author’s attempts to balance between
autocorrelation and cancellation in the algorithm. The difficulty with autocorrelation approach
has shown to determine sometimes a wrong peak as fundamental frequency, occurring sub-
harmonic error. YIN attempts to solve these problems by in several procedural ways. YIN is
based on the difference function which attempts to minimize the difference between the
waveform and its delayed duplicate instead of maximizing the product in case of
autocorrelation. The mathematical definition is presented in equation (2.5).
2
1
( ) ( ( ) ( )) 0 | | , : 1,..., 1N
m
d k x m x m k k N if input N=
= − + ≤ ≤ −∑ (2.5)
- 21 -
Modeling the signal ( )x m as a period function with periodpN , by definition invariant
for a time shift of pN , the difference between( )x m and ( )px m N+ is zero. Thus the same
is true after taking the square and averaging over a window, as shown in equation (2.6).
Conversely, an unknown period may be found by forming the difference function like
equation (2.5), and searching for the values of k for which the function is zero or minimum
value.
2
1
1
) ( ( ) ( )) 0 ( ) ( ) 0, .
1, 0
( )) ( )
1( )
N
p pm
k
m
i x m x m N if x m x m N m
if k
d kii d k otherwise
d mk
=
=
− + = − + = ∀
=′ =
∑
∑
(2.6)
Additionally, YIN employs a cumulative mean function which de-emphasizes higher-
period dips in the difference function in order to reduce the occurrence of sub-harmonic errors.
Other improvements in the YIN analysis include a parabolic interpolation of the local minima,
which has the effect of reducing the errors when the period estimation is not a factor of the
window length used. Fig. 2-6 shows pitch selection in time domain after termination of above
processes.
Figure 2-6. Pitch detection standard, domain and boundary of YIN: (left) analyzed speech signal, (right) pitch selection in time domain after YIN
- 22 -
2.2.2 Frequency Domain Approaches
2.2.2.1 Cepstrum
The cepstrum is a common transform used to gain information from a person’s speech
signal [48]. It can be used to separate the excitation signal which contains the words and the
pitch and the transfer function which contains the voice quality. If we assume that a sequence
of voiced speech is the result of convoluting the glottal excitation sequence [ ]e n with the
vocal tract’s discrete impulse response[ ]q n . In frequency domain, the convolution
relationship becomes a multiplication relationship. Next, using property of log function
log log logAB A B= + , the multiplication relationship can be transformed into an additive
relationship. Finally, the real cepstrum of a signal [ ] [ ] [ ]s n e n q n= × is defined as equation
(2.7).
1 1( ) (log ( ( )) ) log ( )
2j m
FFT FFTc m F F x m S e dπ ω
πω ω
π−
−= = ∫ (2.7)
where
( ) ( ) : 1,..., 1j m
m
S x m e if input Nω
ω
ωω −
=−
= −∑ (2.8)
Figure 2-7. Pitch detection standard, domain and boundary of Cepstrum: (left) analyzed speech signal, (right) pitch selection in quefrency domain after Cepstrum
Cepstral coefficients decrease as 1/q, where q is the index of the cepstrum sequence.
Hence, for very noisy voices, the retrieval of F0 could be difficult when the pitch pulses are
- 23 -
hardly distinguished in the middle of a flat noise sequence. Moreover, the implementation of
the real cepstrum analysis depends heavily on the computation of the Short Time Fourier
Transform (STFT) and a proper choice of the window superimposed to the signal is of great
importance. As shown in Fig. 2-7, pitch estimate is determined by maximum value of pitch
available range in case of voiced sounds.
2.2.2.2 Simplified Inverse Filtering Techniques (SIFT)
Figure 2-8. Overall process of SIFT (top) and Pitch detection standard, domain and boundary of SIFT: (bottom left) analyzed speech signal, (bottom right) pitch selection in
time domain after SIFT (Hybrid method)
The Simple Inverse Filter Techniques (SIFT) algorithm is based on an LP analysis of data
[49], which gives the Inverse Filter (IF) to be used in this approach [50]. Commonly, a low
filter order (ρ >4) is selected, corresponding to no more than two formants characterizing the
vocal tract [51]. However, if this suffices for healthy voices, pathological voices require a
higher and possibly varying filter order, due to the strong noise component such as aspiration
- 24 -
utterance, corrupting the signal. In this paper, selection of model order ρ and parameters
estimation was performed, followed by an autocorrelation maximization of the inverse filter
residuals, which gives the estimated pitch value. The optimum model order ρ of 17 is
achieved by order selection method of minimum description length, and the frequency of low
pass filter is selected as 1200Hz, and pre-emphasis coefficient of 0.975. The overall process of
SIFT was illustrated in Fig. 2-8.
2.2.3 Alternative Approaches
2.2.3.1 Wavelet
Figure 2-9. Diagram of fast lifting wavelet transform
The Wavelet Transform is a flexible tool for analyzing time-frequency behavior of
signals embedded in noise, and is well-suited to handle non-stationary data [52, 53]. In
particular, dyadic wavelets are characterized by an exponential sampling of the plane, given
by a power-of-two sampling of the scale parameter. This amounts to considerable savings in
the computational cost of the algorithm, which makes this approach suitable for fast signal
processing. Other useful properties of dyadic wavelets are linearity and shift invariance, as
speech signals are often modeled as a linear combination of shifted and damped sinusoids.
Thus this transform is successfully applied in the analysis of speech signals [54].
The discrete wavelet transform (DWT) allows separating the high, as detailed signal, and
low, as approximated signal, frequency components of a signal on successive scales. Scaling
functions are associated to low-pass filters while wavelets are associated to high-pass filters,
which perform the signal decomposition on subsequent levels. Most meaningful frequency
- 25 -
values will have highest intensity in the level where such values are included. The main
advantage of this transform over the FFT is that the frequency temporal location is preserved.
It has good time resolution in the high frequency range and good frequency resolution in the
low frequency range. The DWT has been successfully applied in detecting the pitch period of
speech signals [55].
Figure 2-10. Pitch detection standard, domain and boundary of Wavelet: (left) analyzed speech signal, (right) pitch selection in time domain after FLWT
In this study, Fast Lifting Wavelet Transform (FLWT) is used to implement PDA based
on DWT (Fig 2-9). A wavelet transform splits a signal into an approximation and a detail
using Haar wavelet. The FLWT using a Haar wavelet is mathematically equivalent to running
a low-pass filter and down-sampling to produce the approximation component and running a
high-pass filter and down-sampling to produce the detail component. After performing FLWT
repeatedly but limited by 4 steps, each maximum and minimum of each analyzed mode
distance is searched, and pitch is acquired by averaged the distance of each maximum as
shown in Fig. 2-10.
- 26 -
2.2.3.2 State-Space Embedding
Figure 2-11. Pitch detection standard, domain and boundary of State-Space Embedding: pitch selection in periodicity histogram (time domain) after singular value decomposition
The state-space embedding signal representation is a method of observing the short-time
history of a waveform in a way that makes repetitive cycles clear. The basic state-space
embedding representation is to plot the value of the waveform at time t versus the slope of
the waveform at the same point [56]. A periodic signal should produce a repeating cycle in
state-space embedding, returning to a point with the same value and slope. Higher dimension
state-space embedding representations plot the value and 1n − derivatives of the signal in
n dimensions.
- 27 -
Pseudo-phase space, also called embedded representation, is a simpler form of phase
space. The value of the incoming waveform is plotted against a time-delayed version of itself.
The representation plots the points( , ) ( ( ), ( ))x y f t f t k= − , and in the n -dimensional
case, 0 1 1 1 1( , ,..., ) ( ( ), ( ),..., ( ))n nx x x f t f t k f t k− −= − − . Often, for simplicity, 1k k= . In this
study, embedding dimension of 7 and time delay of 19 are fixed as heuristic analysis for
pathological voices (Fig. 2-11). For a more detailed theoretical state-space embedding pitch
estimator are discussed in following studies [57-59].
2.3 Robust Pitch Detection Algorithm for Pathological Voice Based on Fast
Orthogonal Search
Pathological voices are exceedingly different to normal voice. Noise signal produced by
glottal distortion, pitch period perturbation (called jitter), and pitch amplitude perturbation
(called shimmer) are embedded in voiced sounds. The autocorrelation method, widely used
tool for pitch detection algorithm, is not a key to reduce these error factors, because octave
problems still remain, and peak finding is difficult in first frame. Therefore, another solution is
suggested to overcome these problems. Proposed PDA based on Fast Orthogonal Search
(FOS) algorithm is introduced, implemented, and compared with previously presented PDAs
in diverse analysis.
2.3.1 Introduction of Fast Orthogonal Search Algorithm
The pitch estimation is based on the FOS analysis. The FOS analysis can take a set of
non-orthogonal functions and fit them to the sampled signal. Sine and cosine pairs at the same
frequency can be fitted to estimate the amplitude and phase of the spectrum at the frequency of
the sinusoidal pair. Frequencies with a fractional number of periods in the window can be
searched for giving FOS a greater resolution than the FFT [60, 61].
One of the most efficient and most frequently used model structure detection techniques
is the orthogonal algorithm [62]. The advantage of using the orthogonal algorithm is that the
contributions of candidate terms are decoupled and consequently the significance of model
- 28 -
terms can be measured based on the corresponding error reduction ratios. Consider a dynamic
nonlinear polynomial NARMA model as defined by equation (2.9).
( ) [ ( 1), ..., ( ), ..., ( ),..., ( )] ( )y n F y n y n k x n x n l e n= − − − + (2.9)
where [ ]F ⋅ denotes a nonlinear polynomial, and equation (2.9) can be more concisely as
equation (2.10) with any sampled ( )y n of lengthN .
0
( ) ( ) ( )M
m mm
y n a p n e n=
= +∑ (2.10)
ma and mp denote unknown parameters and candidate function like nonlinear regressors
respectively, and [ ]e n is a white noise sequence with zero mean and finite variance.
Candidate functions are of lengthN , which are chosen to represent sine and cosine functions
having particular frequencies of interest. Candidates are given by below equation (2.11).
2
2 1
2( ) sin
2( ) cos
mm
mm
f np n
N
f np n
N
π
π+
= =
(2.11)
The frequency of each candidate is given bymf , these sine and cosine functions are not
necessarily orthogonal as they may have a fractional number of periods over lengthN . The
sampled signal ( )y n can also be expressed as a functional expansion of M orthogonal
functions ( )mW n , its coefficients mg , and some error, ( )e n ,as defined by equation (2.12).
0
( ) ( ) ( )M
m mm
y n g w n e n=
= +∑ (2.12)
The set of orthogonal functions are derived from the candidate functions using the Gram-
Schmidt orthogonalization algorithm [63]. In the Gram-Schmidt algorithm, an orthogonal
function can be calculated as the corresponding candidate minus the weighted sum of previous
orthogonal functions, as given by equation (2.13) with coefficients of equation (2.14).
1
0
( ) ( ) ( )m
m m mr rr
w n p n w nα−
=
= −∑ (2.13)
where
1
2
1
( ) ( )
( ( ))
N
mn
m N
mn
y n w ng
w n
=
=
=∑∑
(2.14)
- 29 -
However, the construction of the orthogonal functions ( )mW n in equation (2.13) is time
and memory consuming. To avoid this, the FOS algorithm directly computes the orthogonal
expansion coefficients mg (an algorithm based on a modified Cholesky decomposition
technique) without explicitly creating the orthogonal functions ( )mW n . As a consequence,
computing time is significantly reduced. The coefficients mg are obtained by equation (2.15). ( )
, 0,...,( , )m
C mg m M
D m m= = (2.15)
where
1
0
(0,0) 1
( ,0) ( ), 1,...,
( , ) ( ) ( ) ( , ), 1,..., , 1,...,
m
r
m r rii
D
D m p n m M
D m r p n p n D m i m M r mα−
=
=
= =
= − = =∑
(2.16)
and ( , )
, 1,..., , 0,..., 1( , )mr
D m rm M r m
D r rα = = = − (2.17)
with
1
0
(0) ( )
( ) ( ) ( ) ( )m
m mrr
C y n
C m y n p n C rα−
=
=
= − ∑ (2.18)
Additionally, the mean square error is calculated from equation (2.19).
2 2
0
( ) ( , )M
mm
mse y n g D m m=
= − ∑ (2.19)
The overbar, for equation (2.16), (2.18), and (2.19) denotes a time-average computed
over the portion of data record of lengthM . The spectral density at a given frequencymf is a
combination of the magnitude of the corresponding two (sine and cosine at frequencyf )
candidate functions and the phase spectrum are given by equation (2.20).
2 22 2 1
1 2 1
2
( )
( ) tan
m m m
mm
m
F f a a
af
aφ
+
− +
= +
=
(2.20)
- 30 -
2.3.2 Pitch Selection
Figure 2-12. Two episodes of pitch selection in FOS
In the FOS algorithm, the candidate functions are fitted in the order presented (0, 1, …,
m-1), and order M of 640 is selected by empirical analysis of minimum mean square error of
equation(2.19). Once the spectrum of the sampled signal is determined over a particular
frequency range from 0 to 6000 Hz, the pitch information must be extracted. As shown in Fig.
2-12, pitch candidates can be selected from a frequency domain normalized by maximum
frequency located within 50-350 Hz.
- 31 -
Figure 2-13. Pitch selection of boundary of a third of global Maximum peak
Looking a detail process of pitch selection, first, global maximum peak is regarded as
first pitch estimate, and search the boundary of a half of the first pitch estimate, if the first
pitch estimate is over than frequency of 100 Hz and threshold of 0.2. The boundary scale
(either side of 13 Hz) of 2 % of M size is used. Then, local maximum peak in the valid search
range (boundary of the first pitch estimate) with the magnitudes exceeding some prescribed
fraction (e.g. 20 %) of the global maximum peak is found and its position is stored as second
pitch period candidate. Theses process goes until finding the lowest pitch period satisfied with
pitch period condition, whether it is located within valid pitch period range and is over than
- 32 -
20 % of global maximum peak. If there is no peak with a half of the first pitch estimate,
research the local pitch estimate within new boundary of a third of the first pitch estimate (e.g.
Fig. 2-13). Finally the last existing pitch estimate’s position value in frequency domain is
selected as fundamental frequency. Pitch candidates obtained as described above for steady periodic speech frames usually
include only a true pitch period and its integer multiples. Selecting the lowest multiple can
give a reliable local pitch estimate for such frames, because the lowest multiple is usually
regarded as real pitch period due to the nature of the spectrum of voice that pitch period’s
harmonics are integer multiples of pitch periods. For our PDA implementation we have
developed an algorithm based on FOS.
2.4 Experimental Procedure
2.4.1 Speech Database
Both pathological voices of 99 and normal voices of 30 were measured at the Institute of
Logopedics & Phoniatrics, Yong-Dong Severance hospital, Yonsei University, Seoul, Korea.
Pathological voices were only considered as voices due to benign vocal lesions; polyps,
nodules, and cysts. The mean age at diagnosis of pathological group among adults aged 18-69
years was 43.7 years, and age of otherwise was 48.4 years ranging from 21 to 68, irrespective
of sex.
Acoustic data were recorded with the Computerized Speech Lab. (CSL), Kay Elemetrics
without noisy environment, and electroglottographic (EGG) data were simultaneously
obtained. All acoustic and EGG data were sampled at sampling rate of 22 KHz and resolution
of 16 bit. The sustained vowel of |a|, |e|, |i|, |o|, |u| samples for 2-3 sec were obtained by twice
utterance.
2.4.2 Preconditions of Performance Evaluation
Some preconditions are required to evaluate performance of pitch detection. It is difficult
to empirically measure the performance of a pitch estimator for several reasons. First,
- 33 -
performance depends on domain, as discussed above. A pitch estimator will almost certainly
behave better in the context for which it was developed. Second, it is difficult to automatically
rate the result of a pitch estimator against expected outcomes, precisely because it is difficult
to measure pitch in the first place. We humans are good at it, and so we can listen to a file and
judge the accuracy of a pitch estimation engine, but to lend credibility to this measure, we
must have many people, both expert and lay, judge the pitch estimation result on a large
number of sound files. Once a measure like this is taken, however, it can be used to evaluate
the results of other pitch estimation methods. Another way to evaluate pitch estimators is to
compare the results of multiple detectors on a common corpus. This third method of
comparison is what will be used in this work.
Reference pitch is semi-automatically obtained by accurate glottal closure instant (GCI)
measure. GCI can be obtained from the differentiated EGG signal, and the successive GCI
positions correlate with pitch scheme of speech signal [64]. Automated pitch estimation based
on this pitch was applied to produce reference pitch estimate. However, this scheme is based
on the fact that the length of pitch period does not change drastically pitch periods do not
double or halve during the limited length of the segment. Therefore, we only used automated
reference pitch estimator in case of type 1 signal or periodic signal, while type 2 and 3 signals
were manually analyzed. Speech analyzing tools such as spectrogram and Lyapunov
exponents [65] were adopted to discriminate type 2 and 3 signals from type 1 signal. If period-
doubling or halving bifurcation was exhibited by the excised larynx, it was regarded as type 2
or 3 signal. After analysis of Lyapunov exponents, the result containing at least one positive
Lyapunov exponents was defined as chaotic signal or type 3.
2.4.3 Error Types of PDA
Five types of error measures involved in performance of the pitch detection were defined
as below;
i) X2 (doubling) error: The percentage ratio of voiced frames (which is correctly classified
with a below condition or unvoiced frames also correctly classified) to total voiced frames
proved to be voiced frames.
computed pitch estimate - reference pitch> 0.2
reference pitch
- 34 -
ii) /2 (halving) error: The percentage ratio of voiced frames (which is correctly classified with
a below condition or unvoiced frames also correctly classified) to total voiced frames proved
to be voiced frames.
computed pitch estimate - reference pitch< -0.2
reference pitch
iii) G (gross) error: sum of X2 error and /2 error
iv) F (fine) error: the mean of the absolute difference between computed pitch estimate and
reference pitch estimate during the period which proved to be voiced frames except occurring
G error.
v) S (standard deviation) error: the standard deviation of the absolute difference for the period
defined by above F error.
Error measures related with voicing decision estimation was not considered in this paper.
Although segmented signal included unvoiced period, some of PDA have no ability of
classifying voicing decision, and all of PDAs are equally adopted in same voicing decision
estimation algorithms such as silence detector based on energy threshold and zero crossing
rate.
2.4.4 Optimum Window Selection
The autocorrelation algorithm is relatively impervious to noise, but is sensitive to
sampling rate, because it calculates fundamental frequency directly from a shift in samples.
Therefore, optimized window size should be decided, because all PDAs must be operated in
their best conditions, and the performance of PDAs was sensitive to window size. Empirical
analysis of measures of pitch gross error rates was achieved in Fig. 2-13 and 2-14. These rates
were calibrated overall our speech database. We tested optimum window selection in 8 PDAs
with two separated speech database; normal and BVFL voices. In normal database (e.g. Fig. 2-
13), the length of 22, 68, 20, 34, 46, 70, 23, and 46 ms was selected as optimized window size
of AC, AMDF, YIN, CEP, SIFT, WAV, PS, and FOS, respectively. In BVFL database (e.g.
Fig. 2-14), the length of 21, 20, 24, 31, 30, 59, 25, and 54 ms was selected as optimized
window size of AC, AMDF, YIN, CEP, SIFT, WAV, PS, and FOS, respectively. The pattern
of window size plots of each PDA in BVFL database is similar to its result in normal database,
and gross error largely increases except FOS. There is no difference of the window size
- 35 -
between two groups except AMDF. We selected above optimum window size to compare each
PDA in their best performance.
Figure 2-14. Gross error rates as a function of window size in normal database
Figure 2-15. Gross error rates as a function of window size in BVFL database
- 36 -
2.5 Experimental Results
2.5.1 Evaluating Performance of PDAs in Normal versus BVFL Voices
Five types of error of pitch estimation between eight PDAs are presented in Table 2-1. It
can be seen from these results that gross error of FOS analysis is superior to other PDAs,
irrespective of types of speech signal (Normal or BVFL), and is remarkably lower than other
PDAs in cases of BVFL. This is perhaps to be expected, since the halving error of FOS
analysis is substantially different from other PDAs. In Table 2-1 results of performance of the
avaiable PDAs in normal and BVFL database are stated; p value (95% confidence interval) is
a statistics of Welch's t test between Normal and BVFL groups (NNormal = 150, NBVFL = 495).
Shaded blocks denote best performance of pitch estimation in each column. Our proposed
FOS algorithm mainly shows best performance in Gross and Halving error, but PDA based on
Wavelet is best in Doubling, Fine and Standard Deviation error. In particular, every
approaches of pitch estimation are faced with the problem of trading off too-high versus too-
low errors. This is usually addressed by applying some form of bias. Mainly, PDAs based on
spectrum can not perform low errors. However, these PDAs show good performance related in
too-high errors, and this is well adopted in pathological voices which have aperiodic
components such as jitter, shimmer and noise, because these components usually prohibit
PDA from finding correct pitch.
There is no difference of performance among types of BVFL as shown in Table 2-2, 2-3,
and 2-4, and PDA based on FOS performs good high-errors and average low-errors. Detail
results of performance of the PDAs are presented in APPENDIX A; performance test as an
age with interval of two decades (except infants; below 10 years), performance test as a sex,
and performance test as a phonation types (|a|, |e|, |i|, |o|, |u|).
- 3
7 -
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
0.83
0.95
4.69
1.68
6.27
4.89
5.72
2.44
.5.98
2.97
4.12
1.43
8.66
7.69
3.20
1.07
Mean
2.04
2.27
7.46
1.84
1.91
2.10
9.42
2.73
9.08
3.67
6.97
1.59
12.88
10.72
5.99
1.31
S.D.
.285
<.001
.015
<.001
<.001
<.001
.008
<.001
p
G
0.34
0.03
1.27
0.63
0.01
0.01
0.97
0.66
2.48
1.30
0.66
0.53
1.06
0.15
0.41
0.13
Mean
1.38
0.33
4.64
1.50
0.09
0.07
3.14
2.06
6.98
3.58
2.02
0.82
6.14
0.81
2.27
0.45
S.D.
<.001
.031
.276
.181
.017
.085
.005
.009
p
X2
0.49
0.92
3.42
1.05
6.25
4.88
4.75
1.78
3.51
1.67
3.46
0.90
7.60
7.55
2.80
0.94
Mean
1.48
2.26
5.47
1.40
1.91
2.10
8.10
1.98
4.51
1.83
6.29
1.37
11.85
10.77
5.45
10.22
S.D.
<.001
<.001
.012
<.001
<.001
<.001
.017
<.001
p
/2
2.72
2.95
3.93
1.90
0.83
0.39
6.44
4.83
5.56
3.49
14.75
14.01
7.53
6.35
2.14
1.31
Mean
2.46
1.61
6.57
10.88
1.40
0.76
8.69
2.40
11.89
5.56
3.52
0.54
9.06
7.63
3.69
0.82
S.D.
.521
<.001
<.001
<.001
.005
<.001
.002
<.001
p
F
4.05
4.00
8.15
5.45
0.58
0.44
8.87
6.34
10.73
7.90
4.56
2.63
11.35
11.17
4.48
2.61
Mean
6.95
4.65
8.73
6.27
0.97
0.82
11.85
9.07
13.17
10.93
5.88
2.32
10.16
10.58
7.22
3.91
S.D.
.355
.002
<.001
.071
.012
<.001
.668
<.001
p
S
Table 2-1. Results of performance of the avaiable PDAs in database of normal and BVFL (unit: percentage %)
- 38 -
Table 2-2. Results of performance of the available PDAs in database of cyst (Npolyp = 195) (unit: percentage %)
G X2 /2 F S PDAs
Mean S.D. Mean S.D. Mean S.D. Mean S.D. Mean S.D.
AC 4.28 8.28 0.84 4.50 3.44 6.78 3.00 5.85 5.14 6.98
AMDF 7.95 12.70 0.68 2.41 7.27 12.78 6.56 8.47 10.20 9.57
YIN 5.77 9.95 0.99 3.13 4.78 9.40 15.39 4.40 5.53 5.81
CEP 8.34 13.44 3.82 9.67 4.53 6.34 6.31 12.12 9.22 10.37
SIFT 9.50 16.49 0.91 2.62 8.59 15.27 8.53 12.05 9.47 11.73
WAV 7.32 1.73 0.01 0.04 7.31 1.74 0.91 1.66 0.36 0.80
PS 8.01 12.76 3.22 10.20 4.87 7.38 6.20 11.01 10.15 9.21
FOS 0.42 1.37 0.33 1.28 2.21 1.65 2.21 1.65 2.54 4.32
Table 2-3. Results of performance of the available PDAs in database of cyst (Ncyst = 85) (unit: percentage %)
G X2 /2 F S PDAs
Mean S.D. Mean S.D. Mean S.D. Mean S.D. Mean S.D.
AC 3.18 6.12 0.33 1.64 2.84 5.90 1.71 2.39 3.67 5.52
AMDF 7.11 12.70 1.55 8.16 5.56 10.54 6.58 8.72 10.15 9.93
YIN 3.55 5.74 0.63 1.92 2.92 4.90 14.30 2.56 3.91 4.00
CEP 6.52 9.04 2.94 7.53 3.58 4.29 6.53 14.23 12.96 14.88
SIFT 4.81 7.42 1.31 3.90 3.50 5.03 5.79 8.96 9.05 13.73
WAV 5.97 1.94 0.02 0.12 5.95 1.93 0.67 1.17 0.58 0.95
PS 4.15 6.17 1.08 2.23 3.07 5.47 3.52 5.31 7.28 6.90
FOS 0.51 1.61 0.25 1.23 0.25 1.10 2.68 2.30 3.37 5.83
- 39 -
Table 2-43 Results of performance of the avaiable PDAs in nodule database (Nnodule = 140) (unit: percentage %)
G X2 /2 F S PDAs
mean std mean std mean std mean std mean std
AC 2.61 3.54 0.29 0.82 2.32 3.17 2.45 3.97 5.65 9.69
AMDF 12.07 12.77 0.33 0.88 11.74 12.66 9.94 9.64 14.37 10.39
YIN 4.22 6.86 0.51 1.18 3.70 6.25 15.25 4.35 5.34 8.34
CEP 3.53 3.92 0.77 1.57 2.76 3.32 3.22 3.71 7.33 10.02
SIFT 5.17 5.82 0.34 0.92 4.83 5.58 6.30 4.65 8.16 7.10
WAV 6.20 1.74 0.00 0.00 6.20 1.74 1.09 1.60 0.72 1.09
PS 3.66 4.13 0.44 1.00 3.22 3.74 3.34 4.66 8.59 11.11
FOS 1.71 2.74 0.51 1.67 1.19 2.16 3.09 3.07 6.27 9.35
2.5.2 Evaluating Performance of PDAs in Aperiodicity Level of Voices
Another tests occurred in two groups between Type-1 and Type-2 according to Titze [66].
This test performed to evaluate robustness of pitch doubling and halving errors. As shown in
Table 2-5, the difference of gross error of Type-2 group considerably increase except PDA
based on FOS, whereas the difference of gross error of Type-1 group moderately increase. In
Type-1 group, the average gap of +1.04 %, +0.43 %, +1.14 %, +1.24 %, +1.16 %, +1.27 %,
+1.48 % and -0.38 % are achieved in AC, AMDF, YIN, CEP, SIFT, WAV, PS, and FOS,
respectively. In Type-2 group, the average gap of +5.73, +5.03, +7.32, +7.74, +10.41, +2.52,
+8.82 and +1.95 are achieved in AC, AMDF, YIN, CEP, SIFT, WAV, PS, and FOS,
respectively. This information tells that PDA based on FOS is immune to pitch doubling and
halving errors.
- 4
0 -
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
0.61
1.03
3.06
1.58
6.26
4.99
3.53
2.37
3.80
2.56
2.58
1.44
8.63
8.20
2.16
1.12
Mean
1.57
2.35
4.07
1.55
1.80
2.08
3.79
2.77
4.46
2.48
4.29
1.61
12.38
10.98
3.98
1.34
S.D.
.054
<.001
<.001
<.001
<.001
<.001
.706
<.001
p
G
0.14
0.03
0.55
0.47
0.01
0.00
0.58
0.60
1.15
0.80
0.35
0.49
0.73
0.09
0.21
0.11
Mean
0.79
0.34
0.99
0.94
0.08
0.05
1.62
2.09
2.57
2.14
0.78
0.79
5.63
0.63
1.27
0.40
S.D.
.026
.402
.440
.913
.124
.074
.029
.168
p
X2
0.47
1.01
2.51
1.11
6.25
4.98
2.95
1.77
2.66
1.76
2.23
0.95
7.90
8.11
1.96
1.01
Mean
1.39
2.34
4.02
1.44
1.80
2.08
3.44
2.00
3.39
1.85
4.26
1.40
11.45
11.00
3.77
1.24
S.D.
.013
<.001
<.001
<.001
<.001
<.001
.847
<.001
p
/2
2.35
2.90
2.59
1.70
0.73
0.41
4.66
4.75
3.33
2.73
14.10
14.01
7.42
6.68
1.55
1.32
Mean
1.43
1.66
3.72
1.55
1.29
0.79
2.75
2.25
4.95
3.22
2.07
0.55
8.61
7.79
2.34
0.81
S.D.
.001
<.001
.001
.702
.110
.470
.351
.099
p
F
3.26
4.15
6.59
5.01
0.57
0.44
6.39
5.57
7.99
6.51
3.38
2.60
11.29
11.54
3.34
2.56
Mean
4.58
4.76
7.04
5.73
0.99
0.85
7.12
7.87
10.66
8.93
3.63
2.37
10.22
10.41
5.57
3.91
S.D.
.060
.010
.147
.284
.113
.005
.804
.075
p
S
Table 2-5. Results of performance of the available PDAs in database of normal and BVFL
- 4
1 -
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
1.95
0.00
10.24
2.80
6.22
3.70
13.69
3.28
15.43
7.69
8.75
1.23
6.90
1.87
6.29
0.56
Mean
3.17
0.00
12.64
3.82
2.16
1.98
17.48
2.21
18.27
8.91
11.34
1.25
10.38
3.97
9.28
0.84
S.D.
<.001
<.001
.001
<.001
.032
<.001
.006
<.001
p
G
1.20
0.00
5.02
2.39
0.05
0.06
3.05
1.34
9.54
7.00
2.55
0.98
2.25
0.79
1.60
0.45
Mean
2.24
0.00
10.27
3.97
0.14
0.19
7.17
1.71
16.16
8.84
4.85
1.08
3.54
1.85
5.16
0.81
S.D.
<.001
.135
.878
.099
.443
.024
.043
.098
p
X2
0.75
0.00
5.22
0.41
6.17
3.65
10.64
1.94
5.90
0.69
6.20
0.25
4.64
1.09
4.69
0.11
Mean
2.10
0.00
6.13
0.75
2.15
2.04
14.05
1.85
5.81
1.27
8.70
0.60
9.95
3.77
7.48
0.40
S.D.
.006
<.001
.001
<.001
<.001
<.001
.037
<.001
p
/2
4.40
3.43
8.62
4.16
1.27
0.23
12.77
5.82
16.61
12.16
16.78
14.00
6.85
2.64
4.23
1.22
Mean
4.31
0.68
10.92
3.44
1.64
0.16
18.24
3.68
27.53
14.15
6.60
0.32
8.84
3.97
6.91
0.93
S.D.
.093
.011
<.001
.007
.412
.001
.012
.001
p
F
8.14
2.33
14.07
10.46
0.81
0.33
18.27
15.21
22.83
23.94
8.84
3.07
12.09
6.96
8.51
3.18
Mean
12.23
1.33
13.41
9.66
1.08
0.30
22.26
15.86
19.50
17.92
11.47
1.67
10.07
12.17
11.69
4.02
S.D.
<.001
.280
.004
.575
.848
<.001
.192
.006
p
S
Table 2-6 Results of performance of the available PDAs in database of normal and BVFL
- 42 -
2.6 Summary
An algorithm is presented for the estimation of the fundamental frequency (F0) of
pathological voiced sounds. It is based on the well-known FOS algorithm with a pitch
selection method that combines to prevent pitch estimation errors, especially octave errors.
The algorithm has several desirable features. Gross error rates are lower than the best
competing methods, as evaluated over a database of speech recorded together with an
electroglottographic signal. The algorithm is relatively simple and may be implemented
efficiently and with low latency, and it involves few parameters that must be tuned. It is based
on a signal model that handles various forms of aperiodicity (such as white and colored
noises) that occur in particular applications. However, there is some trade-off such as too-low
error which ranged within ignored boundary.
- 43 -
Chapter 3
Comparison of Acoustic and Electroglottographic Parameters of BVFL
before and after Laryngeal Surgery
3.1 Introduction to Acoustic and Electroglottographic Analysis on Vowel
3.1.1 Acoustic Analysis
Up to now a substantial amount of research has been devoted to the determination of the
influence of pathological changes of the larynx upon the voice signal [67-69]. Acoustic
analysis is a modal and traditional method for evaluation and detection of laryngeal
pathologies [70, 71]. Most of these investigations were devoted to computation of pathological
voice parameters for use in clinical practice. Also, several studies [72, 73] have been devoted
to classification and screening of laryngeal pathology. The following results have been
established concerning changes in the pathological voice signal in most cases:
1. Significant variations in voice pitch period and pitch peak amplitude.
2. Breaks in pitch generation during sustained vowel phonation.
3. Distortion of the pitch-pulses shape and increased degree of hoarseness due to the
high-frequency noisy components of the voiced speech.
4. Presence of a loud turbulent and additive noise.
5. Presence of sub-harmonic components in the vowel spectra.
6. Dominating first harmonic and decrease or loss of the high-frequency harmonics in
the signal spectrum resulting in breathy phonation.
7. Interruptions in the pitch period generation.
These changes are not always observed simultaneously, and only part of them could be
present, depending on the disease and its stage. In cases of weak pathology, the voice signal
retains a normal periodicity; only the noise level slightly rises, while the amplitude of the
high-frequency harmonics in the signal spectrum decreases. For this reason, a precise analysis
- 44 -
of periodicity and noise determination is required to detect the disease in its early stage. Under
precise analysis, we understand evaluation with high precision of the parameters, describing
the cycle-to-cycle variations in the pitch and voice amplitude. In order to visualize some of the
above differences between normal and pathological voices, Fig. 3-1 shows waveforms and
spectra of normal and pathological voices.
Figure 3-1. Episodes of speech, spectrum, and EGG of normal and pathological voices
- 45 -
3.1.2 Electroglottographic Analysis
Electroglottographic (EGG) Analysis is also effectively used to detect and evaluate the
laryngeal disorders, because the EGG waveform is less complex than the speech signal, and is
relatively unaffected by the acoustic resonance of the vocal tract. It was thus considered to be
more advantageous than the speech signal for perturbation analysis [74]. The EGG signal
reflects the degree of vocal fold contact during the vibratory cycle of the vocal folds [75].
Irregularities in the EGG signal correspond to irregularities in the vibratory pattern of the
vocal folds [76, 77]. The EGG features we measure accounted for such factors as the rise and
fall time of the EGG signal. Thus, we conjectured that the EGG waveform features would
provide a nearly direct measure of the irregularities in the vibratory motion of the vocal folds
and thus provide and excellent classification of normal subjects versus those with vocal
disorders.
3.1.3 Measurement of Pathological Voice
Voice changes are measured during phonation of the sustained vowel. These parameters
define the degree of cycle-to-cycle instability of amplitude and pitch, and indicate the level of
aperiodic components (noises) in the voice signal, predicted by the presence of turbulent noise
and frequency and amplitude modulation of the voice. By modulation we mean a presence of
unintentional variations of voice amplitude and pitch, due to both neurological reasons and to
the biomechanical properties of the vocal folds.
For the laryngeal diagnostics, variations of voice amplitude and pitch period are usually
tested. These perturbations are called pitch perturbation (jitter) and amplitude perturbation
(shimmer), respectively. These perturbations are random by nature and persist in both normal
and pathological voices [78–80]. Slow fundamental frequency and amplitude variations are
defined as frequency and amplitude tremors, respectively, and are due to physical over-tension
rather than to pathological changes in the larynx. However, the more severe status of voice
disorder, the more level of jitter and shimmer increase. Therefore, it is required to statistically
analyze parameters extracted from speech and electroglottographic signal in order to know
difference between preoperative and postoperative voiced sounds.
- 46 -
3.2 Methods and Experiments
3.2.1 Experimental Data and Protocol
Pathological voices were collected from 42 subjects in BVFL from June 2003 to
December 2006. All subjects are normal hearing and native speakers of Korean, with
experience of laryngeal surgery for treatment of BVFL. The experimental instructions and
stimuli were controlled by expert clinician in the Institute of Logopedics & Phoniatrics, Yong-
Dong Severance hospital, Yonsei University, Seoul, Korea. Acoustic data were recorded with
the Computerized Speech Lab (CSL), Kay Elemetrics without noisy environment, and
electroglottographic (EGG) data were simultaneously obtained. All acoustic and EGG data
were sampled at sampling rate of 22 KHz and resolution of 16 bit. The patients phonate into a
microphone for less than 3 seconds sounds too good to be true, and these sustained vowels of
|a|, |e|, |i|, |o|, |u| sound samples were obtained twice. After recordings, all voice data with
WAV format (including EGG data) were transferred to PC, running on a Microsoft Windows
OS, and were analyzed by Matlab & Visual Studio .NET C++ and C# software.
3.2.2 Analysis and Results of Formant Frequencies
3.2.2.1 Estimation of Formant Frequencies
The speech signal is produced by the action of the vocal tract over the excitation coming
from the glottis. Different conformations of the vocal tract produce different resonances that
amplify frequency components of the excitation, resulting in the different sounds. These
resonance frequencies are called formant frequencies. The estimation of the formant
frequencies, mainly the first two formants, F1 and F2, has many practical applications. In
linguistics [81, 82], they are used for the characterization of the different sounds found in the
speech. Frequency formant measures can be obtained directly by visual inspection in the
spectrogram of the speech signal or automatically by means of a computational algorithm. A
useful technique for the formant estimation is the linear predictive coefficient (LPC) analysis.
In the LPC analysis, an all pole prediction filter models the vocal tract and the angular position
of the poles of the filter gives the formant frequencies.
- 47 -
Apart from a variety of formant tracking approaches [83, 84], considerable attention has
been paid to methods based on linear prediction analysis [85, 86]. However, capturing and
tracking formants accurately from noisy speech is not easy, largely because the accuracy of
root-finding algorithms based on LPC is sensitive to the noise level. Therefore we use the
pitch tracking method based on detecting the change of phase spectrum of LPC, as shown in
Fig. 3-2.
Figure 3-2. Formant frequencies tracking based on phase spectrum of LPC
3.2.2.2 Comparative Results
Table 3-1. below presents mean and S.D. of formant frequencies for F1, F2, and F3 for
male, female, and non-considering sex for their sustained vowel |a|, |e|, |i|, |o|, |u| before and
after laryngeal surgery. Paired t-test (Wilcoxon rank sum test) is used for the comparison
between preoperative and postoperative vowel. P-value associated with t is considered as
- 48 -
lower than 0.05. According to the results, there is no significant difference between two
groups (before and after surgery) of vowel data, even if |o| sound of male, |a| and |u| sound of
female, and |a|, |o| sound of non-considering sex significantly changed. In order to assess the
accuracy of the original frequency measurements, each vowel was re-measured twice, and the
mean F1, F2, and F3 frequencies were within 15.52±18.03 Hz, 21±19.74 Hz, and 19.08±19.21
Hz of the original values.
- 4
9 -
Fe -male
Male
Irrespe-ctive of sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
a
340.1
410.1
310.4
453.4
686.1
302.0
340.2
283.7
477.6
659.6
320.2
373.5
296.4
466.1
672.2
Mean
42.4
88.2
40.8
151.1
205.9
55.2
58.5
32.2
67.4
67.0
52.6
81.3
38.6
114.2
148.8
S.D.
pre
368.8
429.5
333.5
485.9
785.8
306.3
372.0
275.4
497.0
677.2
336.1
399.4
303.1
491.7
728.9
Mean
41.9
62.3
52.8
124.1
109.7
30.7
45.7
32.2
47.2
55.5
47.9
60.9
51.8
91.1
100.8
S.D.
post
.025
.156
.103
.326
.037
.706
.018
.277
.158
.266
.062
.006
.398
.132
.019
p-value
F1
803.5
854.4
1854.1
1517.7
1412.3
920.0
860.7
2065.3
1744.2
1446.8
864.5
857.7
1964.7
1636.3
1430.4
Mean
206.6
249.5
666.0
627.3
359.3
433.8
399.6
354.1
275.5
451.1
345.9
332.7
530.2
484.1
405.4
S.D.
pre
952.2
888.6
1680.9
1626.9
1359.2
722.9
889.1
2183.0
1714.5
1255.1
832.1
888.9
1943.9
1672.8
1304.6
Mean
391.5
308.3
829.7
649.6
165.6
153.1
380.5
207.0
241.9
389.5
310.6
343.8
636.7
477.0
305.3
S.D.
post
.124
.712
.477
.517
.554
.067
.781
.189
.679
.170
.661
.647
.866
.675
.133
p-value
F2
2142.3
2120.4
3232.9
2771.6
2823.8
2394.3
2215.4
3255.9
2885.6
2898.5
2274.3
2170.2
3244.9
2831.3
2862.9
Mean
307.6
286.8
371.4
374.6
594.7
445.6
476.9
362.8
348.3
530.5
402.2
396.6
362.6
361.2
556.3
S.D.
pre
2299.6
2160.5
3103.9
2781.3
2937.7
2264.9
2192.8
3320.6
2838.9
2730.5
2281.4
2177.4
3217.5
2811.5
2829.1
Mean
520.3
303.4
547.4
335.9
235.9
259.3
458.9
327.0
359.7
420.9
400.3
388.3
453.5
345.6
357.1
S.D.
post
.275
.663
.399
.927
.397
.232
.862
.385
.633
.254
.936
.928
.737
.779
.734
p-value
F3
Table 3-4. Mean and S.D. of formant frequencies from sustained vowel |a|, |e|, |i|, |o|, |u| before
and after laryngeal surgery
- 50 -
Figure 3-3. Box plots of F1, F2, and F3 formant frequencies of voiced sounds |a|, |e|,
|i|, |o|, |u| of male group before and after surgery
Figure 3-4. Box plots of F1, F2, and F3 formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and after surgery
- 51 -
Welch’s t-test for comparison between two groups is also tested and results of the tests
are plotted in box plots of Fig. 3-3 and 3-4. In comparison of before and after surgery, no
formant frequencies of vowel |a|, |e|, |i|, |o|, |u| of both sex with p-value over than 0.05 are
achieved except F1 of phonation |u| of female. These results also present no significant
difference between two groups. Hence these results, the information that the change of
postoperative vowel is not related with formant frequencies are analogized. However, we
analyzed the loci of F1 and F2 formant frequencies with S.D. of 1 for detail changes of theirs
distribution (e.g. Fig 3-5, 3-6). Distribution of F1 and F2 of men is smaller than that of women,
and |a| sound of both groups is clearly separated than other phonations.
Figure 3-5. Loci of the mean and S.D. of 1 of F1 and F2 formant frequencies of male before and after surgery
- 52 -
Figure 3-6. Loci of the mean and S.D. of 1 of F1 and F2 formant frequencies of female before and after surgery
3.3 Analysis and Results of Fundamental Frequency Perturbation (Jitter)
3.3.1 Various Jitter Measures
Mean F0, Max F0, Min F0 and S.D. F0 denote the mean F0, maximum F0, minimum F0
and standard deviation F0 value in analyzed segment, respectively. Phonatory frequency range
is a parameter indicating the range of tension of the vocal fold [87]. Mean absolute jitter
(MAJ) is the mean absolute difference between sequential vocal periods measured during
sustained phonations. However, absolute jitter is influenced by the mean fundamental
frequency of the speaker. For this reason, relative jitter measures have been proposed such as
Jitter. Jitter is the mean absolute difference between sequential vocal fundamental frequencies
divided by the mean frequency of the phonation.
- 53 -
Table 3-2. Various jitter measures
Description Formula Description Formula
Mean F0 Max F0
Min F0 S.D. F0
Phonatory
Frequency
Range
Mean
Absolute Jitter
(MAJ)
Jitter (%)
Pitch
Perturbation
Factor
Directional
Pitch
Perturbation
Factor
RAPP3
RAPP5
RAPP15
Another effective pitch perturbation features are pitch perturbation factor (PPF) and
directional pitch perturbation factor (DPPF). PPF formulated by Lieberman is defined as
percentage value of the number of waveform periods exceeding the given threshold compared
to the total number of voiced pitch periods, and proved to be sensitive to the presence of
masses on the vocal folds [88]. In this study, the threshold is 10 percents of positioning error
of pitch periods. DPPF proposed by Hecker & Kreul is the percentage of the total number of
difference between adjacent pitch periods for which there is a change in the algebraic sign [89,
90]. Finally, relative average pitch perturbations (RAPP) introduced by Koike [91] are the
average absolute difference between a period and the average of it and its closest neighbors,
divided by the average period. In this study, 3, 5 and 15 pitch periods are selected. These
max( )iF
2
1
1( )
1
n
ii
F Fn =
−− ∑
1
1 n
ii
Fn =∑
0 _log( )
0 _12
log 2
F hi
F lo×1
11
1
1 i ii n
F Fn +
= −
−− ∑
min( )iF
11 1
2
12 3
1000 _
ni i i
ii
F F FF
nF av
−+ −
=
+ + −− ×∑
2
22
3
( )1
4 5
1000 _
i
nk i
ii
F kF
n
F av
+
−= −
=
−−
×
∑∑
7
77
8
( )114 15
1000 _
i
nk i
ii
F kF
n
F av
+
−= −
=
−−
×
∑∑
100p threshold
voice
N
N≥ ×
100voice
N
N∆± ×
0 _
MAJ
F av
- 54 -
measures have been extensively used in the last decade, since they are less sensitive to pitch
extraction errors due to smoothing in their calculation.
3.3.2 Comparative Results
Table B-1, B-2, B-3, and B-4 in APPENDIX B, various jitter relevant measures before
and after surgery, analyzed by Wilcoxon rank sum t-test, are presented. Among the 12
measures, significant changes are founded in the mean of these measures except Max F0 and
Min F0. In particular, mean values of F0 of male group are significantly different between
preoperative and postoperative vowel, whereas mean values of F0 of female group do not
show remarkable changes. This result is plotted in Fig. 3-7. In male group, average difference
of before and after vowel |a|, |e|, |i|, |o|, |u| are -17.3 Hz (-12.8 %), -16.9 Hz (-12.4 %), -17.1 Hz
(-12.4 %), -21.5 Hz (-15.2 %), and -22.3 Hz (-15.6 %), respectively. Min F0 are also
significantly different in male group, but that of female group presents non-significant changes.
However, Max F0 of both groups is significantly different. S.D. of F0 also presents significant
changes in vowel |e|, |i|, |u| of male and |a|, |e|, |i| of female according to Table B-2 in
APPENDIX B. Phonatory frequency range, mean absolute jitter (MAJ), and Jitter (%) are
good surrogate for discrimination between two groups; before and after laryngeal surgery.
Nearly, all of their p-value is significantly different from data of pre-treatment. Phonatory
frequency range slightly decreases about a half of value before surgery. MAJ and Jitter (%)
also decrease about a half or a seventh of value before surgery. Detail dropped values are
described in Table B-2 and B-3 in APPENDIX B. Pitch perturbation factor and directional
perturbation factor also present significant difference, but some vowel do not show changes;
phonation |a| and |o| of male, and |a|, |i|, |o|, and |u| of female in the pitch perturbation factor,
and |i|, |o|, and |u| of female in the directional perturbation factor. RAPP relevant measures are
also good touchstone of classification between preoperative and postoperative group. As
shown in Fig. 3-10, all of these measures of all vowel of both sex are significantly different,
excluding phonation |i| of female in the RAPP3 and RAPP15. These results are assumed that |i|
sound has more high frequency components than other phonation.
- 55 -
Figure 3-7. Mean F0 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery
Figure 3-8. Jitter (%) of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery
- 56 -
Figure 3-9. Pitch perturbation factor of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery
Figure 3-10. RAPP15 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after
laryngeal surgery
- 57 -
3.3.3 Analysis and Results of Intensity Perturbation (Shimmer)
3.3.3.1 Various Shimmer Measures
Mean Amplitude (Amp), Max Amp, Min Amp and S.D. Amp denote the mean amplitude,
maximum F0, minimum amplitude and standard deviation amplitude value in analyzed
segment, respectively.
Table 3-3. Various shimmer measures
Description Formula Description Formula
Mean Amp
Max Amp
Min Amp
S.D. Amp
Shimmer
(dB)
Mean
Absolute Jitter
(MAJ)
Jitter (%)
Pitch
Perturbation
Factor
Amplitude
Directional
Perturbation
Factor
RAAP3
RAAP5
RAAP15
1
11
1
1 i ii n
A An +
= −
−− ∑
min( )iA
11 1
2
12 3
100_
ni i i
ii
A A AA
nAmp av
−+ −
=
+ + −− ×∑
2
22
3
( )1
4 5
100_
i
nk i
ii
A kA
n
Amp av
+
−= −
=
−−
×
∑∑
7
77
8
( )114 15
100_
i
nk i
ii
A kA
n
Amp av
+
−= −
=
−−
×
∑∑
100p threshold
voice
N
N≥ ×
100voice
N
N∆± ×
max( )iA
2
1
1( )
1
n
ii
A An =
−− ∑
1
1 n
ii
An =∑
_
MAS
Amp av
1
1 1
120 log( )
1
ni
i i
A
n A
−
= +
×− ∑
- 58 -
Shimmer (dB) is the average absolute base-10 logarithm of the difference between the
amplitudes of consecutive periods, multiplied by 20. Mean absolute shimmer (MAS) and
Shimmer (%) are the average absolute difference between the amplitudes of consecutive
periods, and MAS value divided by the average amplitude in order to reduce the influence by
the mean amplitude of the speaker. Another effective amplitude perturbation features are
amplitude perturbation factor (APF) and directional amplitude perturbation factor (DAPF).
APF is defined as percentage value of the number of waveform periods exceeding the given
threshold compared to the total number of voiced pitch periods, and the threshold is 10
percents of intensity error of amplitudes. DAPF proposed is the percentage of the total number
of difference between adjacent amplitudes for which there is a change in the algebraic sign.
Finally, relative average amplitude perturbations (RAAP) are the average absolute difference
between amplitude and the average of it and its closest neighbors, divided by the average
period. In this study, 3, 5 and 15 amplitude points are selected as smoothing factors.
Figure 3-11. Shimmer (%) of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery
- 59 -
Figure 3-12. RAAP15 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery
3.3.3.2 Comparative Results
Table B-5, B-6, B-7, and B-8 in APPENDIX B, various shimmer relevant measures
before and after surgery, analyzed by Wilcoxon rank sum t-test, are introduced. Among the 12
measures, significant changes are only founded in the Shimmer (dB), Shimmer (%), and
RAAP relevant measures, compared to presence of significant changes in the most of jitter
relevant measures. Mean Amp, Max Amp, Min Amp, S.D. Amp, MAS, Amplitude
perturbation factor, and Amplitude directional perturbation factor shows no significant
changes. All of Shimmer (dB) and Shimmer (%) of both sexes are assumed to be a good
dichotomizer between preoperative and postoperative group. However, p-value of these
measures of phonation |i| is also found as bad classifier. These results are plotted in Fig. 3-11.
Postoperative voiced sounds of RAAP relevant measures show significantly lower value in all
of phonation of both sexes except phonation |i|, |o|, |u| of female group in RAPP3 and RAAP5,
and phonation |i| of female group in RAAP15. In particular, larger a smoothing factors from 3
to 15, bigger the gap between two groups (before and after surgery). This truth informs us
- 60 -
partial influence of shimmer will be smoothed by long smoothing factor, and more precisely
evaluate the analyzed segment.
3.3.4 Analysis and Results of Noise Components
3.3.4.1 Estimation of the noise in the spectral domain
The presence of noise, besides, one should also take into account factors that affect the
value of the parameter, but do not have pathological origin (presence of slow amplitude and
pitch, caused by the person’s emotional status). This is necessary to distinguish normal from
slightly pathological voices. Just in this sense, in the popular method of Yumoto [92] for
determination of harmonic-to-noise ratio, known as one of insufficient precision, some
changes have been made, aimed at decreasing the influence of non-pathological factors [93].
Spectral methods for noise estimation in a voice signal are based on Parseval’s theorem,
stating that the mean power of the signal (taken in the time domain) equals the sum of the
powers of its spectral components. The latter have some significant advantages compared to
the methods based on the time domain. Firstly, they have larger capabilities for separating the
harmonic from the noise component (additive and modulation noise) of the signal. Next, they
are less sensitive to slow changes in pitch and amplitude, due to the patient’s failure to hold a
constant tone and power during phonation, or emotional reasons.
3.3.4.2 Estimation of Harmonic-to-Noise Ratio (HNR)
HNR in speech signals based on Cepstral liftering was firstly developed by de Krom [94],
and it is mathematically implemented by cutting specific edge in quefrency domain. The
essence principles of the de Krom approach is that the harmonics can be removed from the
spectrum of voiced speech using cepstral processing and hence the noise can be estimated at
all frequencies in the spectrum. The rahmonics, representing the prominent peaks at integer
multiples of the fundamental period (T) in the cepstrum of voiced speech, are removed
through comb-liftering. The resulting comb-liftered cepstrum is Fourier transformed to
obtain a noise spectrum (log power spectrum in dB ( )apN f , which is subtracted from the
- 61 -
log power spectrum, ( )O f , of the original signal. This gives a source related log power
spectrum, ( )apH f . A baseline correction factor ( )dB f , defined as the deviance of
harmonic peaks from the 0 dB line, is determined. This factor is subtracted from the
estimated noise spectrum to yield a modified noise spectrum. The modified noise
spectrum ( )N f is now subtracted from the original log-spectrum, in order to estimate the
harmonics-to-noise ratio (HNR).
Figure 3-13. HNR ratio calculation using Cepstral smoothing in Spectrum
A voiced speech waveform, ( )enS t , including aspiration noise [95], ( )n t at the glottal
source, can be approximated as below equation (2.1): [ ]( ) ( )* ( ) ( ) * ( )* ( )enS t e t g t n t v t r t= + (2.1)
where ( )e t is a periodic impulse train,( )g t is a single glottal pulse,( )v t is impulse response of
the vocal tract,( )r t represents the radiation load and * indicates convolution. Applying a
Hanning window( )w
[ ]( ) ( )* ( ) ( ) * ( )* ( ) ( )wenS t e t g t n t v t r t w t= + ×ß
ギ (2.2)
Provided the window length is sufficiently long the window function can be moved inside the
convolution [96] to give equation (2.3)
[ ]( ) ( ) * ( ) ( ) * ( )* ( )wen w wS t e t g t n t v t r t= + (2.3)
Taking the FFT gives below equation (2.4).
- 62 -
[ ]( ) ( ) ( ) ( ) ( ) ( )wen w wS t E f G f N f V f R f= × + × × (2.4)
Taking the logarithm of the magnitude squared values and approximating the signal energy at
harmonic locations, 2
log ( )wen h
S f (equation (2.5)), and at between-harmonic locations, 2
log ( )wen bh
S f , gives equation (2.6)
2 2 2
log ( ) log ( ) ( ) log ( )wen w Rh
S f E f G f V f= × + (2.5)
2 2 2
log ( ) log ( ) log ( )wen w Rbh
S f N f V f= + (2.6)
where ( )RV t is the FFT of ( )v t and ( )r t combined.
Figure 3-14. Influence of cepstral smoothing due to liftered long-term temporal window (57ms)
- 63 -
Figure 3-15. Influence of cepstral smoothing due to liftered short-term temporal window (1ms)
Applying the cepstrum to this signal and obtaining the liftered baseline, it can be seen
that the baseline is influenced by the noise source and temporal analysis window size. As the
temporal window length increases the baseline approximates the noise floor more accurately,
because more estimates are available for the between harmonics as opposed to the harmonics
and the Fourier transform of the liftered cepstrum behaves like a moving average filter applied
to the logarithmic spectrum.
- 64 -
3.3.5 Estimation of Degree of Hoarse (DH) and Normalized Noise Energy (NNE)
Degree of Hoarse (DH) and Normalized Noise Energy are similar to HNR, and they also
estimate noise embedded in speech signal. First, the signal spectrum is calculated for a limited
segment of the signal which requires the use of a Hanning window in this study. Since noise in
the harmonics range should also be taken into account in order to find the total noise energy
precisely enough, the harmonics range width must be less than half the fundamental frequency.
The analyzed signal is divided into segments of width TW and displacement TW /2, and the
harmonic and noise component energies are determined for each segment. The ultimate values
of each of these energies are determined after averaging over the whole signal. With spectral
methods, energies of the periodic and noise components are calculated on the basis of the
power spectrum of the voice signal, obtained through FFT. In order to reduce the influence of
slow signal variations as much as possible, the Hanning window length is taken to be 14
periods, i.e. W = 14S/F0 where W is the window length expressed in points and S is the
sampling rate. All spectrum components within the harmonics band (Energy of Harmonic:
EH), HE , are summed up to obtain the harmonics energy, while the sum of the rest of the
components (Energy of Noise: EN), NE , is taken to be the noise energy:
max
1
k
k
k f
H i kk i f
E X a= =
= − ∑ ∑ (2.7)
max 1
11 1
k
k
k f
N i kk i f
E X a−
−= = +
= + ∑ ∑ (2.8)
where: iX is the power spectrum of the voice signal, maxk is the number of harmonics,
1
1
10
2
2
k
k
f
k ii f
ba X
f b−
−
= +
=− ∑ (2.9)
is the noise energy within the harmonics band and is based on the assumption that noise has a
random character and is uniformly distributed through the whole signal range,
"'
0 0( ) ( )k
f kf b and f kf b= − = +∫ (2.10)
are the lower and upper frequency bounds of the k-th harmonic band, respectively, f0=F0N/S
and b=1.5N/W are the fundamental frequency (F0 is in Hz, f0 is in number of samples) and
- 65 -
half of the harmonics band width, respectively. In order to avoid the influence of the f0
evaluation error, the values of NNE and DH are calculated for all possible fundamental
frequencies:
'0
max
iffH
= (2.11)
where if take an integer value in the interval:
0 max[ ( ) ]if f Hδ∈ +∫ (2.12)
maxH is the number of the highest-frequency harmonic and δ=0.25 is the experimentally
determined maximum error of the f0 calculation. The final value of the NNE and DH is taken
to be equal to the least obtained value. The values of DH and NNE, obtained for each segment,
are averaged for the whole phonation. DH is the ratio of NE to HE , averaged for the whole
phonation:
1
1 nN
i H
EDH
n E=
= ∑ (2.13)
NNE [97] is the ratio of NE to N HE E+ , calculated for frequency range of 4 kHz and
averaged for the whole phonation and converted into [dB]:
101
110log ( )[ ]
nN
i H N
ENNE dB
n E E=
=+∑ (2.14)
3.3.6 Estimation of the normalized first harmonic energy (NFHE)
Breathy phonation changes reflect in the acoustic signal as a relatively strong first harmonic
and weak high frequency harmonics. The ratio between the amplitudes of the first and second
harmonics is usually used to evaluate the vocal quality of these types of voices [98]. This new
parameter, called the normalized first harmonic energy (NFHE) [98], is more informative than
the ratio between first and second harmonics for discrimination between normal and breathy
phonations [93], see also the choice of this frequency band is due to the absence of strong
harmonics beyond 4 kHz for both normal and pathological voices.
- 66 -
Figure 3-16. Plot of harmonic and noise phase segment in spectral domain
For calculation of the parameter NFHE, the voice signal should be divided into segments,
containing equal number of periods, and FFT is calculated for each segment, using Hanning
window. The length of the segments should be equal to above stated, e.g. 14. The value of
NFHE for every segment i can be obtained by the equation (2.15):
0
0
max 0
02
( )
( )
f b
f f bi K kf b
k f kf b
P f
NFHE
P f
+
= −+
= = −
=∑∑ ∑
(2.15)
where P(f) is the power spectrum of the signal, f0 is the fundamental frequency, b=1.5N/W is
the half of the harmonics band, k is harmonic’s ordinal number and kmax is the number of
- 67 -
harmonics within 4 kHz range. N and W are the number of points, used to calculate FFT, and
the length of time window, respectively.
3.3.6.1 Comparative Results
All of noise estimation measures of male group are significantly changed after laryngeal
surgery according to Table 3-4 and 3-5, whereas NNE and DH measures of female group
show non-remarkable difference. Table 3-4 and 3-5 show the results of noise estimation
relevant measures analyzed by Wilcoxon rank sum t-test between preoperative and
postoperative vowel. The pattern of changeless phonation is somewhat different in male and
female groups. Phonation |o|, |u| of HNR, |i| of DH, and |i|, |o|, |u| of NFHE in male group are
constant before and after surgery, but Phonation |a|, |u| of HNR, |a|, |i|, |o|, |u| of NNE, and |a|,
|i|, |o|, |u| of DH in female group are constant. In particular, phonation |a| of female group
presents no significant difference. In both sexes, phonation |e| is evident surrogate for
separating between preoperative and postoperative group. HNR of postoperative vowel is
higher than that of preoperative vowel with average 2.46 dB of |a|, 2.38 dB of |e|, 2.35 dB of |i|,
1.11 dB of |o|, 0.77 of |u| in male group, and 1.16 of |a|, 1.17 dB of |e|, 1.14 dB of |i|, 1.4 dB of
|o|, and 1.37 dB of |u| in female group, respectively. In NNE measure, all of phonation
significantly increases by the gap of 3.90 dB of |a|, 5.31 dB of |e|, 2.30 of |i|, 1.74 of |o|, 1.33 of
|u| in male group, and 2.01 dB of |a|, 2.88 dB of |e|, 0.71 of |i|, 1.41 of |o|, 0.79 of |u| in female
group. DH of male also significantly increase, but that of female increase somewhat, not
significantly increasing. However, NFHE, a representative of breathiness, significantly
decrease except |i|, |o|, |u| of male group.
- 68 -
Figure 3-17. Box plots of HNR, NNE, and DH of voiced sounds |a|, |e|, |i|, |o|, |u| of male group before and after surgery
Figure 3-18. Box plots of HNR, NNE, and DH of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and after surgery
- 69 -
Figure 3-19. Box plots of NFHE of voiced sounds |a|, |e|, |i|, |o|, |u| of male group before and after surgery
Figure 3-20. Box plots of NFHE of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and after surgery
- 7
0 -
Fe -male
Male
Irrespec-tive of
sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
a
13.05
14.23
15.74
16.68
16.78
13.84
13.93
14.06
15.09
15.10
13.46
14.07
14.86
15.85
15.90
Mean
3.07
2.72
3.45
3.52
4.18
2.27
2.55
2.79
2.75
2.67
2.68
2.60
3.20
3.20
3.53
S.D.
pre
14.42
15.63
16.88
17.85
17.94
14.61
15.04
16.41
17.47
17.56
14.52
15.32
16.63
17.65
17.74
Mean
2.42
2.35
2.57
2.68
3.43
1.77
2.95
2.61
2.57
2.13
2.08
2.66
2.57
2.60
2.79
S.D.
post
.086
.033
.046
.059
.155
.147
.082
.007
.000
.001
.023
.005
.001
.000
.001
p-value
HNR
9.32
10.37
12.70
12.83
12.28
7.59
7.83
8.48
7.87
7.91
8.42
9.04
10.49
10.23
9.97
Mean
2.64
2.84
4.20
4.64
4.83
2.95
2.59
3.82
2.80
3.06
2.91
2.97
4.49
4.50
4.53
S.D.
pre
10.12
11.78
13.41
15.71
14.30
8.92
9.58
10.79
13.18
11.81
9.49
10.63
12.04
14.39
13.00
Mean
2.29
2.92
3.66
3.22
2.70
2.53
2.81
2.74
3.50
2.13
2.47
3.04
3.44
3.56
2.70
S.D.
post
.247
.073
.323
.009
.069
.028
.005
.008
.000
.000
.016
.001
.006
.000
.000
p-value
NNE
5.45
6.38
8.11
7.20
7.20
4.26
4.61
5.19
4.40
4.49
4.83
5.45
6.58
5.73
5.78
Mean
1.78
2.03
3.25
3.37
3.19
1.29
1.37
2.05
1.14
1.36
1.64
1.91
3.03
2.81
2.74
S.D.
pre
6.10
7.32
8.52
9.36
8.22
5.31
5.60
6.12
7.22
6.34
5.69
6.42
7.26
8.24
7.24
Mean
1.59
2.55
3.09
2.40
2.13
1.26
1.99
1.80
2.25
1.36
1.47
2.41
2.75
2.54
1.99
S.D.
post
.184
.052
.429
.003
.137
.001
.033
.050
.000
.000
.002
.003
.048
.000
.000
p-value
DH
Table 3-4. Mean and S.D. of formant frequencies from sustained vowel |a|, |e|, |i|, |o|, |u| before and after laryngeal
surgery
- 71 -
Table 3-5. Mean and S.D. of NFHE from sustained vowel |a|, |e|, |i|, |o|, |u| before and after laryngeal surgery
NFHE
pre post
Mean S.D. Mean S.D. p-value
a 11.36 4.41 7.75 2.26 <.001
e 11.90 5.18 7.50 2.68 <.001
i 12.45 5.40 9.00 3.11 <.001
o 13.24 5.24 9.73 2.95 <.001
Irrespective of sex
u 14.74 6.45 10.95 3.25 <.001
a 10.56 3.99 6.60 1.69 <.001
e 11.66 5.77 6.59 2.40 <.001
i 9.83 4.77 10.85 2.78 0.35
o 10.84 4.10 8.83 2.60 .033
Male
u 12.93 6.92 9.37 2.55 .033
a 12.24 4.77 9.01 2.15 .004
e 12.18 4.51 8.50 2.67 .001
i 13.42 3.92 10.40 2.81 .002
o 15.89 5.16 10.72 3.05 <.001
Female
u 16.74 5.38 12.69 3.10 .004
3.3.7 Analysis and Results of Electroglottographic Parameters
3.3.7.1 Estimation of Open Quotient and Speed Quotient
There are two famous parameters for analysis of EGG waveform; Open Quotient (OQ)
and Speed Quotient (SQ) [99]. OQ can be used to evaluate the EGG duty cycle, and SQ can be
used to determine the spectral properties of the generated voiced sound. However, there still
remain controversy of finding the glottal opening and closing point in order to obtaining OQ
and SQ. The time instant of glottal closure is well detectable as a positive peak in the time
derivative EGG (DEGG) of the EGG signal, but there is no agreement on defining the time
instant of glottal opening. In the past, the minimum of DEGG has been used as marker for the
glottal opening [100], but often there does not exist any clear minimum during a glottal cycle.
Even if there is a clear minimum, there is no agreement on whether this is actually the instant
- 72 -
of glottal opening or not. An alternative approach to analyze the vibrational patterns of a
glottal cycle in an EGG signal is to define a time instant corresponding to the point of
intersection between the (falling edge of the) EGG signal and a threshold line. With the
threshold intersection criterion different values have been placed at points representing various
threshold such as 25%, 30%, 40%, 50% and 75% of the signal peak-to-peak amplitude [101].
There is an agreement on the fact that the results within a study do not strongly depend on the
threshold value as long as a constant value is consistently used. In this study, the OQ is
defined as the ratio between the duration of the phase where the glottis is open and the whole
duration of the glottal cycle, multiplied by 100 to express the values in percent as shown in Fig.
3-21 and Fig. 3-22.
Figure 3-21. Determination of start point of opening phase and closing phase in EGG, 16-smoothed EGG, and differentiated EGG waveform
- 73 -
Figure 3-22. Detail definition of opening and closing phase period in EGG waveform
Hybrid method suggested by Howard, detecting closing peaks on the DEGG signal and
estimating opening peaks on the EGG signal using a threshold of 3/7, is adopted in this study
(Equation 2.17). Speed Quotient (SQ) is defined as the ratio of rise time (increased contact) to
fall time (decreased contact), which leads to an inverted measure compared to the SQ's
definition within the acoustic signal. This means that higher values for the SQ indicate more
symmetrical EGG pulses (Equation 2.18).
( ) 100(%)open
cycle
TOpen Quotient OQ
T= × (2.16)
( ) 100(%)A
B
tSpeed Quotient SQ
t= × (2.17)
3.3.7.2 Comparative Results
In Table 3-6, there is no evident difference between preoperative and postoperative vowel.
Mean value of OQ of all of phonation in male group slightly increase, but do not show
significant changes, either do in female group. Mean value of SQ of both sex also do not show
significant difference. However, particular characteristics of SQ group, regressing within
- 74 -
normal range, are presented in condition of division of two groups separated by mean SQ
value as shown in Table 3-7. There are two patterns of groups (Low-SQ and High-SQ) divided
by mean value of postoperative vowel. Low-SQ group, lower than mean value of each
phonation of postoperative vowel, tend to increase toward mean SQ value, and High-SQ group,
over than mean value of each phonation of postoperative vowel, tend to decrease toward mean
SQ value too. This truth was assumed that hypertensive vibration pattern of vocal cords is
normalized after laryngeal surgery.
Table 3-6. Mean and S.D. of open quotient and speed quotient from sustained vowel |a|, |e|, |i|, |o|, |u| before and after laryngeal surgery
Open Quotient Speed Quotient
pre post pre post
Mean S.D. Mean S.D.
P-value Mean S.D. Mean S.D.
p-value
a 51.27 10.80 48.14 8.77 .196 75.29 40.46 82.20 34.25 .433
e 50.44 9.63 46.85 8.44 .079 74.47 36.21 86.25 36.37 .124
i 47.79 10.04 46.33 10.35 .495 86.53 45.04 88.36 38.92 .842
o 47.91 9.67 47.60 9.87 .881 85.58 42.06 84.89 38.35 .933
Irrespective of sex
u 46.09 11.49 46.29 9.59 .925 98.37 68.82 92.47 60.67 .653
a 52.92 12.58 45.63 7.47 .050 80.59 46.51 97.75 33.14 .224
e 51.03 12.23 45.23 8.19 .088 83.66 46.07 99.20 34.29 .220
i 47.12 11.87 46.26 11.57 .808 100.87 51.87 98.9 42.53 .898
o 48.64 10.21 46.05 11.78 .440 91.53 42.12 101.31 41.25 .447
Male
u 45.34 12.73 43.95 10.01 .703 116.77 82.59 116.26 71.97 .983
a 49.46 8.38 50.91 9.43 .628 69.45 32.76 65.09 27.01 .672
e 49.78 5.82 64.35 16.64 .598 64.35 16.64 71.99 33.86 .598
i 48.52 7.80 46.40 9.11 .370 70.77 30.08 76.77 31.61 .536
o 47.11 9.23 49.31 7.15 .337 79.03 42.07 66.83 25.25 .224
Female
u 46.91 10.22 48.86 8.64 .300 78.13 43.08 66.30 29.00 .131
- 75 -
Table 3-7. Mean and S.D. of speed quotient from sustained vowel |a|, |e|, |i|, |o|, |u| before and after laryngeal after laryngeal surgery
Low-SQ High-SQ
pre post pre post
Mean S.D. Mean S.D. p
Mean S.D. Mean S.D. p
a 50.18 20.37 101.87 32.53 .001 133.81 25.15 90.54 35.15 .030
e 52.33 20.48 96.80 31.51 .003 138.48 13.89 103.41 40.64 .033
i 59.91 19.94 100.72 42.64 .013 150.01 29.53 96.70 44.60 .023
o 65.44 22.85 102.03 38.12 .010 137.18 24.69 100.06 49.01 .085
Male
u 72.05 27.16 110.04 36.91 .014 195.04 89.98 127.13 113.12 .270
a 48.29 10.60 71.27 30.09 .067 90.61 34.04 58.91 23.45 .028
e 55.43 9.29 74.22 39.70 .088 85.16 9.27 66.80 14.55 .062
i 59.88 10.47 74.99 33.05 .104 132.46 31.69 86.85 24.02 .228
o 54.36 7.64 63.21 29.94 .284 124.86 41.46 73.56 12.07 .015
Female
u 54.71 6.96 58.65 18.01 .431 132.78 42.54 84.16 42.51 .016
- 76 -
Figure 3-23. Box plots of OQ and SQ formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of male group before and after surgery
Figure 3-24. Box plots of OQ and SQ formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and after surgery
- 77 -
3.4 Summary
In this chapter, our results show that acoustical and electroglottographic characteristics of
vowel change after laryngeal surgery. Mean pitch of male group decreased about 12-15%
value of preoperative pitch, whereas that of female group does not significantly change.
Formant frequencies show constant values before and after surgery. Most of jitter measures
are significantly changed, but some of shimmer measures are different later the surgery. In
noise estimation relevant measures such as HNR, NNE, DH, and NFHE, some of phonation
significantly present the difference according to sex. Finally, no changes are achieved in OQ
and SQ of EGG relevant measures, but particular characteristics of SQ group, regressing
within normal range, are presented in condition of division of two groups separated by mean
SQ value.
- 78 -
Chapter 4
Modification of Preoperative Vowel Sounds based on Acoustic and
Electroglottographic Analysis
4.1 Introduction to Perception of Aperiodicity in Pathological Voices
Jitter, shimmer, and noise are the surrogate for acoustic measurement of voice signals,
and are often considered as indices of the perceived quality of both normal and pathological
voices. A lot of applications of acoustic measures to assess vocal quality derive their validity
from the relevance of specific acoustic properties of the signal to auditory perceptions of voice.
Researchers typically use correlation or regression techniques to demonstrate the extent to
which such measures explain or predict listeners’ scalar quality judgments. However,
observed associations between acoustic and perceptual measures have varied considerably
across studies. Although hundreds of studies describing, evaluating, and applying measures of
noise and acoustic signal perturbation have been published [102], the perceptual salience of
these attributes remains poorly understood.
Synthesized voiced sounds are sometimes needed to evaluate performance of above
stated measures, and have been long developed by various models. A discrepancy exists
between the results of early synthesis studies and findings from later investigations examining
this association in naturally produced voices [80]. In early stage, synthesis studies [103, 104]
used sawtooth waves with added jitter (1–50 Hz around a mean F0 of 100 or 200 Hz) or
shimmer (alternate periods reduced in amplitude by 1–6 dB). Complete correlations were
observed between the amount of jitter or shimmer and judgments of relative roughness for
these non-speech stimuli. Hillenbrand [105] also studied synthetic vowel using the uni-variate
analysis between jitter, shimmer, and noise and ratings of breathiness and roughness.
In this chapter, we model the postoperative vowel based on results of Chapter 3. For
completely predicting postoperative voiced sounds, various methods for modification of
preoperative vowel are tested and evaluated; especially in jitter, shimmer, and NSR ratios of
speech sounds.
- 79 -
4.2 Synthesized Vowel Modeling
The input energy to the vocal tract comes from the vibrating glottis, driven by the air
pressure released from the lungs, causes the sound waves to propagate through the tube. The
vocal cords give rise to a periodic characteristic. This periodic waveform is known as the
glottal waveform. There are various studies to concentrate on modeling a glottal waveform
[106, 107], because this waveform is source of voiced sounds, resonating through vocal tract.
Also, to model the synthesized vowel is necessary to evaluate the performance and accuracy
of some measures such as harmonic-to-noise ratio.
4.2.1 Glottal Waveform Modeling
4.2.1.1 Rosenberg’s Model
Figure 4-1. Glottal waveform generated by Rosenberg’s model
- 80 -
Rosenberg [108], Titze [109] and several other researchers investigated an alternative
approach to inverse filtering of the speech waveform to generate the excitation signal. Their
findings show that this approach is capable of giving a good estimate of the glottal volume
velocity. There are two famous parametric models for speech production; Rosenberg’s model
and Titze’s model.
Rosenberg used inverse filtering to extract the glottal waveform from speech, and applied
a pitch-synchronous re-synthesis method to produce speech utterances with various source
waveforms [108]. In his perceptual tests, the most natural excitation signal involves
specification of several parameters. In order to better explain Rosenberg’s model, we need to
first introduce the general waveform of the glottal area during the vocal fold excitation.
As shown in Fig. 4-1, T denotes the pitch period, TP denotes the opening phase during the
glottal excitation, and TN denotes the closing phase during the glottal excitation. In
Rosenberg’s model, the glottal waveform is specified by three parameters, namely, pitch
period T, open quotient (OQ) TP+TN/TN which denotes the ratio of pulse duration to pitch
period, and speed quotient (SQ) TP/TN which denotes the ratio of the rising to failing pulse
durations, that is, OQ, SQ are the duration of the glottal open phase to the duration of the
complete glottal cycle, and the duration of the glottal opening phase to the duration of the
glottal closing phase. Generally, OQ ranged from 0.1 to 0.9, and SQ ranged form 0.5 to 5.0.
Based on Rosenberg’s experimental results, the glottal area function is given by:
1/ 2[1 cos( / )] 0
( ) cos( ( ) /(2 ))
0
P P
r P N P P N
n T n T
G n n T T T n T T
otherwise
ππ
− ≤ ≤= − ≤ ≤ + (4.1)
Fig. 4-1 shows a glottal waveform computed from Rosenberg’s model, where the
parameters are set as T=10ms, paired TP1/TP2/TP3 =2/3/4 ms, TN1/TN2/TN3 = 4/3/2 ms,
respectively.
4.2.1.2 Titze’s Model
Titze proposed a parametric model to represent the glottal area [109]. The glottal
waveform in Titze’s model is essentially similar to the Rosenberg pulse described in Equation,
( )rG n
- 81 -
with an extra parameter beta to determine the residual decay of the falling slope. The glottal
area function in this model is given by:
cot
( )
0
sinsinr
m m
G nm m
βθ θ
θ π
θ π
θ θθ θ
− ≤=>
(4.2)
Where
pm
p N p N
Tnand
T T T T
ππθ θ≡ ≡+ +
(4.3)
Figure 4-2. Glottal waveform generated by Titze’s model
Fig. 4-2 shows a glottal waveform computed from Titze’s model, where the parameters
are set as T=10 ms, TP = 3 ms, TN1/TN2/TN3 = 1.5/3.0/4.5 ms, TP+ TN1/TN2/TN3 = 4.5/6.0/7.5 ms,
respectively and beta = 1.2, respectively. Beta is the slope factor, typically ranging from 0.7 to
3.0, which is time constant of residual decay expressed as percentage of the pitch period.
- 82 -
4.2.2 Aperiodicity of Glottal Waveform
In this study, the vowel |a|, |e|, |i|, |o|, |u| are synthesized using an implementation of the
discrete time model for speech production with Titze’s glottal area function used as the source
function. A sequence of glottal pulses of normalized Titze’s model is used as input into a
delay line digital filter, where the filter coefficients are obtained based on area function data
for normal Korean vowel, with jitter (%) of 0.03 and shimmer (%) of 0.5, and reflection
coefficient at the lip end of 0.65. Radiation at the lips is modeled by the first order difference
equation . To create a sequence of such wave shapes, an impulse train
generator produces a sequence of unit impulses which are spaced by the desired fundamental
period. This sequence is then convolved with the glottal-pulse shape in order to produce the
desired repetitive waveform. Since the goal is to study abnormalities of the voicing source, it
is at the source that perturbation is introduced. Aperiodicity is introduced into the waveform
by altering the source function. Like equation (4.4), random shimmer is introduced by adding
a random variable gain factor to the amplitude of the pitch period impulse train prior to
convolution with the glottal pulse.
' [100 * ( )] /100A A shimmer random n= + (4.4)
Where A = impulse train amplitude and n = cyclic impulse number.
Random jitter is also introduced by adding a random variable gain factor to the pitch
period impulse train. ' ( 1 2)[100 * ( )] /100Pitch Period N N jitter random n= + + (4.5)
The introduction of random noise follows a similar strategy. Random additive noise is
introduced by adding the multiplication of the average of the glottal-pulse wavelet by a user
specified variance, denoted per to the glottal waveform ( )rG n . The noise is added according
to the following equation:
' ( ) ( ) *[100 * ( )] /100r rG n G n noise random n= + (4.6)
1( ) (1 )R z z−= −
- 83 -
4.3 Modifications of Preoperative Vowel
In this section, we propose and modify the preoperative vowel according to the truth of
Chapter 3 in order to predict enhanced vowel like the postoperative vowel (successful return
of normal voice). In this study, three factors are focused; Jitter, shimmer, and aspiration noise
in speech, because these factors mainly affect the quality of voiced sounds.
4.3.1 Design of Modification of Fundamental Frequency
4.3.1.1 Pitch Scale Modification and Jitter using PSOLA
Figure 4-3. Pitch period modification by PSOLA
The PSOLA (Pitch Synchronous Overlap Add) method was originally developed at
France Telecom. It is actually not a synthesis method itself but allows prerecorded speech
samples smoothly concatenated and provides good controlling for pitch and duration, so it is
used in some commercial applications [110].
There are several versions of the PSOLA algorithm and all of them work in essence the
same way. Time-domain version, TD-PSOLA, is the most commonly used due to its
computational efficiency [111]. The basic algorithm consists of three steps [112]. The analysis
step where the original speech signal is first divided into separate but often overlapping short-
term analysis signals (ST), the modification of each analysis signal to synthesis signal, and the
synthesis step where these segments are recombined by means of overlap-adding. Short term
- 84 -
signals ( )mx n are obtained from digital speech waveform ( )x n by multiplying the signal
by a sequence of pitch-synchronous analysis window( )mh n :
( ) ( ) ( )m m mx n h t n x n= − (4.7)
where m is an index for the short-time signal. The windows, which are usually Hanging
window, are centered on the successive instantsmt , called pitch-marks. These marks are set at
a pitch-synchronous rate on the voiced parts of the signal and at a constant rate on the
unvoiced parts. The used window length is proportional to local pitch period and the window
factor is usually from 2 to 4 [113]. The pitch markers are determined either by manually
inspection of speech signal or automatically by some pitch estimation methods [114]. The
segment recombination in synthesis step is performed after defining a new pitch-mark
sequence.
Figure 4-4. Episodes of pitch scale modification by PSOLA
- 85 -
Manipulation of fundamental frequency is achieved by changing the time intervals (pitch
period) between pitch markers. The modification of duration is achieved by either repeating or
omitting speech segments. In principle, modification of fundamental frequency also implies a
modification of duration [114]. Jitter also can be modified by PSOLA. A pitch period longer
than maximum threshold of pitch period difference is compressed within the maximum
threshold, whereas a pitch period shorter than minimum threshold of it is expanded within the
minimum threshold (Fig. 4-4). Pitch scale modification is only adopted in voice sounds of
male according to the results of Chapter 3. Average difference of pitch modification in the
phonation |a|, |e|, |i|, |o|, |u| are -12.8 %, -12.4 %, -12.4, -15.2, and -15.6 %, respectively.
4.3.1.2. Modification of Intensity
According to Behlau and Pontes [115, 116], vocal intensity is directly related with
subglottic pressure of the air column. Subglottic pressure, in turn, depends on factors such as
amplitude of vibration and tension of vocal folds, more specifically the glottic resistance.
Variations of intensity, however, also depend on frequency [117]. According to Behlau
and Pontes [115, 116], high voices tend to be more intense, because the increase in laryngeal
tonus generates higher glottic resistance and, consequently, more intensity. Jitter is affected
mainly because of lack of control of vocal fold vibration and shimmer with reduction of glottic
resistance and mass lesions in the vocal folds, which are related with presence of noise at
emission and breathiness [118].
As results of Chapter 3, shimmer relevant measures do not present significant difference
between preoperative and postoperative vowel. However, Shimmer (%) over 6 % may affect
the perceptual quality of speech, and so some modification of intensity needs to improve the
quality of speech.
Our proposed method is very simple and easy control to shimmer. First, find all pitch
point by pitch marker above stated in PSOLA, and then adjust all values of specific segment to
increase or decrease within permitted range of threshold, if a value of current amplitude is
higher than threshold related to a value of previous amplitude. Threshold level is stetted as
2.5 % (permitted variation of amplitude). This process was illustrated by Fig. 4-5.
- 86 -
Figure 4-5. Intensity modification by Shimmer (%) of 2.5 %
4.3.1.3 Short term Postfilter
The short-term postfilter [119] consists of a 16th order pole-zero filter in cascade with a
first-order all-zero filter. The 16th order pole-zero filters attenuates the frequency components
between formant peaks, while the 16th order all zero filter attempts to compensate for the
spectral tilt in the frequency response of the 16th order pole-zero filter. The transfer function
of the short-term postfilter is the following,
( / )1
( )( / )
n
f d
A zP z
g A z
γγ
= (4.8)
where ˆ( )A z is the received quantized LP inverse filter and the factors n γ and d γ control the amount of short-term postfiltering, and are set to γ = 0.55 n and γ = 0.7 as shown in
Fig. 4-6. The role of this postfilter is similar to the one played by the perceptual weighting
filter in the choice of the excitation vectors in the encoding scheme. It lowers the valleys and
enhances the peaks of the resulting decoded spectrum so that the noise is shaped with the
signal. If any white Gaussian noise had been introduced during the encoding process or due to
- 87 -
transmission error, the resulting noise would be small in the spectral region of low signal
energy and vice versa for the formant regions. In general, after the decoded speech is passed
through the long-term postfilter and the short term postfilter, the filtered speech will not have
the same power level as the decoded (unfiltered) speech. To avoid occasional large gain
excursions, it is necessary to use automatic gain control to force the postfiltered speech to have
roughly the same power as the unfiltered speech. This is taken care of by the gain factor fg .
Figure 4-6. Plots of short term postfiltered voiced sound in time and spectral domain
- 88 -
4.4 Design of Enhancement of Noise Components
4.4.1 Introduction to Wavelet Transform Threshold Shrinkage
Wavelet transform threshold shrinkage (WTS) [120, 121] was implemented using the
difference of the statistical properties of the signal and noise present in the wavelet domain.
WTS involves shrinking in the wavelet transform domain, and consists of three steps []: a
linear forward wavelet transform, a nonlinear shrinkage denoising, and a linear inverse
wavelet transform. Suppose we want to estimatenf from noisy observation signal nX
(4.9)
where nS denotes target signal, and nE is independent and uniformly distributed. Let W(·)
and W−1(·) denote the forward and inverse wavelet transform operators. Let D(· , ·) denote the
denoising operator with soft threshold. We intend to wavelet shrinkage denoise ( )X t in
order to recover ( )S t as an estimate of( )S t . Then the three steps summarize the procedure.
1
( )
( , )
ˆ ( )
Y W X
Z D Y
S W Z
λ−
==
=
(4.10)
Of course, this summary of principles does not reveal the details involving implementation of
the operators W or D , or selection of the threshold . Let's focus on and D . Given threshold
for data U (in any arbitrary domain signal, transform, or otherwise), the rule defines
nonlinear soft thresholding. ( , ) sgn( ) max(0,| | )D U U Uλ λ≡ − (4.11)
The operator D nulls all values of U for which | |U and shrinks toward the origin by an
amount of all values of U for which| |U λ> . It is the latter aspect that has led to D being
called the shrinkage operator in addition to the soft threshold operator.
, 1,...,n n nX S E n N= + =
- 89 -
4.4.2 Determination of Adaptive Threshold
If the spectrum is calculated for short signal segments, the segment duration is chosen to
be ten periods based on a compromise. A short segment (short duration) leads to an increase in
the bandwidth of the harmonics which hinders the discrimination between harmonic and noise
components. When large segment duration is used the non-stationarity of the voice signal
increases the noise component and decreases the harmonic component and again the
discrimination between them is difficult. Obviously a compromise should be found. A number
of experiments with simulated signals were carried out. To these signals white noise was
added (40% deviation); also slow amplitude oscillations (40% deviation) were added, as well
as slow pitch oscillations (6% deviation). Experiments showed that when calculating HNR
with a segment length of ten periods, we received more precise separation of the harmonic
from the noise component and the segment length increase does not lead to significant growth
of the error produced by the presence of slow amplitude and pitch oscillations (this is
equivalent to non-stationary). Plots of HNR, NNE, NFHE, and DH as a function of jitter,
shimmer, and noise are plotted in Fig. 4-7. The best correlation coefficients of each noise
estimation are resulted in Table 4-1. NNE, NNE, and NFHE are best parameter in jitter,
shimmer, and noise, respectively. However, another test is required because of real speech
does not hold one perturbation factor. So synthetic speech signal (Fig. 4-8) combined
perturbation factors is generated and tested again in order to test the how well correlate
between HNR and variation of noise, shimmer, and jitter, as shown in Fig. 4-9 and 4-10.
Table 4-1. Correlation coefficients of each noise estimation as a change of jitter, shimmer, and noise
HNR NNE NFHE DH
Jitter -0.89 -0-94 0.90 -0-91
Shimmer -0.86 -0.90 0.70 -0.90
Noise -0.61 -0.85 0.98 -0.59
- 90 -
Figure 4-7. Plots of HNR, NNE, NFHE, and DH as a function of (a) jitter (S.D. 40%), (b) shimmer (40%), and (c) noise (6%), and (d) magnification of (c) in NFHE
Figure 4-8. Examples of synthetic voiced sound |a| (a) with shimmer and noise of 5 % and jitter of 0.75 %, (b) with shimmer and noise of 40 % and jitter of 6 %
- 91 -
Figure 4-9. Plots of HNR as a function of Jitter (S.D. 0.75- 6.0 %) & noise and shimmer (S.D. 5-40 %) for phonation |a|,|e|,|i|,|o|,|u| for female group
Figure 4-10. Plots of HNR as a function of Jitter (S.D. 0.75- 6.0 %) & noise and shimmer (S.D. 5-40 %) for phonation |a|,|e|,|i|,|o|,|u| for male group
- 92 -
Figure 4-11. Episode of denoising with Wavelet threshold shrinkage
As a result, HNR is highly correlated with changes of noise, shimmer, and jitter (Fig.4-9,
4-10). Therefore it is possible to analogize inversely the threshold value for wavelet threshold
shrinkage, and adaptive denoising threshold is calculated from below equation, called
modified rigorous SURE soft threshold.
aThreshold wασ= (4.12)
assume 1 2[ , , , ]NW W W W= K , where 21,2, , , 1 2 n n N j k NW W and W W W= = ≤ ≤ ≤
KK
1
12 ( ) , ( 1,2, , )
i
i i kk
r N i N i w w i NN =
= − + − + = ∑ K (4.13)
then let mina ir r= , and we can get aW , 1, 1,( ) / 0.6745,k kmedian w wσ = is wavelet
coefficients at the first scale. α is our constant for increased HNR value for above heuristic
- 93 -
analysis, and α of 0.37 is used in this study. As a result, wavelet threshold shrinkage
reduces the noise, as well as preserves the stress noise (Fig. 4-11).
4.5 Modification of Baseline Wander of EGG Signal
It is required to reduce baseline wander in EGG waveform, because EGG waveform is
used for input signal of nonlinear speech synthesizer in next chapter. In vocal cords, the
abduction-adduction of glottis is mainly controlled by the posterior cricoarytenoid (abductor)
and interarytenoid (adductor) muscles respectively. Electroglottography (EGG) is a technique
used to register laryngeal behavior indirectly by a measuring the change in electrical
impedance across the throat during speaking. However, EGG waveform is affected by
laryngeal muscles which fluctuate the vocal cords, and which result in baseline wander.
4.5.1 Introduction to Empirical Mode Decomposition
The EMD technique decomposes the signal into a number of Intrinsic Mode Functions
(IMFs), each of which a mono component function. This new method is developed by Dr.
Norden E. Huang at the NASA Goddard Space Flight Center. The procedure for extracting the
IMFs from a signal ( )x t is illustrated in Fig. 4-12. After identifying all the local maxima and
minima of the signal, the upper and lower envelopes are generated through curve fitting.
Research has shown that many complex curve fitting functions have only resulted in marginal
improvement while increasing the computational load significantly [122].
Therefore, the cubic spline function was employed in the presented study. The mean
values of the upper and lower envelopes of the signal 11( )m t are calculated as
11( ) ( ( ) ( ) ) / 2maxima minimam t x t x t= + (4.14)
where max ( )imax t and min ( )imax t are the upper and lower envelopes of the signal, respectively.
Accordingly, the difference between the signal ( )x t and the envelopes of the signal 11( )m t ,
which is denoted as 11( )h t , is given by
11 11( ) ( ) ( )h t x t m t= − (4.15)
- 94 -
Figure 4-12. Process diagram of EMD
Due to the approximation nature of the curve fitting method, has to be further processed
(by treating as the signal itself and repeating the process continually) until it satisfies the
following two conditions.
1) The number of extreme and the number of zero-crossings are either equal to each other or
differ by at most one.
2) At any point, the mean value between the envelope defined by local maxima and the
envelope defined by the local minima is zero.
Through the iteration process (for a total of times), the difference between the signal and
the mean envelope values, which 1 ( )jh t is denoted as, is obtained as
1 ( ) 1( 1) 1( ) ( )j t j jh h t m t−= − (4.16)
- 95 -
where 1 ( )jm t is the mean envelope value after the i th iteration, and 1( 1)( )jh t− is the
difference between the signal and the mean envelope values at the (j-1)th iteration. The function 1 ( )jh t is then defined as the first IMF component and expressed as
1 1( ) ( )jIMF t h t= (4.17)
After separating 1( )IMF t from the original signal ( )x t , the residue is obtained as
1 1( ) ( ) ( )r t x t IMF t= − (4.18)
Subsequently, the residue1( )r t can be treated as the new signal, and the above-illustrated
iteration process is repeated to extract the rest of the IMFs inherent to the signal ( )x t as
1 2 2
1
( ) ( ) ( )
( ) ( ) ( )n n n
r t IMF t r t
r t IMF t r t−
− =
− =K (4.19)
The signal decomposition process is terminated when ( )nr t becomes a monotonic
function, from which no further IMFs can be extracted. By substituting (12) into (11), the
signal ( )x t is decomposed into a number of intrinsic mode functions that are the constituent
components of the signal. As a result, the signal ( )x t can be expressed as
1
( ) ( ) ( )n
i ni
x t c t r t=
= +∑ (4.20)
where IMFi(t) represents the i th intrinsic mode function, and ( )nr t is the residue of the
signal decomposition. Equation (4.13) provides a complete description of the empirical mode
decomposition process [123], which can be evaluated by checking the amplitude error
between the reconstructed and the original signal. As an example, Fig. 4-14 illustrates both the
IMFs and the residue of the multi-component. Finally, wandering baseline estimate
(fluctuating red line in Fig. 4-13) are achieved by adding last IMF function (in this case,
8IMF of Fig. 4-14) and residue signal, and then original signal minus wandering baseline
estimate signal was considered as EGG signal with cancellation of baseline wander. For a
comparison of performance, high pass filtered EGG signal (order of 500 and cut-off frequency
of 50 Hz) also is plotted in the bottom of Fig. 4-13, and Fig. 4-14 shows each step of IMF
functions and residue.
- 96 -
Figure 4-13. Reduction of baseline wander in EGG waveform by EMD and high pass filter with FIR 500-order (Pass band: over 40 Hz); voiced sound |u| with sampling rate of
22050 Hz
- 97 -
Figure 4-14. Plots of IMFs and residue of EGG waveform in Figure 4-13.
- 98 -
4.6 Summary
In this chapter, we modify the preoperative voiced sounds in order to enhance the
perceptual quality like normal voice. Our hypothesis is that reduction of aperiodicity of
preoperative voiced sounds can resemble postoperative voiced sounds, and Main components
of aperiodictiy are considered as jitter, shimmer, and noise in speech signal. Enhancement
rates are adjusted by statistical results based on the difference between preoperative and
postoperative speech sounds. Modification of pitch period, intensity, and noise of aspiration
are controlled by PSOLA, intensity modifier, and Wavelet threshold shrinkage methods.
Baseline wander of EGG signal is also reduced by empirical mode decomposition method
embedded by Hilbert-Huang Transform. These modified speech and EGG signal was used as
input signals in nonlinear speech modeling in next chapter for estimation of postoperative
vowel.
- 99 -
Chapter 5
Nonlinear Speech Production Modeling using Nonlinear Autoregressive
Exogenous based on Support Vector Regression
5.1 Introduction of Speech Production Modeling
Speech waveforms are rich in information but are highly redundant in structure. Storage
of acoustic data is therefore inefficient and a more compact parametric representation of the
information conveyed by the signal is desirable. An ideal model should exploit the redundancy
in the speech signal to give data compression while capturing the distinguishing features
coding and synthesis applications, the ability to regenerate the original speech waveform from
the model is also necessary.
The acoustic speech waveform varies slowly with time as different sounds are produced
so the frequency properties of the signal are constantly changing. A time-varying model of the
waveforms is needed for which the model parameters are continuously updated at a suitable
rate. Typically, a short-time analysis is used, in which the speech waveform is divided into a
sequence of overlapping segments of about 20 ms in duration, and a new set of model
parameters calculated for each segment. Since the articulators of about 10 ms [124] which
permits an update rate (frame rate) of 10 ms. Even the fastest transitions in plosives can be
captured relatively well by an update rate of 5 ms.
Two approaches to developing a model are articulatory modeling and acoustic modeling.
The articulatory modeling approach aims to represent the vocal tract and movement of
articulators in as much physiological detail as possible and assumes that a similar underlying
system will generate a similar output. Articulatory models have the potential for good
reproduction from simple control signals and can reproduce all the perceptually relevant
effects of the real speech, such as co-articulation [125]. However, the dimensions of the vocal
tract and a detailed analysis of the movement of the articulators are needed. Such information
is difficult to obtain and often requires intrusive measurement techniques. The acoustic
modeling approach models the speech waveform directly in either the time or frequency
domain. The models are easy to construct because only the speech waveform is required,
- 100 -
which is easily obtained using a microphone. An exact match of the waveform or spectrum is
not needed for perceptually good synthesis and events which are not perceptually relevant
need not be modeled. The most popular technique for speech modeling applications, such as
speech coding and speech synthesis, is the time-domain acoustic modeling method known as
LP.
In this chapter, we will review well-known LPC and nonlinear speech modeling based on
neural network, and introduce our proposed nonlinear speech modeling based on Support
Vector Machine (SVM). Nonlinear speech modeling using SVM also presents the
implemented results for predicting postoperative vowel.
5.1.1 Overview of Linear Speech Production Modeling
Linear prediction techniques use a source-filter arrangement to model the vocal tract
system, which assumes that the source is located at the glottis and that a linear filter is
adequate to model the frequency properties of the vocal tract. At the analysis stage, it is
assumed that no information about the excitation of the vocal tract is known and that the
speech waveform can only be modeled from its previous values. The linear vocal tract filter
defines an autoregressive (AR) model of the speech, in which the current speech sample,( )y t ,
is predicted from a linear combination of a finite number of past samples
1
ˆ ( ) ( )an
p kk
y t a y t k=
= − −∑ (5.1)
where ˆ ( )py t is the predicted speech sample. The prediction error or residual,
ˆ( ) ( ) ( )pe t y t y t= − , represents structure in the speech which is not captured by the model. For
a good model, the residual has no predictable structure and appears as white noise. For voiced
speech, the residual has significant peaks at the pitch period which coincide with the instants
of excitation of the vocal tract, which coincide with rapid closure of the vocal cords [126].
When the LP model is excited by the residual signal, the speech waveform is reproduced
exactly. This is not practical for most applications, and one approach is to use a model of the
residual signal. A source-filter arrangement is used in which the residual is represented by an
impulse train at the pitch frequency for voiced sounds or a random, white noise generator for
unvoiced sounds.
- 101 -
The spectral match between the estimated transfer function, ˆ ( )H z and the spectral
envelope of the speech is shown by applying Parseval’s Theorem to equation (5.2) and (5.9)
shows that minimizing E is equivalent to minimizing the integral of the ratio of the energy
spectrum of the speech segment to the magnitude squared of the frequency response of the
system model.
1
2 22
0
1( ) ( ) ( )
2
aN nj j
t
E e t Y e A e dπ ω ωπ
ωπ
+ −
−=
= =∑ ∫ (5.2)
ˆ ( )( )
jj
GH e
A eω
ω= (5.3)
2
2
2
( )( )
( ) ˆ ( )
j
jj
j
Y eGE e d
A e H e
ωπω
ω π ωω
−= ∫ (5.4)
For a predictor of order an , the first 1an + values of the autocorrelation of the speech
segment and autocorrelation function of the system impulse response are equal. Thus as the
predictor order tends to infinity, the magnitude spectra ˆ ( )jH e ω and ( )jY e ω will match.
2 2ˆlim ( ) ( )
a
j j
nH e Y eω ω
→∞= (5.5)
However, the spectra ( )jH e ω and ( )jY e ω may not be equivalent because ˆ ( )jH e ω is
constrained to minimum-phase (all zeros inside the unit circle). In general, the speech
spectrum is not minimum-phase when radiation occurs from more than one point and there
multiple sound pathways. Due to the spectral matching property of the mean-squared error
criterion, linear prediction analysis can be used to obtain a smoothed estimate of the short-time
spectral envelope of the speech. Since E depends on the ratio of ( )jY e ω and ˆ ( )jH e ω ,
the matching process performs uniformly over the frequency range of interest, regardless of
the shape of the spectral envelope of the speech. However, the formants of the spectrum are
more closely modeled because regions where ˆ( ) ( )j jY e H eω ω> contribute more to E than
region where ˆ( ) ( )j jY e H eω ω< . Estimates of the speech formants can be obtained by
locating peaks in the smoothed spectral envelope or by factorizing ( )A z into its constituent
poles. Each formant is approximated by a complex-conjugate pole pair which forms a second
order filter with transfer function, ( )iA z , given by
1 21 2( ) 1iA z a z a z− −= + + (5.6)
The frequency of the formant is determined from the pole angle and the bandwidth from
the radius. The spectrum is unique in the range / 2 / 2s sw w w− < < and repeats at multiples
- 102 -
of the sampling frequency, Ws. The transfer function, ˆ ( )H z , is stable when all the poles lie
inside the unit circle. The holds if all ( )iA z are stable, Analogous to the cascade and parallel
realization of the resonant network used in formant synthesizers, ˆ ( )H z can be implemented
in cascade or parallel form. In cascade form, ˆ ( )H z is expanded as a product of formant
factors.
/ 2
1
ˆ ( )( )
n
ii
GH z
A z=
=∏
(5.7)
In parallel form ˆ ( )H z is expanded as a sum of formant factors
/ 2
1
ˆ ( ) ( ) / ( )n
i ii
H z B z A z=
= ∑ (5.8)
Where ( )iB z are the residues of the partial fraction expansion of ˆ ( )H z .
5.1.2 Limitations of Linear Speech Production Modeling
The advantages of linear prediction for speech analysis are ease of implementation, a
closed-form solution, complete separation of the source and vocal tract filter in synthesis, and
a direct interpretation in terms of loss-less acoustic tube model of the vocal tract [10].
However, linear prediction has several disadvantages. Unvoiced sounds are poorly
modeled by a minimum phase, all-pole linear prediction model because the vocal tract
function for these sounds contains zeros. Although argued that the spectral notches produced
by zeros are hard to detect [127], Synthesis of unvoiced sounds by a linear prediction model is
poor. In linear prediction models, zeros have to be approximated by a collection of poles
which requires a higher prediction order. The linear prediction parameters are not optimal for
synthesis, since they are developed to minimize the mean-squared prediction error, rather than
the actual error obtained at the output of the model when used for synthesis (the synthesis
error). The main disadvantage of linear prediction is that the source and vocal tract filter are
not decoupled in analysis and the linear prediction filter thus models the combined effect of
source, vocal tract and lip radiation. As a result, the quality of synthetic speech generated from
linear prediction models degrades rapidly as the pitch of the excitation is altered from that of
the original speech.
- 103 -
5.2 Overview of Nonlinear Speech Production Modeling
5.2.1 Review of Former Research in Nonlinear Speech Production Modeling
Due to universal approximation capabilities neural networks (NNs) are able to
approximate unknown systems based on sparse sets of noisy data [128, 129]. Although a lot of
NN’s applications concern classification problem, a growing interest has been devoted in
nonlinear time series prediction and in complex nonlinear dynamic modeling [130]. Moreover,
one of the main drawbacks that can hinder practical NNs application in multimedia, depends
on their computational and structural complexity.
Classical approaches for nonlinear DSP are based on specific and efficient architectures
e.g. median and bilinear filters, some spectral analysis techniques or generic nonlinear
architectures suitable for a large class of problems but usually complex e.g. Volterra filters,
non linear state equations, polynomial filters, functional links, etc., [131, 132]. In other words
typical nonlinear DSP approaches consist of design specific algorithms for specifics problems.
Neural networks (the multi-layer perceptron (MLP) [133], the time delay neural networks
(TDNN) [134], and recurrent neural networks (RNN) [135]), have been used extensively in
the past for functional approximation of continuous nonlinear mappings [136]. Successful
functional approximation depends on appropriate selection of the parameter values.
The MLP and RNNs represent an adaptive circuit which extend and generalize the simple
adaptive linear filter in nonlinear domain. By adding in some way delay lines NN filters can
be viewed as an extension of linear adaptive filters to deal with nonlinear modeling tasks [137].
It is well known that a large amount of DSP techniques are based on linear models, but in
some cases the nature of the problems are nonlinear and obviously in these cases nonlinear
general purpose architectures are needed.
Despite the formal elegance of the neural model, several problems should be solved. First
of all is the model selection. Given an input-output relation the problems are: (1) the
determination of the inputs number, (2) the number of neurons in the hidden layers in order to
have a correct approximation and (in the case of dynamic processes) (3) how put memory
(delay line) in the model.
Even though there are several papers dealing with the problem of network topology
determination, usually the numbers of layers and neurons are specified by heuristic procedure.
- 104 -
Although linear adaptive filter theory is well-known and consolidated, its extension to the
nonlinear domain is a field of great interest and in continuous expansion. In this section, some
neural architecture suitable for adaptive nonlinear filtering are presented. The formulation of
transversal and recursive filters can be easily extended to the nonlinear domain: in the case of
discrete-time sequences the filter can be described through a relationship between the input
sequence [ ], [ 1],x t x t − L and the output sequence [ ], [ 1],y t y t − L . The general form are
expressed as [ ] [ ], [ 1],..., [ 1]y t x t x t x t M= Φ − − + (5.9)
[ ] [ ], [ 1],..., [ 1], [ 1]..., [ ]y t x t x t x t M y t y t N= Φ − − + − − (5.10)
In the first expression the output is a nonlinear function of the inputs (present and past
samples): in other words equation (5.9) represents a nonlinear generalization of linear finite
impulse response filter (FIR). The output signal y[t] in equation (5.10) is also a function of
past output signal: so it represents a nonlinear generalization of linear infinite impulse
response filter (IIR). The equation (5.10) represents a general form usually called nonlinear
autoregressive moving average (NARMA) model. The indexes M and N, represent the filter
memory length and the couple (N, M) is defined as filter order.
Figure 5-1. Buffered MLP structure with input TDL
- 105 -
The easiest way to get dynamics from a MLP network is the use of external tapped delay
lines (TDL) [138], as shown in Fig. 5-1 subsuming many traditional signal processing
structures, including FIR-IIR filters, and gamma memory NN [139], for which the delay
operator, used in conventional TDL’s, is replaced by a single pole discrete-time filter. These
networks are universal estimate for dynamic systems [140], just as feedforward MLP’s are
universal estimate for static mappings [141].
Concerning the previous general structure we can assert: 1) the problem of the
determination of the optimum filter order (N, M) requires some a priori knowledge of the
statistics of the input signal; 2) filtering of high non-stationary signals requires that the filter
free parameters (w R∈ ) can vary fast so that it is possible to track the input’s statistic
variation. Moreover, if in equations (5-9) and (5-10) Φ is a linear function, there exists a huge
number of methods for the determination of the free filters parameters (filter synthesis). A
family of adaptive algorithms, suitable for transversal filters, is derived from the least square
error minimization [142].
5.2.2 Introduction of Support Vector Machine for nonlinear regression
Support vector machine (SVM), originally introduced by Vapnik(1985, 1988), solves the
weak point of neural network such as the existence of local minima in the area of statistical
learning theory and structural risk minimization. SVM solutions are characterized by convex
optimization problems. Despite of many successful application of SVM in classification and
regression problem, SVM requires to solve a quadratic program (QP) problem. QP is to
optimize a quadratic function over a polyhedron, defined by linear equations and/or
inequalities, which is time memory expensive.
A modified version of SVM in a least squares (LS) sense has been proposed for
classification in Suykens and Vaderwalle (1999). In LS-SVM, the solution is given by a linear
system instead of a QP problem. Taking account of the fact that the computational complexity
increases strongly with the number of training data, LS-SVM can be efficiently estimated
using iterative methods. The fact that LS-SVM has explicit primal-dual formulations has a
number of advantages.
- 106 -
SVM can be adopted in both linear and nonlinear regression. In this study, we explain
SVM for nonlinear regression. Supposing the training data set D be denoted by each input iX
1(x , ) x yn d di i i i iy R and R= ∈ ∈ (5.11)
and the output iy . We consider the case of nonlinear regression. Then, we take the form
( ) ' ( )f x w xφ η= + (5.12)
where the termη is a bias term. Here the feature mapping function ( ) : fddR Rφ ⋅ → maps the
input spaces to the higher dimensional feature space where the dimensionfd is defined in an
implicit way.
The optimization problem is defined with a regularization parameter γ as
2
1
1min '
2 2
n
ii
w wγ ξ
=
+ ∑ (5.13)
over , , w η ζ subject to equally constraints
w ' (x ) , 1,..., .i i iy i nφ ξ= + = (5.14)
The Lagrangian function can be constructed as
( )2
1 1
1( , , : ) ' w ' (x )
2 2
n n
i i i i ii i
L w a w w a yγη ξ ξ φ η ξ
= =
= + − + + − ∑ ∑ (5.15)
Where ia ’s are the Lagrange multipliers. The conditions for optimality are given by
1
1
0 (x )
0 0
0 , 1,...,
0 w ' (x ) 0, 1,..., .
n
i ii
n
ii
i i
i i ii
Lw a
w
La
La i n
Ly i n
a
δ φδδδηδ γξδξδ φ η ξδ
=
=
= → =
= → =
= → = =
= → + + − = =
∑∑
(5.16)
with solution
1
0 1' 0
a y1 I
ηγ −
= Ω +
(5.17)
- 107 -
with
1
1
y ( ,..., ) '
1 (1,...,1) '
a ( ,..., ) '
( ) ' ( ) ( , ) , 1,...,
n
n
kl kl k l k l
y y
a a
where x x K x x k l nφ φ
===
Ω = Ω Ω = = =
(5.18)
kernel function ( , )k lK x x are obtained from the application of Mercer’s conditions [143].
Several choices of the kernel function are possible. Solving the linear equation (3) the optimal
bias and Lagrange multipliers, b and ia ’s are obtained, the optimal regression function for
given x is obtained as
1
ˆ ˆ ˆ( ) a (x , x)n
i ii
f x K η=
= +∑ (5.19)
Note that in the nonlinear setting, the optimization problem corresponds to finding the
flattest function in the feature space, not in the input space. In fact, SVM has strong advantage
that SVM performs particularly well for the nonlinear regression model with several input
variables.
5.3 Nonlinear Speech Production Modeling based on Support Vector Regression
Once speech signal is embedded into a reconstructed phase space, the task of making
nonlinear model turns into a function estimation problem where least squares (LS)-SVR can
be used. Compared with other methods, the several advantages of SVR are suggested. First, in
nonlinear model for speech synthesis, because the system runs in autonomous mode, good
generalization performance is needed, otherwise the system is often unstable, i.e. the system’s
output is entirely different from what we want. A regularization term is included in SVR to
control the capability of the function class and improve the generalization ability of the model.
Finally, SVR is easy to be used. Only a few parameters need to be tuned, and no local solution
exists.
Taking the defects of source-filter theory into consideration, the speech signal itself
instead of excitation signal is modeled. The structure of the model is as follows. It can be seen
that the input vector of SVM is generated from time series through delay line. During training
- 108 -
phase, given signal is inputted. And in autonomous running mode, output is fed back (Fig. 5-
2).
5.3.1 NARX using SVR Model
For the linear dynamical part, we will assume a model structure of the form:
1 1
n m
k i k j k j ki j
y a y i b u ξ−= =
= − + +∑ ∑ (5.20)
with k k, R, N u ,y k ku y k∈ ∈ , input ku and output ky with discrete time index k and
kξ the so-called equation error which will be assumed to be white Gaussian noise. This
model structure is one of the best known model structures in linear identification. Adding a
static nonlinearity : : ( )f R R x f x→ → to equation lead to:
1 1
( )n m
k i k i j k j ki j
y a y b f u ξ− −= =
= + +∑ ∑ (5.21)
Applying LS-SVM function estimation outlined in the former section, we assume the
following structure for the static nonlinearityf :
( ) ' ( )f u w uφ η= + (5.22)
with
( ) ' ( ) ( , ) , 1,...,kl k l k lu u K u u k l nφ φΩ = = = (5.23)
a kernel of choice. Hence, equation can be rewritten as follows:
1 1
( ' ( )n m
k i k i j k j ki j
y a y b w uφ η ξ− −= =
= + + +∑ ∑ (5.24)
We focus on finding estimates for the linear parameters , 1,...,ia i n= and , 1,...,jb j m=
and the static nonlinearityf . The Lagrangian of the resulting estimation problem is given by
( )2k-j 0
1 1 1
( , , , , : )
1' w ' (u )
2 2
n N n m
i k i k i j i ii k r i j
L w b a
w w a y b y
η ξ α
γ ξ α φ η ξ−= = = =
= + − + + + − ∑ ∑ ∑ ∑ (5.25)
with max( , ) 1m nγ = + . The conditions for optimality are as follows:
- 109 -
( )
( )
1
1
0
01 1
0 (u )
0 0
0 0, 1,...,
0 ' ( ) 0, 1,...,
0 , ,...,
0 ' ( ) 0, ,.
N m
k j k jk j
N m
k jk j
N
k k iki
N
k k iki
k kk
n m
i k i k k j k ji jk
Lw a b
w
La b
b
Ly i n
a
Lw u i m
La k N
La y y b w u k
γ
γ
γ
γ
δ φδδδδ αδδ α φ ηδηδ γξ γδξδ ξ φ η γδα
−= =
= =
−=
−=
− −= =
= → =
= → =
= → = =
= → + = =
= → = =
= → + − + + = =
∑∑∑∑∑∑
∑ ∑ .., N
(5.26)
substituting L
w
δδ
and k
Lδδξ
in (5.26) lead to:
( )0
1 1
1
( ) ' ( )
0, ,...,
m N m
j p q q p k jj q r p
n
i k i k ki
b b u u
a y y k N
α φ φ η
ξ γ
− −= = =
−=
+
+ + − = =
∑∑∑∑
(5.27)
If the jb values were known, the resulting problem would be linear in the unknown and
easy to solve through:
1 1 0
11 1
ˆ0 0 0
1 0 a 0
ˆ yI
TN
TN
b
Y
b Y K
γ
γ
η
αγ
− +
−− +
= + (5.28)
with
- 110 -
[ ]1
1
1
1 1
2 1 2
1
, ,1 1
1
... '
... '
ˆ
[ ... ] '
...
...
...
ˆ
... '
N
m
m
jj
n
N
N
n n N n
m m
p q j l p r j q r lj l
N
b b
a a a
y y y
y y yY
y y y
K b b
y y y
γ
γ γ
γ γ
γ γ
γ
α α α
β β β
=
− −
− − −
− − + −
+ − + −= =
+
= =
=
=
= = Ω
=
∑
∑ ∑
M M M
(5.29)
Since the jb are in general not known and the solution to the resulting third order
estimation problem (5-27) is by no means trivial, we will use an approximate method to obtain
models of the form (5-26).
To avoid having to solve the problem (5-27), we propose to rewrite (5-24) as follows:
'
1 1
( )n m
k i k i j k j ki j
y a y w u dφ ξ− −= =
= + + +∑ ∑ (5.30)
which can conveniently be solved using LS-SVM’s. Note, however, that the resulting model
class is wider than (5-24) due to the replacement of one w by several , 1,...,jw j m= =
Taking all of the above into account, the optimization problem that is ultimately solved is the
following with ,..., Nγξ ξ ξ= :
' 2
, , ,1
1 1min ( , )
2 2
m N
j j j kwj a d
j k
f w e w wξ γ
γ ξ= =
= +∑ ∑ (5.31)
subject to
'
1 1
( ) 0, ,...,m n
j k j i k i k kj i
w u a y d y k Nφ ξ γ− −= =
+ + + − = =∑ ∑ (5.32)
'
1
( ) 0, 1,...,N
j kk
w u j mφ=
= =∑ (5.33)
- 111 -
Note the extra constraints (5-33) to center the nonlinear function ' ( ), 1,...,jw j mφ ⋅ = around
their average over the training set. This to remove the uncertainty resulting from the fact that
any set of constants can be added to the terms of an additive nonlinear function, as long as the
sum of the constants is zero. Removing this uncertainty will facilitate the extraction of the
parameters jb in (5-21) later. Furthermore, this constraint enables us to give a clear meaning
to the bias parameterd , namely
1
1
1( )
mN
j kkj
d b f uN =
=
= ∑ ∑ (5.34)
The resulting Lagrangian is:
k-j
1 1
'
1 1
( , , , : , )
( , ) w ' (u )
( )
j
N n m
j k i k i i kk r i j
m N
j j lj l
L w d a
F w a y d y
w u
ξ α β
ξ α φ ξ
β φ
−= = =
= =
= − + + + −
+
∑ ∑ ∑∑ ∑
(5.35)
The conditions for optimality are:
1
'
1 1
'
1
0 ( ) (u ), 1,...,
0 0, 1,...,
0 0
0 , ,...,
0 ( ) 0, ,...,
0 ( ) 0,
N N
j k k j j kk jj
N
k k iki
N
kk
k kk
m n
j k j i k i k kj ik
N
j kkj
Lw u j m
w
Ly i n
a
L
d
Lk N
Lw u a y d y k N
Lw u j
γ
γ
γ
δ α φ β φδδ αδδ αδδ α γξ γδξδ φ ξ γδαδ φδβ
−= =
−=
=
− −= =
=
= → = + =
= → = =
= → =
= → = =
= → + + + − = =
= → = =
∑ ∑∑∑
∑ ∑∑ 1,...,m
(5.36)
with solution:
1 0
f
0m
0 0 1 0 0
0 0 0a
y1 I K
00 0 (1 1) I
T
T
T T
d
Y Y
Y K
K
αγβ
−
= + Ω ⋅
(5.37)
- 112 -
where
, ,1
m
p q p j q jj
K γ γ+ − + −=
= Ω∑% (5.38)
0, ,
1
n
p q k p qk
K −=
= Ω∑ (5.39)
The projection of the obtained model onto (5-21) goes as follows; Estimated for the
autoregressive parameters , 1,...,ia i n= are directly obtained from (5-37). Furthermore, for
the training input sequence1[ ]Nu uK , we have:
1
1
1,1 1,2 1,
2,1 2,2 2,
,1 ,2 ,
1
,1 ,1
ˆ ˆ( ) ( )
0
0
,
N
m
N N N N N
N N N N N
m m m NN
N
k k Nk
m
b
f u f u
b
γ
γ
γ γ γγ
α αα α
α α
β
β
− − −
− − −
− − −
=
Ω Ω Ω Ω Ω Ω = × Ω Ω Ω
+ Ω Ω ∑
M L
K K
K K
M M MO O
KK
M K
(5.40)
with ˆ ( )f u an estimate for
1
1( ) ( ) ( )
N
kk
f u f u f uN =
= − ∑ (5.41)
Hence, estimates for jb and the static nonlinearityf can be obtained from a rank 1
decomposition of the right hand side of (5-40), for instance using a singular value
decomposition. Once all the bj are known, 1
( )N
kkf u
=∑ can be obtained as 1
1
( )N d
k mk
jj
Nf u
b=
=
=∑ ∑
- 113 -
Figure 5-2. A scheme of NARX using SVR
5.3.2 Optimum parameter Selection
A kernel function is some function that corresponds to an inner product into some feature
space. There are many other opportunities to go beyond such basic choices for kernel selection.
However, even though there are many kernel functions satisfied with semi-positive definite
symmetric function according to Mercer’s theorem, general radial basis function (RBF) kernel
can be a reasonable choice. Roughly, more basis functions which have the number of
hyperparameters imply richer representation, but more opportunities for overfitting. Therefore,
RBF kernel with ( 2 ,σ ϒ ) parameters was used in this study as shown in equation (5.42)
without model selection. It is important to select good ( 2 ,σ ϒ ) for achieving high training
accuracy, because (2 ,σ ϒ ) are only parameters used in RBF kernel model; furthermore,
undesirable ( 2 ,σ ϒ ) may bring on overfitted train model, even it is generally believed that
the SVM optimizes the generalization error and outperforms other learning machines.
Grid search finds the paired (2 ,σ ϒ ) value in the limited range of ( 2 ,σ ϒ );
exponentially growing sequence of 2 0σ > and 0ϒ > were tried because of simple and
powerful calculation, even though there are other advanced grid search method (Fig. 5-3).
- 114 -
2
2( , )
i jx x
i jK x x e σ
−−
= (5.42)
Figure 5-3. 2D-plot of selection of optimum 2σ and ϒ for phonation |i| of male group
Table 5-1. Results of optimized 2 ,σ ϒ and mean square error of RBF kernel in
phonation |a|, |e|, |i|, |o|, |u| for both sexes
2σ ϒ Mean square error sex
Phona
-tion Mean S.D. Mean S.D. Mean S.D.
|a| 17.31 2.26 791.40 21.91 1.26 × 10-3 0.14 × 10-3
|e| 22.08 3.37 802.07 18.04 1.28 × 10-3 0.13 × 10-3
|i| 26.08 4.70 814.64 24.99 2.69 × 10-3 0.15× 10-3
|o| 23.53 4.08 814.69 27.61 3.17 × 10-3 0.17× 10-3
male
|u| 28.41 4.18 810.09 21.25 3.62 × 10-3 0.20× 10-3
|a| 27.04 4.81 827.66 21.86 3.83 × 10-3 0.24× 10-3
|e| 28.18 4.63 829.35 21.16 4.23 × 10-3 0.19× 10-3
|i| 37.65 4.19 828.65 27.75 6.11 × 10-3 0.36× 10-3
|o| 30.49 4.19 837.39 35.61 5.51 × 10-3 0.40× 10-3
female
|u| 30.19 4.84 841.39 37.00 5.06 × 10-3 0.48× 10-3
- 115 -
5.4 Evaluation of NARX using SVR Model
As shown in Fig. 5-4, NARX using SVR model successfully predict the modified voiced
sound. Reconstructed speech is similar to original one in time domain and low part of
frequency domain. Moreover, shimmer and jitter is preserved in reconstructed one. Unlike /a:/,
reconstruction of /i:/ fails. It predicts slightly wrong value. Models trained under 50 sets of
parameters can not output stable /i:/ in autonomous mode although MSE of one step prediction
is quite low. Original speech and some typical output of |i| phonation are shown in Fig. 5-6.
Figure 5-4. Synthesized versus original signal (time delay = 50): (top) modified speech signal, (middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |a|
of male
- 116 -
Figure 5-5. Synthesized versus original signal (time delay = 50): (top) modified speech signal, (middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |i|
of male
5.4.1 Multi-band Model
It is well known from linear and adaptive filter theory that subband techniques present
several advantages with respect to the full-band approach [144-146]. First of all, they achieve
computational efficiency by decimating the signal before the adaptive processing. In fact the
subband linear adaptive filters present impulse responses that are shorter than full-band
adaptive filter although the total number of the free parameters remains the same.
A second interesting property is due to the splitting of the input signal: the eigenvalues
spread of the subband-signals’ autocorrelation function is reduced and consequently least-
squares-like adaptation algorithms present better convergence performance [147].
- 117 -
It is well known, in fact, that the needed long training can hinder many real-time SVM
applications. So, since smaller schemes are needed for each subband, the multirate approach
has been used in a on-line (or in continuous learning) mode as a simple nonlinear adaptive
filter [148].
An important topic of multirate signal processing regards the choice of the filter banks.
Filter banks, in fact, decompose full-band signal spectra in a number of directly adjacent
frequencies subbands and recombine the signal spectra by the use of low-pass, band-pass, and
high-pass filters. Moreover, in the last two decades several techniques and topologies for the
design of filter banks have been proposed. A key part of the design can concern perfect vs.
almost perfect reconstruction or uniform vs. non-uniforms bands. [145, 146], and Our
proposed multiband SVR with wavelet filterbank model is illustrated in Fig. 5-6. Performance
of this model will be presented in next section. Consequently, multiband model successfully
predict voiced sounds as shown in Fig. 5-7.
Figure 5-6. Multiband SVR Model with wavelet filterbank
- 118 -
Figure 5-7. Plots of synthesized versus original signal (time delay = 50): (a) original speech signal, (b) EGG signal, (c) D1 original (left) and synthesized (right) signal, (d) D2 original (left) and synthesized (right) signal, (e) D3 original (left) and synthesized (right) signal, (f) A3 original (left) and synthesized (right) signal, synthesized + original speech
signal; phonation |i| of male
- 119 -
Figure 5-8. Comparison of spectrogram between original speech signal and synthesized speech signal
5.5 Experimental Results
Results of jitter(%) between synthesized and postoperative sounds in phonation |a|, |e|, |i|,
|o|, |u| for both sexes are presented in Table 5-2. Excessive jitter(%) value of preoperative
vowel (ref Table B-3 in APPENDIX B) decrease under 1 % in both sexes in synthesized
vowel, and statistically show no significant difference (except |a| of male group, and |i|, |o| and
|u| sounds of female group), compared to postoperative vowel. Some discrepancy are assumed,
because jitter(%) of |e| and |i| of male group, and |a| and |i| of female group are still over 1%
after surgery. However, LPC synthesizer can not represent jitter(%) like this model.
According to Wilcoxon rank sum t test, Table 5-3 shows the results of Lyapunov
exponents between synthesized and postoperative voiced sounds in phonation |a|, |e|, |i|, |o|, |u|
for both sexes. There is no difference between synthesized and postoperative voiced sounds in
all of phonation for both sexes. This truth tells that synthesized vowel may resemble
postoperative vowel in dynamic aperiodicity analysis.
- 120 -
Table 5-2. Results of jitter(%) between synthesized and postoperative sounds in
phonation |a|, |e|, |i|, |o|, |u| for both sexes
synthesized postoperative sex
Phona-
tion Mean S.D. Mean S.D. p-value
|a| 0.76 0.21 0.71 0.54 .021
|e| 0.78 0.34 1.15 0.84 .047
|i| 0.86 0.37 1.48 1.11 .106
|o| 0.66 0.20 0.54 0.23 .032
male
|u| 0.78 0.24 0.71 0.40 <.001
|a| 0.83 0.41 1.97 3.22 .044
|e| 0.74 0.37 0.99 0.60 .021
|i| 0.89 0.31 1.88 0.77 .157
|o| 0.49 0.35 0.74 0.22 .076
female
|u| 0.61 0.38 1.08 0.44 .224
Table 5-3. Results of Lyapunov exponents between synthesized and postoperative sounds
in phonation |a|, |e|, |i|, |o|, |u| for both sexes
synthesized postoperative sex
Phona-
tion Mean S.D. Mean S.D. p-value
|a| 0.435 0.058 0.474 0.049 .001
|e| 0.475 0.091 0.493 0.089 <.001
|i| 0.520 0.098 0.517 0.118 .008
|o| 0.559 0.148 0.523 0.133 .004
male
|u| 0.565 0.153 0.534 0.183 .003
|a| 0.594 0.140 0.574 0.129 .003
|e| 0.620 0.144 0.600 0.149 .007
|i| 0.661 0.173 0.604 0.218 .005
|o| 0.659 0.140 0.634 0.117 .014
female
|u| 0.614 0.151 0.651 0.140 .004
- 121 -
5.6 Summary
In this chapter, our proposed NARX based on LS-SVR is introduced and tested in
enhanced preoperative voiced sounds for producing natural sounds. This nonlinear synthesizer
perfectly reproduce voiced sounds, and also conserve the naturalness such as jitter and
shimmer, compared to LPC does not keep these naturalness. However, the results of some
phonation are quite different from the original sounds. These results are assumed that single
band model can not afford to control and decompose the high frequency components.
Therefore multiband model with wavelet filterbank is adopted for substituting single band
model. As a results, multiband model results in improved stability. Finally, nonlinear speech
modeling using NARX based on LS-SVR can successfully reconstruct modified preoperative
sounds nearly similar to postoperative voiced sounds, according to jitter, shimmer, and
Lyapunov exponents analysis.
- 122 -
CHAPTER 6
Conclusion
In this dissertation, a design and implementation of estimation of postoperative vowel is
presented using nonlinear speech modeling based on NARX, according to the acoustic and
electroglottographic analysis between preoperative and postoperative vowel. This approach is
not yet proven widely, but suggest reasonable solution for estimation of postoperative vowel.
Preoperative vowels of benign vocal folds lesions usually have perceptual aperiodicity due to
physical changes of vocal cords affected by BVFL. Jitter, shimmer, and aspiration noise are
mainly caused by lose of control of vocal cords, irregular pattern of vibration of vocal cords.
Therefore, we started with hypothesis that reduction of jitter, shimmer, and noise of
preoperative vowel can enhance the perceptual quality of speech similar to postoperative.
In this dissertation, our findings are summarized below:
1. Established PDAs are lack of pitch detection for pathological voices because of the
increase of sub-harmonics and high-frequency components occurring pitch halving
and doubling errors, and PDA based on FOS algorithm may be assumed to solve
these errors, especially in case of Type-2 signal suggested by Titze [],
2. Clear difference of acoustic and electroglottographic statistics between preoperative
and postoperative vowel are achieved, even though there are somewhat constant
values of measures in case of type of sex, phonation, and measures.
3. Modification of preoperative vowel based on acoustic and electroglottographic
analysis can resemble amount of postoperative vowel in spectral and dynamic
domain.
4. Performance of nonlinear speech modeling using NARX based SVR showed better
than LPC in perceptual quality of voiced sounds, and this results is assumed that
natural jitter, shimmer, and noise are conserved, whereas LPC produces artificial
sounds due to lack of naturalness.
- 123 -
During the several decades, speech signal processing has undergone significant
improvements along with the progress in the areas of digital signal processing, pattern
recognition, artificial intelligence, and etc. However, natural characteristics of speech
production modeling and its quality measurements are far form being completely solved. Of
course, it is difficult to find and solve the factors related with voice, especially pathological
voice. Although pathological voice relevant area are limitedly conducted with other area,
compared to other speech processing techniques, valuable gravity of this study still remains
for the reason that this areas may be a key for solving some bottleneck problems encountered
in current speech signal processing. Finally, we realize that a lot of research remains to be
conducted to address the unsolved problems; implementation of real-time problem,
performance technique, and etc.
Some interesting research topics are as follows: neo-PSOLA controlling jitter and
shimmer, perceptual multiband for decomposition and reconstruction, and finding perceptual
parameters directly relevant of pathological voices.
- 124 -
APPENDIX A
Detail results of performance of the PDAs in Chapter 2 are recorded below standards:
Age with interval of two decades except infants: Table A-1~5
Sex : Table A-6, 7
Phonation types; |a|, |e|, |i|, |o|, |u| : Table A-8~12
Table A-1. Results of performance of the available PDAs in normal & age range of 10's and 20's database (NNormal & Age10~29 = 120) (unit: percentage %)
G X2 /2 F S PDAs
Mean S.D. Mean S.D. Mean S.D. Mean S.D. Mean S.D.
AC 1.06 1.29 0.09 0.37 0.97 1.25 1.24 0.71 2.24 2.85
AMDF 7.85 10.64 0.14 0.77 7.71 10.67 6.54 7.82 11.41 10.79
YIN 1.40 1.59 0.53 0.82 0.87 1.38 13.99 0.47 2.52 1.53
CEP 3.10 3.94 1.50 3.93 1.60 1.78 3.76 6.13 8.33 11.63
SIFT 2.37 2.83 0.69 2.16 1.69 1.95 4.96 2.51 6.79 9.53
WAV 4.85 2.11 0.00 0.05 4.85 2.11 0.40 0.76 0.44 0.83
PS 1.71 1.90 0.68 1.60 1.03 1.48 1.91 1.84 5.53 6.00
FOS 0.99 2.42 0.00 0.00 0.99 2.42 2.97 1.67 3.94 4.68
- 125 -
Table A-2. Results of performance of the available PDAs in normal & age range of 30's and 40's database (N Normal & Age30~40 = 30) (unit: percentage %)
G X2 /2 F S PDAs
Mean S.D. Mean S.D. Mean std Mean S.D. Mean S.D.
AC 1.13 1.44 0.31 0.65 0.82 1.12 1.59 1.13 4.07 6.51
AMDF 7.09 11.24 0.18 0.96 6.91 11.31 5.59 6.89 10.24 9.83
YIN 1.54 1.60 0.54 0.84 1.00 1.35 14.12 0.75 3.08 4.23
CEP 2.46 2.24 0.50 1.24 1.96 2.02 2.40 1.74 6.19 7.42
SIFT 2.71 2.34 0.55 1.67 2.16 2.09 4.32 1.82 4.57 6.83
WAV 5.03 2.07 0.02 0.12 5.00 2.11 0.36 0.77 0.42 0.79
PS 1.55 1.60 0.41 1.02 1.13 1.09 1.85 2.05 5.13 7.35
FOS 0.79 1.55 0.13 0.73 0.65 1.43 2.84 1.34 4.26 4.37
* Results of performance of the available PDAs in normal age range of 50's and 60's database
(N/A)
Table A-3. Results of performance of the available PDAs in BVFL & age range of 10's and 20's database (NBNFL & Age10~29 = 65) (unit: percentage %)
G X2 /2 F S PDAs
Mean S.D. Mean S.D. Mean S.D. Mean S.D. Mean S.D.
AC 2.00 3.41 0.57 2.81 1.43 1.71 1.91 2.86 4.47 9.27 AMDF 13.98 18.31 3.95 16.08 10.03 12.26 11.23 10.93 13.73 10.26
YIN 2.51 3.56 0.35 0.79 2.15 3.49 14.23 1.78 3.34 3.57
CEP 3.36 3.51 1.29 2.32 2.07 2.53 3.39 3.44 9.68 13.02
SIFT 3.82 4.23 0.76 2.26 3.06 3.67 4.96 3.38 6.80 7.61
WAV 6.17 1.99 0.01 0.06 6.16 1.98 0.63 1.25 0.57 1.12
PS 3.07 3.59 0.79 1.32 2.28 3.26 3.49 4.01 9.50 11.50
FOS 1.17 2.42 0.57 1.82 0.60 1.38 3.03 3.32 5.39 9.85
- 126 -
Table A-4. Results of performance of the available PDAs in BVFL & age range of 10's and 20's database (NBNFL & Age30~49 = 280) (units: percentage %)
G X2 /2 F S PDAs
Mean S.D. Mean S.D. Mean S.D. Mean S.D. Mean S.D.
AC 2.90 5.27 0.41 2.55 2.49 4.40 2.24 4.27 4.55 7.36 AMDF 8.36 12.08 0.55 1.76 7.81 12.14 7.27 8.75 11.37 10.25
YIN 4.06 7.06 0.70 2.03 3.36 6.48 14.79 3.85 4.57 6.66
CEP 6.24 9.12 2.62 6.47 3.63 4.64 5.61 9.34 10.97 12.82
SIFT 6.01 10.28 0.87 2.46 5.14 9.31 6.56 7.96 8.89 10.14
WAV 6.24 1.93 0.01 0.09 6.23 1.94 0.84 1.46 0.53 0.92
PS 4.64 7.76 1.54 5.87 3.11 4.67 3.97 7.04 8.05 8.78
FOS 0.77 1.96 0.26 1.26 0.50 1.47 2.63 2.30 3.89 6.83
Table A-5. Results of performance of the available PDAs in BVFL & age range of 10's and 20's database (NBNFL & Age50~69 = 150) (units: percentage %)
G X2 /2 F S PDAs
Mean S.D. Mean S.D. Mean S.D. Mean S.D. Mean S.D.
AC 4.29 7.75 0.34 1.20 3.96 7.66 2.07 2.75 4.36 5.87 AMDF 6.89 10.82 0.74 2.01 6.15 10.96 6.39 8.36 10.29 9.82
YIN 4.94 7.80 0.73 2.35 4.21 6.77 14.90 3.45 5.18 5.00
CEP 6.64 10.43 2.73 8.95 3.91 4.81 6.39 17.24 10.73 13.92
SIFT 5.99 9.33 1.24 4.37 4.75 6.91 6.74 11.22 9.73 15.68
WAV 6.37 1.84 0.02 0.09 6.35 1.83 0.90 1.35 0.68 1.00
PS 5.48 8.03 0.97 2.42 4.50 7.20 4.05 6.58 7.74 7.11
FOS 0.80 2.00 0.38 1.36 0.42 1.56 2.73 2.32 3.76 5.48
- 1
27
-
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
0.48
0.07
4.80
2.05
5.97
4.54
6.36
3.09
8.69
4.41
3.48
1.30
3.08
0.91
2.62
1.03
Mean
1.63
0.54
8.67
2.33
2.37
2.38
11.65
3.51
12.39
5.25
5.98
1.44
10.35
2.70
4.89
1.40
S.D.
.003
<.001
<.001
.001
<.001
<.001
.008
<.001
p
G
0.40
0.07
2.44
1.25
0.03
0.02
1.87
1.53
5.13
3.27
1.01
0.67
2.27
0.29
0.73
0.31
Mean
1.57
0.54
7.04
2.21
0.14
0.12
4.63
3.12
10.15
5.33
2.86
0.82
9.51
1.07
3.44
0.65
S.D.
.014
.041
.663
.527
.069
.142
.004
.105
p
X2
0.07
0.00
2.36
0.81
5.94
4.52
4.49
1.56
3.56
1.14
2.47
0.63
0.81
0.62
1.89
0.72
Mean
0.45
0.00
3.54
1.19
2.36
2.39
8.99
1.81
4.33
1.60
4.50
1.16
4.40
2.48
3.12
1.17
S.D.
.019
<.001
<.001
<.001
<.001
<.001
.685
<.001
p
/2
2.89
3.23
4.48
2.92
0.61
0.24
6.90
4.97
8.99
5.75
14.28
13.96
3.65
1.58
1.91
1.08
Mean
2.33
0.73
7.65
2.57
1.16
0.33
12.05
3.61
17.42
8.62
2.49
0.59
6.10
2.88
3.43
1.03
S.D.
.084
.016
<.001
.050
.058
.098
<.001
.003
p
F
3.40
2.27
8.38
8.47
0.48
0.24
11.30
9.99
17.47
13.92
4.29
2.77
7.07
4.24
3.90
3.11
Mean
6.50
0.98
8.40
8.76
0.84
0.33
16.45
13.29
16.12
15.46
4.71
3.18
9.03
8.98
6.44
5.39
S.D.
.019
.944
.001
.540
.139
.006
.042
.354
p
S
Table A-6. Results of performance of the available PDAs in database of normal and BVFL (Male and without Age)
- 1
28
-
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
1.07
1.46
4.62
1.46
6.47
5.08
5.28
2.07
4.15
2.14
4.56
1.50
12.44
11.62
3.60
1.10
Mean
2.25
2.70
6.54
1.45
1.49
1.90
7.54
2.09
5.11
1.89
7.56
1.67
13.07
11.65
6.61
1.27
S.D.
.206
<.001
<.001
<.001
<.001
<.001
.566
<.001
p
G
0.30
0.00
0.48
0.27
0.00
0.00
0.35
0.16
0.68
0.16
0.43
0.45
0.23
0.06
0.19
0.03
Mean
1.23
0.00
1.06
0.62
0.01
0.00
1.06
0.68
2.03
0.61
1.09
0.82
0.69
0.60
0.72
0.22
S.D.
<.001
.019
.318
.035
<.001
.815
.024
.001
p
X2
0.77
1.46
4.14
1.19
6.47
5.08
4.93
1.91
3.48
1.98
4.13
1.05
12.21
11.56
3.41
1.06
Mean
1.84
2.70
6.37
1.50
1.49
1.90
7.44
2.07
4.63
1.89
7.18
1.46
13.05
11.66
6.51
1.24
S.D.
.022
<.001
<.001
<.001
<.001
<.001
.648
<.001
p
/2
2.60
2.78
3.56
1.30
0.98
0.48
6.07
4.76
3.23
2.17
15.07
14.04
10.15
9.12
2.30
1.44
Mean
2.55
1.93
5.70
0.90
1.53
0.91
5.32
1.27
4.31
1.26
4.05
0.51
9.77
8.15
3.86
0.63
S.D.
.451
<.001
<.001
<.001
<.001
<.001
.308
<.001
p
F
4.49
5.00
7.99
3.70
0.65
0.55
7.22
4.23
6.16
4.42
4.80
2.56
14.26
15.19
4.87
2.32
Mean
7.22
5.50
8.96
3.11
1.05
0.99
6.79
4.09
7.97
4.33
6.56
1.65
9.86
9.32
7.69
2.70
S.D.
.468
<.001
.401
<.001
.007
<.001
.405
<.001
p
S
Table A-7. Results of performance of the available PDAs in database of normal and BVFL (Female and without Age)
- 1
29
-
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
0.65
1.20
5.38
1.90
6.39
5.74
6.51
3.34
5.74
2.74
5.94
1.93
7.62
7.13
4.17
1.68
Mean
1.45
2.89
7.31
2.09
1.65
1.62
8.51
3.68
7.14
2.20
8.69
2.33
12.70
9.81
6.67
1.84
S.D.
.322
<.001
.063
.004
<.001
<.001
.825
.001
p
G
0.15
0.00
1.06
0.40
0.00
0.02
0.71
0.88
1.34
0.33
0.46
0.39
0.30
0.00
0.17
0.08
Mean
0.65
0.00
3.97
1.05
0.03
0.10
1.68
2.93
3.84
1.27
1.03
0.74
0.86
0.00
0.59
0.42
S.D.
.026
.140
.420
.767
.028
.702
.001
.343
p
X2
0.50
1.20
4.32
1.50
6.38
5.72
5.79
2.46
4.41
2.41
5.48
1.54
7.32
7.13
4.00
1.60
Mean
1.28
2.89
6.28
1.78
1.65
1.64
8.28
2.20
5.39
2.13
8.70
2.07
12.81
9.81
6.69
1.65
S.D.
.208
<.001
.060
<.001
.003
<.001
.932
.001
p
/2
2.40
3.12
3.96
1.69
1.06
0.58
6.09
4.80
3.99
2.41
15.05
14.07
7.44
5.87
2.68
1.69
Mean
1.08
1.85
5.85
1.92
1.59
1.03
5.97
2.01
6.22
2.92
3.73
0.86
9.69
7.13
4.23
1.30
S.D.
.051
.001
.054
.068
.057
.017
.336
.043
p
F
3.38
4.38
8.46
4.41
0.67
0.53
8.74
6.02
9.09
5.65
4.85
3.35
10.41
10.79
5.76
3.84
Mean
4.09
5.00
9.84
6.00
0.96
0.81
9.11
7.10
11.41
8.68
4.42
4.29
9.58
9.94
7.96
5.78
S.D.
.326
.007
.417
.092
.083
.102
.855
.152
p
S
Table A-8. Results of performance of the avaiable PDAs in database of |a| (Phonation |a|)
- 1
30
-
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
0.70
1.11
4.33
1.33
6.10
4.96
5.27
2.04
5.10
2.67
3.89
1.19
7.78
9.25
2.58
0.91
Mean
1.62
2.98
7.12
1.34
2.15
2.10
8.04
2.39
9.19
2.69
6.52
1.13
11.10
14.07
4.42
0.86
S.D.
.475
<.001
.012
.001
.022
<.001
.604
.001
p
G
0.16
0.13
1.42
0.75
0.01
0.00
0.89
0.64
1.85
0.89
0.41
0.49
0.35
0.00
0.17
0.08
Mean
0.72
0.73
4.90
1.22
0.09
0.00
3.47
1.83
6.77
2.81
1.01
0.68
1.02
0.00
0.63
0.31
S.D.
.858
.214
.182
.599
.263
.602
.001
.323
p
X2
0.54
0.98
2.91
0.58
6.09
4.96
4.38
1.40
3.25
1.78
3.49
0.70
7.43
9.25
2.41
0.83
Mean
1.51
2.93
4.73
0.92
2.14
2.10
7.40
1.84
4.35
1.61
5.98
1.08
11.28
14.07
4.27
0.83
S.D.
.438
<.001
.013
<.001
.006
<.001
.521
.001
p
/2
2.51
2.98
3.72
1.46
0.65
0.24
6.15
4.51
4.47
3.18
14.74
13.94
7.22
7.13
1.81
1.18
Mean
1.57
2.01
6.44
1.58
1.19
0.48
7.53
1.76
10.24
4.36
3.53
0.34
8.48
9.72
2.09
0.54
S.D.
.254
.002
.007
.048
.323
.028
.963
.008
p
F
3.34
3.52
7.37
4.07
0.47
0.34
8.64
4.40
9.09
7.72
4.03
2.34
10.77
10.64
3.87
2.17
Mean
4.26
4.32
6.84
4.56
0.94
0.73
10.82
5.08
10.32
10.06
4.20
1.43
9.42
10.28
5.24
3.47
S.D.
.840
.003
.433
.004
.519
.001
.949
.042
p
S
Table A-9. Results of performance of the avaiable PDAs in database of normal and BVFL (Phonation |e|)
- 1
31
-
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
0.78
0.49
3.49
1.08
6.31
3.83
4.30
1.79
4.71
2.44
2.74
1.09
9.17
7.76
2.05
0.86
Mean
2.13
1.32
6.58
1.29
1.72
2.02
8.32
1.80
6.43
2.67
5.25
1.32
14.08
9.92
4.29
1.06
S.D.
.369
.001
<.001
.006
.006
.005
.540
.013
p
G
0.52
0.00
1.47
0.54
0.03
0.02
0.67
0.30
1.76
1.03
0.68
0.41
1.25
0.33
0.30
0.15
Mean
1.84
0.00
5.34
1.23
0.15
0.12
1.86
0.94
3.80
2.35
1.19
0.71
7.62
1.29
1.01
0.48
S.D.
.006
.110
.904
.154
.206
.136
.255
.268
p
X2
0.26
0.49
2.02
0.54
6.28
3.81
3.63
1.49
2.94
1.41
2.06
0.68
7.92
7.43
1.75
0.70
Mean
1.00
1.32
2.92
0.79
1.72
2.04
7.03
1.81
3.97
1.58
4.69
0.94
12.47
9.98
4.01
0.98
S.D.
.395
<.001
<.001
.007
.002
.007
.823
.020
p
/2
2.85
2.53
4.15
2.42
0.83
0.26
5.77
4.85
4.55
2.98
14.29
13.95
7.76
6.82
1.64
1.24
Mean
3.11
1.16
7.09
1.99
1.40
0.44
8.08
2.36
5.96
3.59
2.58
0.45
9.77
7.29
1.83
0.61
S.D.
.409
.032
.001
.315
.080
.221
.571
.066
p
F
4.45
2.97
8.37
5.09
0.60
0.43
7.55
6.34
10.35
7.62
4.15
2.51
11.45
12.18
3.64
1.97
Mean
8.61
3.64
8.57
6.26
0.95
0.78
10.14
9.01
11.54
9.05
4.15
1.36
10.22
11.74
5.72
2.31
S.D.
.178
.025
.322
.535
.183
.001
.760
.021
p
S
Table A-10. Results of performance of the available PDAs in database of normal and BVFL (Phonation |i|)
- 1
32
-
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
0.85
1.02
4.99
2.02
6.32
5.29
6.39
2.97
6.90
3.25
4.12
1.34
9.36
7.93
3.88
1.24
Mean
2.11
1.88
7.24
1.45
1.93
2.13
11.10
3.06
11.17
4.16
6.68
1.33
13.27
11.19
7.34
1.39
S.D.
.682
<.001
.022
.007
.008
<.001
.560
.001
p
G
0.30
0.00
1.10
0.54
0.02
0.00
1.45
1.12
3.36
1.67
0.77
0.40
1.59
0.11
0.65
0.19
Mean
1.22
0.00
4.59
0.87
0.07
0.00
4.50
2.78
9.78
3.68
2.84
0.65
7.91
0.63
4.02
0.49
S.D.
.016
.250
.035
.634
.159
.236
.069
.262
p
X2
0.55
1.02
3.89
1.49
6.31
5.29
4.94
1.85
3.54
1.58
3.35
0.94
7.77
7.82
3.23
1.05
Mean
1.73
1.88
5.08
1.65
1.92
2.13
8.16
2.07
3.72
1.98
5.63
1.21
11.52
11.25
5.96
1.33
S.D.
.233
<.001
.024
.001
<.001
<.001
.984
.001
p
/2
2.56
2.89
3.55
1.69
0.79
0.48
6.96
5.41
6.97
3.79
14.73
14.03
7.53
6.70
2.09
1.19
Mean
2.53
1.37
6.28
1.37
1.43
0.88
11.15
3.23
19.24
5.70
3.36
0.45
8.60
8.03
4.14
0.67
S.D.
.364
.007
.164
.223
.151
.045
.626
.039
p
F
3.38
4.34
7.90
6.55
0.54
0.58
9.72
8.13
10.94
8.20
4.46
2.19
12.10
11.70
3.98
2.77
Mean
5.96
4.78
8.45
6.74
1.03
1.02
14.30
10.74
14.36
12.82
5.91
1.53
10.87
12.18
5.94
4.20
S.D.
.370
.371
.836
.516
.323
.001
.874
.215
p
S
Table A-11. Results of performance of the avaiable PDAs in database of normal and BVFL (Phonation |o|)
- 1
33
-
FOS
PS
WAV
SIFT
CEP
YIN
AMDF
AC
PDAs
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
BVFL
Normal
Signal Type
1.18
0.95
5.26
2.04
6.22
4.60
6.13
2.06
7.47
3.76
3.91
1.58
9.35
6.40
3.35
0.68
Mean
2.67
1.94
8.84
2.57
2.08
2.18
10.70
2.14
10.41
5.61
7.05
1.49
13.20
8.27
6.43
1.05
S.D.
.605
.002
.001
.001
.013
.003
.147
<.001
p
G
0.58
0.00
1.28
0.90
0.01
0.00
1.11
0.36
4.07
2.57
1.00
0.96
1.79
0.28
0.75
0.17
Mean
1.91
0.00
4.37
2.56
0.05
0.00
3.28
0.93
8.32
5.85
2.96
1.14
8.12
1.09
2.78
0.54
S.D.
.003
.557
.170
.045
.273
.915
.075
.052
p
X2
0.60
0.95
3.97
1.4
6.21
4.60
5.02
1.70
3.39
1.18
2.91
0.62
7.57
6.13
2.60
0.51
Mean
1.77
1.94
7.15
1.36
2.08
2.18
9.41
1.90
4.86
1.69
5.26
1.12
11.33
8.37
5.67
0.90
S.D.
.384
<.001
.001
.001
<.001
<.001
.453
.001
p
/2
3.26
3.21
4.27
2.23
0.82
0.40
7.05
4.61
7.80
5.07
14.95
14.07
7.68
5.26
2.48
1.25
Mean
3.23
1.52
7.19
2.32
1.37
0.78
9.86
2.42
12.30
9.05
4.24
0.46
8.84
5.80
5.02
0.67
S.D.
.916
.016
.036
.027
.190
.046
.084
.019
p
F
5.69
4.79
8.63
7.13
0.62
0.29
9.70
6.83
14.17
10.32
5.48
2.78
12.03
10.56
5.16
2.31
Mean
9.73
5.20
9.74
7.28
0.99
0.75
14.05
11.95
16.76
13.42
9.14
1.49
10.71
9.11
10.02
2.77
S.D.
.513
.367
.056
.273
.201
.006
.461
.012
p
S
Table A-12. Results of performance of the avaiable PDAs in database of normal and BVFL (phonation |u|)
- 134 -
APPENDIX B
Detail results of measures of jitter and shimmer in Chapter 3 are recorded below standards:
Table B-1. Mean and S.D. of mean, max, and min of F0 for phonation |a|,|e|,|i|,|o|,|u| for
both sex
Table B-2. Mean and S.D. of S.D, phonatory frequency range, and mean absolute jitter
of F0 for phonation |a|,|e|,|i|,|o|,|u| for both sex
Table B-3. Mean and S.D. of jitter(%), pitch perturbation factor, directional pitch
perturbation factor of F0 for phonation |a|,|e|,|i|,|o|,|u| for both sex
Table B-4. Mean and S.D. of relative average pitch perturbation 3, 5, and 15 of F0 for
phonation |a|,|e|,|i|,|o|,|u| for both sex
Table B-5. Mean and S.D. of mean, max, and min of amplitude for phonation
|a|,|e|,|i|,|o|,|u| for both sex
Table B-6. Mean and S.D. of S.D, Shimmer(dB), and mean absolute shimmer of
amplitude for phonation |a|,|e|,|i|,|o|,|u| for both sex
Table B-7. Mean and S.D. of shimmer(%), amplitude perturbation factor(%) and
amplitude directional perturbation factor of amplit ude for phonation |a|,|e|,|i|,|o|,|u| for
both sex
Table B-8. Mean and S.D. of relative average amplitude perturbation of amplitude for
phonation |a|,|e|,|i|,|o|,|u| for both sex
- 1
35
-
Fe male
Male
Irrespective of sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
a
205.3
204.7
206.4
201.3
198.1
143.0
140.9
137.7
135.7
134.9
172.7
171.3
170.4
166.9
165.0
Mean
31.2
31.7
29.8
28.1
27.4
30.2
28.7
19.9
18.8
18.6
43.7
43.9
42.6
40.6
39.3
S.D.
pre
197.0
196.6
198.3
195.7
193.1
120.7
119.4
120.6
118.8
117.6
157.0
156.1
157.6
155.4
153.6
Mean
27.0
27.2
27.0
26.5
26.0
18.1
18.6
17.7
17.9
18.3
44.6
45.1
45.2
44.7
44.1
S.D.
post
.059
.094
.087
.134
.141
.004
.004
.002
.001
.001
.001
.001
.000
.000
.000
p-value
Mean F0
216.2
214.7
225.9
224.4
234.8
159.4
153.8
150.9
146.8
172.2
186.4
182.8
186.6
183.7
202.0
Mean
35.4
39.3
35.8
42.4
47.9
51.1
48.0
25.4
23.2
114.2
42.4
53.3
48.6
51.4
93.5
S.D.
pre
203.0
201.6
212.1
202.7
212.3
124.4
123.0
126.5
123.2
122.4
161.8
160.4
167.2
161.1
165.2
Mean
28.5
27.9
31.6
28.4
36.5
18.2
19.3
19.5
18.8
22.7
46.1
46.1
50.3
46.6
54.2
S.D.
post
.013
.061
.023
.001
.031
.006
.011
.001
.000
.073
.001
.002
.000
.000
.015
p-value
Max F0
193.5
195.6
188.1
185.0
169.2
125.3
126.4
125.9
124.3
121.4
157.8
159.3
155.5
153.2
144.2
Mean
27.8
28.4
24.6
27.2
25.2
24.6
22.1
20.0
21.9
22.6
43.0
43.0
38.4
39.1
33.7
S.D.
pre
191.1
191.6
186.0
188.6
179.4
116.6
115.2
115.2
113.8
112.7
152.1
151.6
148.8
149.4
144.5
Mean
25.6
26.5
22.6
25.2
28.6
19.4
19.9
17.3
18.6
17.7
43.7
44.9
40.9
43.6
40.9
S.D.
post
.572
.383
.630
.479
.082
.086
.013
.020
.025
.067
.083
.015
.036
.271
.938
p-value
Min F0
Table B-1. Mean and S.D. of mean, max, and min of F0 for phonation |a|,|e|,|i|,|o|,|u| for both sex
- 1
36
-
Fe male
Male
Irrespective of sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
A
4.12
3.11
5.16
5.51
11.46
5.46
4.47
4.40
4.02
6.71
4.82
3.82
4.76
4.73
8.97
Mean
3.27
2.34
2.29
4.49
9.16
8.09
7.05
3.29
4.26
11.6
6.24
5.34
2.85
4.39
10.6
S.D.
pre
2.33
1.96
3.92
2.40
4.86
1.78
1.75
2.37
1.98
1.85
2.04
1.85
3.10
2.18
3.28
Mean
0.88
0.74
1.73
0.95
6.36
2.03
1.71
1.79
1.15
1.75
1.60
1.46
1.90
1.07
4.76
S.D.
post
.019
.046
.031
.004
.008
.045
.092
.014
.027
.072
.006
.024
.001
.000
.002
p-value
S.D. of F0
1.88
1.53
3.10
3.21
5.48
3.83
3.06
3.11
2.94
4.69
2.90
2.33
3.11
3.07
5.07
Mean
1.34
1.38
1.53
2.89
4.03
5.82
4.70
2.27
3.73
6.81
4.38
3.58
1.93
3.32
5.61
S.D.
pre
1.03
0.87
2.20
1.23
2.89
1.17
1.19
1.64
1.40
1.38
1.10
1.04
1.91
1.32
2.10
Mean
0.30
0.23
0.68
0.67
3.19
1.25
1.31
0.98
0.84
1.41
0.92
0.96
0.89
0.76
2.51
S.D.
post
.007
.057
.016
.003
.029
.047
.083
.007
.060
.042
.012
.027
.000
.001
.003
p-value
Phonatory frequency Range
3.32
2.07
4.82
5.21
10.75
5.00
3.86
4.97
3.89
3.52
4.20
3.01
4.90
4.52
6.96
Mean
2.38
0.84
2.07
4.05
10.6
7.81
7.02
5.15
4.83
4.06
5.88
5.13
3.95
4.47
8.64
S.D.
pre
2.17
1.49
3.81
1.92
3.95
0.86
0.65
1.89
1.34
0.89
1.48
1.05
2.80
1.62
2.35
Mean
1.09
0.56
1.75
1.15
6.91
0.47
0.30
1.70
1.02
0.90
1.05
0.61
1.96
1.11
4.99
S.D.
post
.036
.008
.070
.001
.013
.022
.045
.018
.015
.009
.006
.020
.004
.000
.001
p-value
Mean Absolute Jitter(MAJ)
Table B-2. Mean and S.D. of S.D, phonatory frequency range, and mean absolute jitter of F0 for phonation
|a|,|e|,|i|,|o|,|u| for both sex
- 1
37
-
Fe male
Male
Irrespective of
sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
a
1.59
1.00
2.30
2.55
5.29
3.12
2.40
3.53
2.85
2.56
2.39
1.73
2.95
2.71
3.86
Mean
1.11
0.34
0.89
1.90
5.08
4.51
3.91
3.49
3.61
2.87
3.40
2.90
2.64
2.89
4.26
S.D.
pre
1.08
0.74
1.88
0.99
1.97
0.71
0.54
1.48
1.15
0.71
0.89
0.64
1.67
1.05
1.31
Mean
0.44
0.22
0.77
0.60
3.22
0.40
0.23
1.11
0.84
0.54
0.45
0.24
0.97
0.73
2.31
S.D.
post
.046
.007
.084
.001
.011
.020
.038
.019
.022
.008
.007
.020
.007
.000
.000
p-value
Jitter(%)
3.31
0.37
6.95
6.15
9.16
5.07
3.49
7.70
3.84
3.22
4.23
2.01
7.34
4.94
6.05
Mean
5.12
0.76
4.80
6.12
10.58
10.6
9.15
11.31
7.42
6.31
8.44
6.76
8.74
6.85
9.01
S.D.
pre
1.84
0.19
9.51
1.48
5.13
0.00
0.00
0.883
0.00
0.48
0.87
0.09
4.99
0.70
2.69
Mean
3.71
0.76
7.50
3.72
12.5
0.00
0.00
4.14
0.00
2.27
2.69
0.52
7.34
2.64
9.00
S.D.
post
.246
.490
.136
.003
.224
.037
.087
.018
.024
.079
.016
.075
.185
.000
.054
p-value
Pitch Perturbation Factor(%)
58.92
56.11
58.22
62.23
59.05
60.9
59.10
63.16
62.85
58.39
59.98
57.68
60.81
62.56
58.70
Mean
8.63
7.80
6.81
8.20
9.06
12.3
9.06
12.9
9.76
11.5
10.6
8.52
10.6
8.9
10.3
S.D.
pre
58.03
53.19
60.63
54.99
52.58
48.44
44.26
52.01
51.20
48.33
53.01
48.51
56.12
53.01
50.36
Mean
5.91
6.60
6.88
9.50
8.30
12.3
11.5
11.4
12.0
8.73
10.8
10.4
10.3
10.9
8.69
S.D.
post
.569
.207
.140
.000
.010
.001
.000
.003
.000
.000
.002
.000
.034
.000
.000
p-value
Directional Perturbation Factor(%)
Table B-3. Mean and S.D. of jitter(%), pitch perturbation factor, directional pitch perturbation factor of F0 for
phonation |a|,|e|,|i|,|o|,|u| for both sex
- 1
38
-
Fe male
Male
Irrespective of
sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
a
0.99
0.62
1.40
1.58
3.22
1.89
1.46
2.21
1.74
1.57
1.47
1.06
1.82
1.66
2.35
Mean
0.70
0.20
0.52
1.15
3.04
2.73
2.36
2.29
2.20
1.78
2.06
1.75
1.73
1.76
2.57
S.D.
pre
0.67
0.47
1.17
0.62
1.21
0.42
0.31
0.89
0.67
0.41
0.54
0.38
1.02
0.64
0.79
Mean
0.26
0.13
0.47
0.37
1.93
0.24
0.12
0.67
0.54
0.35
0.28
0.15
0.59
0.46
1.39
S.D.
post
.050
.006
.114
.001
.010
.019
.033
.019
.021
.008
.006
.018
.009
.000
.000
p-value
RAPP 3(%)
0.91
0.59
1.35
1.44
3.13
1.84
1.43
1.85
1.62
1.56
1.40
1.03
1.61
1.53
2.30
Mean
0.57
0.19
0.52
1.04
3.00
2.79
2.42
1.59
2.03
1.84
2.08
1.79
1.22
1.62
2.56
S.D.
pre
0.64
0.46
1.09
0.59
1.17
0.41
0.32
0.85
0.63
.426
0.52
0.39
0.96
0.61
0.78
Mean
0.24
0.12
0.44
0.29
1.89
0.19
0.11
0.63
0.44
0.30
0.24
0.13
0.56
0.37
1.36
S.D.
post
.037
.013
.066
.001
.012
.026
.043
.015
.023
.010
.010
.026
.004
.000
.000
p-value
RAPP 5(%)
0.96
0.62
1.31
1.45
2.97
1.90
1.46
2.06
1.68
1.72
1.45
1.06
1.70
1.57
2.31
Mean
0.60
0.19
0.50
1.07
2.81
2.73
2.30
1.96
2.04
2.06
2.05
1.70
1.49
1.63
2.5
S.D.
pre
0.65
0.49
1.08
0.61
1.12
0.53
0.44
0.91
0.75
0.54
0.59
0.47
0.99
0.68
0.82
Mean
0.24
0.11
0.42
0.27
1.73
0.22
0.14
0.60
0.44
0.25
0.23
0.13
0.52
0.37
1.23
S.D.
post
.028
.016
.087
.002
.011
.029
.048
.019
.028
.016
.010
.028
.007
.000
.000
p-value
RAPP 15(%)
Table B-4. Mean and S.D. of relative average pitch perturbation 3, 5, and 15 of F0 for phonation |a|,|e|,|i|,|o|,|u| for both
sex
- 1
39
-
Fe male
Male
Irrespective of sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
a
0.51
0.69
0.47
0.53
0.66
0.53
0.67
0.55
0.61
0.73
0.52
0.68
0.52
0.57
0.70
Mean
0.21
0.35
0.23
0.25
0.44
0.26
0.35
0.30
0.29
0.36
0.24
0.34
0.27
0.27
0.40
S.D.
pre
0.45
0.66
0.39
0.60
0.65
0.57
0.80
0.53
0.71
0.90
0.51
0.73
0.46
0.66
0.78
Mean
0.15
0.37
0.14
0.31
.300
0.24
0.34
0.22
0.35
0.43
0.21
0.36
0.19
0.33
0.39
S.D.
post
.229
.689
.089
.400
0.92
.589
.185
.704
.298
.125
.787
.397
.224
.173
.240
p-value
Mean Amp
0.63
0.89
0.59
0.70
0.94
0.69
0.89
0.68
0.83
1.02
0.66
0.89
0.63
0.77
0.99
Mean
0.25
0.42
0.27
0.31
0.52
0.32
0.41
0.35
0.38
0.40
0.29
0.41
0.31
0.35
0.46
S.D.
pre
0.54
0.80
0.46
0.76
0.88
0.65
0.97
0.62
0.90
1.22
0.60
0.89
0.54
0.83
1.06
Mean
0.17
0.39
0.15
0.37
0.38
0.26
0.41
0.24
0.42
0.52
0.23
0.40
0.21
0.40
0.49
S.D.
post
.115
.318
.029
.567
0.57
0.60
.524
.495
.553
.132
.165
.997
.080
.403
.429
p-value
Max Amp
0.35
0.47
0.37
0.36
0.42
0.39
0.48
0.42
0.43
0.51
0.37
0.47
0.40
0.40
0.471
Mean
0.17
0.26
0.19
0.17
0.38
0.24
0.30
0.24
0.21
0.30
0.21
0.28
0.22
0.19
0.34
S.D.
pre
0.36
0.53
0.32
0.47
0.46
0.46
0.64
0.44
0.53
0.63
0.41
0.59
0.38
0.50
0.55
Mean
0.15
0.31
0.13
0.29
0.24
0.22
0.32
0.19
0.27
0.33
0.19
0.32
0.17
0.28
0.30
S.D.
post
.714
.394
.199
.135
.606
.185
.063
.757
.135
.199
.175
.042
.688
.031
.177
p-value
Min Amp
Table B-5. Mean and S.D. of mean, max, and min of amplitude for phonation |a|,|e|,|i|,|o|,|u| for both sex
- 1
40
-
Fe Male
Male
Irrespective of sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
a
0.07
0.09
0.04
0.07
0.12
0.07
0.09
0.05
0.09
0.11
0.07
0.09
0.05
0.08
0.12
Mean
0.04
0.06
0.02
0.04
0.09
0.06
0.05
0.03
0.07
0.05
0.05
0.06
0.03
0.06
0.07
S.D.
pre
0.04
0.07
0.03
0.07
0.09
0.04
0.08
0.04
0.08
0.14
0.04
0.07
0.03
0.07
0.12
Mean
0.03
0.05
0.01
0.04
0.05
0.04
0.06
0.03
0.07
0.06
0.03
0.05
0.02
0.06
0.06
S.D.
post
.014
.072
.018
.560
.215
.046
.593
.151
.827
.079
.002
.111
.012
.636
.964
p-value
S.D. of Amp
3.09
2.73
2.88
3.99
5.54
6.44
6.54
5.19
6.85
6.73
4.84
4.73
4.09
5.48
6.16
Mean
1.81
1.38
0.99
1.37
2.38
7.29
8.47
3.45
5.36
3.27
5.63
6.43
2.81
4.20
2.91
S.D.
pre
2.30
2.10
2.67
2.66
3.68
2.56
5.37
2.57
3.18
4.70
2.43
2.24
2.62
2.93
4.22
Mean
0.89
0.87
0.74
0.95
1.80
1.40
1.00
0.70
1.05
2.43
1.18
1.18
0.71
1.03
2.19
S.D.
post
.056
.077
.427
.000
.001
.014
.030
.002
.005
.002
.005
.005
.002
.000
.000
p-value
Shimmer(dB)
0.01
0.02
0.02
0.04
0.07
0.04
0.02
0.02
0.04
0.05
0.03
0.02
0.02
0.04
0.06
Mean
0.05
0.05
0.02
0.04
0.07
0.08
0.07
0.03
0.06
0.06
0.07
0.06
0.03
0.05
0.06
S.D.
pre
0.02
0.03
0.01
0.02
0.04
0.03
0.01
0.01
0.03
0.06
0.02
0.02
0.01
0.03
0.05
Mean
0.02
0.02
0.02
0.02
0.02
0.06
0.06
0.04
0.06
0.05
0.05
0.05
0.03
0.04
0.04
S.D.
post
.923
.409
.046
.072
.085
.649
.365
.376
.934
.564
.703
.841
.095
.490
.588
p-value
Mean Absolute Shimmer(MAS)
Table B-6. Mean and S.D. of S.D, Shimmer(dB), and mean absolute shimmer of amplitude for phonation |a|,|e|,|i|,|o|,|u|
for both sex
- 1
41
-
Fe male
Male
Irrespective of
sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
a
0.01
0.01
0.01
0.02
0.03
0.03
0.03
0.02
0.04
0.04
0.02
0.0.2
0.02
0.03
0.03
Mean
0.01
0.00
0.00
0.01
0.01
0.04
0.02
0.01
0.03
0.02
0.03
0.02
0.01
0.02
0.02
S.D.
pre
0.01
0.01
0.01
0.01
0.02
0.01
0.01
0.01
0.02
0.04
0.01
0.01
0.01
0.01
0.03
Mean
0.00
0.00
0.00
0.00
0.01
0.00
0.01
0.00
0.01
0.02
0.00
0.01
0.00
0.00
0.02
S.D.
post
.285
.028
.541
.024
.005
.108
0.03
.002
.020
.246
.067
.009
.002
.005
.009
p-value
Shimmer (%)
1.70
2.52
0.61
3.47
8.98
5.88
9.15
7.00
11.11
14.52
3.89
5.99
3.95
7.47
11.88
Mean
6.18
3.79
1.37
4.90
7.99
11.64
10.73
9.88
12.99
10.28
9.57
8.77
7.83
10.61
9.57
S.D.
pre
17.59
11.13
27.43
16.82
21.84
14.53
7.08
17.93
17.46
23.06
15.99
9.01
22.45
17.16
22.48
Mean
18.07
14.58
17.44
17.51
27.17
18.29
9.89
16.55
16.79
23.97
18.03
12.36
17.44
16.93
25.24
S.D.
post
.002
.025
.000
.004
.063
.048
.509
.023
.186
.110
.000
.225
.000
.003
.013
p-value
Amplitude Perturbation Factor(%)
61.50
56.19
62.19
59.64
61.06
61.02
58.36
62.46
62.42
59.77
61.25
57.33
62.33
61.09
60.38
Mean
10.96
9.70
4.86
6.50
6.07
10.73
6.04
9.49
10.36
3.87
10.71
7.96
7.56
8.75
5.02
S.D.
pre
62.53
59.16
65.87
61.62
60.81
58.02
52.02
59.48
57.25
55.34
60.17
55.42
62.52
59.33
57.94
Mean
4.92
8.38
3.94
5.27
6.40
10.95
9.09
5.39
7.14
11.15
8.82
9.37
5.70
6.62
9.50
S.D.
post
.661
.348
.020
.223
.876
.340
.016
.150
.035
.088
.582
.358
.888
.248
.114
p-value
Amplitude Directional Perturbation Factor (%)
Table B-7. Mean and S.D. of shimmer(%), amplitude perturbation factor(%) and amplitude directional perturbation factor of
amplitude for phonation |a|,|e|,|i|,|o|,|u| for both sex
- 1
42
-
Fe male
Male
Irrespective of
sex
u
o
i
e
a
u
o
i
e
a
u
o
i
e
a
1.75
1.46
1.60
2.19
3.02
3.60
3.67
2.86
3.80
3.69
2.72
2.62
2.26
3.04
3.37
Mean
1.23
0.87
0.62
0.82
1.28
4.33
5.08
1.74
2.81
1.89
3.34
3.85
1.46
2.24
1.65
S.D.
pre
1.29
1.16
1.53
1.48
2.05
1.36
1.14
1.40
1.67
2.55
1.33
1.15
1.46
1.58
2.31
Mean
0.55
0.54
0.45
0.59
1.08
0.96
0.45
0.35
0.61
1.53
0.78
0.49
0.40
0.60
1.34
S.D.
post
.093
.172
.665
.000
.001
.016
.029
.001
.002
.003
.006
.018
.002
.000
.000
p-value
RAAP 3(%)
1.81
1.65
1.82
2.51
3.45
3.96
4.15
3.26
4.32
4.24
2.93
2.96
2.57
3.46
3.86
Mean
0.89
0.78
0.56
0.83
1.65
4.70
5.75
2.54
3.75
2.06
3.58
4.34
1.99
2.89
1.89
S.D.
pre
1.43
1.27
1.67
1.65
2.28
1.46
1.38
1.58
1.95
2.81
1.45
1.33
1.62
1.81
2.56
Mean
0.51
0.53
0.42
0.58
1.15
0.60
0.58
0.40
0.68
1.40
0.55
0.55
0.41
0.65
1.30
S.D.
post
.057
.052
.277
.000
.004
.017
.034
.006
.009
.001
.008
.018
.004
.001
.000
p-value
RAAP 5(%)
2.41
2.36
2.19
3.17
4.58
4.95
5.23
3.99
5.25
5.42
3.74
3.86
3.13
4.26
5.02
Mean
0.92
0.85
0.63
0.94
1.97
5.06
6.02
2.85
4.03
2.22
3.89
4.58
2.27
3.13
2.12
S.D.
pre
1.72
1.79
1.94
2.10
3.03
2.23
2.46
2.15
2.98
3.99
1.99
2.14
2.05
2.56
3.53
Mean
0.55
0.63
0.50
0.55
1.10
0.82
1.07
0.58
0.93
1.43
0.74
0.94
0.55
0.88
1.36
S.D.
post
.002
.008
.127
.000
.002
.015
.037
.006
.015
.002
.003
.014
.003
.001
.000
p-value
RAAP 15(%)
Table B-8. Mean and S.D. of relative average amplitude perturbation of amplitude for phonation |a|,|e|,|i|,|o|,|u| for both
sex
- 143 -
REFERENCES
[1] Robert TS, Voice Science, 1st ed. NY, Plural P., 2005.
[2] Ainsworth S, Disorders of voice, 1st ed. Philadelphia, Harper & Row P., 1980.
[3] Robert TS, Clinical Assessment of Voice, 1st ed. NY, Plural P., 2005.
[4] Woo P, Casper J, Colton R, Brewer D, Aerodynamic and stroboscopic findings before and
after microlaryngeal phonosurgery, J. Voice, 1994, vol. 8, pp. 186-194.
[5] Lesly W, Cristina JM, Wayne H, and Alvaro G, Vocal fold nodule vs. vocal fold polyp:
Answer from surgical pathologist and voice pathologist point of view, J. Voice, 2003,
vol. 18, pp. 125-129.
[6] Michael MJ, Update on the etiology, diagnosis, and treatment of vocal fold nodules, polyps,
and cysts, 2003, Otolaryngology & Head and Neck Surgery, vol. 11, pp. 456-461.
[7] Ossoff RH et al., Cysts, nodules, and polyps. In The Larynx, 1st ed. Philadelphia,
Lippincott Williams & Willams P., 2003.
[8] Rubin P, Baer T, An articulatory synthesizer for perceptual research, J. of Acoust Soc of
Am, 1981, vol. 70, pp. 321-328.
[9] Fant G, Acoustic theory of speech production, Mouton, The Hague.
[10] Rabiner LR, Schafer RW, Digital processing of speech signals, NJ, Prentice Hall, 1978.
[11] Makhoul J, Spectral linear prediction: Properties and applications, IEEE Transactions on
Acoustics, Speech and Signal Processing, 1975, vol. 23, pp. 283-296.
[12] Teager, H. & Teager, S., Evidence for nonlinear sound production mechanisms in the
vocal tract, Vol. D 55 of NATO ASI Series, Kluwer Academic Publishers, pp. 241-261.
[13] Tishby, N, A dynamical systems approach to speech processing, Proc. IEEE international
Conference on Acoustics, Speech, and Signal Processing, vol 1, pp. 365-368.
[14] Townshend, B. Nonlinear prediction of speech, Proc. IEEE international Conference on
Acoustics, Speech, and Signal Processing, vol 1, pp. 425-428
[15] Wu, L. & Fallside, F., Fully vector quantized neural network-based code-excited non-
linear predictive speech coding, Technical Report CUED/F-INFENG/TR.94, Cambrige
Univ. Eng. Dep, England.
[16] Mackenzie Beck, J. M., Organic Variation and Voice Quality, PhD Dissertation,
University of Edinburgh, 1988.
- 144 -
[17] Paul C., Eva C., Ruth, E. et al., Formal perceptual evaluation of voice quality in the
united kingdom, Logopedics Phoniatrics vocology, vol. 25, pp. 133-138
[18] M.P. Karnell, R.S. Scherer, L. Fischer, Comparison of acoustic voice perturbation
measures among three independent voice laboratories, J. Speech hear. Res., 1991, vol.
34, pp. 781-790.
[19] M.P. Karnell, K.D. Hall, K. Landahl, Comparison of fundamental frequency and
perturbation measurements among three analysis systems, J. Voice, 1995, vol. 4, pp.
383-393.
[20] S. Bielamowicz, J. Kreiman, B.R. Gerratt, M.S. Dauer, and G.S. Berke, Comparison of
voice analysis systems for perturbation measurements, J. Speech Hear. Res., 1996, vol.
39, pp. 126-134.
[21] Giulia Bertino et al, Acoustic Analysis of Voice Quality with or without False Vocal Fold
Displacement After Cordectomy, Journal of Voice, 2001,
[23] FUJITA, Reginaldo, FERREIRA, Ana Elisa and SARKOVAS, Caroline.,
Videokymography assessment of vocal fold vibration before and after hydration., Rev.
Bras. Otorrinolaringol., 2004, vol. 70, pp. 742-746.
[23] Pieter Noordzij, Peak Woo., Glottal area waveform analysis of benign vocal fold lesions
before and after surgery, The annals of otology, Rhinology & Laryngology, vol. 109, pp.
441-446.
[24] Ming-Wang H., Yu-Che H., The characteristic features of muscle tension dysphonia
before and after surgery in benign lesions of the vocal fold, ORL J Otorhinolaryngol
Relat Spec. vol. 66, pp. 246-254.
[25] Ming-Wang H., Videolaryngostroboscopic observation of mucus layer during vocal cord
vibration in patients with vocal nodules before and after surgery., Acta Otolaryngol, vol.
124, pp. 186-191.
[26] Alison B., Lucian S., Tina H., Factors Predicting Patient Perception of Dysphonia Caused
by Benign Vocal Fold Lesions, Laryngoscope, vol. 114, pp. 1693-1670.
[27] Shi Chan Kim, Comparative study of pre and postoperative voice and image analysis in
unilateral vocal cord paralysis and vocal polyp, Yonsei University, Korea, 2000
[28] Joo Hwan Lee, Prediction of post-treatment outcome of pathologic voice using voice
synthesis, Yonsei University, Korea, 2003
- 145 -
[29] Moo-jin Baek, A comparative study of pre and postoperative voice and prediction of
postoperative voice by speech synthesis in benign laryngeal diseases, Pusan National
University, Korea, 2000
[30] de Cheveign´e, A., Separation of concurrent harmonics sounds: Fundamental frequency
estimation and a timedomain cancellation model of auditory processing. J. Acoust Soc
of Am, vol. 93, pp. 3271–3290.
[31] Klapuri, A. P., Multiple fundamental frequency estimation based on harmonicity and
spectral smoothness, IEEE Transactions on Speech and Audio Processing, vol. 11, pp.
804-816.
[32] de Cheveign´e, A., Pitch perception models in Pitch. Springer-Verlag. Edited by C. Plack
and A. Oxenham, 2004.
[33] von Helmholtz, H. L. F., On the Sensations of Tone as a Physiological Basis for the
Theory of Music. New York: Dover. English translation of 1863 edition by A. J. Ellis.
[34] Schouten, J. F., The perception of subjective tones in Psychological Acoustics. Edited by
E.D. Schubert, 1979.
[35] John M. Eargle. Music, Sound and Technology. Van Nostrand Reinhold, Toronto, 1995.
[36] Stephen Handel. Listening. MIT Press, Cambridge, 1989.
[37] Stanley Coren, Lawrence M. Ward, and James T. Enns. Sensation and Perception.
Harcourt Brace
[38] F. Klingholz, F. Martin, Quantitative spectral evaluation of shimmer and jitter, J. Speech
Hear. Res., 1985, vol. 28, pp. 169-174.
[39] S. Feijoo, C. Hernandez, Short-term stability measures for the evaluation of vocal quality,
J. Speech Hear. Res., 1990, vol. 33, pp. 324-334.
[40] Fant, C., On the predictability of formant levels and spectrum envelopes from formant
frequencies. In M. Halle. H. Lunt, and H. MacLean (eds.), For Roman Jakobson. The
Hague: Mouton, 1956.
[41] Laver. John., Principles of Phonetics, Cambridge University Press, 1994.
[42] Christine M. S., Elaine T. S., Christopher D., Approximations of open quotient and speed
quotient from glottal airflow and EGG waveforms: Effects of Measurement Criteria and
Sound Pressure Level, J of Voice., Vol. 12, pp. 31-43.
[43] Ingo R. Titze, Acoustic Interpretation of Resonant Voice, J. of Voice, 2001, vol.15, pp.
519-528.
- 146 -
[44] M.M. Sondhi, New methods of pitch determination, IEEE Trans. Audio Electroacoust.,
1968, vol. 16, pp. 262-266.
[45] L.R. Rabiner, On the use of autocorrelation analysis for pitch detection, IEEE Trans.
Acoust. Speech Signal Process., 1977, vol. 25, pp. 24-33.
[46] J.R. Deller, Jr., J.G. Proakis, J.H.L. Hansen, Example short-term features and applications,
Discrete-Time Processing of Speech Signals, Macmillan, New York, 1993.
[47] A. D. Cheveigne and H. Kawahara, Yin: A fundamental frequency estimator for speech
and music, Journal of the Acoustical Society of America, 2002, vol. 111, pp. 1917-1930.
[48] Jindong Chen, Kuldip K. Paliwal, Satoshi Nakamura, Cepstrum derived from
differentiated power spectrum for robust speech recognition, Speech Communication,
2003, vol. 41, pp. 469-484.
[49] J. D. Market, The SIFT algorithm for fundamental frequency estimation, IEEE Trans.
Audio Electroacourt., 1972, vol. 20, pp.367-377.
[50] T. Engin Tuncer, Deconvolution and preequalization with best delay LS inverse filters,
Signal Processing, 2004, vol. 84, pp. 2207-2219.
[51] Jan Skoglund, Analysis and quantization of glottal pulse shapes, Speech Communication,
1998, vol.24, pp. 133-152.
[52] Mallat SG. A theory for multiresolution signal decomposition: the wavelet representation.
IEEE Trans Patt Anal Mach Intell, 1989, vol. 11, pp. 674–93.
[53] Yisong Dai, The time-frequency analysis approach of electric noise based on the wavelet
transform, Solid-State Electronics, 2000, vol.44, pp. 2147-2153.
[54] O. Farooq, S. Datta, Phoneme recognition using wavelet based features, Information
Science, 2003, vol.150, pp. 5-15.
[55] Kadambe S, Bourdeaux-Bartels GF. Application of the Wavelet transform for pitch
detection of speech signals. IEEE Trans Inf Theory, 1992, vol. 38, pp. 917–24.
[56] Vincent Gibiat. Phase space representations of acoustical musical signals., Journal of
Sound and Vibration, 1988, vol. 123, pp. 537–572.
[57] David Gerhard. Audio visualization in phase space. In Bridges, Mathematical
Connections in Art, Music and Science, pp. 137–144, 1999.
[58] Dmitry Terez. Fundamental frequency estimation using signal embedding in state space.,
Journal of the Acoustical Society of America, 2002, vol. 112, pp. 2279.
- 147 -
[59] Dmitry Terez. Robust pitch determination using nonlinear state-space embedding. In
International Conference on Acoustics, Speech and Signal Processing, vol. I, 2002, pp.
345–348.
[60] McGaughey D., Spectral Modelling and stimulation of Atmospherically distorted
wavefront data, PhD Thesis, Queen’s’ University, Ontario, 1999
[61] Korenberg M. and Adeney K., Iterative Fast Orthogonal Search for Modelling by a sum
of exponentials or sinusoids, Annals of Biomedical Eng. 1998, vol. 26, pp. 315-327.
[62] KORENBERG, M.J., Fast orthogonal identification of nonlinear difference equation and
functional expansion models., Proceedings of the 30th Midwest Symposium on Circuits
and systems, 1987, pp. 270-276
[63] Wahid, A, Fast Orthogonal Search for Training Radial Basis Function Neural Networks,
M.S Thesis, Unviersity of Main, 1994.
[64] T. V. Ananthapadmanabha and B. Yegnanarayana. Epoch extraction from linear
prediction residual for identification of closed glottis interval. IEEE Trans. Acoust.,
Spch., and Sag. Proc., 1979, vol. 27.
[65] A. Kumar and S. K. Mullick, Nonlinear dynamical analysis of speech, J. Acoust. Soc. Am.
vol. 100, pp. 615–629.
[66] Titze, I. R. Workshop on Acoustic Voice Analysis, Summary Statement, National Center
for Voice and Speech, Denver, 1995.
[67] Titze LR. Principles of Voice Production. Englewood Cliffs, NJ: Prentice Hall: 1994.
[68] Davis SB. Acoustic characteristics of normal and pathological voices. In: Lass NK, ed,
Speech and Language: Advances in Basic Research and Practice, vol. 1. New York:
Academic; pp. 271-235, 1979.
[69] Hadjitodorov S, Mitev P. A computer system for acoustic analysis of pathological voices
and laryngeal diseases screening. Med Eng Phys. 2002, vol. 24, pp. 419–429.
[70] Giovanni A, Robert D, Estubier N, Teston B: Objective evaluation of dysphonia:
Preliminary results of a device allowing simultaneous acoustics and aerodynamics
measurements. Folia, Phon. Logop.
[71] Banci G, Monini S, Falaschi A, Sario N: Vocal fold disorder evaluation by digital speech
analysis, J. Phonetics, 1986, vol. 14, pp. 495-499.
- 148 -
[72] Gavidia-Ceballos L, Hansen L: Direct speech feature estimation using an iterative EM
algorithm for vocal fold pathology detection., IEEE Tr. on Biomedical Eng., 1996, vol.
43, pp. 373-383.
[73] Laver J, Hiller S, Mackenzie J, Rooney E: An acoustic screening system for the detection
of laryngeal pathology. J. Phonetics, vol. 14, pp. 517-524.
[74] D. G. Childers, A. M. Smith, and A. K. Krishnamurthy, A critical review of
electroglottography, CRC Crit., Rev, Bioeng., vol. 12, pp. 131-161.
[75] Jack J. J, Shuangyi T., Michel D., Chi-haur W., and David G. H., Integrated Analyzer and
Classifier of Glottographic Signals, IEEE Trans. Rehab. Eng. vol. 6, pp. 227-234.
[76] Childers DG, Hicks DM, Moore GP, AlsakaYA. A model for vocal fold vibratory motion,
contact area, and the electroglottogram. J Acoust Soc Am, 1986, vol. 80, pp. 1309-1320.
[77] Matsushita H. The vibratory mode of the vocal folds in the excised larynx. Folia Phoniatr.
1975, vol. 27, pp. 7-18.
[78] Askenfelt, A. G., and Hammarberg, B., Speech waveform perturbation analysis: A
perceptual-acoustical comparison of seven measures, J. Speech Hear. Res. vol. 29, pp.
50–64.
[79] Gauffin, J., Granqvist, S., Hammarberg, B., and Hertegård, S. , Irregularities in the voice:
A perceptual experiment using synthetic voices with subharmonics, in Vocal Fold
Physiology: Controlling Complexity and Chaos
[80] Heiberger, V. L., and Horii, Y., Jitter and shimmer in sustained phonation,’’ in Speech
and Language: Advances in Basic Research and Practice, Lass Academic, New York,
vol. 7, pp. 299–332.
[81] Kent, R.D. and C. Read. 1992. The Acoustic Analysis of Speech. Sandiego: Singular
Publishing.
[82] McCarthy, J. 1994. The Phonetics and Phonology of Semitic Pharyngeals. In Keating, P.,
Phonological Structure and Phonetic Form: Papers in Laboratory Phonetics III.
Cambridge, Mass: Cambridge.
[83] A. Crowe and M.A.Jack, Globally optimizing formant tracker using generalized centroids,
Electron. Lett., 1987, vol. 23, pp. 1019-1020.
[84] G. E. Kopec, Formant tracking using hidden Markov models and vector quantization,
IEEE Trans. Acoust., Speech, Signal Processing, 1986, vol. ASSP-34, pp. 709-729.
- 149 -
[85] S. McCandless, An algorithm for automatic formant extraction using linear prediction
spectra, IEEE Trans. Acoust., Speech, Signal Processing, 1974, vol. ASSP-22, pp. 135-
141.
[86] R. C. Snell and F. Milinazzo, Formant location from LPC analysis data, IEEE Trans.
Speech Audio Processing, 1993, vol. 1, pp. 129-134.
[87] Roger W. Chan, Measurements of vocal fold tissue viscoelasticity: Approaching the male
phonatory frequency range, J. of Acoust Soc of Am, 2004, vol. 115, pp. 3161-3170.
[88] Lieberman P., Perturbations in Vocal Pitch, J. Acoust Soc of Am, 1961, vol. 33, pp. 597-
603.
[89] M. H. L. Hecker and E. J. Kreul, Description of the speech of patients. with cancer of the
vocal folds. Part I: Directional perturbation factors for jitter and for shimmer, J.
Commun. Disorders, 1984, vol. 17, pp. 143–151
[90] Hecker M. & Kreul E., Description of the speech of patients with cancer of the vocal
folds. Part I: Measures of fundamental frequncy, 1971, J. Acoust. Soc. Am., vol. 44,
pp.1275-1282.
[91] Koike, Y., Application of some acoustic measures for evaluation of laryngeal dysfunction,
J. Acoust. Soc. Am, 1973, vol. 45, pp. 839–844.
[92] Yumoto E, Gould W, Baer T., The harmonics-to-noise ratio as an index of the degree of
hoarseness. J. Acoust. Soc. Am., 1982, vol. 71, pp. 1544-1550.
[93] Mitev P. System for acoustic analysis of the pathological voices and screening of the
laryngeal diseases. Ph.D. thesis, Center on Biomedical Engineering, Bulgarian
Academy of Sciences, 2000.
[94] De Krom G: A cepstmm-based technique for determining a harmonics to noise ratio in
speech signals. J. of Speech &Hearing Research, 1993, vol. 36, pp. 254-266.
[95] Imaizumi, S., A preliminary study on the generation of pathological voice types, in Vocal
Fold Physiology: Voice Production, Mechanisms and Functions, New York, pp. 249-
258.
[96] Schafer, R. W., and Rabiner, L. R.., System for automatic analysis of voiced speech, J.
Acoust. Soc. Am. 1970, vol. 47, pp. 634–648.
[97] Kasuya H, Ogawa S, Mashima K, Ebihara S. Normalized noise energy as an acoustic
measure to evaluate pathologic voice., J. Acoust. Soc. Am, 1986, vol. 80, pp. 1329-1334.
- 150 -
[98] Hillenbrand J, Houde R. Acoustic correlates of breathy vocal quality: Dysphonic voices
and continuous speech., J. Speech & Hearing Research, 1996, vol. 39, pp. 311-321.
[99] Rosenberg A.E, Effect of Glottal Pulse Shape on the Quality of Natural Vowels , 1971, J.
Acous. Soc. Am., vol. 49, pp. 583-590.
[100] Henrich, N., d’Alessandro, C., Doval, B. & Castellengo, M., On the use of the derivative
of electroglottographic signals for characterization of nonpathological phonation.
Journal of the Acoustical Society of America, 2004, vol. 115, pp. 1321-1332.
[101] Higgins, M. & Saxman, J., A comparison of selected phonatory behaviours of healthy
aged and young adults. Journal of Speech and Hearing Research, 1991, vol. 34, pp.
1000-1010.
[102] Buder, E. H., Acoustic analysis of voice quality: A tabulation of algorithms 1902–1990,
in Voice Quality Measurement, Singular, San Diego, pp. 119–244
[103] Wendahl, R. W., Some parameters of auditory roughness, Folia Phoniatr., vol. 18, pp.
26–32.
[104] Wendahl, R. W., Laryngeal analog synthesis of jitter and shimmer auditory parameters
of harshness, Folia Phoniatr. vol. 18, pp. 98–108.
[105] Hillenbrand, J., A methodological study of perturbation and additive noise in
synthetically generated voice signals, J. Speech Hear. Res. vol. 30, pp. 448–461.
[106] Seiichi T. and Tatsuya H., A glottal waveform model for high-quality speech synthesis,
J. Acoust Soc of Am, 1990, vol. 88, pp.152-160
[107] Klatt, D. H., Software for a cascade/parallel. formant synthesizer, J. Acoust. Soc. Am.
1980, vol. 67, pp. 971–995.
[108] A.E.Rosenberg, Effects of pulse shape on the quality of natural vowels, J. Acoust. Soc.
Am., 1973, vol. 49, pp.583-591.
[109] I.R.Titze, Synthesis of sung vowels using a time-domain approach, in Transcripts of the
11th Symp.: Care of the Prof. Voice, V.L.Lawrence Ed.New York: The Voice
Foundation, pp. 90-98, 1982.
[110] Donovan R., Trainable Speech Synthesis. PhD. Thesis. Cambridge University
Engineering Department, England. 1996.
[111] Valbret H., Moulines E., Tubach J., Voice Transformation Using PSOLA Techique.
Proceedings of Eurospeech 91, 1991, vol. 1, pp. 345-348.
- 151 -
[112] Charpentier F., Moulines E., Pitch-Synchronous Waveform Prosessing Techniques for
Text-to-Speech Synthesis Using Diphones. Proceedings of Eurospeech, 1989, vol. 89,
pp. 13-19.
[113] Kleijn K., Paliwal K. (Editors., Speech Coding and Synthesis. Elsevier Science B.V.,
The Netherlands. 1998
[114] Kortekaas R., Kohlrausch A., Psychoacoustical Evaluation of the Pitch-Synchronous
Overlap-and-Add Speech-Waveform Manipulation Technique Using Single-Formant
Stimuli. JASA, 1994, vol. 101, pp. 2202-2213.
[115] Behlau M, Pontes P. Avaliação e Tratamento das Disfonias. São Paulo: Lovise; 1995.
[116] BEHLAU, M.; PONTES, P. As chamadas disfonias espasmódicas: dificuldades de
diagnóstico e tratamento. R. Bras. Otorrinolaringol., São Paulo, 1997, vol. 63, supl. 1,
p. 4-27.
[117] Coleman RF, Mabis J, Hinson J. Fundamental frequency-sound pressure level profiles
of adult male and female voices. Journal of Speech and Hearing Research, 1977, vol.
20, pp. 197-204.
[118] Behlau M, Madazio G, Feijó D, Pontes P. Avaliação da Voz. In: Behlau M (org.) Voz -
O Livro do Especialista. Vol. I. Rio de Janeiro: Revinter; 2001. Cap. 3, 86-180.
[119] CCITT, CODING OF SPEECH AT 16 kbit/s USING LOW-DELAY CODE EXCITED
LINEAR PREDICTION, 1992
[120] P. Moulin, Wavelet thresholding techniques for power spectrum estimation, IEEE Trans.
Signal Processing, 1994, vol. 42, pp. 3126–3136.
[121] A. T. Walden, D. B. Percival, and E. J. McCoy, Spectrum estimation by wavelet
thresholding of multitaper estimators, IEEE Trans. Signal Processing, 1998, vol. 46, pp.
3153–3165.
[122] Flandrin, P., Rilling, G., and Goncalves, P., Empirical Mode Decomposition as a Filter
Bank, IEEE Signal Processing Letters, 2004, pp. 112 – 114.
[123] Huang, N., Attoh-Okine N. O., The Hilbert-Huang Transform in Engineering,
Taylor&Francis, CRC, 2005.
[124] Rabiner L. and Juang B. H., Fundamentals of speech recognition, Prentence Hall, NJ,
1993.
[125] C.S. Blackburn, Articulatory Methods for Speech Production and Recognition, PhD
Thesis, Cambridge University Engineering Department, 1996.
- 152 -
[126] Baer, T., Lofqvist, A& McGarr, N., Laryngeal vibrations: A comparison between high-
speed filming and glottographic techniques, J. of Acoust Soc of Am., 1983, vol. 73, pp.
1304-1308.
[127] Klatt, D, Review of text-to-speech conversion for english, J. of Acoust Soc of Am., 1987,
vol. 82, pp. 737-793.
[128] S. Haykin, Neural Networks (A comprehensive Foundation), 2nd Edition, Prentice-Hall,
Englewood CliLs, NJ, 1999.
[129] S. Haykin, Neural Networks Expand SP’s Horizons, I . Mag., 1996, vol. 13, pp. 24–49.
[130] J.C. Principe, A. Rathie, J.M. Kuo, Prediction of chaotic time series with neural
networks and the issue of dynamic modeling, Int. J. Bifurcation Chaos, 1992, vol. 2, pp.
989–996.
[131] V.J. Mathews, G.L. Sicuranza, Polynomial Signal Processing, Wiley Publishers, New
York, 2000.
[132] Y.H. Pao, Adaptive Pattern Recognition and Neural Networks, Addison-Wesley,
Reading, MA, 1989.
[133] K. Hornik, M. Stinchombe, H. White, Multilayer feedforward networks are universal
approximators, Neural Networks, pp. 359–366, 1989.
[134] K.J. Lang, G.E. Hinton, The development of the time-delayneural networks architecture
for speech recognition, Technical Report CMU-CS-88-152, Carnegie Mellon
University, Pittsburgh, PA.
[135] R.J. Williams, J. Peng, An e,cient gradient-based algorithm for on-line training of
recurrent network trajectories, Neural Comput., 1990, vol. 2, pp. 490–501.
[136] A. Cichocki, R. Unbehauen, Neural Networks for Optimization and Signal Processing,
Wiley, B.G. Teubner, Stuttgart, 1993.
[137] B. Widrow, M. Lehr, 30 years of adaptive neural networks: perceptron, adaline and
backpropagation, Proc. IEEE, 1990, vol. 78.
[138] K.S. Narendra, K. Parthasarathy, Identi-cation and control of dynamical systems
containing neural networks, IEEE Trans. Neural Networks, 1990, vol. 1, pp. 4–27.
[139] B. De Vries, J.C. Principe, The gamma model-A new neural model for temporal
processing, Neural Networks, 1992, vol. 5, pp. 565–576.
- 153 -
[140] T. Chen, H. Chen, Approximation of continuous functionals byneural networks with
application to dynamic systems, IEEE Trans. Neural Networks, 1993, vol. 4, pp. 910–
918.
[141] G. Cybenko, Approximation by superposition of a sigmoidal function, in: Mathematical
Control Signals Systems, Vol. 2, Springer, New York, 1989.
[142] S. Haykin, Adaptive Filter Theory, 3th Edition, Prentice-Hall, Englewood CliLs, NJ,
1996.
[143] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, 1995.
[144] H. Yasukawa, Signal restoration of broad band speech using nonlinear processing,
Proceedings of EUSIPCO’96, Trieste, Italy, Sept. 1996.
[145] R.E. Crochiere, L.R. Rabiner, Multirate Digital Signal Processing, Prentice-Hall,
Englewood CliLs, NJ, 1983.
[146] N.J. Fleige, Multirate Digital Signal Processing (Multirate systems, Filter Banks,
Wavelet), Wiley, New York, 1994.
[147] M.R. Petraglia, S.K. Mitra, Performance analysis of adaptive -lter structures based on
subband decomposition, Proceedings of the IEEE International Symposium on Circuit
and Systems, Chicago, IL, 1993, pp. 60–63.
[148] G. Cocchi, A. Uncini, Subband neural networks prediction for on-line audio signal
recovery, IEEE Trans. Neural Network, 2002, vol. 13, pp. 867–876.
- 154 -
국문초록국문초록국문초록국문초록
비선형비선형비선형비선형 음성음성음성음성 모델링을모델링을모델링을모델링을 이용한이용한이용한이용한 양성양성양성양성 후두후두후두후두 질환의질환의질환의질환의 수술수술수술수술 후후후후
모음에모음에모음에모음에 대한대한대한대한 예측예측예측예측
연세대학교 대학원
의용전자공학과
장 승 진
병적인 음성에서 지각적인 비주기성은 기본 주파수의 간격 (jitter), 강도의 떨림
(shimmer)과 잡음과 같은 동요 요인에 의해 주로 발생된다. 이러한 요인들은
주로 성문 진동에 대한 제어 손실, 성문에 발생하는 종양 및 방사와 호흡시
발생하는 잡음의 존재로 인하여 주로 영향 받는다. 본 연구의 가정은 병적인
음성에서 이러한 동요 요인들을 제거하는 것이 수술후의 음성과 비슷한 향상을
발생할 수 있다는 것이다.
본 연구에서는, 수술 전/후 모음에 대한 음성 및 전기성문파형 검사 결과를
바탕으로 양성 후두 질환을 위한 수술 후 모음 예측에 대한 모형과 구현을
비선형 외인성 입력을 갖는 자기회귀 방법 (NARX)를 기반으로 한 비선형 음성
모델링을 통하여 수행하였다.
먼저, 정확한 음성 분석을 위하여 병적인 음성에 대한 강인한 피치 검출 알고리즘
제안하였다. 기존의 다른 피치 검출 알고리즘과 달리 고속 직교 검출을 기반으로
제안된 피치 검출 알고리즘은 상당히 많이 피치 조대 오차, 특히 피치 반감
오차를 줄일 수 있다.
이후, 음성 및 전기성문파형 검사와 관련한 다양한 측정들이 42 명의 양성 후두
질환 자들을 대상으로 수술 전/후 두 차례에 걸쳐 검사되었다. 남성 그룹의 평균
- 155 -
피치는 약 12-15 % 감소한 반면에 여성 그룹들의 값은 유의하게 변하지 않았다.
포만트 주파수 (Formant frequency)는 수술 전과 후에 일정한 값을 유지하였다.
대부분의 jitter 측정치들은 통계적으로 유의하게 변화한 반면, 일부의
shimmer 들만 수술 후 달라졌음을 확인할 수 있었다. harmonic-to-noise ratio
(HNR), normalized noise energy (NNE), degree of hoarse (DH), and normalized
first harmonic energy (NFHE)와 같은 잡은 예측 관련 측정치들에서는 성별에
따라서 일부의 발성에 대해서만 유의하게 차이를 보였다. 전기성문 파형검사 관련
측정치의 open quotient (OQ), speed quotient (SQ)에서는 변화를 보이지
않았지만, 특이하게도 평균 SQ 값에 의해 구분된 두 그룹의 경우 정상 범위 내로
회귀하는 것을 발견하였다.
이러한 검사 결과를 바탕으로 정상적인 음성과 같은 지각적인 정도로 수술 전
모음을 향상시키도록 변조하였다. 변조되는 정도는 수술 전/후 음성의 차이를
기반으로 한 통계적인 결과에 의해서 조정되었다. 피치 거리, 강도 및 기식성
잡음의 변조들이 Pitch synchronous overlap and add (PSOLA), 강도 조정자 및
웨이블릿 문턱치 감소 방법들과 전기성문파형 신호의 기저선 변동 제거에 의하여
수행되었다. 이렇게 변경된 음성, 성문 신호들은 최소 제곱 서포트 벡터 회귀
(SVR)를 기반으로 한 NARX 비선형 음성 모델링에서 입력 신호들로 사용되어진다.
마지막으로, 음성 및 전기성문파형 검사를 기반으로 한 수술 전 모음의 변조는
주파수 및 동력학 도메인에서 수술 후의 모음과 상당 부분 비슷함을 보였다.
또한 SVR 을 기반으로 한 NARX 을 이용한 비선형 음성 모델링의 성능은
모음들의 지각적 정도에 있어 LPC 보다 우수하였으며, 이러한 결과는 LPC 의
경우 자연스러움이 부족한 인공적인 음성을 생성하는 반면에, 자연적인 jitter,
shimmer 및 잡음이 보존되기 때문이라 예측된다.
요약어: 피치검출 알고리즘, 양성후두 질환, 비선형 음성 모델링, 비선형 자귀회귀
외인성 모델, 음성 분석, 전기성문파형 분석