estimation of postoperative vowel of benign vocal fold ...estimation of postoperative vowel of...

Estimation of Postoperative Vowel

of Benign Vocal Fold Lesions using

Nonlinear Speech Production Modeling

Seung Jin Jang

The Graduate School

Yonsei University

Department of Biomedical Engineering

2

Estimation of Postoperative Vowel

of Benign Vocal Fold Lesions using

Nonlinear Speech Production Modeling

A Dissertation

Submitted to the Department of Biomedical Engineering

and the Graduate School of Yonsei University

in partial fulfillment of the

requirements for the degree of

Doctor of Philosophy

Seung Jin Jang

August 2007

3

This certifies that the dissertation of Seung Jin Jang is approved.

The Graduate School

Yonsei University

August 2007

4

ACKNOWLEDGMENTS

First and foremost, I would like to thank my parents for their continuous concern,

generous support and endless love. They have encouraged me to keep going to work on the

fundamental problems. This dissertation is dedicated.

I would like to thank my dissertation advisor, Professor Young-Ro Yoon, for his

insightful advice. His abundant scientific experiences and approaches have been an inspiration

to me in the past years of my graduate study and research. He is not only an academic advisor

who advises me on research, but also a great mentor with enthusiasm and patience to enlighten

me throughout this dissertation work.

I would also like to gratefully acknowledge the valuable assistance from the other

member of my dissertation committee, Professor Kyoung-Joung Lee, Professor Kyoung-Hwan

Kim, Professor Young-Cheol Park, and Professor Hong-Sik Choi. I want to sincerely thank

Professor Hong-Sik Choi and Young-Cheol Park for their scholarly comments and

encouragements on my dissertation. Another Professor Hyung-Ro Yoon, Professor Yoon-Sun

Lee, Professor Dong-Yoon Kim, Professor Young-Ho Kim, Professor Tae-Min Shin,

Professor Hyo-Sung Jo, Professor Bup-Min Kim, and Professor Han-Sung Kim have followed

my dissertation work closely during the past five years of my graduate study. Their insight and

knowledge in biomedical engineering have greatly influenced my research work on various

aspects. They always give me an inspiration and diligence.

I also want to thank Professor Seung-Hoon Park in Kyunghee University, Professor

Sung-Oh Hwang in wouju medical college, Professor Dong-Yul Na in department of

Computer & Telecommunication Engineering, and Chul-Gyu Lee in Konkuk University. They

have been always academic guiding light to me. I also appreciate professor everyone in the

department of Biomedical Engineering, even though they do not directly contact with me.

I really want to give my thankful mind to office mates, Hyo-min Kim and Sang-Ha Song,

old fellow, Young-Gu Yoon and Jung-Woo Lee, and benefactor, Hironori Suzaki in Japan.

Without their generous support and collaboration, this dissertation would never have been

finished.

I also appreciate Dr. Sung-Hee Choi, Dr. Jae-Name Choi, Dr. Sung-Eun Im, Dr. Jae-Ok

Kim, and Dr. Hae-Suk Park in the Institute of Logopedics & Phoniatrics, Yong-Dong

5

Severance hospital for friendly suggestion of the useful information and acquisition of speech

database. In particular, I would like to mention Dr. Sung-Hee Choi for her practical advice,

and Dr. Jae-Name Choi for her unsparing help.

I would like to thank my former officemates and other students, Dr. Won-Sik Kim, Dr.

Dong-Ik Cha, Dr. Hong-Mo Sung, Dr. Jae-Woo Shin, Dr. Ah-ram Sul, Woo-Hee Lee, Won-

Suk Jang, Sung-Yoon Kim, Suk-Gyun Hong, Hae-Won Choi, Seung-Ha Lee, Byung-Yoon

Kang, Min-Suk Cha, Joo-Sung Lee, Sae-Lim Park, Gyu-Suk Hong, Jung-Hoon Lee, Hun

Shim, Yong-Ju Yang, Zip-Min Jung, Joo-Hwan Lee, and Yong-Gu Jang for creating a

pleasant working environment for me. The administrative support of our system officer, Jong-

Su Ahn, Myung-Bae Yang, Gyung-Ja Kim, Mi-Hyung Lee and Byung-Wook Kim is also

highly appreciated.

I would like to thank my old intimate friends, Ji-Eun Lee, Sang-Woo Kim, Won-Wu Lee,

Byung-Geun Hong, Jun-young Lee, Seung-Hoon Son, Byung-Joo Lee, Jin-Won Kang, Jun-

Hee Yoon, Hee-Joong Lee, Sang-Hoon Han, Dae-Young Kim, Ki-Tae Park, Phil-Sung Oh,

Seung-Hyun Son, Chan-Ho Lee, Hyun Heo, Sang-Hoon Han, Dong-Gyu Shin, Dong-Sun Kim,

Dae-Geun Jeon, Dong-Won Kang, Ki-Won Lee, Nam-Hoon Kim, Jong-Gu Lee, Seung-Hoon

Kim and Gi-Sik Tae. I was always encouraged by their reinforcement, even though they are

far from me.

“인내는 정말 힘든 것이지만 배울 만한 가치가 있는 유일한 것이다.

자연과 성장은, 평화와 번영과 아름다움은 모두 인내를 바탕으로 하며,

시간과 고요함과 신뢰를 필요로 한다.”

Hermann Hesse

August 2007

from Seung-Jin Jang

i

CONTENTS

FIGURE LEGENDS ................................................................................................................. iv TABLE LEGENDS..................................................................................................................vii ABBREVIATIONS.................................................................................................................viii ABSTRACT............................................................................................................................... x

1. Introduction............................................................................................................................ 1 1.1 General Backgrounds........................................................................................................ 1

1.1.1 Mechanism of the Voice Production .......................................................................... 1 1.1.2 A Brief View of the Voice Disorders: Benign Vocal Fold Lesions Focus .................. 1 1.1.3 Speech Production Model.......................................................................................... 7

1.2 Problem Definition ........................................................................................................... 9 1.3 Organization of the Thesis .............................................................................................. 11

2. Robust Pitch Detection Algorithm for Pathological Voice ................................................... 12

2.1 Introduction .................................................................................................................... 12 2.1.1 Introduction to Pitch (Fundamental Frequency; F0) Perception.............................. 12 2.1.2 Difficulties of Pitch Estimation ............................................................................... 14 2.1.3 Characteristics of Pathological Voice ...................................................................... 15

2.2 Review of the Several Established PDAs ....................................................................... 17 2.2.1 Time Domain Approaches ....................................................................................... 18

2.2.1.1 Autocorrelation (AC)......................................................................................... 18 2.2.1.2 Average Magnitude Difference Function (AMDF)............................................ 19 2.2.1.3 YIN.................................................................................................................... 20

2.2.2 Frequency Domain Approaches............................................................................... 22 2.2.2.1 CEPSTRUM...................................................................................................... 22 2.2.2.2 Simplified Inverse Filtering Techniques (SIFT) ................................................23

2.2.3 Alternative Approaches............................................................................................ 24 2.2.3.1 Wavelet .............................................................................................................. 24 2.2.3.2 State-Space Embedding..................................................................................... 26

2.3 Robust Pitch Detection Algorithm for Pathological Voice Based on Fast Orthogonal Search ............................................................................................................................ 27

2.3.1 Introduction of Fast Orthogonal Search Algorithm ................................................. 27 2.3.2 Pitch Selection ......................................................................................................... 30

2.4 Experimental Procedure.................................................................................................. 32 2.4.1 Speech Database ...................................................................................................... 32 2.4.2 Preconditions of Performance Evaluation................................................................ 32 2.4.3 Error Types of PDA ................................................................................................. 33 2.4.4 Optimum Window Selection.................................................................................... 34

2.5 Experimental Results ...................................................................................................... 36 2.5.1 Evaluating Performance of PDAs in Normal versus BVFL Voices ......................... 36 2.5.2 Evaluating Performance of PDAs in Aperiodicity Level of Voices ......................... 39

2.6 Summary......................................................................................................................... 42

ii

3. Comparison of Acoustic and Electroglottographic Parameters of BVFL before and after Laryngeal Surgery.................................................................................................................... 43

3.1 Introduction to Acoustic and Electroglottographic Analysis on Vowel .......................... 43 3.1.1 Acoustic Analysis..................................................................................................... 43 3.1.2 Electroglottographic Analysis.................................................................................. 45 3.1.3 Measurement of Pathological Voice ........................................................................ 45

3.2 Methods and Experiments .............................................................................................. 46 3.2.1 Experimental Data and Protocol .............................................................................. 46 3.2.2 Analysis and Results of Formant Frequencies ......................................................... 46

3.2.2.1 Estimation of Formant Frequencies................................................................... 46 3.2.2.2 Comparative Results.......................................................................................... 47

3.3 Analysis and Results of Fundamental Frequency Perturbation (Jitter) ........................... 52 3.3.1 Various Jitter Measures............................................................................................ 52 3.3.2 Comparative Results ................................................................................................ 54 3.3.3 Analysis and Results of Intensity Perturbation (Shimmer) ...................................... 57

3.3.3.1 Various Shimmer Measures ............................................................................... 57 3.3.3.2 Comparative Results.......................................................................................... 59

3.3.4 Analysis and Results of Noise Components ............................................................ 60 3.3.4.1 Estimation of the noise in the spectral domain .................................................. 60 3.3.4.2 Estimation of Harmonic-to-Noise Ratio (HNR)................................................ 60

3.3.5 Estimation of Degree of Hoarse (DH) and Normalized Noise Energy (NNE) ........ 64 3.3.6 Estimation of the normalized first harmonic energy (NFHE).................................. 65

3.3.6.1 Comparative Results.......................................................................................... 67 3.3.7 Analysis and Results of Electroglottographic Parameters ....................................... 71

3.3.7.1 Estimation of Open Quotient and Speed Quotient............................................. 71 3.3.7.2 Comparative Results.......................................................................................... 73

3.4 Summary......................................................................................................................... 77 4. Modification of Preoperative Vowel Sounds based on Acoustic and Electroglottographic Analysis.................................................................................................................................... 78

4.1 Introduction to Perception of Aperiodicity in Pathological Voices.................................78 4.2 Synthesized Vowel Modeling ......................................................................................... 79

4.2.1 Glottal Waveform Modeling .................................................................................... 79 4.2.1.1 Rosenberg’s Model ............................................................................................ 79 4.2.1.2 Titze’s model ..................................................................................................... 80

4.2.2 Aperiodicity of Glottal Waveform ........................................................................... 82 4.3 Modifications of Preoperative Vowel ............................................................................. 83

4.3.1 Design of Modification of Fundamental Frequency ................................................ 83 4.3.1.1 Pitch Scale Modification and Jitter using PSOLA............................................. 83 4.3.1.2. Modification of Intensity .................................................................................. 85 4.3.1.3 Short term Postfilter .......................................................................................... 86

4.4 Design of Enhancement of Noise Components .............................................................. 88 4.4.1 Introduction to Wavelet Transform Threshold Shrinkage........................................ 88 4.4.2 Determination of Adaptive Threshold...................................................................... 89

4.5 Modification of Baseline Wander of EGG Signal........................................................... 93 4.5.1 Introduction to Empirical Mode Decomposition ..................................................... 93

iii

4.6 Summary......................................................................................................................... 98 5. Nonlinear Speech Production Modeling using Nonlinear Autoregressive Exogenous based on Support Vector Regression .................................................................................................. 99

5.1 Introduction of Speech Production Modeling................................................................. 99 5.1.1 Overview of Linear Speech Production Modeling................................................ 100 5.1.2 Limitations of Linear Speech Production Modeling.............................................. 102

5.2 Overview of Nonlinear Speech Production Modeling on Support Vector Regression .103 5.2.1 Review of Former Research in Nonlinear Speech Production Modeling .............. 103 5.2.2 Introduction of Support Vector Machine for nonlinear regression......................... 105

5.3 Nonlinear Speech Production Modeling based on Support Vector Regression ............ 107 5.3.1 NARX using SVR Model ...................................................................................... 108 5.3.2 Optimum parameter Selection ............................................................................... 113

5.4 Evaluation of NARX using SVR Model....................................................................... 115 5.4.1 Multi-band Model.................................................................................................. 116

5.5 Experimental Results .................................................................................................... 119 5.6 Summary....................................................................................................................... 121

6. Conclusion ......................................................................................................................... 122 Appendix A ............................................................................................................................ 124 Appendix B ............................................................................................................................ 134 References.............................................................................................................................. 143 국문초록................................................................................................................................ 154

iv

FIGURE LEGENDS

Figure 1-1. The subsystems of voice.......................................................................................... 2 Figure 1-2. Benign vocal fold lesions pictured by stroboscope (KayLab). (A) Vocal fold

nodules, (B) Vocal Fold cyst, (C) Vocal fold Polyp, (D) normal vocal fold ........... 3 Figure 1-3. Diagram of the source-filter theory ......................................................................... 7 Figure 2-1. Episodes of unvoiced sound |t| and voiced sound |a| ............................................. 13 Figure 2-2. Vibratory pattern of vocal folds with a single opening and closing & airflow

between the vocal folds as change of vocal folds vibration ................................. 15 Figure 2-3. Episodes of pitch doubling and halving error: pitch estimated by autocorrelation

method.................................................................................................................. 16 Figure 2-4. Pitch detection standard, domain and boundary of AC: (left) analyzed speech

signal, (right) pitch selection in time domain after autocorrelation...................... 18 Figure 2-5. Pitch detection standard, domain and boundary of AMDF: (left) analyzed speech

signal, (right) pitch selection in time domain after AMDF................................... 20 Figure 2-6. Pitch detection standard, domain and boundary of YIN: (left) analyzed speech

signal, (right) pitch selection in time domain after YIN....................................... 21 Figure 2-7. Pitch detection standard, domain and boundary of Cepstrum: (left) analyzed

speech signal, (right) pitch selection in quefrency domain after Cepstrum.......... 22 Figure 2-8. Overall process of SIFT (top) and Pitch detection standard, domain and boundary

of SIFT: (bottom left) analyzed speech signal, (bottom right) pitch selection in time domain after SIFT (Hybrid method)............................................................. 23

Figure 2-9. Diagram of fast lifting wavelet transform.............................................................. 24 Figure 2-10. Pitch detection standard, domain and boundary of Wavelet: (left) analyzed speech

signal, (right) pitch selection in time domain after FLWT ................................... 25 Figure 2-11. Pitch detection standard, domain and boundary of State-Space Embedding: pitch

selection in periodicity histogram (time domain) after singular value decomposition ...................................................................................................... 26

Figure 2-12. Two episodes of pitch selection in FOS............................................................... 30 Figure 2-13. Pitch selection of boundary of a third of global Maximum peak......................... 31 Figure 2-14. Gross error rates as a function of window size in normal database ................... 35 Figure 2-15. Gross error rates as a function of window size in BVFL database ...................... 35 Figure 3-1. Episodes of speech, spectrum, and EGG of normal and pathological voices ........ 44 Figure 3-2. Formant frequencies tracking based on phase spectrum of LPC ........................... 47 Figure 3-3. Box plots of F1, F2, and F3 formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u|

of male group before and after surgery ................................................................50 Figure 3-4. Box plots of F1, F2, and F3 formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u|

of female group before and after surgery ............................................................. 50 Figure 3-5. Loci of the mean and S.D. of 1 of F1 and F2 formant frequencies of male before

and after surgery................................................................................................... 51 Figure 3-6. Loci of the mean and S.D. of 1 of F1 and F2 formant frequencies of female before

and after surgery................................................................................................... 52 Figure 3-7. Mean F0 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after

laryngeal surgery .................................................................................................. 55 Figure 3-8. Jitter (%) of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after

v

laryngeal surgery .................................................................................................. 55 Figure 3-9. Pitch perturbation factor of vowel |a|, |e|, |i|, |o|, |u| of male and female group before

and after laryngeal surgery ................................................................................... 56 Figure 3-10. RAPP15 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after

laryngeal surgery .................................................................................................. 56 Figure 3-11. Shimmer (%) of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after

laryngeal surgery .................................................................................................. 58 Figure 3-12. RAAP15 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after

laryngeal surgery .................................................................................................. 59 Figure 3-13. HNR ratio calculation using Cepstral smoothing in Spectrum............................ 61 Figure 3-14. Influence of cepstral smoothing due to liftered long-term temporal window

(57ms) .................................................................................................................. 62 Figure 3-15. Influence of cepstral smoothing due to liftered short-term temporal window

(1ms) .................................................................................................................... 63 Figure 3-16. Plot of harmonic and noise phase segment in spectral domain............................ 66 Figure 3-17. Box plots of HNR, NNE, and DH of voiced sounds |a|, |e|, |i|, |o|, |u| of male

group before and after surgery ............................................................................. 68 Figure 3-18. Box plots of HNR, NNE, and DH of voiced sounds |a|, |e|, |i|, |o|, |u| of female

group before and after surgery ............................................................................. 68 Figure 3-19. Box plots of NFHE of voiced sounds |a|, |e|, |i|, |o|, |u| of male group before and

after surgery.......................................................................................................... 69 Figure 3-20. Box plots of NFHE of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and

after surgery.......................................................................................................... 69 Figure 3-21. Determination of start point of opening phase and closing phase in EGG, 16-

smoothed EGG, and differentiated EGG waveform ............................................. 72 Figure 3-22. Detail definition of opening and closing phase period in EGG waveform .......... 73 Figure 3-23. Box plots of OQ and SQ formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of

male group before and after surgery..................................................................... 76 Figure 3-24. Box plots of OQ and SQ formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of

female group before and after surgery.................................................................. 76 Figure 4-1. Glottal waveform generated by Rosenberg model................................................. 79 Figure 4-2. Glottal waveform generated by Titze’s model ....................................................... 81 Figure 4-3. Pitch period modification by PSOLA.................................................................... 83 Figure 4-4. Episodes of pitch scale modification by PSOLA ................................................. 84 Figure 4-5. Intensity modification by Shimmer (%) of 2.5 %.................................................. 86 Figure 4-6. Plots of short term postfiltered voiced sound in time and spectral domain ........... 87 Figure 4-7. Plots of HNR, NNE, NFHE, and DH as a function of (a) jitter (S.D. 40%), (b)

shimmer (40%), and (c) noise (6%), and (d) magnification of (c) in NFHE ...... 90 Figure 4-8. Examples of synthetic voiced sound |a| (a) with shimmer and noise of 5 % and

jitter of 0.75 %, (b) with shimmer and noise of 40 % and jitter of 6 %................ 90 Figure 4-9. Plots of HNR as a function of Jitter (S.D. 0.75- 6.0 %) & noise and shimmer (S.D.

5-40 %) for phonation |a|,|e|,|i|,|o|,|u| for female group ......................................... 91 Figure 4-10. Plots of HNR as a function of Jitter (S.D. 0.75- 6.0 %) & noise and shimmer (S.D.

5-40 %) for phonation |a|,|e|,|i|,|o|,|u| for male group ............................................ 91 Figure 4-11. Episode of denoising with Wavelet threshold shrinkage ..................................... 92 Figure 4-12. Process diagram of EMD..................................................................................... 94 Figure 4-13. Reduction of baseline wander in EGG waveform by EMD and high pass filter

vi

with FIR 500-order (Pass band: over 40 Hz); voiced sound |u| with sampling rate of 22050 Hz.......................................................................................................... 96

Figure 4-14. Plots of IMFs and residue of EGG waveform in Figure 4-7. ............................. 97 Figure 5-1. Buffered MLP structure with input TDL ............................................................. 104 Figure 5-2. A scheme of NARX using SVR........................................................................... 113

Figure 5-3. 2D-plot of selection of optimum 2 andσ ϒ for phonation |i| of male group . 114

Figure 5-4. Synthesized versus original signal (time delay = 50): (top) modified speech signal, (middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |a| of male ........................................................................................................... 115

Figure 5-5. Synthesized versus original signal (time delay = 50): (top) modified speech signal, (middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |i| of male ............................................................................................................ 116

Figure 5-6. Multiband SVR Model with wavelet filterbank................................................... 117 Figure 5-7. synthesized versus original signal (time delay = 50): (top) modified speech signal,

(middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |i| of male ............................................................................................................ 118

Figure 5-8. Comparison of spectrogram between original speech signal and synthesized speech signal .................................................................................................................. 119

vii

TABLE LEGENDS

Table 1-1. Classification of voice disorder................................................................................. 4 Table 1-2. Phonatory function examinations.............................................................................. 5 Table 1-3. Detail characteristics of benign vocal fold lesions: Nodules, Polyps, and Cysts ...... 6 Table 2-1. Results of performance of the avaiable PDAs in database of normal and BVFL.... 37 Table 2-2. Results of performance of the available PDAs in database of cyst (Npolyp =195) .................................................................................................................................................. 38 Table 2-3 Results of performance of the available PDAs in database of cyst (Ncyst = 85) .................................................................................................................................................. 38 Table 2-4 Results of performance of the avaiable PDAs in nodule database (Nnodule = 140) .................................................................................................................................................. 39 Table 2-5. Results of performance of the available PDAs in database of normal and BVFL .................................................................................................................................................. 40 Table 2-6. Results of performance of the available PDAs in database of normal and BVFL .................................................................................................................................................. 41 Table 3-1 Mean and S.D. of formant frequencies from sustained vowel |a|, |e|, |i|, |o|, |u| before

and after laryngeal surgery ................................................................................. 49 Table 3-2 Various jitter measures ........................................................................................... 53 Table 3-3 Various shimmer measures ..................................................................................... 57 Table 3-4 Mean and S.D. of formant frequentcies from sustained vowel |a|, |e|, |i|, |o|, |u| before

and after laryngeal surgery ................................................................................. 70 Table 3-5 Mean and S.D. of NFHE from sustained vowel |a|, |e|, |i|, |o|, |u| before and after

laryngeal surgery ................................................................................................ 71 Table 3-6 Mean and S.D. of open quotient and speed quotient from sustained vowel |a|, |e|, |i|,

|o|, |u| before and after laryngeal surgery ............................................................ 74 Table 3-7. Mean and S.D. of speed quotient from sustained vowel |a|, |e|, |i|, |o|, |u| before and

after laryngeal after laryngeal surgery ................................................................ 75 Table 4-1. Correlation coefficients of each noise estimation as a change of jitter, shimmer, and

noise ..................................................................................................................... 89

Table 5-1. Results of optimized 2 ,σ ϒ and mean square error of RBF kernel in phonation |a|,

|e|, |i|, |o|, |u| for both sexes ................................................................................. 114 Table 5-2. Results of jitter(%) between synthesized and postoperative sounds in phonation |a|,

|e|, |i|, |o|, |u| for both sexes ................................................................................. 120 Table 5-3. Results of Lyapunov exponents between synthesized and postoperative sounds in

phonation |a|, |e|, |i|, |o|, |u| for both sexes ........................................................... 120

viii

ABBREVIATIONS

AC Autocorrelation

AMDF Average Magnitude Difference Function

APF Amplitude Perturbation Factor

AR Autoregressive

BVFL Benign Vocal Fold Lesions

DAPF Directional Amplitude Perturbation Factor

DEGG Derivative Electroglottography

DH Degree of Hoarse

DPPF Directional Pitch Perturbation Factor

EGG Electroglottography

EMD Empirical Mode Decomposition

FOS Fast Orthogonal Search

GCI Glottal Closure Instant

HHT Hilbert-Huang Transform

HNR Harmonic-to-Noise Ratio

IMF Intrinsic Mode Function

LS-SVM Least-Squares SVM

LP Linear Prediction

MAJ Mean Absolute jitter

MAS Mean Absolute Shimmer

NFHE Normalized First Harmonic Energy

NN Neural Network

NNE Normalized Noise Energy

OQ Open Quotient

PDA Pitch Detection Algorithm

PPF Pitch Perturbation Factor

PSOLA Pitch Synchronous Overlap and Add

RAAP Relative Average Amplitude Perturbation

RAPP Relative Amplitude Pitch Perturbation

SIFT Simplified Inverse Filtering Techniques

SQ Speed Quotient

SVM Support Vector Machine

SVR Support Vector Regression

ix

ABSTRACT

Estimation of Postoperative Vowel of Benign Vocal Fold Lesions

using Nonlinear Speech Modeling

Seung-Jin Jang

Department of Biomedical Engineering

The Graduate College

Yonsei University

.

In pathological voices, perceptual aperiodicity is mainly caused by perturbation factors

such as jitter, shimmer, and noise. These factors are mainly affected by lack of control of

vocal fold vibration, mass lesions of vocal cords, and presence of noise at emission and

breathiness. Our hypothesis is that reduction of these perturbation factors in pathological voice

can be enhanced similar to postoperative voice.

In benign vocal fold lesions, a design and implementation of estimation of postoperative

vowel is studied using nonlinear speech modeling based on nonlinear autoregressive with

exogenous input (NARX), according to the acoustic and electroglottographic analysis between

preoperative and postoperative sustained vowel.

First, robust pitch detection algorithm (PDA) for pathological voice is suggested for

accurate acoustic analysis. Compared to other established PDAs, our proposed PDA based on

fast orthogonal search can considerably reduce the pitch gross errors, especially pitch halving

error.

After that, it is investigated that various measurements related with acoustic and

electroglottographic analysis are achieved twice before and after laryngeal surgery, for 42

subjects in a relevant of benign vocal fold lesions. Mean pitch of male group decreased about

12-15 % value of preoperative pitch, whereas that of female group does not significantly

change. Formant frequencies show constant values before and after surgery. Most of jitter

measures are significantly changed, but some of shimmer measures are different later the

x

surgery. In noise estimation relevant measures such as harmonic-to-noise ratio (HNR),

normalized noise energy (NNE), degree of hoarse (DH), and normalized first harmonic energy

(NFHE), some of phonation significantly present the difference according to sex. No changes

are achieved in open quotient (OQ) and speed quotient (SQ) of Electroglottography (EGG)

relevant measures, but particular characteristics of SQ group, regressing within normal range,

are presented in condition of division of two groups separated by mean SQ value.

According to above results, we modify the preoperative voiced sounds in order to

enhance the perceptual quality like normal voice. Enhancement rates are adjusted by statistical

results based on the difference between preoperative and postoperative speech sounds.

Modification of pitch period, intensity, and noise of aspiration are controlled by pitch

synchronous overlap and add (PSOLA), intensity modifier, and Wavelet threshold shrinkage

methods and baseline wander of EGG signal using empirical mode decomposition (EMD).

These modified speech and EGG signal was used as input signals in nonlinear speech

modeling, NARX based on Least Square-Support Vector Regression.

Finally, modification of preoperative vowel based on acoustic and electroglottographic

analysis can resemble amount of postoperative vowel in spectral and dynamic domain.

Performance of nonlinear speech modeling using NARX based SVR also showed better than

LPC in perceptual quality of voiced sounds, and this result is assumed that natural jitter,

shimmer, and noise are conserved, whereas LPC produces artificial sounds due to lack of

naturalness

Key Words: pitch detection algorithm, benign vocal fold lesions, nonlinear speech modeling,

nonlinear autoregressive exogenous, acoustic analysis, electroglottographic analysis

- 1 -

CHAPTER 1

Introduction

1.1 General Backgrounds

1.1.1 Mechanism of the Voice Production

Voice is produced by a complex and multi-organ system [1]. The system is mainly

consisted of three subsystems as shown in Fig. 1-1. First, voice production begins with

respiration system. Respiratory organs provide the system with airflow which was inhaled as

the diaphragm lowers, and some aerodynamic energy of the air is converted to acoustic energy

by larynx, sometimes called voice box. The larynx is positioned between the base of the

tongue and the top of the trachea, and is a cylindrical framework of cartilage that serves to

anchor the vocal folds. The vocal folds, also called vocal cords, are two bands of smooth

muscle tissue that lie opposite each other, housed within the larynx. The vocal folds play an

essential role in the production of glottal sound. The glottal sound is resonated and filtered as

it travels the vocal tract. The vocal tract length varies between 13 cm and 20 cm for different

speakers and may change according to the sound produced. For an average adult male, the

vocal tract is considered to be about 17 cm long in its rest position. Finally, the filtered glottal

sound is radiated through the throat, nose and mouth (resonating cavities). The size and shape

of these cavities, along with the size and shape of the vocal folds, help to determine voice

quality. Furthermore, variety within an individual voice is the result of lengthening or

shortening, tensing or relaxing the vocal folds.

1.1.2 A Brief View of the Voice Disorders: Benign Vocal Fold Lesions Focus

Voice disorder, or dysphonia, is one of a group of problems involving abnormal pitch,

loudness, or quality of sounds produced by defective larynx [2]. Voice disorders are usually

divided into three main categories: organic, functional, and a combination of the two. Organic

voice disorders are divided into two groups [3]: structural and neurogenic. Structural disorders

- 2 -

involve something physically wrong with the mechanism; especially often involving tissue or

fluids of the vocal folds. Neurogenic disorders are caused by a problem in the nervous system

as it interacts with the larynx. Functional disorders are caused by poor muscle functioning. All

functional disorders fall under the category of muscle tension dysphonia. Psychogenic

disorders exist, because it is possible for the voice to be disturbed for psychological reasons.

In this case, there is no structural reason for the voice disorder, and there may or may not be

some pattern of muscle tension. A detail classification of voice disorder is introduced in Table

1-1.

Figure 1-1. The subsystems of voice

In clinical settings, phonatory function examinations aim to determine the diagnosis of

the lesion and its size, vibratory mode, and degree of dysphonia, which lead to the

establishment of a treatment strategy. Aside from the subjective impressions of the patient and

voice therapist, there are objective measures available to aid in the assessment of laryngeal

function before and after surgery. Acoustic, phonatory airflow, and qualitative stroboscopic

measurements and etc (summarized in Table 1-2) have been used to analyze the results of

microlaryngeal phonosurgery [4].

- 3 -

The only three structural voice disorders were focused in this dissertation; nodules,

polyps, and cysts (usually called benign vocal fold lesions; BVFL). There are three reasons; 1)

these lesions are surgically treated in common case, though operation is not a modal treatment,

2) therefore, excision is the only factor which affect the voice quality between preoperative

and postoperative voiced sounds, and 3) they are usually reversible, i.e. will resolve, and

recurrence rates are low. Benign vocal fold lesions are non-cancerous growths of abnormal

tissue on the vocal folds, so these lesions are non–life-threatening pathologies. However, these

lesions are important because of their influence on voice quality, and excessive growth of

these lesions may affect breathing patterns.

Figure 1-2. Benign vocal fold lesions pictured by stroboscope (KayLab). (A) Vocal fold nodules, (B) Vocal Fold cyst, (C) Vocal fold Polyp, (D) normal vocal fold.

A B

D C

- 4

-

Types of

Voice

Disorder

Division

•Contact Ulcers

•Nodules (nodes)

•Cysts

•Polyps

•Granuloma

•Hemorrhage

•Hyperkeratosis

•Laryngitis

•Leukoplakia

•Trauma

•Miscilaneous growths

•Papiloma

Structural

•Paralysis/Paresis

•Spasmodic Dysphonia

(Laryngeal Dystonia)

•Tremor

(Benign Essential Tremor)

•Voice Problem caused by

another neurological disorder

(e.g. Parkinson's disease,

myasthenia gravis, ALS/

Lou Gherig's Disease)

Neurogenic

Organic

•Muscle tension dysphonia

•Anterior-posterior

construction

•Hyperabduction

•Hyperadduction

•Pharyngeal constriction

•Ventricular Phonation

•Vocal fold bowing

Functional

• Conversion dysphonia

(aphonia)

• Puberphonia

(mutational falsetto)

Psychogenic

Table 1-1. Classification of voice disorder

- 5 -

Table 1-2. Phonatory function examinations

Examination Parameter

Aerodynamics

• Subglottic pressure

• Supraglottic pressure

• Glottal impedance

• Volume velocity of the airflow at the glottis (mean airflow rate)

Stroboscope

(Vocal folds

vibration)

• Regularity or periodicity

• Symmetry between the vocal folds

• Glottal closure (Glottal Area Waveform)

• Amplitude

• Mucosal wave

• Non-vibrating portion

Acoustic Analysis

• Fundamental frequency (F0; Pitch)

• Intensity

• Perturbations of pitch (various Jitter and Shimmer measures)

• Amount of noise

Glottis Analysis • EGG (ex: Speed Quotient, Open Quotient)

• PGG

Psychophysical

Measurement

• GRBAS scale

• Vocal Profile Analysis (VPA)

• Buffalo III Voice Profile

Phonatory Ability • Various physical phonatory parameters

(ex: Maximum phonation time)

Voice Profile • Frequency range

• Intensity range

Looking for a detail BVFL, vocal fold nodule, polyp and cyst are defined as separate

entities by the otolaryngologist and voice pathologist based on their anatomic location and

gross appearance. A polyp is defined as a lesion on the anterior third of the vocal fold. It may

be sessile or pedunculated and, if pedunculated, very mobile. A nodule is defined as a small

lesion occurring on both sides of the vocal fold, strictly symmetric on the border of the

- 6 -

anterior and middle third of the vocal fold and usually immobile during phonation. The lesion

is confined to the superficial layer of trauma or irritation to the vocal fold [5]. A cyst is

divided into two types, mucous retention and squamous inclusion cysts. Mucous retention

cysts usually arise below the free margin of the glottis and translucent collections of mucous

likely arising from a plugged mucous gland duct. Squamous inclusion cysts appear as yellow

fusiform masses within the lamina propria. Vocal fold mucosal wave is reduced to absent and

amplitude of vibration is moderately to severely decrease [6]. The detail characteristics of each

BVFL are introduced in Table 1-3 [7].

Table 1-3. Detail characteristics of benign vocal fold lesions: Nodules, Polyps, and Cysts

Type Nodules Polyps Cysts

Appearance • Blister-like or

callous-like • symmetric, firmfull

Solid or fluid filled, thin surface (can be quite large)

Undulation or fullness

Color White to opaque Translucent to red Translucent to yellow

Location Anterior-middle third junction

Free edge of anterior third

Free edge of superior surface of middle third

Closure configuration

Posterior chink, hourglass

Irregular Likely complete

Free edge roughness Slight Moderate to severe Smooth to slight

Amplitude Slight to moderate decrease

Slight to severe decrease Moderate to severe decrease

Wave Any Present or increased Diminished or absent

Vibration Majority Present Majority with complete or partial absence

Majority with partial absence

Voice Symptom

• variable (from normal to breathy, very hoarse and strained) • unable to sing high and soft notes (occurrence of delay in the onset of the sound with an audible air escape)

variable (from normal to severely dysphonic)

variable (from normal to breathy, very rough and hoarse)

- 7 -

1.1.3 Speech Production Model

There are largely two approaches to develop a speech production model: articulatory

modeling and acoustic modeling. The articulatory modeling is to model the positions and

movements of the articulatory organs, and a similar result of it is caused by a similar

underlying system. Articulatory models have an advantage of the good reproduction with

simple control and can reproduce all the perceptually relevant effects of real speech such as

co-articulation [8]. However, it needs abundant information related with values and

dimensions of vocal tract and a detailed analysis of the movement of the articulators. On the

contrary, the acoustic modeling approach is widely adopted in speech production applications

because only the speech waveform is required, which is easily obtained by a recorder. The

acoustic modeling approach is to model the speech waveform directly in either the time or

frequency domain.

Figure 1-3. Diagram of the source-filter theory

One of the most popular models in the acoustic models is the source-filter speech

production model as shown in Fig. 1-3 [9, 10]. This theory models speech as a combination of

- 8 -

sound source (represents vocal folds), a filter (represents vocal tract), and radiator (represents

lip). This model usually have two different phonemes distinguished by the properties of their

sources and spectral shapes: voiced sounds and unvoiced sounds. For voiced sounds, periodic

glottal excitation is regarded as source. For unvoiced sound, turbulent noise produced by at a

constriction in the vocal tact itself is regarded as source. In this model, transfer function of

equation (1.1) characterizes the vocal tract system in the frequency domain.

( )( ) ( )

( )

Y fH f R f

X f= (1.1)

The transfer function provides the ratio of the spectrum of the pressure wave in the sound

frequency, ( )Y f , at some fixed distance from the lips, to that of the volume velocity wave,

( )X f , at the source. ( )H f and ( )R f are the vocal tract transfer function and the lip

radiation characteristic, respectively. Finally, the spectrum of speech is achieved by

combination of these transfer functions like an equation (1.2).

( ) ( ) ( ) ( )Y f X f H f R f= (1.2)

Linear prediction (LP) analysis [11] is generally adopted in the source-filter model for

performing speech processing such as synthesis, and forms the basis of most speech coding

systems, such as vocoders, code excited linear prediction (CELP) coders, and multi-pulse

coders. The reputation of linear prediction is due to ease of analysis and implementation and

low computational requirements. An alternative approach is to model the transfer function of

the vocal tract system, or vocal tract modeling. Autoregressive with Exogenous input (ARX),

output error (OE), and state-space parameterizations for the vocal tract filter have been used

and differ in their underlying structure of the model and the nature of the error which is

minimized in the parameter estimation procedure.

The truth that the speech production mechanism is nonlinear, is proved by experimental

and theoretical evidence [12]. Nonlinearities in the speech data are caused by rapid transitions

between and during phones, especially plosives where there is occlusion of the vocal tract, and

by turbulent excitation during unvoiced segments. Glottal opening and closure during the pitch

periods of voiced speech causes coupling at the back of the throat which introduces additional

energy loss. Linear models of the vocal tract system have a limited performance because they

may not capture the structure of the data or the underlying system dynamics. The application

- 9 -

of nonlinear models to the prediction of the speech has shown 2-3 dB improvement in

prediction gain over linear models [13-15].

1.2 Problem Definition

In the field of voice pathology, the nature of the pathological voice has been usually

classified and described using scales or terms denoting its perceptual impression such as

hoarseness, breathiness, roughness and etc. by speech-language pathologists. The three formal

perceptual protocols are commonly used in all over the world: The Vocal Profile Analysis

(VPA) [16], GRBAS Scale, and The Buffalo III Voice Profile [17]. Perceptual voice quality

evaluation provides a baseline of the extent and type of the presenting problem and, therefore,

allows a monitoring of the process of therapy. This provides the clinician with a valuable

clinical outcome tool. Perceptual evaluation of voice quality may be the most valid of clinical

outcome measures as patients with voice disorders seek treatment, because their voices do not

sound normal and often decide on whether treatment has been successful based on whether or

not they sound better.

However, there are some problems with perceptual evaluation of voice quality. The lack

of a standard set of well-defined terms, the variability of the human voice, the reliability of

perceptual voice quality ratings between and within raters, and a considerable disparity in the

design of the available perceptual rating scales are potential problems [18-20]. Also, another

problem exists. Many patients who must be operated to resolve their voices are worried about

the result of the excision surgery, therefore to suggest estimated postoperative phonation can

relieve patients from fear, perhaps, that phonation dose not improve or worse than before;

moreover, the estimation of postoperative phonation informs patients how well surgery works

by means of comparing postoperative sounds with ideal simulated sounds, which have normal

range of personalized acoustic characteristic.

Some research have advanced in the analysis of benign vocal fold lesions before and after

surgery using acoustic, aerodynamic, and stroboscopic measures analysis [21,22], glottal area

waveform analysis [23], characteristic features of muscle tension analysis [24,25], and

correlation analysis between acoustic examination and image of vocal folds [26,27]. These

studies are valuable, but give little objective information on influence of the operation. There

is little study in estimation of postoperative phonation in benign vocal fold lesions. Kim [28]

- 10 -

and Baek [29] studied that prediction of postoperative voice by speech synthesis in benign

laryngeal diseases. However, though these studies are the first research in prediction of

postoperative phonation, it appeared to be impractical. Because prediction based on linear

prediction analysis adjusted with normal range of jitter and shimmer have lack of naturalness

in phonation and uniqueness between variable human voices.

This dissertation addresses the problem of prediction of postoperative phonation based on

acoustic and electroglottographic analysis between preoperative and postoperative voiced

sounds and natural speech production model. The overall strategy of our investigation is to

model refined nonlinear speech production model using speech modification by use of

parameters extracted from diverse analysis. The primary contributions of this dissertation can

be summarized as follows. First, we develop robust pitch detection algorithm for pathological

voices which often occur pitch doubling and halving errors, because performances of the

established pitch detection algorithms are not good in aperiodic pitch sounds such as

pathological voices. Second, we investigate the quantitative measures of preoperative and

postoperative data by acoustic and electroglottographic analysis in order to find the difference

before and after laryngeal surgery. Third, we suggest nonlinear speech production model

based on nonlinear autoregressive with exogenous input (NARX) using least square support

vector regression. Experimental results show that this production model is able to synthesize

natural-quality. Finally, we compare some simulation results on postoperative vowel.

- 11 -

1.3 Organization of the Thesis

This dissertation is organized as follows. Chapter 2 introduced robust pitch detection

algorithm for pathological voice using Fast Orthogonal Search (FOS) analysis. After

established pitch detection algorithm (PDA) were reviewed, our proposed PDA for

pathological voice was addressed and compared to those PDAs in various conditions such as

normal/pathological, male/female, severity of voice pattern, ages of subject, and phonation

types. Chapter 3 presents the diverse acoustic and electroglottographic characteristics of

pathological voice of benign vocal fold lesions (BVFL) before and after laryngeal surgery.

Pathological voice of pre- and post-treatment in BVFL were analyzed by various

measurements; perturbation of pitch period and intensity, noise due to physical changes of

vocal folds, and pattern of vocal folds vibration. In Chapter 4, the more precisely predictable

modeling of postoperative vowel was presented. Pitch synchronous overlap and add (PSOLA)

was used for pitch and formant frequency modification, and wavelet transform threshold

shrinkage to estimate the noise-eliminated pathological voice were described and evaluated.

Reduction of baseline wander of electroglottography (EGG) using empirical mode

decomposition (EMD) was also presented. After existing linear and nonlinear speech

production modeling were introduced in the Chapter 5, proposed nonlinear speech production

modeling using least-squares Support Vector Regression was developed and tested in this

chapter. Voice quality between postoperative vowel and synthesized vowel was also evaluated,

and the experimental results will be discussed. Finally, a conclusion and comments about

some interesting future research topics are given in Chapter 6.

- 12 -

Chapter 2

Robust Pitch Detection Algorithm for Pathological Voice

2.1 Introduction

This chapter explores some of the issues, problems, and solution involved in the

estimation of fundamental frequency in pathological voice, especially related with benign

vocal fold lesions. First, there is an introduction of what is meant by the terms fundamental

frequency, often called pitch. After some characteristics of pathological voices and difficulties

of pitch estimation are presented, a brief review of several established pitch detection

algorithms are then discussed. Finally, proposed robust pitch detection algorithm is introduced

and evaluated in various conditions.

2.1.1 Introduction to Pitch (Fundamental Frequency; F0) Perception

Pitch is a fundamental auditory attribute for the perception of speech and music, and is

related to the repetition rate of the waveform of the sound and co-varies with its fundamental

frequency. Perception of pitch is a very complex sensory phenomenon which involves a lot of

sciences such as physics, psychology, psychophysics, psychoacoustics, physiology, and

neurological science. Therefore, there is no solitary theory or model capable of explaining all

the processes how human perceives a pitch in musical theories. Nevertheless, there have been

two long-lasting theories in rivalry on pitch perception, either of which most experts agreed to

with their own variations [30-32]. One is a place theory [33], and the other is a temporal

theory [34]. A place theory postulated that the rate-place patterning of neural firings (place-

rate coding) is used to transmit information concerning frequency to the central nervous

system, while a temporal theory insisted that peripheral level of the auditory nerve, temporal

patterning of neural firings (temporal coding) is used. This inability to explain all of the

experimental data related to perception of pitch with a solitary theory or model led to a view

that there might be two separated pitch perception mechanisms: place or spectral mechanism

for low, resolvable harmonics, and temporal mechanism for high, irresolvable harmonics.

- 13 -

Pitch is loosely related to the log of the frequency, perceived pitch increasing about an

octave with every doubling in frequency. However, frequency doubling below 1000 Hz

corresponds to a pitch interval slightly less than an octave, while pitch doubling above 5000

Hz corresponds to an interval slightly more than an octave [35-37]. In the time domain, the

pitch information in voiced speech is present as quasi-periodic signal excursions, as shown in

Fig. 2-1. Voiced sound |a| is caused by the excitation vocal cords, whereas unvoiced sound |t|

are caused by the resonant cavity vocal tract shape. This periodic voiced sound can generally

be labeled by the eye method sometimes employed to obtain a reference pitch signal.

Figure 2-1. Episodes of unvoiced sound |t| and voiced sound |a|

- 14 -

2.1.2 Difficulties of Pitch Estimation

Pitch estimation is widely adopted in speech processing applications, but is still

considered one of the most difficult tasks. It is difficult to estimate pitch period for several

reasons below stated.

First, domain specific modeling problem exists. There are many other applications where

pitch information is of great use such as automatic music transcription, speech recognition,

and pathological voice relevant applications. In musical application such as automatic music

transcription, pitch is indispensable since it directly corresponds to the height of the musical

notes. In speech, pitch can greatly improve speech intelligibility and thus can be very useful in

speech recognition systems. Sound source separation is another application where pitch

information is critical especially when there are concurrent sound sources. In voice pathology,

jitter (perturbation of frequency) and shimmer (perturbation of amplitude) calculated from

pitch are commonly used to test the existence of dysphonia, to measure voice quality such as

hoarseness, and to assess the severity of pathological voice [38, 39]. It has been difficult to

develop a pitch estimator for wide domains, and a lot of applications only depend on the

specific domain of the data. Therefore, a pitch detector for one domain is less accurate when

applied to a different domain.

Second, complexity of human voices more than the ability of current pitch detectors has

also been troubled. For examples, pitch period changes with time, often with each glottal

period. This jittered pitch period can be against the rules of pitch detector; an assumption that

voiced speech is stationary and periodic in analysis segment, about 15~20 ms, and may result

in major failure of pitch estimation. Sub-harmonics of fundamental frequency often appear

that are sub-multipliers of the "true" frequency. In many cases when strong sub-harmonics and

presence of aperiodic components with high intensity are present, the most reasonable

objective pitch estimate is clearly at odds with the auditory percept.

Finally, the dynamic range of the voice fundamental frequency also makes a problem.

Generally, Fant [40] determined the average F0 in conversational speech in European

languages were approximately 120 Hz for men, 220 Hz for women, and 330 Hz for children,

and the typical range exploited by a single speaker within one utterance is normally within one

octave. The maximum overall range of fundamental frequency in ordinary conversation is

about 50-250 Hz for men and about 120-480 Hz for women [41]. However, the pitch of some

- 15 -

male voices can be as low as 50 Hz, whereas the pitch of children’s voice can be as high as

600 Hz.

2.1.3 Characteristics of Pathological Voice

The general idea of fundamental frequency estimation or detection is to obtain the period

of the glottal excitation waveform. When comparing electroglottographic signal with the

acoustic signal, it is evident that the vibrations in the voice tract have greatest amplitude at the

moment of closing the glottis [42]. Moreover, closing the glottis is much more abrupt

compared to its opening so the moment of closing can be determined more precisely.

Figure 2-2. Vibratory pattern of vocal folds with a single opening and closing & airflow

between the vocal folds as change of vocal folds vibration

- 16 -

This waveform is the result of the periodic opening and closure of the vocal cords in the

glottis while air is forced through from the lungs; therefore, it needs to understand a vibratory

cycle of the vocal folds such as a single opening and closing of the vocal folds for deeply

understanding how voice is produced. As shown in Fig. 2-2, general description of vibratory

pattern of the vocal folds starts with a moment when subglottal pressure overpowers fold

resistance just enough for the vocal folds to first start to blow open. They continue to blow

apart during the open phase until the escape of air reduces subglottal pressure enough for fold

resistance to overpower air flow, or d moment. At that point, the closing phase begins as the

folds move toward each other. It ends as soon as the glottis is closed, or f moment. After that,

the close phase continues until the opening phase starts the entire cycle over again.

Figure 2-3. Episodes of pitch doubling and halving error: pitch estimated by autocorrelation method

- 17 -

A variety of methods exists for determination of T0 (1/F0), but their application to

pathological voices gives rise to the following objective difficulties potentially leading to

severe errors:

- The significant variations of the amplitude and pitch are presented.

- The vocal folds do not contact in right position for voicing.

- The presence of the F0 appears at exactly half or double the correct fundamental

frequency is presented, caused by influence of the physical changes of vocal folds.

Especially, structural modification of vocal chord due to BVFL tends to produce abnormal

cycle or pattern of glottal excitation waveform. This abnormal pattern usually results in

undesirable effect of pitch doubling or halving error as shown in Fig. 2-3.

2.2 Review of the Several Established PDAs

In categories of pitch detection algorithm (PDA), there are mainly three kinds of

approaches; time domain approaches, frequency domain approaches, and alternative

approaches. Most of pitch detection methods are based on the assumption that speech signal is

stationary in short time, but the reality is that speech signal is non-stationary and quasi-

periodical. So, it will sometimes induce detection error. Usually, PDAs are not designed to

work with the whole range of pathological voices. They depend on a minimum of signal

periodicity to give accurate results. Titze et al. [43] reported an upper limit of 6% jitter for

accurate period detection. Next part, time domain method, frequency domain method, and

alternative method of three types of PDA will be addressed before comparing with our

proposed PDA algorithm for pathological voice.

- 18 -

2.2.1 Time Domain Approaches

2.2.1.1 Autocorrelation (AC)

The most basic approach to the problem of pitch estimation is to look at the waveform

that represents the change in air pressure over time, and attempt to detect the pitch from that

waveform. The goal of the autocorrelation [44, 45] routines is to find the similarity between

the signal and a shifted version of itself. The result of a correlation is a measure of similarity

as a function of time lag between the beginnings of the two waveforms. Equation (2.1) shows

the mathematical definitions of the autocorrelation of a finite discrete function( )x m of sizeN .

Figure 2-4. Pitch detection standard, domain and boundary of AC: (left) analyzed speech signal, (right) pitch selection in time domain after autocorrelation

( ) ( ) ( ) 0 | | , :1,..., 1N

m N

R k x m x m k k N if input N=−

= + ≤ ≤ −∑ (2.1)

) ( ) ( )

) (0) | ( ) |

) ( ) ( )p p

i R k R k

ii R R k

iii R k R k N if N is periodic

= −≥= +

(2.2)

Given equation (2.2) of the characteristics of the autocorrelation, if the signal is periodic,

the autocorrelation function,( )R k , also will be, and if the signal is harmonic, the

autocorrelation function will have peaks in multiples of the fundamental frequency. Maximum

peak displayed when lag of 0. As the time lag increases to half of the period of the waveform,

- 19 -

the correlation decreases to a minimum as shown in Fig. 2-4. This is because the waveform is

out of phase with its time-delayed copy. As the time lag increases again to the length of one

period, the autocorrelation again increases back to a maximum, because the waveform and its

time-delayed copy are in phase. The first peak in the autocorrelation indicates the period of the

waveform.

This technique is the most efficient at mid to low frequencies. Thus it has been popular in

speech recognition applications where the pitch range is limited. The weakness of the

autocorrelation approach is that it is prone to sub-harmonic errors, that is, it occasionally

generates pitch estimates that differ from human pitch judgments, most often by an octave.

Wider period peaks and multiplicity of non-periodic peaks also produce wrong pitch estimates.

2.2.1.2 Average Magnitude Difference Function (AMDF)

The Average Magnitude Difference Function (AMDF) [46] analysis is a variation of AC

analysis where, instead of correlating the input speech at various delays, a difference signal is

formed between the delayed speech and the original, and at each delay value the absolute

magnitude is taken. Hence, when considering DSP-chip/hardware implementation, AMDF is

more favorable since calculation of the AMDF requires no multiplications, a desirable

property for real-time applications. The mathematical definition of the AMDF was show in

equation (2.3).

1

1( ) | ( ) ( ) | 0 | | 1, :1,...,

N

m

F k x m x m k k N if input NN =

= − + ≤ ≤ −∑ (2.3)

Where ( )x m are the samples of input speech and ( )x m k+ are the samples shifted by

lags of k . The vertical bars denote taking the magnitude of the difference between ( )x m

and ( )x m k+ . Thus a difference signal ( )F x is formed by delaying the input speech various

amounts, subtracting the delayed waveform from the original, and summing the magnitude of

the differences between sample values. Defined as equation (2.4) of the characteristics of the

AMDF, the difference signal is always zero at lag of 0, and is particularly small at delays

corresponding to the pitch period of a voiced sound having a quasi-periodic structure as shown

in Fig. 2-5.

- 20 -

) (0) 0

) (0) ( )p p

i F if perfectly periodic

ii F F N if N is periodic

==

(2.4)

An advantage of this method is that the relative sizes of the nulls tend to remain constant

as a function of lags, because there is always full overlap of data between the two segments

being cross differenced.

Figure 2-5. Pitch detection standard, domain and boundary of AMDF: (left) analyzed speech signal, (right) pitch selection in time domain after AMDF

2.2.1.3 YIN

The meaning of YIN [47] is originated from the word of the oriental yin-yang

philosophical balance, intended to represent author’s attempts to balance between

autocorrelation and cancellation in the algorithm. The difficulty with autocorrelation approach

has shown to determine sometimes a wrong peak as fundamental frequency, occurring sub-

harmonic error. YIN attempts to solve these problems by in several procedural ways. YIN is

based on the difference function which attempts to minimize the difference between the

waveform and its delayed duplicate instead of maximizing the product in case of

autocorrelation. The mathematical definition is presented in equation (2.5).

2

1

( ) ( ( ) ( )) 0 | | , : 1,..., 1N

m

d k x m x m k k N if input N=

= − + ≤ ≤ −∑ (2.5)

- 21 -

Modeling the signal ( )x m as a period function with periodpN , by definition invariant

for a time shift of pN , the difference between( )x m and ( )px m N+ is zero. Thus the same

is true after taking the square and averaging over a window, as shown in equation (2.6).

Conversely, an unknown period may be found by forming the difference function like

equation (2.5), and searching for the values of k for which the function is zero or minimum

value.

2

1

1

) ( ( ) ( )) 0 ( ) ( ) 0, .

1, 0

( )) ( )

1( )

N

p pm

k

m

i x m x m N if x m x m N m

if k

d kii d k otherwise

d mk

=

=

− + = − + = ∀

=′ =

∑

∑

(2.6)

Additionally, YIN employs a cumulative mean function which de-emphasizes higher-

period dips in the difference function in order to reduce the occurrence of sub-harmonic errors.

Other improvements in the YIN analysis include a parabolic interpolation of the local minima,

which has the effect of reducing the errors when the period estimation is not a factor of the

window length used. Fig. 2-6 shows pitch selection in time domain after termination of above

processes.

Figure 2-6. Pitch detection standard, domain and boundary of YIN: (left) analyzed speech signal, (right) pitch selection in time domain after YIN

- 22 -

2.2.2 Frequency Domain Approaches

2.2.2.1 Cepstrum

The cepstrum is a common transform used to gain information from a person’s speech

signal [48]. It can be used to separate the excitation signal which contains the words and the

pitch and the transfer function which contains the voice quality. If we assume that a sequence

of voiced speech is the result of convoluting the glottal excitation sequence [ ]e n with the

vocal tract’s discrete impulse response[ ]q n . In frequency domain, the convolution

relationship becomes a multiplication relationship. Next, using property of log function

log log logAB A B= + , the multiplication relationship can be transformed into an additive

relationship. Finally, the real cepstrum of a signal [ ] [ ] [ ]s n e n q n= × is defined as equation

(2.7).

1 1( ) (log ( ( )) ) log ( )

2j m

FFT FFTc m F F x m S e dπ ω

πω ω

π−

−= = ∫ (2.7)

where

( ) ( ) : 1,..., 1j m

m

S x m e if input Nω

ω

ωω −

=−

= −∑ (2.8)

Figure 2-7. Pitch detection standard, domain and boundary of Cepstrum: (left) analyzed speech signal, (right) pitch selection in quefrency domain after Cepstrum

Cepstral coefficients decrease as 1/q, where q is the index of the cepstrum sequence.

Hence, for very noisy voices, the retrieval of F0 could be difficult when the pitch pulses are

- 23 -

hardly distinguished in the middle of a flat noise sequence. Moreover, the implementation of

the real cepstrum analysis depends heavily on the computation of the Short Time Fourier

Transform (STFT) and a proper choice of the window superimposed to the signal is of great

importance. As shown in Fig. 2-7, pitch estimate is determined by maximum value of pitch

available range in case of voiced sounds.

2.2.2.2 Simplified Inverse Filtering Techniques (SIFT)

Figure 2-8. Overall process of SIFT (top) and Pitch detection standard, domain and boundary of SIFT: (bottom left) analyzed speech signal, (bottom right) pitch selection in

time domain after SIFT (Hybrid method)

The Simple Inverse Filter Techniques (SIFT) algorithm is based on an LP analysis of data

[49], which gives the Inverse Filter (IF) to be used in this approach [50]. Commonly, a low

filter order (ρ >4) is selected, corresponding to no more than two formants characterizing the

vocal tract [51]. However, if this suffices for healthy voices, pathological voices require a

higher and possibly varying filter order, due to the strong noise component such as aspiration

- 24 -

utterance, corrupting the signal. In this paper, selection of model order ρ and parameters

estimation was performed, followed by an autocorrelation maximization of the inverse filter

residuals, which gives the estimated pitch value. The optimum model order ρ of 17 is

achieved by order selection method of minimum description length, and the frequency of low

pass filter is selected as 1200Hz, and pre-emphasis coefficient of 0.975. The overall process of

SIFT was illustrated in Fig. 2-8.

2.2.3 Alternative Approaches

2.2.3.1 Wavelet

Figure 2-9. Diagram of fast lifting wavelet transform

The Wavelet Transform is a flexible tool for analyzing time-frequency behavior of

signals embedded in noise, and is well-suited to handle non-stationary data [52, 53]. In

particular, dyadic wavelets are characterized by an exponential sampling of the plane, given

by a power-of-two sampling of the scale parameter. This amounts to considerable savings in

the computational cost of the algorithm, which makes this approach suitable for fast signal

processing. Other useful properties of dyadic wavelets are linearity and shift invariance, as

speech signals are often modeled as a linear combination of shifted and damped sinusoids.

Thus this transform is successfully applied in the analysis of speech signals [54].

The discrete wavelet transform (DWT) allows separating the high, as detailed signal, and

low, as approximated signal, frequency components of a signal on successive scales. Scaling

functions are associated to low-pass filters while wavelets are associated to high-pass filters,

which perform the signal decomposition on subsequent levels. Most meaningful frequency

- 25 -

values will have highest intensity in the level where such values are included. The main

advantage of this transform over the FFT is that the frequency temporal location is preserved.

It has good time resolution in the high frequency range and good frequency resolution in the

low frequency range. The DWT has been successfully applied in detecting the pitch period of

speech signals [55].

Figure 2-10. Pitch detection standard, domain and boundary of Wavelet: (left) analyzed speech signal, (right) pitch selection in time domain after FLWT

In this study, Fast Lifting Wavelet Transform (FLWT) is used to implement PDA based

on DWT (Fig 2-9). A wavelet transform splits a signal into an approximation and a detail

using Haar wavelet. The FLWT using a Haar wavelet is mathematically equivalent to running

a low-pass filter and down-sampling to produce the approximation component and running a

high-pass filter and down-sampling to produce the detail component. After performing FLWT

repeatedly but limited by 4 steps, each maximum and minimum of each analyzed mode

distance is searched, and pitch is acquired by averaged the distance of each maximum as

shown in Fig. 2-10.

- 26 -

2.2.3.2 State-Space Embedding

Figure 2-11. Pitch detection standard, domain and boundary of State-Space Embedding: pitch selection in periodicity histogram (time domain) after singular value decomposition

The state-space embedding signal representation is a method of observing the short-time

history of a waveform in a way that makes repetitive cycles clear. The basic state-space

embedding representation is to plot the value of the waveform at time t versus the slope of

the waveform at the same point [56]. A periodic signal should produce a repeating cycle in

state-space embedding, returning to a point with the same value and slope. Higher dimension

state-space embedding representations plot the value and 1n − derivatives of the signal in

n dimensions.

- 27 -

Pseudo-phase space, also called embedded representation, is a simpler form of phase

space. The value of the incoming waveform is plotted against a time-delayed version of itself.

The representation plots the points( , ) ( ( ), ( ))x y f t f t k= − , and in the n -dimensional

case, 0 1 1 1 1( , ,..., ) ( ( ), ( ),..., ( ))n nx x x f t f t k f t k− −= − − . Often, for simplicity, 1k k= . In this

study, embedding dimension of 7 and time delay of 19 are fixed as heuristic analysis for

pathological voices (Fig. 2-11). For a more detailed theoretical state-space embedding pitch

estimator are discussed in following studies [57-59].

2.3 Robust Pitch Detection Algorithm for Pathological Voice Based on Fast

Orthogonal Search

Pathological voices are exceedingly different to normal voice. Noise signal produced by

glottal distortion, pitch period perturbation (called jitter), and pitch amplitude perturbation

(called shimmer) are embedded in voiced sounds. The autocorrelation method, widely used

tool for pitch detection algorithm, is not a key to reduce these error factors, because octave

problems still remain, and peak finding is difficult in first frame. Therefore, another solution is

suggested to overcome these problems. Proposed PDA based on Fast Orthogonal Search

(FOS) algorithm is introduced, implemented, and compared with previously presented PDAs

in diverse analysis.

2.3.1 Introduction of Fast Orthogonal Search Algorithm

The pitch estimation is based on the FOS analysis. The FOS analysis can take a set of

non-orthogonal functions and fit them to the sampled signal. Sine and cosine pairs at the same

frequency can be fitted to estimate the amplitude and phase of the spectrum at the frequency of

the sinusoidal pair. Frequencies with a fractional number of periods in the window can be

searched for giving FOS a greater resolution than the FFT [60, 61].

One of the most efficient and most frequently used model structure detection techniques

is the orthogonal algorithm [62]. The advantage of using the orthogonal algorithm is that the

contributions of candidate terms are decoupled and consequently the significance of model

- 28 -

terms can be measured based on the corresponding error reduction ratios. Consider a dynamic

nonlinear polynomial NARMA model as defined by equation (2.9).

( ) [ ( 1), ..., ( ), ..., ( ),..., ( )] ( )y n F y n y n k x n x n l e n= − − − + (2.9)

where [ ]F ⋅ denotes a nonlinear polynomial, and equation (2.9) can be more concisely as

equation (2.10) with any sampled ( )y n of lengthN .

0

( ) ( ) ( )M

m mm

y n a p n e n=

= +∑ (2.10)

ma and mp denote unknown parameters and candidate function like nonlinear regressors

respectively, and [ ]e n is a white noise sequence with zero mean and finite variance.

Candidate functions are of lengthN , which are chosen to represent sine and cosine functions

having particular frequencies of interest. Candidates are given by below equation (2.11).

2

2 1

2( ) sin

2( ) cos

mm

mm

f np n

N

f np n

N

π

π+

= =

(2.11)

The frequency of each candidate is given bymf , these sine and cosine functions are not

necessarily orthogonal as they may have a fractional number of periods over lengthN . The

sampled signal ( )y n can also be expressed as a functional expansion of M orthogonal

functions ( )mW n , its coefficients mg , and some error, ( )e n ,as defined by equation (2.12).

0

( ) ( ) ( )M

m mm

y n g w n e n=

= +∑ (2.12)

The set of orthogonal functions are derived from the candidate functions using the Gram-

Schmidt orthogonalization algorithm [63]. In the Gram-Schmidt algorithm, an orthogonal

function can be calculated as the corresponding candidate minus the weighted sum of previous

orthogonal functions, as given by equation (2.13) with coefficients of equation (2.14).

1

0

( ) ( ) ( )m

m m mr rr

w n p n w nα−

=

= −∑ (2.13)

where

1

2

1

( ) ( )

( ( ))

N

mn

m N

mn

y n w ng

w n

=

=

=∑∑

(2.14)

- 29 -

However, the construction of the orthogonal functions ( )mW n in equation (2.13) is time

and memory consuming. To avoid this, the FOS algorithm directly computes the orthogonal

expansion coefficients mg (an algorithm based on a modified Cholesky decomposition

technique) without explicitly creating the orthogonal functions ( )mW n . As a consequence,

computing time is significantly reduced. The coefficients mg are obtained by equation (2.15). ( )

, 0,...,( , )m

C mg m M

D m m= = (2.15)

where

1

0

(0,0) 1

( ,0) ( ), 1,...,

( , ) ( ) ( ) ( , ), 1,..., , 1,...,

m

r

m r rii

D

D m p n m M

D m r p n p n D m i m M r mα−

=

=

= =

= − = =∑

(2.16)

and ( , )

, 1,..., , 0,..., 1( , )mr

D m rm M r m

D r rα = = = − (2.17)

with

1

0

(0) ( )

( ) ( ) ( ) ( )m

m mrr

C y n

C m y n p n C rα−

=

=

= − ∑ (2.18)

Additionally, the mean square error is calculated from equation (2.19).

2 2

0

( ) ( , )M

mm

mse y n g D m m=

= − ∑ (2.19)

The overbar, for equation (2.16), (2.18), and (2.19) denotes a time-average computed

over the portion of data record of lengthM . The spectral density at a given frequencymf is a

combination of the magnitude of the corresponding two (sine and cosine at frequencyf )

candidate functions and the phase spectrum are given by equation (2.20).

2 22 2 1

1 2 1

2

( )

( ) tan

m m m

mm

m

F f a a

af

aφ

+

− +

= +

=

(2.20)

- 30 -

2.3.2 Pitch Selection

Figure 2-12. Two episodes of pitch selection in FOS

In the FOS algorithm, the candidate functions are fitted in the order presented (0, 1, …,

m-1), and order M of 640 is selected by empirical analysis of minimum mean square error of

equation(2.19). Once the spectrum of the sampled signal is determined over a particular

frequency range from 0 to 6000 Hz, the pitch information must be extracted. As shown in Fig.

2-12, pitch candidates can be selected from a frequency domain normalized by maximum

frequency located within 50-350 Hz.

- 31 -

Figure 2-13. Pitch selection of boundary of a third of global Maximum peak

Looking a detail process of pitch selection, first, global maximum peak is regarded as

first pitch estimate, and search the boundary of a half of the first pitch estimate, if the first

pitch estimate is over than frequency of 100 Hz and threshold of 0.2. The boundary scale

(either side of 13 Hz) of 2 % of M size is used. Then, local maximum peak in the valid search

range (boundary of the first pitch estimate) with the magnitudes exceeding some prescribed

fraction (e.g. 20 %) of the global maximum peak is found and its position is stored as second

pitch period candidate. Theses process goes until finding the lowest pitch period satisfied with

pitch period condition, whether it is located within valid pitch period range and is over than

- 32 -

20 % of global maximum peak. If there is no peak with a half of the first pitch estimate,

research the local pitch estimate within new boundary of a third of the first pitch estimate (e.g.

Fig. 2-13). Finally the last existing pitch estimate’s position value in frequency domain is

selected as fundamental frequency. Pitch candidates obtained as described above for steady periodic speech frames usually

include only a true pitch period and its integer multiples. Selecting the lowest multiple can

give a reliable local pitch estimate for such frames, because the lowest multiple is usually

regarded as real pitch period due to the nature of the spectrum of voice that pitch period’s

harmonics are integer multiples of pitch periods. For our PDA implementation we have

developed an algorithm based on FOS.

2.4 Experimental Procedure

2.4.1 Speech Database

Both pathological voices of 99 and normal voices of 30 were measured at the Institute of

Logopedics & Phoniatrics, Yong-Dong Severance hospital, Yonsei University, Seoul, Korea.

Pathological voices were only considered as voices due to benign vocal lesions; polyps,

nodules, and cysts. The mean age at diagnosis of pathological group among adults aged 18-69

years was 43.7 years, and age of otherwise was 48.4 years ranging from 21 to 68, irrespective

of sex.

Acoustic data were recorded with the Computerized Speech Lab. (CSL), Kay Elemetrics

without noisy environment, and electroglottographic (EGG) data were simultaneously

obtained. All acoustic and EGG data were sampled at sampling rate of 22 KHz and resolution

of 16 bit. The sustained vowel of |a|, |e|, |i|, |o|, |u| samples for 2-3 sec were obtained by twice

utterance.

2.4.2 Preconditions of Performance Evaluation

Some preconditions are required to evaluate performance of pitch detection. It is difficult

to empirically measure the performance of a pitch estimator for several reasons. First,

- 33 -

performance depends on domain, as discussed above. A pitch estimator will almost certainly

behave better in the context for which it was developed. Second, it is difficult to automatically

rate the result of a pitch estimator against expected outcomes, precisely because it is difficult

to measure pitch in the first place. We humans are good at it, and so we can listen to a file and

judge the accuracy of a pitch estimation engine, but to lend credibility to this measure, we

must have many people, both expert and lay, judge the pitch estimation result on a large

number of sound files. Once a measure like this is taken, however, it can be used to evaluate

the results of other pitch estimation methods. Another way to evaluate pitch estimators is to

compare the results of multiple detectors on a common corpus. This third method of

comparison is what will be used in this work.

Reference pitch is semi-automatically obtained by accurate glottal closure instant (GCI)

measure. GCI can be obtained from the differentiated EGG signal, and the successive GCI

positions correlate with pitch scheme of speech signal [64]. Automated pitch estimation based

on this pitch was applied to produce reference pitch estimate. However, this scheme is based

on the fact that the length of pitch period does not change drastically pitch periods do not

double or halve during the limited length of the segment. Therefore, we only used automated

reference pitch estimator in case of type 1 signal or periodic signal, while type 2 and 3 signals

were manually analyzed. Speech analyzing tools such as spectrogram and Lyapunov

exponents [65] were adopted to discriminate type 2 and 3 signals from type 1 signal. If period-

doubling or halving bifurcation was exhibited by the excised larynx, it was regarded as type 2

or 3 signal. After analysis of Lyapunov exponents, the result containing at least one positive

Lyapunov exponents was defined as chaotic signal or type 3.

2.4.3 Error Types of PDA

Five types of error measures involved in performance of the pitch detection were defined

as below;

i) X2 (doubling) error: The percentage ratio of voiced frames (which is correctly classified

with a below condition or unvoiced frames also correctly classified) to total voiced frames

proved to be voiced frames.

computed pitch estimate - reference pitch> 0.2

reference pitch

- 34 -

ii) /2 (halving) error: The percentage ratio of voiced frames (which is correctly classified with

a below condition or unvoiced frames also correctly classified) to total voiced frames proved

to be voiced frames.

computed pitch estimate - reference pitch< -0.2

reference pitch

iii) G (gross) error: sum of X2 error and /2 error

iv) F (fine) error: the mean of the absolute difference between computed pitch estimate and

reference pitch estimate during the period which proved to be voiced frames except occurring

G error.

v) S (standard deviation) error: the standard deviation of the absolute difference for the period

defined by above F error.

Error measures related with voicing decision estimation was not considered in this paper.

Although segmented signal included unvoiced period, some of PDA have no ability of

classifying voicing decision, and all of PDAs are equally adopted in same voicing decision

estimation algorithms such as silence detector based on energy threshold and zero crossing

rate.

2.4.4 Optimum Window Selection

The autocorrelation algorithm is relatively impervious to noise, but is sensitive to

sampling rate, because it calculates fundamental frequency directly from a shift in samples.

Therefore, optimized window size should be decided, because all PDAs must be operated in

their best conditions, and the performance of PDAs was sensitive to window size. Empirical

analysis of measures of pitch gross error rates was achieved in Fig. 2-13 and 2-14. These rates

were calibrated overall our speech database. We tested optimum window selection in 8 PDAs

with two separated speech database; normal and BVFL voices. In normal database (e.g. Fig. 2-

13), the length of 22, 68, 20, 34, 46, 70, 23, and 46 ms was selected as optimized window size

of AC, AMDF, YIN, CEP, SIFT, WAV, PS, and FOS, respectively. In BVFL database (e.g.

Fig. 2-14), the length of 21, 20, 24, 31, 30, 59, 25, and 54 ms was selected as optimized

window size of AC, AMDF, YIN, CEP, SIFT, WAV, PS, and FOS, respectively. The pattern

of window size plots of each PDA in BVFL database is similar to its result in normal database,

and gross error largely increases except FOS. There is no difference of the window size

- 35 -

between two groups except AMDF. We selected above optimum window size to compare each

PDA in their best performance.

Figure 2-14. Gross error rates as a function of window size in normal database

Figure 2-15. Gross error rates as a function of window size in BVFL database

- 36 -

2.5 Experimental Results

2.5.1 Evaluating Performance of PDAs in Normal versus BVFL Voices

Five types of error of pitch estimation between eight PDAs are presented in Table 2-1. It

can be seen from these results that gross error of FOS analysis is superior to other PDAs,

irrespective of types of speech signal (Normal or BVFL), and is remarkably lower than other

PDAs in cases of BVFL. This is perhaps to be expected, since the halving error of FOS

analysis is substantially different from other PDAs. In Table 2-1 results of performance of the

avaiable PDAs in normal and BVFL database are stated; p value (95% confidence interval) is

a statistics of Welch's t test between Normal and BVFL groups (NNormal = 150, NBVFL = 495).

Shaded blocks denote best performance of pitch estimation in each column. Our proposed

FOS algorithm mainly shows best performance in Gross and Halving error, but PDA based on

Wavelet is best in Doubling, Fine and Standard Deviation error. In particular, every

approaches of pitch estimation are faced with the problem of trading off too-high versus too-

low errors. This is usually addressed by applying some form of bias. Mainly, PDAs based on

spectrum can not perform low errors. However, these PDAs show good performance related in

too-high errors, and this is well adopted in pathological voices which have aperiodic

components such as jitter, shimmer and noise, because these components usually prohibit

PDA from finding correct pitch.

There is no difference of performance among types of BVFL as shown in Table 2-2, 2-3,

and 2-4, and PDA based on FOS performs good high-errors and average low-errors. Detail

results of performance of the PDAs are presented in APPENDIX A; performance test as an

age with interval of two decades (except infants; below 10 years), performance test as a sex,

and performance test as a phonation types (|a|, |e|, |i|, |o|, |u|).

- 3

7 -

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

0.83

0.95

4.69

1.68

6.27

4.89

5.72

2.44

.5.98

2.97

4.12

1.43

8.66

7.69

3.20

1.07

Mean

2.04

2.27

7.46

1.84

1.91

2.10

9.42

2.73

9.08

3.67

6.97

1.59

12.88

10.72

5.99

1.31

S.D.

.285

<.001

.015

<.001

<.001

<.001

.008

<.001

p

G

0.34

0.03

1.27

0.63

0.01

0.01

0.97

0.66

2.48

1.30

0.66

0.53

1.06

0.15

0.41

0.13

Mean

1.38

0.33

4.64

1.50

0.09

0.07

3.14

2.06

6.98

3.58

2.02

0.82

6.14

0.81

2.27

0.45

S.D.

<.001

.031

.276

.181

.017

.085

.005

.009

p

X2

0.49

0.92

3.42

1.05

6.25

4.88

4.75

1.78

3.51

1.67

3.46

0.90

7.60

7.55

2.80

0.94

Mean

1.48

2.26

5.47

1.40

1.91

2.10

8.10

1.98

4.51

1.83

6.29

1.37

11.85

10.77

5.45

10.22

S.D.

<.001

<.001

.012

<.001

<.001

<.001

.017

<.001

p

/2

2.72

2.95

3.93

1.90

0.83

0.39

6.44

4.83

5.56

3.49

14.75

14.01

7.53

6.35

2.14

1.31

Mean

2.46

1.61

6.57

10.88

1.40

0.76

8.69

2.40

11.89

5.56

3.52

0.54

9.06

7.63

3.69

0.82

S.D.

.521

<.001

<.001

<.001

.005

<.001

.002

<.001

p

F

4.05

4.00

8.15

5.45

0.58

0.44

8.87

6.34

10.73

7.90

4.56

2.63

11.35

11.17

4.48

2.61

Mean

6.95

4.65

8.73

6.27

0.97

0.82

11.85

9.07

13.17

10.93

5.88

2.32

10.16

10.58

7.22

3.91

S.D.

.355

.002

<.001

.071

.012

<.001

.668

<.001

p

S

Table 2-1. Results of performance of the avaiable PDAs in database of normal and BVFL (unit: percentage %)

- 38 -

Table 2-2. Results of performance of the available PDAs in database of cyst (Npolyp = 195) (unit: percentage %)

G X2 /2 F S PDAs

Mean S.D. Mean S.D. Mean S.D. Mean S.D. Mean S.D.

AC 4.28 8.28 0.84 4.50 3.44 6.78 3.00 5.85 5.14 6.98

AMDF 7.95 12.70 0.68 2.41 7.27 12.78 6.56 8.47 10.20 9.57

YIN 5.77 9.95 0.99 3.13 4.78 9.40 15.39 4.40 5.53 5.81

CEP 8.34 13.44 3.82 9.67 4.53 6.34 6.31 12.12 9.22 10.37

SIFT 9.50 16.49 0.91 2.62 8.59 15.27 8.53 12.05 9.47 11.73

WAV 7.32 1.73 0.01 0.04 7.31 1.74 0.91 1.66 0.36 0.80

PS 8.01 12.76 3.22 10.20 4.87 7.38 6.20 11.01 10.15 9.21

FOS 0.42 1.37 0.33 1.28 2.21 1.65 2.21 1.65 2.54 4.32

Table 2-3. Results of performance of the available PDAs in database of cyst (Ncyst = 85) (unit: percentage %)

G X2 /2 F S PDAs


AC 3.18 6.12 0.33 1.64 2.84 5.90 1.71 2.39 3.67 5.52

AMDF 7.11 12.70 1.55 8.16 5.56 10.54 6.58 8.72 10.15 9.93

YIN 3.55 5.74 0.63 1.92 2.92 4.90 14.30 2.56 3.91 4.00

CEP 6.52 9.04 2.94 7.53 3.58 4.29 6.53 14.23 12.96 14.88

SIFT 4.81 7.42 1.31 3.90 3.50 5.03 5.79 8.96 9.05 13.73

WAV 5.97 1.94 0.02 0.12 5.95 1.93 0.67 1.17 0.58 0.95

PS 4.15 6.17 1.08 2.23 3.07 5.47 3.52 5.31 7.28 6.90

FOS 0.51 1.61 0.25 1.23 0.25 1.10 2.68 2.30 3.37 5.83

- 39 -

Table 2-43 Results of performance of the avaiable PDAs in nodule database (Nnodule = 140) (unit: percentage %)

G X2 /2 F S PDAs

mean std mean std mean std mean std mean std

AC 2.61 3.54 0.29 0.82 2.32 3.17 2.45 3.97 5.65 9.69

AMDF 12.07 12.77 0.33 0.88 11.74 12.66 9.94 9.64 14.37 10.39

YIN 4.22 6.86 0.51 1.18 3.70 6.25 15.25 4.35 5.34 8.34

CEP 3.53 3.92 0.77 1.57 2.76 3.32 3.22 3.71 7.33 10.02

SIFT 5.17 5.82 0.34 0.92 4.83 5.58 6.30 4.65 8.16 7.10

WAV 6.20 1.74 0.00 0.00 6.20 1.74 1.09 1.60 0.72 1.09

PS 3.66 4.13 0.44 1.00 3.22 3.74 3.34 4.66 8.59 11.11

FOS 1.71 2.74 0.51 1.67 1.19 2.16 3.09 3.07 6.27 9.35

2.5.2 Evaluating Performance of PDAs in Aperiodicity Level of Voices

Another tests occurred in two groups between Type-1 and Type-2 according to Titze [66].

This test performed to evaluate robustness of pitch doubling and halving errors. As shown in

Table 2-5, the difference of gross error of Type-2 group considerably increase except PDA

based on FOS, whereas the difference of gross error of Type-1 group moderately increase. In

Type-1 group, the average gap of +1.04 %, +0.43 %, +1.14 %, +1.24 %, +1.16 %, +1.27 %,

+1.48 % and -0.38 % are achieved in AC, AMDF, YIN, CEP, SIFT, WAV, PS, and FOS,

respectively. In Type-2 group, the average gap of +5.73, +5.03, +7.32, +7.74, +10.41, +2.52,

+8.82 and +1.95 are achieved in AC, AMDF, YIN, CEP, SIFT, WAV, PS, and FOS,

respectively. This information tells that PDA based on FOS is immune to pitch doubling and

halving errors.

- 4

0 -

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

0.61

1.03

3.06

1.58

6.26

4.99

3.53

2.37

3.80

2.56

2.58

1.44

8.63

8.20

2.16

1.12

Mean

1.57

2.35

4.07

1.55

1.80

2.08

3.79

2.77

4.46

2.48

4.29

1.61

12.38

10.98

3.98

1.34

S.D.

.054

<.001

<.001

<.001

<.001

<.001

.706

<.001

p

G

0.14

0.03

0.55

0.47

0.01

0.00

0.58

0.60

1.15

0.80

0.35

0.49

0.73

0.09

0.21

0.11

Mean

0.79

0.34

0.99

0.94

0.08

0.05

1.62

2.09

2.57

2.14

0.78

0.79

5.63

0.63

1.27

0.40

S.D.

.026

.402

.440

.913

.124

.074

.029

.168

p

X2

0.47

1.01

2.51

1.11

6.25

4.98

2.95

1.77

2.66

1.76

2.23

0.95

7.90

8.11

1.96

1.01

Mean

1.39

2.34

4.02

1.44

1.80

2.08

3.44

2.00

3.39

1.85

4.26

1.40

11.45

11.00

3.77

1.24

S.D.

.013

<.001

<.001

<.001

<.001

<.001

.847

<.001

p

/2

2.35

2.90

2.59

1.70

0.73

0.41

4.66

4.75

3.33

2.73

14.10

14.01

7.42

6.68

1.55

1.32

Mean

1.43

1.66

3.72

1.55

1.29

0.79

2.75

2.25

4.95

3.22

2.07

0.55

8.61

7.79

2.34

0.81

S.D.

.001

<.001

.001

.702

.110

.470

.351

.099

p

F

3.26

4.15

6.59

5.01

0.57

0.44

6.39

5.57

7.99

6.51

3.38

2.60

11.29

11.54

3.34

2.56

Mean

4.58

4.76

7.04

5.73

0.99

0.85

7.12

7.87

10.66

8.93

3.63

2.37

10.22

10.41

5.57

3.91

S.D.

.060

.010

.147

.284

.113

.005

.804

.075

p

S

Table 2-5. Results of performance of the available PDAs in database of normal and BVFL

- 4

1 -

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

1.95

0.00

10.24

2.80

6.22

3.70

13.69

3.28

15.43

7.69

8.75

1.23

6.90

1.87

6.29

0.56

Mean

3.17

0.00

12.64

3.82

2.16

1.98

17.48

2.21

18.27

8.91

11.34

1.25

10.38

3.97

9.28

0.84

S.D.

<.001

<.001

.001

<.001

.032

<.001

.006

<.001

p

G

1.20

0.00

5.02

2.39

0.05

0.06

3.05

1.34

9.54

7.00

2.55

0.98

2.25

0.79

1.60

0.45

Mean

2.24

0.00

10.27

3.97

0.14

0.19

7.17

1.71

16.16

8.84

4.85

1.08

3.54

1.85

5.16

0.81

S.D.

<.001

.135

.878

.099

.443

.024

.043

.098

p

X2

0.75

0.00

5.22

0.41

6.17

3.65

10.64

1.94

5.90

0.69

6.20

0.25

4.64

1.09

4.69

0.11

Mean

2.10

0.00

6.13

0.75

2.15

2.04

14.05

1.85

5.81

1.27

8.70

0.60

9.95

3.77

7.48

0.40

S.D.

.006

<.001

.001

<.001

<.001

<.001

.037

<.001

p

/2

4.40

3.43

8.62

4.16

1.27

0.23

12.77

5.82

16.61

12.16

16.78

14.00

6.85

2.64

4.23

1.22

Mean

4.31

0.68

10.92

3.44

1.64

0.16

18.24

3.68

27.53

14.15

6.60

0.32

8.84

3.97

6.91

0.93

S.D.

.093

.011

<.001

.007

.412

.001

.012

.001

p

F

8.14

2.33

14.07

10.46

0.81

0.33

18.27

15.21

22.83

23.94

8.84

3.07

12.09

6.96

8.51

3.18

Mean

12.23

1.33

13.41

9.66

1.08

0.30

22.26

15.86

19.50

17.92

11.47

1.67

10.07

12.17

11.69

4.02

S.D.

<.001

.280

.004

.575

.848

<.001

.192

.006

p

S

Table 2-6 Results of performance of the available PDAs in database of normal and BVFL

- 42 -

2.6 Summary

An algorithm is presented for the estimation of the fundamental frequency (F0) of

pathological voiced sounds. It is based on the well-known FOS algorithm with a pitch

selection method that combines to prevent pitch estimation errors, especially octave errors.

The algorithm has several desirable features. Gross error rates are lower than the best

competing methods, as evaluated over a database of speech recorded together with an

electroglottographic signal. The algorithm is relatively simple and may be implemented

efficiently and with low latency, and it involves few parameters that must be tuned. It is based

on a signal model that handles various forms of aperiodicity (such as white and colored

noises) that occur in particular applications. However, there is some trade-off such as too-low

error which ranged within ignored boundary.

- 43 -

Chapter 3

Comparison of Acoustic and Electroglottographic Parameters of BVFL

before and after Laryngeal Surgery

3.1 Introduction to Acoustic and Electroglottographic Analysis on Vowel

3.1.1 Acoustic Analysis

Up to now a substantial amount of research has been devoted to the determination of the

influence of pathological changes of the larynx upon the voice signal [67-69]. Acoustic

analysis is a modal and traditional method for evaluation and detection of laryngeal

pathologies [70, 71]. Most of these investigations were devoted to computation of pathological

voice parameters for use in clinical practice. Also, several studies [72, 73] have been devoted

to classification and screening of laryngeal pathology. The following results have been

established concerning changes in the pathological voice signal in most cases:

1. Significant variations in voice pitch period and pitch peak amplitude.

2. Breaks in pitch generation during sustained vowel phonation.

3. Distortion of the pitch-pulses shape and increased degree of hoarseness due to the

high-frequency noisy components of the voiced speech.

4. Presence of a loud turbulent and additive noise.

5. Presence of sub-harmonic components in the vowel spectra.

6. Dominating first harmonic and decrease or loss of the high-frequency harmonics in

the signal spectrum resulting in breathy phonation.

7. Interruptions in the pitch period generation.

These changes are not always observed simultaneously, and only part of them could be

present, depending on the disease and its stage. In cases of weak pathology, the voice signal

retains a normal periodicity; only the noise level slightly rises, while the amplitude of the

high-frequency harmonics in the signal spectrum decreases. For this reason, a precise analysis

- 44 -

of periodicity and noise determination is required to detect the disease in its early stage. Under

precise analysis, we understand evaluation with high precision of the parameters, describing

the cycle-to-cycle variations in the pitch and voice amplitude. In order to visualize some of the

above differences between normal and pathological voices, Fig. 3-1 shows waveforms and

spectra of normal and pathological voices.

Figure 3-1. Episodes of speech, spectrum, and EGG of normal and pathological voices

- 45 -

3.1.2 Electroglottographic Analysis

Electroglottographic (EGG) Analysis is also effectively used to detect and evaluate the

laryngeal disorders, because the EGG waveform is less complex than the speech signal, and is

relatively unaffected by the acoustic resonance of the vocal tract. It was thus considered to be

more advantageous than the speech signal for perturbation analysis [74]. The EGG signal

reflects the degree of vocal fold contact during the vibratory cycle of the vocal folds [75].

Irregularities in the EGG signal correspond to irregularities in the vibratory pattern of the

vocal folds [76, 77]. The EGG features we measure accounted for such factors as the rise and

fall time of the EGG signal. Thus, we conjectured that the EGG waveform features would

provide a nearly direct measure of the irregularities in the vibratory motion of the vocal folds

and thus provide and excellent classification of normal subjects versus those with vocal

disorders.

3.1.3 Measurement of Pathological Voice

Voice changes are measured during phonation of the sustained vowel. These parameters

define the degree of cycle-to-cycle instability of amplitude and pitch, and indicate the level of

aperiodic components (noises) in the voice signal, predicted by the presence of turbulent noise

and frequency and amplitude modulation of the voice. By modulation we mean a presence of

unintentional variations of voice amplitude and pitch, due to both neurological reasons and to

the biomechanical properties of the vocal folds.

For the laryngeal diagnostics, variations of voice amplitude and pitch period are usually

tested. These perturbations are called pitch perturbation (jitter) and amplitude perturbation

(shimmer), respectively. These perturbations are random by nature and persist in both normal

and pathological voices [78–80]. Slow fundamental frequency and amplitude variations are

defined as frequency and amplitude tremors, respectively, and are due to physical over-tension

rather than to pathological changes in the larynx. However, the more severe status of voice

disorder, the more level of jitter and shimmer increase. Therefore, it is required to statistically

analyze parameters extracted from speech and electroglottographic signal in order to know

difference between preoperative and postoperative voiced sounds.

- 46 -

3.2 Methods and Experiments

3.2.1 Experimental Data and Protocol

Pathological voices were collected from 42 subjects in BVFL from June 2003 to

December 2006. All subjects are normal hearing and native speakers of Korean, with

experience of laryngeal surgery for treatment of BVFL. The experimental instructions and

stimuli were controlled by expert clinician in the Institute of Logopedics & Phoniatrics, Yong-

Dong Severance hospital, Yonsei University, Seoul, Korea. Acoustic data were recorded with

the Computerized Speech Lab (CSL), Kay Elemetrics without noisy environment, and

electroglottographic (EGG) data were simultaneously obtained. All acoustic and EGG data

were sampled at sampling rate of 22 KHz and resolution of 16 bit. The patients phonate into a

microphone for less than 3 seconds sounds too good to be true, and these sustained vowels of

|a|, |e|, |i|, |o|, |u| sound samples were obtained twice. After recordings, all voice data with

WAV format (including EGG data) were transferred to PC, running on a Microsoft Windows

OS, and were analyzed by Matlab & Visual Studio .NET C++ and C# software.

3.2.2 Analysis and Results of Formant Frequencies

3.2.2.1 Estimation of Formant Frequencies

The speech signal is produced by the action of the vocal tract over the excitation coming

from the glottis. Different conformations of the vocal tract produce different resonances that

amplify frequency components of the excitation, resulting in the different sounds. These

resonance frequencies are called formant frequencies. The estimation of the formant

frequencies, mainly the first two formants, F1 and F2, has many practical applications. In

linguistics [81, 82], they are used for the characterization of the different sounds found in the

speech. Frequency formant measures can be obtained directly by visual inspection in the

spectrogram of the speech signal or automatically by means of a computational algorithm. A

useful technique for the formant estimation is the linear predictive coefficient (LPC) analysis.

In the LPC analysis, an all pole prediction filter models the vocal tract and the angular position

of the poles of the filter gives the formant frequencies.

- 47 -

Apart from a variety of formant tracking approaches [83, 84], considerable attention has

been paid to methods based on linear prediction analysis [85, 86]. However, capturing and

tracking formants accurately from noisy speech is not easy, largely because the accuracy of

root-finding algorithms based on LPC is sensitive to the noise level. Therefore we use the

pitch tracking method based on detecting the change of phase spectrum of LPC, as shown in

Fig. 3-2.

Figure 3-2. Formant frequencies tracking based on phase spectrum of LPC

3.2.2.2 Comparative Results

Table 3-1. below presents mean and S.D. of formant frequencies for F1, F2, and F3 for

male, female, and non-considering sex for their sustained vowel |a|, |e|, |i|, |o|, |u| before and

after laryngeal surgery. Paired t-test (Wilcoxon rank sum test) is used for the comparison

between preoperative and postoperative vowel. P-value associated with t is considered as

- 48 -

lower than 0.05. According to the results, there is no significant difference between two

groups (before and after surgery) of vowel data, even if |o| sound of male, |a| and |u| sound of

female, and |a|, |o| sound of non-considering sex significantly changed. In order to assess the

accuracy of the original frequency measurements, each vowel was re-measured twice, and the

mean F1, F2, and F3 frequencies were within 15.52±18.03 Hz, 21±19.74 Hz, and 19.08±19.21

Hz of the original values.

- 4

9 -

Fe -male

Male

Irrespe-ctive of sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

a

340.1

410.1

310.4

453.4

686.1

302.0

340.2

283.7

477.6

659.6

320.2

373.5

296.4

466.1

672.2

Mean

42.4

88.2

40.8

151.1

205.9

55.2

58.5

32.2

67.4

67.0

52.6

81.3

38.6

114.2

148.8

S.D.

pre

368.8

429.5

333.5

485.9

785.8

306.3

372.0

275.4

497.0

677.2

336.1

399.4

303.1

491.7

728.9

Mean

41.9

62.3

52.8

124.1

109.7

30.7

45.7

32.2

47.2

55.5

47.9

60.9

51.8

91.1

100.8

S.D.

post

.025

.156

.103

.326

.037

.706

.018

.277

.158

.266

.062

.006

.398

.132

.019

p-value

F1

803.5

854.4

1854.1

1517.7

1412.3

920.0

860.7

2065.3

1744.2

1446.8

864.5

857.7

1964.7

1636.3

1430.4

Mean

206.6

249.5

666.0

627.3

359.3

433.8

399.6

354.1

275.5

451.1

345.9

332.7

530.2

484.1

405.4

S.D.

pre

952.2

888.6

1680.9

1626.9

1359.2

722.9

889.1

2183.0

1714.5

1255.1

832.1

888.9

1943.9

1672.8

1304.6

Mean

391.5

308.3

829.7

649.6

165.6

153.1

380.5

207.0

241.9

389.5

310.6

343.8

636.7

477.0

305.3

S.D.

post

.124

.712

.477

.517

.554

.067

.781

.189

.679

.170

.661

.647

.866

.675

.133

p-value

F2

2142.3

2120.4

3232.9

2771.6

2823.8

2394.3

2215.4

3255.9

2885.6

2898.5

2274.3

2170.2

3244.9

2831.3

2862.9

Mean

307.6

286.8

371.4

374.6

594.7

445.6

476.9

362.8

348.3

530.5

402.2

396.6

362.6

361.2

556.3

S.D.

pre

2299.6

2160.5

3103.9

2781.3

2937.7

2264.9

2192.8

3320.6

2838.9

2730.5

2281.4

2177.4

3217.5

2811.5

2829.1

Mean

520.3

303.4

547.4

335.9

235.9

259.3

458.9

327.0

359.7

420.9

400.3

388.3

453.5

345.6

357.1

S.D.

post

.275

.663

.399

.927

.397

.232

.862

.385

.633

.254

.936

.928

.737

.779

.734

p-value

F3

Table 3-4. Mean and S.D. of formant frequencies from sustained vowel |a|, |e|, |i|, |o|, |u| before

and after laryngeal surgery

- 50 -

Figure 3-3. Box plots of F1, F2, and F3 formant frequencies of voiced sounds |a|, |e|,

|i|, |o|, |u| of male group before and after surgery

Figure 3-4. Box plots of F1, F2, and F3 formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and after surgery

- 51 -

Welch’s t-test for comparison between two groups is also tested and results of the tests

are plotted in box plots of Fig. 3-3 and 3-4. In comparison of before and after surgery, no

formant frequencies of vowel |a|, |e|, |i|, |o|, |u| of both sex with p-value over than 0.05 are

achieved except F1 of phonation |u| of female. These results also present no significant

difference between two groups. Hence these results, the information that the change of

postoperative vowel is not related with formant frequencies are analogized. However, we

analyzed the loci of F1 and F2 formant frequencies with S.D. of 1 for detail changes of theirs

distribution (e.g. Fig 3-5, 3-6). Distribution of F1 and F2 of men is smaller than that of women,

and |a| sound of both groups is clearly separated than other phonations.

Figure 3-5. Loci of the mean and S.D. of 1 of F1 and F2 formant frequencies of male before and after surgery

- 52 -

Figure 3-6. Loci of the mean and S.D. of 1 of F1 and F2 formant frequencies of female before and after surgery

3.3 Analysis and Results of Fundamental Frequency Perturbation (Jitter)

3.3.1 Various Jitter Measures

Mean F0, Max F0, Min F0 and S.D. F0 denote the mean F0, maximum F0, minimum F0

and standard deviation F0 value in analyzed segment, respectively. Phonatory frequency range

is a parameter indicating the range of tension of the vocal fold [87]. Mean absolute jitter

(MAJ) is the mean absolute difference between sequential vocal periods measured during

sustained phonations. However, absolute jitter is influenced by the mean fundamental

frequency of the speaker. For this reason, relative jitter measures have been proposed such as

Jitter. Jitter is the mean absolute difference between sequential vocal fundamental frequencies

divided by the mean frequency of the phonation.

- 53 -

Table 3-2. Various jitter measures

Description Formula Description Formula

Mean F0 Max F0

Min F0 S.D. F0

Phonatory

Frequency

Range

Mean

Absolute Jitter

(MAJ)

Jitter (%)

Pitch

Perturbation

Factor

Directional

Pitch

Perturbation

Factor

RAPP3

RAPP5

RAPP15

Another effective pitch perturbation features are pitch perturbation factor (PPF) and

directional pitch perturbation factor (DPPF). PPF formulated by Lieberman is defined as

percentage value of the number of waveform periods exceeding the given threshold compared

to the total number of voiced pitch periods, and proved to be sensitive to the presence of

masses on the vocal folds [88]. In this study, the threshold is 10 percents of positioning error

of pitch periods. DPPF proposed by Hecker & Kreul is the percentage of the total number of

difference between adjacent pitch periods for which there is a change in the algebraic sign [89,

90]. Finally, relative average pitch perturbations (RAPP) introduced by Koike [91] are the

average absolute difference between a period and the average of it and its closest neighbors,

divided by the average period. In this study, 3, 5 and 15 pitch periods are selected. These

max( )iF

2

1

1( )

1

n

ii

F Fn =

−− ∑

1

1 n

ii

Fn =∑

0 _log( )

0 _12

log 2

F hi

F lo×1

11

1

1 i ii n

F Fn +

= −

−− ∑

min( )iF

11 1

2

12 3

1000 _

ni i i

ii

F F FF

nF av

−+ −

=

+ + −− ×∑

2

22

3

( )1

4 5

1000 _

i

nk i

ii

F kF

n

F av

+

−= −

=

−−

×

∑∑

7

77

8

( )114 15

1000 _

i

nk i

ii

F kF

n

F av

+

−= −

=

−−

×

∑∑

100p threshold

voice

N

N≥ ×

100voice

N

N∆± ×

0 _

MAJ

F av

- 54 -

measures have been extensively used in the last decade, since they are less sensitive to pitch

extraction errors due to smoothing in their calculation.

3.3.2 Comparative Results

Table B-1, B-2, B-3, and B-4 in APPENDIX B, various jitter relevant measures before

and after surgery, analyzed by Wilcoxon rank sum t-test, are presented. Among the 12

measures, significant changes are founded in the mean of these measures except Max F0 and

Min F0. In particular, mean values of F0 of male group are significantly different between

preoperative and postoperative vowel, whereas mean values of F0 of female group do not

show remarkable changes. This result is plotted in Fig. 3-7. In male group, average difference

of before and after vowel |a|, |e|, |i|, |o|, |u| are -17.3 Hz (-12.8 %), -16.9 Hz (-12.4 %), -17.1 Hz

(-12.4 %), -21.5 Hz (-15.2 %), and -22.3 Hz (-15.6 %), respectively. Min F0 are also

significantly different in male group, but that of female group presents non-significant changes.

However, Max F0 of both groups is significantly different. S.D. of F0 also presents significant

changes in vowel |e|, |i|, |u| of male and |a|, |e|, |i| of female according to Table B-2 in

APPENDIX B. Phonatory frequency range, mean absolute jitter (MAJ), and Jitter (%) are

good surrogate for discrimination between two groups; before and after laryngeal surgery.

Nearly, all of their p-value is significantly different from data of pre-treatment. Phonatory

frequency range slightly decreases about a half of value before surgery. MAJ and Jitter (%)

also decrease about a half or a seventh of value before surgery. Detail dropped values are

described in Table B-2 and B-3 in APPENDIX B. Pitch perturbation factor and directional

perturbation factor also present significant difference, but some vowel do not show changes;

phonation |a| and |o| of male, and |a|, |i|, |o|, and |u| of female in the pitch perturbation factor,

and |i|, |o|, and |u| of female in the directional perturbation factor. RAPP relevant measures are

also good touchstone of classification between preoperative and postoperative group. As

shown in Fig. 3-10, all of these measures of all vowel of both sex are significantly different,

excluding phonation |i| of female in the RAPP3 and RAPP15. These results are assumed that |i|

sound has more high frequency components than other phonation.

- 55 -

Figure 3-7. Mean F0 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery

Figure 3-8. Jitter (%) of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery

- 56 -

Figure 3-9. Pitch perturbation factor of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery

Figure 3-10. RAPP15 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after

laryngeal surgery

- 57 -

3.3.3 Analysis and Results of Intensity Perturbation (Shimmer)

3.3.3.1 Various Shimmer Measures

Mean Amplitude (Amp), Max Amp, Min Amp and S.D. Amp denote the mean amplitude,

maximum F0, minimum amplitude and standard deviation amplitude value in analyzed

segment, respectively.

Table 3-3. Various shimmer measures

Description Formula Description Formula

Mean Amp

Max Amp

Min Amp

S.D. Amp

Shimmer

(dB)

Mean

Absolute Jitter

(MAJ)

Jitter (%)

Pitch

Perturbation

Factor

Amplitude

Directional

Perturbation

Factor

RAAP3

RAAP5

RAAP15

1

11

1

1 i ii n

A An +

= −

−− ∑

min( )iA

11 1

2

12 3

100_

ni i i

ii

A A AA

nAmp av

−+ −

=

+ + −− ×∑

2

22

3

( )1

4 5

100_

i

nk i

ii

A kA

n

Amp av

+

−= −

=

−−

×

∑∑

7

77

8

( )114 15

100_

i

nk i

ii

A kA

n

Amp av

+

−= −

=

−−

×

∑∑

100p threshold

voice

N

N≥ ×

100voice

N

N∆± ×

max( )iA

2

1

1( )

1

n

ii

A An =

−− ∑

1

1 n

ii

An =∑

_

MAS

Amp av

1

1 1

120 log( )

1

ni

i i

A

n A

−

= +

×− ∑

- 58 -

Shimmer (dB) is the average absolute base-10 logarithm of the difference between the

amplitudes of consecutive periods, multiplied by 20. Mean absolute shimmer (MAS) and

Shimmer (%) are the average absolute difference between the amplitudes of consecutive

periods, and MAS value divided by the average amplitude in order to reduce the influence by

the mean amplitude of the speaker. Another effective amplitude perturbation features are

amplitude perturbation factor (APF) and directional amplitude perturbation factor (DAPF).

APF is defined as percentage value of the number of waveform periods exceeding the given

threshold compared to the total number of voiced pitch periods, and the threshold is 10

percents of intensity error of amplitudes. DAPF proposed is the percentage of the total number

of difference between adjacent amplitudes for which there is a change in the algebraic sign.

Finally, relative average amplitude perturbations (RAAP) are the average absolute difference

between amplitude and the average of it and its closest neighbors, divided by the average

period. In this study, 3, 5 and 15 amplitude points are selected as smoothing factors.

Figure 3-11. Shimmer (%) of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery

- 59 -

Figure 3-12. RAAP15 of vowel |a|, |e|, |i|, |o|, |u| of male and female group before and after laryngeal surgery


Table B-5, B-6, B-7, and B-8 in APPENDIX B, various shimmer relevant measures

before and after surgery, analyzed by Wilcoxon rank sum t-test, are introduced. Among the 12

measures, significant changes are only founded in the Shimmer (dB), Shimmer (%), and

RAAP relevant measures, compared to presence of significant changes in the most of jitter

relevant measures. Mean Amp, Max Amp, Min Amp, S.D. Amp, MAS, Amplitude

perturbation factor, and Amplitude directional perturbation factor shows no significant

changes. All of Shimmer (dB) and Shimmer (%) of both sexes are assumed to be a good

dichotomizer between preoperative and postoperative group. However, p-value of these

measures of phonation |i| is also found as bad classifier. These results are plotted in Fig. 3-11.

Postoperative voiced sounds of RAAP relevant measures show significantly lower value in all

of phonation of both sexes except phonation |i|, |o|, |u| of female group in RAPP3 and RAAP5,

and phonation |i| of female group in RAAP15. In particular, larger a smoothing factors from 3

to 15, bigger the gap between two groups (before and after surgery). This truth informs us

- 60 -

partial influence of shimmer will be smoothed by long smoothing factor, and more precisely

evaluate the analyzed segment.

3.3.4 Analysis and Results of Noise Components

3.3.4.1 Estimation of the noise in the spectral domain

The presence of noise, besides, one should also take into account factors that affect the

value of the parameter, but do not have pathological origin (presence of slow amplitude and

pitch, caused by the person’s emotional status). This is necessary to distinguish normal from

slightly pathological voices. Just in this sense, in the popular method of Yumoto [92] for

determination of harmonic-to-noise ratio, known as one of insufficient precision, some

changes have been made, aimed at decreasing the influence of non-pathological factors [93].

Spectral methods for noise estimation in a voice signal are based on Parseval’s theorem,

stating that the mean power of the signal (taken in the time domain) equals the sum of the

powers of its spectral components. The latter have some significant advantages compared to

the methods based on the time domain. Firstly, they have larger capabilities for separating the

harmonic from the noise component (additive and modulation noise) of the signal. Next, they

are less sensitive to slow changes in pitch and amplitude, due to the patient’s failure to hold a

constant tone and power during phonation, or emotional reasons.

3.3.4.2 Estimation of Harmonic-to-Noise Ratio (HNR)

HNR in speech signals based on Cepstral liftering was firstly developed by de Krom [94],

and it is mathematically implemented by cutting specific edge in quefrency domain. The

essence principles of the de Krom approach is that the harmonics can be removed from the

spectrum of voiced speech using cepstral processing and hence the noise can be estimated at

all frequencies in the spectrum. The rahmonics, representing the prominent peaks at integer

multiples of the fundamental period (T) in the cepstrum of voiced speech, are removed

through comb-liftering. The resulting comb-liftered cepstrum is Fourier transformed to

obtain a noise spectrum (log power spectrum in dB ( )apN f , which is subtracted from the

- 61 -

log power spectrum, ( )O f , of the original signal. This gives a source related log power

spectrum, ( )apH f . A baseline correction factor ( )dB f , defined as the deviance of

harmonic peaks from the 0 dB line, is determined. This factor is subtracted from the

estimated noise spectrum to yield a modified noise spectrum. The modified noise

spectrum ( )N f is now subtracted from the original log-spectrum, in order to estimate the

harmonics-to-noise ratio (HNR).

Figure 3-13. HNR ratio calculation using Cepstral smoothing in Spectrum

A voiced speech waveform, ( )enS t , including aspiration noise [95], ( )n t at the glottal

source, can be approximated as below equation (2.1): [ ]( ) ( )* ( ) ( ) * ( )* ( )enS t e t g t n t v t r t= + (2.1)

where ( )e t is a periodic impulse train,( )g t is a single glottal pulse,( )v t is impulse response of

the vocal tract,( )r t represents the radiation load and * indicates convolution. Applying a

Hanning window( )w

[ ]( ) ( )* ( ) ( ) * ( )* ( ) ( )wenS t e t g t n t v t r t w t= + ×ß

ギ (2.2)

Provided the window length is sufficiently long the window function can be moved inside the

convolution [96] to give equation (2.3)

[ ]( ) ( ) * ( ) ( ) * ( )* ( )wen w wS t e t g t n t v t r t= + (2.3)

Taking the FFT gives below equation (2.4).

- 62 -

[ ]( ) ( ) ( ) ( ) ( ) ( )wen w wS t E f G f N f V f R f= × + × × (2.4)

Taking the logarithm of the magnitude squared values and approximating the signal energy at

harmonic locations, 2

log ( )wen h

S f (equation (2.5)), and at between-harmonic locations, 2

log ( )wen bh

S f , gives equation (2.6)

2 2 2

log ( ) log ( ) ( ) log ( )wen w Rh

S f E f G f V f= × + (2.5)

2 2 2

log ( ) log ( ) log ( )wen w Rbh

S f N f V f= + (2.6)

where ( )RV t is the FFT of ( )v t and ( )r t combined.

Figure 3-14. Influence of cepstral smoothing due to liftered long-term temporal window (57ms)

- 63 -

Figure 3-15. Influence of cepstral smoothing due to liftered short-term temporal window (1ms)

Applying the cepstrum to this signal and obtaining the liftered baseline, it can be seen

that the baseline is influenced by the noise source and temporal analysis window size. As the

temporal window length increases the baseline approximates the noise floor more accurately,

because more estimates are available for the between harmonics as opposed to the harmonics

and the Fourier transform of the liftered cepstrum behaves like a moving average filter applied

to the logarithmic spectrum.

- 64 -

3.3.5 Estimation of Degree of Hoarse (DH) and Normalized Noise Energy (NNE)

Degree of Hoarse (DH) and Normalized Noise Energy are similar to HNR, and they also

estimate noise embedded in speech signal. First, the signal spectrum is calculated for a limited

segment of the signal which requires the use of a Hanning window in this study. Since noise in

the harmonics range should also be taken into account in order to find the total noise energy

precisely enough, the harmonics range width must be less than half the fundamental frequency.

The analyzed signal is divided into segments of width TW and displacement TW /2, and the

harmonic and noise component energies are determined for each segment. The ultimate values

of each of these energies are determined after averaging over the whole signal. With spectral

methods, energies of the periodic and noise components are calculated on the basis of the

power spectrum of the voice signal, obtained through FFT. In order to reduce the influence of

slow signal variations as much as possible, the Hanning window length is taken to be 14

periods, i.e. W = 14S/F0 where W is the window length expressed in points and S is the

sampling rate. All spectrum components within the harmonics band (Energy of Harmonic:

EH), HE , are summed up to obtain the harmonics energy, while the sum of the rest of the

components (Energy of Noise: EN), NE , is taken to be the noise energy:

max

1

k

k

k f

H i kk i f

E X a= =

= − ∑ ∑ (2.7)

max 1

11 1

k

k

k f

N i kk i f

E X a−

−= = +

= + ∑ ∑ (2.8)

where: iX is the power spectrum of the voice signal, maxk is the number of harmonics,

1

1

10

2

2

k

k

f

k ii f

ba X

f b−

−

= +

=− ∑ (2.9)

is the noise energy within the harmonics band and is based on the assumption that noise has a

random character and is uniformly distributed through the whole signal range,

"'

0 0( ) ( )k

f kf b and f kf b= − = +∫ (2.10)

are the lower and upper frequency bounds of the k-th harmonic band, respectively, f0=F0N/S

and b=1.5N/W are the fundamental frequency (F0 is in Hz, f0 is in number of samples) and

- 65 -

half of the harmonics band width, respectively. In order to avoid the influence of the f0

evaluation error, the values of NNE and DH are calculated for all possible fundamental

frequencies:

'0

max

iffH

= (2.11)

where if take an integer value in the interval:

0 max[ ( ) ]if f Hδ∈ +∫ (2.12)

maxH is the number of the highest-frequency harmonic and δ=0.25 is the experimentally

determined maximum error of the f0 calculation. The final value of the NNE and DH is taken

to be equal to the least obtained value. The values of DH and NNE, obtained for each segment,

are averaged for the whole phonation. DH is the ratio of NE to HE , averaged for the whole

phonation:

1

1 nN

i H

EDH

n E=

= ∑ (2.13)

NNE [97] is the ratio of NE to N HE E+ , calculated for frequency range of 4 kHz and

averaged for the whole phonation and converted into [dB]:

101

110log ( )[ ]

nN

i H N

ENNE dB

n E E=

=+∑ (2.14)

3.3.6 Estimation of the normalized first harmonic energy (NFHE)

Breathy phonation changes reflect in the acoustic signal as a relatively strong first harmonic

and weak high frequency harmonics. The ratio between the amplitudes of the first and second

harmonics is usually used to evaluate the vocal quality of these types of voices [98]. This new

parameter, called the normalized first harmonic energy (NFHE) [98], is more informative than

the ratio between first and second harmonics for discrimination between normal and breathy

phonations [93], see also the choice of this frequency band is due to the absence of strong

harmonics beyond 4 kHz for both normal and pathological voices.

- 66 -

Figure 3-16. Plot of harmonic and noise phase segment in spectral domain

For calculation of the parameter NFHE, the voice signal should be divided into segments,

containing equal number of periods, and FFT is calculated for each segment, using Hanning

window. The length of the segments should be equal to above stated, e.g. 14. The value of

NFHE for every segment i can be obtained by the equation (2.15):

0

0

max 0

02

( )

( )

f b

f f bi K kf b

k f kf b

P f

NFHE

P f

+

= −+

= = −

=∑∑ ∑

(2.15)

where P(f) is the power spectrum of the signal, f0 is the fundamental frequency, b=1.5N/W is

the half of the harmonics band, k is harmonic’s ordinal number and kmax is the number of

- 67 -

harmonics within 4 kHz range. N and W are the number of points, used to calculate FFT, and

the length of time window, respectively.


All of noise estimation measures of male group are significantly changed after laryngeal

surgery according to Table 3-4 and 3-5, whereas NNE and DH measures of female group

show non-remarkable difference. Table 3-4 and 3-5 show the results of noise estimation

relevant measures analyzed by Wilcoxon rank sum t-test between preoperative and

postoperative vowel. The pattern of changeless phonation is somewhat different in male and

female groups. Phonation |o|, |u| of HNR, |i| of DH, and |i|, |o|, |u| of NFHE in male group are

constant before and after surgery, but Phonation |a|, |u| of HNR, |a|, |i|, |o|, |u| of NNE, and |a|,

|i|, |o|, |u| of DH in female group are constant. In particular, phonation |a| of female group

presents no significant difference. In both sexes, phonation |e| is evident surrogate for

separating between preoperative and postoperative group. HNR of postoperative vowel is

higher than that of preoperative vowel with average 2.46 dB of |a|, 2.38 dB of |e|, 2.35 dB of |i|,

1.11 dB of |o|, 0.77 of |u| in male group, and 1.16 of |a|, 1.17 dB of |e|, 1.14 dB of |i|, 1.4 dB of

|o|, and 1.37 dB of |u| in female group, respectively. In NNE measure, all of phonation

significantly increases by the gap of 3.90 dB of |a|, 5.31 dB of |e|, 2.30 of |i|, 1.74 of |o|, 1.33 of

|u| in male group, and 2.01 dB of |a|, 2.88 dB of |e|, 0.71 of |i|, 1.41 of |o|, 0.79 of |u| in female

group. DH of male also significantly increase, but that of female increase somewhat, not

significantly increasing. However, NFHE, a representative of breathiness, significantly

decrease except |i|, |o|, |u| of male group.

- 68 -

Figure 3-17. Box plots of HNR, NNE, and DH of voiced sounds |a|, |e|, |i|, |o|, |u| of male group before and after surgery

Figure 3-18. Box plots of HNR, NNE, and DH of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and after surgery

- 69 -

Figure 3-19. Box plots of NFHE of voiced sounds |a|, |e|, |i|, |o|, |u| of male group before and after surgery

Figure 3-20. Box plots of NFHE of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and after surgery

- 7

0 -

Fe -male

Male

Irrespec-tive of

sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

a

13.05

14.23

15.74

16.68

16.78

13.84

13.93

14.06

15.09

15.10

13.46

14.07

14.86

15.85

15.90

Mean

3.07

2.72

3.45

3.52

4.18

2.27

2.55

2.79

2.75

2.67

2.68

2.60

3.20

3.20

3.53

S.D.

pre

14.42

15.63

16.88

17.85

17.94

14.61

15.04

16.41

17.47

17.56

14.52

15.32

16.63

17.65

17.74

Mean

2.42

2.35

2.57

2.68

3.43

1.77

2.95

2.61

2.57

2.13

2.08

2.66

2.57

2.60

2.79

S.D.

post

.086

.033

.046

.059

.155

.147

.082

.007

.000

.001

.023

.005

.001

.000

.001

p-value

HNR

9.32

10.37

12.70

12.83

12.28

7.59

7.83

8.48

7.87

7.91

8.42

9.04

10.49

10.23

9.97

Mean

2.64

2.84

4.20

4.64

4.83

2.95

2.59

3.82

2.80

3.06

2.91

2.97

4.49

4.50

4.53

S.D.

pre

10.12

11.78

13.41

15.71

14.30

8.92

9.58

10.79

13.18

11.81

9.49

10.63

12.04

14.39

13.00

Mean

2.29

2.92

3.66

3.22

2.70

2.53

2.81

2.74

3.50

2.13

2.47

3.04

3.44

3.56

2.70

S.D.

post

.247

.073

.323

.009

.069

.028

.005

.008

.000

.000

.016

.001

.006

.000

.000

p-value

NNE

5.45

6.38

8.11

7.20

7.20

4.26

4.61

5.19

4.40

4.49

4.83

5.45

6.58

5.73

5.78

Mean

1.78

2.03

3.25

3.37

3.19

1.29

1.37

2.05

1.14

1.36

1.64

1.91

3.03

2.81

2.74

S.D.

pre

6.10

7.32

8.52

9.36

8.22

5.31

5.60

6.12

7.22

6.34

5.69

6.42

7.26

8.24

7.24

Mean

1.59

2.55

3.09

2.40

2.13

1.26

1.99

1.80

2.25

1.36

1.47

2.41

2.75

2.54

1.99

S.D.

post

.184

.052

.429

.003

.137

.001

.033

.050

.000

.000

.002

.003

.048

.000

.000

p-value

DH

Table 3-4. Mean and S.D. of formant frequencies from sustained vowel |a|, |e|, |i|, |o|, |u| before and after laryngeal

surgery

- 71 -

Table 3-5. Mean and S.D. of NFHE from sustained vowel |a|, |e|, |i|, |o|, |u| before and after laryngeal surgery

NFHE

pre post

Mean S.D. Mean S.D. p-value

a 11.36 4.41 7.75 2.26 <.001

e 11.90 5.18 7.50 2.68 <.001

i 12.45 5.40 9.00 3.11 <.001

o 13.24 5.24 9.73 2.95 <.001

Irrespective of sex

u 14.74 6.45 10.95 3.25 <.001

a 10.56 3.99 6.60 1.69 <.001

e 11.66 5.77 6.59 2.40 <.001

i 9.83 4.77 10.85 2.78 0.35

o 10.84 4.10 8.83 2.60 .033

Male

u 12.93 6.92 9.37 2.55 .033

a 12.24 4.77 9.01 2.15 .004

e 12.18 4.51 8.50 2.67 .001

i 13.42 3.92 10.40 2.81 .002

o 15.89 5.16 10.72 3.05 <.001

Female

u 16.74 5.38 12.69 3.10 .004

3.3.7 Analysis and Results of Electroglottographic Parameters

3.3.7.1 Estimation of Open Quotient and Speed Quotient

There are two famous parameters for analysis of EGG waveform; Open Quotient (OQ)

and Speed Quotient (SQ) [99]. OQ can be used to evaluate the EGG duty cycle, and SQ can be

used to determine the spectral properties of the generated voiced sound. However, there still

remain controversy of finding the glottal opening and closing point in order to obtaining OQ

and SQ. The time instant of glottal closure is well detectable as a positive peak in the time

derivative EGG (DEGG) of the EGG signal, but there is no agreement on defining the time

instant of glottal opening. In the past, the minimum of DEGG has been used as marker for the

glottal opening [100], but often there does not exist any clear minimum during a glottal cycle.

Even if there is a clear minimum, there is no agreement on whether this is actually the instant

- 72 -

of glottal opening or not. An alternative approach to analyze the vibrational patterns of a

glottal cycle in an EGG signal is to define a time instant corresponding to the point of

intersection between the (falling edge of the) EGG signal and a threshold line. With the

threshold intersection criterion different values have been placed at points representing various

threshold such as 25%, 30%, 40%, 50% and 75% of the signal peak-to-peak amplitude [101].

There is an agreement on the fact that the results within a study do not strongly depend on the

threshold value as long as a constant value is consistently used. In this study, the OQ is

defined as the ratio between the duration of the phase where the glottis is open and the whole

duration of the glottal cycle, multiplied by 100 to express the values in percent as shown in Fig.

3-21 and Fig. 3-22.

Figure 3-21. Determination of start point of opening phase and closing phase in EGG, 16-smoothed EGG, and differentiated EGG waveform

- 73 -

Figure 3-22. Detail definition of opening and closing phase period in EGG waveform

Hybrid method suggested by Howard, detecting closing peaks on the DEGG signal and

estimating opening peaks on the EGG signal using a threshold of 3/7, is adopted in this study

(Equation 2.17). Speed Quotient (SQ) is defined as the ratio of rise time (increased contact) to

fall time (decreased contact), which leads to an inverted measure compared to the SQ's

definition within the acoustic signal. This means that higher values for the SQ indicate more

symmetrical EGG pulses (Equation 2.18).

( ) 100(%)open

cycle

TOpen Quotient OQ

T= × (2.16)

( ) 100(%)A

B

tSpeed Quotient SQ

t= × (2.17)


In Table 3-6, there is no evident difference between preoperative and postoperative vowel.

Mean value of OQ of all of phonation in male group slightly increase, but do not show

significant changes, either do in female group. Mean value of SQ of both sex also do not show

significant difference. However, particular characteristics of SQ group, regressing within

- 74 -

normal range, are presented in condition of division of two groups separated by mean SQ

value as shown in Table 3-7. There are two patterns of groups (Low-SQ and High-SQ) divided

by mean value of postoperative vowel. Low-SQ group, lower than mean value of each

phonation of postoperative vowel, tend to increase toward mean SQ value, and High-SQ group,

over than mean value of each phonation of postoperative vowel, tend to decrease toward mean

SQ value too. This truth was assumed that hypertensive vibration pattern of vocal cords is

normalized after laryngeal surgery.

Table 3-6. Mean and S.D. of open quotient and speed quotient from sustained vowel |a|, |e|, |i|, |o|, |u| before and after laryngeal surgery

Open Quotient Speed Quotient

pre post pre post

Mean S.D. Mean S.D.

P-value Mean S.D. Mean S.D.

p-value

a 51.27 10.80 48.14 8.77 .196 75.29 40.46 82.20 34.25 .433

e 50.44 9.63 46.85 8.44 .079 74.47 36.21 86.25 36.37 .124

i 47.79 10.04 46.33 10.35 .495 86.53 45.04 88.36 38.92 .842

o 47.91 9.67 47.60 9.87 .881 85.58 42.06 84.89 38.35 .933

Irrespective of sex

u 46.09 11.49 46.29 9.59 .925 98.37 68.82 92.47 60.67 .653

a 52.92 12.58 45.63 7.47 .050 80.59 46.51 97.75 33.14 .224

e 51.03 12.23 45.23 8.19 .088 83.66 46.07 99.20 34.29 .220

i 47.12 11.87 46.26 11.57 .808 100.87 51.87 98.9 42.53 .898

o 48.64 10.21 46.05 11.78 .440 91.53 42.12 101.31 41.25 .447

Male

u 45.34 12.73 43.95 10.01 .703 116.77 82.59 116.26 71.97 .983

a 49.46 8.38 50.91 9.43 .628 69.45 32.76 65.09 27.01 .672

e 49.78 5.82 64.35 16.64 .598 64.35 16.64 71.99 33.86 .598

i 48.52 7.80 46.40 9.11 .370 70.77 30.08 76.77 31.61 .536

o 47.11 9.23 49.31 7.15 .337 79.03 42.07 66.83 25.25 .224

Female

u 46.91 10.22 48.86 8.64 .300 78.13 43.08 66.30 29.00 .131

- 75 -

Table 3-7. Mean and S.D. of speed quotient from sustained vowel |a|, |e|, |i|, |o|, |u| before and after laryngeal after laryngeal surgery

Low-SQ High-SQ

pre post pre post

Mean S.D. Mean S.D. p

Mean S.D. Mean S.D. p

a 50.18 20.37 101.87 32.53 .001 133.81 25.15 90.54 35.15 .030

e 52.33 20.48 96.80 31.51 .003 138.48 13.89 103.41 40.64 .033

i 59.91 19.94 100.72 42.64 .013 150.01 29.53 96.70 44.60 .023

o 65.44 22.85 102.03 38.12 .010 137.18 24.69 100.06 49.01 .085

Male

u 72.05 27.16 110.04 36.91 .014 195.04 89.98 127.13 113.12 .270

a 48.29 10.60 71.27 30.09 .067 90.61 34.04 58.91 23.45 .028

e 55.43 9.29 74.22 39.70 .088 85.16 9.27 66.80 14.55 .062

i 59.88 10.47 74.99 33.05 .104 132.46 31.69 86.85 24.02 .228

o 54.36 7.64 63.21 29.94 .284 124.86 41.46 73.56 12.07 .015

Female

u 54.71 6.96 58.65 18.01 .431 132.78 42.54 84.16 42.51 .016

- 76 -

Figure 3-23. Box plots of OQ and SQ formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of male group before and after surgery

Figure 3-24. Box plots of OQ and SQ formant frequencies of voiced sounds |a|, |e|, |i|, |o|, |u| of female group before and after surgery

- 77 -

3.4 Summary

In this chapter, our results show that acoustical and electroglottographic characteristics of

vowel change after laryngeal surgery. Mean pitch of male group decreased about 12-15%

value of preoperative pitch, whereas that of female group does not significantly change.

Formant frequencies show constant values before and after surgery. Most of jitter measures

are significantly changed, but some of shimmer measures are different later the surgery. In

noise estimation relevant measures such as HNR, NNE, DH, and NFHE, some of phonation

significantly present the difference according to sex. Finally, no changes are achieved in OQ

and SQ of EGG relevant measures, but particular characteristics of SQ group, regressing

within normal range, are presented in condition of division of two groups separated by mean

SQ value.

- 78 -

Chapter 4

Modification of Preoperative Vowel Sounds based on Acoustic and

Electroglottographic Analysis

4.1 Introduction to Perception of Aperiodicity in Pathological Voices

Jitter, shimmer, and noise are the surrogate for acoustic measurement of voice signals,

and are often considered as indices of the perceived quality of both normal and pathological

voices. A lot of applications of acoustic measures to assess vocal quality derive their validity

from the relevance of specific acoustic properties of the signal to auditory perceptions of voice.

Researchers typically use correlation or regression techniques to demonstrate the extent to

which such measures explain or predict listeners’ scalar quality judgments. However,

observed associations between acoustic and perceptual measures have varied considerably

across studies. Although hundreds of studies describing, evaluating, and applying measures of

noise and acoustic signal perturbation have been published [102], the perceptual salience of

these attributes remains poorly understood.

Synthesized voiced sounds are sometimes needed to evaluate performance of above

stated measures, and have been long developed by various models. A discrepancy exists

between the results of early synthesis studies and findings from later investigations examining

this association in naturally produced voices [80]. In early stage, synthesis studies [103, 104]

used sawtooth waves with added jitter (1–50 Hz around a mean F0 of 100 or 200 Hz) or

shimmer (alternate periods reduced in amplitude by 1–6 dB). Complete correlations were

observed between the amount of jitter or shimmer and judgments of relative roughness for

these non-speech stimuli. Hillenbrand [105] also studied synthetic vowel using the uni-variate

analysis between jitter, shimmer, and noise and ratings of breathiness and roughness.

In this chapter, we model the postoperative vowel based on results of Chapter 3. For

completely predicting postoperative voiced sounds, various methods for modification of

preoperative vowel are tested and evaluated; especially in jitter, shimmer, and NSR ratios of

speech sounds.

- 79 -

4.2 Synthesized Vowel Modeling

The input energy to the vocal tract comes from the vibrating glottis, driven by the air

pressure released from the lungs, causes the sound waves to propagate through the tube. The

vocal cords give rise to a periodic characteristic. This periodic waveform is known as the

glottal waveform. There are various studies to concentrate on modeling a glottal waveform

[106, 107], because this waveform is source of voiced sounds, resonating through vocal tract.

Also, to model the synthesized vowel is necessary to evaluate the performance and accuracy

of some measures such as harmonic-to-noise ratio.

4.2.1 Glottal Waveform Modeling

4.2.1.1 Rosenberg’s Model

Figure 4-1. Glottal waveform generated by Rosenberg’s model

- 80 -

Rosenberg [108], Titze [109] and several other researchers investigated an alternative

approach to inverse filtering of the speech waveform to generate the excitation signal. Their

findings show that this approach is capable of giving a good estimate of the glottal volume

velocity. There are two famous parametric models for speech production; Rosenberg’s model

and Titze’s model.

Rosenberg used inverse filtering to extract the glottal waveform from speech, and applied

a pitch-synchronous re-synthesis method to produce speech utterances with various source

waveforms [108]. In his perceptual tests, the most natural excitation signal involves

specification of several parameters. In order to better explain Rosenberg’s model, we need to

first introduce the general waveform of the glottal area during the vocal fold excitation.

As shown in Fig. 4-1, T denotes the pitch period, TP denotes the opening phase during the

glottal excitation, and TN denotes the closing phase during the glottal excitation. In

Rosenberg’s model, the glottal waveform is specified by three parameters, namely, pitch

period T, open quotient (OQ) TP+TN/TN which denotes the ratio of pulse duration to pitch

period, and speed quotient (SQ) TP/TN which denotes the ratio of the rising to failing pulse

durations, that is, OQ, SQ are the duration of the glottal open phase to the duration of the

complete glottal cycle, and the duration of the glottal opening phase to the duration of the

glottal closing phase. Generally, OQ ranged from 0.1 to 0.9, and SQ ranged form 0.5 to 5.0.

Based on Rosenberg’s experimental results, the glottal area function is given by:

1/ 2[1 cos( / )] 0

( ) cos( ( ) /(2 ))

0

P P

r P N P P N

n T n T

G n n T T T n T T

otherwise

ππ

− ≤ ≤= − ≤ ≤ + (4.1)

Fig. 4-1 shows a glottal waveform computed from Rosenberg’s model, where the

parameters are set as T=10ms, paired TP1/TP2/TP3 =2/3/4 ms, TN1/TN2/TN3 = 4/3/2 ms,

respectively.

4.2.1.2 Titze’s Model

Titze proposed a parametric model to represent the glottal area [109]. The glottal

waveform in Titze’s model is essentially similar to the Rosenberg pulse described in Equation,

( )rG n

- 81 -

with an extra parameter beta to determine the residual decay of the falling slope. The glottal

area function in this model is given by:

cot

( )

0

sinsinr

m m

G nm m

βθ θ

θ π

θ π

θ θθ θ

− ≤=>

(4.2)

Where

pm

p N p N

Tnand

T T T T

ππθ θ≡ ≡+ +

(4.3)

Figure 4-2. Glottal waveform generated by Titze’s model

Fig. 4-2 shows a glottal waveform computed from Titze’s model, where the parameters

are set as T=10 ms, TP = 3 ms, TN1/TN2/TN3 = 1.5/3.0/4.5 ms, TP+ TN1/TN2/TN3 = 4.5/6.0/7.5 ms,

respectively and beta = 1.2, respectively. Beta is the slope factor, typically ranging from 0.7 to

3.0, which is time constant of residual decay expressed as percentage of the pitch period.

- 82 -

4.2.2 Aperiodicity of Glottal Waveform

In this study, the vowel |a|, |e|, |i|, |o|, |u| are synthesized using an implementation of the

discrete time model for speech production with Titze’s glottal area function used as the source

function. A sequence of glottal pulses of normalized Titze’s model is used as input into a

delay line digital filter, where the filter coefficients are obtained based on area function data

for normal Korean vowel, with jitter (%) of 0.03 and shimmer (%) of 0.5, and reflection

coefficient at the lip end of 0.65. Radiation at the lips is modeled by the first order difference

equation . To create a sequence of such wave shapes, an impulse train

generator produces a sequence of unit impulses which are spaced by the desired fundamental

period. This sequence is then convolved with the glottal-pulse shape in order to produce the

desired repetitive waveform. Since the goal is to study abnormalities of the voicing source, it

is at the source that perturbation is introduced. Aperiodicity is introduced into the waveform

by altering the source function. Like equation (4.4), random shimmer is introduced by adding

a random variable gain factor to the amplitude of the pitch period impulse train prior to

convolution with the glottal pulse.

' [100 * ( )] /100A A shimmer random n= + (4.4)

Where A = impulse train amplitude and n = cyclic impulse number.

Random jitter is also introduced by adding a random variable gain factor to the pitch

period impulse train. ' ( 1 2)[100 * ( )] /100Pitch Period N N jitter random n= + + (4.5)

The introduction of random noise follows a similar strategy. Random additive noise is

introduced by adding the multiplication of the average of the glottal-pulse wavelet by a user

specified variance, denoted per to the glottal waveform ( )rG n . The noise is added according

to the following equation:

' ( ) ( ) *[100 * ( )] /100r rG n G n noise random n= + (4.6)

1( ) (1 )R z z−= −

- 83 -

4.3 Modifications of Preoperative Vowel

In this section, we propose and modify the preoperative vowel according to the truth of

Chapter 3 in order to predict enhanced vowel like the postoperative vowel (successful return

of normal voice). In this study, three factors are focused; Jitter, shimmer, and aspiration noise

in speech, because these factors mainly affect the quality of voiced sounds.

4.3.1 Design of Modification of Fundamental Frequency

4.3.1.1 Pitch Scale Modification and Jitter using PSOLA

Figure 4-3. Pitch period modification by PSOLA

The PSOLA (Pitch Synchronous Overlap Add) method was originally developed at

France Telecom. It is actually not a synthesis method itself but allows prerecorded speech

samples smoothly concatenated and provides good controlling for pitch and duration, so it is

used in some commercial applications [110].

There are several versions of the PSOLA algorithm and all of them work in essence the

same way. Time-domain version, TD-PSOLA, is the most commonly used due to its

computational efficiency [111]. The basic algorithm consists of three steps [112]. The analysis

step where the original speech signal is first divided into separate but often overlapping short-

term analysis signals (ST), the modification of each analysis signal to synthesis signal, and the

synthesis step where these segments are recombined by means of overlap-adding. Short term

- 84 -

signals ( )mx n are obtained from digital speech waveform ( )x n by multiplying the signal

by a sequence of pitch-synchronous analysis window( )mh n :

( ) ( ) ( )m m mx n h t n x n= − (4.7)

where m is an index for the short-time signal. The windows, which are usually Hanging

window, are centered on the successive instantsmt , called pitch-marks. These marks are set at

a pitch-synchronous rate on the voiced parts of the signal and at a constant rate on the

unvoiced parts. The used window length is proportional to local pitch period and the window

factor is usually from 2 to 4 [113]. The pitch markers are determined either by manually

inspection of speech signal or automatically by some pitch estimation methods [114]. The

segment recombination in synthesis step is performed after defining a new pitch-mark

sequence.

Figure 4-4. Episodes of pitch scale modification by PSOLA

- 85 -

Manipulation of fundamental frequency is achieved by changing the time intervals (pitch

period) between pitch markers. The modification of duration is achieved by either repeating or

omitting speech segments. In principle, modification of fundamental frequency also implies a

modification of duration [114]. Jitter also can be modified by PSOLA. A pitch period longer

than maximum threshold of pitch period difference is compressed within the maximum

threshold, whereas a pitch period shorter than minimum threshold of it is expanded within the

minimum threshold (Fig. 4-4). Pitch scale modification is only adopted in voice sounds of

male according to the results of Chapter 3. Average difference of pitch modification in the

phonation |a|, |e|, |i|, |o|, |u| are -12.8 %, -12.4 %, -12.4, -15.2, and -15.6 %, respectively.

4.3.1.2. Modification of Intensity

According to Behlau and Pontes [115, 116], vocal intensity is directly related with

subglottic pressure of the air column. Subglottic pressure, in turn, depends on factors such as

amplitude of vibration and tension of vocal folds, more specifically the glottic resistance.

Variations of intensity, however, also depend on frequency [117]. According to Behlau

and Pontes [115, 116], high voices tend to be more intense, because the increase in laryngeal

tonus generates higher glottic resistance and, consequently, more intensity. Jitter is affected

mainly because of lack of control of vocal fold vibration and shimmer with reduction of glottic

resistance and mass lesions in the vocal folds, which are related with presence of noise at

emission and breathiness [118].

As results of Chapter 3, shimmer relevant measures do not present significant difference

between preoperative and postoperative vowel. However, Shimmer (%) over 6 % may affect

the perceptual quality of speech, and so some modification of intensity needs to improve the

quality of speech.

Our proposed method is very simple and easy control to shimmer. First, find all pitch

point by pitch marker above stated in PSOLA, and then adjust all values of specific segment to

increase or decrease within permitted range of threshold, if a value of current amplitude is

higher than threshold related to a value of previous amplitude. Threshold level is stetted as

2.5 % (permitted variation of amplitude). This process was illustrated by Fig. 4-5.

- 86 -

Figure 4-5. Intensity modification by Shimmer (%) of 2.5 %

4.3.1.3 Short term Postfilter

The short-term postfilter [119] consists of a 16th order pole-zero filter in cascade with a

first-order all-zero filter. The 16th order pole-zero filters attenuates the frequency components

between formant peaks, while the 16th order all zero filter attempts to compensate for the

spectral tilt in the frequency response of the 16th order pole-zero filter. The transfer function

of the short-term postfilter is the following,

( / )1

( )( / )

n

f d

A zP z

g A z

γγ

= (4.8)

where ˆ( )A z is the received quantized LP inverse filter and the factors n γ and d γ control the amount of short-term postfiltering, and are set to γ = 0.55 n and γ = 0.7 as shown in

Fig. 4-6. The role of this postfilter is similar to the one played by the perceptual weighting

filter in the choice of the excitation vectors in the encoding scheme. It lowers the valleys and

enhances the peaks of the resulting decoded spectrum so that the noise is shaped with the

signal. If any white Gaussian noise had been introduced during the encoding process or due to

- 87 -

transmission error, the resulting noise would be small in the spectral region of low signal

energy and vice versa for the formant regions. In general, after the decoded speech is passed

through the long-term postfilter and the short term postfilter, the filtered speech will not have

the same power level as the decoded (unfiltered) speech. To avoid occasional large gain

excursions, it is necessary to use automatic gain control to force the postfiltered speech to have

roughly the same power as the unfiltered speech. This is taken care of by the gain factor fg .

Figure 4-6. Plots of short term postfiltered voiced sound in time and spectral domain

- 88 -

4.4 Design of Enhancement of Noise Components

4.4.1 Introduction to Wavelet Transform Threshold Shrinkage

Wavelet transform threshold shrinkage (WTS) [120, 121] was implemented using the

difference of the statistical properties of the signal and noise present in the wavelet domain.

WTS involves shrinking in the wavelet transform domain, and consists of three steps []: a

linear forward wavelet transform, a nonlinear shrinkage denoising, and a linear inverse

wavelet transform. Suppose we want to estimatenf from noisy observation signal nX

(4.9)

where nS denotes target signal, and nE is independent and uniformly distributed. Let W(·)

and W−1(·) denote the forward and inverse wavelet transform operators. Let D(· , ·) denote the

denoising operator with soft threshold. We intend to wavelet shrinkage denoise ( )X t in

order to recover ( )S t as an estimate of( )S t . Then the three steps summarize the procedure.

1

( )

( , )

ˆ ( )

Y W X

Z D Y

S W Z

λ−

==

=

(4.10)

Of course, this summary of principles does not reveal the details involving implementation of

the operators W or D , or selection of the threshold . Let's focus on and D . Given threshold

for data U (in any arbitrary domain signal, transform, or otherwise), the rule defines

nonlinear soft thresholding. ( , ) sgn( ) max(0,| | )D U U Uλ λ≡ − (4.11)

The operator D nulls all values of U for which | |U and shrinks toward the origin by an

amount of all values of U for which| |U λ> . It is the latter aspect that has led to D being

called the shrinkage operator in addition to the soft threshold operator.

, 1,...,n n nX S E n N= + =

- 89 -

4.4.2 Determination of Adaptive Threshold

If the spectrum is calculated for short signal segments, the segment duration is chosen to

be ten periods based on a compromise. A short segment (short duration) leads to an increase in

the bandwidth of the harmonics which hinders the discrimination between harmonic and noise

components. When large segment duration is used the non-stationarity of the voice signal

increases the noise component and decreases the harmonic component and again the

discrimination between them is difficult. Obviously a compromise should be found. A number

of experiments with simulated signals were carried out. To these signals white noise was

added (40% deviation); also slow amplitude oscillations (40% deviation) were added, as well

as slow pitch oscillations (6% deviation). Experiments showed that when calculating HNR

with a segment length of ten periods, we received more precise separation of the harmonic

from the noise component and the segment length increase does not lead to significant growth

of the error produced by the presence of slow amplitude and pitch oscillations (this is

equivalent to non-stationary). Plots of HNR, NNE, NFHE, and DH as a function of jitter,

shimmer, and noise are plotted in Fig. 4-7. The best correlation coefficients of each noise

estimation are resulted in Table 4-1. NNE, NNE, and NFHE are best parameter in jitter,

shimmer, and noise, respectively. However, another test is required because of real speech

does not hold one perturbation factor. So synthetic speech signal (Fig. 4-8) combined

perturbation factors is generated and tested again in order to test the how well correlate

between HNR and variation of noise, shimmer, and jitter, as shown in Fig. 4-9 and 4-10.

Table 4-1. Correlation coefficients of each noise estimation as a change of jitter, shimmer, and noise

HNR NNE NFHE DH

Jitter -0.89 -0-94 0.90 -0-91

Shimmer -0.86 -0.90 0.70 -0.90

Noise -0.61 -0.85 0.98 -0.59

- 90 -

Figure 4-7. Plots of HNR, NNE, NFHE, and DH as a function of (a) jitter (S.D. 40%), (b) shimmer (40%), and (c) noise (6%), and (d) magnification of (c) in NFHE

Figure 4-8. Examples of synthetic voiced sound |a| (a) with shimmer and noise of 5 % and jitter of 0.75 %, (b) with shimmer and noise of 40 % and jitter of 6 %

- 91 -

Figure 4-9. Plots of HNR as a function of Jitter (S.D. 0.75- 6.0 %) & noise and shimmer (S.D. 5-40 %) for phonation |a|,|e|,|i|,|o|,|u| for female group

Figure 4-10. Plots of HNR as a function of Jitter (S.D. 0.75- 6.0 %) & noise and shimmer (S.D. 5-40 %) for phonation |a|,|e|,|i|,|o|,|u| for male group

- 92 -

Figure 4-11. Episode of denoising with Wavelet threshold shrinkage

As a result, HNR is highly correlated with changes of noise, shimmer, and jitter (Fig.4-9,

4-10). Therefore it is possible to analogize inversely the threshold value for wavelet threshold

shrinkage, and adaptive denoising threshold is calculated from below equation, called

modified rigorous SURE soft threshold.

aThreshold wασ= (4.12)

assume 1 2[ , , , ]NW W W W= K , where 21,2, , , 1 2 n n N j k NW W and W W W= = ≤ ≤ ≤

KK

1

12 ( ) , ( 1,2, , )

i

i i kk

r N i N i w w i NN =

= − + − + = ∑ K (4.13)

then let mina ir r= , and we can get aW , 1, 1,( ) / 0.6745,k kmedian w wσ = is wavelet

coefficients at the first scale. α is our constant for increased HNR value for above heuristic

- 93 -

analysis, and α of 0.37 is used in this study. As a result, wavelet threshold shrinkage

reduces the noise, as well as preserves the stress noise (Fig. 4-11).

4.5 Modification of Baseline Wander of EGG Signal

It is required to reduce baseline wander in EGG waveform, because EGG waveform is

used for input signal of nonlinear speech synthesizer in next chapter. In vocal cords, the

abduction-adduction of glottis is mainly controlled by the posterior cricoarytenoid (abductor)

and interarytenoid (adductor) muscles respectively. Electroglottography (EGG) is a technique

used to register laryngeal behavior indirectly by a measuring the change in electrical

impedance across the throat during speaking. However, EGG waveform is affected by

laryngeal muscles which fluctuate the vocal cords, and which result in baseline wander.

4.5.1 Introduction to Empirical Mode Decomposition

The EMD technique decomposes the signal into a number of Intrinsic Mode Functions

(IMFs), each of which a mono component function. This new method is developed by Dr.

Norden E. Huang at the NASA Goddard Space Flight Center. The procedure for extracting the

IMFs from a signal ( )x t is illustrated in Fig. 4-12. After identifying all the local maxima and

minima of the signal, the upper and lower envelopes are generated through curve fitting.

Research has shown that many complex curve fitting functions have only resulted in marginal

improvement while increasing the computational load significantly [122].

Therefore, the cubic spline function was employed in the presented study. The mean

values of the upper and lower envelopes of the signal 11( )m t are calculated as

11( ) ( ( ) ( ) ) / 2maxima minimam t x t x t= + (4.14)

where max ( )imax t and min ( )imax t are the upper and lower envelopes of the signal, respectively.

Accordingly, the difference between the signal ( )x t and the envelopes of the signal 11( )m t ,

which is denoted as 11( )h t , is given by

11 11( ) ( ) ( )h t x t m t= − (4.15)

- 94 -

Figure 4-12. Process diagram of EMD

Due to the approximation nature of the curve fitting method, has to be further processed

(by treating as the signal itself and repeating the process continually) until it satisfies the

following two conditions.

1) The number of extreme and the number of zero-crossings are either equal to each other or

differ by at most one.

2) At any point, the mean value between the envelope defined by local maxima and the

envelope defined by the local minima is zero.

Through the iteration process (for a total of times), the difference between the signal and

the mean envelope values, which 1 ( )jh t is denoted as, is obtained as

1 ( ) 1( 1) 1( ) ( )j t j jh h t m t−= − (4.16)

- 95 -

where 1 ( )jm t is the mean envelope value after the i th iteration, and 1( 1)( )jh t− is the

difference between the signal and the mean envelope values at the (j-1)th iteration. The function 1 ( )jh t is then defined as the first IMF component and expressed as

1 1( ) ( )jIMF t h t= (4.17)

After separating 1( )IMF t from the original signal ( )x t , the residue is obtained as

1 1( ) ( ) ( )r t x t IMF t= − (4.18)

Subsequently, the residue1( )r t can be treated as the new signal, and the above-illustrated

iteration process is repeated to extract the rest of the IMFs inherent to the signal ( )x t as

1 2 2

1

( ) ( ) ( )

( ) ( ) ( )n n n

r t IMF t r t

r t IMF t r t−

− =

− =K (4.19)

The signal decomposition process is terminated when ( )nr t becomes a monotonic

function, from which no further IMFs can be extracted. By substituting (12) into (11), the

signal ( )x t is decomposed into a number of intrinsic mode functions that are the constituent

components of the signal. As a result, the signal ( )x t can be expressed as

1

( ) ( ) ( )n

i ni

x t c t r t=

= +∑ (4.20)

where IMFi(t) represents the i th intrinsic mode function, and ( )nr t is the residue of the

signal decomposition. Equation (4.13) provides a complete description of the empirical mode

decomposition process [123], which can be evaluated by checking the amplitude error

between the reconstructed and the original signal. As an example, Fig. 4-14 illustrates both the

IMFs and the residue of the multi-component. Finally, wandering baseline estimate

(fluctuating red line in Fig. 4-13) are achieved by adding last IMF function (in this case,

8IMF of Fig. 4-14) and residue signal, and then original signal minus wandering baseline

estimate signal was considered as EGG signal with cancellation of baseline wander. For a

comparison of performance, high pass filtered EGG signal (order of 500 and cut-off frequency

of 50 Hz) also is plotted in the bottom of Fig. 4-13, and Fig. 4-14 shows each step of IMF

functions and residue.

- 96 -

Figure 4-13. Reduction of baseline wander in EGG waveform by EMD and high pass filter with FIR 500-order (Pass band: over 40 Hz); voiced sound |u| with sampling rate of

22050 Hz

- 97 -

Figure 4-14. Plots of IMFs and residue of EGG waveform in Figure 4-13.

- 98 -

4.6 Summary

In this chapter, we modify the preoperative voiced sounds in order to enhance the

perceptual quality like normal voice. Our hypothesis is that reduction of aperiodicity of

preoperative voiced sounds can resemble postoperative voiced sounds, and Main components

of aperiodictiy are considered as jitter, shimmer, and noise in speech signal. Enhancement

rates are adjusted by statistical results based on the difference between preoperative and

postoperative speech sounds. Modification of pitch period, intensity, and noise of aspiration

are controlled by PSOLA, intensity modifier, and Wavelet threshold shrinkage methods.

Baseline wander of EGG signal is also reduced by empirical mode decomposition method

embedded by Hilbert-Huang Transform. These modified speech and EGG signal was used as

input signals in nonlinear speech modeling in next chapter for estimation of postoperative

vowel.

- 99 -

Chapter 5

Nonlinear Speech Production Modeling using Nonlinear Autoregressive

Exogenous based on Support Vector Regression

5.1 Introduction of Speech Production Modeling

Speech waveforms are rich in information but are highly redundant in structure. Storage

of acoustic data is therefore inefficient and a more compact parametric representation of the

information conveyed by the signal is desirable. An ideal model should exploit the redundancy

in the speech signal to give data compression while capturing the distinguishing features

coding and synthesis applications, the ability to regenerate the original speech waveform from

the model is also necessary.

The acoustic speech waveform varies slowly with time as different sounds are produced

so the frequency properties of the signal are constantly changing. A time-varying model of the

waveforms is needed for which the model parameters are continuously updated at a suitable

rate. Typically, a short-time analysis is used, in which the speech waveform is divided into a

sequence of overlapping segments of about 20 ms in duration, and a new set of model

parameters calculated for each segment. Since the articulators of about 10 ms [124] which

permits an update rate (frame rate) of 10 ms. Even the fastest transitions in plosives can be

captured relatively well by an update rate of 5 ms.

Two approaches to developing a model are articulatory modeling and acoustic modeling.

The articulatory modeling approach aims to represent the vocal tract and movement of

articulators in as much physiological detail as possible and assumes that a similar underlying

system will generate a similar output. Articulatory models have the potential for good

reproduction from simple control signals and can reproduce all the perceptually relevant

effects of the real speech, such as co-articulation [125]. However, the dimensions of the vocal

tract and a detailed analysis of the movement of the articulators are needed. Such information

is difficult to obtain and often requires intrusive measurement techniques. The acoustic

modeling approach models the speech waveform directly in either the time or frequency

domain. The models are easy to construct because only the speech waveform is required,

- 100 -

which is easily obtained using a microphone. An exact match of the waveform or spectrum is

not needed for perceptually good synthesis and events which are not perceptually relevant

need not be modeled. The most popular technique for speech modeling applications, such as

speech coding and speech synthesis, is the time-domain acoustic modeling method known as

LP.

In this chapter, we will review well-known LPC and nonlinear speech modeling based on

neural network, and introduce our proposed nonlinear speech modeling based on Support

Vector Machine (SVM). Nonlinear speech modeling using SVM also presents the

implemented results for predicting postoperative vowel.

5.1.1 Overview of Linear Speech Production Modeling

Linear prediction techniques use a source-filter arrangement to model the vocal tract

system, which assumes that the source is located at the glottis and that a linear filter is

adequate to model the frequency properties of the vocal tract. At the analysis stage, it is

assumed that no information about the excitation of the vocal tract is known and that the

speech waveform can only be modeled from its previous values. The linear vocal tract filter

defines an autoregressive (AR) model of the speech, in which the current speech sample,( )y t ,

is predicted from a linear combination of a finite number of past samples

1

ˆ ( ) ( )an

p kk

y t a y t k=

= − −∑ (5.1)

where ˆ ( )py t is the predicted speech sample. The prediction error or residual,

ˆ( ) ( ) ( )pe t y t y t= − , represents structure in the speech which is not captured by the model. For

a good model, the residual has no predictable structure and appears as white noise. For voiced

speech, the residual has significant peaks at the pitch period which coincide with the instants

of excitation of the vocal tract, which coincide with rapid closure of the vocal cords [126].

When the LP model is excited by the residual signal, the speech waveform is reproduced

exactly. This is not practical for most applications, and one approach is to use a model of the

residual signal. A source-filter arrangement is used in which the residual is represented by an

impulse train at the pitch frequency for voiced sounds or a random, white noise generator for

unvoiced sounds.

- 101 -

The spectral match between the estimated transfer function, ˆ ( )H z and the spectral

envelope of the speech is shown by applying Parseval’s Theorem to equation (5.2) and (5.9)

shows that minimizing E is equivalent to minimizing the integral of the ratio of the energy

spectrum of the speech segment to the magnitude squared of the frequency response of the

system model.

1

2 22

0

1( ) ( ) ( )

2

aN nj j

t

E e t Y e A e dπ ω ωπ

ωπ

+ −

−=

= =∑ ∫ (5.2)

ˆ ( )( )

jj

GH e

A eω

ω= (5.3)

2

2

2

( )( )

( ) ˆ ( )

j

jj

j

Y eGE e d

A e H e

ωπω

ω π ωω

−= ∫ (5.4)

For a predictor of order an , the first 1an + values of the autocorrelation of the speech

segment and autocorrelation function of the system impulse response are equal. Thus as the

predictor order tends to infinity, the magnitude spectra ˆ ( )jH e ω and ( )jY e ω will match.

2 2ˆlim ( ) ( )

a

j j

nH e Y eω ω

→∞= (5.5)

However, the spectra ( )jH e ω and ( )jY e ω may not be equivalent because ˆ ( )jH e ω is

constrained to minimum-phase (all zeros inside the unit circle). In general, the speech

spectrum is not minimum-phase when radiation occurs from more than one point and there

multiple sound pathways. Due to the spectral matching property of the mean-squared error

criterion, linear prediction analysis can be used to obtain a smoothed estimate of the short-time

spectral envelope of the speech. Since E depends on the ratio of ( )jY e ω and ˆ ( )jH e ω ,

the matching process performs uniformly over the frequency range of interest, regardless of

the shape of the spectral envelope of the speech. However, the formants of the spectrum are

more closely modeled because regions where ˆ( ) ( )j jY e H eω ω> contribute more to E than

region where ˆ( ) ( )j jY e H eω ω< . Estimates of the speech formants can be obtained by

locating peaks in the smoothed spectral envelope or by factorizing ( )A z into its constituent

poles. Each formant is approximated by a complex-conjugate pole pair which forms a second

order filter with transfer function, ( )iA z , given by

1 21 2( ) 1iA z a z a z− −= + + (5.6)

The frequency of the formant is determined from the pole angle and the bandwidth from

the radius. The spectrum is unique in the range / 2 / 2s sw w w− < < and repeats at multiples

- 102 -

of the sampling frequency, Ws. The transfer function, ˆ ( )H z , is stable when all the poles lie

inside the unit circle. The holds if all ( )iA z are stable, Analogous to the cascade and parallel

realization of the resonant network used in formant synthesizers, ˆ ( )H z can be implemented

in cascade or parallel form. In cascade form, ˆ ( )H z is expanded as a product of formant

factors.

/ 2

1

ˆ ( )( )

n

ii

GH z

A z=

=∏

(5.7)

In parallel form ˆ ( )H z is expanded as a sum of formant factors

/ 2

1

ˆ ( ) ( ) / ( )n

i ii

H z B z A z=

= ∑ (5.8)

Where ( )iB z are the residues of the partial fraction expansion of ˆ ( )H z .

5.1.2 Limitations of Linear Speech Production Modeling

The advantages of linear prediction for speech analysis are ease of implementation, a

closed-form solution, complete separation of the source and vocal tract filter in synthesis, and

a direct interpretation in terms of loss-less acoustic tube model of the vocal tract [10].

However, linear prediction has several disadvantages. Unvoiced sounds are poorly

modeled by a minimum phase, all-pole linear prediction model because the vocal tract

function for these sounds contains zeros. Although argued that the spectral notches produced

by zeros are hard to detect [127], Synthesis of unvoiced sounds by a linear prediction model is

poor. In linear prediction models, zeros have to be approximated by a collection of poles

which requires a higher prediction order. The linear prediction parameters are not optimal for

synthesis, since they are developed to minimize the mean-squared prediction error, rather than

the actual error obtained at the output of the model when used for synthesis (the synthesis

error). The main disadvantage of linear prediction is that the source and vocal tract filter are

not decoupled in analysis and the linear prediction filter thus models the combined effect of

source, vocal tract and lip radiation. As a result, the quality of synthetic speech generated from

linear prediction models degrades rapidly as the pitch of the excitation is altered from that of

the original speech.

- 103 -

5.2 Overview of Nonlinear Speech Production Modeling

5.2.1 Review of Former Research in Nonlinear Speech Production Modeling

Due to universal approximation capabilities neural networks (NNs) are able to

approximate unknown systems based on sparse sets of noisy data [128, 129]. Although a lot of

NN’s applications concern classification problem, a growing interest has been devoted in

nonlinear time series prediction and in complex nonlinear dynamic modeling [130]. Moreover,

one of the main drawbacks that can hinder practical NNs application in multimedia, depends

on their computational and structural complexity.

Classical approaches for nonlinear DSP are based on specific and efficient architectures

e.g. median and bilinear filters, some spectral analysis techniques or generic nonlinear

architectures suitable for a large class of problems but usually complex e.g. Volterra filters,

non linear state equations, polynomial filters, functional links, etc., [131, 132]. In other words

typical nonlinear DSP approaches consist of design specific algorithms for specifics problems.

Neural networks (the multi-layer perceptron (MLP) [133], the time delay neural networks

(TDNN) [134], and recurrent neural networks (RNN) [135]), have been used extensively in

the past for functional approximation of continuous nonlinear mappings [136]. Successful

functional approximation depends on appropriate selection of the parameter values.

The MLP and RNNs represent an adaptive circuit which extend and generalize the simple

adaptive linear filter in nonlinear domain. By adding in some way delay lines NN filters can

be viewed as an extension of linear adaptive filters to deal with nonlinear modeling tasks [137].

It is well known that a large amount of DSP techniques are based on linear models, but in

some cases the nature of the problems are nonlinear and obviously in these cases nonlinear

general purpose architectures are needed.

Despite the formal elegance of the neural model, several problems should be solved. First

of all is the model selection. Given an input-output relation the problems are: (1) the

determination of the inputs number, (2) the number of neurons in the hidden layers in order to

have a correct approximation and (in the case of dynamic processes) (3) how put memory

(delay line) in the model.

Even though there are several papers dealing with the problem of network topology

determination, usually the numbers of layers and neurons are specified by heuristic procedure.

- 104 -

Although linear adaptive filter theory is well-known and consolidated, its extension to the

nonlinear domain is a field of great interest and in continuous expansion. In this section, some

neural architecture suitable for adaptive nonlinear filtering are presented. The formulation of

transversal and recursive filters can be easily extended to the nonlinear domain: in the case of

discrete-time sequences the filter can be described through a relationship between the input

sequence [ ], [ 1],x t x t − L and the output sequence [ ], [ 1],y t y t − L . The general form are

expressed as [ ] [ ], [ 1],..., [ 1]y t x t x t x t M= Φ − − + (5.9)

[ ] [ ], [ 1],..., [ 1], [ 1]..., [ ]y t x t x t x t M y t y t N= Φ − − + − − (5.10)

In the first expression the output is a nonlinear function of the inputs (present and past

samples): in other words equation (5.9) represents a nonlinear generalization of linear finite

impulse response filter (FIR). The output signal y[t] in equation (5.10) is also a function of

past output signal: so it represents a nonlinear generalization of linear infinite impulse

response filter (IIR). The equation (5.10) represents a general form usually called nonlinear

autoregressive moving average (NARMA) model. The indexes M and N, represent the filter

memory length and the couple (N, M) is defined as filter order.

Figure 5-1. Buffered MLP structure with input TDL

- 105 -

The easiest way to get dynamics from a MLP network is the use of external tapped delay

lines (TDL) [138], as shown in Fig. 5-1 subsuming many traditional signal processing

structures, including FIR-IIR filters, and gamma memory NN [139], for which the delay

operator, used in conventional TDL’s, is replaced by a single pole discrete-time filter. These

networks are universal estimate for dynamic systems [140], just as feedforward MLP’s are

universal estimate for static mappings [141].

Concerning the previous general structure we can assert: 1) the problem of the

determination of the optimum filter order (N, M) requires some a priori knowledge of the

statistics of the input signal; 2) filtering of high non-stationary signals requires that the filter

free parameters (w R∈ ) can vary fast so that it is possible to track the input’s statistic

variation. Moreover, if in equations (5-9) and (5-10) Φ is a linear function, there exists a huge

number of methods for the determination of the free filters parameters (filter synthesis). A

family of adaptive algorithms, suitable for transversal filters, is derived from the least square

error minimization [142].

5.2.2 Introduction of Support Vector Machine for nonlinear regression

Support vector machine (SVM), originally introduced by Vapnik(1985, 1988), solves the

weak point of neural network such as the existence of local minima in the area of statistical

learning theory and structural risk minimization. SVM solutions are characterized by convex

optimization problems. Despite of many successful application of SVM in classification and

regression problem, SVM requires to solve a quadratic program (QP) problem. QP is to

optimize a quadratic function over a polyhedron, defined by linear equations and/or

inequalities, which is time memory expensive.

A modified version of SVM in a least squares (LS) sense has been proposed for

classification in Suykens and Vaderwalle (1999). In LS-SVM, the solution is given by a linear

system instead of a QP problem. Taking account of the fact that the computational complexity

increases strongly with the number of training data, LS-SVM can be efficiently estimated

using iterative methods. The fact that LS-SVM has explicit primal-dual formulations has a

number of advantages.

- 106 -

SVM can be adopted in both linear and nonlinear regression. In this study, we explain

SVM for nonlinear regression. Supposing the training data set D be denoted by each input iX

1(x , ) x yn d di i i i iy R and R= ∈ ∈ (5.11)

and the output iy . We consider the case of nonlinear regression. Then, we take the form

( ) ' ( )f x w xφ η= + (5.12)

where the termη is a bias term. Here the feature mapping function ( ) : fddR Rφ ⋅ → maps the

input spaces to the higher dimensional feature space where the dimensionfd is defined in an

implicit way.

The optimization problem is defined with a regularization parameter γ as

2

1

1min '

2 2

n

ii

w wγ ξ

=

+ ∑ (5.13)

over , , w η ζ subject to equally constraints

w ' (x ) , 1,..., .i i iy i nφ ξ= + = (5.14)

The Lagrangian function can be constructed as

( )2

1 1

1( , , : ) ' w ' (x )

2 2

n n

i i i i ii i

L w a w w a yγη ξ ξ φ η ξ

= =

= + − + + − ∑ ∑ (5.15)

Where ia ’s are the Lagrange multipliers. The conditions for optimality are given by

1

1

0 (x )

0 0

0 , 1,...,

0 w ' (x ) 0, 1,..., .

n

i ii

n

ii

i i

i i ii

Lw a

w

La

La i n

Ly i n

a

δ φδδδηδ γξδξδ φ η ξδ

=

=

= → =

= → =

= → = =

= → + + − = =

∑∑

(5.16)

with solution

1

0 1' 0

a y1 I

ηγ −

= Ω +

(5.17)

- 107 -

with

1

1

y ( ,..., ) '

1 (1,...,1) '

a ( ,..., ) '

( ) ' ( ) ( , ) , 1,...,

n

n

kl kl k l k l

y y

a a

where x x K x x k l nφ φ

===

Ω = Ω Ω = = =

(5.18)

kernel function ( , )k lK x x are obtained from the application of Mercer’s conditions [143].

Several choices of the kernel function are possible. Solving the linear equation (3) the optimal

bias and Lagrange multipliers, b and ia ’s are obtained, the optimal regression function for

given x is obtained as

1

ˆ ˆ ˆ( ) a (x , x)n

i ii

f x K η=

= +∑ (5.19)

Note that in the nonlinear setting, the optimization problem corresponds to finding the

flattest function in the feature space, not in the input space. In fact, SVM has strong advantage

that SVM performs particularly well for the nonlinear regression model with several input

variables.

5.3 Nonlinear Speech Production Modeling based on Support Vector Regression

Once speech signal is embedded into a reconstructed phase space, the task of making

nonlinear model turns into a function estimation problem where least squares (LS)-SVR can

be used. Compared with other methods, the several advantages of SVR are suggested. First, in

nonlinear model for speech synthesis, because the system runs in autonomous mode, good

generalization performance is needed, otherwise the system is often unstable, i.e. the system’s

output is entirely different from what we want. A regularization term is included in SVR to

control the capability of the function class and improve the generalization ability of the model.

Finally, SVR is easy to be used. Only a few parameters need to be tuned, and no local solution

exists.

Taking the defects of source-filter theory into consideration, the speech signal itself

instead of excitation signal is modeled. The structure of the model is as follows. It can be seen

that the input vector of SVM is generated from time series through delay line. During training

- 108 -

phase, given signal is inputted. And in autonomous running mode, output is fed back (Fig. 5-

2).

5.3.1 NARX using SVR Model

For the linear dynamical part, we will assume a model structure of the form:

1 1

n m

k i k j k j ki j

y a y i b u ξ−= =

= − + +∑ ∑ (5.20)

with k k, R, N u ,y k ku y k∈ ∈ , input ku and output ky with discrete time index k and

kξ the so-called equation error which will be assumed to be white Gaussian noise. This

model structure is one of the best known model structures in linear identification. Adding a

static nonlinearity : : ( )f R R x f x→ → to equation lead to:

1 1

( )n m

k i k i j k j ki j

y a y b f u ξ− −= =

= + +∑ ∑ (5.21)

Applying LS-SVM function estimation outlined in the former section, we assume the

following structure for the static nonlinearityf :

( ) ' ( )f u w uφ η= + (5.22)

with

( ) ' ( ) ( , ) , 1,...,kl k l k lu u K u u k l nφ φΩ = = = (5.23)

a kernel of choice. Hence, equation can be rewritten as follows:

1 1

( ' ( )n m

k i k i j k j ki j

y a y b w uφ η ξ− −= =

= + + +∑ ∑ (5.24)

We focus on finding estimates for the linear parameters , 1,...,ia i n= and , 1,...,jb j m=

and the static nonlinearityf . The Lagrangian of the resulting estimation problem is given by

( )2k-j 0

1 1 1

( , , , , : )

1' w ' (u )

2 2

n N n m

i k i k i j i ii k r i j

L w b a

w w a y b y

η ξ α

γ ξ α φ η ξ−= = = =

= + − + + + − ∑ ∑ ∑ ∑ (5.25)

with max( , ) 1m nγ = + . The conditions for optimality are as follows:

- 109 -

( )

( )

1

1

0

01 1

0 (u )

0 0

0 0, 1,...,

0 ' ( ) 0, 1,...,

0 , ,...,

0 ' ( ) 0, ,.

N m

k j k jk j

N m

k jk j

N

k k iki

N

k k iki

k kk

n m

i k i k k j k ji jk

Lw a b

w

La b

b

Ly i n

a

Lw u i m

La k N

La y y b w u k

γ

γ

γ

γ

δ φδδδδ αδδ α φ ηδηδ γξ γδξδ ξ φ η γδα

−= =

= =

−=

−=

− −= =

= → =

= → =

= → = =

= → + = =

= → = =

= → + − + + = =

∑∑∑∑∑∑

∑ ∑ .., N

(5.26)

substituting L

w

δδ

and k

Lδδξ

in (5.26) lead to:

( )0

1 1

1

( ) ' ( )

0, ,...,

m N m

j p q q p k jj q r p

n

i k i k ki

b b u u

a y y k N

α φ φ η

ξ γ

− −= = =

−=

+

+ + − = =

∑∑∑∑

(5.27)

If the jb values were known, the resulting problem would be linear in the unknown and

easy to solve through:

1 1 0

11 1

ˆ0 0 0

1 0 a 0

ˆ yI

TN

TN

b

Y

b Y K

γ

γ

η

αγ

− +

−− +

= + (5.28)

with

- 110 -

[ ]1

1

1

1 1

2 1 2

1

, ,1 1

1

... '

... '

ˆ

[ ... ] '

...

...

...

ˆ

... '

N

m

m

jj

n

N

N

n n N n

m m

p q j l p r j q r lj l

N

b b

a a a

y y y

y y yY

y y y

K b b

y y y

γ

γ γ

γ γ

γ γ

γ

α α α

β β β

=

− −

− − −

− − + −

+ − + −= =

+

= =

=

=

= = Ω

=

∑

∑ ∑

M M M

(5.29)

Since the jb are in general not known and the solution to the resulting third order

estimation problem (5-27) is by no means trivial, we will use an approximate method to obtain

models of the form (5-26).

To avoid having to solve the problem (5-27), we propose to rewrite (5-24) as follows:

'

1 1

( )n m

k i k i j k j ki j

y a y w u dφ ξ− −= =

= + + +∑ ∑ (5.30)

which can conveniently be solved using LS-SVM’s. Note, however, that the resulting model

class is wider than (5-24) due to the replacement of one w by several , 1,...,jw j m= =

Taking all of the above into account, the optimization problem that is ultimately solved is the

following with ,..., Nγξ ξ ξ= :

' 2

, , ,1

1 1min ( , )

2 2

m N

j j j kwj a d

j k

f w e w wξ γ

γ ξ= =

= +∑ ∑ (5.31)

subject to

'

1 1

( ) 0, ,...,m n

j k j i k i k kj i

w u a y d y k Nφ ξ γ− −= =

+ + + − = =∑ ∑ (5.32)

'

1

( ) 0, 1,...,N

j kk

w u j mφ=

= =∑ (5.33)

- 111 -

Note the extra constraints (5-33) to center the nonlinear function ' ( ), 1,...,jw j mφ ⋅ = around

their average over the training set. This to remove the uncertainty resulting from the fact that

any set of constants can be added to the terms of an additive nonlinear function, as long as the

sum of the constants is zero. Removing this uncertainty will facilitate the extraction of the

parameters jb in (5-21) later. Furthermore, this constraint enables us to give a clear meaning

to the bias parameterd , namely

1

1

1( )

mN

j kkj

d b f uN =

=

= ∑ ∑ (5.34)

The resulting Lagrangian is:

k-j

1 1

'

1 1

( , , , : , )

( , ) w ' (u )

( )

j

N n m

j k i k i i kk r i j

m N

j j lj l

L w d a

F w a y d y

w u

ξ α β

ξ α φ ξ

β φ

−= = =

= =

= − + + + −

+

∑ ∑ ∑∑ ∑

(5.35)

The conditions for optimality are:

1

'

1 1

'

1

0 ( ) (u ), 1,...,

0 0, 1,...,

0 0

0 , ,...,

0 ( ) 0, ,...,

0 ( ) 0,

N N

j k k j j kk jj

N

k k iki

N

kk

k kk

m n

j k j i k i k kj ik

N

j kkj

Lw u j m

w

Ly i n

a

L

d

Lk N

Lw u a y d y k N

Lw u j

γ

γ

γ

δ α φ β φδδ αδδ αδδ α γξ γδξδ φ ξ γδαδ φδβ

−= =

−=

=

− −= =

=

= → = + =

= → = =

= → =

= → = =

= → + + + − = =

= → = =

∑ ∑∑∑

∑ ∑∑ 1,...,m

(5.36)

with solution:

1 0

f

0m

0 0 1 0 0

0 0 0a

y1 I K

00 0 (1 1) I

T

T

T T

d

Y Y

Y K

K

αγβ

−

= + Ω ⋅

(5.37)

- 112 -

where

, ,1

m

p q p j q jj

K γ γ+ − + −=

= Ω∑% (5.38)

0, ,

1

n

p q k p qk

K −=

= Ω∑ (5.39)

The projection of the obtained model onto (5-21) goes as follows; Estimated for the

autoregressive parameters , 1,...,ia i n= are directly obtained from (5-37). Furthermore, for

the training input sequence1[ ]Nu uK , we have:

1

1

1,1 1,2 1,

2,1 2,2 2,

,1 ,2 ,

1

,1 ,1

ˆ ˆ( ) ( )

0

0

,

N

m

N N N N N

N N N N N

m m m NN

N

k k Nk

m

b

f u f u

b

γ

γ

γ γ γγ

α αα α

α α

β

β

− − −

− − −

− − −

=

Ω Ω Ω Ω Ω Ω = × Ω Ω Ω

+ Ω Ω ∑

M L

K K

K K

M M MO O

KK

M K

(5.40)

with ˆ ( )f u an estimate for

1

1( ) ( ) ( )

N

kk

f u f u f uN =

= − ∑ (5.41)

Hence, estimates for jb and the static nonlinearityf can be obtained from a rank 1

decomposition of the right hand side of (5-40), for instance using a singular value

decomposition. Once all the bj are known, 1

( )N

kkf u

=∑ can be obtained as 1

1

( )N d

k mk

jj

Nf u

b=

=

=∑ ∑

- 113 -

Figure 5-2. A scheme of NARX using SVR

5.3.2 Optimum parameter Selection

A kernel function is some function that corresponds to an inner product into some feature

space. There are many other opportunities to go beyond such basic choices for kernel selection.

However, even though there are many kernel functions satisfied with semi-positive definite

symmetric function according to Mercer’s theorem, general radial basis function (RBF) kernel

can be a reasonable choice. Roughly, more basis functions which have the number of

hyperparameters imply richer representation, but more opportunities for overfitting. Therefore,

RBF kernel with ( 2 ,σ ϒ ) parameters was used in this study as shown in equation (5.42)

without model selection. It is important to select good ( 2 ,σ ϒ ) for achieving high training

accuracy, because (2 ,σ ϒ ) are only parameters used in RBF kernel model; furthermore,

undesirable ( 2 ,σ ϒ ) may bring on overfitted train model, even it is generally believed that

the SVM optimizes the generalization error and outperforms other learning machines.

Grid search finds the paired (2 ,σ ϒ ) value in the limited range of ( 2 ,σ ϒ );

exponentially growing sequence of 2 0σ > and 0ϒ > were tried because of simple and

powerful calculation, even though there are other advanced grid search method (Fig. 5-3).

- 114 -

2

2( , )

i jx x

i jK x x e σ

−−

= (5.42)

Figure 5-3. 2D-plot of selection of optimum 2σ and ϒ for phonation |i| of male group

Table 5-1. Results of optimized 2 ,σ ϒ and mean square error of RBF kernel in

phonation |a|, |e|, |i|, |o|, |u| for both sexes

2σ ϒ Mean square error sex

Phona

-tion Mean S.D. Mean S.D. Mean S.D.

|a| 17.31 2.26 791.40 21.91 1.26 × 10-3 0.14 × 10-3

|e| 22.08 3.37 802.07 18.04 1.28 × 10-3 0.13 × 10-3

|i| 26.08 4.70 814.64 24.99 2.69 × 10-3 0.15× 10-3

|o| 23.53 4.08 814.69 27.61 3.17 × 10-3 0.17× 10-3

male

|u| 28.41 4.18 810.09 21.25 3.62 × 10-3 0.20× 10-3

|a| 27.04 4.81 827.66 21.86 3.83 × 10-3 0.24× 10-3

|e| 28.18 4.63 829.35 21.16 4.23 × 10-3 0.19× 10-3

|i| 37.65 4.19 828.65 27.75 6.11 × 10-3 0.36× 10-3

|o| 30.49 4.19 837.39 35.61 5.51 × 10-3 0.40× 10-3

female

|u| 30.19 4.84 841.39 37.00 5.06 × 10-3 0.48× 10-3

- 115 -

5.4 Evaluation of NARX using SVR Model

As shown in Fig. 5-4, NARX using SVR model successfully predict the modified voiced

sound. Reconstructed speech is similar to original one in time domain and low part of

frequency domain. Moreover, shimmer and jitter is preserved in reconstructed one. Unlike /a:/,

reconstruction of /i:/ fails. It predicts slightly wrong value. Models trained under 50 sets of

parameters can not output stable /i:/ in autonomous mode although MSE of one step prediction

is quite low. Original speech and some typical output of |i| phonation are shown in Fig. 5-6.

Figure 5-4. Synthesized versus original signal (time delay = 50): (top) modified speech signal, (middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |a|

of male

- 116 -

Figure 5-5. Synthesized versus original signal (time delay = 50): (top) modified speech signal, (middle) EGG signal, (bottom) synthesized + modified speech signal; phonation |i|

of male

5.4.1 Multi-band Model

It is well known from linear and adaptive filter theory that subband techniques present

several advantages with respect to the full-band approach [144-146]. First of all, they achieve

computational efficiency by decimating the signal before the adaptive processing. In fact the

subband linear adaptive filters present impulse responses that are shorter than full-band

adaptive filter although the total number of the free parameters remains the same.

A second interesting property is due to the splitting of the input signal: the eigenvalues

spread of the subband-signals’ autocorrelation function is reduced and consequently least-

squares-like adaptation algorithms present better convergence performance [147].

- 117 -

It is well known, in fact, that the needed long training can hinder many real-time SVM

applications. So, since smaller schemes are needed for each subband, the multirate approach

has been used in a on-line (or in continuous learning) mode as a simple nonlinear adaptive

filter [148].

An important topic of multirate signal processing regards the choice of the filter banks.

Filter banks, in fact, decompose full-band signal spectra in a number of directly adjacent

frequencies subbands and recombine the signal spectra by the use of low-pass, band-pass, and

high-pass filters. Moreover, in the last two decades several techniques and topologies for the

design of filter banks have been proposed. A key part of the design can concern perfect vs.

almost perfect reconstruction or uniform vs. non-uniforms bands. [145, 146], and Our

proposed multiband SVR with wavelet filterbank model is illustrated in Fig. 5-6. Performance

of this model will be presented in next section. Consequently, multiband model successfully

predict voiced sounds as shown in Fig. 5-7.

Figure 5-6. Multiband SVR Model with wavelet filterbank

- 118 -

Figure 5-7. Plots of synthesized versus original signal (time delay = 50): (a) original speech signal, (b) EGG signal, (c) D1 original (left) and synthesized (right) signal, (d) D2 original (left) and synthesized (right) signal, (e) D3 original (left) and synthesized (right) signal, (f) A3 original (left) and synthesized (right) signal, synthesized + original speech

signal; phonation |i| of male

- 119 -

Figure 5-8. Comparison of spectrogram between original speech signal and synthesized speech signal

5.5 Experimental Results

Results of jitter(%) between synthesized and postoperative sounds in phonation |a|, |e|, |i|,

|o|, |u| for both sexes are presented in Table 5-2. Excessive jitter(%) value of preoperative

vowel (ref Table B-3 in APPENDIX B) decrease under 1 % in both sexes in synthesized

vowel, and statistically show no significant difference (except |a| of male group, and |i|, |o| and

|u| sounds of female group), compared to postoperative vowel. Some discrepancy are assumed,

because jitter(%) of |e| and |i| of male group, and |a| and |i| of female group are still over 1%

after surgery. However, LPC synthesizer can not represent jitter(%) like this model.

According to Wilcoxon rank sum t test, Table 5-3 shows the results of Lyapunov

exponents between synthesized and postoperative voiced sounds in phonation |a|, |e|, |i|, |o|, |u|

for both sexes. There is no difference between synthesized and postoperative voiced sounds in

all of phonation for both sexes. This truth tells that synthesized vowel may resemble

postoperative vowel in dynamic aperiodicity analysis.

- 120 -

Table 5-2. Results of jitter(%) between synthesized and postoperative sounds in

phonation |a|, |e|, |i|, |o|, |u| for both sexes

synthesized postoperative sex

Phona-

tion Mean S.D. Mean S.D. p-value

|a| 0.76 0.21 0.71 0.54 .021

|e| 0.78 0.34 1.15 0.84 .047

|i| 0.86 0.37 1.48 1.11 .106

|o| 0.66 0.20 0.54 0.23 .032

male

|u| 0.78 0.24 0.71 0.40 <.001

|a| 0.83 0.41 1.97 3.22 .044

|e| 0.74 0.37 0.99 0.60 .021

|i| 0.89 0.31 1.88 0.77 .157

|o| 0.49 0.35 0.74 0.22 .076

female

|u| 0.61 0.38 1.08 0.44 .224

Table 5-3. Results of Lyapunov exponents between synthesized and postoperative sounds

in phonation |a|, |e|, |i|, |o|, |u| for both sexes

synthesized postoperative sex

Phona-

tion Mean S.D. Mean S.D. p-value

|a| 0.435 0.058 0.474 0.049 .001

|e| 0.475 0.091 0.493 0.089 <.001

|i| 0.520 0.098 0.517 0.118 .008

|o| 0.559 0.148 0.523 0.133 .004

male

|u| 0.565 0.153 0.534 0.183 .003

|a| 0.594 0.140 0.574 0.129 .003

|e| 0.620 0.144 0.600 0.149 .007

|i| 0.661 0.173 0.604 0.218 .005

|o| 0.659 0.140 0.634 0.117 .014

female

|u| 0.614 0.151 0.651 0.140 .004

- 121 -

5.6 Summary

In this chapter, our proposed NARX based on LS-SVR is introduced and tested in

enhanced preoperative voiced sounds for producing natural sounds. This nonlinear synthesizer

perfectly reproduce voiced sounds, and also conserve the naturalness such as jitter and

shimmer, compared to LPC does not keep these naturalness. However, the results of some

phonation are quite different from the original sounds. These results are assumed that single

band model can not afford to control and decompose the high frequency components.

Therefore multiband model with wavelet filterbank is adopted for substituting single band

model. As a results, multiband model results in improved stability. Finally, nonlinear speech

modeling using NARX based on LS-SVR can successfully reconstruct modified preoperative

sounds nearly similar to postoperative voiced sounds, according to jitter, shimmer, and

Lyapunov exponents analysis.

- 122 -

CHAPTER 6

Conclusion

In this dissertation, a design and implementation of estimation of postoperative vowel is

presented using nonlinear speech modeling based on NARX, according to the acoustic and

electroglottographic analysis between preoperative and postoperative vowel. This approach is

not yet proven widely, but suggest reasonable solution for estimation of postoperative vowel.

Preoperative vowels of benign vocal folds lesions usually have perceptual aperiodicity due to

physical changes of vocal cords affected by BVFL. Jitter, shimmer, and aspiration noise are

mainly caused by lose of control of vocal cords, irregular pattern of vibration of vocal cords.

Therefore, we started with hypothesis that reduction of jitter, shimmer, and noise of

preoperative vowel can enhance the perceptual quality of speech similar to postoperative.

In this dissertation, our findings are summarized below:

1. Established PDAs are lack of pitch detection for pathological voices because of the

increase of sub-harmonics and high-frequency components occurring pitch halving

and doubling errors, and PDA based on FOS algorithm may be assumed to solve

these errors, especially in case of Type-2 signal suggested by Titze [],

2. Clear difference of acoustic and electroglottographic statistics between preoperative

and postoperative vowel are achieved, even though there are somewhat constant

values of measures in case of type of sex, phonation, and measures.

3. Modification of preoperative vowel based on acoustic and electroglottographic

analysis can resemble amount of postoperative vowel in spectral and dynamic

domain.

4. Performance of nonlinear speech modeling using NARX based SVR showed better

than LPC in perceptual quality of voiced sounds, and this results is assumed that

natural jitter, shimmer, and noise are conserved, whereas LPC produces artificial

sounds due to lack of naturalness.

- 123 -

During the several decades, speech signal processing has undergone significant

improvements along with the progress in the areas of digital signal processing, pattern

recognition, artificial intelligence, and etc. However, natural characteristics of speech

production modeling and its quality measurements are far form being completely solved. Of

course, it is difficult to find and solve the factors related with voice, especially pathological

voice. Although pathological voice relevant area are limitedly conducted with other area,

compared to other speech processing techniques, valuable gravity of this study still remains

for the reason that this areas may be a key for solving some bottleneck problems encountered

in current speech signal processing. Finally, we realize that a lot of research remains to be

conducted to address the unsolved problems; implementation of real-time problem,

performance technique, and etc.

Some interesting research topics are as follows: neo-PSOLA controlling jitter and

shimmer, perceptual multiband for decomposition and reconstruction, and finding perceptual

parameters directly relevant of pathological voices.

- 124 -

APPENDIX A

Detail results of performance of the PDAs in Chapter 2 are recorded below standards:

Age with interval of two decades except infants: Table A-1~5

Sex : Table A-6, 7

Phonation types; |a|, |e|, |i|, |o|, |u| : Table A-8~12

Table A-1. Results of performance of the available PDAs in normal & age range of 10's and 20's database (NNormal & Age10~29 = 120) (unit: percentage %)

G X2 /2 F S PDAs


AC 1.06 1.29 0.09 0.37 0.97 1.25 1.24 0.71 2.24 2.85

AMDF 7.85 10.64 0.14 0.77 7.71 10.67 6.54 7.82 11.41 10.79

YIN 1.40 1.59 0.53 0.82 0.87 1.38 13.99 0.47 2.52 1.53

CEP 3.10 3.94 1.50 3.93 1.60 1.78 3.76 6.13 8.33 11.63

SIFT 2.37 2.83 0.69 2.16 1.69 1.95 4.96 2.51 6.79 9.53

WAV 4.85 2.11 0.00 0.05 4.85 2.11 0.40 0.76 0.44 0.83

PS 1.71 1.90 0.68 1.60 1.03 1.48 1.91 1.84 5.53 6.00

FOS 0.99 2.42 0.00 0.00 0.99 2.42 2.97 1.67 3.94 4.68

- 125 -

Table A-2. Results of performance of the available PDAs in normal & age range of 30's and 40's database (N Normal & Age30~40 = 30) (unit: percentage %)

G X2 /2 F S PDAs

Mean S.D. Mean S.D. Mean std Mean S.D. Mean S.D.

AC 1.13 1.44 0.31 0.65 0.82 1.12 1.59 1.13 4.07 6.51

AMDF 7.09 11.24 0.18 0.96 6.91 11.31 5.59 6.89 10.24 9.83

YIN 1.54 1.60 0.54 0.84 1.00 1.35 14.12 0.75 3.08 4.23

CEP 2.46 2.24 0.50 1.24 1.96 2.02 2.40 1.74 6.19 7.42

SIFT 2.71 2.34 0.55 1.67 2.16 2.09 4.32 1.82 4.57 6.83

WAV 5.03 2.07 0.02 0.12 5.00 2.11 0.36 0.77 0.42 0.79

PS 1.55 1.60 0.41 1.02 1.13 1.09 1.85 2.05 5.13 7.35

FOS 0.79 1.55 0.13 0.73 0.65 1.43 2.84 1.34 4.26 4.37

* Results of performance of the available PDAs in normal age range of 50's and 60's database

(N/A)

Table A-3. Results of performance of the available PDAs in BVFL & age range of 10's and 20's database (NBNFL & Age10~29 = 65) (unit: percentage %)

G X2 /2 F S PDAs


AC 2.00 3.41 0.57 2.81 1.43 1.71 1.91 2.86 4.47 9.27 AMDF 13.98 18.31 3.95 16.08 10.03 12.26 11.23 10.93 13.73 10.26

YIN 2.51 3.56 0.35 0.79 2.15 3.49 14.23 1.78 3.34 3.57

CEP 3.36 3.51 1.29 2.32 2.07 2.53 3.39 3.44 9.68 13.02

SIFT 3.82 4.23 0.76 2.26 3.06 3.67 4.96 3.38 6.80 7.61

WAV 6.17 1.99 0.01 0.06 6.16 1.98 0.63 1.25 0.57 1.12

PS 3.07 3.59 0.79 1.32 2.28 3.26 3.49 4.01 9.50 11.50

FOS 1.17 2.42 0.57 1.82 0.60 1.38 3.03 3.32 5.39 9.85

- 126 -

Table A-4. Results of performance of the available PDAs in BVFL & age range of 10's and 20's database (NBNFL & Age30~49 = 280) (units: percentage %)

G X2 /2 F S PDAs


AC 2.90 5.27 0.41 2.55 2.49 4.40 2.24 4.27 4.55 7.36 AMDF 8.36 12.08 0.55 1.76 7.81 12.14 7.27 8.75 11.37 10.25

YIN 4.06 7.06 0.70 2.03 3.36 6.48 14.79 3.85 4.57 6.66

CEP 6.24 9.12 2.62 6.47 3.63 4.64 5.61 9.34 10.97 12.82

SIFT 6.01 10.28 0.87 2.46 5.14 9.31 6.56 7.96 8.89 10.14

WAV 6.24 1.93 0.01 0.09 6.23 1.94 0.84 1.46 0.53 0.92

PS 4.64 7.76 1.54 5.87 3.11 4.67 3.97 7.04 8.05 8.78

FOS 0.77 1.96 0.26 1.26 0.50 1.47 2.63 2.30 3.89 6.83

Table A-5. Results of performance of the available PDAs in BVFL & age range of 10's and 20's database (NBNFL & Age50~69 = 150) (units: percentage %)

G X2 /2 F S PDAs


AC 4.29 7.75 0.34 1.20 3.96 7.66 2.07 2.75 4.36 5.87 AMDF 6.89 10.82 0.74 2.01 6.15 10.96 6.39 8.36 10.29 9.82

YIN 4.94 7.80 0.73 2.35 4.21 6.77 14.90 3.45 5.18 5.00

CEP 6.64 10.43 2.73 8.95 3.91 4.81 6.39 17.24 10.73 13.92

SIFT 5.99 9.33 1.24 4.37 4.75 6.91 6.74 11.22 9.73 15.68

WAV 6.37 1.84 0.02 0.09 6.35 1.83 0.90 1.35 0.68 1.00

PS 5.48 8.03 0.97 2.42 4.50 7.20 4.05 6.58 7.74 7.11

FOS 0.80 2.00 0.38 1.36 0.42 1.56 2.73 2.32 3.76 5.48

- 1

27

-

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

0.48

0.07

4.80

2.05

5.97

4.54

6.36

3.09

8.69

4.41

3.48

1.30

3.08

0.91

2.62

1.03

Mean

1.63

0.54

8.67

2.33

2.37

2.38

11.65

3.51

12.39

5.25

5.98

1.44

10.35

2.70

4.89

1.40

S.D.

.003

<.001

<.001

.001

<.001

<.001

.008

<.001

p

G

0.40

0.07

2.44

1.25

0.03

0.02

1.87

1.53

5.13

3.27

1.01

0.67

2.27

0.29

0.73

0.31

Mean

1.57

0.54

7.04

2.21

0.14

0.12

4.63

3.12

10.15

5.33

2.86

0.82

9.51

1.07

3.44

0.65

S.D.

.014

.041

.663

.527

.069

.142

.004

.105

p

X2

0.07

0.00

2.36

0.81

5.94

4.52

4.49

1.56

3.56

1.14

2.47

0.63

0.81

0.62

1.89

0.72

Mean

0.45

0.00

3.54

1.19

2.36

2.39

8.99

1.81

4.33

1.60

4.50

1.16

4.40

2.48

3.12

1.17

S.D.

.019

<.001

<.001

<.001

<.001

<.001

.685

<.001

p

/2

2.89

3.23

4.48

2.92

0.61

0.24

6.90

4.97

8.99

5.75

14.28

13.96

3.65

1.58

1.91

1.08

Mean

2.33

0.73

7.65

2.57

1.16

0.33

12.05

3.61

17.42

8.62

2.49

0.59

6.10

2.88

3.43

1.03

S.D.

.084

.016

<.001

.050

.058

.098

<.001

.003

p

F

3.40

2.27

8.38

8.47

0.48

0.24

11.30

9.99

17.47

13.92

4.29

2.77

7.07

4.24

3.90

3.11

Mean

6.50

0.98

8.40

8.76

0.84

0.33

16.45

13.29

16.12

15.46

4.71

3.18

9.03

8.98

6.44

5.39

S.D.

.019

.944

.001

.540

.139

.006

.042

.354

p

S

Table A-6. Results of performance of the available PDAs in database of normal and BVFL (Male and without Age)

- 1

28

-

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

1.07

1.46

4.62

1.46

6.47

5.08

5.28

2.07

4.15

2.14

4.56

1.50

12.44

11.62

3.60

1.10

Mean

2.25

2.70

6.54

1.45

1.49

1.90

7.54

2.09

5.11

1.89

7.56

1.67

13.07

11.65

6.61

1.27

S.D.

.206

<.001

<.001

<.001

<.001

<.001

.566

<.001

p

G

0.30

0.00

0.48

0.27

0.00

0.00

0.35

0.16

0.68

0.16

0.43

0.45

0.23

0.06

0.19

0.03

Mean

1.23

0.00

1.06

0.62

0.01

0.00

1.06

0.68

2.03

0.61

1.09

0.82

0.69

0.60

0.72

0.22

S.D.

<.001

.019

.318

.035

<.001

.815

.024

.001

p

X2

0.77

1.46

4.14

1.19

6.47

5.08

4.93

1.91

3.48

1.98

4.13

1.05

12.21

11.56

3.41

1.06

Mean

1.84

2.70

6.37

1.50

1.49

1.90

7.44

2.07

4.63

1.89

7.18

1.46

13.05

11.66

6.51

1.24

S.D.

.022

<.001

<.001

<.001

<.001

<.001

.648

<.001

p

/2

2.60

2.78

3.56

1.30

0.98

0.48

6.07

4.76

3.23

2.17

15.07

14.04

10.15

9.12

2.30

1.44

Mean

2.55

1.93

5.70

0.90

1.53

0.91

5.32

1.27

4.31

1.26

4.05

0.51

9.77

8.15

3.86

0.63

S.D.

.451

<.001

<.001

<.001

<.001

<.001

.308

<.001

p

F

4.49

5.00

7.99

3.70

0.65

0.55

7.22

4.23

6.16

4.42

4.80

2.56

14.26

15.19

4.87

2.32

Mean

7.22

5.50

8.96

3.11

1.05

0.99

6.79

4.09

7.97

4.33

6.56

1.65

9.86

9.32

7.69

2.70

S.D.

.468

<.001

.401

<.001

.007

<.001

.405

<.001

p

S

Table A-7. Results of performance of the available PDAs in database of normal and BVFL (Female and without Age)

- 1

29

-

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

0.65

1.20

5.38

1.90

6.39

5.74

6.51

3.34

5.74

2.74

5.94

1.93

7.62

7.13

4.17

1.68

Mean

1.45

2.89

7.31

2.09

1.65

1.62

8.51

3.68

7.14

2.20

8.69

2.33

12.70

9.81

6.67

1.84

S.D.

.322

<.001

.063

.004

<.001

<.001

.825

.001

p

G

0.15

0.00

1.06

0.40

0.00

0.02

0.71

0.88

1.34

0.33

0.46

0.39

0.30

0.00

0.17

0.08

Mean

0.65

0.00

3.97

1.05

0.03

0.10

1.68

2.93

3.84

1.27

1.03

0.74

0.86

0.00

0.59

0.42

S.D.

.026

.140

.420

.767

.028

.702

.001

.343

p

X2

0.50

1.20

4.32

1.50

6.38

5.72

5.79

2.46

4.41

2.41

5.48

1.54

7.32

7.13

4.00

1.60

Mean

1.28

2.89

6.28

1.78

1.65

1.64

8.28

2.20

5.39

2.13

8.70

2.07

12.81

9.81

6.69

1.65

S.D.

.208

<.001

.060

<.001

.003

<.001

.932

.001

p

/2

2.40

3.12

3.96

1.69

1.06

0.58

6.09

4.80

3.99

2.41

15.05

14.07

7.44

5.87

2.68

1.69

Mean

1.08

1.85

5.85

1.92

1.59

1.03

5.97

2.01

6.22

2.92

3.73

0.86

9.69

7.13

4.23

1.30

S.D.

.051

.001

.054

.068

.057

.017

.336

.043

p

F

3.38

4.38

8.46

4.41

0.67

0.53

8.74

6.02

9.09

5.65

4.85

3.35

10.41

10.79

5.76

3.84

Mean

4.09

5.00

9.84

6.00

0.96

0.81

9.11

7.10

11.41

8.68

4.42

4.29

9.58

9.94

7.96

5.78

S.D.

.326

.007

.417

.092

.083

.102

.855

.152

p

S

Table A-8. Results of performance of the avaiable PDAs in database of |a| (Phonation |a|)

- 1

30

-

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

0.70

1.11

4.33

1.33

6.10

4.96

5.27

2.04

5.10

2.67

3.89

1.19

7.78

9.25

2.58

0.91

Mean

1.62

2.98

7.12

1.34

2.15

2.10

8.04

2.39

9.19

2.69

6.52

1.13

11.10

14.07

4.42

0.86

S.D.

.475

<.001

.012

.001

.022

<.001

.604

.001

p

G

0.16

0.13

1.42

0.75

0.01

0.00

0.89

0.64

1.85

0.89

0.41

0.49

0.35

0.00

0.17

0.08

Mean

0.72

0.73

4.90

1.22

0.09

0.00

3.47

1.83

6.77

2.81

1.01

0.68

1.02

0.00

0.63

0.31

S.D.

.858

.214

.182

.599

.263

.602

.001

.323

p

X2

0.54

0.98

2.91

0.58

6.09

4.96

4.38

1.40

3.25

1.78

3.49

0.70

7.43

9.25

2.41

0.83

Mean

1.51

2.93

4.73

0.92

2.14

2.10

7.40

1.84

4.35

1.61

5.98

1.08

11.28

14.07

4.27

0.83

S.D.

.438

<.001

.013

<.001

.006

<.001

.521

.001

p

/2

2.51

2.98

3.72

1.46

0.65

0.24

6.15

4.51

4.47

3.18

14.74

13.94

7.22

7.13

1.81

1.18

Mean

1.57

2.01

6.44

1.58

1.19

0.48

7.53

1.76

10.24

4.36

3.53

0.34

8.48

9.72

2.09

0.54

S.D.

.254

.002

.007

.048

.323

.028

.963

.008

p

F

3.34

3.52

7.37

4.07

0.47

0.34

8.64

4.40

9.09

7.72

4.03

2.34

10.77

10.64

3.87

2.17

Mean

4.26

4.32

6.84

4.56

0.94

0.73

10.82

5.08

10.32

10.06

4.20

1.43

9.42

10.28

5.24

3.47

S.D.

.840

.003

.433

.004

.519

.001

.949

.042

p

S

Table A-9. Results of performance of the avaiable PDAs in database of normal and BVFL (Phonation |e|)

- 1

31

-

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

0.78

0.49

3.49

1.08

6.31

3.83

4.30

1.79

4.71

2.44

2.74

1.09

9.17

7.76

2.05

0.86

Mean

2.13

1.32

6.58

1.29

1.72

2.02

8.32

1.80

6.43

2.67

5.25

1.32

14.08

9.92

4.29

1.06

S.D.

.369

.001

<.001

.006

.006

.005

.540

.013

p

G

0.52

0.00

1.47

0.54

0.03

0.02

0.67

0.30

1.76

1.03

0.68

0.41

1.25

0.33

0.30

0.15

Mean

1.84

0.00

5.34

1.23

0.15

0.12

1.86

0.94

3.80

2.35

1.19

0.71

7.62

1.29

1.01

0.48

S.D.

.006

.110

.904

.154

.206

.136

.255

.268

p

X2

0.26

0.49

2.02

0.54

6.28

3.81

3.63

1.49

2.94

1.41

2.06

0.68

7.92

7.43

1.75

0.70

Mean

1.00

1.32

2.92

0.79

1.72

2.04

7.03

1.81

3.97

1.58

4.69

0.94

12.47

9.98

4.01

0.98

S.D.

.395

<.001

<.001

.007

.002

.007

.823

.020

p

/2

2.85

2.53

4.15

2.42

0.83

0.26

5.77

4.85

4.55

2.98

14.29

13.95

7.76

6.82

1.64

1.24

Mean

3.11

1.16

7.09

1.99

1.40

0.44

8.08

2.36

5.96

3.59

2.58

0.45

9.77

7.29

1.83

0.61

S.D.

.409

.032

.001

.315

.080

.221

.571

.066

p

F

4.45

2.97

8.37

5.09

0.60

0.43

7.55

6.34

10.35

7.62

4.15

2.51

11.45

12.18

3.64

1.97

Mean

8.61

3.64

8.57

6.26

0.95

0.78

10.14

9.01

11.54

9.05

4.15

1.36

10.22

11.74

5.72

2.31

S.D.

.178

.025

.322

.535

.183

.001

.760

.021

p

S

Table A-10. Results of performance of the available PDAs in database of normal and BVFL (Phonation |i|)

- 1

32

-

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

0.85

1.02

4.99

2.02

6.32

5.29

6.39

2.97

6.90

3.25

4.12

1.34

9.36

7.93

3.88

1.24

Mean

2.11

1.88

7.24

1.45

1.93

2.13

11.10

3.06

11.17

4.16

6.68

1.33

13.27

11.19

7.34

1.39

S.D.

.682

<.001

.022

.007

.008

<.001

.560

.001

p

G

0.30

0.00

1.10

0.54

0.02

0.00

1.45

1.12

3.36

1.67

0.77

0.40

1.59

0.11

0.65

0.19

Mean

1.22

0.00

4.59

0.87

0.07

0.00

4.50

2.78

9.78

3.68

2.84

0.65

7.91

0.63

4.02

0.49

S.D.

.016

.250

.035

.634

.159

.236

.069

.262

p

X2

0.55

1.02

3.89

1.49

6.31

5.29

4.94

1.85

3.54

1.58

3.35

0.94

7.77

7.82

3.23

1.05

Mean

1.73

1.88

5.08

1.65

1.92

2.13

8.16

2.07

3.72

1.98

5.63

1.21

11.52

11.25

5.96

1.33

S.D.

.233

<.001

.024

.001

<.001

<.001

.984

.001

p

/2

2.56

2.89

3.55

1.69

0.79

0.48

6.96

5.41

6.97

3.79

14.73

14.03

7.53

6.70

2.09

1.19

Mean

2.53

1.37

6.28

1.37

1.43

0.88

11.15

3.23

19.24

5.70

3.36

0.45

8.60

8.03

4.14

0.67

S.D.

.364

.007

.164

.223

.151

.045

.626

.039

p

F

3.38

4.34

7.90

6.55

0.54

0.58

9.72

8.13

10.94

8.20

4.46

2.19

12.10

11.70

3.98

2.77

Mean

5.96

4.78

8.45

6.74

1.03

1.02

14.30

10.74

14.36

12.82

5.91

1.53

10.87

12.18

5.94

4.20

S.D.

.370

.371

.836

.516

.323

.001

.874

.215

p

S

Table A-11. Results of performance of the avaiable PDAs in database of normal and BVFL (Phonation |o|)

- 1

33

-

FOS

PS

WAV

SIFT

CEP

YIN

AMDF

AC

PDAs

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

BVFL

Normal

Signal Type

1.18

0.95

5.26

2.04

6.22

4.60

6.13

2.06

7.47

3.76

3.91

1.58

9.35

6.40

3.35

0.68

Mean

2.67

1.94

8.84

2.57

2.08

2.18

10.70

2.14

10.41

5.61

7.05

1.49

13.20

8.27

6.43

1.05

S.D.

.605

.002

.001

.001

.013

.003

.147

<.001

p

G

0.58

0.00

1.28

0.90

0.01

0.00

1.11

0.36

4.07

2.57

1.00

0.96

1.79

0.28

0.75

0.17

Mean

1.91

0.00

4.37

2.56

0.05

0.00

3.28

0.93

8.32

5.85

2.96

1.14

8.12

1.09

2.78

0.54

S.D.

.003

.557

.170

.045

.273

.915

.075

.052

p

X2

0.60

0.95

3.97

1.4

6.21

4.60

5.02

1.70

3.39

1.18

2.91

0.62

7.57

6.13

2.60

0.51

Mean

1.77

1.94

7.15

1.36

2.08

2.18

9.41

1.90

4.86

1.69

5.26

1.12

11.33

8.37

5.67

0.90

S.D.

.384

<.001

.001

.001

<.001

<.001

.453

.001

p

/2

3.26

3.21

4.27

2.23

0.82

0.40

7.05

4.61

7.80

5.07

14.95

14.07

7.68

5.26

2.48

1.25

Mean

3.23

1.52

7.19

2.32

1.37

0.78

9.86

2.42

12.30

9.05

4.24

0.46

8.84

5.80

5.02

0.67

S.D.

.916

.016

.036

.027

.190

.046

.084

.019

p

F

5.69

4.79

8.63

7.13

0.62

0.29

9.70

6.83

14.17

10.32

5.48

2.78

12.03

10.56

5.16

2.31

Mean

9.73

5.20

9.74

7.28

0.99

0.75

14.05

11.95

16.76

13.42

9.14

1.49

10.71

9.11

10.02

2.77

S.D.

.513

.367

.056

.273

.201

.006

.461

.012

p

S

Table A-12. Results of performance of the avaiable PDAs in database of normal and BVFL (phonation |u|)

- 134 -

APPENDIX B

Detail results of measures of jitter and shimmer in Chapter 3 are recorded below standards:

Table B-1. Mean and S.D. of mean, max, and min of F0 for phonation |a|,|e|,|i|,|o|,|u| for

both sex

Table B-2. Mean and S.D. of S.D, phonatory frequency range, and mean absolute jitter

of F0 for phonation |a|,|e|,|i|,|o|,|u| for both sex

Table B-3. Mean and S.D. of jitter(%), pitch perturbation factor, directional pitch

perturbation factor of F0 for phonation |a|,|e|,|i|,|o|,|u| for both sex

Table B-4. Mean and S.D. of relative average pitch perturbation 3, 5, and 15 of F0 for

phonation |a|,|e|,|i|,|o|,|u| for both sex

Table B-5. Mean and S.D. of mean, max, and min of amplitude for phonation

|a|,|e|,|i|,|o|,|u| for both sex

Table B-6. Mean and S.D. of S.D, Shimmer(dB), and mean absolute shimmer of

amplitude for phonation |a|,|e|,|i|,|o|,|u| for both sex

Table B-7. Mean and S.D. of shimmer(%), amplitude perturbation factor(%) and

amplitude directional perturbation factor of amplit ude for phonation |a|,|e|,|i|,|o|,|u| for

both sex

Table B-8. Mean and S.D. of relative average amplitude perturbation of amplitude for


- 1

35

-

Fe male

Male

Irrespective of sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

a

205.3

204.7

206.4

201.3

198.1

143.0

140.9

137.7

135.7

134.9

172.7

171.3

170.4

166.9

165.0

Mean

31.2

31.7

29.8

28.1

27.4

30.2

28.7

19.9

18.8

18.6

43.7

43.9

42.6

40.6

39.3

S.D.

pre

197.0

196.6

198.3

195.7

193.1

120.7

119.4

120.6

118.8

117.6

157.0

156.1

157.6

155.4

153.6

Mean

27.0

27.2

27.0

26.5

26.0

18.1

18.6

17.7

17.9

18.3

44.6

45.1

45.2

44.7

44.1

S.D.

post

.059

.094

.087

.134

.141

.004

.004

.002

.001

.001

.001

.001

.000

.000

.000

p-value

Mean F0

216.2

214.7

225.9

224.4

234.8

159.4

153.8

150.9

146.8

172.2

186.4

182.8

186.6

183.7

202.0

Mean

35.4

39.3

35.8

42.4

47.9

51.1

48.0

25.4

23.2

114.2

42.4

53.3

48.6

51.4

93.5

S.D.

pre

203.0

201.6

212.1

202.7

212.3

124.4

123.0

126.5

123.2

122.4

161.8

160.4

167.2

161.1

165.2

Mean

28.5

27.9

31.6

28.4

36.5

18.2

19.3

19.5

18.8

22.7

46.1

46.1

50.3

46.6

54.2

S.D.

post

.013

.061

.023

.001

.031

.006

.011

.001

.000

.073

.001

.002

.000

.000

.015

p-value

Max F0

193.5

195.6

188.1

185.0

169.2

125.3

126.4

125.9

124.3

121.4

157.8

159.3

155.5

153.2

144.2

Mean

27.8

28.4

24.6

27.2

25.2

24.6

22.1

20.0

21.9

22.6

43.0

43.0

38.4

39.1

33.7

S.D.

pre

191.1

191.6

186.0

188.6

179.4

116.6

115.2

115.2

113.8

112.7

152.1

151.6

148.8

149.4

144.5

Mean

25.6

26.5

22.6

25.2

28.6

19.4

19.9

17.3

18.6

17.7

43.7

44.9

40.9

43.6

40.9

S.D.

post

.572

.383

.630

.479

.082

.086

.013

.020

.025

.067

.083

.015

.036

.271

.938

p-value

Min F0

Table B-1. Mean and S.D. of mean, max, and min of F0 for phonation |a|,|e|,|i|,|o|,|u| for both sex

- 1

36

-

Fe male

Male

Irrespective of sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

A

4.12

3.11

5.16

5.51

11.46

5.46

4.47

4.40

4.02

6.71

4.82

3.82

4.76

4.73

8.97

Mean

3.27

2.34

2.29

4.49

9.16

8.09

7.05

3.29

4.26

11.6

6.24

5.34

2.85

4.39

10.6

S.D.

pre

2.33

1.96

3.92

2.40

4.86

1.78

1.75

2.37

1.98

1.85

2.04

1.85

3.10

2.18

3.28

Mean

0.88

0.74

1.73

0.95

6.36

2.03

1.71

1.79

1.15

1.75

1.60

1.46

1.90

1.07

4.76

S.D.

post

.019

.046

.031

.004

.008

.045

.092

.014

.027

.072

.006

.024

.001

.000

.002

p-value

S.D. of F0

1.88

1.53

3.10

3.21

5.48

3.83

3.06

3.11

2.94

4.69

2.90

2.33

3.11

3.07

5.07

Mean

1.34

1.38

1.53

2.89

4.03

5.82

4.70

2.27

3.73

6.81

4.38

3.58

1.93

3.32

5.61

S.D.

pre

1.03

0.87

2.20

1.23

2.89

1.17

1.19

1.64

1.40

1.38

1.10

1.04

1.91

1.32

2.10

Mean

0.30

0.23

0.68

0.67

3.19

1.25

1.31

0.98

0.84

1.41

0.92

0.96

0.89

0.76

2.51

S.D.

post

.007

.057

.016

.003

.029

.047

.083

.007

.060

.042

.012

.027

.000

.001

.003

p-value

Phonatory frequency Range

3.32

2.07

4.82

5.21

10.75

5.00

3.86

4.97

3.89

3.52

4.20

3.01

4.90

4.52

6.96

Mean

2.38

0.84

2.07

4.05

10.6

7.81

7.02

5.15

4.83

4.06

5.88

5.13

3.95

4.47

8.64

S.D.

pre

2.17

1.49

3.81

1.92

3.95

0.86

0.65

1.89

1.34

0.89

1.48

1.05

2.80

1.62

2.35

Mean

1.09

0.56

1.75

1.15

6.91

0.47

0.30

1.70

1.02

0.90

1.05

0.61

1.96

1.11

4.99

S.D.

post

.036

.008

.070

.001

.013

.022

.045

.018

.015

.009

.006

.020

.004

.000

.001

p-value

Mean Absolute Jitter(MAJ)

Table B-2. Mean and S.D. of S.D, phonatory frequency range, and mean absolute jitter of F0 for phonation

|a|,|e|,|i|,|o|,|u| for both sex

- 1

37

-

Fe male

Male

Irrespective of

sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

a

1.59

1.00

2.30

2.55

5.29

3.12

2.40

3.53

2.85

2.56

2.39

1.73

2.95

2.71

3.86

Mean

1.11

0.34

0.89

1.90

5.08

4.51

3.91

3.49

3.61

2.87

3.40

2.90

2.64

2.89

4.26

S.D.

pre

1.08

0.74

1.88

0.99

1.97

0.71

0.54

1.48

1.15

0.71

0.89

0.64

1.67

1.05

1.31

Mean

0.44

0.22

0.77

0.60

3.22

0.40

0.23

1.11

0.84

0.54

0.45

0.24

0.97

0.73

2.31

S.D.

post

.046

.007

.084

.001

.011

.020

.038

.019

.022

.008

.007

.020

.007

.000

.000

p-value

Jitter(%)

3.31

0.37

6.95

6.15

9.16

5.07

3.49

7.70

3.84

3.22

4.23

2.01

7.34

4.94

6.05

Mean

5.12

0.76

4.80

6.12

10.58

10.6

9.15

11.31

7.42

6.31

8.44

6.76

8.74

6.85

9.01

S.D.

pre

1.84

0.19

9.51

1.48

5.13

0.00

0.00

0.883

0.00

0.48

0.87

0.09

4.99

0.70

2.69

Mean

3.71

0.76

7.50

3.72

12.5

0.00

0.00

4.14

0.00

2.27

2.69

0.52

7.34

2.64

9.00

S.D.

post

.246

.490

.136

.003

.224

.037

.087

.018

.024

.079

.016

.075

.185

.000

.054

p-value

Pitch Perturbation Factor(%)

58.92

56.11

58.22

62.23

59.05

60.9

59.10

63.16

62.85

58.39

59.98

57.68

60.81

62.56

58.70

Mean

8.63

7.80

6.81

8.20

9.06

12.3

9.06

12.9

9.76

11.5

10.6

8.52

10.6

8.9

10.3

S.D.

pre

58.03

53.19

60.63

54.99

52.58

48.44

44.26

52.01

51.20

48.33

53.01

48.51

56.12

53.01

50.36

Mean

5.91

6.60

6.88

9.50

8.30

12.3

11.5

11.4

12.0

8.73

10.8

10.4

10.3

10.9

8.69

S.D.

post

.569

.207

.140

.000

.010

.001

.000

.003

.000

.000

.002

.000

.034

.000

.000

p-value

Directional Perturbation Factor(%)

Table B-3. Mean and S.D. of jitter(%), pitch perturbation factor, directional pitch perturbation factor of F0 for


- 1

38

-

Fe male

Male

Irrespective of

sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

a

0.99

0.62

1.40

1.58

3.22

1.89

1.46

2.21

1.74

1.57

1.47

1.06

1.82

1.66

2.35

Mean

0.70

0.20

0.52

1.15

3.04

2.73

2.36

2.29

2.20

1.78

2.06

1.75

1.73

1.76

2.57

S.D.

pre

0.67

0.47

1.17

0.62

1.21

0.42

0.31

0.89

0.67

0.41

0.54

0.38

1.02

0.64

0.79

Mean

0.26

0.13

0.47

0.37

1.93

0.24

0.12

0.67

0.54

0.35

0.28

0.15

0.59

0.46

1.39

S.D.

post

.050

.006

.114

.001

.010

.019

.033

.019

.021

.008

.006

.018

.009

.000

.000

p-value

RAPP 3(%)

0.91

0.59

1.35

1.44

3.13

1.84

1.43

1.85

1.62

1.56

1.40

1.03

1.61

1.53

2.30

Mean

0.57

0.19

0.52

1.04

3.00

2.79

2.42

1.59

2.03

1.84

2.08

1.79

1.22

1.62

2.56

S.D.

pre

0.64

0.46

1.09

0.59

1.17

0.41

0.32

0.85

0.63

.426

0.52

0.39

0.96

0.61

0.78

Mean

0.24

0.12

0.44

0.29

1.89

0.19

0.11

0.63

0.44

0.30

0.24

0.13

0.56

0.37

1.36

S.D.

post

.037

.013

.066

.001

.012

.026

.043

.015

.023

.010

.010

.026

.004

.000

.000

p-value

RAPP 5(%)

0.96

0.62

1.31

1.45

2.97

1.90

1.46

2.06

1.68

1.72

1.45

1.06

1.70

1.57

2.31

Mean

0.60

0.19

0.50

1.07

2.81

2.73

2.30

1.96

2.04

2.06

2.05

1.70

1.49

1.63

2.5

S.D.

pre

0.65

0.49

1.08

0.61

1.12

0.53

0.44

0.91

0.75

0.54

0.59

0.47

0.99

0.68

0.82

Mean

0.24

0.11

0.42

0.27

1.73

0.22

0.14

0.60

0.44

0.25

0.23

0.13

0.52

0.37

1.23

S.D.

post

.028

.016

.087

.002

.011

.029

.048

.019

.028

.016

.010

.028

.007

.000

.000

p-value

RAPP 15(%)

Table B-4. Mean and S.D. of relative average pitch perturbation 3, 5, and 15 of F0 for phonation |a|,|e|,|i|,|o|,|u| for both

sex

- 1

39

-

Fe male

Male

Irrespective of sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

a

0.51

0.69

0.47

0.53

0.66

0.53

0.67

0.55

0.61

0.73

0.52

0.68

0.52

0.57

0.70

Mean

0.21

0.35

0.23

0.25

0.44

0.26

0.35

0.30

0.29

0.36

0.24

0.34

0.27

0.27

0.40

S.D.

pre

0.45

0.66

0.39

0.60

0.65

0.57

0.80

0.53

0.71

0.90

0.51

0.73

0.46

0.66

0.78

Mean

0.15

0.37

0.14

0.31

.300

0.24

0.34

0.22

0.35

0.43

0.21

0.36

0.19

0.33

0.39

S.D.

post

.229

.689

.089

.400

0.92

.589

.185

.704

.298

.125

.787

.397

.224

.173

.240

p-value

Mean Amp

0.63

0.89

0.59

0.70

0.94

0.69

0.89

0.68

0.83

1.02

0.66

0.89

0.63

0.77

0.99

Mean

0.25

0.42

0.27

0.31

0.52

0.32

0.41

0.35

0.38

0.40

0.29

0.41

0.31

0.35

0.46

S.D.

pre

0.54

0.80

0.46

0.76

0.88

0.65

0.97

0.62

0.90

1.22

0.60

0.89

0.54

0.83

1.06

Mean

0.17

0.39

0.15

0.37

0.38

0.26

0.41

0.24

0.42

0.52

0.23

0.40

0.21

0.40

0.49

S.D.

post

.115

.318

.029

.567

0.57

0.60

.524

.495

.553

.132

.165

.997

.080

.403

.429

p-value

Max Amp

0.35

0.47

0.37

0.36

0.42

0.39

0.48

0.42

0.43

0.51

0.37

0.47

0.40

0.40

0.471

Mean

0.17

0.26

0.19

0.17

0.38

0.24

0.30

0.24

0.21

0.30

0.21

0.28

0.22

0.19

0.34

S.D.

pre

0.36

0.53

0.32

0.47

0.46

0.46

0.64

0.44

0.53

0.63

0.41

0.59

0.38

0.50

0.55

Mean

0.15

0.31

0.13

0.29

0.24

0.22

0.32

0.19

0.27

0.33

0.19

0.32

0.17

0.28

0.30

S.D.

post

.714

.394

.199

.135

.606

.185

.063

.757

.135

.199

.175

.042

.688

.031

.177

p-value

Min Amp

Table B-5. Mean and S.D. of mean, max, and min of amplitude for phonation |a|,|e|,|i|,|o|,|u| for both sex

- 1

40

-

Fe Male

Male

Irrespective of sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

a

0.07

0.09

0.04

0.07

0.12

0.07

0.09

0.05

0.09

0.11

0.07

0.09

0.05

0.08

0.12

Mean

0.04

0.06

0.02

0.04

0.09

0.06

0.05

0.03

0.07

0.05

0.05

0.06

0.03

0.06

0.07

S.D.

pre

0.04

0.07

0.03

0.07

0.09

0.04

0.08

0.04

0.08

0.14

0.04

0.07

0.03

0.07

0.12

Mean

0.03

0.05

0.01

0.04

0.05

0.04

0.06

0.03

0.07

0.06

0.03

0.05

0.02

0.06

0.06

S.D.

post

.014

.072

.018

.560

.215

.046

.593

.151

.827

.079

.002

.111

.012

.636

.964

p-value

S.D. of Amp

3.09

2.73

2.88

3.99

5.54

6.44

6.54

5.19

6.85

6.73

4.84

4.73

4.09

5.48

6.16

Mean

1.81

1.38

0.99

1.37

2.38

7.29

8.47

3.45

5.36

3.27

5.63

6.43

2.81

4.20

2.91

S.D.

pre

2.30

2.10

2.67

2.66

3.68

2.56

5.37

2.57

3.18

4.70

2.43

2.24

2.62

2.93

4.22

Mean

0.89

0.87

0.74

0.95

1.80

1.40

1.00

0.70

1.05

2.43

1.18

1.18

0.71

1.03

2.19

S.D.

post

.056

.077

.427

.000

.001

.014

.030

.002

.005

.002

.005

.005

.002

.000

.000

p-value

Shimmer(dB)

0.01

0.02

0.02

0.04

0.07

0.04

0.02

0.02

0.04

0.05

0.03

0.02

0.02

0.04

0.06

Mean

0.05

0.05

0.02

0.04

0.07

0.08

0.07

0.03

0.06

0.06

0.07

0.06

0.03

0.05

0.06

S.D.

pre

0.02

0.03

0.01

0.02

0.04

0.03

0.01

0.01

0.03

0.06

0.02

0.02

0.01

0.03

0.05

Mean

0.02

0.02

0.02

0.02

0.02

0.06

0.06

0.04

0.06

0.05

0.05

0.05

0.03

0.04

0.04

S.D.

post

.923

.409

.046

.072

.085

.649

.365

.376

.934

.564

.703

.841

.095

.490

.588

p-value

Mean Absolute Shimmer(MAS)

Table B-6. Mean and S.D. of S.D, Shimmer(dB), and mean absolute shimmer of amplitude for phonation |a|,|e|,|i|,|o|,|u|

for both sex

- 1

41

-

Fe male

Male

Irrespective of

sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

a

0.01

0.01

0.01

0.02

0.03

0.03

0.03

0.02

0.04

0.04

0.02

0.0.2

0.02

0.03

0.03

Mean

0.01

0.00

0.00

0.01

0.01

0.04

0.02

0.01

0.03

0.02

0.03

0.02

0.01

0.02

0.02

S.D.

pre

0.01

0.01

0.01

0.01

0.02

0.01

0.01

0.01

0.02

0.04

0.01

0.01

0.01

0.01

0.03

Mean

0.00

0.00

0.00

0.00

0.01

0.00

0.01

0.00

0.01

0.02

0.00

0.01

0.00

0.00

0.02

S.D.

post

.285

.028

.541

.024

.005

.108

0.03

.002

.020

.246

.067

.009

.002

.005

.009

p-value

Shimmer (%)

1.70

2.52

0.61

3.47

8.98

5.88

9.15

7.00

11.11

14.52

3.89

5.99

3.95

7.47

11.88

Mean

6.18

3.79

1.37

4.90

7.99

11.64

10.73

9.88

12.99

10.28

9.57

8.77

7.83

10.61

9.57

S.D.

pre

17.59

11.13

27.43

16.82

21.84

14.53

7.08

17.93

17.46

23.06

15.99

9.01

22.45

17.16

22.48

Mean

18.07

14.58

17.44

17.51

27.17

18.29

9.89

16.55

16.79

23.97

18.03

12.36

17.44

16.93

25.24

S.D.

post

.002

.025

.000

.004

.063

.048

.509

.023

.186

.110

.000

.225

.000

.003

.013

p-value

Amplitude Perturbation Factor(%)

61.50

56.19

62.19

59.64

61.06

61.02

58.36

62.46

62.42

59.77

61.25

57.33

62.33

61.09

60.38

Mean

10.96

9.70

4.86

6.50

6.07

10.73

6.04

9.49

10.36

3.87

10.71

7.96

7.56

8.75

5.02

S.D.

pre

62.53

59.16

65.87

61.62

60.81

58.02

52.02

59.48

57.25

55.34

60.17

55.42

62.52

59.33

57.94

Mean

4.92

8.38

3.94

5.27

6.40

10.95

9.09

5.39

7.14

11.15

8.82

9.37

5.70

6.62

9.50

S.D.

post

.661

.348

.020

.223

.876

.340

.016

.150

.035

.088

.582

.358

.888

.248

.114

p-value

Amplitude Directional Perturbation Factor (%)

Table B-7. Mean and S.D. of shimmer(%), amplitude perturbation factor(%) and amplitude directional perturbation factor of

amplitude for phonation |a|,|e|,|i|,|o|,|u| for both sex

- 1

42

-

Fe male

Male

Irrespective of

sex

u

o

i

e

a

u

o

i

e

a

u

o

i

e

a

1.75

1.46

1.60

2.19

3.02

3.60

3.67

2.86

3.80

3.69

2.72

2.62

2.26

3.04

3.37

Mean

1.23

0.87

0.62

0.82

1.28

4.33

5.08

1.74

2.81

1.89

3.34

3.85

1.46

2.24

1.65

S.D.

pre

1.29

1.16

1.53

1.48

2.05

1.36

1.14

1.40

1.67

2.55

1.33

1.15

1.46

1.58

2.31

Mean

0.55

0.54

0.45

0.59

1.08

0.96

0.45

0.35

0.61

1.53

0.78

0.49

0.40

0.60

1.34

S.D.

post

.093

.172

.665

.000

.001

.016

.029

.001

.002

.003

.006

.018

.002

.000

.000

p-value

RAAP 3(%)

1.81

1.65

1.82

2.51

3.45

3.96

4.15

3.26

4.32

4.24

2.93

2.96

2.57

3.46

3.86

Mean

0.89

0.78

0.56

0.83

1.65

4.70

5.75

2.54

3.75

2.06

3.58

4.34

1.99

2.89

1.89

S.D.

pre

1.43

1.27

1.67

1.65

2.28

1.46

1.38

1.58

1.95

2.81

1.45

1.33

1.62

1.81

2.56

Mean

0.51

0.53

0.42

0.58

1.15

0.60

0.58

0.40

0.68

1.40

0.55

0.55

0.41

0.65

1.30

S.D.

post

.057

.052

.277

.000

.004

.017

.034

.006

.009

.001

.008

.018

.004

.001

.000

p-value

RAAP 5(%)

2.41

2.36

2.19

3.17

4.58

4.95

5.23

3.99

5.25

5.42

3.74

3.86

3.13

4.26

5.02

Mean

0.92

0.85

0.63

0.94

1.97

5.06

6.02

2.85

4.03

2.22

3.89

4.58

2.27

3.13

2.12

S.D.

pre

1.72

1.79

1.94

2.10

3.03

2.23

2.46

2.15

2.98

3.99

1.99

2.14

2.05

2.56

3.53

Mean

0.55

0.63

0.50

0.55

1.10

0.82

1.07

0.58

0.93

1.43

0.74

0.94

0.55

0.88

1.36

S.D.

post

.002

.008

.127

.000

.002

.015

.037

.006

.015

.002

.003

.014

.003

.001

.000

p-value

RAAP 15(%)

Table B-8. Mean and S.D. of relative average amplitude perturbation of amplitude for phonation |a|,|e|,|i|,|o|,|u| for both

sex

- 143 -

REFERENCES

[1] Robert TS, Voice Science, 1st ed. NY, Plural P., 2005.

[2] Ainsworth S, Disorders of voice, 1st ed. Philadelphia, Harper & Row P., 1980.

[3] Robert TS, Clinical Assessment of Voice, 1st ed. NY, Plural P., 2005.

[4] Woo P, Casper J, Colton R, Brewer D, Aerodynamic and stroboscopic findings before and

after microlaryngeal phonosurgery, J. Voice, 1994, vol. 8, pp. 186-194.

[5] Lesly W, Cristina JM, Wayne H, and Alvaro G, Vocal fold nodule vs. vocal fold polyp:

Answer from surgical pathologist and voice pathologist point of view, J. Voice, 2003,

vol. 18, pp. 125-129.

[6] Michael MJ, Update on the etiology, diagnosis, and treatment of vocal fold nodules, polyps,

and cysts, 2003, Otolaryngology & Head and Neck Surgery, vol. 11, pp. 456-461.

[7] Ossoff RH et al., Cysts, nodules, and polyps. In The Larynx, 1st ed. Philadelphia,

Lippincott Williams & Willams P., 2003.

[8] Rubin P, Baer T, An articulatory synthesizer for perceptual research, J. of Acoust Soc of

Am, 1981, vol. 70, pp. 321-328.

[9] Fant G, Acoustic theory of speech production, Mouton, The Hague.

[10] Rabiner LR, Schafer RW, Digital processing of speech signals, NJ, Prentice Hall, 1978.

[11] Makhoul J, Spectral linear prediction: Properties and applications, IEEE Transactions on

Acoustics, Speech and Signal Processing, 1975, vol. 23, pp. 283-296.

[12] Teager, H. & Teager, S., Evidence for nonlinear sound production mechanisms in the

vocal tract, Vol. D 55 of NATO ASI Series, Kluwer Academic Publishers, pp. 241-261.

[13] Tishby, N, A dynamical systems approach to speech processing, Proc. IEEE international

Conference on Acoustics, Speech, and Signal Processing, vol 1, pp. 365-368.

[14] Townshend, B. Nonlinear prediction of speech, Proc. IEEE international Conference on

Acoustics, Speech, and Signal Processing, vol 1, pp. 425-428

[15] Wu, L. & Fallside, F., Fully vector quantized neural network-based code-excited non-

linear predictive speech coding, Technical Report CUED/F-INFENG/TR.94, Cambrige

Univ. Eng. Dep, England.

[16] Mackenzie Beck, J. M., Organic Variation and Voice Quality, PhD Dissertation,

University of Edinburgh, 1988.

- 144 -

[17] Paul C., Eva C., Ruth, E. et al., Formal perceptual evaluation of voice quality in the

united kingdom, Logopedics Phoniatrics vocology, vol. 25, pp. 133-138

[18] M.P. Karnell, R.S. Scherer, L. Fischer, Comparison of acoustic voice perturbation

measures among three independent voice laboratories, J. Speech hear. Res., 1991, vol.

34, pp. 781-790.

[19] M.P. Karnell, K.D. Hall, K. Landahl, Comparison of fundamental frequency and

perturbation measurements among three analysis systems, J. Voice, 1995, vol. 4, pp.

383-393.

[20] S. Bielamowicz, J. Kreiman, B.R. Gerratt, M.S. Dauer, and G.S. Berke, Comparison of

voice analysis systems for perturbation measurements, J. Speech Hear. Res., 1996, vol.

39, pp. 126-134.

[21] Giulia Bertino et al, Acoustic Analysis of Voice Quality with or without False Vocal Fold

Displacement After Cordectomy, Journal of Voice, 2001,

[23] FUJITA, Reginaldo, FERREIRA, Ana Elisa and SARKOVAS, Caroline.,

Videokymography assessment of vocal fold vibration before and after hydration., Rev.

Bras. Otorrinolaringol., 2004, vol. 70, pp. 742-746.

[23] Pieter Noordzij, Peak Woo., Glottal area waveform analysis of benign vocal fold lesions

before and after surgery, The annals of otology, Rhinology & Laryngology, vol. 109, pp.

441-446.

[24] Ming-Wang H., Yu-Che H., The characteristic features of muscle tension dysphonia

before and after surgery in benign lesions of the vocal fold, ORL J Otorhinolaryngol

Relat Spec. vol. 66, pp. 246-254.

[25] Ming-Wang H., Videolaryngostroboscopic observation of mucus layer during vocal cord

vibration in patients with vocal nodules before and after surgery., Acta Otolaryngol, vol.

124, pp. 186-191.

[26] Alison B., Lucian S., Tina H., Factors Predicting Patient Perception of Dysphonia Caused

by Benign Vocal Fold Lesions, Laryngoscope, vol. 114, pp. 1693-1670.

[27] Shi Chan Kim, Comparative study of pre and postoperative voice and image analysis in

unilateral vocal cord paralysis and vocal polyp, Yonsei University, Korea, 2000

[28] Joo Hwan Lee, Prediction of post-treatment outcome of pathologic voice using voice

synthesis, Yonsei University, Korea, 2003

- 145 -

[29] Moo-jin Baek, A comparative study of pre and postoperative voice and prediction of

postoperative voice by speech synthesis in benign laryngeal diseases, Pusan National

University, Korea, 2000

[30] de Cheveign´e, A., Separation of concurrent harmonics sounds: Fundamental frequency

estimation and a timedomain cancellation model of auditory processing. J. Acoust Soc

of Am, vol. 93, pp. 3271–3290.

[31] Klapuri, A. P., Multiple fundamental frequency estimation based on harmonicity and

spectral smoothness, IEEE Transactions on Speech and Audio Processing, vol. 11, pp.

804-816.

[32] de Cheveign´e, A., Pitch perception models in Pitch. Springer-Verlag. Edited by C. Plack

and A. Oxenham, 2004.

[33] von Helmholtz, H. L. F., On the Sensations of Tone as a Physiological Basis for the

Theory of Music. New York: Dover. English translation of 1863 edition by A. J. Ellis.

[34] Schouten, J. F., The perception of subjective tones in Psychological Acoustics. Edited by

E.D. Schubert, 1979.

[35] John M. Eargle. Music, Sound and Technology. Van Nostrand Reinhold, Toronto, 1995.

[36] Stephen Handel. Listening. MIT Press, Cambridge, 1989.

[37] Stanley Coren, Lawrence M. Ward, and James T. Enns. Sensation and Perception.

Harcourt Brace

[38] F. Klingholz, F. Martin, Quantitative spectral evaluation of shimmer and jitter, J. Speech

Hear. Res., 1985, vol. 28, pp. 169-174.

[39] S. Feijoo, C. Hernandez, Short-term stability measures for the evaluation of vocal quality,

J. Speech Hear. Res., 1990, vol. 33, pp. 324-334.

[40] Fant, C., On the predictability of formant levels and spectrum envelopes from formant

frequencies. In M. Halle. H. Lunt, and H. MacLean (eds.), For Roman Jakobson. The

Hague: Mouton, 1956.

[41] Laver. John., Principles of Phonetics, Cambridge University Press, 1994.

[42] Christine M. S., Elaine T. S., Christopher D., Approximations of open quotient and speed

quotient from glottal airflow and EGG waveforms: Effects of Measurement Criteria and

Sound Pressure Level, J of Voice., Vol. 12, pp. 31-43.

[43] Ingo R. Titze, Acoustic Interpretation of Resonant Voice, J. of Voice, 2001, vol.15, pp.

519-528.

- 146 -

[44] M.M. Sondhi, New methods of pitch determination, IEEE Trans. Audio Electroacoust.,

1968, vol. 16, pp. 262-266.

[45] L.R. Rabiner, On the use of autocorrelation analysis for pitch detection, IEEE Trans.

Acoust. Speech Signal Process., 1977, vol. 25, pp. 24-33.

[46] J.R. Deller, Jr., J.G. Proakis, J.H.L. Hansen, Example short-term features and applications,

Discrete-Time Processing of Speech Signals, Macmillan, New York, 1993.

[47] A. D. Cheveigne and H. Kawahara, Yin: A fundamental frequency estimator for speech

and music, Journal of the Acoustical Society of America, 2002, vol. 111, pp. 1917-1930.

[48] Jindong Chen, Kuldip K. Paliwal, Satoshi Nakamura, Cepstrum derived from

differentiated power spectrum for robust speech recognition, Speech Communication,

2003, vol. 41, pp. 469-484.

[49] J. D. Market, The SIFT algorithm for fundamental frequency estimation, IEEE Trans.

Audio Electroacourt., 1972, vol. 20, pp.367-377.

[50] T. Engin Tuncer, Deconvolution and preequalization with best delay LS inverse filters,

Signal Processing, 2004, vol. 84, pp. 2207-2219.

[51] Jan Skoglund, Analysis and quantization of glottal pulse shapes, Speech Communication,

1998, vol.24, pp. 133-152.

[52] Mallat SG. A theory for multiresolution signal decomposition: the wavelet representation.

IEEE Trans Patt Anal Mach Intell, 1989, vol. 11, pp. 674–93.

[53] Yisong Dai, The time-frequency analysis approach of electric noise based on the wavelet

transform, Solid-State Electronics, 2000, vol.44, pp. 2147-2153.

[54] O. Farooq, S. Datta, Phoneme recognition using wavelet based features, Information

Science, 2003, vol.150, pp. 5-15.

[55] Kadambe S, Bourdeaux-Bartels GF. Application of the Wavelet transform for pitch

detection of speech signals. IEEE Trans Inf Theory, 1992, vol. 38, pp. 917–24.

[56] Vincent Gibiat. Phase space representations of acoustical musical signals., Journal of

Sound and Vibration, 1988, vol. 123, pp. 537–572.

[57] David Gerhard. Audio visualization in phase space. In Bridges, Mathematical

Connections in Art, Music and Science, pp. 137–144, 1999.

[58] Dmitry Terez. Fundamental frequency estimation using signal embedding in state space.,

Journal of the Acoustical Society of America, 2002, vol. 112, pp. 2279.

- 147 -

[59] Dmitry Terez. Robust pitch determination using nonlinear state-space embedding. In

International Conference on Acoustics, Speech and Signal Processing, vol. I, 2002, pp.

345–348.

[60] McGaughey D., Spectral Modelling and stimulation of Atmospherically distorted

wavefront data, PhD Thesis, Queen’s’ University, Ontario, 1999

[61] Korenberg M. and Adeney K., Iterative Fast Orthogonal Search for Modelling by a sum

of exponentials or sinusoids, Annals of Biomedical Eng. 1998, vol. 26, pp. 315-327.

[62] KORENBERG, M.J., Fast orthogonal identification of nonlinear difference equation and

functional expansion models., Proceedings of the 30th Midwest Symposium on Circuits

and systems, 1987, pp. 270-276

[63] Wahid, A, Fast Orthogonal Search for Training Radial Basis Function Neural Networks,

M.S Thesis, Unviersity of Main, 1994.

[64] T. V. Ananthapadmanabha and B. Yegnanarayana. Epoch extraction from linear

prediction residual for identification of closed glottis interval. IEEE Trans. Acoust.,

Spch., and Sag. Proc., 1979, vol. 27.

[65] A. Kumar and S. K. Mullick, Nonlinear dynamical analysis of speech, J. Acoust. Soc. Am.

vol. 100, pp. 615–629.

[66] Titze, I. R. Workshop on Acoustic Voice Analysis, Summary Statement, National Center

for Voice and Speech, Denver, 1995.

[67] Titze LR. Principles of Voice Production. Englewood Cliffs, NJ: Prentice Hall: 1994.

[68] Davis SB. Acoustic characteristics of normal and pathological voices. In: Lass NK, ed,

Speech and Language: Advances in Basic Research and Practice, vol. 1. New York:

Academic; pp. 271-235, 1979.

[69] Hadjitodorov S, Mitev P. A computer system for acoustic analysis of pathological voices

and laryngeal diseases screening. Med Eng Phys. 2002, vol. 24, pp. 419–429.

[70] Giovanni A, Robert D, Estubier N, Teston B: Objective evaluation of dysphonia:

Preliminary results of a device allowing simultaneous acoustics and aerodynamics

measurements. Folia, Phon. Logop.

[71] Banci G, Monini S, Falaschi A, Sario N: Vocal fold disorder evaluation by digital speech

analysis, J. Phonetics, 1986, vol. 14, pp. 495-499.

- 148 -

[72] Gavidia-Ceballos L, Hansen L: Direct speech feature estimation using an iterative EM

algorithm for vocal fold pathology detection., IEEE Tr. on Biomedical Eng., 1996, vol.

43, pp. 373-383.

[73] Laver J, Hiller S, Mackenzie J, Rooney E: An acoustic screening system for the detection

of laryngeal pathology. J. Phonetics, vol. 14, pp. 517-524.

[74] D. G. Childers, A. M. Smith, and A. K. Krishnamurthy, A critical review of

electroglottography, CRC Crit., Rev, Bioeng., vol. 12, pp. 131-161.

[75] Jack J. J, Shuangyi T., Michel D., Chi-haur W., and David G. H., Integrated Analyzer and

Classifier of Glottographic Signals, IEEE Trans. Rehab. Eng. vol. 6, pp. 227-234.

[76] Childers DG, Hicks DM, Moore GP, AlsakaYA. A model for vocal fold vibratory motion,

contact area, and the electroglottogram. J Acoust Soc Am, 1986, vol. 80, pp. 1309-1320.

[77] Matsushita H. The vibratory mode of the vocal folds in the excised larynx. Folia Phoniatr.

1975, vol. 27, pp. 7-18.

[78] Askenfelt, A. G., and Hammarberg, B., Speech waveform perturbation analysis: A

perceptual-acoustical comparison of seven measures, J. Speech Hear. Res. vol. 29, pp.

50–64.

[79] Gauffin, J., Granqvist, S., Hammarberg, B., and Hertegård, S. , Irregularities in the voice:

A perceptual experiment using synthetic voices with subharmonics, in Vocal Fold

Physiology: Controlling Complexity and Chaos

[80] Heiberger, V. L., and Horii, Y., Jitter and shimmer in sustained phonation,’’ in Speech

and Language: Advances in Basic Research and Practice, Lass Academic, New York,

vol. 7, pp. 299–332.

[81] Kent, R.D. and C. Read. 1992. The Acoustic Analysis of Speech. Sandiego: Singular

Publishing.

[82] McCarthy, J. 1994. The Phonetics and Phonology of Semitic Pharyngeals. In Keating, P.,

Phonological Structure and Phonetic Form: Papers in Laboratory Phonetics III.

Cambridge, Mass: Cambridge.

[83] A. Crowe and M.A.Jack, Globally optimizing formant tracker using generalized centroids,

Electron. Lett., 1987, vol. 23, pp. 1019-1020.

[84] G. E. Kopec, Formant tracking using hidden Markov models and vector quantization,

IEEE Trans. Acoust., Speech, Signal Processing, 1986, vol. ASSP-34, pp. 709-729.

- 149 -

[85] S. McCandless, An algorithm for automatic formant extraction using linear prediction

spectra, IEEE Trans. Acoust., Speech, Signal Processing, 1974, vol. ASSP-22, pp. 135-

141.

[86] R. C. Snell and F. Milinazzo, Formant location from LPC analysis data, IEEE Trans.

Speech Audio Processing, 1993, vol. 1, pp. 129-134.

[87] Roger W. Chan, Measurements of vocal fold tissue viscoelasticity: Approaching the male

phonatory frequency range, J. of Acoust Soc of Am, 2004, vol. 115, pp. 3161-3170.

[88] Lieberman P., Perturbations in Vocal Pitch, J. Acoust Soc of Am, 1961, vol. 33, pp. 597-

603.

[89] M. H. L. Hecker and E. J. Kreul, Description of the speech of patients. with cancer of the

vocal folds. Part I: Directional perturbation factors for jitter and for shimmer, J.

Commun. Disorders, 1984, vol. 17, pp. 143–151

[90] Hecker M. & Kreul E., Description of the speech of patients with cancer of the vocal

folds. Part I: Measures of fundamental frequncy, 1971, J. Acoust. Soc. Am., vol. 44,

pp.1275-1282.

[91] Koike, Y., Application of some acoustic measures for evaluation of laryngeal dysfunction,

J. Acoust. Soc. Am, 1973, vol. 45, pp. 839–844.

[92] Yumoto E, Gould W, Baer T., The harmonics-to-noise ratio as an index of the degree of

hoarseness. J. Acoust. Soc. Am., 1982, vol. 71, pp. 1544-1550.

[93] Mitev P. System for acoustic analysis of the pathological voices and screening of the

laryngeal diseases. Ph.D. thesis, Center on Biomedical Engineering, Bulgarian

Academy of Sciences, 2000.

[94] De Krom G: A cepstmm-based technique for determining a harmonics to noise ratio in

speech signals. J. of Speech &Hearing Research, 1993, vol. 36, pp. 254-266.

[95] Imaizumi, S., A preliminary study on the generation of pathological voice types, in Vocal

Fold Physiology: Voice Production, Mechanisms and Functions, New York, pp. 249-

258.

[96] Schafer, R. W., and Rabiner, L. R.., System for automatic analysis of voiced speech, J.

Acoust. Soc. Am. 1970, vol. 47, pp. 634–648.

[97] Kasuya H, Ogawa S, Mashima K, Ebihara S. Normalized noise energy as an acoustic

measure to evaluate pathologic voice., J. Acoust. Soc. Am, 1986, vol. 80, pp. 1329-1334.

- 150 -

[98] Hillenbrand J, Houde R. Acoustic correlates of breathy vocal quality: Dysphonic voices

and continuous speech., J. Speech & Hearing Research, 1996, vol. 39, pp. 311-321.

[99] Rosenberg A.E, Effect of Glottal Pulse Shape on the Quality of Natural Vowels , 1971, J.

Acous. Soc. Am., vol. 49, pp. 583-590.

[100] Henrich, N., d’Alessandro, C., Doval, B. & Castellengo, M., On the use of the derivative

of electroglottographic signals for characterization of nonpathological phonation.

Journal of the Acoustical Society of America, 2004, vol. 115, pp. 1321-1332.

[101] Higgins, M. & Saxman, J., A comparison of selected phonatory behaviours of healthy

aged and young adults. Journal of Speech and Hearing Research, 1991, vol. 34, pp.

1000-1010.

[102] Buder, E. H., Acoustic analysis of voice quality: A tabulation of algorithms 1902–1990,

in Voice Quality Measurement, Singular, San Diego, pp. 119–244

[103] Wendahl, R. W., Some parameters of auditory roughness, Folia Phoniatr., vol. 18, pp.

26–32.

[104] Wendahl, R. W., Laryngeal analog synthesis of jitter and shimmer auditory parameters

of harshness, Folia Phoniatr. vol. 18, pp. 98–108.

[105] Hillenbrand, J., A methodological study of perturbation and additive noise in

synthetically generated voice signals, J. Speech Hear. Res. vol. 30, pp. 448–461.

[106] Seiichi T. and Tatsuya H., A glottal waveform model for high-quality speech synthesis,

J. Acoust Soc of Am, 1990, vol. 88, pp.152-160

[107] Klatt, D. H., Software for a cascade/parallel. formant synthesizer, J. Acoust. Soc. Am.

1980, vol. 67, pp. 971–995.

[108] A.E.Rosenberg, Effects of pulse shape on the quality of natural vowels, J. Acoust. Soc.

Am., 1973, vol. 49, pp.583-591.

[109] I.R.Titze, Synthesis of sung vowels using a time-domain approach, in Transcripts of the

11th Symp.: Care of the Prof. Voice, V.L.Lawrence Ed.New York: The Voice

Foundation, pp. 90-98, 1982.

[110] Donovan R., Trainable Speech Synthesis. PhD. Thesis. Cambridge University

Engineering Department, England. 1996.

[111] Valbret H., Moulines E., Tubach J., Voice Transformation Using PSOLA Techique.

Proceedings of Eurospeech 91, 1991, vol. 1, pp. 345-348.

- 151 -

[112] Charpentier F., Moulines E., Pitch-Synchronous Waveform Prosessing Techniques for

Text-to-Speech Synthesis Using Diphones. Proceedings of Eurospeech, 1989, vol. 89,

pp. 13-19.

[113] Kleijn K., Paliwal K. (Editors., Speech Coding and Synthesis. Elsevier Science B.V.,

The Netherlands. 1998

[114] Kortekaas R., Kohlrausch A., Psychoacoustical Evaluation of the Pitch-Synchronous

Overlap-and-Add Speech-Waveform Manipulation Technique Using Single-Formant

Stimuli. JASA, 1994, vol. 101, pp. 2202-2213.

[115] Behlau M, Pontes P. Avaliação e Tratamento das Disfonias. São Paulo: Lovise; 1995.

[116] BEHLAU, M.; PONTES, P. As chamadas disfonias espasmódicas: dificuldades de

diagnóstico e tratamento. R. Bras. Otorrinolaringol., São Paulo, 1997, vol. 63, supl. 1,

p. 4-27.

[117] Coleman RF, Mabis J, Hinson J. Fundamental frequency-sound pressure level profiles

of adult male and female voices. Journal of Speech and Hearing Research, 1977, vol.

20, pp. 197-204.

[118] Behlau M, Madazio G, Feijó D, Pontes P. Avaliação da Voz. In: Behlau M (org.) Voz -

O Livro do Especialista. Vol. I. Rio de Janeiro: Revinter; 2001. Cap. 3, 86-180.

[119] CCITT, CODING OF SPEECH AT 16 kbit/s USING LOW-DELAY CODE EXCITED

LINEAR PREDICTION, 1992

[120] P. Moulin, Wavelet thresholding techniques for power spectrum estimation, IEEE Trans.

Signal Processing, 1994, vol. 42, pp. 3126–3136.

[121] A. T. Walden, D. B. Percival, and E. J. McCoy, Spectrum estimation by wavelet

thresholding of multitaper estimators, IEEE Trans. Signal Processing, 1998, vol. 46, pp.

3153–3165.

[122] Flandrin, P., Rilling, G., and Goncalves, P., Empirical Mode Decomposition as a Filter

Bank, IEEE Signal Processing Letters, 2004, pp. 112 – 114.

[123] Huang, N., Attoh-Okine N. O., The Hilbert-Huang Transform in Engineering,

Taylor&Francis, CRC, 2005.

[124] Rabiner L. and Juang B. H., Fundamentals of speech recognition, Prentence Hall, NJ,

1993.

[125] C.S. Blackburn, Articulatory Methods for Speech Production and Recognition, PhD

Thesis, Cambridge University Engineering Department, 1996.

- 152 -

[126] Baer, T., Lofqvist, A& McGarr, N., Laryngeal vibrations: A comparison between high-

speed filming and glottographic techniques, J. of Acoust Soc of Am., 1983, vol. 73, pp.

1304-1308.

[127] Klatt, D, Review of text-to-speech conversion for english, J. of Acoust Soc of Am., 1987,

vol. 82, pp. 737-793.

[128] S. Haykin, Neural Networks (A comprehensive Foundation), 2nd Edition, Prentice-Hall,

Englewood CliLs, NJ, 1999.

[129] S. Haykin, Neural Networks Expand SP’s Horizons, I . Mag., 1996, vol. 13, pp. 24–49.

[130] J.C. Principe, A. Rathie, J.M. Kuo, Prediction of chaotic time series with neural

networks and the issue of dynamic modeling, Int. J. Bifurcation Chaos, 1992, vol. 2, pp.

989–996.

[131] V.J. Mathews, G.L. Sicuranza, Polynomial Signal Processing, Wiley Publishers, New

York, 2000.

[132] Y.H. Pao, Adaptive Pattern Recognition and Neural Networks, Addison-Wesley,

Reading, MA, 1989.

[133] K. Hornik, M. Stinchombe, H. White, Multilayer feedforward networks are universal

approximators, Neural Networks, pp. 359–366, 1989.

[134] K.J. Lang, G.E. Hinton, The development of the time-delayneural networks architecture

for speech recognition, Technical Report CMU-CS-88-152, Carnegie Mellon

University, Pittsburgh, PA.

[135] R.J. Williams, J. Peng, An e,cient gradient-based algorithm for on-line training of

recurrent network trajectories, Neural Comput., 1990, vol. 2, pp. 490–501.

[136] A. Cichocki, R. Unbehauen, Neural Networks for Optimization and Signal Processing,

Wiley, B.G. Teubner, Stuttgart, 1993.

[137] B. Widrow, M. Lehr, 30 years of adaptive neural networks: perceptron, adaline and

backpropagation, Proc. IEEE, 1990, vol. 78.

[138] K.S. Narendra, K. Parthasarathy, Identi-cation and control of dynamical systems

containing neural networks, IEEE Trans. Neural Networks, 1990, vol. 1, pp. 4–27.

[139] B. De Vries, J.C. Principe, The gamma model-A new neural model for temporal

processing, Neural Networks, 1992, vol. 5, pp. 565–576.

- 153 -

[140] T. Chen, H. Chen, Approximation of continuous functionals byneural networks with

application to dynamic systems, IEEE Trans. Neural Networks, 1993, vol. 4, pp. 910–

918.

[141] G. Cybenko, Approximation by superposition of a sigmoidal function, in: Mathematical

Control Signals Systems, Vol. 2, Springer, New York, 1989.

[142] S. Haykin, Adaptive Filter Theory, 3th Edition, Prentice-Hall, Englewood CliLs, NJ,

1996.

[143] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, New York, 1995.

[144] H. Yasukawa, Signal restoration of broad band speech using nonlinear processing,

Proceedings of EUSIPCO’96, Trieste, Italy, Sept. 1996.

[145] R.E. Crochiere, L.R. Rabiner, Multirate Digital Signal Processing, Prentice-Hall,

Englewood CliLs, NJ, 1983.

[146] N.J. Fleige, Multirate Digital Signal Processing (Multirate systems, Filter Banks,

Wavelet), Wiley, New York, 1994.

[147] M.R. Petraglia, S.K. Mitra, Performance analysis of adaptive -lter structures based on

subband decomposition, Proceedings of the IEEE International Symposium on Circuit

and Systems, Chicago, IL, 1993, pp. 60–63.

[148] G. Cocchi, A. Uncini, Subband neural networks prediction for on-line audio signal

recovery, IEEE Trans. Neural Network, 2002, vol. 13, pp. 867–876.

- 154 -

국문초록국문초록국문초록국문초록

비선형비선형비선형비선형 음성음성음성음성 모델링을모델링을모델링을모델링을 이용한이용한이용한이용한 양성양성양성양성 후두후두후두후두 질환의질환의질환의질환의 수술수술수술수술 후후후후

모음에모음에모음에모음에 대한대한대한대한 예측예측예측예측

연세대학교 대학원

의용전자공학과

장 승 진

병적인 음성에서 지각적인 비주기성은 기본 주파수의 간격 (jitter), 강도의 떨림

(shimmer)과 잡음과 같은 동요 요인에 의해 주로 발생된다. 이러한 요인들은

주로 성문 진동에 대한 제어 손실, 성문에 발생하는 종양 및 방사와 호흡시

발생하는 잡음의 존재로 인하여 주로 영향 받는다. 본 연구의 가정은 병적인

음성에서 이러한 동요 요인들을 제거하는 것이 수술후의 음성과 비슷한 향상을

발생할 수 있다는 것이다.

본 연구에서는, 수술 전/후 모음에 대한 음성 및 전기성문파형 검사 결과를

바탕으로 양성 후두 질환을 위한 수술 후 모음 예측에 대한 모형과 구현을

비선형 외인성 입력을 갖는 자기회귀 방법 (NARX)를 기반으로 한 비선형 음성

모델링을 통하여 수행하였다.

먼저, 정확한 음성 분석을 위하여 병적인 음성에 대한 강인한 피치 검출 알고리즘

제안하였다. 기존의 다른 피치 검출 알고리즘과 달리 고속 직교 검출을 기반으로

제안된 피치 검출 알고리즘은 상당히 많이 피치 조대 오차, 특히 피치 반감

오차를 줄일 수 있다.

이후, 음성 및 전기성문파형 검사와 관련한 다양한 측정들이 42 명의 양성 후두

질환 자들을 대상으로 수술 전/후 두 차례에 걸쳐 검사되었다. 남성 그룹의 평균

- 155 -

피치는 약 12-15 % 감소한 반면에 여성 그룹들의 값은 유의하게 변하지 않았다.

포만트 주파수 (Formant frequency)는 수술 전과 후에 일정한 값을 유지하였다.

대부분의 jitter 측정치들은 통계적으로 유의하게 변화한 반면, 일부의

shimmer 들만 수술 후 달라졌음을 확인할 수 있었다. harmonic-to-noise ratio

(HNR), normalized noise energy (NNE), degree of hoarse (DH), and normalized

first harmonic energy (NFHE)와 같은 잡은 예측 관련 측정치들에서는 성별에

따라서 일부의 발성에 대해서만 유의하게 차이를 보였다. 전기성문 파형검사 관련

측정치의 open quotient (OQ), speed quotient (SQ)에서는 변화를 보이지

않았지만, 특이하게도 평균 SQ 값에 의해 구분된 두 그룹의 경우 정상 범위 내로

회귀하는 것을 발견하였다.

이러한 검사 결과를 바탕으로 정상적인 음성과 같은 지각적인 정도로 수술 전

모음을 향상시키도록 변조하였다. 변조되는 정도는 수술 전/후 음성의 차이를

기반으로 한 통계적인 결과에 의해서 조정되었다. 피치 거리, 강도 및 기식성

잡음의 변조들이 Pitch synchronous overlap and add (PSOLA), 강도 조정자 및

웨이블릿 문턱치 감소 방법들과 전기성문파형 신호의 기저선 변동 제거에 의하여

수행되었다. 이렇게 변경된 음성, 성문 신호들은 최소 제곱 서포트 벡터 회귀

(SVR)를 기반으로 한 NARX 비선형 음성 모델링에서 입력 신호들로 사용되어진다.

마지막으로, 음성 및 전기성문파형 검사를 기반으로 한 수술 전 모음의 변조는

주파수 및 동력학 도메인에서 수술 후의 모음과 상당 부분 비슷함을 보였다.

또한 SVR 을 기반으로 한 NARX 을 이용한 비선형 음성 모델링의 성능은

모음들의 지각적 정도에 있어 LPC 보다 우수하였으며, 이러한 결과는 LPC 의

경우 자연스러움이 부족한 인공적인 음성을 생성하는 반면에, 자연적인 jitter,

shimmer 및 잡음이 보존되기 때문이라 예측된다.

요약어: 피치검출 알고리즘, 양성후두 질환, 비선형 음성 모델링, 비선형 자귀회귀

외인성 모델, 음성 분석, 전기성문파형 분석

estimation of postoperative vowel of benign vocal fold ...estimation of postoperative vowel of...

Documents