appendix 1 basics of speech signal a.1.1 anatomy and...

160

APPENDIX 1

BASICS OF SPEECH SIGNAL

A.1.1 ANATOMY AND PHYSIOLOGY OF SPEECH PRODUCTION

Vocal organs produce human speech as shown in Figure A.1.1. The

main energy source is the lungs with the diaphragm. When speaking, the airflow

is forced through the glottis between the vocal cords and the larynx to the three

main cavities of the vocal tract: the pharynx, oral and nasal cavities. From the

oral and nasal cavities the airflow exits through the nose and mouth, respectively.

The V-shaped opening between the vocal cords, called the glottis, is the most

important sound source in the vocal system. The vocal cords may act in several

different ways during speech. The most important function is to modulate the air

flow by rapidly opening and closing, causing buzzing sound from which vowels

and voiced consonants are produced. The fundamental frequency of vibration

depends on the mass and tension and is about 110 Hz, 200 Hz, and 300 Hz with

men, women, and children, respectively. With stop consonants the vocal cords

may act suddenly from a completely closed position, in which they cut the

airflow, to totally open position producing a glottal stop. On the other hand, with

unvoiced consonants, such as /s/ or /f/, they may be completely open. An

intermediate position may also occur with phonemes like /h/ (Gianfranco Denes

2012).

161

Figure A.1.1 Pictorial view of human vocal organs (Gianfranco Denes 2012)

The pharynx connects the larynx to the oral cavity. It has almost fixed

dimensions, but its length may be changed slightly by raising or lowering the

larynx at one end and the soft palate at the other end. The soft palate also isolates

or connects the route from the nasal cavity to the pharynx. At the bottom of the

pharynx are the epiglottis and false vocal cords to prevent food reaching the

larynx and to isolate the esophagus acoustically from the vocal tract. The

epiglottis, the false vocal cords and the vocal cords are closed during swallowing

and open during normal breathing (Thomas et al 2002).

The oral cavity is one of the most important parts of the vocal tract.

The movements of the palate, the tongue, the lips, the cheeks and the teeth can

vary its size, shape and acoustics. Especially the tongue is very flexible, the tip

and the edges can be moved independently and the entire tongue can move

forward, backward, up and down. The lips control the size and shape of the

mouth opening through which speech sound is radiated. Unlike the oral cavity,

162

the nasal cavity has fixed dimensions and shape. Its length is about 12 cm and

volume 60 cm3. The air stream to the nasal cavity is controlled by the soft palate

(Thomas 2008).

A simplified view of speech production is given in Figure A.1.2 where

the speech organs are divided into three main groups: the lungs, larynx, and vocal

tract. The lungs act as a power supply and provide airflow to the larynx stage of

the speech production mechanism. The larynx modulates airflow from the lungs

and provides either a periodic puff-like or a noisy airflow source to the vocal

tract. The vocal tract consists of oral, nasal, and pharynx cavities, giving the

modulated airflow its “color” by spectrally shaping the source. Sound sources

can also be generated by constrictions and boundaries that are made within the

vocal tract yielding an impulsive airflow source in addition to noisy and periodic

sources. The variation of air pressure at the lips results in a traveling sound wave

that the listener perceives as speech.

Figure A.1.2 Simple view of speech production (Thomas 2008)

There are three general categories of the source for speech sounds:

periodic, noisy, and impulsive, although combinations of these sources are often

163

present. Examples of speech sounds generated with each of these source

categories are seen in the word “shop,” where the “sh”, “o”, and “p” are

generated from a noisy, periodic, and impulsive source respectively (Thomas

2008).

Figure A.1.3 shows a more realistic view of the anatomy of speech

production. We now look in detail at this anatomy, as well as at the associated

physiology and its importance in speech production.

Figure A.1.3 Cross-sectional view of the anatomy of speech production (Thomas 2008)

A.1.1.1Lungs

One purpose of the lungs is the inhalation and exhalation of air. When

we inhale, we enlarge the chest cavity by expanding the rib cage surrounding the

164

lungs and by lowering the diaphragm that sits at the bottom of the lungs and

separates the lungs from the abdomen; this action lowers the air pressure in the

lungs, thus causing air to rush in through the vocal tract and down the trachea

into the lungs. The trachea, sometimes referred to as the “windpipe,” is about a

12-cm-long and 1.5–2 cm-diameter pipe, which go from the lungs to the

epiglottis. The epiglottis is a small mass, or “switch,” which, during swallowing

and eating, deflects food away from entering the trachea. When we eat, the

epiglottis falls, allowing food to pass through a tube called the esophagus and

into the stomach. When we exhale, we reduce the volume of the chest cavity by

contracting the muscles in the rib cage, thus increasing the lung air pressure. This

increase in pressure then causes air to flow through the trachea into the larynx. In

breathing, we rhythmically inhale to take in oxygen, and exhale to release carbon

dioxide.

During the process of speaking, on the other hand, we take in short

spurts of air and release them steadily by controlling the muscles around the rib

cage. We override our rhythmic breathing by making the duration of exhaling

roughly equal to the length of a sentence or phrase. During this timed exhalation,

the lung air pressure is maintained at approximately a constant level, slightly

above atmospheric pressure, by steady slow contraction of the rib cage, although

the air pressure varies around this level due to the time-varying properties of the

larynx and vocal tract (Thomas 2008).

A.1.1.2Vocal Tract

The vocal tract is comprised of the oral cavity from the larynx to

the lips and the nasal passage that is coupled to the oral tract by way of the

velum. The oral tract takes on many different lengths and cross-sections by

moving the tongue, teeth, lips, and jaw and has an average length of 17 cm in a

165

typical adult male and shorter for females, and a spatially-varying cross section

of up to 20 cm2. If we were to listen to the pressure wave at the output of the

vocal folds during voicing, we would hear simply a time-varying buzz-like

sound, which is not very interesting. One purpose of the vocal tract is to

spectrally “color” the source, which is important for making perceptually distinct

speech sounds. A second purpose is to generate new sources for sound

production(Thomas 2008).

Spectral Shaping — under certain conditions, the relation between a

glottal airflow velocity input and vocal tract airflow velocity output can be

approximated by a linear filter with resonances, much like resonances of organ

pipes and wind instruments. The resonance frequencies of the vocal tract are, in a

speech science context, called formant frequencies or simply formants. The word

“formant” also refers to the entire spectral contribution of a resonance so we

often use the phrases “formant bandwidth” and “formant amplitude” (at the

formant frequency). Formants change with different vocal tract configurations.

With different vowels, for example, the jaw, teeth, lips, and tongue, are generally

in different positions. Panel (a) of Figure A.1.4 shows the tongue hump high in

the front and back of the palate (upper wall of mouth), each position

corresponding to different resonant cavities and thus different vowels(Thomas

2008).

166

Figure A.1.4 Illustration of changing vocal tract shapes for (a) vowels (having a periodic source), (b) Plosives (having an impulsive source), and (c) fricatives (having a noise source) (Thomas 2008)

A.1.2 CATEGORIZATION OF SOUND BY SOURCE

There are various ways to categorize speech sounds. For example,

we can categorize speech sounds based on different sources to the vocal tract; we

have seen that different sources are due to the vocal fold state, but are also

formed at various constrictions in the oral tract. Speech sounds generated with a

periodic glottal source are termed voiced; likewise, sounds not so generated are

called unvoiced. There are a variety of unvoiced sounds, including those created

with a noise source at an oral tract constriction. Because the noise of such sounds

comes from the friction of the moving air against the constriction, these sounds

are sometimes referred to as fricatives.

Examples of some of these sound classes are shown in Figure A.1.5 in the

sentence, “Which tea party did Baker go to?” (Thomas 2008).

167

Figure A.1.5 Examples of voiced, fricative, and plosive sounds in the sentence, Which tea party did Baker go to?” (a) Speech waveform; (b)–(d) magnified voiced, fricative, and plosive sounds from (a). (Thomas 2008)

An example of frication is in the sound “th” in the word “thin” where

turbulence is generated between the tongue and the upper teeth. The reader

should hold the “th” sound and feel the turbulence. A second unvoiced sound

class is plosives created with an impulsive source within the oral tract (Figure

A.1.5b). An example of a plosive is the “t” in the word “top.” The location of the

168

closed or partial constriction corresponds to different plosive or fricative sounds,

respectively. We noted earlier that a barrier could also be made at the vocal folds

by partially closing the vocal folds, but without oscillation, as in the sound “h” in

“he.” These are whispered unvoiced speech sounds. These voiced and unvoiced

sound categories, however, do not relate exclusively to the source state because a

combination of these states can also be made whereby vocal fold vibration occurs

simultaneously with impulsive or noisy sources. For example, with “z” in the

word “zebra,” the vocal folds are vibrating and, at the same time, noise is created

at a vocal tract constriction behind the teeth against the palate. Such sounds are

referred to as voiced fricatives in contrast to unvoiced fricatives where the vocal

folds do not vibrate simultaneously with frication. There also exist voiced

plosives as counterparts to unvoiced plosives as with the “b” in the word “boat.”

(Thomas 2008).

A.1.3 PITCH AND FREQUENCY

A sound wave, like any other wave, is introduced into a medium by a

vibrating object. The vibrating object is the source of the disturbance, which

moves through the medium. The vibrating object, which creates the disturbance,

could be the vocal chord of a person, the vibrating string and soundboard of a

guitar or violin, the vibrating tines of a tuning fork, or the vibrating diaphragm of

a radio speaker. Regardless of what vibrating object is creating the sound wave,

the particle of the medium through which the sound moves is vibrating in a back

and forth motion at a given frequency. The frequency of a wave refers to how

often the particles of the medium vibrate when a wave passes through the

medium. The frequency of a wave is measured as the number of complete back-

and-forth vibrations of a particle of the medium per unit of time. If a particle of

air undergoes 1000 longitudinal vibrations in 2 seconds, then the frequency of the

169

wave would be 500 vibrations per second. A commonly used unit for frequency

is the Hertz (abbreviated Hz), where

1 Hertz = 1 vibration/second (Thomas 2008)

As a sound wave moves through a medium, each particle of the

medium vibrates at the same frequency. This is sensible since each particle

vibrates due to the motion of its nearest neighbor. The first particle of the

medium begins vibrating, at say 500 Hz, and begins to set the second particle

into vibrational motion at the same frequency of 500 Hz. The second particle

begins vibrating at 500 Hz and thus sets the third particle of the medium into

vibrational motion at 500 Hz. The process continues throughout the medium;

each particle vibrates at the same frequency. And of course the frequency at

which each particle vibrates is the same as the frequency of the original source of

the sound wave. Subsequently, a guitar string vibrating at 50 Hz will set the air

particles in the room vibrating at the same frequency of 500 Hz, which carries a

sound signal to the ear of a listener, which is detected as a 500 Hz sound wave.

The back-and-forth vibrational motion of the particles of the medium

would not be the only observable phenomenon occurring at a given frequency.

Since a sound wave is a pressure wave, a detector could be used to detect

oscillations in pressure from a high pressure to a low pressure and back to a high

pressure. As the compressions (high pressure) and rarefactions (low pressure)

move through the medium, they would reach the detector at a given frequency.

For example, a compression would reach the detector 500 times per second if the

frequency of the wave were 500 Hz. Similarly, a rarefaction would reach the

detector 500 times per second if the frequency of the wave were 500 Hz. The

frequency of a sound wave not only refers to the number of back-and-forth

vibrations of the particles per unit of time, but also refers to the number of

compressions or rarefactions, which pass a given point per unit of time. A

170

detector could be used to detect the frequency of these pressure oscillations over

a given period of time. The typical output provided by such a detector is a

pressure-time plot is shown in the Figure A.1.6.

Figure A.1.6 Frequency of pressure oscillations (adapted from:Music in the digital world)

Since a pressure-time plot shows the fluctuations in pressure over time,

the period of the sound wave can be found by measuring the time between

successive high pressure points (corresponding to the compressions) or the time

between successive low pressure points (corresponding to the rarefactions). The

frequency is simply the reciprocal of the period. For this reason, a sound wave

with a high frequency would correspond to a pressure time plot with a small

period - that is, a plot corresponding to a small amount of time between

successive high pressure points. Conversely, a sound wave with a low frequency

would correspond to a pressure time plot with a large period - that is, a plot

corresponding to a large amount of time between successive high pressure points.

The Figure A.1.7 shows two pressure-time plots, one corresponding to a high

frequency and the other to a low frequency.

171

Figure A.1.7 Pressure-Time plots corresponding to high and low frequencies(adapted from:Music in the digital world)

The ears of a human (and other animals) are sensitive detectors

capable of detecting the fluctuations in air pressure, which impinge upon the

eardrum. The human ear is capable of detecting sound waves with a wide range

of frequencies, ranging between approximately 50 Hz to 20000 Hz. Any sound

with a frequency below the audible range of hearing (i.e., less than 50 Hz) is

known as an infrasound and any sound with a frequency above the audible range

of hearing (i.e., more than 15000 Hz) is known as an ultrasound.

Humans are not alone in their ability to detect a wide range of

frequencies. Dogs can detect frequencies as low as approximately 50 Hz and as

high as 45000 Hz. Cats can detect frequencies as low as approximately 45 Hz

and as high as 85000 Hz. Bats, being nocturnal creature, must rely on sound

echolocation for navigation and hunting. Bats can detect frequencies as high as

120000 Hz. Dolphins can detect frequencies as high as 200000 Hz. While dogs,

cats, bats, and dolphins have an unusual ability to detect ultrasound, an elephant

possesses the unusual ability to detect infrasound, having an audible range from

approximately 5 Hz to approximately 10000 Hz.

172

The sensation of frequencies is commonly referred to as the pitch of a

sound. A high pitch sound corresponds to a high frequency sound wave and a

low pitch sound corresponds to a low frequency sound wave. Amazingly, many

people, especially those who have been musically trained, are capable of

detecting a difference in frequency between two separate sounds which is as little

as 2 Hz. When two sounds with a frequency difference of greater than 7 Hz are

played simultaneously, most people are capable of detecting the presence of a

complex wave pattern resulting from the interference and superposition of the

two sound waves. Certain sound waves when played (and heard) simultaneously

will produce a particularly pleasant sensation when heard, are said to

be consonant. Such sound waves form the basis of intervals in music. For

example, any two sounds whose frequencies make a 2:1 ratio are said to be

separated by an octave and result in a particularly pleasing sensation when heard.

That is, two sound waves sound good when played together if one sound has

twice the frequency of the other. Similarly two sounds with a frequency ratio of

5:4 are to be separated by an interval of a third; such sound waves also sound

good when played together. Examples of other sound wave intervals and their

respective frequency ratios are listed in the Table A.1.1.

Table A.1.1 Sound wave intervals and their respective frequency ratios (adapted from:Music in the digital world)

Interval Frequency Ratio Examples Octave 2:1 512 Hz and 256 Hz Third 5:4 320 Hz and 256 Hz Fourth 4:3 342 Hz and 256 Hz Fifth 3:2 384 Hz and 256 Hz

The ability of humans to perceive pitch is associated with the

frequency of the sound wave, which impinges upon the ear. Because sound

waves traveling through air are longitudinal waves, which produce high and low

pressure disturbances of the particles of the air at a given frequency, the ear has

173

an ability to detect such frequencies and associate them with the pitch of the

sound. But pitch is not the only property of a sound wave detectable by the

human ear(available from: Music in the digital world).

A.1.4 ELEMENTS OF A LANGUAGE

A fundamental distinctive unit of a language is the phoneme; the

phoneme is distinctive in the sense that it is a speech sound class that

differentiates words of a language. For example, the words “cat,” “bat,” and

“hat” consist of three speech sounds, the first of which gives each word its

distinctive meaning, being from different phoneme classes. Many sounds provide

this distinctive meaning, and such sounds represent a particular phoneme. To

emphasize the distinction between the concept of a phoneme and sounds that

convey a phoneme, the speech scientist uses the term phone to mean a particular

instantiation of a phoneme. This distinction is also seen in the different studies of

phonemics and phonetics. Different languages contain different phoneme sets.

Syllables contain one or more phonemes, while words are formed with one or

more syllables, concatenated to form phrases and sentences. Linguistics is the

study of the arrangement of speech sounds, i.e., phonemes and the larger speech

units built from phonemes, according to the rules of a language. Phonemes can

differ across languages, but certain properties of the grammatical rules

combining phonemes and larger units of a language may be common and

instinctual. There are various ways to study speech sounds that make up

phoneme classes; the use of the above first two descriptors in this study is

sometimes referred to as articulatory phonetics, while using the last two is

referred to as acoustic phonetics. One broad phoneme classification for English is

in terms of vowels, consonants, diphthongs, affricates, and semi-vowels. Figure

A.1.8 shows this classification, along with various subgroups, where each

phoneme symbol is written within slashes according to both the International

174

Phonetic Alphabet and an orthographic (alphabetic spelling) representation. An

insightful history of the various phoneme symbol representations is described. In

the remainder of this text, we use the orthographic symbols.

Figure A.1.8 Phonemes in American English. Orthographic symbols are given in parentheses to the left of the International Phonetic Alphabet symbols. (adapted from:Thomas 2008)

Phonemes arise from a combination of vocal fold and vocal tract articulatory

features. Articulatory features, corresponding to the first two descriptors above,

include the vocal fold state, i.e., whether the vocal folds are vibrating or open;

the tongue position and height, i.e., whether it is in the front, central, or back

along the palate and whether its constriction is partial or complete; and the velum

state, i.e., whether a sound is nasal or not. It has been hypothesized that the first

step in the production of a phone is to conceive in the brain the set of articulatory

175

features that correspond to a phoneme. A particular set of speech muscles is

responsible for “activating” each feature with certain relative timing. It is these

features that we may store in our brain for the representation of a phoneme. In

English, the combinations of features are such to give 40 phonemes, while in

other languages the features can yield a smaller set (Thomas 2008).

A.1.5 REPRESENTATION AND ANALYSIS OF SPEECH SIGNALS

Continuous speech is a set of complicated audio signals, which

makes producing them artificially difficult. Speech signals are usually considered

as voiced or unvoiced, but in some cases they are something between these two.

Voiced sounds consist of fundamental frequency (F0) and its harmonic

components produced by vocal cords (vocal folds). The vocal tract modifies this

excitation signal causing formant (pole) or anti-formant (zero) frequencies. Each

formant frequency has also amplitude and bandwidth and it may be sometimes

difficult to define some of these parameters correctly. The fundamental

frequency and formant frequencies are probably the most important concepts in

speech synthesis and also in speech processing in general.

With purely unvoiced sounds, there is no fundamental

frequency in excitation signal and therefore no harmonic structure either and the

excitation can be considered as white noise. The airflow is forced through a vocal

tract constriction, which can occur in several places between glottis and mouth.

Some sounds are produced with complete stoppage of airflow followed by a

sudden release, producing an impulsive turbulent excitation often followed by a

more protracted turbulent excitation. Unvoiced sounds are also usually more

silent and less steady than voiced ones. Whispering is the special case of speech.

When whispering a voiced sound there is no fundamental frequency in the

176

excitation and the first formant frequencies produced by vocal tract are perceived

(Available from: Phonetics and Theory of Speech Production).

Speech signals of the three vowels (/a/ /i/ /u/) are presented in

time- and frequency domain in Figure A.1.9. The fundamental frequency is about

100 Hz in all cases and the formant frequencies F1, F2, and F3 with vowel /a/ are

approximately 600 Hz, 1000 Hz, and 2500 Hz respectively. With vowel /i/ the

first three formants are 200 Hz, 2300 Hz, and 3000 Hz, and with /u/ 300 Hz, 600

Hz, and 2300 Hz. The harmonic structure of the excitation is also easy to

perceive from frequency domain presentation.

It can be seen that the first three formants are inside the normal

telephone channel (from 300 Hz to 3400 Hz) so the needed bandwidth for

intelligible speech is not very wide. For higher quality, up to 10 kHz bandwidth

may be used which leads to 20 kHz sampling frequency. Unless, the fundamental

frequency is outside the telephone channel, the human hearing system is capable

to reconstruct it from its harmonic components (Available from: Phonetics and

Theory of Speech Production).

177

Figure A.1.9. The time- and frequency-domain presentation of vowels /a/, /i/, and /u/ (adapted from: Phonetics and Theory of Speech Production)

178

APPENDIX 2

DATASET

A.2 LIST OF PHONETICALLY BALANCED WORDS

Table A.2.1 Phonetically balanced words

List 1 List 2 List 3 List4 are awe ache bath bad bait air beastbar bean bald bee bask blush barb blondebox bought bead budge cane bounce cape bus cleanse bud cast bush clove charge check cloak crash cloud class course creed corpse crave court death dab crime dodge deed earl deck dupe dike else dig earndish fate dill eelend five drop fin feast frog fame float fern gill far frownfolk gloss fig hatchford hire flush heed fraud hit gnaw hiss fuss hock hurl hot grove job jam how heap log law kite hid moose leave merge

179

Table A.2.1 continued hive mute lush lush hunt nab muck neat is need neck new mange niece nest oils no nut oak ornook our path peck not perk please pert pan pick pulse pinch pants pit rate pod pest quart rouse race pile rap shout rack plush rib sit rave rag scythe size raw rat shoe sob rutride sludge sped sage rise snuff stag scab rub start take shed slip suck thrash shin smile tan toil sketch strife tang trip slapsuch them turf sourthen trash vow starve there vamp wedge straptoe vast wharf test use ways who tick wheat wish why touch

appendix 1 basics of speech signal a.1.1 anatomy and...

Documents