appendix 1 basics of speech signal a.1.1 anatomy and...
TRANSCRIPT
160
APPENDIX 1
BASICS OF SPEECH SIGNAL
A.1.1 ANATOMY AND PHYSIOLOGY OF SPEECH PRODUCTION
Vocal organs produce human speech as shown in Figure A.1.1. The
main energy source is the lungs with the diaphragm. When speaking, the airflow
is forced through the glottis between the vocal cords and the larynx to the three
main cavities of the vocal tract: the pharynx, oral and nasal cavities. From the
oral and nasal cavities the airflow exits through the nose and mouth, respectively.
The V-shaped opening between the vocal cords, called the glottis, is the most
important sound source in the vocal system. The vocal cords may act in several
different ways during speech. The most important function is to modulate the air
flow by rapidly opening and closing, causing buzzing sound from which vowels
and voiced consonants are produced. The fundamental frequency of vibration
depends on the mass and tension and is about 110 Hz, 200 Hz, and 300 Hz with
men, women, and children, respectively. With stop consonants the vocal cords
may act suddenly from a completely closed position, in which they cut the
airflow, to totally open position producing a glottal stop. On the other hand, with
unvoiced consonants, such as /s/ or /f/, they may be completely open. An
intermediate position may also occur with phonemes like /h/ (Gianfranco Denes
2012).
161
Figure A.1.1 Pictorial view of human vocal organs (Gianfranco Denes 2012)
The pharynx connects the larynx to the oral cavity. It has almost fixed
dimensions, but its length may be changed slightly by raising or lowering the
larynx at one end and the soft palate at the other end. The soft palate also isolates
or connects the route from the nasal cavity to the pharynx. At the bottom of the
pharynx are the epiglottis and false vocal cords to prevent food reaching the
larynx and to isolate the esophagus acoustically from the vocal tract. The
epiglottis, the false vocal cords and the vocal cords are closed during swallowing
and open during normal breathing (Thomas et al 2002).
The oral cavity is one of the most important parts of the vocal tract.
The movements of the palate, the tongue, the lips, the cheeks and the teeth can
vary its size, shape and acoustics. Especially the tongue is very flexible, the tip
and the edges can be moved independently and the entire tongue can move
forward, backward, up and down. The lips control the size and shape of the
mouth opening through which speech sound is radiated. Unlike the oral cavity,
162
the nasal cavity has fixed dimensions and shape. Its length is about 12 cm and
volume 60 cm3. The air stream to the nasal cavity is controlled by the soft palate
(Thomas 2008).
A simplified view of speech production is given in Figure A.1.2 where
the speech organs are divided into three main groups: the lungs, larynx, and vocal
tract. The lungs act as a power supply and provide airflow to the larynx stage of
the speech production mechanism. The larynx modulates airflow from the lungs
and provides either a periodic puff-like or a noisy airflow source to the vocal
tract. The vocal tract consists of oral, nasal, and pharynx cavities, giving the
modulated airflow its “color” by spectrally shaping the source. Sound sources
can also be generated by constrictions and boundaries that are made within the
vocal tract yielding an impulsive airflow source in addition to noisy and periodic
sources. The variation of air pressure at the lips results in a traveling sound wave
that the listener perceives as speech.
Figure A.1.2 Simple view of speech production (Thomas 2008)
There are three general categories of the source for speech sounds:
periodic, noisy, and impulsive, although combinations of these sources are often
163
present. Examples of speech sounds generated with each of these source
categories are seen in the word “shop,” where the “sh”, “o”, and “p” are
generated from a noisy, periodic, and impulsive source respectively (Thomas
2008).
Figure A.1.3 shows a more realistic view of the anatomy of speech
production. We now look in detail at this anatomy, as well as at the associated
physiology and its importance in speech production.
Figure A.1.3 Cross-sectional view of the anatomy of speech production (Thomas 2008)
A.1.1.1Lungs
One purpose of the lungs is the inhalation and exhalation of air. When
we inhale, we enlarge the chest cavity by expanding the rib cage surrounding the
164
lungs and by lowering the diaphragm that sits at the bottom of the lungs and
separates the lungs from the abdomen; this action lowers the air pressure in the
lungs, thus causing air to rush in through the vocal tract and down the trachea
into the lungs. The trachea, sometimes referred to as the “windpipe,” is about a
12-cm-long and 1.5–2 cm-diameter pipe, which go from the lungs to the
epiglottis. The epiglottis is a small mass, or “switch,” which, during swallowing
and eating, deflects food away from entering the trachea. When we eat, the
epiglottis falls, allowing food to pass through a tube called the esophagus and
into the stomach. When we exhale, we reduce the volume of the chest cavity by
contracting the muscles in the rib cage, thus increasing the lung air pressure. This
increase in pressure then causes air to flow through the trachea into the larynx. In
breathing, we rhythmically inhale to take in oxygen, and exhale to release carbon
dioxide.
During the process of speaking, on the other hand, we take in short
spurts of air and release them steadily by controlling the muscles around the rib
cage. We override our rhythmic breathing by making the duration of exhaling
roughly equal to the length of a sentence or phrase. During this timed exhalation,
the lung air pressure is maintained at approximately a constant level, slightly
above atmospheric pressure, by steady slow contraction of the rib cage, although
the air pressure varies around this level due to the time-varying properties of the
larynx and vocal tract (Thomas 2008).
A.1.1.2Vocal Tract
The vocal tract is comprised of the oral cavity from the larynx to
the lips and the nasal passage that is coupled to the oral tract by way of the
velum. The oral tract takes on many different lengths and cross-sections by
moving the tongue, teeth, lips, and jaw and has an average length of 17 cm in a
165
typical adult male and shorter for females, and a spatially-varying cross section
of up to 20 cm2. If we were to listen to the pressure wave at the output of the
vocal folds during voicing, we would hear simply a time-varying buzz-like
sound, which is not very interesting. One purpose of the vocal tract is to
spectrally “color” the source, which is important for making perceptually distinct
speech sounds. A second purpose is to generate new sources for sound
production(Thomas 2008).
Spectral Shaping — under certain conditions, the relation between a
glottal airflow velocity input and vocal tract airflow velocity output can be
approximated by a linear filter with resonances, much like resonances of organ
pipes and wind instruments. The resonance frequencies of the vocal tract are, in a
speech science context, called formant frequencies or simply formants. The word
“formant” also refers to the entire spectral contribution of a resonance so we
often use the phrases “formant bandwidth” and “formant amplitude” (at the
formant frequency). Formants change with different vocal tract configurations.
With different vowels, for example, the jaw, teeth, lips, and tongue, are generally
in different positions. Panel (a) of Figure A.1.4 shows the tongue hump high in
the front and back of the palate (upper wall of mouth), each position
corresponding to different resonant cavities and thus different vowels(Thomas
2008).
166
Figure A.1.4 Illustration of changing vocal tract shapes for (a) vowels (having a periodic source), (b) Plosives (having an impulsive source), and (c) fricatives (having a noise source) (Thomas 2008)
A.1.2 CATEGORIZATION OF SOUND BY SOURCE
There are various ways to categorize speech sounds. For example,
we can categorize speech sounds based on different sources to the vocal tract; we
have seen that different sources are due to the vocal fold state, but are also
formed at various constrictions in the oral tract. Speech sounds generated with a
periodic glottal source are termed voiced; likewise, sounds not so generated are
called unvoiced. There are a variety of unvoiced sounds, including those created
with a noise source at an oral tract constriction. Because the noise of such sounds
comes from the friction of the moving air against the constriction, these sounds
are sometimes referred to as fricatives.
Examples of some of these sound classes are shown in Figure A.1.5 in the
sentence, “Which tea party did Baker go to?” (Thomas 2008).
167
Figure A.1.5 Examples of voiced, fricative, and plosive sounds in the sentence, Which tea party did Baker go to?” (a) Speech waveform; (b)–(d) magnified voiced, fricative, and plosive sounds from (a). (Thomas 2008)
An example of frication is in the sound “th” in the word “thin” where
turbulence is generated between the tongue and the upper teeth. The reader
should hold the “th” sound and feel the turbulence. A second unvoiced sound
class is plosives created with an impulsive source within the oral tract (Figure
A.1.5b). An example of a plosive is the “t” in the word “top.” The location of the
168
closed or partial constriction corresponds to different plosive or fricative sounds,
respectively. We noted earlier that a barrier could also be made at the vocal folds
by partially closing the vocal folds, but without oscillation, as in the sound “h” in
“he.” These are whispered unvoiced speech sounds. These voiced and unvoiced
sound categories, however, do not relate exclusively to the source state because a
combination of these states can also be made whereby vocal fold vibration occurs
simultaneously with impulsive or noisy sources. For example, with “z” in the
word “zebra,” the vocal folds are vibrating and, at the same time, noise is created
at a vocal tract constriction behind the teeth against the palate. Such sounds are
referred to as voiced fricatives in contrast to unvoiced fricatives where the vocal
folds do not vibrate simultaneously with frication. There also exist voiced
plosives as counterparts to unvoiced plosives as with the “b” in the word “boat.”
(Thomas 2008).
A.1.3 PITCH AND FREQUENCY
A sound wave, like any other wave, is introduced into a medium by a
vibrating object. The vibrating object is the source of the disturbance, which
moves through the medium. The vibrating object, which creates the disturbance,
could be the vocal chord of a person, the vibrating string and soundboard of a
guitar or violin, the vibrating tines of a tuning fork, or the vibrating diaphragm of
a radio speaker. Regardless of what vibrating object is creating the sound wave,
the particle of the medium through which the sound moves is vibrating in a back
and forth motion at a given frequency. The frequency of a wave refers to how
often the particles of the medium vibrate when a wave passes through the
medium. The frequency of a wave is measured as the number of complete back-
and-forth vibrations of a particle of the medium per unit of time. If a particle of
air undergoes 1000 longitudinal vibrations in 2 seconds, then the frequency of the
169
wave would be 500 vibrations per second. A commonly used unit for frequency
is the Hertz (abbreviated Hz), where
1 Hertz = 1 vibration/second (Thomas 2008)
As a sound wave moves through a medium, each particle of the
medium vibrates at the same frequency. This is sensible since each particle
vibrates due to the motion of its nearest neighbor. The first particle of the
medium begins vibrating, at say 500 Hz, and begins to set the second particle
into vibrational motion at the same frequency of 500 Hz. The second particle
begins vibrating at 500 Hz and thus sets the third particle of the medium into
vibrational motion at 500 Hz. The process continues throughout the medium;
each particle vibrates at the same frequency. And of course the frequency at
which each particle vibrates is the same as the frequency of the original source of
the sound wave. Subsequently, a guitar string vibrating at 50 Hz will set the air
particles in the room vibrating at the same frequency of 500 Hz, which carries a
sound signal to the ear of a listener, which is detected as a 500 Hz sound wave.
The back-and-forth vibrational motion of the particles of the medium
would not be the only observable phenomenon occurring at a given frequency.
Since a sound wave is a pressure wave, a detector could be used to detect
oscillations in pressure from a high pressure to a low pressure and back to a high
pressure. As the compressions (high pressure) and rarefactions (low pressure)
move through the medium, they would reach the detector at a given frequency.
For example, a compression would reach the detector 500 times per second if the
frequency of the wave were 500 Hz. Similarly, a rarefaction would reach the
detector 500 times per second if the frequency of the wave were 500 Hz. The
frequency of a sound wave not only refers to the number of back-and-forth
vibrations of the particles per unit of time, but also refers to the number of
compressions or rarefactions, which pass a given point per unit of time. A
170
detector could be used to detect the frequency of these pressure oscillations over
a given period of time. The typical output provided by such a detector is a
pressure-time plot is shown in the Figure A.1.6.
Figure A.1.6 Frequency of pressure oscillations (adapted from:Music in the digital world)
Since a pressure-time plot shows the fluctuations in pressure over time,
the period of the sound wave can be found by measuring the time between
successive high pressure points (corresponding to the compressions) or the time
between successive low pressure points (corresponding to the rarefactions). The
frequency is simply the reciprocal of the period. For this reason, a sound wave
with a high frequency would correspond to a pressure time plot with a small
period - that is, a plot corresponding to a small amount of time between
successive high pressure points. Conversely, a sound wave with a low frequency
would correspond to a pressure time plot with a large period - that is, a plot
corresponding to a large amount of time between successive high pressure points.
The Figure A.1.7 shows two pressure-time plots, one corresponding to a high
frequency and the other to a low frequency.
171
Figure A.1.7 Pressure-Time plots corresponding to high and low frequencies(adapted from:Music in the digital world)
The ears of a human (and other animals) are sensitive detectors
capable of detecting the fluctuations in air pressure, which impinge upon the
eardrum. The human ear is capable of detecting sound waves with a wide range
of frequencies, ranging between approximately 50 Hz to 20000 Hz. Any sound
with a frequency below the audible range of hearing (i.e., less than 50 Hz) is
known as an infrasound and any sound with a frequency above the audible range
of hearing (i.e., more than 15000 Hz) is known as an ultrasound.
Humans are not alone in their ability to detect a wide range of
frequencies. Dogs can detect frequencies as low as approximately 50 Hz and as
high as 45000 Hz. Cats can detect frequencies as low as approximately 45 Hz
and as high as 85000 Hz. Bats, being nocturnal creature, must rely on sound
echolocation for navigation and hunting. Bats can detect frequencies as high as
120000 Hz. Dolphins can detect frequencies as high as 200000 Hz. While dogs,
cats, bats, and dolphins have an unusual ability to detect ultrasound, an elephant
possesses the unusual ability to detect infrasound, having an audible range from
approximately 5 Hz to approximately 10000 Hz.
172
The sensation of frequencies is commonly referred to as the pitch of a
sound. A high pitch sound corresponds to a high frequency sound wave and a
low pitch sound corresponds to a low frequency sound wave. Amazingly, many
people, especially those who have been musically trained, are capable of
detecting a difference in frequency between two separate sounds which is as little
as 2 Hz. When two sounds with a frequency difference of greater than 7 Hz are
played simultaneously, most people are capable of detecting the presence of a
complex wave pattern resulting from the interference and superposition of the
two sound waves. Certain sound waves when played (and heard) simultaneously
will produce a particularly pleasant sensation when heard, are said to
be consonant. Such sound waves form the basis of intervals in music. For
example, any two sounds whose frequencies make a 2:1 ratio are said to be
separated by an octave and result in a particularly pleasing sensation when heard.
That is, two sound waves sound good when played together if one sound has
twice the frequency of the other. Similarly two sounds with a frequency ratio of
5:4 are to be separated by an interval of a third; such sound waves also sound
good when played together. Examples of other sound wave intervals and their
respective frequency ratios are listed in the Table A.1.1.
Table A.1.1 Sound wave intervals and their respective frequency ratios (adapted from:Music in the digital world)
Interval Frequency Ratio Examples Octave 2:1 512 Hz and 256 Hz Third 5:4 320 Hz and 256 Hz Fourth 4:3 342 Hz and 256 Hz Fifth 3:2 384 Hz and 256 Hz
The ability of humans to perceive pitch is associated with the
frequency of the sound wave, which impinges upon the ear. Because sound
waves traveling through air are longitudinal waves, which produce high and low
pressure disturbances of the particles of the air at a given frequency, the ear has
173
an ability to detect such frequencies and associate them with the pitch of the
sound. But pitch is not the only property of a sound wave detectable by the
human ear(available from: Music in the digital world).
A.1.4 ELEMENTS OF A LANGUAGE
A fundamental distinctive unit of a language is the phoneme; the
phoneme is distinctive in the sense that it is a speech sound class that
differentiates words of a language. For example, the words “cat,” “bat,” and
“hat” consist of three speech sounds, the first of which gives each word its
distinctive meaning, being from different phoneme classes. Many sounds provide
this distinctive meaning, and such sounds represent a particular phoneme. To
emphasize the distinction between the concept of a phoneme and sounds that
convey a phoneme, the speech scientist uses the term phone to mean a particular
instantiation of a phoneme. This distinction is also seen in the different studies of
phonemics and phonetics. Different languages contain different phoneme sets.
Syllables contain one or more phonemes, while words are formed with one or
more syllables, concatenated to form phrases and sentences. Linguistics is the
study of the arrangement of speech sounds, i.e., phonemes and the larger speech
units built from phonemes, according to the rules of a language. Phonemes can
differ across languages, but certain properties of the grammatical rules
combining phonemes and larger units of a language may be common and
instinctual. There are various ways to study speech sounds that make up
phoneme classes; the use of the above first two descriptors in this study is
sometimes referred to as articulatory phonetics, while using the last two is
referred to as acoustic phonetics. One broad phoneme classification for English is
in terms of vowels, consonants, diphthongs, affricates, and semi-vowels. Figure
A.1.8 shows this classification, along with various subgroups, where each
phoneme symbol is written within slashes according to both the International
174
Phonetic Alphabet and an orthographic (alphabetic spelling) representation. An
insightful history of the various phoneme symbol representations is described. In
the remainder of this text, we use the orthographic symbols.
Figure A.1.8 Phonemes in American English. Orthographic symbols are given in parentheses to the left of the International Phonetic Alphabet symbols. (adapted from:Thomas 2008)
Phonemes arise from a combination of vocal fold and vocal tract articulatory
features. Articulatory features, corresponding to the first two descriptors above,
include the vocal fold state, i.e., whether the vocal folds are vibrating or open;
the tongue position and height, i.e., whether it is in the front, central, or back
along the palate and whether its constriction is partial or complete; and the velum
state, i.e., whether a sound is nasal or not. It has been hypothesized that the first
step in the production of a phone is to conceive in the brain the set of articulatory
175
features that correspond to a phoneme. A particular set of speech muscles is
responsible for “activating” each feature with certain relative timing. It is these
features that we may store in our brain for the representation of a phoneme. In
English, the combinations of features are such to give 40 phonemes, while in
other languages the features can yield a smaller set (Thomas 2008).
A.1.5 REPRESENTATION AND ANALYSIS OF SPEECH SIGNALS
Continuous speech is a set of complicated audio signals, which
makes producing them artificially difficult. Speech signals are usually considered
as voiced or unvoiced, but in some cases they are something between these two.
Voiced sounds consist of fundamental frequency (F0) and its harmonic
components produced by vocal cords (vocal folds). The vocal tract modifies this
excitation signal causing formant (pole) or anti-formant (zero) frequencies. Each
formant frequency has also amplitude and bandwidth and it may be sometimes
difficult to define some of these parameters correctly. The fundamental
frequency and formant frequencies are probably the most important concepts in
speech synthesis and also in speech processing in general.
With purely unvoiced sounds, there is no fundamental
frequency in excitation signal and therefore no harmonic structure either and the
excitation can be considered as white noise. The airflow is forced through a vocal
tract constriction, which can occur in several places between glottis and mouth.
Some sounds are produced with complete stoppage of airflow followed by a
sudden release, producing an impulsive turbulent excitation often followed by a
more protracted turbulent excitation. Unvoiced sounds are also usually more
silent and less steady than voiced ones. Whispering is the special case of speech.
When whispering a voiced sound there is no fundamental frequency in the
176
excitation and the first formant frequencies produced by vocal tract are perceived
(Available from: Phonetics and Theory of Speech Production).
Speech signals of the three vowels (/a/ /i/ /u/) are presented in
time- and frequency domain in Figure A.1.9. The fundamental frequency is about
100 Hz in all cases and the formant frequencies F1, F2, and F3 with vowel /a/ are
approximately 600 Hz, 1000 Hz, and 2500 Hz respectively. With vowel /i/ the
first three formants are 200 Hz, 2300 Hz, and 3000 Hz, and with /u/ 300 Hz, 600
Hz, and 2300 Hz. The harmonic structure of the excitation is also easy to
perceive from frequency domain presentation.
It can be seen that the first three formants are inside the normal
telephone channel (from 300 Hz to 3400 Hz) so the needed bandwidth for
intelligible speech is not very wide. For higher quality, up to 10 kHz bandwidth
may be used which leads to 20 kHz sampling frequency. Unless, the fundamental
frequency is outside the telephone channel, the human hearing system is capable
to reconstruct it from its harmonic components (Available from: Phonetics and
Theory of Speech Production).
177
Figure A.1.9. The time- and frequency-domain presentation of vowels /a/, /i/, and /u/ (adapted from: Phonetics and Theory of Speech Production)
178
APPENDIX 2
DATASET
A.2 LIST OF PHONETICALLY BALANCED WORDS
Table A.2.1 Phonetically balanced words
List 1 List 2 List 3 List4 are awe ache bath bad bait air beastbar bean bald bee bask blush barb blondebox bought bead budge cane bounce cape bus cleanse bud cast bush clove charge check cloak crash cloud class course creed corpse crave court death dab crime dodge deed earl deck dupe dike else dig earndish fate dill eelend five drop fin feast frog fame float fern gill far frownfolk gloss fig hatchford hire flush heed fraud hit gnaw hiss fuss hock hurl hot grove job jam how heap log law kite hid moose leave merge
179
Table A.2.1 continued hive mute lush lush hunt nab muck neat is need neck new mange niece nest oils no nut oak ornook our path peck not perk please pert pan pick pulse pinch pants pit rate pod pest quart rouse race pile rap shout rack plush rib sit rave rag scythe size raw rat shoe sob rutride sludge sped sage rise snuff stag scab rub start take shed slip suck thrash shin smile tan toil sketch strife tang trip slapsuch them turf sourthen trash vow starve there vamp wedge straptoe vast wharf test use ways who tick wheat wish why touch