real-time vowel synthesis - a magnetic resonator piano based project_by_vasileios_valavanis
TRANSCRIPT
Queen Mary, University of London
Master’s Project
Real-Time Vowel Synthesis:
A Magnetic Resonator Piano Based Project
Author:
Vasileios Valavanis
Supervisor:
Andrew McPherson
A thesis submitted in fulfilment of the requirements
for the degree of Master of Science
in the
School of Electronic Engineering and Computer Science
Queen Mary, University of London
August 2014
“If a picture paints a thousand words, then a
naked picture paints a thousand words without
any vowels”
Josh Stern
Abstract
Speech synthesis has been an important field of research since the beginning of the
digital signal processing era. Vowels make words intelligible and as Josh Stern so
eloquently quoted, without them words are naked. This project aims to explore
the development of a real time vowel synthesis system based on a medium that
no conventional systems use. Dr McPherson’s magnetic resonator piano was used
in order to vibrate its strings in such way so that they generate vowels. This
paper walks the reader through the thorough investigation on the properties of the
human voice, the spectral analysis the magnetic resonator piano’s structure and the
implementation of this vowel synthesis system that includes a software synthesiser
developed by the author. Results, potential improvements and expansions are
discussed.
ii
Acknowledgements
I would like to thank my project supervisor, Dr Andrew McPherson, for giving
me the opportunity to work on one of the most fascinating subjects in the field
of audio synthesis and computer science, for allowing me to use the magnetic
resonator piano and also for his consistent support and guidance throughout the
project and for putting up with my constant emailing and pestering.
iii
Contents
List of Figures v
1 Introduction 1
1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Paper Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Review 5
2.1 The Magnetic Resonator Piano (MRP) . . . . . . . . . . . . . . . . 5
2.1.1 MRP Signal Flow . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 The Human Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Anatomy of the Human Voice . . . . . . . . . . . . . . . . . 8
2.2.2 Mechanics of the Human Voice . . . . . . . . . . . . . . . . 11
2.3 Speech Synthesis Models . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Implementation 23
3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Plugin Parameters Description . . . . . . . . . . . . . . . . . . . . . 36
3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Conclusion 43
4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Summary and Final Thoughts . . . . . . . . . . . . . . . . . . . . . 48
Bibliography 49
iv
List of Figures
2.1 MRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Key Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 U.R.S. Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Vocal Cords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Glottal Pulses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Glottal Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Formant filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Source-filter model . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9 Articulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.10 Concatenative synthesis . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 PRAAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Impulse Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Magnitude Response . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Vowel A Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 C4 Frequency Response . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 G3 Frequency Response . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 G3 Average Frequency Response . . . . . . . . . . . . . . . . . . . . 35
3.8 Plugin Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.9 1st result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.10 2nd Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.11 3rd Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.12 4th Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.13 5th Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.14 6th Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
v
Chapter 1
Introduction
1.1 History
The artificial recreation of the human voice has been a subject of study long before
the digital age. Papers from the late 1700’s exploring into the generation of vowels
and the synthesis of consonants and their implementation, indicate just how early
scientists showed interest in the subject. [1] Almost a century later, in 1930,
Bell labs created the ”vocoder”, which stripped down speech into its fundamental
frequency and its harmonics and within the next decade Homer Dudley developed
a keyboard based voice synthesiser. [2]
With the evolution of electronic and digital signal processing came more advanced
systems. In 1961 Bell labs once more, created an electronic speech synthesis system
using an IBM 704 computer and in the early 70’s the TSI Speech+ was developed
by Handheld electronics which was a breakthrough in portable speech calculators
1
Chapter 1. Introduction 2
for blind.[2] One of the most brilliant minds of the 21st century, Stephen Hawking,
is also using a Speech+ series system to communicate due to his severe medical
condition which has rendered him unable to speak.
Nowadays, the majority of these technologies are computer based but there is still
significant need for mechanical implementations in the market. [2]
1.2 Background
Digital imitation of speech, both as a concept and as an area of study, has the
potential to drive the advancement of a variety of industries forward. Some of
the current industries experimenting with this technology have seen a significant
growth since its successful development and have pushed its methodology towards
new boundaries. Medical sciences, education, music, gaming platforms and many
other fields have created a substantial number of speech synthesis techniques. All
of the techniques rely on understanding how the human voice works and the correct
use of the tools available.
The functionality of an electronic or digital audio synthesiser is quite simple. It
involves the generation of electric or digital signals which represent waveforms
and their conversion to audible signals through speakers. A few of the most pop-
ular sound synthesis methods are additive synthesis, subtractive synthesis and
wavetable synthesis.[3] All of these methods, including the ones not mentioned,
describe the generation of non audible signals before the stage of their transmission
however what most waveform synthesis systems fail to elaborate on is the actual
Chapter 1. Introduction 3
medium of sound reproduction. Modern synthesisers are in a way restricted to
transmit their output through speakers. This project is willing to take a different
angle and explore vowel synthesis transmitted via piano strings.
The quality of any artificial system depends on its approximation to the system it
is modelling. In the search for a perfect manmade speech processor the research
on vowel generation through piano strings described in this paper has resulted in
a new speech synthesis method. The question raised by the author is whether it
is possible to develop a vowel synthesis system and transmit its output through
piano strings so that intelligible vowels are generated.
1.3 Paper Structure
The remainder of this paper is structured as follows:
• Chapter 2 contains a literature review covering the mechanics of the mag-
netic resonator piano developed by Dr Andrew McPherson, the physics of
the human voice and existing speech synthesis methods.
• Chapter 3 includes the proposed method and its implementation (including
the design process) regarding the real time vowel synthesis system discussed
and the description of its parameters.
• Chapter 4 concludes the project with a discussion of the results in the
context of what is examined in the literature review, an evaluation and
Chapter 1. Introduction 4
discussion of the future work, and finally a summary of the project along
with the author’s final thoughts.
Chapter 2
Literature Review
2.1 The Magnetic Resonator Piano (MRP)
The magnetic resonator piano can be considered to be an acoustic instrument
with electronic prosthetics. As a project on its own, the MRP takes the traditional
grand piano and extends its capabilities. Its success is based on the electromagnetic
actuators attached above each piano string creating an electromagnetic field strong
enough to force the strings to vibrate. This vibration allows the indefinite sustain
of any note whilst gives the performer control over amplitude, frequency and timbre
in real time. What is remarkable about the MRP is that it is perfectly audible with
its existing acoustic structure and without any amplification or use of loudspeakers.
All 88 keys of the piano are usable whereas the normal performance of the piano
remains intact despite all the instalments. [4]
5
Chapter 2. Literature Review 6
Figure 2.1: Magnetic Resonator Piano Block Diagram [4]
2.1.1 MRP Signal Flow
When in operation, the MRP can be divided into three main processes that consist
its workflow.
1. The first task of the system is to receive an audio signal from a computer as
seen in figure 2.1. The most updated version of the MRP, and the one used
for this project, neglects the sensing of the string vibration by the pickup as
well as the feedback based mechanism that the band pass filters and PLL
create. The input feeds the audio amplifiers that drive each actuator directly.
[5]
2. Triggering the actuators is the next stage of operation. It is essential to
mention here that the computer uses a MAX Msp patch to guide outgoing
signals towards the string of the user’s choice. The detection of which key
Chapter 2. Literature Review 7
is active is made by a continuous key sensing mechanism. A modified Moog
Piano Bar is used for this specific task. The Piano bar uses optical and
interrupt sensors above each piano key, so that it keeps track of their motion.
The actuators are triggered with a slow pressing movement of any key and
remain active as long as the hammer does not touch the string and the key
is not back to its default position. [5]
3. The last process of this machine is the actuation. The electromagnets above
each string generate a field in which piano strings vibrate. Being driven by
audio amplifiers, the actuators force a vibration which is in phase with the
actual audio input a process that results in a better spectral presence of the
output. [5]
Figure 2.2: Key sensing [5]
To sum up and simplify, the magnetic resonator piano receives any audio signal
from a computer and transmits it as accurately as possible keeping the string
Chapter 2. Literature Review 8
vibrations in harmony. Limitations and non linearities are of course inevitable
and will be discussed in later chapters.
2.2 The Human Voice
2.2.1 Anatomy of the Human Voice
Most of us are oblivious as to how many parts of our body are required to com-
plete certain tasks like generating voice, which is the case for most of our bodily
functions.[6] An examination of the anatomy of the upper respiratory system re-
veals the complexity of such process whilst clarifying essential notions towards its
correct physical modelling.
Our body is such an efficient machine that uses the same organs, muscles and
bones for a vast variety of different functions. This paper will focus on how these
body parts are related to the production of sound rather than their general use.
Figures 2.3 and 2.4 show where these key body parts are located.
• The pharynx is a part of the whole throat area located just behind the oral
and nasal cavity. It is used for producing sound and it has the ability to
split into two muscular tubes. [8]
• The larynx is an organ that helps breathing and is essential to sound cre-
ation because of its ability to manipulate pitch and volume. [9]
Chapter 2. Literature Review 9
Figure 2.3: Upper Respiratory System Diagram [7]
• The trachea sends air into the lungs. It is open almost all the time and is
located right below the larynx. [9]
• The epiglottis is an elastic cartilage flap attached to the entrance of the
larynx. It covers the larynx and works as a valve when we eat or drink. [10]
• The esophagus allows food to pass from the pharynx into the stomach. It
is only open when you swallow or vomit. [8]
• The vocal cords or folds make phonation possible.They are twin mem-
branes of muscles and ligaments covered in mucus and are located where the
pharynx splits between the trachea and the oesophagus stretching horizon-
tally across the larynx. They are no bigger than 22mm and are open during
Chapter 2. Literature Review 10
Figure 2.4: Vocal Cords [11]
inhalation, closed when you hold your breath and vibrate when you speak
or sing.[10]
• The glottis is the combination of the vocal folds and the space in between
the folds. [12]
The aforementioned body parts put together the most sophisticated musical in-
strument in existence. To the disappointment of some readers, it is well shown
that the vocal cords are not a set of strings vibrating like a guitar or a piano thus
producing sound. Their function is to allow, block or partially block air traveling
from the lungs through the trachea. The anatomy of the vocal system unveils key
activities of certain organs that make the voice processing research more accurate.
Chapter 2. Literature Review 11
2.2.2 Mechanics of the Human Voice
A view of the human vocal system as a musical instrument has helped the author
examine the mechanics behind producing voice in a technical way. This section
will look into how the generation of vowels depends on the interplay between key
body parts.
The human respiratory system works like a string and a wind instrument simultaneously.[8]
This complex apparatus is broken down into three major sub systems that explain
its function thoroughly. The major active processes responsible for the generation
of vowels are: respiration, phonation and resonation.
1. Respiration
The first component of the voice ”instrument” is the lungs. We can consider
the lungs as the source of the kinetic energy that is responsible for sound
generation since air is the medium in which sound propagates. When we
inhale, air is being stored temporarily into the lungs in order to oxygenate
the blood. To initiate speech, air from the lungs is forced through the trachea
and the other vocal mechanisms before it exits our body. While speaking,
breathing becomes faster and inhalation occurs mostly from the mouth. [8]
One control parameter that this component provides is the volume gain of the
produced sound. The force by which we contract our lungs while we speak
controls the pressure within the lungs. When the pressure of the lungs is
either higher or lower than the atmospheric pressure air starts flowing. To
exhale we simply use certain muscles to decrease our lungs’ capacity thus
Chapter 2. Literature Review 12
increase air pressure which results in expiration. The higher the velocity of
the airflow through the vocal tract the greater the amplitude of the sound
we produce is. [9]
2. Phonation
The next segment of the vocal system concerns the actual generation of
sound. Phonation is the process of the conversion of air into audible sound
waves. As air travels through the trachea it inevitably meets the larynx
and the vocal cords located at its base. The vocal cords work as a gate
which regulates the airflow from and towards the oral and nasal cavities.
The ability of this gate to remain partially open while interrupting the air
coming from the lungs is what makes it function as a vibrator. [8]
Certain aerodynamic and myeoelastic phenomena drive the vibration pro-
cess of the vocal cords. Under the pressure of the pulmonary airflow, the
vocal cords separate whereas due to a combination of factors, including elas-
ticity, laryngeal muscle tension, and the Bernoulli effect the vocal folds close
rapidly. [13] As the top of the folds is opening, the bottom is in the process
of closing, and as soon as the top is closed, the pressure buildup below the
glottis begins to open the bottom. If the process is maintained by a steady
supply of pressurised air, the vocal cords will continue to open and close
in a quasi-periodic fashion. Each vibration allows a small amount of air to
escape, producing an audible sound at the frequency of this movement; this
process generates voice. [13]
Chapter 2. Literature Review 13
Figure 2.5: Pulses created by the vocal cord vibrations [14]
Figure 2.6: Glottal source spectrum [15]
Chapter 2. Literature Review 14
The frequency of this vibration sets the fundamental frequency of the glottal
source and determines the perceived pitch of the voice.[16] The resulting
waveform of this process is a periodic pulsating signal with high energy
at the fundamental frequency and its harmonics and gradually decreasing
amplitude across the spectrum as shown in figure 2.6. [6]
Like all string instruments, the pitch of the sound generated by the folds
depends on their mass, length and tension.
fx =nv
2Lwhere: v =
√T
µ(2.1)
(T is tension, L is string length and µ is mass per unit length.)
However, being organic and airflow dependant the vocal folds distinguish
themselves in terms of functionality. The length of the vocal cords can vary
between 17- 22 mm for an adult male and 12-17 for an adult female. These
numbers mathematically explain why women most commonly have a higher
pitched voice than men. The fundamental frequency range of the glottal
source is commonly between 50 - 500 Hz for all human beings. [17]
Controlling the pitch of our voice is one of the most important features we
have towards communicating with one another. While the pressure of the
pulmonary airflow can be one method of controlling voice pitch the primary
mechanism of doing so resides into the larynx. The muscles within the
larynx give us control over the elasticity and the tension of the vocal folds.
Chapter 2. Literature Review 15
By manipulating these characteristics we practically adjust the fundamental
frequency of the glottal source that is being generated. [16]
3. Resonation
In the final stage of the voice generation process, the glottal source is be-
ing shaped into intelligible vowels and consonants making up words. The
first step of this transformation occurs into the cavity of the pharynx which
communicates with the nasal, the oral cavity and the larynx. The pulses of
air that escaped the vocal cords are being diffused into all of these cavities
that have the role of the resonator of our whole vocal system. The moving
parts of our resonator give us the ability to shape the waveforms transmit-
ted. The lips, the tongue, the velum and basically all of our facial muscles
give us dynamic control over the real time filtering of the glottal source. The
resonator’s task is to attenuate some frequencies of the band limited pulse
produced by the vocal folds while amplifying others. Despite the fact that
without the vocal folds we could have no voice whatsoever, it is this section
of the whole mechanism that makes our voice versatile,interesting and iden-
tifiable. The nature of the human DNA makes sure that each person has
different dimensions of the aforementioned cavities and muscles. Therefore
the filtering of the glottal source that occurs in the resonation stage, is as
unique as one’s fingerprints. [18]
At this point it is important to mention that speech is made up from more
than one types of sound. The previous section examined the generation
of voiced sound but speech contains unvoiced and plosive sounds as well.
Chapter 2. Literature Review 16
Unvoiced sounds result when air gets through certain blockades in the oral
cavity whereas plosive sounds are sudden bursts of air coming from either
the abrupt movement of the vocal tract or the mouth. [17]
If we were to describe the resonation process in terms of digital signal pro-
cessing the notion of formants must be introduced. Formants have more
than one definition in multiple research areas. The most common definition
and the one used by the author describes formants as the spectral peaks
of a given sound spectrum. While the resonator amplifies and attenuates
frequencies of the source, formants occur in its spectral envelope that are
unique for each individual. [19]
Figure 2.7: Graphic representation of formant creation [20]
Chapter 2. Literature Review 17
2-3 formants are enough to represent a person’s voice despite the fact that our
resonator produces more. [8] As shown in the bottom of figure 2.7, formants look
like the result of band pass filters applied to the source. This parallelism is not far
from the truth as formants are characterised by the same features as band-pass
filters which are centre frequency, gain and bandwidth.
2.3 Speech Synthesis Models
Traditionally when we talk about speech synthesisers we refer to TTS (text to
speech) systems due to their intuitive input method and vast popularity. Such
systems can be broken down into multiple components that consist their func-
tionality. One of the most important modules of TTS systems is the waveform
generator. Based on the model used for the sound generation, speech synthesis
systems can be classified into three types. These three types are the Formant Syn-
thesis, the Articulatory Synthesis and the Concatenative Synthesis. The different
synthesis systems can also be classified into two sub categories according to the
extend of human intervention in the creation and execution process. Synthesis by
rule uses a collection of supervised rules in order to perform synthesis and data
driven synthesis derives its parameters from actual speech data. [21]
1. Formant Synthesis
Formant synthesis uses a source-filter model to generate intelligible sounds.
The source-filter model can be characterised as a simplified version of the real
life voice generation process. It can be simply described as the generation of
Chapter 2. Literature Review 18
a quasi-periodic pulsating signal (glottal source) and its filtering by multiple
variable band pass filters with the appropriate formant parameters in order
to have intelligible vowels produced. [21]
Figure 2.8: Source-filter model [20]
Modelling multiple formant resonances in the digital world requires the im-
plementation of 2nd order IIR (infinite impulse response) filters. Equation
2.2 shows the transfer function of a 2nd order IIR filter. [21] The deriva-
tion of the frequency response of a filter from its transfer function will be
examined in chapter 3.
Hi (z) =1
1− 2e−πbi cos (2πfi) z−1 + e−2πbiz−2(2.2)
Chapter 2. Literature Review 19
(2nd order IIR all-pole filter transfer function with fi = Fi/Fs, bi = Bi/Fs
where Fi, Bi and Fs are the formant’s centre frequency, bandwidth and sam-
pling frequency, respectively.)
The choice of IIR filters is not random. IIR filters, in comparison to FIR
filters, are a lot more computationally efficient. Their low requirements of
filter coefficients (ai, bi) make them faster and less memory consuming. On
the other hand FIR filters are always stable whereas IIR filters can have
poles outside the unit circle which will render them unstable. [21]
There are two ways to combine a number of IIR filters together. One way is
to create a cascaded array and the other is to combine them in parallel. The
parallel method is a lot more complex and is mainly used for the production
of fricative sounds. The cascaded method is ideal for vowel sounds and is
a lot easier to implement. One other very important difference between
the two methods is that the cascaded technique results in an all pole filter
whereas the parallel method results in a filter that has zeros in addition to
poles. Poles and zeros can disclose the frequency response characteristics of
a filter and are often used as the basis for digital filter design. [21]
H1 (z) =M∑k=0
bkz−k (2.3)
H2 (z) =1
1−∑N
k=1 akz−k
(2.4)
(2.3 is the transfer function of an all-zero filter and 2.4 of an all-pole filter.)
Chapter 2. Literature Review 20
In reality voice signals are not stationary. Formant synthesis by rule takes
into account the physical limitations of the vocal tract so that the change
between formants does not occur abruptly whilst giving the user the ability
to manipulate pitch and formant sweep in real time, however it leaves out
important reflections and nuances that make the output sound realistic.[21]
2. Articulatory Synthesis
Articulatory synthesis is a lot closer to formant synthesis in terms of syn-
thesising voice by rule. It models the motion of our articulators and the
resulting distributions of waveforms in the lungs, the oral and nasal cavities
and the larynx. This model drives a formant synthesiser and uses 5 artic-
ulatory parameters: area of lip opening, constriction formed by the tongue
blade, opening of the nasal cavities, average glottal area and rate of active
expansion/ contraction of the vocal tract tube behind a blockade [21]
Figure 2.9: A list of the human articulators [22]
Chapter 2. Literature Review 21
The nature of the human speech articulators does not allow them to perform
large movements. On the contrary, due to the fact that they are so restricted
it is easier to model them. The 5 articulatory parameters are interlaced with
the fundamental frequency and the first 4 formant frequencies. Though this
model can be the most promising in terms of speech quality, the methods
to collect the aforementioned area parameters are not very advanced thus
making articulatory synthesis the least accurate speech synthesis model of
the three. [21]
3. Concatenative Synthesis
Concatenative synthesis attempts to imitate speech in such way that it cap-
tures all of the small details and secondary reflections in order to sound as
realistic as possible. The principle behind this model involves the concate-
nation of several speech excerpts from recordings together so that a natural
sequence of speech is formed. Unlike synthesis by rule this data driven
synthesis model does not require any manual adjustments. In addition the
segments of speech selected are real, so the output is expected to be of high
quality. [21]
In reality the cascaded segments are often different in terms of spectral and
prosodic continuity. If the formants of one segment are not exactly the same
with its adjacent or the perceived pitch is different from one clip to another,
then discontinuities occur at the point of concatenation. The speech excerpts
may be perfectly normal, it is their sequence that sounds unnatural. Under
ideal conditions the concatenative synthesis model produces the most natural
Chapter 2. Literature Review 22
output however its design has to address many issues to avoid discontinuities.
The more issues solved during the design process the better the outcome and
naturally the more complicated the system is. In addition high quality data
driven synthesis models require large amounts of data stored within their
operating system and are very computationally expensive and consume a lot
more memory than rule based models. [21]
Figure 2.10: A simple concatenative synthesis diagram [23]
Chapter 3
Implementation
The task of the project described in this paper is to build a real time computer
based system that generates vowels in such way so that they are being transmitted
by Dr. McPherson’s magnetic resonator piano intelligibly. The author has created
a user friendly digital audio synthesiser that has the role of the input and is in
the form of an AU plugin. Moreover the spectral behaviour of the piano has
been analysed and added to the system so that clearer results are achieved. The
scientific process behind this attempt is examined in the remainder of this chapter.
3.1 Methodology
The method proposed for the successful production of this system is the creation
of a vowel synthesis AU plugin based on the formant synthesis by rule model men-
tioned in chapter 2. Two criteria were taken under consideration before making
this decision; one was the computational requirements of the plugin and the second
23
Chapter 3. Implementation 24
was the ability to modify certain parameters in real time. The formant synthesis
model offers low computational cost and full adjustment capabilities in real time
as well as a decent quality output.
The implementation of the real time vowel synthesis system consists of two primary
phases; programming and spectral analysis of the piano. The JUCE framework
was used for the programming stage which was carried out in C++ using X-Code 5.
For the spectral analysis of the piano 2 DPA 2006A microphones and a TASCAM
US-122MKII audio interface were used for the recording of the samples and finally
MATLAB was used for the analysis of the audio samples.
3.2 Preparation
The correct programming of the plugin required some pre calculated data. Specif-
ically, the formant centre frequency, bandwidth and gain values needed for the
filtering of the model, were captured using PRAAT. PRAAT is an open source
speech feature extraction software that offers the option to detect voice formants
by a single recording.[24] The first 4 formants for each vowel were analysed by the
recordings of the author’s voice.
As shown in figure 3.1, the red dots represent the formants of the vowels. The Y
axis of the spectrogram is frequency in Hz and the X axis is time in seconds. It is
clear that PRAAT provides comprehensive formant data across time in segments
of 5 ms.
Chapter 3. Implementation 25
Figure 3.1: PRAAT spectrogram showing the first 4 formants of 3 vowels [24]
3.3 Programming
The most comprehensive way to describe the creation of a source/ filter audio
plugin is to divide it into two main processes. The first process would be the
generation of the source and the second the filtering of it.
1. Source
In chapter 2 we thoroughly investigated the nature of the glottal source
and concluded that it is a periodic pulsating signal rich in harmonics with
gradually decreasing amplitude from the fundamental frequency and across
the spectrum. Its digital representation is a band limited pulse which is
essentially a series of harmonics or sinusoids:
χ (t) =N∑k=1
sin (2πkf0t) (3.1)
with a number of harmonics given by:
Chapter 3. Implementation 26
N =fs2f0
(3.2)
so that we avoid aliasing.
A C++ oscillator class taken from Will Pirkle’s book ”Designing Audio
Plug-ins in C++” was used for the generation of multiple sine waves. This
clever approach uses a 1024 sample buffer to store sine wave values which
are constantly updated. To avoid any fractional data point values within the
buffer, a linear interpolation using weighted sums function is being used[25]
y =
((x− x1)(x2 − x1)
)y2 +
(1−
((x− x1)(x2 − x1)
))y1 (3.3)
for any y between data points (x1, y1) and (x2, y2).
This procedure ensures phase coherence to a large number of oscillator in-
stances created simultaneously.
In practice 34 instances of the oscillator class were created, the first one rep-
resenting the fundamental frequency and the rest 33 its harmonics, creating
a band limited pulse at steady magnitude across its spectrum as shown in
figure 3.2.
2. Filtering
Chapter 3. Implementation 27
Figure 3.2: Impulse train in time and frequency domain
The filtering stage has been conducted in a way so that it extends the tradi-
tional methods. The obtained formant filter values for 5 vowels (phonetically
[a], [e], [i], [o], [ou]) have been inserted into 5 lookup tables. Each lookup
table contains centre frequency (in Hz), gain (in dB) and Q (bandwidth)
values for 4 filters. To adequately describe the performance of the plugin,
this section is divided into the two main functions of the code.
• Calculation of coefficients : The aforementioned formant values are
used to derive the necessary coefficients for the 20 filters in total.
Given:
Chapter 3. Implementation 28
G = 10g/20
F = 2π(fcfs
)For : g >= 1 And for : g < 1
H = F2Q
H = F2gQ
Then the filter coefficients ai, bi are given by:
a2 = −0.5 (1−H)(1+H)
a1 = (0.5− a2) cos (H)
b0 = (G− 1)(0.25 + 0.5a2) + 0.5
b1 = −a1
b2 = −(G− 1)(0.25 + 0.5a2)− a2
(fc is centre frequency, g is gain and Q is the bandwidth of each filter.)
• Calculation of frequency response : Normally, a source/ filter
model would use coefficients to design filters that will shape the band
limited pulse generated, however this plugin incorporates something
beyond just filtering. In a separate function, each filter’s frequency re-
sponse is derived from its transfer function.
From the 2nd order filter transfer function:
H (z) =b0 + b1z
−1 + b2z−2
a0 + a1z−1 + a2z−2
Chapter 3. Implementation 29
We calculate the frequency response by substituting z with ω = ejω
and taking its complex conjugate:
H (ω) =b0 + b1 cos (ω)− jb1 sin (ω) + b2 cos (2ω)− jb2 sin (2ω)
a0 + a1 cos (ω)− ja1 sin (ω) + a2 cos (2ω)− ja2 sin (2ω)
H (ω) =[b0 + b1 cos (ω) + b2 cos (2ω)] + j [−b1 sin (ω)− b2 sin (2ω)]
[a0 + a1 cos (ω) + a2 cos (2ω)] + j [−a1 sin (ω)− a2 sin (2ω)]
for:
A = [b0 + b1 cos (ω) + b2 cos (2ω)]
B = [−b1 sin (ω)− b2 sin (2ω)]
C = [a0 + a1 cos (ω) + a2 cos (2ω)]
D = [−a1 sin (ω)− a2 sin (2ω)]
we get:
H (ω) = A+jBC+jD
C−jDC−jD
H (ω) = [AC+BD]+j[AD−BC]C2+D2
the magnitude response of the filter:
|H (ω)| = 1
C2D2
√(AC +BD)2 + (AD −BC)2 (3.4)
Chapter 3. Implementation 30
Note here that ω is frequency in radians (0 − 2π) and can be written
as ω = 2π fifs
The magnitude response of a filter encloses gain information for each
frequency within its bandwidth.
Figure 3.3: Magnitude response of the cascaded filter for vowel A
For any generated sine wave with frequency fi which is part of the source
waveform, the above function calculates its amplitude according to the
4 formant filters of each vowel. Essentially the plugin generates 34 pre
filtered signals giving a lot more flexibility in terms of transmission.The
resulting synthesis method is the additive synthesis because the source
is being constructed instead of being carved into shape. [3]
Figure 3.4 demonstrates how the frequency response function has shaped
the band limited pulse according to the 4 resonances of the vowel A. It
Chapter 3. Implementation 31
Figure 3.4: The vowel A transmitted from the plugin
is clearly shown how different frequencies are generated with different
magnitudes.
The 4 filters are combined in a cascaded array at the end of the C++
function. As mentioned in chapter 2, the cascaded method is ideal for
vowel synthesis and very easy to implement.
Expressing the filter transfer function in a factorized form as seen below:
H (z) =
∑Mk=0 bkz
k
1−∑N
k=1 akz−k
= G(1− z1z−1) (1− z∗1z−1) (1− z2z−1) (1− z∗z−1) . . .
(1− p1z−1) (1− p∗1z−1) (1− p2z−1) (1− p∗2z−1) . . .
(3.5)
where G is the cascaded filter gain and the poles pi and zeros zi are either
complex conjugate pairs or real-valued. By grouping the factorized
equation in terms of the complex conjugate and real valued pairs we
Chapter 3. Implementation 32
get:
H (z) = G
((1− z1z−1) (1− z∗1z−1)
(1− p1z−1) (1− p∗1z−1)
)((1− z2z−1) (1− z∗2z−1)
(1− p2z−1) (1− p∗2z−1)
). . .
(3.6)
Equation 3.6 represents the cascaded expression of 2nd order IIR filters
enclosed in the big brackets which can also be expressed as:
H (z) = GK∏k=1
Hk (z) (3.7)
Finally, a low pass filter with centre frequency at 2.5kHz has been added
to the cascaded array so that unwanted harmonics are neglected.
3.4 Spectral Analysis
The magnetic resonator piano, like most structures, has some specific acoustic
properties. When sound waves travel through its body their spectrum is inevitably
shaped in a similar way the vocal tract shapes the glottal source. In the attempt to
transmit the output of the plugin through the MRP, the piano’s formants have to
be taken under consideration. The accurate reproduction of the vowels generated
from the AU plugin requires a medium with a flat frequency response. Given that
the piano certainly does not have a flat frequency response a compensative method
was necessary.
Chapter 3. Implementation 33
The strings used for this project were specific so that their resonant frequencies
cover the vocal range (50 - 500Hz). C2, E2, G2, D3, E3, G3, C4, E4 and G4
were tested and their output was recorded in order to be analysed. Their resonant
frequencies are:
• C2 - 65Hz
• E2 - 82Hz
• G2 - 97Hz
• D3 - 146Hz
• E3 - 164Hz
• G3 - 197Hz
• C4 - 260Hz
• E4 - 327Hz
• G4 - 389Hz
The idea is to feed the strings with all the vowels, a version of all the vowels filtered
at 1kHz and white noise all at different dB levels, record all the outputs and derive
an average frequency response for each string and consequently the whole body
of the piano. Knowing how each string responds is very important towards the
implementation of an inverse filter which will flatten that response and make vowel
transmission more intelligible.
Chapter 3. Implementation 34
Figure 3.5: The frequency response of C4 under all test conditions for vowelA
Figure 3.6: The frequency response of G3 for all vowels
Chapter 3. Implementation 35
The response of a system h(n) knowing its input x(n) and output y(n) is a simple
comparison of y(n) with regard to the x(n) and in theory it should be the same no
matter what the input is. Figures 3.5 and 3.6 show the response of the string C4
under a variety of test conditions. The responses of each string for all vowels under
every test condition were gathered in a significant number of arrays. By averaging
these arrays a clear view of how each string behaves to a number of harmonic inputs
is revealed and by taking their inverse average response a quantifiable pattern
emerges towards the creation of the inverse filter.
Figure 3.7: Average frequency response and its inverse for string G3
The amplitude differences between the red line and the blue line, shown in figure
3.7, at the frequencies corresponding to the peaks of the red line are the gain and
frequency values to be imported to the inverse filter.
Chapter 3. Implementation 36
3.5 Plugin Parameters Description
The design of the vowel generator AU plugin is such so that it offers a variety
of adjustable parameters to the user. This section includes a description of its
architecture and an examination of its features.
Figure 3.8: A picture of the plugin in operation
Figure 3.8 shows the layout of the plugin. Its parameters from top to bottom are:
• A master ON/ OFF button enables and disables the output. The button
turns red to indicate when it is in OFF mode.
Chapter 3. Implementation 37
• A volume knob ranging from 0 to 10 dB with a step of 0.1.
• The wave type drop down list that offers 4 choices of waveform types to be
generated. The user has the option to create vowels with sine waves, triangle
waves, sawtooth waves or rectangular waves.
• The vowel type drop down list is the parameter with which the user can
switch between the 5 different vowels.
• A frequency knob that sets the fundamental frequency of the vowel. This
parameter ranges from 50 to 500 Hz with a step of 1.
• 9 buttons for each string. These buttons configure the settings of the inverse
filter according to the analysis conducted on each string separately. The user
may switch between 9 filters depending on what piano key he is using. Each
button turns white when active and red when not.
• 34 volume sliders, one for each harmonic and the fundamental frequency f0.
These sliders function as a graphic equaliser and range between -50 to +50
dB with a step of 1. When an inverse filter button is pressed, these sliders
automatically take values for the inverse filter of the corresponding string.
3.6 Results
This project set out to create a vowel synthesis system that receives intelligi-
ble vowels from an AU plugin and plays them back through the MRP strings
keeping a their intelligibility. The results of this attempt are difficult to show
Chapter 3. Implementation 38
on paper. Evaluating whether a vowel is intelligible or not means one has
to listen to it to comprehend it. Being able to identify whether a sound is a
vowel, natural or not, is the goal here. The spectral content of the output
of the piano in comparison to the output of the plugin is the optimal way
to view this project’s results. Due to the sheer volume of results this section
will show the three most intelligible vowels transmitted from the piano and
the three least intelligible.
Figure 3.9: Spectrum of piano output vs plugin output of G3 playing thevowel O
Chapter 3. Implementation 39
Figure 3.10: Spectrum of piano output vs plugin output of E4 playing thevowel A
Figure 3.11: Spectrum of piano output vs plugin output of C4 playing thevowel O
Chapter 3. Implementation 40
Figure 3.12: Spectrum of piano output vs plugin output of G2 playing thevowel A
Figure 3.13: Spectrum of piano output vs plugin output of C2 playing thevowel I
Chapter 3. Implementation 41
Figure 3.14: Spectrum of piano output vs plugin output of D3 playing thevowel E
The figures of this section show the spectral behaviour of 6 piano strings,
represented by the blue lines, superimposed on the spectrum of the plugin
output or the piano input, represented by the dashed red lines. Intelligibility
of the piano output is shown when the peaks of the blue line follow the
pattern created by the peaks of the red line. In other words when the
spectral shape of the blue line is close to the spectral shape of the red line or
in some occasions the same, regardless of the general difference in dB level,
then it means that the plugin output has retained the magnitude relationship
of its spectral peaks through the piano strings. In the occasion where the
formants of the two lines do not follow the same pattern, it is clear that the
acoustical structure of the piano has distorted the input’s frequency content,
thus the inverse filtering was ineffective.
Chapter 3. Implementation 42
It is clear that figures 3.9, 3.10 and 3.11 are graphs of intelligible vowels
whereas figures 3.12, 3.13 and 3.14 are plots of vowels that are quite dif-
ficult to identify. Further discussion on the results will take place in the
conclusion.
Chapter 4
Conclusion
This final chapter concludes the examination of the project. A discussion on
the method chosen, the implementation and the results is included as well
as a final evaluation. The last two sections of the chapter concern possible
improvements in terms of programming and analysis and a summary of the
report.
4.1 Discussion
A thorough investigation was carried out on the science behind the human
voice, the most popular speech synthesis systems in the market and the
magnetic resonator piano. In the literature review, this paper examined
the mechanics of how voice is produced, deducted measurable parameters of
voicing, and explained the major differences between three speech synthesis
models.
43
Chapter 4. Conclusion 44
The formant synthesis model was chosen, which according to the investiga-
tion was the appropriate method in terms of best quality for vowel generation
and low computational cost. The JUCE framework was used within Xcode
5 on a Macintosh operating system for the creation of an audio plugin. The
implementation of this model required the generation of a source waveform
and its filtering by a cascaded array of 2nd order IIR filters. In chapter 3 a
detailed analysis of the source/ filter model is included, providing a descrip-
tion of the filtering stage which takes this method a step further.
After intelligible vowels were produced by the AU plugin, an analysis on
how the structure of the MRP affects the frequency content of the input was
conducted. The analysis revealed a pattern in the frequency response of the
9 strings of the piano that were used in this project. Nine inverse filters
were incorporated into the audio plugin in order to obtain a flat frequency
response.
Finally, 6 plots were shown representing the most intelligible vowels and the
least intelligible vowels generated by this complex system.
4.2 Evaluation
The project’s aims were to create a vowel generation system using a digital
audio synthesiser and to transmit its output accurately via the strings of the
MRP. Considering that the AU plugin produced intelligible vowels that are
in no way realistic, an understanding emerges of how the evaluation process
will be carried out.
Chapter 4. Conclusion 45
In terms of spectral shape relationship between the input and the output,
a close relationship meaning success and a far relationship meaning failure,
this project has been a success. Most of the 45 vowels transmitted via the
piano strings at 9 different fundamental frequencies, had a very close spectral
shape relationship with the input.
The vowels generated by the digital audio synthesiser, have neglected a sig-
nificant number of characteristics that would make them sound real. Early
reflections in the nasal and oral cavities, unvoiced sounds, small pitch varia-
tions and many other parameters are not included in the source/ filter model.
The resulting sound lacks the transients of the natural voice and its closest
approximation to reality is as if we took the steady state parts of a real vowel
and put it in an indefinite loop. Although real it would not sound natural
or intelligible. Within a context of successively changing vowels of the same
type though the perception of intelligibility we have changes dramatically.
For example when the piano plays the vowel A at 65Hz through C2, it sounds
nothing like an A. But when it plays all the vowels in a successive order we
start to recognise the different types. Even the vowels whose spectrum plots
did not meet the success criteria, in a context of successive transmission of
different vowels they become intelligible. And vice versa, the most intelligi-
ble vowels in terms of spectral shape relationship are not intelligible when
played out of any context.
To conclude, the author’s evaluation is that the project is a successful first
step towards a useable vowel generation system but as it is at the moment
Chapter 4. Conclusion 46
it is incomplete.
4.3 Improvements
Possible improvements towards the better function of this vowel synthesis
system concern programming and spectral analysis.
– Algorithm improvements
The AU plugin although very successful, could incorporate some more
parameters to generate a more natural output. Small pitch variations
of the fundamental frequency of the source and the centre frequency of
each formant filter could be implemented in order to model the human
voice more accurately. White noise could model the unvoiced air escap-
ing the vocal folds resulting in a harmonically richer output. Finally
a parallel array of 2nd order IIR filters could make consonant genera-
tion possible and would model vowel reflections in the mouth and nasal
cavities.
– Spectral analysis improvements
The inverse filtering applied in this project is based on the linear rela-
tionship between input and output. The MRP is a physical structure
and like all physical structures it is non linear. One of the main reasons
that this project had partial success in the frequency content relation-
ship between input and output, was intermodulation distortion. This
Chapter 4. Conclusion 47
non linear distortion produced by the acoustic architecture of the pi-
ano, resulted in the appearance of some very unpredictable harmonics
in the output spectrum.
A major improvement for this project would be to analyse the non
linearity of the piano and derive a mathematical model that will predict
the distortion and will design its inverse filter more accurately. This
may allow the creation of a function performing vowel transition across
their corresponding spectral targets. This parameter was attempted
during this project but distortion coming from the piano excited a vast
variety of harmonics during the vowel transition. The result was very
noisy and contradicted with the goals of the project.
Chapter 4. Conclusion 48
4.4 Summary and Final Thoughts
The goals of this bold attempt described in this paper both have been and
have not been achieved. The research and implementation by the author
may have resulted in a robust vowel synthesis system which successfully
transmits the majority of the vowels from a digital audio synthesiser via the
MRP strings, however it is not a useable instrument.
Improvements have been proposed and room for further research by other
scientists has been left by the author. Overall, it is hoped that this project
has been a positive step towards the fascinating field of speech processing.
Bibliography
[1] History and development of speech synthesis. 2006. URL
http://www.acoustics.hut.fi/publications/files/theses/
lemmetty_mst/chap2.html.
[2] Richard W. Sproat. Multilingual Text-to-Speech Synthesis: The Bell
Labs Approach, volume 4. Springer, 1997.
[3] Sam O’Sullivan. Understanding the basics of sound synthesis. The
Pro Audio Files, February 2012. URL http://theproaudiofiles.
com/sound-synthesis-basics/.
[4] Andrew McPherson. The magnetic resonator piano: Electronic aug-
mentation of an acoustic grand piano. Journal of New Music Research,
39(3):189 – 202, 2010.
[5] Youngmoo Kim Andrew McPherson. Augmenting the acous-
tic piano with electromagnetic string actuation and contin-
uous key position sensing. NIME, 1:217–222, 2010. URL
http://www.educ.dab.uts.edu.au/nime/PROCEEDINGS/papers/
Paper%20K1-K5/P217_McPherson.pdf.
49
Bibliography 50
[6] Glen Lee. Voice synthesis. The Encyclopaedia of Virtual Environ-
ments, 1, 1993. URL http://www.hitl.washington.edu/projects/
knowledge_base/virtual-worlds/EVE/I.B.2.VoiceSynthesis.
html.
[7] Upper respiratory system diagram. . URL http://quizlet.com/
12905648/module-1-the-respiratory-system-anatomy-and-physiology-flash-cards/.
[8] Johan Sundberg. The acoustics of the singing voice. 1997. URL http:
//www.zainea.com/voices.htm.
[9] Jackie R. Haynes and Ronald Netsell. The mechanics of speech breath-
ing: a tutorial. Department of Communication Sciences and Disorders
Southwest Missouri State University, 2001.
[10] Deirdre D. Michael. About the voice. Lions Voice Clinic, 2014. URL
http://www.lionsvoiceclinic.umn.edu/page2.htm.
[11] Larynx. URL http://learnhumananatomy.com/larynx/.
[12] The Voice Foundation. Voice anatomy physiology. 2014. URL
http://voicefoundation.org/health-science/voice-disorders/
anatomy-physiology-of-voice-production/.
[13] Janwillem van den Berg. Myoelastic aerodynamic theory of voice
production. Journal of Speech, Language, and Hearing Research,
September 1958. URL http://jslhr.pubs.asha.org/article.aspx?
articleid=1749406.
Bibliography 51
[14] C. Julian Chen. Physics of human voice: A new theory with applica-
tion. Research Conference Columbia University, 1(1):1–19, November
2012. URL http://www.google.co.uk/url?sa=t&rct=j&q=&esrc=
s&source=web&cd=1&ved=0CCMQFjAA&url=http.
[15] Glottal source spectrum. . URL http://www.ncvs.org/ncvs/
tutorials/voiceprod/images/5.5.jpg.
[16] Dinesh K. Chhetri. Neuromuscular control of fundamental frequency
and glottal posture at phonation onset. Acoustical Society of America,
November 2011. URL http://headandnecksurgery.ucla.edu/
workfiles/Academics/Articles/neuromusc_control_chhetri_et_
al.pdf.
[17] Maeva Garnier Joe Wolfe and John Smith. Voice acoustics: an intro-
duction. UNSW, 2, May 2009. URL http://www.phys.unsw.edu.au/
jw/voice.html.
[18] Eric Armstrong. Journey of the voice: Anatomy, physiology and the
care of the voice. Voice and Speech source, 2008. URL http://www.
yorku.ca/earmstro/journey/resonation.html.
[19] Fant G. Acoustic theory of speechproduction. Mouton co, 1960.
[20] Source/ filter model. URL http://www.phys.unsw.edu.au/jw/
speechmodel.html.
Bibliography 52
[21] Alex Acero Xuedong Huang and Hsiao Wuen Hon. Spoken Language
Processing a Guide to Theory, Algorithm, and System Development.
Prentice Hall, first edition, 2001.
[22] Articulators. URL http://educationcing.blogspot.co.uk/2012/
08/articulatory-phonetics-vocal-organs.html.
[23] Concatenative synthesis. URL http://www.acoustics.hut.fi/
publications/files/theses/lemmetty_mst/chap9.html.
[24] Praat. . URL http://www.fon.hum.uva.nl/praat/.
[25] Will Pirkle. Designing Audio Effect Plug-Ins in C++. Focal Press, first
edition, 2012.