real-time vowel synthesis - a magnetic resonator piano based project_by_vasileios_valavanis

Queen Mary, University of London

Master’s Project

Real-Time Vowel Synthesis:

A Magnetic Resonator Piano Based Project

Author:

Vasileios Valavanis

Supervisor:

Andrew McPherson

A thesis submitted in fulfilment of the requirements

for the degree of Master of Science

in the

School of Electronic Engineering and Computer Science

Queen Mary, University of London

August 2014

“If a picture paints a thousand words, then a

naked picture paints a thousand words without

any vowels”

Josh Stern

Abstract

Speech synthesis has been an important field of research since the beginning of the

digital signal processing era. Vowels make words intelligible and as Josh Stern so

eloquently quoted, without them words are naked. This project aims to explore

the development of a real time vowel synthesis system based on a medium that

no conventional systems use. Dr McPherson’s magnetic resonator piano was used

in order to vibrate its strings in such way so that they generate vowels. This

paper walks the reader through the thorough investigation on the properties of the

human voice, the spectral analysis the magnetic resonator piano’s structure and the

implementation of this vowel synthesis system that includes a software synthesiser

developed by the author. Results, potential improvements and expansions are

discussed.

ii

Acknowledgements

I would like to thank my project supervisor, Dr Andrew McPherson, for giving

me the opportunity to work on one of the most fascinating subjects in the field

of audio synthesis and computer science, for allowing me to use the magnetic

resonator piano and also for his consistent support and guidance throughout the

project and for putting up with my constant emailing and pestering.

iii

Contents

List of Figures v

1 Introduction 1

1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Paper Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Literature Review 5

2.1 The Magnetic Resonator Piano (MRP) . . . . . . . . . . . . . . . . 5

2.1.1 MRP Signal Flow . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 The Human Voice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Anatomy of the Human Voice . . . . . . . . . . . . . . . . . 8

2.2.2 Mechanics of the Human Voice . . . . . . . . . . . . . . . . 11

2.3 Speech Synthesis Models . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Implementation 23

3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Spectral Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.5 Plugin Parameters Description . . . . . . . . . . . . . . . . . . . . . 36

3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Conclusion 43

4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.4 Summary and Final Thoughts . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 49

iv

List of Figures

2.1 MRP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Key Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 U.R.S. Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Vocal Cords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Glottal Pulses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6 Glottal Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Formant filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.8 Source-filter model . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.9 Articulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.10 Concatenative synthesis . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1 PRAAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Impulse Train . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Magnitude Response . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Vowel A Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 C4 Frequency Response . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6 G3 Frequency Response . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7 G3 Average Frequency Response . . . . . . . . . . . . . . . . . . . . 35

3.8 Plugin Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.9 1st result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.10 2nd Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.11 3rd Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.12 4th Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.13 5th Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.14 6th Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

v

Chapter 1

Introduction

1.1 History

The artificial recreation of the human voice has been a subject of study long before

the digital age. Papers from the late 1700’s exploring into the generation of vowels

and the synthesis of consonants and their implementation, indicate just how early

scientists showed interest in the subject. [1] Almost a century later, in 1930,

Bell labs created the ”vocoder”, which stripped down speech into its fundamental

frequency and its harmonics and within the next decade Homer Dudley developed

a keyboard based voice synthesiser. [2]

With the evolution of electronic and digital signal processing came more advanced

systems. In 1961 Bell labs once more, created an electronic speech synthesis system

using an IBM 704 computer and in the early 70’s the TSI Speech+ was developed

by Handheld electronics which was a breakthrough in portable speech calculators

1

Chapter 1. Introduction 2

for blind.[2] One of the most brilliant minds of the 21st century, Stephen Hawking,

is also using a Speech+ series system to communicate due to his severe medical

condition which has rendered him unable to speak.

Nowadays, the majority of these technologies are computer based but there is still

significant need for mechanical implementations in the market. [2]

1.2 Background

Digital imitation of speech, both as a concept and as an area of study, has the

potential to drive the advancement of a variety of industries forward. Some of

the current industries experimenting with this technology have seen a significant

growth since its successful development and have pushed its methodology towards

new boundaries. Medical sciences, education, music, gaming platforms and many

other fields have created a substantial number of speech synthesis techniques. All

of the techniques rely on understanding how the human voice works and the correct

use of the tools available.

The functionality of an electronic or digital audio synthesiser is quite simple. It

involves the generation of electric or digital signals which represent waveforms

and their conversion to audible signals through speakers. A few of the most pop-

ular sound synthesis methods are additive synthesis, subtractive synthesis and

wavetable synthesis.[3] All of these methods, including the ones not mentioned,

describe the generation of non audible signals before the stage of their transmission

however what most waveform synthesis systems fail to elaborate on is the actual


medium of sound reproduction. Modern synthesisers are in a way restricted to

transmit their output through speakers. This project is willing to take a different

angle and explore vowel synthesis transmitted via piano strings.

The quality of any artificial system depends on its approximation to the system it

is modelling. In the search for a perfect manmade speech processor the research

on vowel generation through piano strings described in this paper has resulted in

a new speech synthesis method. The question raised by the author is whether it

is possible to develop a vowel synthesis system and transmit its output through

piano strings so that intelligible vowels are generated.

1.3 Paper Structure

The remainder of this paper is structured as follows:

• Chapter 2 contains a literature review covering the mechanics of the mag-

netic resonator piano developed by Dr Andrew McPherson, the physics of

the human voice and existing speech synthesis methods.

• Chapter 3 includes the proposed method and its implementation (including

the design process) regarding the real time vowel synthesis system discussed

and the description of its parameters.

• Chapter 4 concludes the project with a discussion of the results in the

context of what is examined in the literature review, an evaluation and


discussion of the future work, and finally a summary of the project along

with the author’s final thoughts.

Chapter 2

Literature Review

2.1 The Magnetic Resonator Piano (MRP)

The magnetic resonator piano can be considered to be an acoustic instrument

with electronic prosthetics. As a project on its own, the MRP takes the traditional

grand piano and extends its capabilities. Its success is based on the electromagnetic

actuators attached above each piano string creating an electromagnetic field strong

enough to force the strings to vibrate. This vibration allows the indefinite sustain

of any note whilst gives the performer control over amplitude, frequency and timbre

in real time. What is remarkable about the MRP is that it is perfectly audible with

its existing acoustic structure and without any amplification or use of loudspeakers.

All 88 keys of the piano are usable whereas the normal performance of the piano

remains intact despite all the instalments. [4]

5

Chapter 2. Literature Review 6

Figure 2.1: Magnetic Resonator Piano Block Diagram [4]

2.1.1 MRP Signal Flow

When in operation, the MRP can be divided into three main processes that consist

its workflow.

1. The first task of the system is to receive an audio signal from a computer as

seen in figure 2.1. The most updated version of the MRP, and the one used

for this project, neglects the sensing of the string vibration by the pickup as

well as the feedback based mechanism that the band pass filters and PLL

create. The input feeds the audio amplifiers that drive each actuator directly.

[5]

2. Triggering the actuators is the next stage of operation. It is essential to

mention here that the computer uses a MAX Msp patch to guide outgoing

signals towards the string of the user’s choice. The detection of which key


is active is made by a continuous key sensing mechanism. A modified Moog

Piano Bar is used for this specific task. The Piano bar uses optical and

interrupt sensors above each piano key, so that it keeps track of their motion.

The actuators are triggered with a slow pressing movement of any key and

remain active as long as the hammer does not touch the string and the key

is not back to its default position. [5]

3. The last process of this machine is the actuation. The electromagnets above

each string generate a field in which piano strings vibrate. Being driven by

audio amplifiers, the actuators force a vibration which is in phase with the

actual audio input a process that results in a better spectral presence of the

output. [5]

Figure 2.2: Key sensing [5]

To sum up and simplify, the magnetic resonator piano receives any audio signal

from a computer and transmits it as accurately as possible keeping the string


vibrations in harmony. Limitations and non linearities are of course inevitable

and will be discussed in later chapters.

2.2 The Human Voice

2.2.1 Anatomy of the Human Voice

Most of us are oblivious as to how many parts of our body are required to com-

plete certain tasks like generating voice, which is the case for most of our bodily

functions.[6] An examination of the anatomy of the upper respiratory system re-

veals the complexity of such process whilst clarifying essential notions towards its

correct physical modelling.

Our body is such an efficient machine that uses the same organs, muscles and

bones for a vast variety of different functions. This paper will focus on how these

body parts are related to the production of sound rather than their general use.

Figures 2.3 and 2.4 show where these key body parts are located.

• The pharynx is a part of the whole throat area located just behind the oral

and nasal cavity. It is used for producing sound and it has the ability to

split into two muscular tubes. [8]

• The larynx is an organ that helps breathing and is essential to sound cre-

ation because of its ability to manipulate pitch and volume. [9]


Figure 2.3: Upper Respiratory System Diagram [7]

• The trachea sends air into the lungs. It is open almost all the time and is

located right below the larynx. [9]

• The epiglottis is an elastic cartilage flap attached to the entrance of the

larynx. It covers the larynx and works as a valve when we eat or drink. [10]

• The esophagus allows food to pass from the pharynx into the stomach. It

is only open when you swallow or vomit. [8]

• The vocal cords or folds make phonation possible.They are twin mem-

branes of muscles and ligaments covered in mucus and are located where the

pharynx splits between the trachea and the oesophagus stretching horizon-

tally across the larynx. They are no bigger than 22mm and are open during


Figure 2.4: Vocal Cords [11]

inhalation, closed when you hold your breath and vibrate when you speak

or sing.[10]

• The glottis is the combination of the vocal folds and the space in between

the folds. [12]

The aforementioned body parts put together the most sophisticated musical in-

strument in existence. To the disappointment of some readers, it is well shown

that the vocal cords are not a set of strings vibrating like a guitar or a piano thus

producing sound. Their function is to allow, block or partially block air traveling

from the lungs through the trachea. The anatomy of the vocal system unveils key

activities of certain organs that make the voice processing research more accurate.


2.2.2 Mechanics of the Human Voice

A view of the human vocal system as a musical instrument has helped the author

examine the mechanics behind producing voice in a technical way. This section

will look into how the generation of vowels depends on the interplay between key

body parts.

The human respiratory system works like a string and a wind instrument simultaneously.[8]

This complex apparatus is broken down into three major sub systems that explain

its function thoroughly. The major active processes responsible for the generation

of vowels are: respiration, phonation and resonation.

1. Respiration

The first component of the voice ”instrument” is the lungs. We can consider

the lungs as the source of the kinetic energy that is responsible for sound

generation since air is the medium in which sound propagates. When we

inhale, air is being stored temporarily into the lungs in order to oxygenate

the blood. To initiate speech, air from the lungs is forced through the trachea

and the other vocal mechanisms before it exits our body. While speaking,

breathing becomes faster and inhalation occurs mostly from the mouth. [8]

One control parameter that this component provides is the volume gain of the

produced sound. The force by which we contract our lungs while we speak

controls the pressure within the lungs. When the pressure of the lungs is

either higher or lower than the atmospheric pressure air starts flowing. To

exhale we simply use certain muscles to decrease our lungs’ capacity thus


increase air pressure which results in expiration. The higher the velocity of

the airflow through the vocal tract the greater the amplitude of the sound

we produce is. [9]

2. Phonation

The next segment of the vocal system concerns the actual generation of

sound. Phonation is the process of the conversion of air into audible sound

waves. As air travels through the trachea it inevitably meets the larynx

and the vocal cords located at its base. The vocal cords work as a gate

which regulates the airflow from and towards the oral and nasal cavities.

The ability of this gate to remain partially open while interrupting the air

coming from the lungs is what makes it function as a vibrator. [8]

Certain aerodynamic and myeoelastic phenomena drive the vibration pro-

cess of the vocal cords. Under the pressure of the pulmonary airflow, the

vocal cords separate whereas due to a combination of factors, including elas-

ticity, laryngeal muscle tension, and the Bernoulli effect the vocal folds close

rapidly. [13] As the top of the folds is opening, the bottom is in the process

of closing, and as soon as the top is closed, the pressure buildup below the

glottis begins to open the bottom. If the process is maintained by a steady

supply of pressurised air, the vocal cords will continue to open and close

in a quasi-periodic fashion. Each vibration allows a small amount of air to

escape, producing an audible sound at the frequency of this movement; this

process generates voice. [13]


Figure 2.5: Pulses created by the vocal cord vibrations [14]

Figure 2.6: Glottal source spectrum [15]


The frequency of this vibration sets the fundamental frequency of the glottal

source and determines the perceived pitch of the voice.[16] The resulting

waveform of this process is a periodic pulsating signal with high energy

at the fundamental frequency and its harmonics and gradually decreasing

amplitude across the spectrum as shown in figure 2.6. [6]

Like all string instruments, the pitch of the sound generated by the folds

depends on their mass, length and tension.

fx =nv

2Lwhere: v =

√T

µ(2.1)

(T is tension, L is string length and µ is mass per unit length.)

However, being organic and airflow dependant the vocal folds distinguish

themselves in terms of functionality. The length of the vocal cords can vary

between 17- 22 mm for an adult male and 12-17 for an adult female. These

numbers mathematically explain why women most commonly have a higher

pitched voice than men. The fundamental frequency range of the glottal

source is commonly between 50 - 500 Hz for all human beings. [17]

Controlling the pitch of our voice is one of the most important features we

have towards communicating with one another. While the pressure of the

pulmonary airflow can be one method of controlling voice pitch the primary

mechanism of doing so resides into the larynx. The muscles within the

larynx give us control over the elasticity and the tension of the vocal folds.


By manipulating these characteristics we practically adjust the fundamental

frequency of the glottal source that is being generated. [16]

3. Resonation

In the final stage of the voice generation process, the glottal source is be-

ing shaped into intelligible vowels and consonants making up words. The

first step of this transformation occurs into the cavity of the pharynx which

communicates with the nasal, the oral cavity and the larynx. The pulses of

air that escaped the vocal cords are being diffused into all of these cavities

that have the role of the resonator of our whole vocal system. The moving

parts of our resonator give us the ability to shape the waveforms transmit-

ted. The lips, the tongue, the velum and basically all of our facial muscles

give us dynamic control over the real time filtering of the glottal source. The

resonator’s task is to attenuate some frequencies of the band limited pulse

produced by the vocal folds while amplifying others. Despite the fact that

without the vocal folds we could have no voice whatsoever, it is this section

of the whole mechanism that makes our voice versatile,interesting and iden-

tifiable. The nature of the human DNA makes sure that each person has

different dimensions of the aforementioned cavities and muscles. Therefore

the filtering of the glottal source that occurs in the resonation stage, is as

unique as one’s fingerprints. [18]

At this point it is important to mention that speech is made up from more

than one types of sound. The previous section examined the generation

of voiced sound but speech contains unvoiced and plosive sounds as well.


Unvoiced sounds result when air gets through certain blockades in the oral

cavity whereas plosive sounds are sudden bursts of air coming from either

the abrupt movement of the vocal tract or the mouth. [17]

If we were to describe the resonation process in terms of digital signal pro-

cessing the notion of formants must be introduced. Formants have more

than one definition in multiple research areas. The most common definition

and the one used by the author describes formants as the spectral peaks

of a given sound spectrum. While the resonator amplifies and attenuates

frequencies of the source, formants occur in its spectral envelope that are

unique for each individual. [19]

Figure 2.7: Graphic representation of formant creation [20]


2-3 formants are enough to represent a person’s voice despite the fact that our

resonator produces more. [8] As shown in the bottom of figure 2.7, formants look

like the result of band pass filters applied to the source. This parallelism is not far

from the truth as formants are characterised by the same features as band-pass

filters which are centre frequency, gain and bandwidth.

2.3 Speech Synthesis Models

Traditionally when we talk about speech synthesisers we refer to TTS (text to

speech) systems due to their intuitive input method and vast popularity. Such

systems can be broken down into multiple components that consist their func-

tionality. One of the most important modules of TTS systems is the waveform

generator. Based on the model used for the sound generation, speech synthesis

systems can be classified into three types. These three types are the Formant Syn-

thesis, the Articulatory Synthesis and the Concatenative Synthesis. The different

synthesis systems can also be classified into two sub categories according to the

extend of human intervention in the creation and execution process. Synthesis by

rule uses a collection of supervised rules in order to perform synthesis and data

driven synthesis derives its parameters from actual speech data. [21]

1. Formant Synthesis

Formant synthesis uses a source-filter model to generate intelligible sounds.

The source-filter model can be characterised as a simplified version of the real

life voice generation process. It can be simply described as the generation of


a quasi-periodic pulsating signal (glottal source) and its filtering by multiple

variable band pass filters with the appropriate formant parameters in order

to have intelligible vowels produced. [21]

Figure 2.8: Source-filter model [20]

Modelling multiple formant resonances in the digital world requires the im-

plementation of 2nd order IIR (infinite impulse response) filters. Equation

2.2 shows the transfer function of a 2nd order IIR filter. [21] The deriva-

tion of the frequency response of a filter from its transfer function will be

examined in chapter 3.

Hi (z) =1

1− 2e−πbi cos (2πfi) z−1 + e−2πbiz−2(2.2)


(2nd order IIR all-pole filter transfer function with fi = Fi/Fs, bi = Bi/Fs

where Fi, Bi and Fs are the formant’s centre frequency, bandwidth and sam-

pling frequency, respectively.)

The choice of IIR filters is not random. IIR filters, in comparison to FIR

filters, are a lot more computationally efficient. Their low requirements of

filter coefficients (ai, bi) make them faster and less memory consuming. On

the other hand FIR filters are always stable whereas IIR filters can have

poles outside the unit circle which will render them unstable. [21]

There are two ways to combine a number of IIR filters together. One way is

to create a cascaded array and the other is to combine them in parallel. The

parallel method is a lot more complex and is mainly used for the production

of fricative sounds. The cascaded method is ideal for vowel sounds and is

a lot easier to implement. One other very important difference between

the two methods is that the cascaded technique results in an all pole filter

whereas the parallel method results in a filter that has zeros in addition to

poles. Poles and zeros can disclose the frequency response characteristics of

a filter and are often used as the basis for digital filter design. [21]

H1 (z) =M∑k=0

bkz−k (2.3)

H2 (z) =1

1−∑N

k=1 akz−k

(2.4)

(2.3 is the transfer function of an all-zero filter and 2.4 of an all-pole filter.)


In reality voice signals are not stationary. Formant synthesis by rule takes

into account the physical limitations of the vocal tract so that the change

between formants does not occur abruptly whilst giving the user the ability

to manipulate pitch and formant sweep in real time, however it leaves out

important reflections and nuances that make the output sound realistic.[21]

2. Articulatory Synthesis

Articulatory synthesis is a lot closer to formant synthesis in terms of syn-

thesising voice by rule. It models the motion of our articulators and the

resulting distributions of waveforms in the lungs, the oral and nasal cavities

and the larynx. This model drives a formant synthesiser and uses 5 artic-

ulatory parameters: area of lip opening, constriction formed by the tongue

blade, opening of the nasal cavities, average glottal area and rate of active

expansion/ contraction of the vocal tract tube behind a blockade [21]

Figure 2.9: A list of the human articulators [22]


The nature of the human speech articulators does not allow them to perform

large movements. On the contrary, due to the fact that they are so restricted

it is easier to model them. The 5 articulatory parameters are interlaced with

the fundamental frequency and the first 4 formant frequencies. Though this

model can be the most promising in terms of speech quality, the methods

to collect the aforementioned area parameters are not very advanced thus

making articulatory synthesis the least accurate speech synthesis model of

the three. [21]

3. Concatenative Synthesis

Concatenative synthesis attempts to imitate speech in such way that it cap-

tures all of the small details and secondary reflections in order to sound as

realistic as possible. The principle behind this model involves the concate-

nation of several speech excerpts from recordings together so that a natural

sequence of speech is formed. Unlike synthesis by rule this data driven

synthesis model does not require any manual adjustments. In addition the

segments of speech selected are real, so the output is expected to be of high

quality. [21]

In reality the cascaded segments are often different in terms of spectral and

prosodic continuity. If the formants of one segment are not exactly the same

with its adjacent or the perceived pitch is different from one clip to another,

then discontinuities occur at the point of concatenation. The speech excerpts

may be perfectly normal, it is their sequence that sounds unnatural. Under

ideal conditions the concatenative synthesis model produces the most natural


output however its design has to address many issues to avoid discontinuities.

The more issues solved during the design process the better the outcome and

naturally the more complicated the system is. In addition high quality data

driven synthesis models require large amounts of data stored within their

operating system and are very computationally expensive and consume a lot

more memory than rule based models. [21]

Figure 2.10: A simple concatenative synthesis diagram [23]

Chapter 3

Implementation

The task of the project described in this paper is to build a real time computer

based system that generates vowels in such way so that they are being transmitted

by Dr. McPherson’s magnetic resonator piano intelligibly. The author has created

a user friendly digital audio synthesiser that has the role of the input and is in

the form of an AU plugin. Moreover the spectral behaviour of the piano has

been analysed and added to the system so that clearer results are achieved. The

scientific process behind this attempt is examined in the remainder of this chapter.

3.1 Methodology

The method proposed for the successful production of this system is the creation

of a vowel synthesis AU plugin based on the formant synthesis by rule model men-

tioned in chapter 2. Two criteria were taken under consideration before making

this decision; one was the computational requirements of the plugin and the second

23

Chapter 3. Implementation 24

was the ability to modify certain parameters in real time. The formant synthesis

model offers low computational cost and full adjustment capabilities in real time

as well as a decent quality output.

The implementation of the real time vowel synthesis system consists of two primary

phases; programming and spectral analysis of the piano. The JUCE framework

was used for the programming stage which was carried out in C++ using X-Code 5.

For the spectral analysis of the piano 2 DPA 2006A microphones and a TASCAM

US-122MKII audio interface were used for the recording of the samples and finally

MATLAB was used for the analysis of the audio samples.

3.2 Preparation

The correct programming of the plugin required some pre calculated data. Specif-

ically, the formant centre frequency, bandwidth and gain values needed for the

filtering of the model, were captured using PRAAT. PRAAT is an open source

speech feature extraction software that offers the option to detect voice formants

by a single recording.[24] The first 4 formants for each vowel were analysed by the

recordings of the author’s voice.

As shown in figure 3.1, the red dots represent the formants of the vowels. The Y

axis of the spectrogram is frequency in Hz and the X axis is time in seconds. It is

clear that PRAAT provides comprehensive formant data across time in segments

of 5 ms.


Figure 3.1: PRAAT spectrogram showing the first 4 formants of 3 vowels [24]

3.3 Programming

The most comprehensive way to describe the creation of a source/ filter audio

plugin is to divide it into two main processes. The first process would be the

generation of the source and the second the filtering of it.

1. Source

In chapter 2 we thoroughly investigated the nature of the glottal source

and concluded that it is a periodic pulsating signal rich in harmonics with

gradually decreasing amplitude from the fundamental frequency and across

the spectrum. Its digital representation is a band limited pulse which is

essentially a series of harmonics or sinusoids:

χ (t) =N∑k=1

sin (2πkf0t) (3.1)

with a number of harmonics given by:


N =fs2f0

(3.2)

so that we avoid aliasing.

A C++ oscillator class taken from Will Pirkle’s book ”Designing Audio

Plug-ins in C++” was used for the generation of multiple sine waves. This

clever approach uses a 1024 sample buffer to store sine wave values which

are constantly updated. To avoid any fractional data point values within the

buffer, a linear interpolation using weighted sums function is being used[25]

y =

((x− x1)(x2 − x1)

)y2 +

(1−

((x− x1)(x2 − x1)

))y1 (3.3)

for any y between data points (x1, y1) and (x2, y2).

This procedure ensures phase coherence to a large number of oscillator in-

stances created simultaneously.

In practice 34 instances of the oscillator class were created, the first one rep-

resenting the fundamental frequency and the rest 33 its harmonics, creating

a band limited pulse at steady magnitude across its spectrum as shown in

figure 3.2.

2. Filtering


Figure 3.2: Impulse train in time and frequency domain

The filtering stage has been conducted in a way so that it extends the tradi-

tional methods. The obtained formant filter values for 5 vowels (phonetically

[a], [e], [i], [o], [ou]) have been inserted into 5 lookup tables. Each lookup

table contains centre frequency (in Hz), gain (in dB) and Q (bandwidth)

values for 4 filters. To adequately describe the performance of the plugin,

this section is divided into the two main functions of the code.

• Calculation of coefficients : The aforementioned formant values are

used to derive the necessary coefficients for the 20 filters in total.

Given:


G = 10g/20

F = 2π(fcfs

)For : g >= 1 And for : g < 1

H = F2Q

H = F2gQ

Then the filter coefficients ai, bi are given by:

a2 = −0.5 (1−H)(1+H)

a1 = (0.5− a2) cos (H)

b0 = (G− 1)(0.25 + 0.5a2) + 0.5

b1 = −a1

b2 = −(G− 1)(0.25 + 0.5a2)− a2

(fc is centre frequency, g is gain and Q is the bandwidth of each filter.)

• Calculation of frequency response : Normally, a source/ filter

model would use coefficients to design filters that will shape the band

limited pulse generated, however this plugin incorporates something

beyond just filtering. In a separate function, each filter’s frequency re-

sponse is derived from its transfer function.

From the 2nd order filter transfer function:

H (z) =b0 + b1z

−1 + b2z−2

a0 + a1z−1 + a2z−2


We calculate the frequency response by substituting z with ω = ejω

and taking its complex conjugate:

H (ω) =b0 + b1 cos (ω)− jb1 sin (ω) + b2 cos (2ω)− jb2 sin (2ω)

a0 + a1 cos (ω)− ja1 sin (ω) + a2 cos (2ω)− ja2 sin (2ω)

H (ω) =[b0 + b1 cos (ω) + b2 cos (2ω)] + j [−b1 sin (ω)− b2 sin (2ω)]

[a0 + a1 cos (ω) + a2 cos (2ω)] + j [−a1 sin (ω)− a2 sin (2ω)]

for:

A = [b0 + b1 cos (ω) + b2 cos (2ω)]

B = [−b1 sin (ω)− b2 sin (2ω)]

C = [a0 + a1 cos (ω) + a2 cos (2ω)]

D = [−a1 sin (ω)− a2 sin (2ω)]

we get:

H (ω) = A+jBC+jD

C−jDC−jD

H (ω) = [AC+BD]+j[AD−BC]C2+D2

the magnitude response of the filter:

|H (ω)| = 1

C2D2

√(AC +BD)2 + (AD −BC)2 (3.4)


Note here that ω is frequency in radians (0 − 2π) and can be written

as ω = 2π fifs

The magnitude response of a filter encloses gain information for each

frequency within its bandwidth.

Figure 3.3: Magnitude response of the cascaded filter for vowel A

For any generated sine wave with frequency fi which is part of the source

waveform, the above function calculates its amplitude according to the

4 formant filters of each vowel. Essentially the plugin generates 34 pre

filtered signals giving a lot more flexibility in terms of transmission.The

resulting synthesis method is the additive synthesis because the source

is being constructed instead of being carved into shape. [3]

Figure 3.4 demonstrates how the frequency response function has shaped

the band limited pulse according to the 4 resonances of the vowel A. It


Figure 3.4: The vowel A transmitted from the plugin

is clearly shown how different frequencies are generated with different

magnitudes.

The 4 filters are combined in a cascaded array at the end of the C++

function. As mentioned in chapter 2, the cascaded method is ideal for

vowel synthesis and very easy to implement.

Expressing the filter transfer function in a factorized form as seen below:

H (z) =

∑Mk=0 bkz

k

1−∑N

k=1 akz−k

= G(1− z1z−1) (1− z∗1z−1) (1− z2z−1) (1− z∗z−1) . . .

(1− p1z−1) (1− p∗1z−1) (1− p2z−1) (1− p∗2z−1) . . .

(3.5)

where G is the cascaded filter gain and the poles pi and zeros zi are either

complex conjugate pairs or real-valued. By grouping the factorized

equation in terms of the complex conjugate and real valued pairs we


get:

H (z) = G

((1− z1z−1) (1− z∗1z−1)

(1− p1z−1) (1− p∗1z−1)

)((1− z2z−1) (1− z∗2z−1)

(1− p2z−1) (1− p∗2z−1)

). . .

(3.6)

Equation 3.6 represents the cascaded expression of 2nd order IIR filters

enclosed in the big brackets which can also be expressed as:

H (z) = GK∏k=1

Hk (z) (3.7)

Finally, a low pass filter with centre frequency at 2.5kHz has been added

to the cascaded array so that unwanted harmonics are neglected.

3.4 Spectral Analysis

The magnetic resonator piano, like most structures, has some specific acoustic

properties. When sound waves travel through its body their spectrum is inevitably

shaped in a similar way the vocal tract shapes the glottal source. In the attempt to

transmit the output of the plugin through the MRP, the piano’s formants have to

be taken under consideration. The accurate reproduction of the vowels generated

from the AU plugin requires a medium with a flat frequency response. Given that

the piano certainly does not have a flat frequency response a compensative method

was necessary.


The strings used for this project were specific so that their resonant frequencies

cover the vocal range (50 - 500Hz). C2, E2, G2, D3, E3, G3, C4, E4 and G4

were tested and their output was recorded in order to be analysed. Their resonant

frequencies are:

• C2 - 65Hz

• E2 - 82Hz

• G2 - 97Hz

• D3 - 146Hz

• E3 - 164Hz

• G3 - 197Hz

• C4 - 260Hz

• E4 - 327Hz

• G4 - 389Hz

The idea is to feed the strings with all the vowels, a version of all the vowels filtered

at 1kHz and white noise all at different dB levels, record all the outputs and derive

an average frequency response for each string and consequently the whole body

of the piano. Knowing how each string responds is very important towards the

implementation of an inverse filter which will flatten that response and make vowel

transmission more intelligible.


Figure 3.5: The frequency response of C4 under all test conditions for vowelA

Figure 3.6: The frequency response of G3 for all vowels


The response of a system h(n) knowing its input x(n) and output y(n) is a simple

comparison of y(n) with regard to the x(n) and in theory it should be the same no

matter what the input is. Figures 3.5 and 3.6 show the response of the string C4

under a variety of test conditions. The responses of each string for all vowels under

every test condition were gathered in a significant number of arrays. By averaging

these arrays a clear view of how each string behaves to a number of harmonic inputs

is revealed and by taking their inverse average response a quantifiable pattern

emerges towards the creation of the inverse filter.

Figure 3.7: Average frequency response and its inverse for string G3

The amplitude differences between the red line and the blue line, shown in figure

3.7, at the frequencies corresponding to the peaks of the red line are the gain and

frequency values to be imported to the inverse filter.


3.5 Plugin Parameters Description

The design of the vowel generator AU plugin is such so that it offers a variety

of adjustable parameters to the user. This section includes a description of its

architecture and an examination of its features.

Figure 3.8: A picture of the plugin in operation

Figure 3.8 shows the layout of the plugin. Its parameters from top to bottom are:

• A master ON/ OFF button enables and disables the output. The button

turns red to indicate when it is in OFF mode.


• A volume knob ranging from 0 to 10 dB with a step of 0.1.

• The wave type drop down list that offers 4 choices of waveform types to be

generated. The user has the option to create vowels with sine waves, triangle

waves, sawtooth waves or rectangular waves.

• The vowel type drop down list is the parameter with which the user can

switch between the 5 different vowels.

• A frequency knob that sets the fundamental frequency of the vowel. This

parameter ranges from 50 to 500 Hz with a step of 1.

• 9 buttons for each string. These buttons configure the settings of the inverse

filter according to the analysis conducted on each string separately. The user

may switch between 9 filters depending on what piano key he is using. Each

button turns white when active and red when not.

• 34 volume sliders, one for each harmonic and the fundamental frequency f0.

These sliders function as a graphic equaliser and range between -50 to +50

dB with a step of 1. When an inverse filter button is pressed, these sliders

automatically take values for the inverse filter of the corresponding string.

3.6 Results

This project set out to create a vowel synthesis system that receives intelligi-

ble vowels from an AU plugin and plays them back through the MRP strings

keeping a their intelligibility. The results of this attempt are difficult to show


on paper. Evaluating whether a vowel is intelligible or not means one has

to listen to it to comprehend it. Being able to identify whether a sound is a

vowel, natural or not, is the goal here. The spectral content of the output

of the piano in comparison to the output of the plugin is the optimal way

to view this project’s results. Due to the sheer volume of results this section

will show the three most intelligible vowels transmitted from the piano and

the three least intelligible.

Figure 3.9: Spectrum of piano output vs plugin output of G3 playing thevowel O


Figure 3.10: Spectrum of piano output vs plugin output of E4 playing thevowel A

Figure 3.11: Spectrum of piano output vs plugin output of C4 playing thevowel O


Figure 3.12: Spectrum of piano output vs plugin output of G2 playing thevowel A

Figure 3.13: Spectrum of piano output vs plugin output of C2 playing thevowel I


Figure 3.14: Spectrum of piano output vs plugin output of D3 playing thevowel E

The figures of this section show the spectral behaviour of 6 piano strings,

represented by the blue lines, superimposed on the spectrum of the plugin

output or the piano input, represented by the dashed red lines. Intelligibility

of the piano output is shown when the peaks of the blue line follow the

pattern created by the peaks of the red line. In other words when the

spectral shape of the blue line is close to the spectral shape of the red line or

in some occasions the same, regardless of the general difference in dB level,

then it means that the plugin output has retained the magnitude relationship

of its spectral peaks through the piano strings. In the occasion where the

formants of the two lines do not follow the same pattern, it is clear that the

acoustical structure of the piano has distorted the input’s frequency content,

thus the inverse filtering was ineffective.


It is clear that figures 3.9, 3.10 and 3.11 are graphs of intelligible vowels

whereas figures 3.12, 3.13 and 3.14 are plots of vowels that are quite dif-

ficult to identify. Further discussion on the results will take place in the

conclusion.

Chapter 4

Conclusion

This final chapter concludes the examination of the project. A discussion on

the method chosen, the implementation and the results is included as well

as a final evaluation. The last two sections of the chapter concern possible

improvements in terms of programming and analysis and a summary of the

report.

4.1 Discussion

A thorough investigation was carried out on the science behind the human

voice, the most popular speech synthesis systems in the market and the

magnetic resonator piano. In the literature review, this paper examined

the mechanics of how voice is produced, deducted measurable parameters of

voicing, and explained the major differences between three speech synthesis

models.

43

Chapter 4. Conclusion 44

The formant synthesis model was chosen, which according to the investiga-

tion was the appropriate method in terms of best quality for vowel generation

and low computational cost. The JUCE framework was used within Xcode

5 on a Macintosh operating system for the creation of an audio plugin. The

implementation of this model required the generation of a source waveform

and its filtering by a cascaded array of 2nd order IIR filters. In chapter 3 a

detailed analysis of the source/ filter model is included, providing a descrip-

tion of the filtering stage which takes this method a step further.

After intelligible vowels were produced by the AU plugin, an analysis on

how the structure of the MRP affects the frequency content of the input was

conducted. The analysis revealed a pattern in the frequency response of the

9 strings of the piano that were used in this project. Nine inverse filters

were incorporated into the audio plugin in order to obtain a flat frequency

response.

Finally, 6 plots were shown representing the most intelligible vowels and the

least intelligible vowels generated by this complex system.

4.2 Evaluation

The project’s aims were to create a vowel generation system using a digital

audio synthesiser and to transmit its output accurately via the strings of the

MRP. Considering that the AU plugin produced intelligible vowels that are

in no way realistic, an understanding emerges of how the evaluation process

will be carried out.


In terms of spectral shape relationship between the input and the output,

a close relationship meaning success and a far relationship meaning failure,

this project has been a success. Most of the 45 vowels transmitted via the

piano strings at 9 different fundamental frequencies, had a very close spectral

shape relationship with the input.

The vowels generated by the digital audio synthesiser, have neglected a sig-

nificant number of characteristics that would make them sound real. Early

reflections in the nasal and oral cavities, unvoiced sounds, small pitch varia-

tions and many other parameters are not included in the source/ filter model.

The resulting sound lacks the transients of the natural voice and its closest

approximation to reality is as if we took the steady state parts of a real vowel

and put it in an indefinite loop. Although real it would not sound natural

or intelligible. Within a context of successively changing vowels of the same

type though the perception of intelligibility we have changes dramatically.

For example when the piano plays the vowel A at 65Hz through C2, it sounds

nothing like an A. But when it plays all the vowels in a successive order we

start to recognise the different types. Even the vowels whose spectrum plots

did not meet the success criteria, in a context of successive transmission of

different vowels they become intelligible. And vice versa, the most intelligi-

ble vowels in terms of spectral shape relationship are not intelligible when

played out of any context.

To conclude, the author’s evaluation is that the project is a successful first

step towards a useable vowel generation system but as it is at the moment


it is incomplete.

4.3 Improvements

Possible improvements towards the better function of this vowel synthesis

system concern programming and spectral analysis.

– Algorithm improvements

The AU plugin although very successful, could incorporate some more

parameters to generate a more natural output. Small pitch variations

of the fundamental frequency of the source and the centre frequency of

each formant filter could be implemented in order to model the human

voice more accurately. White noise could model the unvoiced air escap-

ing the vocal folds resulting in a harmonically richer output. Finally

a parallel array of 2nd order IIR filters could make consonant genera-

tion possible and would model vowel reflections in the mouth and nasal

cavities.

– Spectral analysis improvements

The inverse filtering applied in this project is based on the linear rela-

tionship between input and output. The MRP is a physical structure

and like all physical structures it is non linear. One of the main reasons

that this project had partial success in the frequency content relation-

ship between input and output, was intermodulation distortion. This


non linear distortion produced by the acoustic architecture of the pi-

ano, resulted in the appearance of some very unpredictable harmonics

in the output spectrum.

A major improvement for this project would be to analyse the non

linearity of the piano and derive a mathematical model that will predict

the distortion and will design its inverse filter more accurately. This

may allow the creation of a function performing vowel transition across

their corresponding spectral targets. This parameter was attempted

during this project but distortion coming from the piano excited a vast

variety of harmonics during the vowel transition. The result was very

noisy and contradicted with the goals of the project.


4.4 Summary and Final Thoughts

The goals of this bold attempt described in this paper both have been and

have not been achieved. The research and implementation by the author

may have resulted in a robust vowel synthesis system which successfully

transmits the majority of the vowels from a digital audio synthesiser via the

MRP strings, however it is not a useable instrument.

Improvements have been proposed and room for further research by other

scientists has been left by the author. Overall, it is hoped that this project

has been a positive step towards the fascinating field of speech processing.

Bibliography

[1] History and development of speech synthesis. 2006. URL

http://www.acoustics.hut.fi/publications/files/theses/

lemmetty_mst/chap2.html.

[2] Richard W. Sproat. Multilingual Text-to-Speech Synthesis: The Bell

Labs Approach, volume 4. Springer, 1997.

[3] Sam O’Sullivan. Understanding the basics of sound synthesis. The

Pro Audio Files, February 2012. URL http://theproaudiofiles.

com/sound-synthesis-basics/.

[4] Andrew McPherson. The magnetic resonator piano: Electronic aug-

mentation of an acoustic grand piano. Journal of New Music Research,

39(3):189 – 202, 2010.

[5] Youngmoo Kim Andrew McPherson. Augmenting the acous-

tic piano with electromagnetic string actuation and contin-

uous key position sensing. NIME, 1:217–222, 2010. URL

http://www.educ.dab.uts.edu.au/nime/PROCEEDINGS/papers/

Paper%20K1-K5/P217_McPherson.pdf.

49

http://www.acoustics.hut.fi/publications/files/theses/lemmetty_mst/chap2.html


http://theproaudiofiles.com/sound-synthesis-basics/

http://theproaudiofiles.com/sound-synthesis-basics/

http://www.educ.dab.uts.edu.au/nime/PROCEEDINGS/papers/Paper%20K1-K5/P217_McPherson.pdf

http://www.educ.dab.uts.edu.au/nime/PROCEEDINGS/papers/Paper%20K1-K5/P217_McPherson.pdf

Bibliography 50

[6] Glen Lee. Voice synthesis. The Encyclopaedia of Virtual Environ-

ments, 1, 1993. URL http://www.hitl.washington.edu/projects/

knowledge_base/virtual-worlds/EVE/I.B.2.VoiceSynthesis.

html.

[7] Upper respiratory system diagram. . URL http://quizlet.com/

12905648/module-1-the-respiratory-system-anatomy-and-physiology-flash-cards/.

[8] Johan Sundberg. The acoustics of the singing voice. 1997. URL http:

//www.zainea.com/voices.htm.

[9] Jackie R. Haynes and Ronald Netsell. The mechanics of speech breath-

ing: a tutorial. Department of Communication Sciences and Disorders

Southwest Missouri State University, 2001.

[10] Deirdre D. Michael. About the voice. Lions Voice Clinic, 2014. URL

http://www.lionsvoiceclinic.umn.edu/page2.htm.

[11] Larynx. URL http://learnhumananatomy.com/larynx/.

[12] The Voice Foundation. Voice anatomy physiology. 2014. URL

http://voicefoundation.org/health-science/voice-disorders/

anatomy-physiology-of-voice-production/.

[13] Janwillem van den Berg. Myoelastic aerodynamic theory of voice

production. Journal of Speech, Language, and Hearing Research,

September 1958. URL http://jslhr.pubs.asha.org/article.aspx?

articleid=1749406.

http://www.hitl.washington.edu/projects/knowledge_base/virtual-worlds/EVE/I.B.2.VoiceSynthesis.html



http://quizlet.com/12905648/module-1-the-respiratory-system-anatomy-and-physiology-flash-cards/

http://quizlet.com/12905648/module-1-the-respiratory-system-anatomy-and-physiology-flash-cards/

http://www.zainea.com/voices.htm

http://www.zainea.com/voices.htm

http://www.lionsvoiceclinic.umn.edu/page2.htm

http://learnhumananatomy.com/larynx/

http://voicefoundation.org/health-science/voice-disorders/anatomy-physiology-of-voice-production/

http://voicefoundation.org/health-science/voice-disorders/anatomy-physiology-of-voice-production/

http://jslhr.pubs.asha.org/article.aspx?articleid=1749406

http://jslhr.pubs.asha.org/article.aspx?articleid=1749406

Bibliography 51

[14] C. Julian Chen. Physics of human voice: A new theory with applica-

tion. Research Conference Columbia University, 1(1):1–19, November

2012. URL http://www.google.co.uk/url?sa=t&rct=j&q=&esrc=

s&source=web&cd=1&ved=0CCMQFjAA&url=http.

[15] Glottal source spectrum. . URL http://www.ncvs.org/ncvs/

tutorials/voiceprod/images/5.5.jpg.

[16] Dinesh K. Chhetri. Neuromuscular control of fundamental frequency

and glottal posture at phonation onset. Acoustical Society of America,

November 2011. URL http://headandnecksurgery.ucla.edu/

workfiles/Academics/Articles/neuromusc_control_chhetri_et_

al.pdf.

[17] Maeva Garnier Joe Wolfe and John Smith. Voice acoustics: an intro-

duction. UNSW, 2, May 2009. URL http://www.phys.unsw.edu.au/

jw/voice.html.

[18] Eric Armstrong. Journey of the voice: Anatomy, physiology and the

care of the voice. Voice and Speech source, 2008. URL http://www.

yorku.ca/earmstro/journey/resonation.html.

[19] Fant G. Acoustic theory of speechproduction. Mouton co, 1960.

[20] Source/ filter model. URL http://www.phys.unsw.edu.au/jw/

speechmodel.html.

http://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCMQFjAA&url=http

http://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCMQFjAA&url=http

http://www.ncvs.org/ncvs/tutorials/voiceprod/images/5.5.jpg

http://www.ncvs.org/ncvs/tutorials/voiceprod/images/5.5.jpg

http://headandnecksurgery.ucla.edu/workfiles/Academics/Articles/neuromusc_control_chhetri_et_al.pdf



http://www.phys.unsw.edu.au/jw/voice.html

http://www.phys.unsw.edu.au/jw/voice.html

http://www.yorku.ca/earmstro/journey/resonation.html

http://www.yorku.ca/earmstro/journey/resonation.html

http://www.phys.unsw.edu.au/jw/speechmodel.html

http://www.phys.unsw.edu.au/jw/speechmodel.html

Bibliography 52

[21] Alex Acero Xuedong Huang and Hsiao Wuen Hon. Spoken Language

Processing a Guide to Theory, Algorithm, and System Development.

Prentice Hall, first edition, 2001.

[22] Articulators. URL http://educationcing.blogspot.co.uk/2012/

08/articulatory-phonetics-vocal-organs.html.

[23] Concatenative synthesis. URL http://www.acoustics.hut.fi/

publications/files/theses/lemmetty_mst/chap9.html.

[24] Praat. . URL http://www.fon.hum.uva.nl/praat/.

[25] Will Pirkle. Designing Audio Effect Plug-Ins in C++. Focal Press, first

edition, 2012.

http://educationcing.blogspot.co.uk/2012/08/articulatory-phonetics-vocal-organs.html

http://educationcing.blogspot.co.uk/2012/08/articulatory-phonetics-vocal-organs.html



http://www.fon.hum.uva.nl/praat/

real-time vowel synthesis - a magnetic resonator piano based project_by_vasileios_valavanis

Documents

synthesis of consonants

concatenative synthesis

th result

magnetic resonator piano

speech synthesis models

rd result

nd result

fieldof audio synthesis