Download - 3P Portuguese Pronunciation Professor · 3P – Portuguese Pronunciation Professor Mariana Sofia Pimenta Lopes ... estudantes de línguas mais faladas, como o Inglês americano ou

3P – Portuguese Pronunciation Professor

Mariana Sofia Pimenta Lopes

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor: Prof. Isabel Maria Martins Trancoso

Examination Committee

Chairperson: Professor João Fernando Cardoso Silva Sequeira

Supervisor: Professor Isabel Maria Martins Trancoso

Members of the Committee: Professor Hugo Daniel dos Santos Meinedo

October 2014

3P

iii

To my parents,

v

Acknowledgements

Acknowledgements

I am using this opportunity to express my gratitude to everyone who supported me throughout the

course of this project. I am sincerely grateful for their guidance, constructive criticism and friendly

advice during the project work.

I would like to express the deepest appreciation to my advisor Professor Isabel Trancoso for

encouraging my research and providing priceless support and encouragement when I most needed it.

I would also like to thank the L2F staff, especially to Professor Hugo Meinedo, Professor

Alberto Gareta, Professor Thomas Pelligrini and Phd student Anna Pompili for the immense

assistance, and provision of the source materials essential to helping me completing this project.

Furthermore, I would like to extend my thankfulness to all the people mentioned in the

references for making their work available, so people can understand and adapt their research.

Finally a special thanks to my family. Words cannot express how grateful I am to my mother,

and father for all of the sacrifices that you’ve made on my behalf and encouragement to strive towards

my goal.

vii

Abstract

Abstract

The quality of oral proficiency forms an important part in learning a foreign language. Yet, frequently

students find it hard to obtain a reliable source where they can work their pronunciation intensely. An

automatic assessment system can reduce the cost and workload associated with this task. This type

of tools are available for students of widely spoken languages such as American or British English,

however there is not a large amount of them for students of European Portuguese (EP).

The research presented in this thesis investigates a solution for creating a computer assisted

language learning (CALL) system for EP using as its base the work of Witt (1999)[1].

This thesis begins by outlining important aspects for computer-assisted language learning and

makes a brief analysis of the EP phonemes and the comparison with the two other languages

presented in the corpus, Spanish and Bulgarian. Then the several steps in the method are explained,

i.e., firstly the audio speech is digitalized, then, using Audimus, posterior probabilities on 20 ms frames

are calculated from the extracted features. Subsequently, a GOP score is calculated for each frame

and for each phoneme. Then the GOP is normalized and using a pre-established threshold, from

native speakers’ data, the threshold is adapted in order to improve efficiency in classifying the

phonemes as a correct or incorrect utterance. Finally, since the threshold is a subjective to who

implemented it, it is compared with three human judges in order to guarantee its quality.

Keywords

CAPT, GOP, normalization, pronunciation, natives, non-natives, European Portuguese

viii

Resumo

Resumo A qualidade de proficiência oral constitui uma parte importante na aprendizagem de uma língua

estrangeira. No entanto, muitas vezes os alunos têm dificuldade em obter uma fonte fiável, onde

podem trabalhar intensamente a sua pronúncia. Um sistema de avaliação automática pode reduzir o

custo e a carga de trabalho associada a essa tarefa. Este tipo de ferramentas está disponível para

estudantes de línguas mais faladas, como o Inglês americano ou britânico, no entanto não há, em

grande parte investigação para estudantes de Português Europeu (PE).

A pesquisa apresentada nesta tese investiga uma solução para a criação de um sistema

assistido por computador para a aprendizagem de línguas (CALL) para o PE usando como base o

trabalho de Witt (1999) [1].

Esta tese começa por descrever aspetos importantes para a aprendizagem de línguas

assistida por computador, fazendo também uma breve análise dos fonemas do EP e uma

comparação com as outras duas línguas apresentadas no corpus, o espanhol e o búlgaro. Em

seguida, os vários passos do processo são explicados, ou seja, em primeiro lugar, o registo áudio da

fala é digitalizado, e, utilizando Audimus, as probabilidades posteriores em intervalos de 20 ms são

calculadas a partir das características extraídas. Subsequentemente, uma pontuação GOP é

calculada para cada intervalo e para cada fonema. Em seguida, o GOP é normalizado e usando um

limite pré-estabelecido, obtido a partir de dados de falantes nativos, o limite é adaptado, a fim de

melhorar a eficiência na classificação dos fonemas como correta ou incorretamente pronunciados.

Finalmente, uma vez que o limite é subjetivo para quem o executou, ele é comparado com o

julgamento de três juízes humanos, a fim de garantir a sua qualidade.

Palavras-chave

CALL, GOP, normalização, pronúncia, nativos, não nativos, Português Europeu.

ix

Table of Contents

Acknowledgements ................................................................................... v

Abstract .....................................................................................................vii

Resumo.................................................................................................... viii

List of Figures ........................................................................................... xi

List of Tables ............................................................................................xii

List of Acronyms ...................................................................................... xiii

List of Software ........................................................................................xiv

1 Introduction ................................................................................... 15

1.1 Overview ................................................................................................ 16

1.2 Motivation and problem specification ..................................................... 16

1.3 Innovations of the work .......................................................................... 16

1.4 Thesis contents ..................................................................................... 16

2 Pronunciation ................................................................................ 18

2.1 Learning word pronunciation ................................................................. 19

2.2 Automatic Speech Recognition .............................................................. 19

3 Phonology ..................................................................................... 21

3.1 European Portuguese ............................................................................ 23

3.1.1 Brief description of EP ...................................................................................... 23

3.1.2 Phonology of EP ............................................................................................... 25

3.2 EP and foreign languages ..................................................................... 28

3.2.1 Brief comparison with Spanish .......................................................................... 28

3.2.2 Brief comparison with Bulgarian ........................................................................ 30

4 System Design.............................................................................. 31

4.1 State of the art ....................................................................................... 33

4.1.1 Scientific research ............................................................................................ 33

4.1.2 Existing tools .................................................................................................... 33

x

4.2 Method................................................................................................... 34

4.2.1 Audimus ........................................................................................................... 35

4.2.2 GOP ................................................................................................................. 37

4.2.3 NGOP .............................................................................................................. 38

4.2.4 Threshold ......................................................................................................... 39

4.2.5 GOP for fluent speech ...................................................................................... 40

4.2.1 Overall score .................................................................................................... 41

4.2.2 Performance measure ...................................................................................... 41

4.3 Other classification methods .................................................................. 42

4.3.1 Likelihood Ratio ................................................................................................ 42

4.3.2 MFCC and DTW based evaluation .................................................................... 42

5 Experiments and results ............................................................... 43

5.1 Corpus ................................................................................................... 44

5.2 Implementation ...................................................................................... 45

5.3 Results................................................................................................... 46

5.3.1 Threshold for Native speakers .......................................................................... 46

5.3.1 SA results for all non-native .............................................................................. 49

5.3.2 SA results for Spanish ...................................................................................... 50

5.3.3 SA results for Bulgarian .................................................................................... 51

5.3.4 Comparison, Critic and Evaluation .................................................................... 52

6 Interface ........................................................................................ 54

6.1 VITHEA project ...................................................................................... 55

6.2 3P Interface ........................................................................................... 55

7 Conclusion .................................................................................... 57

7.1 Conclusions ........................................................................................... 58

7.2 Future work ............................................................................................ 59

Annex 1 – Extra Tables ........................................................................... 60

References .............................................................................................. 69

xi

List of Figures

List of Figures

Fig. 1- Lexical Distance among the Languages of Europe ................................................................. 24

Fig. 2- Places of articulation for consonants [22] ................................................................................ 26

Fig. 3- Vowel Triangle in relation to tongue position [22] .................................................................... 27

Fig. 4- Model’s Schematic. ................................................................................................................ 34

Fig. 5- Audimus squematic. ............................................................................................................... 35

Fig. 6- Visualization of the phoneme division in Wavform................................................................... 36

Fig. 7- GOP and NGOP relation. ....................................................................................................... 38

Fig. 8- GOP for fluent speech [1]. ...................................................................................................... 40

Fig. 9- Example of exact end approximated GOP. ............................................................................. 46

Fig. 10- NGOP's and GOP's SA comparison for each phoneme. ....................................................... 47

Fig. 11- NGOP for each phoneme for native (blue) and non-native (red) with native mean and

std above. ................................................................................................................ 49

Fig. 12- SA for each Phoneme in Spanish ......................................................................................... 50

Fig. 13– SA for each Phoneme in Bulgarian ...................................................................................... 51

Fig. 14 - 3P Interface......................................................................................................................... 55

xii

List of Tables

List of Tables

Table 1- Consonants in EP ............................................................................................................... 25

Table 2- Vowels of EP ....................................................................................................................... 27

Table 3- Spanish consonants and vowels in comparison with EP ...................................................... 29

Table 4- Bulgarian consonants and vowels in comparison with EP .................................................... 30

Table 5- Mean and number of occurrences by phoneme ................................................................... 48

Table 6- Threshold of each phoneme for ........................................................................................... 50

Table 7- Threshold of each phoneme ................................................................................................ 51

Table 8- Average performance measures .......................................................................................... 53

Table 9 - Judges vs GOP scoring ...................................................................................................... 53

Table 10 - Example of words using the EP phoneme list………… ......... …………………………..........61

Table 11 - Confusion matrices for Portuguese, Spanish and Bulgarian …………….... .......62

xiii

List of Acronyms

List of Acronyms

EP European Portuguese

3P Portuguese Pronunciation Professor

GOP Goodness of Pronunciation

NGOP Normalized Goodness of Pronunciation

SA Score accuracy

PP

std

Posterior Probability

Standard deviation

CALL Computer assisted language learning

CAPT Computer assisted pronunciation learning

xiv

List of Software

List of Software

Audimus AUDIMUS is a speech recognition system that Works offline with any

audio/speech/video file, transcribing its content with the possibility of

segmentation.

Matlab MATLAB is a high-level language and interactive environment for

numerical computation, visualization, and programming.

Wavsurfer WaveSurfer is an open source tool for sound visualization and

manipulation. Typical applications are speech/sound analysis and

sound annotation/transcription.

3P Program develop in this thesis for EP pronunciation training

15

Chapter 1

Introduction

1 Introduction

This chapter gives a brief overview of the work. It establishes work targets, original contributions and

the motivations. At the end of the chapter, the work structure is provided.

16

1.1 Overview

While assessing pronunciation is well defined in English, it is vaguely studied in EP. 3P (Portuguese

Pronunciation Professor) will guide the learners of EP through several exercises by giving instructions

and feedback, by using Text-To-Speech synthesis. The research presented investigates a solution for

creating a computer assisted language learning (CALL) system for EP (European Portuguese) using

as its base the work of Witt (1999)[1]. The algorithm is based on Goodness of Pronunciation (GOP), a

measure that uses confidence scores drawn from automatic recognition and alignments results at

phone-level. The GOP is computed for native data, in order to select a threshold that separates good

and bad pronunciation. The method is then tested with a non-native corpus, and the results are

analyzed and adjusted using several performance measures. The GOP module has been integrated in

an existing web interface.

1.2 Motivation and problem specification

The thesis is motivated by this vision of an interactive system that helps people learning a new

language. One of the biggest difficulties in learning EP is the lack of material available for general

public [2]. So by creating an interactive system of qualification of different phonemes/words in

European Portuguese gives the students an opportunity to study and reach pronunciation proficiency

without the need of constant native speakers support and allowing to understand the phonetic

subtleties of the language.

1.3 Innovations of the work

This research has two main innovations, the implementation of an adaptation of the GOP method for

European Portuguese phonemes and the creation of a simple web interface where students can

practice their pronunciation that can be modified with new sentences, games and better thresholds.

1.4 Thesis contents

This thesis is composed of 5 chapters.

Chapter 1 – Introduction - This chapter gives a brief overview of the work. Establishing work

targets, original contributions and the motivations.

17

Chapter 2 - Pronunciation - This chapter provides an overview on the difficulties and the

foundations to learn pronunciation. It also has a brief explanation on what a CALL and CAPT

systems are and why they are used.

Chapter 3 - Phonology - This chapter provides a brief explanation of the phonology of EP and

its comparison with Spanish and Bulgarian.

Chapter 4 - System design - This chapter provides an overview of the existing tools and the

research being done and an explanation of the steps for to obtain the classification of the

pronunciation.

Chapter 5 - Experiments and results - This chapter provides the validation of the steps for the

implementation of the project and the results obtained.

Chapter 6 - Interface -This chapter provides an overview of the interface implemented.

Chapter 7 - Conclusions - This chapter finalises the work, summarising conclusions and

pointing out aspects to be developed in future work.

18

2 Pronunciation

Chapter 2

Pronunciation

This chapter provides an overview on the difficulties and the bases to learn pronunciation. It also has a

brief explanation on what a CALL and CAPT systems are and why they are utilized.

.

19

2.1 Learning word pronunciation

Learning a language can be described as the process of obtaining a language competence as a

planned process done by a conscious study. However acquiring a new language as a child and

learning a new language as an adult are two different learning processes. As Kraschen and Terrel

wrote “[Acquiring] is the „natural‟ way, paralleling first language development in children. Acquisition

refers to an unconscious process that involves the naturalistic development of language proficiency

through understanding language and through using language for meaningful communication (Krashen

and Terrel in Richards, 1987: 131)” [3]. Hence this being a parallel process to learn a second

language, “Learning, by contrast, refers to a process in which conscious rules about a language are

developed. It results in explicit knowledge about the forms of a language and the ability to verbalize

this knowledge” (Richards; 1987:131) [3]. Nevertheless most adult learners of a foreign language and

even those as young as 6 years old retain some artifacts in their pronunciation that identify them as

non-native speakers [3].

Native-like intonation can be learned, however, this is extremely difficult for even advanced

language learners. In addition to requiring lots of feedback to improve pronunciation, students ca not

attend to all aspects of pronunciation at the same time, e.g. attending to phonetic accuracy takes

processing time away from attending to intonation [4]. But, for example, when a person listens to a

song in repeat tends to learn it more easily, making repetition a feature that helps learning that is

shared around the world. And as described by the mere exposure effect, repetition does not only work

with songs but with publicity, shapes and patterns based elements such as learning new words. By

repeating a word many times makes someone overlook what the word means but how it sounds,

making a smoother transition in learning a language. Simply repeating a sentence a number of times

shifts the listeners’ attention to the pitch and duration of the sound so that the repeated language

begins to sound like a repeated song. Therefore, one of the bases of acquiring oral proficiency is

repetition [5]. The other base is finding the language difficulties in perceiving the difference between

phonemes and correct them. So for a student to get pronunciation proficiency needs two elements

repetition and evaluation of its performance in order to correct the systematic errors, making learning a

new language a process of training, repetition and memorization.[6][7]

2.2 Automatic Speech Recognition

Learning to pronounce written words means learning the intricate relations between a language writing

system and its speech sounds [8]. When children learn to read and write in primary school they face

an analogous learning task, as do students when mastering the writing system, the speech sounds,

and the vocabulary of a language different from their mother tongue. Learning to pronounce words can

also be modeled on computers. In contrast with humans, machines can be modeled in such specific

ways that for instance, a machine can be set up to accommodate a data base of representations of

20

word-pronunciation knowledge, without having learned any of those representations by itself: it is

hardwired in memory by the system’s designer [8] [9] [10] [11].

Computer Aided Language Learning (CALL) is a cross-disciplinary field that includes the

subfields Foreign Language Learning (FLL), Foreign Language Teaching (FLT), Linguistics, and

Human Language Technologies (HLT). FLL research typically focuses on topics such as learning

strategies employed by students and effectiveness of environments designed to support learning. FLT

focuses on discovering and employing effective pedagogies to facilitate learning as well as meaningful

performance measurements. Linguistics, specifically the subfield of Second Language Learning (SLA),

focuses on the process of learning a second language by investigating common patterns of mistakes

and progression in competence. Finally, Human Language Technologies encompasses the full-range

of technologies, from audio recordings to dialogue systems, used to facilitate learning [12].

Researchers have investigated the use of computers for language learning since the 1960s.

The field of CALL has seen an explosion of research over the past decade. One of the biggest

challenges in designing computer assisted language learning (CALL) applications that provide

automatic feedback on pronunciation errors consists in reliably detecting the pronunciation errors at

such a detailed level that the information provided can be useful to learners. [13] [14] [15].

CALL systems are numerous with diverse system configurations. On the simple end of the

spectrum, the systems can take the form of web pages with fill-in forms, online chat rooms, static

multimedia programs, modifications to popular games, or even simply a set of digital music files for

playback purposes. On the complex end, systems can 27 have automatic speech recognition, voice

synthesis, and highly interactive 3D environments that teach cultural norms as well as language [12].

Modern systems tend to be much richer language learning environments that incorporate high

quality audio, graphics, and automated feedback. The content of the lessons is usually not static, and

is generated randomly or adaptively, in response to student actions. Many systems use some form of

Automatic Speech Recognition (ASR), speech synthesis, natural language understanding, or natural

language generation [12].

In addition, computer assisted language learning (CALL) applications and, more specifically,

computer assisted pronunciation training (CAPT) applications for language learning that make use of

automatic speech recognition (ASR) have been focused on pronunciation grading (or scoring), while

less attention has been paid to error detection (or localization).

But before CALL methods can be devised it is important to recognize the specific difficulties

encountered in pronunciation teaching. First and foremost explicit pronunciation teaching requires the

sole attention of the teacher to a single student in order to analyses his speech and give some notes

on how to improve. This in a normal classroom environment poses a problem. Then learning a new

language involves large repetition of the words that requires not only a mental task but demands

coordination and control over many muscles to achieve proficiency. These may cause social

implications to students that are afraid to perform in the presence of others. This costly time

consuming approach and therefore its automatization is highly desirable for self-study.

21

On the other hand these technologies do not take into account the variations due to speaker

accent, demanding a strict distinction among the different sounds unlike what would happen with

human teachers. So it can be said that there are two strands in the area of pronunciation learning:

teaching correct pronunciation of a foreign language to students, which requires a precise phoneme

recognition and is more objective and easily computed, and assessing the pronunciation quality of a

speaker speaking a foreign language that can tolerate more mispronunciations, but is also more

correlated with what human teachers perceive as the correct pronunciation [8] [9] [10] [11] [16].

3 Phonology

22

Chapter 3

Phonology

This chapter provides a brief explanation of the phonology of EP and its comparison with Spanish and

Bulgarian.

23

3.1 European Portuguese

3.1.1 Brief description of EP

The Portuguese language is a romance language, i.e. descends from vulgar Latin (and influenced by

Celtic, Germanic and Arabic languages) and is the official language of Portugal, Brazil, Angola,

Mozambique, Cape Verde, Guinea-Bissau, São Tomé and Principe, Macao, Equatorial Guinea and

East Timor. It has approximately 215-220 million native speakers and 260 million total speakers (as

first language (L1) and as a foreign language (L2)) with over 10 million speaking EP [17].

Portugal has three official languages: European Portuguese, Mirandese and Portuguese Sign

Language.The dialects of Portugal can be divided into two major groups:

The southern and central dialects are broadly characterized by preserving the distinction

between /b/ and /v/, and by the tendency to substitute /ei/ and /ou/ to /6j/ and /o/. This includes the

dialect of the capital, Lisbon, which however has some peculiarities of its own. Although the dialects of

the Atlantic archipelagos of the Azores and Madeira have unique characteristics, as well, they can

also be grouped with the southern dialects.

And the northern dialects are characterized by preserving the pronunciation of /ei/ and /ou/ as

diphthongs /ei/ and /ou/ and by merging sometimes /v/ with /b/ (as in Spanish). This includes the

dialect of Porto, Portugal’s second largest city.

Also in the Portuguese town of Barrancos (in the border between Extremadura, Andalucia and

Portugal), a dialect of Portuguese heavily influenced by Southern Spanish dialects is spoken, known

as barranquenho.

As for dialects outside of Portugal, in Brazil, Africa and Asia, it is usually believed that the

dialects derived mostly from those of central and southern Portugal.

The Galician language, spoken in the region of Galicia, Spain, is considered by some of its

speakers as a dialect of the Portuguese - or, precisely speaking, Galician-Portuguese (Galego-

Português) language, while others believe it to be a different, if closely related, language. It is mainly

characterized by the lack of opposition between /b/ and /v/, the preservation of “ei” and “ou”

diphthongs, and, perhaps more characteristically, the de-voicing of the consonant S into Z and the use

of /o~/ instead of /6~w~/. In addition EP has a sister language, the Spanish, which shares a lexical

similarity (measure of the degree to which the word sets of two given languages are similar) of 89%

and, with the exception of other romance languages, the similarity with others languages is not

substantial [18] [19] [20].

The figure below schematizes the lexical distance among the languages of Europe, and as

initial hypothesis, for the two languages used by the non-native corpus (Bulgarian and Spanish),

Spanish may have good results due to its proximity with Portuguese though the mix between two

similar languages may create difficulties, while Bulgarian will probably have worse results since it is in

a different group [18].

24

Fig. 1 Lexical Distance among the Languages of Europe Fig. 1- Lexical Distance among the Languages of Europe

25

3.1.2 Phonology of EP

European Portuguese is a West Iberian Indo-European language composed phonetically by 39

phonemes, including the pause, described in the table below utilizing the SAMPA (Speech

Assessment Methods of Alphabet) script [20].

3.1.2.1 Consonants

Consonants

Labial Coronal Avelar Dental Velar Palatal/ Dorsal

Plosive/

Occlusive

voiced b d g

unvoiced p t k

Fricatives voiced v z Z

unvoiced f s S

Nasals m n J

Lateral l/l~ L

Trill r R

Semi-vowels w j

w~ j~

Table 1- Consonants in EP

The consonants in EP can be classified by the manner of articulation (the configuration and

interaction of the articulator) and place of articulation (the point of contact where an obstruction occurs

in the vocal tract between an articulatory gesture, an active articulator and a passive location (typically

some part of the roof of the mouth)) [21].

For the manner of articulation it can be labial, where can be bilabial consonant which is

articulated with both lips or labiodentals with the lower lip and the upper teeth; coronal where can be

dental in which is a consonant articulated with the tongue against the upper teeth or Alveolar where

the articulation happens with the tongue against or close to the superior alveolar ridge; and dorsal

where can be Palatal consonants with the body of the tongue raised against the hard palate (the

http://en.wiktionary.org/wiki/obstruction

http://en.wikipedia.org/wiki/Vocal_tract

http://en.wikipedia.org/wiki/Articulator

http://en.wikipedia.org/wiki/Lip

http://en.wikipedia.org/wiki/Lip

http://en.wikipedia.org/wiki/Teeth

http://en.wikipedia.org/wiki/Consonant

http://en.wikipedia.org/wiki/Alveolar_ridge

http://en.wikipedia.org/wiki/Hard_palate

26

middle part of the roof of the mouth) or Velars that are articulated with the back part of the tongue

(the dorsum) against the soft palate, the back part of the roof of the mouth (known also as the velum).

As for the places of articulation it can be nasal produced with a lowered velum, allowing air to

escape freely through the nose; a plosive in which the vocal tract is blocked so that all airflow ceases;

a fricatives that is produced by forcing air through a narrow channel made by placing

two articulators close together; a lateral in which airstream proceeds along the sides of the tongue,

but is blocked by the tongue from going through the middle of the mouth; a trill produced by vibrations

between the articulator and the place of articulation; and a Fricative that is produced by forcing air

through a narrow channel made by placing two articulators close together.

There are also semi-vowels that are phonetically similar to a vowel sound but function as a

syllable boundary [21].

Fig. 2- Places of articulation for consonants [22]

http://en.wikipedia.org/wiki/Dorsum_(biology)

http://en.wikipedia.org/wiki/Soft_palate

http://en.wikipedia.org/wiki/Soft_palate

http://en.wikipedia.org/wiki/Airstream_mechanism

http://en.wikipedia.org/wiki/Place_of_articulation

http://en.wikipedia.org/wiki/Airstream_mechanism

http://en.wikipedia.org/wiki/Place_of_articulation

27

3.1.2.2 Vowels

Vowels

Oral Nasals

Back Central Front Back Central Front

close i u i~ u~

close-mid e o e~ o~

mid @

open-mid E 61 O 6~2

open a

Table 2- Vowels of EP

Vowels can be classified according to the position of the tongue: front, central and back, from

the further front of the mouth until the back, and close, close-mid, mid, open-mid, and open, from as

close as possible to the roof of the mouth to the most further [21].

Fig. 3- Vowel Triangle in relation to tongue position [22]

1 The phoneme /6/ may appear as /A/ in some parts due to the Audimus configurations

2 Idem for the phoneme /6~/

28

3.2 EP and foreign languages

3.2.1 Brief comparison with Spanish

Spanish or Castilian is also part of the West Iberian Romance languages branch but in the Castilian

subdivision, in contrast with EP being in the Galician-Portuguese one. Both languages derive from the

Latin and have a lexical similarity, a measure of the degree to which the two given languages are

similar, of 0.89. The following table is the representation the EP phonemes with the EP phonemes

non-existent in Spanish marked in bold [23].

Consonants

Labial Coronal Avelar Dental Velar Palatal/

Dorsal

Plosive/

Occlusive

voiced b d g

unvoiced p t k


unvoiced f s S

Nasals m n J

Lateral l/l~ L

Trill r R

Semi-vowels w j

w~ j~

29

Vowels

Oral Nasals


close i u i~ u~

close-mid e o e~ o~

mid @

open-mid E 6 O 6~

open a

Table 3- Spanish consonants and vowels in comparison with EP

The major difference between EP and Spanish in the latter there are no nasal vowels/semi-

vowels as well as open-mid and mid vowels. The voiced fricatives /v/, /z/, /Z/ and the unvoiced /S/ are

nonexistent. However there are additional of the fricatives /T/, such as in cinco, and /x/, as in mujer,

and affricates /tS/, as in mucho and /jj/, as in hielo.

30

3.2.2 Brief comparison with Bulgarian

Bulgarian is also an Indo-European Language but is in the Slavic languages subgroup in contrast with

EP being in the Italic subdivision. The following table is the representation the EP phonemes with the

EP phonemes non-existent in Bulgarian marked in bold [24].

Consonants

Labial Coronal Avelar Dental Velar Palatal/

Dorsal

Plosive/

Occlusive

voiced b d g

unvoiced p t k


unvoiced f s S

Nasals m n J

Lateral l/l~ L

Trill r R

Semi-vowels w j

w~ j~

Vowels

Oral Nasals


close i u i~ u~

close-mid e o e~ o~

mid @

open-mid E 6 O 6~

open a

Table 4- Bulgarian consonants and vowels in comparison with EP

31

In phonetic terms it can observed that EP and Bulgarian are similar. Most consonants of EP are

present in Bulgarian with the exception of /J/, /l~/ and /R/. The only semi-vowel is /j/ and there are no

nasalized vowels. As for the other vowels only /E/, /6/ and /o/ are not present. In addition to these

phonemes, Bulgarian has other 17 palatalized consonants and 5 non-palatalized.

4 System Design

32

Chapter 4

System Design

This chapter provides an overview of the existing tools and the research being done and an

explanation of the steps for to obtain the classification of the pronunciation.

33

4.1 State of the art

4.1.1 Scientific research

Over the last years several groups have developed various interactive language teaching systems

based on speech recognition techniques (CAPT).

One of the first functioning projects was the SPELL project [1] which concentrated on specific

phonemes. There are also other projects that focus on scoring complete sentences but not phonemes.

Though the standard method, and the method used in this research, was firstly analyzed by Witt

(1999) in “Use of Speech Recognition in computer assisted language training”. This method uses a

measure denominated GOP to score the pronunciation. There are other possible processes to

measure, such as, computing MFCC [25] or the likelihood [26] [27] but to this date there are no

register of a method with better efficacy than the described by Witt or some of its modifications [14]

[28] [29].

Another project important to mention was PLASER (Pronunciation Learning via Automatic

SpEech Recognition) a multimedia tool created by the Hong Kong University of Science and

Technology to teach American English pronunciation to High school students. With word exercises,

PLASER computes a score based on the confidence of a given phoneme in a word and paints it with a

3-color scheme according to the accuracy of the pronunciation[29]. It also gives besides an overall

pronunciation score, an explanation with schematic on how to pronounce the phonemes.

4.1.2 Existing tools

Free software or applications available in the market mostly focus on the hearing and repetition of

several words or sentences without giving any feedback on how well they were pronounced. One of

the most complete in the web is forvo that is contains over 1,749,117 words and 1,856,029

pronunciations in 299 languages [29] including pronunciation in EP with translations. There is also

available several pronunciation exercises for the English language, for instance in learnersdictionary

[31] where the user practice how to pronounce several paronomasias, sentences and syllable stress

but it is also devoid of any interaction. As for EP, besides the previous described type of application,

there are only sites with written explanations on how to pronounce the different phonemes, such as in

learningportuguese [32].

However the most interesting one provided is bonjourdefrance [33] where users read a limited

list of sentences in French through a microphone and it gives feedback on how well it was

pronounced. Yet it only gives an overall score and does not indicate precisely which phonemes are

poorly pronounced and which phonemes are well pronounced.

We have not found any paid interactive applications for EP, but there are several for more

practiced languages such as English. One example is englishlearning [34] that costs, at the time of the

survey, between US$ 77.95 and US$125.95. It has several thousand words and sentences divided

34

into different levels and classifies the pronunciation.

Overall most of these systems only provide a general score for a word or utterance and do not

indicate where to improve or correct the mispronunciations.

4.2 Method

For measuring the quality of the pronunciation the process requires an audio file and its transcription.

The audio speech is digitalized, then, using Audimus, the in-house recognizer, posterior probabilities

on 20 ms frames are calculated from the extracted features. Subsequently, a GOP score is calculated

for each frame, and given a score on each phoneme and word by averaging the GOP from each

frame. Afterwards the GOP is normalized and using a pre-established threshold, from native speakers’

data, the threshold is adapted in order to obtain the maximum efficiency in scoring the phonemes as a

correct or incorrect utterance having the concern not to augment it so much that all phonemes are

considered correct. Finally, given the subjective nature of this threshold, the scores of the system are

compared with three human judges in order to compute its correlation.

Fig. 4- Model’s Schematic.

35

4.2.1 Audimus

Audimus is an Automatic Speech Recognition System customized to the European Portuguese

language and developed by Spoken Language Systems Laboratory (L2F) of INESC-ID [35]. The

system is based in a hybrid automatic speech recognizer that combines the temporal modeling

capabilities of Hidden Markov Models (HMMs) with the pattern discriminative classification capabilities

of Multi-Layer Perceptrons (MLPs) [36]. As an output, Audimus gives the posterior probabilities of

each one of the SAMPA phonemes and the identification of the phoneme of every frame.

This system starts by dividing the desired audio file into 20 ms frames and in each frame it

extracts three types of features thus sectioning them into three different branches. The first branch

extracts 26 PLP (Perceptual Linear Prediction) features, the second 26 Log-RASTA (log-RelAtive

SpecTrAl) features and the 3rd uses 28 MSG (ModulationSpectrogram) coefficients. Then each

branch incorporates a MLP classifier that is used to estimate the probability based on the distinctive

extracted features. Each MLP has the same basic structure, which is an input layer with 9 on text

frames, a non-linear hidden layer with over 1000 sigmoidal units and 40 softmax outputs. Lastly the

MLP/HMM acoustic model combines posterior phone probabilities generated by three phonetic

classification branches using an average in the logprobability domain [35] [36].

The accuracy presented by Audimus resulted from the phonetic transcription of a system of

rules (that was monitored by a linguistic specialist) of 27833 different words producing a set of

pronunciations. It uses the 39 phonemes defined previously and is independent of the speaker. The

Fig. 5- Audimus squematic.

36

language models used in training were achieved using the models from CMU_Cambridge. The data

base utilized corresponds to approximately 46 million words, being 321 thousand different ones that

appeared in the on-line Portuguese newspaper Público. Of all words 80% were used in training, 10%

in development and the remaining 10% in evaluation.

Fig. 6- Visualization of the phoneme division in Wavesurfer.

This tool was employed in this project not only to calculate the posterior probabilities, i.e. the

probability of a frame of an introduced audio file being one of each phoneme described previously but

also as a mean for identification of when a phoneme is uttered given an audio transcription.

37

4.2.2 GOP

The GOP method (goodness of pronunciation) was introduced by Witt and Young (1999) and is one of

the most used methods to score the articulation of words. Its popularity is due to its reduced

computational complexity and indistinctness of the language applied. This means that the same

method can be used for different dialects, as long it has the analysis of the posterior probabilities and

the sectioning of the phonemes in the utterance. Although it has shown that the method can yield

satisfactory results [37], it requires the determination of a threshold to define the boundary between a

good and a bad pronunciation. Thus the quality of the GOP scoring depends on the models utilized

and on the native speakers employed. Nonetheless the GOP is calculated equally for both accurate

and inaccurate utterances.

The GOP algorithm calculates the likelihood ratio that the recognized phoneme corresponds to

the phoneme that should have been spoken for each phoneme in an utterance. The GOP score of

phoneme p is defined as the frame-normalized logarithm of the posterior probability P(p|O(p)), where

O(p) refers to the acoustic segment uttered by the speaker. NF(p) corresponds to the number of frames

in acoustic segment O(p) [1].

𝑃𝑃(𝑝) = |log (𝑃(𝑝|𝑂(𝑝)))| 𝑁𝐹(𝑝)⁄ = |log (𝑃(𝑂(𝑝)|𝑝)𝑃(𝑝)

∑ 𝑃(𝑂(𝑝)|𝑞)𝑃(𝑞)𝑞∈𝑄)| ∕ 𝑁𝐹(𝑝) (4.1)

The posterior probability can then be decomposed in the division between the probability

observation vector sequence O(p) given the phoneme p times its prior and the sum of probability of

O(p) given any phoneme q, in a set Q that includes all phonemes, times their priors.

Assuming that that all phonemes are equally likely, thus making probability P(p) and P(q) the

same, and that the sum can be approximated by its maximum value then the derived GOP can be

described as Eq.4.2.

𝐺𝑂𝑃(𝑝) = |log (𝑃(𝑝|𝑂(𝑝)))| 𝑁𝐹(𝑝)⁄ = |log (𝑃(𝑂(𝑝)|𝑝)𝑃(𝑝)

∑ 𝑃(𝑂(𝑝)|𝑞)𝑃(𝑞)𝑞∈𝑄

)| 𝑁𝐹(𝑝)⁄

≈ |log (𝑃(𝑂(𝑝)|𝑝)

∑ 𝑃(𝑂(𝑝)|𝑞)𝑞∈𝑄)| 𝑁𝐹(𝑝)⁄ ≈ |log (

𝑃(𝑂(𝑝)|𝑝)

𝑚𝑎𝑥𝑞∈𝑄𝑃(𝑂(𝑝)|𝑞))| 𝑁𝐹(𝑝)⁄ (4.2)

The GOP(p) value is always equal or greater than zero. The greater the value, the more likely

there is a mispronunciation. In contrast, the nearer it is to zero, the more probable that the

pronunciation is as a native [38].

38

4.2.3 NGOP

In order to reduce the influence of extreme values or outliers of the data set without having to remove

them, Sigmoidal normalization was applied. This way all the data is included and since this

normalization is almost linear near the mean value, the standard deviation of the mean is preserved.

The normalized data is in the range between 0.0 and 1.0 [15].

This normalization takes the raw GOP score and concatenates it to a GOP score,

denominated NGOP (normalized GOP), into the former range. That is,

𝑁𝐺𝑂𝑃 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑠𝑢) =1

1+exp(−𝛼𝑠𝑢+𝛽) (4.3)

where the parameters alpha and beta are empirically found according how rapidly it is wanted

to reach the maximum values and at what values of the abscises the scale starts respectively. This

way it is also easier to visualize the boundaries between a good and bad score [15] [29].

Fig. 7- GOP and NGOP relation.

0 0.2 0.4 0.6 0.8 1 1.2 1.40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

GOP

NG

OP

39

4.2.4 Threshold

In order to distinguish when a GOP begins to define an incorrect utterance there is a need to establish

a threshold. The threshold is different for each phoneme and can be calculated in two different ways.

First, given a native speakers training corpus, the threshold value can be calculated as the mean of

the GOP for each phoneme. Secondly, it can use not only the mean but also the std of the GOP and

use the expression:

𝑇𝑝1 = 𝜇𝑝 + 𝛼𝑠𝑡𝑑𝑝 + 𝛽 (4.4)

where α and β are empirically determined scaling constants[1].

Another phoneme dependent threshold proposed by [27], that uses a data from human

labeling, can be determined by averaging the normalized rejection counts over all speakers:

𝑇𝑝2 = log1

𝑁 ∑ (𝑐𝑛(𝑝) ∕ ∑ 𝑐𝑛(𝑚)𝑀

𝑚=1 )𝑁𝑛=1 (4.5)

where cn(p) is the total number of times that speaker n mispronounced the phoneme p by one of the

human judges in the database, M is the total number of phonemes and N is the total number of

speakers[27]. However, this introduces another level of subjectivity depending on how the human

judges decide to label, consequently the former estimation was selected [1] [37].

40

4.2.5 GOP for fluent speech

The effectiveness of the previous expression can be satisfactory for a single phoneme but for a fluent

speech can be restricted. A distinct approach is to determine from Viterbi decoding the acoustic

boundaries and the corresponding likelihoods. Firstly, the numerator in the GOP equation is calculated

using a forced alignment network in which the sequence of phoneme models is fixed by the known

transcription. Secondly, the denominator is calculated using an unstrained phoneme loop network [1].

𝐺𝑂𝑃(𝑝) ≈ |log𝑃(𝑂(𝑝)|𝑞𝑖)

𝑓𝑒 − 𝑓𝑠

− ∑log𝑃(𝑂(𝑝)|𝑞𝑖𝑗)

𝑓𝑗𝑒 − 𝑓𝑗𝑠

𝑁

𝑗=1

|

𝑝𝑥[𝑛] =1

𝑁𝑜𝑐∙𝐼⋅ 𝑛𝑜𝑐[𝑛], 𝑛 = 1,2, … , 𝑁𝐼 (4.6)

Where, fis and fie denote start and end frame number for the ith phone occurring during the current

interval from fs to fe and N are the phonemes that contribute to this likelihood. This way the

alignments of the phoneme loop will differ from the alignment in forced alignment when there is an

incorrect utterance. This method is preferred for long texts but since the quality of the corpus (see

section 5.1) is not high and the audio recordings had to be divided in small sentences does not

introduce a big advantage in augmenting the number of computations so much [1] [37].

Fig. 8- GOP for fluent speech [1].

41

4.2.1 Overall score

There are two methods to attain the overall word score using a weighted sum of the NGOP of each

phoneme in it using different or equal weights. The latter constitutes an arithmetic mean value that

despite facilitating the calculations may not take into account characteristics in the data e.g. the

different thresholds or the fact certain values of the NGOP not being so precise as others [1] [37].

𝑃𝑆(𝑤𝑜𝑟𝑑) = ∑ 𝜔𝑘 ∙ 𝑁𝐺𝑂𝑃(𝑝ℎ𝑜𝑛𝑒𝑚𝑒𝑘)𝑁𝑘=1 (4.7)

4.2.2 Performance measure

To analyse the performance of the NGOP classification algorithm, for a given threshold, four

decision types can be defined: correctly accepted (CA) phoneme realizations, when phonemes that

were pronounces correctly are also judged as correct; correctly rejected (CR), when phonemes that

were pronounced incorrectly are judged as incorrect; false accepted (FA), when phonemes that were

mispronounced are erroneously judged as correct; and false rejected (FA), when phonemes that were

pronounced correctly are judged as incorrect [1] [39].

To achieve a good performance the algorithm has to be able to not only detect

mispronunciations but also to not classify them as a correct articulation. As a result the performance of

the scoring can be defined by:

𝑆𝐴 = ((𝐶𝐴 + 𝐶𝑅)/(𝐶𝐴 + 𝐶𝑅 + 𝐹𝐴 + 𝐹𝑅)) ∗ 100 (4.8)

where the objective is to achieve optimal performance by maximizing the scoring accuracy while

minimizing the false acceptances. Other useful performance measures include the calculation of the

precision (number of correct results divided by the number of all returned results), recall (the number

of correct results divided by the number of results that should have been returned) and F-measure

(the weighted average of the precision and recall) of correctly accepted or rejected phonemes

realizations [1]:

Precision of 𝐶𝐴 = (𝐶𝐴/(𝐶𝐴 + 𝐹𝐴)) ∗ 100 (4.9)

Precision of 𝐶𝑅 = (𝐶𝑅/(𝐶𝑅 + 𝐹𝑅)) ∗ 100 (4.10)

Recall of 𝐶𝐴 = (𝐶𝐴/(𝐶𝐴 + 𝐹𝑅)) ∗ 100 (4.11)

Recall of 𝐶𝑅 = (𝐶𝑅/(𝐶𝑅 + 𝐹𝐴)) ∗ 100 (4.12)

𝐹𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2 ∗ (Precision ∗ Recall )/(Precision + Recall ) (4.13)

42

4.3 Other classification methods

4.3.1 Likelihood Ratio

This classification method proposed by [26] [27] instead of the GOP score utilizes a Likelihood ratio. If

x is the intended phoneme in forced alignment and y is the phoneme resulted from the free alignment.

𝐿𝑅(𝑥, 𝑦, 𝑂) = log (𝑃(𝑂|𝑥)

𝑃(𝑂|𝑦)) (4.14)

Likelihood Ratio (LR) is useful if there is the information of y as well. The LR is based on binary

classification, determining whether O is more like x or y. As in GOP if the LR score is higher than 0,

where the segment O is judged as correct and otherwise, not. This score demonstrates good results

[26] [27] but it also implies that more scores have to be calculated, which may be unviable in a web

application.

4.3.2 MFCC and DTW based evaluation

These scoring method proposed by [25] begins by analyzing the waveform and calculating the Mel

Mel-Frequency Cepstrum Coefficients (MFCC) since they are the most commonly used acoustic

features in speech process systems. Then an Euclidean distance between students’ MFCC

pronunciation and the standard pronunciation is calculated using the Dynamic Time Warping (DTW)

algorithm. These score in conjunction with a comparison with a standard length of the phoneme gives

an evaluation of the pronunciation. While it is interesting that this scoring method uses the length of

the phonemes it also implies that it is needed to store the standard lengths and MFCC data which can

be computationally heavy. Furthermore this scoring method only helps students learn separated

phonemes and not fluent speech.

43

5 Experiments and results

Chapter 5

Experiments and results

This chapter provides the validation of the steps for the implementation of the project and the results

obtained.

44

5.1 Corpus

The native corpus is composed by 15 people from the Lisbon area, 7 males and 8 females, with age

between 22 and 24 years old. The non-native is composed by one 23 year old female Venezuelan,

which has Spanish as native language, and a group of 11 Bulgarians, 6 males and 5 females, with age

between 27 and 42 years old. The group was asked to read several sentences, having this over 14000

phonemes by the native and 9600 by non-native speakers.

The sentences were recorded using a high-quality head-mounted microphone with Mono 16-

bit resolution and 16 kHz as sampling rate. The text prompts were:

1. “Os industriais preveem uma diminuição da produção. Os empresários da

construção apontam para um recuo da procura. No comércio a retalho

espera-se uma evolução desfavorável do volume de negócios. A

confiança entre os consumidores vem conhecendo um forte recuo desde

Abril”;

2. “Na noite de segunda-feira por motivos ainda não totalmente apurados

os companheiros fizeram barulho antes de tempo o jovem acabou por ser

enleado na corda e arrastado por vários quilómetros”;

3. “O projecto está avaliado em cerca de um milhão de contos e pretende

evitar a entrada na lagoa de são martinho de grandes quantidades de

poluição evitam-se desta forma os transtornos verificados numa das

zonas mais importantes da região sobretudo na época estival”;

4. “O vento norte e o sol discutiam qual dos dois era o mais forte

quando passou um viajante envolto num casaco. Ao vê-lo apostaram que

aquele que primeiro conseguisse obrigar o viajante a tirar o casaco

seria considerado o mais forte. O vento norte começou a soprar com

muita força mas quanto mais soprava mais o viajante se embrulhava no

seu casaco até que o vento norte desistiu. O sol brilhou então com

toda a intensidade. E imediatamente o viajante tirou o casaco O vento

norte teve assim de reconhecer a superioridade do sol”.

The phrases have the phrases been altered if the speaker substitutes a word for another, in case of

misreading.

The native speakers corpus was used to compute the threshold. The non-native as group,

separated by native languages, tested and adjusted the former threshold. It should be noted that the

quality of the corpus is not the best for the non-natives, having several pauses and gasping in many

sentences due to the degree of difficulty of these sentences was not in accordance with the level of

the students. In many cases, the alignment was difficult, especially from middle to the end of each

track. To improve the alignment, the tracks were divided into simple sentences which provided a slight

improvement, nonetheless the disparities still caused several misalignments.

45

5.2 Implementation

Using Audimus the posterior probabilities are computed in forced alignment mode for the native

speakers speech. These are used in the approximated expression Eq. 4.2 to calculate the GOP score.

Then a normalization is calculated with alpha=10 and beta =-10 giving the NGOP. For comparison the

threshold was estimated for both the GOP and NGOP scores.

The first part of the implementation is repeated for the non-native corpus, but this time each

phoneme resulting from the alignment is classified by a human jury as either a good or bad

pronunciation. This classification is compared with the results of the distribution of the score measure

up to the predefined threshold and the SA is calculated.

For the first tests, the mean of the score of the natives was established as the threshold.

Testing the score in the non-native data, in the phonemes with SA inferior to 70% the threshold was

augmented using the expression Eq. 4.4 and alpha=0.5 and beta=0.1. Since the corpus is limited the

results can be inexact for certain phonemes.

46

5.3 Results

5.3.1 Threshold for Native speakers

Firstly, before computing the threshold it is crucial to validate the expression of the approximation Eq.

4.2 and if NGOP actually presents an improvement over the GOP.

5.3.1.1 Approximated vs exact GOP

First of all, is important to verify the efficiency of the approximation in Eq. 4.2. For that purpose, the

GOP score with and without the approximation was calculated for several sentences (native and non-

native). The approximation is valid since both scores diverge on average 0.0668 for GOP and 0.0521

for NGOP, with the maximum divergence in higher scores. In the figures below the comparison is

illustrated for some phonemes.

Fig. 9- Example of exact end approximated GOP

5.3.1.2 Comparison between GOP and NGOP

To compare the values of GOP and NGOP, the average standard deviation was initially calculated for

GOP values, yielding 0.476 and for NGOP, 0.213, meaning that there is a smaller divergence between

values in NGOP allowing a better classification. Secondly is important to verify if the normalization

improves the scoring. The GOP scoring has a bigger distribution while in NGOP the values are

concentrated in the extreme values. Furthermore the values of NGOP are concatenated in the interval

between 0 and 1, while GOP scoring can give values until infinite and there are less borderline cases

in NGOP.

As seen in Figure 10, the SA improves substantially with the usage of a normalized version

versus the non-normalized.

0

0,05

0,1

0,15

Exact

Aproximation

47

Fig. 10- NGOP's and GOP's SA comparison for each phoneme.

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

R i~ o~ 6~ o n J 6 w~ d r l~ u j~ z @ w u~ S

NGOP

GOP

0

0,2

0,4

0,6

0,8

1

1,2

inte

rwo

rdp

ause s t a p k Z m E f i v O g L b e j

e~ l

NGOP

GOP

48

So since both assumptions are correct, the mean of the NGOP of each phoneme uttered by all

native speakers was computed, and this was established as a preliminary threshold to be adjusted

later with the tests on the non-native data.

Phoneme Mean N. occurances Phoneme Mean N. occurances

u 0.140049 1416 o 0.02793 214

6 0.08742 1092 w~ 0.098509 213

r 0.166272 962 w 0.334389 145

t 0.005636 880 l~ 0.142358 143

d 0.127353 745 f 0.006546 142

@ 0.230987 727 o~ 0.057922 142

S 0.005135 612 e 0.053197 136

k 0.015536 567 z 0.252077 131

i 0.040606 527 E 0.019217 128

interwordpause 0.005403 526 Z 0.016086 117

a 0.004149 525 l 0.039551 116

s 0.014778 504 u~ 0.315518 112

6~ 0.065146 449 b 0.050564 101

v 0.044119 389 g 0.051099 87

j 0.043568 369 R 0.089092 84

m 0.017172 345 i~ 0.06171 72

p 0.005864 340 L 0.040425 72

O 0.052711 266 j~ 0.195488 72

n 0.086556 259 J 0.089502 57

e~ 0.058275 245

Table 5- Mean and number of occurrences by phoneme

The values vary between 0.33 for w and 0.005 for the interword_pause. There are problems

with vowel reductions such with /@/, /u/, and /6/ and some problems with word co-articulation, which

result from the training of Audimus not taking into account this phenomena.

To better examine the results and since a perfect pronunciation implies that the score is 0. i.e.

the PP of the phoneme is the maximum PP, a confusion matrix was created in other to count which

phoneme has the maximum PP (presented in the Annex 1). It is noticeable that for natives in most

cases the maximum is the phoneme itself, with the exception of some /z/ being pronounced as /S/ and

some /u/ and /@/ being deleted, not only in intra-word position, but also in word boundaries, as

predicted in 3.1.1.

49

5.3.1 SA results for all non-native

Despite existing a larger number of occurrences of each phoneme in the native corpus, it is noticeable

in the figure 11 that not only there are more scores near one (mispronunciations) in non- native (red)

than there are in native (blue). Also the scores are more distributed in the [0,1] spectrum in non-native

data. The average SA is 77.04% and the F measure is 83.30%. These results are not so interesting

since, as explained before, the difficulties for each language are different. Additionally to the threshold

found previously, the SA was measured and for the low scores (scores < 70%) the formula Eq. 4.2

was applied with alpha = 0.5 and beta=0.1.

Fig. 11- NGOP for each phoneme for native (blue) and non-native (red) with native mean and std

above.

0 0.5 10

500

1000A 0.186 0.373

0 0.5 10

200

400

600a 0.195 0.380

0 0.5 10

200

400

600A~ 0.198 0.376

0 0.5 10

50

100b 0.371 0.470

0 0.5 10

500

1000S 0.188 0.375

0 0.5 10

500

1000d 0.281 0.436

0 0.5 10

50

100

150E 0.257 0.399

0 0.5 10

50

100

150e 0.245 0.413

0 0.5 10

200

400

600@ 0.230 0.401

0 0.5 10

100

200

300e~ 0.290 0.443

0 0.5 10

50

100

150f 0.320 0.460

0 0.5 10

50

100g 0.399 0.481

0 0.5 10

200

400

600i 0.326 0.458

0 0.5 10

50

100i~ 0.286 0.445

0 0.5 10

500

1000ip 0.024 0.146

0 0.5 10

50

100

150Z 0.341 0.468

0 0.5 10

200

400

600k 0.193 0.382

0 0.5 10

50

100

150l 0.357 0.469

0 0.5 10

50

100

150l~ 0.317 0.453

0 0.5 10

50

100L 0.416 0.494

0 0.5 10

200

400m 0.219 0.401

0 0.5 10

100

200

300n 0.239 0.420

0 0.5 10

20

40

60J 0.545 0.496

0 0.5 10

100

200

300O 0.189 0.367

0 0.5 10

100

200

300o 0.146 0.328

0 0.5 10

50

100

150o~ 0.230 0.415

0 0.5 10

200

400p 0.186 0.377

0 0.5 10

500

1000r 0.300 0.456

0 0.5 10

50

100R 0.423 0.480

0 0.5 10

200

400

600s 0.240 0.415

0 0.5 10

500

1000t 0.167 0.363

0 0.5 10

500

1000

1500u 0.222 0.390

0 0.5 10

50

100u~ 0.259 0.429

0 0.5 10

200

400v 0.425 0.486

0 0.5 10

50

100w 0.487 0.489

0 0.5 10

100

200w~ 0.221 0.408

0 0.5 10

200

400j 0.356 0.460

0 0.5 10

20

40

60j~ 0.464 0.479

0 0.5 10

50

100z 0.496 0.497

50

5.3.2 SA results for Spanish

Observing the results of the confusion matrix for Spanish (annex 1.2), as expected, there is a difficulty

in the pronunciation of nasalized vowels, since these do not appear in Spanish. Likewise open-mid

and mid vowels are often replaced by the closest sounding phoneme. There were no considerable

difficulties in the pronunciation of /v/ but there were significant mislabeling in /S/, /z/, /Z/.

Fig. 12- SA for each Phoneme in Spanish

For the phonemes /e/, /@/, /o/, /u/, /u~/, /l~/and /w/ the threshold was modified. But despite

improving the SA in the majority of the cases it did not surpass the 70% accuracy. Also, despite some

phonemes having an accuracy of 1, perfect accuracy, this does not necessarily mean that the

pronunciation is perfect, but can also mean that there are not enough occurrences of the phoneme.

The average SA is 82.53% and the average F measure 88.45%.

With these results the thresholds for Spanish are:

Phoneme Threshold

6 0.08742

a 0.104149

6~ 0.165146

b 0.150564

S 0.105135

d 0.227353

E 0.119217

e 0.248972

@ 0.528522

e~ 0.158275

f 0.106546

g 0.151099

i 0.140606

Phoneme Threshold

i~ 0.16171

interwordpause 0.105403

Z 0.116086

k 0.115536

l 0.139551

l 0.399668

L 0.140425

m 0.117172

n 0.186556

J 0.189502

O 0.152711

o 0.189265

o~ 0.157922

Phoneme Threshold

p 0.105864

r 0.266272

R 0.189092

s 0.114778

t 0.105636

u 0.397042

u~ 0.631825

v 0.144119

w 0.653362

w~ 0.198509

j 0.143568

j~ 0.295488

z 0.352077

Table 6- Threshold of each phoneme for

Spanish

0

0,2

0,4

0,6

0,8

1

1,2

e l~ u b S d e~ o z O E i~ o~ a R @

inte

rwo

rd-p

ause

i

w~

A~ p

u~ r j l v k t A f g Z L m n J s w j~

SA

51

5.3.3 SA results for Bulgarian

Observing the results of the confusion matrix for Bulgarian, despite reasonable results in nasalised

phonemes, as expected there is confusion in the vowels. In coronal consonants there is a bigger

distribution of the maximums between other phonemes.

Fig. 13– SA for each Phoneme in Bulgarian

Here, 21 phonemes had SA lower than 70%, and the threshold was also modified. Also in this

case, despite improving the SA in the majority, the accuracy did not surpass the 70%. Moreover, since

there were more phoneme samples in this case, the SA decreased, not having any phoneme with

accuracy equal to 1. The average SA is 73.23% and the average F measure 76.66%. The lower

scores can also be justified by the quality of the audio tracks provided.

With these results the thresholds for Bulgarian are:

Phoneme Threshold

6 0.315838

a 0.004149

6~ 0.273898

b 0.050564

S 0.005135

d 0.127353

E 0.169588

e 0.248972

@ 0.528522

e~ 0.058275

f 0.006546

g 0.051099

i 0.2325

Phoneme Threshold

i~ 0.276879

interwordpause 0.140755

Z 0.016086

k 0.166636

l 0.225398

l 0.142358

L 0.040425

m 0.017172

n 0.322884

J 0.089502

O 0.259354

o 0.02793

o~ 0.260756

Phoneme Threshold

p 0.005864

r 0.450067

R 0.328808

s 0.170876

t 0.005636

u 0.397042

u~ 0.315518

v 0.044119

w 0.653362

w~ 0.340577

j 0.232914

j~ 0.485334

z 0.563233

Table 7- Threshold of each phoneme

for Bulgarian

00,10,20,30,40,50,60,70,80,9

1

@ e A

w~ l R j O u j~ E i~ o~ w z s

A~ n i k r S t b l~ m p d a g Z L J f

e~ o u~ v

inte

rwo

rd-p

ause

SA

52

5.3.4 Comparison, Critic and Evaluation

As the mathematical approximation is viable and the normalization reduces the outliers and

concatenates the values in interval [0,1] we validate this way the computation of NGOP for speech

evaluation. The SA for Spanish and Bulgarian are satisfactory, despite the bad recording conditions of

the corpora, but can be influenced by how many times a phoneme occurs as well. A phoneme that

occurs more frequently is better evaluated. It was also noted that despite the good phoneme based

results, the fact that some phonemes were not uttered with the same duration as a native makes the

complete word/sentence sound unnatural. This implies that further improvements should take duration

into account. This was not implemented for two reasons, firstly the Audimus already measures the

duration of each phoneme and if phoneme is not at least the time interval, computed by averaging the

time of each phoneme in the corpus that Audimus was trained, it does not count as the said phoneme.

And secondly, even if the interval of Audimus is incorrect if we wanted a time variable we would have

to scrutinize every single phoneme in text and save the average time this phoneme in this particular

place in the sentence, making the interface not so adaptable to further updates.

5.3.4.1 Performance measures

With the threshold established, and considering that this is a subjective technique, three performance

measures were also computed to compare the scoring between the transcription by two judges or one

judge and automatic NGOP. This allows a cross-validation between judges in the number of the errors

that each can find.

The first one is strictness and measures how strict a judge is. This also allows seeing how

subjective judgment interferes with the border line cases between correct and incorrect. The strictness

of judge labelling can be defined as the overall fraction of phones which are rejected, i.e. relative

strictness.

𝑆 =𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑅𝑒𝑗𝑒𝑐𝑡𝑒𝑑 𝑃ℎ𝑜𝑛𝑒𝑚𝑒𝑠

𝑇𝑜𝑡𝑎𝑙 𝐶𝑜𝑢𝑛𝑡 𝑜𝑓 𝑃ℎ𝑜𝑛𝑒𝑚𝑒𝑠 Eq 5.1

And to compare to judges it is simply means to compute the difference between the two

judges, J1 and J2.

𝛿𝑆 = |𝑆𝐽1 − 𝑆𝐽2| Eq 5.2

The second measure is the agreement and takes into account if the phonemes are considered

mispronounced or not by two different judges.

𝐴𝐽1𝐽2 = 1 − 1

𝑁‖𝑋𝐽1 − 𝑋𝐽2‖𝐶 Eq 5.3

where ‖𝑋‖𝐶 = ∑ |𝑥(𝑖)|𝑁−1𝑖=0 , where x is a vector of size N (total number of phonemes in the sentence)

53

and x є {0, 1}, being 0 for phonemes classified as correct and 1 for the others.

The last one, the cross correlation measures the overall agreement between the reference

and the detected error, i.e. the similarity between all segments which contain rejections in

transcriptions. And it measure by

𝐶𝐶𝐽1,𝐽2 = 𝑋𝐽1

𝑇 𝑋𝐽2

‖𝑋𝐽1‖𝐸

‖𝑋𝐽2‖𝐸

Eq 5.4

where ‖𝑋‖𝐸 = √∑ 𝑥(𝑖)2𝑁−1𝑖=0 is the standard Euclidean norm [1].

With the performance measures described above, a small study was conducted to compare

the ratings between human judges. The inter-judge correlation was measured for 20 calibration

sentences (8 of natives and 12 of non-natives) for 3 natives judges. The results were calculated by

averaging A, CC and PC for each judge in relation to all others.

Performance

measure AA CC PC

Average 0.87 0.45 0.74

Table 8- Average performance measures

As for comparison with the NGOP, after adjusting the threshold to the result of best score

possible, each judge obtained the AA, CC and PC below.

Judge AA CC PC

J1 vs NGOP 0.86 0.52 0.79

J2 vs NGOP 0.88 0.40 0.67

J3 vs NGOP 0.87 0.43 0.69

Table 9 - Judges vs GOP scoring

Evaluating these results we find that there was not a very discrepant view in comparison to the

NGOP, it all depended each subjective interpretation for each judge. A stricter judge would scale down

the results.

54

6 Interface

Chapter 6

Interface

This chapter provides an overview of the Interface where the GOP module was integrated.

55

6.1 VITHEA project

VITHEA (Virtual Therapist for Aphasia treatment - Terapeuta Virtual para o tratamento da Afasia) is a

software program for the treatment of aphasic patients, particularly those that show difficulties when

recalling words, incorporating recent advances of speech and language technology. It was created

by L2F (Spoken Language Systems Lab - Laboratório de sistemas de Língua Falada) as part of the

INESC (Institute for Systems and Computer Engineering - Instituto de Engenharia de Sistemas e

Computadores) and by LEL (Language Research Laboratory - Laboratório de Estudos de

Linguagem) as part of the Department of Clinical Neurosciences of the Lisbon Faculty of Medicine

and the hospital Santa Maria.

The software acts as a "virtual therapist", asking for the patient to recall the contents of a

photo or a picture that is shown. Using automatic speech recognition (ASR) technology, the program

is able to recognize what was said by the patient and to validate if it was correct or not. The "virtual

therapist" is able to provide help to the user whenever it is asked for both semantically and

phonologically, both as a written solution or as a speech synthesized production based on text-to-

speech (TTS) technology [40].

6.2 3P Interface

The VITHEA project interface was adapted to make a 3P Interface, however due to time restrictions it

was not possible to test the Interface as a whole. Nevertheless, the VITHEA interface was profoundly

tested with a good performance. It uses JSP/Servlet Server, connected to Audimus, a Database

Management System and the internet, to lodge a Flash application available in a Web browser.

Fig. 14 - 3P Interface

56

The interface has on the left side what the student is supposed to repeat and on the right side the

sound recording and feedback instructions player.

57

7 Conclusion

Chapter 7

Conclusions

This chapter finalises this work, summarising conclusions and pointing out aspects to be developed in

future work.

58

7.1 Conclusions

With thesis it was explored an approach for an interactive system that helps people learning a new

language. This was motivated by the fact that EP learners find difficult finding material to learn the

language. Hence creating an interactive system of qualification of different phonemes/words for

European Portuguese gives the students an opportunity to study and reach pronunciation proficiency

with self-study, as oppose to requiring the solely attention of a human teacher.

The gain of pronunciation proficiency as an adult is a difficult task that requires a great deal

training, repetition and memorization. So it makes sense to use automatic systems to aid the learning.

A CALL explores the techniques to develop automatic professor for pronunciation learning. This will

help the student to train by repeating the same words or phrases while having a classification of his

performance.

The Portuguese language is a romance language and has approximately 215-220 million

native speakers and 260 million total with over 10 million speaking EP. In the EP group there are two

main dialects: northern and central/southern. Here we explored the central/southern, spoken in the

capital, Lisbon. This dialect is characterized manly by the substitutions of /ei/ and /ou/ to /6j/ and /o/.

Spoken EP has the particularity of having vowel reductions and several co-articulations. EP is a

composed phonetically by 39 phonemes.

The languages of the non-native corpus are the Spanish and the Bulgarian, and these are, in

comparison with EP, by the lack of nasalised sounds, open or semi-vowels and by the absence of

certain consonants.

In this thesis was utilized one of the most standard methods, firstly analyzed by Witt (1999),

which uses a performance measure of the pronunciation named GOP. A sigmoidal transformation was

applied to reduce the outliers and concatenate the values between 0 and 1, as opposed to 0 to infinity.

This measure is referred as NGOP. There are also other projects that focus on scoring phonemes,

using for example, MFCC and DTW, but by analyzing the documentation, these did not surpass the

GOP neither in efficiency nor in easiness of the computation.

Therefore the system applied calculates the NGOP to the segmented, by phonemes, audio,

using the probabilities computed by Audimus. After, using a corpus of native speakers the mean and

std are calculated. And using the non-native corpus, divided by languages, a threshold was

established. This threshold was constructed by firstly applying the mean as a threshold to the non-

native data to obtain the SA of each phoneme. To the phonemes with less than 70% SA the threshold

was increased using the formula Eq.4.4.

Although the results were satisfactory the fact that the quality of the corpus is not the best for

the non-natives, having several pauses and gasping in the sentences, causes some doubts about the

viability of the results.

59

For Spanish, since the corpus was too small, some SA are 1. The threshold was recalculated

for seven phonemes, the majority of non-existing vowels in their native language. The average SA is

82.53% and the average F measure 88.45%.

For Bulgarian, the scoring was worse having 21 phonemes with a SA lower than 0.7. The

threshold was re-calculated for these cases but there was not a major improvement. It was opted that

the threshold would not be shifted to higher values since this would increase the falsely accepted

scores. The average SA is 73.23% and the average F measure 76.66%. The lower scores can also be

justified by quality of the audio tracks provided.

Overall the SA for Spanish and Bulgarian are satisfactory but when listening to the whole

sentence/word, some did not sound correct. One of the reasons was the fact that the phoneme did not

have the duration of the native speech. A variable measuring time could solve this problem but would

also diminish the simplicity of the NGOP.

Using the thresholds, performance measures were applied, using three native judges.

Evaluating the results we there was not a very discrepant view in comparison to the NGOP, but since

it all depended each subjective interpretation for each judge, a stricter judge could scale down the

results.

7.2 Future work

The principal flaw in this thesis was the lack of good non-native audio files, so a major improvement

can be made by recording simpler sentences in accordance to the level of the students. There were

only two languages analysed, Spanish and Bulgarian, so a wider and more differentiated corpus would

help to accommodate a more varied number of native languages. Another enhancement can be made

by retraining the Audimus taking into consideration the vowel reductions and the co-articulations.

Finally, despite the VITHEA interface being heavily tested, it would be good to test the 3P interface

with native and non-native subjects.

60

Annex 1 – Extra Tables

Annex 1

Extra Tables

This annex presents extra tables somewhat important for the explanation of the research.

61

A.1 Example of words using the EP phoneme list

Consonants

plosives Symbol Word Transcription

p pai p"aj

b barco b"arku

t tenho t"6Ju

d doce d"os@

k com ko~

g grande gr"6~d@

fricatives f falo f"alu

v verde v"erd@

s céu s"Ew

z casa k"az6

S chapéu S6p"Ew

Z jóia Z"Oj6

nasals m mar m"ar

n nada n"ad6

J vinho v"iJu

liquids l lanche l"6~S@

L trabalho tr6b"aLu

r caro k"aru

R rua R"u6

Vowels and diphthongs i vinte v"i~t@

lápis l"apiS

e fazer f6z"er

E belo b"Elu

a falo f"alu

6 cama k"6m6

madeira m6d"6jr6

O ontem "O~t6~j~

o lobo l"obu

u jus Z"uS

futuro fut"uru

@ felizes f@l"iz@S

i~ fim f"i~

e~ emprego e~pr"egu

6~ irmã irm"6~

o~ bom b"o~

u~ um u~

aw mau m"aw etc.: iw, ew, Ew, (ow)

aj mais m"ajS etc.: ej, Ej, Oj, oj,

6~j~ têm t"6~j~6~j etc.: e~j~, o~j~, u~j~

Taken from [20]

62

A.2 Confusion matrices for Portuguese, Spanish and Bulgarian

In this section the number of times each phoneme is identified as another phoneme is presented. In

purple are the times it is identified as itself and in orange are some cases in which there were a big

portion of phonemes mistaken as another phoneme.

63

Confusion matrix for natives

ip b d g p t k s z f v S Z l l~ L r R m n

ip 456 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0

b 0 62 6 1 20 1 1 0 0 1 2 0 0 0 0 0 0 0 1 0

d 14 2 444 7 5 122 4 10 0 0 6 17 1 3 3 2 10 2 3 15

g 1 2 1 58 0 0 17 0 0 1 0 0 0 0 0 0 0 0 0 0

p 18 1 0 0 286 21 5 0 0 1 0 0 0 0 0 0 2 0 0 0

t 21 0 10 0 9 735 36 10 0 5 0 10 0 0 0 1 7 1 0 1

k 24 0 1 0 4 5 508 1 0 0 0 2 0 0 0 0 2 2 0 0

s 10 0 0 0 2 5 2 440 2 13 0 26 0 0 0 0 0 1 0 0

z 1 0 7 2 0 4 0 7 53 1 7 23 13 0 0 0 1 0 0 0

f 2 0 0 0 12 2 3 1 0 120 0 2 0 0 0 0 0 0 0 0

v 3 2 17 5 4 2 3 0 1 4 306 3 1 2 0 0 1 11 6 1

S 15 0 5 0 0 2 1 5 19 3 2 533 16 0 0 0 3 0 0 0

Z 0 0 1 0 0 0 0 0 0 0 1 19 94 0 0 0 1 0 0 0

l 1 0 1 1 0 0 0 0 0 0 0 0 0 83 2 0 0 0 2 1

l~ 4 1 2 0 0 1 0 0 0 0 0 0 0 0 69 0 1 0 2 0

L 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 51 0 0 0 1

r 9 2 5 1 16 9 11 5 0 1 3 21 2 1 1 0 734 3 1 7

R 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 69 0 1

m 2 4 2 0 1 0 0 0 0 1 0 0 0 2 1 0 2 0 270 4

n 4 0 5 0 1 1 1 0 0 2 0 0 2 3 0 0 4 0 13 199

J 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 1

j 1 0 1 2 0 1 2 6 0 0 0 20 2 2 0 3 14 1 0 1

w 0 0 1 1 0 0 12 0 0 0 1 0 0 0 2 0 0 0 1 4

i~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2

o~ 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0

A~ 1 0 0 0 0 1 0 1 0 3 0 0 0 0 0 0 4 1 0 1

e~ 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1

u~ 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 10 1

j~ 0 0 0 0 0 3 0 2 0 0 0 1 0 0 0 0 0 0 0 0

w~ 1 0 2 0 1 0 3 3 0 0 0 0 0 1 5 0 1 1 8 0

E 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 3 0 1 0

O 0 0 2 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0

u 81 0 28 2 24 40 33 22 5 7 10 62 9 20 6 5 56 1 25 24

@ 53 2 12 1 2 28 11 42 1 6 6 66 9 2 1 0 29 5 4 1

i 2 0 1 1 0 0 3 0 0 0 0 0 0 0 0 8 5 0 8 1

e 0 0 0 0 0 0 0 1 0 0 5 0 0 0 0 0 0 0 0 0

A 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1

a 7 1 0 0 1 1 5 4 0 2 0 3 2 2 0 0 22 7 4 3

o 0 0 0 0 2 0 1 0 0 0 0 0 5 0 0 0 0 0 0 0

64

J j w i~ o~ A~ e~ u~ j~ w~ E O u @ i e A a o

ip 0 0 0 0 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0

b 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 3

d 0 1 0 0 1 2 0 1 0 2 2 0 11 19 2 0 15 4 1

g 0 2 0 1 0 0 0 0 0 0 0 0 2 1 1 0 0 0 0

p 0 0 0 0 0 0 0 0 0 0 1 0 3 0 0 1 0 1 0

t 0 0 0 0 0 1 0 0 1 3 6 0 5 2 1 0 1 3 0

k 0 0 0 0 2 0 2 1 0 0 0 0 6 0 0 0 1 5 0

s 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0

z 0 0 0 0 0 0 0 0 0 0 0 0 2 7 2 0 1 0 0

f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

v 0 0 0 0 0 0 1 1 0 1 0 0 6 1 6 0 0 1 0

S 0 0 0 1 0 0 0 0 0 0 0 0 4 0 1 0 0 2 0

Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

l 0 0 1 0 1 0 0 2 0 1 0 1 14 2 0 0 0 0 3

l~ 0 0 7 0 2 0 1 0 0 3 0 22 17 0 0 2 1 1 7

L 2 8 0 0 1 0 0 0 0 0 0 0 2 1 3 1 0 0 0

r 0 6 0 0 2 1 0 0 1 1 4 5 42 24 9 7 9 16 3

R 0 0 0 0 0 0 0 0 0 0 0 0 5 1 0 0 0 2 0

m 0 0 0 9 7 0 0 13 0 4 0 1 14 2 3 0 2 1 0

n 1 0 0 1 0 0 0 3 0 0 2 1 8 1 1 2 1 2 1

J 25 0 0 12 0 0 0 1 1 0 0 0 6 1 3 0 0 1 0

j 0 204 0 0 0 2 2 0 2 0 15 0 1 6 32 20 6 23 0

w 2 0 43 0 19 0 0 0 1 0 0 4 45 0 5 0 2 0 2

i~ 1 3 0 39 0 0 6 0 8 0 0 0 2 0 7 1 0 2 0

o~ 0 0 0 0 112 0 1 1 0 0 0 1 10 0 0 0 0 4 11

A~ 0 1 2 4 23 243 50 5 3 1 14 4 8 1 7 3 6 55 7

e~ 0 3 0 8 2 0 158 2 9 0 14 0 2 1 16 14 0 12 0

u~ 0 0 0 2 6 0 0 33 0 18 0 0 33 2 0 0 0 1 3

j~ 2 3 0 2 1 0 6 2 30 0 7 0 2 1 6 4 0 0 0

w~ 0 0 0 0 3 1 9 6 0 153 0 4 9 0 0 0 0 1 1

E 0 1 0 0 0 1 0 0 0 0 86 0 0 0 5 14 0 16 0

O 0 0 0 0 1 2 0 0 0 1 0 210 3 0 0 0 15 6 24

u 1 9 4 3 2 2 4 16 2 8 0 7 678 119 32 2 0 36 13

@ 1 1 0 7 3 0 1 5 0 1 1 0 87 294 20 9 0 16 0

i 1 18 0 13 0 0 3 3 0 0 2 0 7 11 404 22 0 1 0

e 0 1 0 2 0 1 3 0 0 0 7 0 11 4 7 85 0 9 0

A 0 1 2 0 1 5 0 0 0 0 16 25 0 0 0 0 436 33 2

a 7 2 1 8 2 38 14 3 3 0 31 6 41 37 35 26 34 709 17

o 0 0 1 0 5 5 0 1 0 1 1 3 46 1 0 0 1 29 112

65

Confusion matrix for Spanish


ip 34 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

b 0 3 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

d 1 0 28 1 0 3 0 0 0 0 1 1 0 0 0 0 3 0 0 8

g 1 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

p 1 0 0 0 20 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0

t 0 0 3 0 0 55 1 0 0 0 0 0 0 0 0 0 0 0 0 0

k 0 0 0 0 0 0 38 0 0 0 0 0 0 0 0 0 0 0 0 0

s 0 0 0 0 0 0 0 34 0 0 0 1 0 0 0 0 0 0 0 0

z 0 0 0 0 0 0 0 3 3 0 0 2 0 0 0 0 0 0 0 0

f 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0

v 1 0 4 0 0 0 0 0 0 0 18 0 0 0 0 0 1 1 1 1

S 1 0 0 0 0 0 0 0 3 0 0 38 0 0 0 0 0 0 0 0

Z 0 0 0 0 0 0 0 0 0 0 0 1 7 0 0 0 0 0 0 0

l 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 1

l~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 1

L 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2

r 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 57 0 0 0

R 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0

m 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 19 1

n 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15

J 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

j 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0

w 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0

i~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

o~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A~ 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

e~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 2

u~ 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

j~ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

w~ 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0

E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

O 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

u 6 0 1 0 0 0 1 0 0 0 1 3 1 3 0 0 1 0 3 1

@ 2 0 0 0 0 0 0 1 0 0 0 2 1 0 0 0 1 0 0 0

i 2 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1

e 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

A 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

a 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

o 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0

66


ip 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

b 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0

d 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 2 0

g 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0

p 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

t 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

k 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

s 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

v 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Z 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

l 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0

l~ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 2 0 0

L 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

r 0 1 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 5 0

R 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

m 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0

n 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0

J 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

j 0 16 0 0 0 0 0 0 1 0 5 0 0 0 0 0 0 1 0

w 1 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 1 2 2

i~ 0 0 0 0 0 0 1 0 2 0 0 0 0 0 0 1 0 0 0

o~ 0 0 0 0 4 2 0 0 0 0 0 0 1 0 0 0 0 2 1

A~ 0 0 0 0 0 12 0 0 0 0 1 1 0 0 0 0 14 1 0

e~ 0 0 0 0 0 3 4 0 0 0 5 0 0 0 0 0 0 2 0

u~ 0 0 0 0 0 1 0 1 0 3 0 0 1 0 0 0 0 1 0

j~ 1 0 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0

w~ 0 0 0 0 0 1 0 0 0 8 0 0 0 0 0 0 2 0 0

E 0 0 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 2 1

O 0 0 0 0 0 1 0 0 0 0 0 12 0 0 0 0 3 2 1

u 0 1 1 0 2 3 0 0 0 0 0 3 15 9 2 1 0 41 1

@ 0 1 0 0 0 0 1 0 0 0 1 0 1 10 5 2 0 24 0

i 1 1 0 1 0 0 1 0 0 0 2 0 1 1 20 2 0 2 0

e 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 3 0

A 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 31 2 0

a 1 1 0 0 0 4 2 0 0 0 3 2 0 0 0 0 12 50 0

o 0 0 0 0 0 1 0 0 0 0 0 2 1 0 0 0 2 6 0

67

Confusion matrix for Bulgarian


ip 796 0 0 1 4 15 13 9 0 6 1 17 1 0 1 0 0 2 1 0

b 3 17 7 0 6 1 1 0 0 1 1 1 1 1 0 1 1 2 3 0

d 21 1 214 2 1 45 10 6 3 0 6 9 0 0 4 1 7 2 3 11

g 0 0 3 15 0 1 10 2 0 3 2 0 0 1 0 0 1 1 1 0

p 14 2 6 0 116 9 10 2 0 1 2 2 0 1 0 0 0 2 1 1

t 76 0 20 1 10 281 22 15 1 5 1 14 2 1 3 1 9 1 2 5

k 19 0 6 0 13 24 216 6 0 3 2 5 3 0 1 1 5 0 9 0

s 30 0 3 1 1 5 4 168 3 15 1 32 1 0 1 0 2 1 5 3

z 3 0 2 0 0 3 0 17 16 4 4 5 6 0 0 1 1 0 1 1

f 5 0 0 0 3 5 4 9 0 37 0 2 0 0 0 0 2 1 2 1

v 10 6 23 2 5 5 6 7 3 14 54 3 1 5 2 1 7 8 4 5

S 19 0 2 0 2 5 2 22 2 2 2 212 8 1 7 0 6 2 3 2

Z 2 0 0 1 1 0 0 4 0 0 1 9 31 3 1 0 0 0 3 1

l 0 0 3 0 1 0 0 1 0 0 0 0 2 19 0 4 2 1 3 7

l~ 4 0 0 0 0 3 1 0 0 0 0 0 0 1 27 0 6 1 1 0

L 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 9 1 0 0 2

r 25 0 10 0 2 10 8 8 0 3 4 24 3 4 1 2 313 14 4 4

R 8 0 0 0 0 0 5 0 0 0 0 3 0 0 0 0 17 3 0 0

m 10 1 7 0 3 2 4 0 0 1 2 1 0 1 1 1 8 0 109 10

n 5 0 6 0 1 4 5 1 0 1 1 1 1 3 0 1 5 1 16 70

J 3 0 0 0 0 0 0 1 0 0 0 2 0 1 1 1 1 0 2 8

j 17 0 4 0 3 1 0 4 0 1 0 4 1 0 1 7 5 0 1 3

w 2 0 0 0 0 1 2 1 0 0 1 3 0 3 4 0 1 0 1 2

i~ 2 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 2 0

o~ 3 0 0 0 2 0 2 3 0 0 0 2 0 0 0 0 2 0 1 0

A~ 10 0 1 0 0 0 1 2 0 0 0 5 0 2 2 0 6 1 8 4

e~ 10 0 2 0 1 2 3 1 0 0 0 4 0 1 0 0 3 0 4 4

u~ 2 0 2 0 1 2 0 0 0 1 0 0 0 0 0 0 0 0 5 1

j~ 1 0 0 0 2 5 0 0 0 0 0 1 1 0 0 0 0 0 3 2

w~ 9 0 1 0 0 1 0 0 0 0 0 4 0 1 18 0 1 0 8 0

E 3 0 0 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0

O 0 0 1 0 0 0 0 2 0 0 0 1 4 1 0 0 1 2 1 0

u 70 1 11 0 6 15 15 14 2 3 3 36 5 4 7 3 15 6 20 9

@ 32 0 7 0 1 10 9 25 1 7 5 33 2 2 2 1 13 2 6 2

i 11 1 4 0 2 6 5 2 1 0 0 10 2 2 1 4 3 1 6 2

e 2 0 0 0 1 2 1 1 0 0 0 2 0 0 0 0 2 0 0 2

A 13 0 1 0 1 4 3 2 0 2 0 3 1 3 1 0 3 0 3 1

a 29 1 3 1 4 6 10 10 0 2 2 8 0 6 2 1 17 10 5 3

o 4 0 0 0 0 2 1 1 0 0 0 0 0 1 1 0 4 1 4 0

68


ip 1 1 0 0 0 0 0 0 0 2 0 0 2 2 1 1 8 5 2

b 1 1 0 0 1 0 0 0 0 0 0 1 2 1 2 0 1 1 3

d 3 1 0 1 5 4 1 1 1 1 4 3 9 8 6 2 9 18 7

g 0 1 0 0 0 0 0 0 0 0 2 0 1 1 2 1 0 0 2

p 0 1 2 0 1 0 1 1 0 0 0 2 3 1 4 1 7 6 0

t 2 3 0 0 2 3 0 0 1 3 1 1 6 2 11 1 9 7 0

k 0 3 1 1 0 2 0 1 0 2 2 4 5 1 7 1 6 4 2

s 0 1 0 4 3 1 1 1 0 0 2 2 4 4 6 3 5 4 4

z 0 0 0 0 0 1 1 0 0 0 0 0 0 2 4 2 4 2 1

f 0 0 0 0 0 1 1 0 0 0 2 1 2 2 0 0 2 0 2

v 0 3 1 4 1 1 2 1 1 4 7 1 8 6 4 3 7 5 7

S 0 0 1 0 0 1 0 2 0 4 3 0 9 2 4 0 6 5 4

Z 0 0 0 0 1 0 2 0 0 0 1 1 0 1 3 1 3 0 0

l 0 2 0 0 1 1 2 3 0 0 1 2 6 3 4 0 4 4 3

l~ 0 0 0 0 1 0 1 0 0 5 1 2 8 1 0 0 4 6 12

L 0 4 0 0 1 0 0 1 0 2 1 0 3 0 2 1 1 1 0

r 1 4 0 3 4 11 2 6 0 1 3 5 20 14 13 4 20 14 11

R 0 0 0 0 1 0 0 0 0 1 0 2 2 0 0 0 3 1 3

m 0 2 0 1 5 0 0 2 0 4 0 1 6 4 3 2 3 1 2

n 2 2 0 0 2 2 0 0 0 0 1 0 2 5 1 0 5 4 3

J 1 1 0 0 1 0 1 0 0 1 0 1 1 2 3 0 0 1 0

j 0 55 0 3 1 4 1 2 3 1 17 3 6 11 19 20 10 16 1

w 0 1 13 1 7 1 2 1 0 2 1 6 21 2 4 0 5 5 5

i~ 0 1 0 10 1 2 5 1 5 0 0 0 1 1 5 0 3 1 0

o~ 0 1 0 0 43 0 0 1 0 0 2 2 10 1 2 0 1 3 6

A~ 1 0 0 1 14 65 11 2 3 5 8 6 7 5 6 7 45 26 10

e~ 0 2 0 6 0 12 44 1 3 1 13 0 1 5 7 1 3 9 1

u~ 0 0 0 0 4 3 1 12 2 9 0 0 14 2 1 0 3 1 2

j~ 0 6 0 1 0 0 2 0 4 4 2 1 2 1 4 1 1 4 1

w~ 0 0 1 0 1 0 0 2 0 34 1 3 12 1 0 0 6 6 11

E 0 3 0 1 0 0 0 1 1 3 16 2 1 6 4 7 4 18 1

O 0 0 0 0 3 4 1 0 2 1 1 50 6 3 0 1 12 12 49

u 3 7 3 5 26 1 1 5 2 17 6 21 320 37 25 10 15 50 56

@ 2 4 0 6 5 4 2 1 2 7 17 2 32 143 26 23 9 68 6

i 1 10 0 1 2 1 4 1 2 0 10 4 21 25 93 18 5 18 6

e 0 1 0 2 1 1 4 0 0 1 16 0 2 12 8 18 2 7 0

A 1 3 0 1 2 10 2 0 0 1 7 12 4 2 4 4 162 37 17

a 1 7 2 3 5 15 5 3 2 10 11 19 48 32 28 14 45 248 32

o 0 0 0 0 7 2 0 1 0 1 2 18 14 0 2 3 5 5 50

69

References

References

[1] WITT, S., Use of Speech Recognition in Computer-assisted Language Learning, November 1999

[2] TABILO, L., O ensino do português como língua estrangeira por professores não nativos,

Setembro 2011

[3] CHARRUA, C., Aquisição Fonética-Fonológica do Português Europeu dos 18 aos 36 meses,

Setembro 2011

[4] BOSCH, A., Learning to pronounce written words: A study in inductive language learning,

December 1997

[5] TAYLOR, E., Why we love repetion music, Ted Talks, 2014

[6] RODRIGUES, S., Fonética e Fonologia no ensino da língua materna: modos de

operacionalização, Setembro 2005

[7] ARIZA, E., HANCOCK, S., Second Language Acquisition Theories as a framework for Creating

Distance Learning Courses, Florida Atlantic University, USA, 2003, online version:

(http://www.irrodl.org/index.php/irrodl/article/view/142/222)

[8] VILLALOBOS, O., Reflections on the connection between computer-assisted language learning

and second language acquisition, February 2013

[9] LEVIS, J., Computer technology in teaching and researching pronunciation, 2007

[10] NECIBI, K., An ASR-based System for Arabic Mispronunciation Detection, December, 2012

[11] NERI, A., The pedagogical effectiveness of ASR-based Computer Assisted Pronunciation

Training, 2007

[12] PEABODY, M., Methods for Pronunciation Assessment in Computer Aided Language Learning,

Massachusetts Institute of Technology, 2011

[13] ESKENAZI, M., Using a Computer in Foreign Language Pronunciation Training: What

Advantages?

[14] LEE, A., GLASS, J. , A Comparison-based Approach to Mispronunciation Detection

http://www.irrodl.org/index.php/irrodl/article/view/142/222

70

[15] PRIDDY, K., KELLER, E., Artificial Neural Networks: Am Introduction, SPIE Imprensa, 2005

[16] BAHI, H., Hybrid ASR system for teaching pronunciation, September 2008

[17] LEWIS, M., SIMONS, G., FENNIQ, C., Ethnologue: Languages of the World, Seventeenth edition,

Texas, 2014, online version: (http://www.ethnologue.com)

[18] A Pronúncia do Português Europeu, Instituto da cooperação e da Língua de Portugal

[19] MARQUILHAS, R., Gramática Histórica do Português, online version: ( http://cvc.instituto-

camoes.pt/hlp/gramhist/index.html)

[20] WELLS, J., UCL Phonetics and Linguistics , University College London, 1997, online version:

(http://www.phon.ucl.ac.uk/home/sampa/portug.htm)

[21] Dicionário Terminológico para consulta em linha, online version: (http://dt.dgidc.min-edu.pt/)

[22] The Phonetic Framework, online version: (https://www.uni-due.de/DI/Phonetic_Framework.htm)

[23] WELLS, J., SAMPA home page, UCL Phonetics and Linguistics and University College London ,

1996, online version: (http://www.phon.ucl.ac.uk/home/sampa/spanish.htm)

[24] WELLS, J.,SAMPA home page, UCL Fonética e Lingüística and BABEL , 1998, online version:

(http://www.phon.ucl.ac.uk/home/sampa/bulgar.htm)

[25] YAO, M., et all, The Implementation of an Evaluation System of English Phoneme Pronunciation

Quality

[26] WET, F., CUCCHIARINI, C., STRIK, H., BOVES, L., Using likelihood ratios to perform utterance

[27] NERI, A., CUCCHIARINI, C., STRIK, H., Pronunciation training in Dutch as a second language on

the basis of automatic speech recognition

[28] ZHAO, T., et all, Automatic Chinese pronunciation error detection using svm trained with

structural features

[29] MAK, B.,et all, PLASER: Pronunciation Learning via Automatic Speech Recognition

[30] Forvo Media SL, 2014, online version: forvo.com)

[31] WEBSTER, M., Learners’ Dictionary, online version: ( http://www.learnersdictionary.com/)

[32] WALKER, R., TAVARES, R., The Language Lover's Guide to Learning Portuguese, online

version: (http://www.learningportuguese.co.uk/guide/pronunciation/introduction)

[33] Bonjour de France, online version: (http://www.bonjourdefrance.com/)

[34] English Computerized, inc, online version: ( http://www.englishlearning.com/)

[35] MEINEDO, H., ABAD, A., PELLEGRINI, T., NETO, J., TRANCOSO, I., The L2F Broadcast News

Speech Recognition System

http://cvc.instituto-camoes.pt/hlp/gramhist/index.html

http://cvc.instituto-camoes.pt/hlp/gramhist/index.html

http://www.phon.ucl.ac.uk/

http://www.ucl.ac.uk/

http://www.phon.ucl.ac.uk/home/sampa/portug.htm

http://dt.dgidc.min-edu.pt/

https://www.uni-due.de/DI/Phonetic_Framework.htm

http://www.phon.ucl.ac.uk/home/sampa/home.htm


http://www.ucl.ac.uk/

http://www.phon.ucl.ac.uk/home/sampa/spanish.htm

http://www.phon.ucl.ac.uk/home/sampa/home.htm


http://midwich.reading.ac.uk/research/speechlab/babel/

http://www.phon.ucl.ac.uk/home/sampa/bulgar.htm

http://23-forvo.com/

http://www.learnersdictionary.com/

http://www.learningportuguese.co.uk/guide/pronunciation/introduction

http://www.bonjourdefrance.com/

http://www.englishlearning.com/

71

[36] MEINEDO, H., CASEIRO, D., NETO, J., TRANCOSO, I., A Broadcast News Speech Recognition

System for the European Portuguese Language

[37] WET, F., VAN DER WALT, C., NIESLER, T.R, Automatics assessment of oral language

proficiency and listening comprehension, Speech Communication 51, 864-874, March 2009

[38] STRIK, H., TRUONG, K., WET, F., CUCCHIARINI, C., Comparing different approaches for

automatic pronunciation error detection, Speech Communication 51, May 2009

[39] KANTERS, S., CUCCHIARINI, C., STRIK, H.,The Goodness of Pronunciation Algorithm: a

Detailed Performance Study

[40] L2F (Spoken Language Systems Lab) and LEL (Language Research Laboratory), VITHEA

project, Virtual Therapist for Aphasia treatment, 2014, available online: (https://vithea.l2f.inesc-

id.pt/wiki/index.php/Main_Page)

Download - 3P Portuguese Pronunciation Professor · 3P – Portuguese Pronunciation Professor Mariana Sofia Pimenta Lopes ... estudantes de línguas mais faladas, como o Inglês americano ou

Top Related