bachelor-seminar thesis speech recognition of austrian ... · figure 1.1:scheme of an asr system...

28
Bachelor-seminar Thesis Speech recognition of Austrian German with the Raspberry Pi Building an ASR system based on PocketSphinx in combination with a beamformer conducted at the Signal Processing and Speech Communications Laboratory Graz University of Technology, Austria by Matthias Blochberger, 1273011 Markus Huber, 1231594 Supervisors: Dipl.-Ing. Dr. Martin Hagm¨ uller Assessors/Examiners: Dipl.-Ing. Dr. Martin Hagm¨ uller Graz, June 23, 2016

Upload: others

Post on 19-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

Bachelor-seminar Thesis

Speech recognition of Austrian Germanwith the Raspberry Pi

Building an ASR system based on PocketSphinx in combinationwith a beamformer

conducted at theSignal Processing and Speech Communications Laboratory

Graz University of Technology, Austria

byMatthias Blochberger, 1273011

Markus Huber, 1231594

Supervisors:Dipl.-Ing. Dr. Martin Hagmuller

Assessors/Examiners:Dipl.-Ing. Dr. Martin Hagmuller

Graz, June 23, 2016

Page 2: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

ABSTRACT

The goal of this thesis was to create a working system for offline speech recognition of AustrianGerman using a Raspberry Pi and the open-source speech recognition software PocketSphinx.For the purpose of noise-reduction a beamformer was applied and its influence on the ASRsystem’s performance had to be evaluated. The procedure included creating an acoustic model,using an open-source speech database from native German speakers, and later adapting it withdata from native Austrian speakers. In the course of this work two types of language models anda phonetic pronunciation dictionary had to be generated. For testing and evaluation purposes,DIRHA, a project concerning voice-enabled home automation, was chosen as area of applica-tion for the system. As a final step, conclusions were drawn from the results, which were notonly based on automated tests from the Sphinx toolkit but also on newly recorded material incombination with the beamformer.

Page 3: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

Statutory Declaration

I declare that I have authored this thesis independently, that I have not used other than thedeclared sources/resources, and that I have explicitly marked all material which has been quotedeither literally or by content from the used sources.

date (signature)

Page 4: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

Speech recognition of Austrian German

Contents

1 Introduction 41.1 Definition of task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Concept of automatic speech recognition . . . . . . . . . . . . . . . . . . . . . . . 41.3 Phonetic units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Speech recognition in PocketSphinx 72.1 Overview of the CMUSphinx toolkit . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 The acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Building the acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Adapting the acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5 The language model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.6 The pronunciation dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.7 The beamformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Method and procedure 113.1 Training the new acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Adapting the acoustic model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Building the language models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3.1 Statistical language model . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3.2 Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Implementing the beamformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Results 184.1 Evaluation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Results for trained and adapted acoustic models . . . . . . . . . . . . . . . . . . 19

4.2.1 Results for the basic German acoustic model . . . . . . . . . . . . . . . . 194.2.2 Results for the adapted acoustic model . . . . . . . . . . . . . . . . . . . . 19

4.3 Results for applying different beamforming algorithms . . . . . . . . . . . . . . . 21

5 Discussion and conclusion 235.1 Interpretation of the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.1 Discussion of tests in the course of acoustic model creation . . . . . . . . 235.1.2 Discussion of tests applying beamforming algorithms . . . . . . . . . . . . 24

5.2 Usability of the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Suggestions for possible improvements . . . . . . . . . . . . . . . . . . . . . . . . 25

June 23, 2016 – iii –

Page 5: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

Speech recognition of Austrian German

1Introduction

1.1 Definition of task

Automatic speech recognition (ASR) systems increasingly take root in modern applications andare already used within a variety of reliable, well-functioning implementations. However, in mostcases online processing (e.g. Google, Apple, etc.) is required to guarantee fast and accurate re-sults for ASR. The scope of this work was to create a speech-to-text system that is capable ofrunning offline on a small and simple platform with limited resources, like the Raspberry Pi,and is able to recognise especially Austrian German speech.The speech recognition engine did not have to be built from scratch, but a toolkit called Pock-etSphinx, provided by the CMUSphinx project, was chosen for this task. A German modeldid already exist, the purpose of this work though was to determine whether optimisations forAustrian German are possible by creating a new, distinct model. It has to be clarified that“Austrian German” does not mean different dialects but German spoken by native Austrianspeakers. The original plan was to adapt an existing German acoustic model with a databaseof recorded Austrian German speech. Additionally, a beamformer was to be added to the pro-cessing chain of the ASR system, to reduce noise and enhance the quality of the speech signal.To test the new speech recogniser, DIRHA, a project for voice-enabled home automation, wasspecified as a possible application. Eventually the performance of the newly created model andthe beamformer’s role in this regard was evaluated.

1.2 Concept of automatic speech recognition

Paraphrase from [1]:Automatic speech recognition (ASR) samples acoustic information as a signal and processes itto get linguistic information which results in readable information (written words, sentences,etc...). This introductory chapter gives a short overview of the basic components, their func-tionality and interaction inside an ASR system.

Figure 1.1 shows the the main components. The first step is extracting usable information fromthe signal. The extracted parameters (’features’) should have a maximum of information rele-vant for the classification and must allow the distinction between different linguistic units. Thisnormally happens through frame based spectral analysis of the signal which can be seen as staticfor the length of the frame. Each frame is processed resulting in one feature vector per frame.These vectors are then used for classification, by applying an acoustic model. In many cases theacoustic model is based on the Hidden Markov Model (HMM), a stochastic model usually gen-erated by ”machine learning”, trained for linguistic units, which usually are words or phones. AHMM is similar to a finite state machine, containing the transition probabilities from one stateto another. Typically a HMM uses three states for the statistical representation of the assignedword or phone.

June 23, 2016 – 4 –

Page 6: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

1 Introduction

Figure 1.1: Scheme of an ASR system [1]

The pronunciation dictionary contains details about the pronunciation of single words in formof phone combinations and can also be described as the mapping of words to phones. An excerptof the dictionary used will be shown and used for further explanation in Ch. 3.1.The language model restricts word search and defines the probability of the occurrence of wordsand word combinations. Either a statistical model of semantics is used, which predicts the fol-lowing word based on the previous sequence of n words (”n-gram”), or a fixed grammar, whichis usually created for specific purposes and scopes of speech recognition (simple command-and-control apps, etc...) due to the tendency of becoming rather complex and non-functional for awider field of applications. Statistical grammar models are generated on the basis of writtenreference texts.In simpler terms: Speech is analysed resulting in a feature vector. The acoustic model then usesthis feature vector to output candidates of a certain phone set. Next, the phones are mappedto words and ultimately the language model selects the most probable word sequence.

1.3 Phonetic units

When dealing with speech recognition, several terms concerning phonetic units and subunitscan be encountered. In order to avoid any confusion a short explanation of the most frequentterms is given in the following paragraph.

Phones, in linguistic terms, are the smallest distinguishable units of sounds that are not con-nected to a certain language. They exist in order to categorise all possible audible sounds ofspeech regarding articulation and acoustic characteristics. The standard reference for phonesis the International Phonetic Alphabet (IPA). Of course, the number of feasible phones is re-stricted, however, the range of variation in producing speech sounds is practically limitless. [2]

Phonemes summarise phones, that, in the context of a certain language, have no differencein meaning. If a word is pronounced in different ways but the meaning of the word stays thesame, the phones that make up the word change, but the phonemes stay the same. Manyspeech recognition applications make use of phoneme sets in the pronunciation dictionary andthe HMMs of the acoustic model. For most languages, a set of 40 to 50 phonemes is sufficientfor ASR.

June 23, 2016 – 5 –

Page 7: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

1 Introduction

It is favourable to deal with phones in context, as the sound of a phone depends largely onthe previous and following phone. Biphones and Triphones aim at this issue, describing a phoneconsidering either one (Biphone) or two (Triphone) neighbouring phones. It is, however, notuseful to implement triphones as underlying unit of an acoustic model, because it is unlikely thatthe training data covers all triphone combinations of a language. Therefore, triphones are com-pressed and ”represented by a small amount of distinct short sound detectors”, called senones. [3]

Senones basically are subunits of triphones. A senone represents, for example, the beginning orending of a triphone, but it can be applied for multiple triphones, which reduces the number ofentities needed to describe a language through triphones. In the context of an acoustic model, asenone acts as the set of gaussian mixture density functions, representing a state of one HMM.

June 23, 2016 – 6 –

Page 8: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

Speech recognition of Austrian German

2Speech recognition in PocketSphinx

2.1 Overview of the CMUSphinx toolkit

PocketSphinx is a module of the open-source project CMUSphinx developed by the CarnegieMellon University and now maintained by Nikolay M. Shmyrev. It is a lightweight speech recog-nition engine mainly developed for use on hand-held and mobile devices. The basic functionalityresembles the scheme shown and explained in Ch. 1.2. It is a command-line tool and able to runon any operating system which has to be compiled by the user. This stems from the fact thatthe used audio device/driver (pulseaudio, alsa, jack, etc.) has to be present while compiling.The CMUSphinx speech recognition toolkit consists of three components: Pocketsphinx (the ac-tual recogniser library), SphinxTrain, which contains the training tools for the acoustic model,and a support library called SphinxBase. Along with PocketSphinx comes a standard acousticmodel for US English.

PocketSphinx can use different types of input (two executables: name in brackets):

• Input based on a single file . . . An audio file in wave format is read by PocketSphinx andused in the speech recognition process. (pocketsphinx )

• Continuous input-stream . . . Continuous audio from a microphone or a similar source ofan audio-stream is used. (pocketsphinx continuous)

The learning-module called SphinxTrain is used to generate the acoustic model using recordedspeech and the transcription thereof. It also holds the training tools for the adaptation ofan acoustic model. To calculate the parameters of the HMMs, iterations of forward-backwardestimation algorithms (Baum-Welch) are used. SphinxTrain furthermore provides a script todetermine the error rate and accuracy of an acoustic model on a data set.

June 23, 2016 – 7 –

Page 9: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

2 Speech recognition in PocketSphinx

2.2 The acoustic model

PocketSphinx knows three different types of acoustic models: continuous, semi-continuous andphonetically tied models (ptm). These differ basically in the number of Gaussians they use,thus the choice of the acoustic model type is determined by the tradeoff between computationalperformance and accuracy of the ASR system. This is an issue especially when dealing withplatforms like a Raspberry Pi, on which resources in terms of performance are very limited.The acoustic model is based on an HMM, generated (learned, as explained in Ch. 2.3) from adatabase of speech samples. Recordings of human speech are transcribed and used to calculatethe features. These features are extracted from frames of 10ms of speech and are representedin a so-called feature vector consisting of 39 numbers. The numbers are derived from spectralanalysis (formants, MFCC) although the way of determination is subject to active research andat this time not documented [3]. The recognition-process itself uses these vectors and comparesthem to their representation as an HMM and determines probabilities of linguistic units (e.g.phones). To further optimize the speech recognition process, the previously mentioned phonesare broken down into smaller linguistic units called senones.

2.3 Building the acoustic model

To build an acoustic model the following resources are necessary:

• Speech-database . . . A large database of recorded human speech with correlating transcrip-tions.

• Phone-list . . . List of used phones in ASCII symbols. (Ch. 3.1)

• Pronunciation-dictionary . . . Vocabulary in text-form and phone-combinations. (Ch. 3.1)

• SphinxTrain . . . Training-module of the CMUSphinx project.

SphinxTrain needs the data in certain formats and locations. The speech-database has to beprovided in the wave audio format (.wav), mono, 16kHz, 16bit. Transcriptions have to be inutf8 compatible text-format and follow a certain structure to indicate the beginning and end ofa sentence or phrase as well as to associate this sequence to the corresponding speech signal,which is done by a tag to the audio file-ID. Here is a short excerpt of a transcription.

Figure 2.1: Transcription example

Apart from the previously mentioned requirements, there also has to be a control file, contain-ing all audio file-IDs in the right order. Two sets of transcriptions and control files have to besupplied, one for training and one for testing the model (the test set should cover about a tenthof the size of the training set). There must not be any words occurring in the transcriptionsthat are not in the dictionary and the dictionary’s phonetics have to be in accordance with thephonelist. A second dictionary, the ”filler” file, lists all non-speech sounds that may occur in theinput data and maps them to the ”corresponding non-speech or speech-like sound units” [4].

June 23, 2016 – 8 –

Page 10: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

2 Speech recognition in PocketSphinx

After preparing and formatting data, the parameters of the new model have to be config-ured. The most important steps include adjusting the acoustic model type (cont., semi., ptm.),the number of Gaussians and the number of senones. Sound feature parameters (samplerate,low/high-pass frequency) and parameters for multi-threaded processing may also be set. Whichvalues to choose for configuration depends largely on the size and kind of the database. Forexample, if the number of senones is high, the sounds are divided into smaller pieces. This cancause problems for a small database, as there are many senones that are not trained properlyor don’t occur in the input data.The further training process basically only consists of running a few scripts of the SphinxTrainmodule. The final acoustic model is represented by a folder, comprised of eight text files. Totest the new model at the end of the training phase, a statistical language model is required.

2.4 Adapting the acoustic model

If speech recognition accuracy is ought to be improved, it is in most cases not necessary to builda new model from scratch, as training requires a lot of data and time. With only a rather smallamount of data, good results can be achieved through adapting an existing acoustic model.Improvements are determined by the adaptation corpus and can be attained in terms of forexample a particular accent, a voice or also a recording environment or audio channel.Adaptation is basically very similar to training a new model: the SphinxTrain module usestranscribed speech data as well as a phonetic dictionary to figure out the new parameters forthe acoustic model. Also, Baum-Welch algorithms are used to perform the adaptation. Theconfiguration of the original trained model cannot be affected in the adaptation process, thusthe dictionary and transcriptions have to match the previously used phonetics (in the contextof this thesis, that was also a reason for training a new model, as the voxforge data was notconsistent with the data available for adapting).There are two adaptation methods used by SphinxTrain. The maximum likelihood linear regres-sion (MLLR) creates a transformation matrix based on calculations of Gaussian model paramtersand is suitable for limited amount of data. The second one is the maximum a posteriori (MAP)adaptation method, which updates and overwrites the parameters of the acoustic model. Bothcan be used in combination to reach a better accuracy [5].

2.5 The language model

PocketSphinx supports a variety of language model types. In this work, only two different lan-guage model formats are considered: A statistical language model in the ARPA format and adeterministic grammar in the JSpeech Grammar Format (JSGF).Statistical language models contain probabilities of words, considering the following n words(n-gram), and are often used in modern speech recognition applications, because the speaker isnot bound to a particular form and may speak in a natural way (as far as the language modelknows all the used words). To build a statistical language model, PocketSphinx recommendsusing the CMUCLMTK toolkit. The statistics are gathered from a ”clean” reference text thatis free from abbreviations, digits or punctuation characters. The beginning and ending of eachutterance in the text also has to be indicated. In general, the reference texts are equivalent tothe transcriptions, just without the file-id tags.

June 23, 2016 – 9 –

Page 11: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

2 Speech recognition in PocketSphinx

Grammars usually are used for simple command-and-control applications and specify exactlywhat can be recognised. A grammar in JSGF is constructed of so-called rules. Said rulesconsist of words and several characters to group words and prepare options of what can be said.Creating a grammar manually in JSGF is basically very straightforward, an example is shownin figure 2.2.

Figure 2.2: Example of a grammar in JSGF

2.6 The pronunciation dictionary

The pronunciation dictionary maps the words to their phonetic segmentation. It is provided inform of a raw text file and simply contains the used vocabulary and the corresponding phoneticrepresentation for each word. The used phonetics are listed in the phone-set file, which definesthe linguistic entities the application should differentiate between. There is no constraint toa certain norm, therefore any set of phonemes may be applied. Choosing the phones andmapping them to linguistic sound is influential to the accuracy in terms of finding the balancebetween ”enough to be able to build the vocabulary coherently” and ”not too much to keepit simple and consistent”. Usually grapheme to phoneme (g2p) converters are used to createa dictionary. Based on an example pronunciation dictionary, a model is trained using machinelearning methods, which is then able to transcribe new words in the same phonetic manner froma simple text file, listing all the new words line by line.

2.7 The beamformer

In order to improve the quality of speech signals and suppress interfering noise a differentialbeamformer can be applied. The interesting signal is extracted from sounds coming from otherangles through constructive interference of time shifted signals of a microphone array. Differ-ent directivities are possible, depending on the amount of microphones and the beamformingalgorithm and its settings. Popular beamforming algorithms would be, for example, ADMA orspectral subtraction. For this work, the aim was to improve the speech recognition accuracy byapplying a beamformer in order to get a cleaner signal.

June 23, 2016 – 10 –

Page 12: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

Speech recognition of Austrian German

3Method and procedure

In general, during the process of building the components for the speech recogniser, the proceduredocumented on the website of the CMUSphinx project was taken as a template. After setting upthe Raspberry Pi 1 with Raspbian (Debian based operating system) and installing the Sphinxcomponents, most computations and operations were performed on a Lenovo ThinkPad, as thenecessary files for the ASR system won’t change with the platform and the workflow is a lotfaster and easier.

3.1 Training the new acoustic model

The first major step and one of the integral parts of the thesis was building the acoustic model.The original plan was to use an existing German model from ”voxforge”, a project revolvingaround the creation of acoustic models from user-provided speech-data, and adapt it with speechdata of native Austrian speakers. This plan was later abolished after learning that the creator ofthe model had used a specific phone-set and translating it would be a tedious job. An additionalconcern was the high number of used phones, which can result in over-complication and badresults. It was deemed easier to build a new model from scratch using the same voxforge speechdata base, but applying different phonetics with the HTK-phone-set.

3.1.1 Data preparation

One particular phone set that is used in projects at the SPSC and has been refined to good resultsbefore is the “HTK”-phone-set. It has been chosen to be used in this project in order to providesimplicity and comparability to similar projects. The existing speech database, which was usedto build the model, contained a pronunciation dictionary using X-Sampa. Several simplifica-tions were made to provide a consistent and understandable set (simplifications in Tab. 3.3).An automated script was written in PHP to translate the existing phone-combinations intocombinations using the desired HTK-phones.

Excerpts from the pronunciation dictionaries:

Transcription X-Sampa

ABBILD ’ ? a p - b I l t

EINNAHME ’ ? aI - n a: - m @

MISSVERHAELTNIS ’ m I s - f E 6 - h E l t - n I s

SPEZIALANWENDUNGEN S p e: - ’ ts j a: l - ? a n - v E n - d U - N @ n

Table 3.1: Examples of words in the pronunciation dictionary with X-Sampa phones

June 23, 2016 – 11 –

Page 13: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

3 Method and procedure

Transcription HTK

ABBILD a p b I l t

EINNAHME aI n a m @

MISSVERHAELTNIS m I s f E N6 h E l t n I s

SPEZIALANWENDUNGEN S p eL t s j a l a n v E n d U N @ n

Table 3.2: Examples of words in the pronunciation dictionary with HTK phones

The phone-set and the correlating XSampa-phones [6].

HTK X-SAMPA

a a, a:, a , A, A:, VaI aIaU aU

@ @, {, {:b bC Cd dE E, E:eL e, e:f f, D, Tg gh h, HI IiL i, i:j jk kl lm mn nN NN2L 9, 2, 2:, 3, 3:N6 6O o, o:, O, O:OY oI, OI, OYp pr r, RS S, Zs st tU u, u:, Uv v, wx xy y, y:, Y

SIL none (Silence)

Table 3.3: HTK - X-SAMPA correlation

June 23, 2016 – 12 –

Page 14: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

3 Method and procedure

After creating the new dictionary, the next step was to prepare the data for training. Theopen-source speech data from voxforge comprised about 35.000 audio files, each containing asingle sentence, phrase or utterance, including the corresponding transcriptions and control files.The data provided partly used other audio formats than WAVE; this data has been excluded asit only represented a very small percentage of the overall data.To handle the large database and ensure consistent file names as well as formatted transcriptions,a script was written to check for right audio-format, existing transcription and to create thenecessary data for SphinxTrain in the right format and file structure.It is nearly inevitable that mistakes in a transcription this large occur. The transcription had tobe cleaned from misspelt words, words missing in the dictionary or other unknown characters.Therefore, two more scripts were created to check which words are missing in the dictionaryand to correct misspelt words in the transcription. Nevertheless, some manual correction wasnecessary to ensure errorless training.

3.1.2 Training

To build the model, all the previously explained parts are necessary: phoneset, pronunciationdictionary, speech-database, transcriptions, control files and a filler dictionary (which in thiscase only consisted of silences, as no non-linguistic fragments were used in the transcription).Before training, the parameters in the configuration file were set. To be capable of forming astatement over the different types of models, a decision was made to train a continuous modelwith 3000 senones and a semi-continuous model with 4000 senones. The full cfg-files can befound in the appendix.The learning-process itself can take up to several weeks depending on processing power and sizeof the database. In this case, the work contained about 35.000 audio-files, each comprised of asingle sentence, and the learning-process took about an hour to finish on a Lenovo ThinkPadT440p with an Intel i7 quad-core.To test the new acoustic model, the last sixty audio files were not applied in the training processbut used as a test set. Furthermore, a language model had to be specified in the configuration-file. A statistical language model in the ARPA format that came along with the voxforge datawas used for this purpose. During the first tests problems with low accuracy were encountered,which were attributed to this voxforge language model: It was created with transcriptionscontaining umlaut marks, but in the process of correcting the necessary files umlauts weretransliterated. Thus, for the testing of the acoustic model, a new language model based on theclean transcription had to be generated.

3.2 Adapting the acoustic model

Two sets of databases with transcribed speech recorded by Austrian speakers were available atthe SPSC, differing in size and also content. The first set held about 530, the second one about4400 audio files, each containing one sentence or phrase.The first thing that had to be done was again to clean up all transcriptions and necessary files foradaptation (see Ch. 2.4). Due to the smaller sets of data, the corrections were made manually,no additional scripts were used to prepare the transcriptions and fileid-files.In order to perform the adaptation, a dictionary is needed. Said dictionary was provided alongwith the data corpora, but was lacking a lot of words and what’s more implemented a slightlydifferent phonetic structure. Therefore, a new dictionary was created with the help of the g2psoftware ”sequitur” (developed by the RWTH Aachen), holding all the words that occurred inthe training, adapting and test transcriptions. The dictionary used in the training process actedas a reference dictionary for this new one.

June 23, 2016 – 13 –

Page 15: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

3 Method and procedure

All the adaptation audio files had to be downsampled and converted from 24bit/48kHz to16bit/16kHz, which was only noticed after receiving error messages during the adaptation pro-cess and very poor first results (accuracy about -33%). The conversion was achieved using thesoftware ”Audacity”.Both adaptation methods were applied, the MLLR matrix was generated and the parameters ofthe acoustic models were updated with the MAP adaptation method. Due to misapplicationsduring the process, using the MLLR matrix with the adapted acoustic model only diminishedthe accuracy rates in the beginning. The reason for this was that the SphinxTrain MLLR algo-rithm was first applied on the base acoustic model, however, not on the model with the alreadyupdated parameters from the MAP algorithm. Of course, the MLLR algorithm was then againperformed after having run MAPADAPT, which resulted in better results.Two small test sets (one of each data set) were retained, containing 20 and 25 sentences. Theadapted models were tested in combination with a variety of different language models.

3.3 Building the language models

3.3.1 Statistical language model

As the language model has a big influence on the results, several language models were created,based on the three data sets used and combinations of them (voxforge set and two sets from theSPSC), to be able to compare the results with regard to the language model.To generate the different statistical models, the CMUCLMTK tool was used. The usage issimple and well explained in the README file and the website of the CMUSphinx project [7].Overall, seven models have been created: voxforge model, 1st corpus, 2nd corpus, voxforge+1st,voxforge+2nd corpus, voxforge+1st+2nd corpus, 1st+2nd corpus. There is also the possibilityof creating so-called linear models. This is recommended if rather words than sentences shouldbe recognised. These linear models were created as well, which makes a total of fourteen differentlanguage models.

3.3.2 Grammar

The DIRHA project, a project developing voice-enabled home automation environments, wasdetermined as a possible application for this work’s outcome. Actions may be carried out byDIRHA, controlled by relatively simple commands, hence using a grammar as language modelwas a convenient option. The grammar was written in JSGF, based on a pdf-file depicting allpossible combinations of words to control the home system.Parts of the grammar were transferred in separate grammar files, which are then called by amain grammar file. After finishing the grammar, a couple of commands were recorded to testwhether it works as expected on the RaspberryPi or not. JSGF allows only one main grammarrule, this caused trouble in the final testing stage. However, this problem was easily avoided bycondensing all of the rules into one.

June 23, 2016 – 14 –

Page 16: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

3 Method and procedure

BITTE

REICHT

ES

SCHLUSS

BEENDEN

ZURUECK

SCHLIESSEN

STOP

ENDE

ABBRECHEN

ABBRUCH

OFFEN

BESPRECHUNGSRAUMIM

KUECHEDERIN

CPRIM

ROOMPARTYCOCKTAILIMFENSTER

DAS

IST

AUF

ZU

GEOEFFNET

GESCHLOSSEN

OFFENJALOUSIEN

VERDUNKELUNG

DEM

DAS

DIE

DER

SIND

IST

AUFGEDREHT

AUSGESCHALTEN

AUSGESCHALTET

ABGEDREHT

EINGESCHALTEN

EINGESCHALTET

LICHT

LICHTER

LAMPEN

LAMPEDEM

DAS

DIE

DERSIND

IST

BESPRECHUNGSRAUMIM

KUECHEDERIN

CPRIM

ROOMPARTYCOCKTAILIM

FENSTER

JALOUSIEN

VERDUNKELUNG

LICHTES

LICHTS

LICHTER

LAMPEN

LAMPE

DER

DES

EINSTELLUNG

ZUSTAND

STATUSDIE

DERIST

WIE

BESPRECHUNGSRAUMIM

KUECHEDERIN

CPRIM

ROOMPARTYCOCKTAILIM

TEMPERATURDIEIST

WIEES

ISTWARM

WIE

IST

TEMPERATUR

WELCHE

KAELTER

WAERMER

GRAD

SECHS

FUENF

VIER

DREI

ZWEI

EINS

UM

MACHE

MACH

GRAD

SECHS

FUENF

VIER

DREI

ZWEI

EINSUM

ACHTUNDZWANZIG

SIEBENUNDZWANZIG

SECHSUNDZWANZIG

FUENFUNDZWANZIG

VIERUNDZWANZIG

DREIUNDZWANZIG

ZWEIUNDZWANZIG

EINUNDZWANZIG

ZWANZIG

NEUNZEHN

ACHTZEHN

SIEBZEHN

SECHZEHN

FUENFZEHN

VIERZEHN

DREIZEHN

ZWOELF

ELF

ZEHN

AUF

TEMPERATUR

DIE

SENKE

ERHOEHE

BESPRECHUNGSRAUMIM

KUECHEDERIN

CPRIM

ROOM

PARTYCOCKTAILIMJALOUSIEN

VERDUNKELUNGDIE

SCHLIESS

SCHLIESSE

MIROEFFNE

OEFFNE

AUS

AB

AN

EIN

BEIDE

TUERDERBEI

FENSTERBEIM

LICHT

LICHTER

LAMPEN

LAMPE

DEM

DAS

DIE

DER

BESPRECHUNGSRAUMIM

KUECHEDERIN

CPRIM

ROOMPARTYCOCKTAILIM

SCHALTE

SCHALT

AUS

AUF

AB

BEIDE

TUERDERBEI

FENSTERBEIM

LICHT

LICHTER

LAMPEN

LAMPEDEM

DAS

DIE

DER

BESPRECHUNGSRAUMIM

KUECHEDERIN

CPRIM

ROOMPARTYCOCKTAILIM

DREHE

DREHBITTE

Figure 3.1: Chart listing possible DIRHA commands

June 23, 2016 – 15 –

Page 17: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

3 Method and procedure

3.4 Implementing the beamformer

A beamformer developed by Elmar Messner [8], consisting of an eight microphone array, wasused in combination with the software by Thomas Pichler [9].By using the Beamformer, software compilation on the Linux Notebook and on RaspberryPibecame more challenging: Incompatibility of the ARM-architecture with some used librariesand different definitions of a particular function (floating point operation) in the beamformingsoftware turned out to be troublesome. This, however, could be handled by updating the versionof the required “QWTPolar” library and rewriting small pieces of the C++ code. Subsequentlymore problems arose. The sound server software JACK proved to be incompatible with Sphinxon the RaspberryPi. Further attempts to ensure that the system would be operating as plannedwere then performed on the Notebook. Eventually, the beamformer’s output signal failed to bemapped to PockerSphinx via JACK. The PocketSphinx application showed an error message,stating that the buffer of JACK would be full. After a lot of troubleshooting (setting buffersize, frame size and samplerate, killing JACK daemons, adjusting real-time settings, etc.) wedecided to contact the developer of PocketSphinx, Nikolay Shmyrev. After quite a lot of emailcorrespondence, Mr. Shmyrev ultimately informed us that PocketSphinx no longer supportsJACK.This forced us to abandon the original idea of a live ASR system with the RaspberryPi and thebeamformer. Instead, speech was recorded and fed to PocketSphinx with audio files in waveformat (16kHz, 16bit).

In order to test how the beamformer affects recognition abilities of the system, a simple test-set-up was created. One loudspeaker plays recorded test-sentences, while the second one playsmusic. This simulates a living-room or, as often referenced in DIRHA, a cocktail-party-room.Figure 3.2 shows a simple scheme of the test-set-up.

Figure 3.2: Simple scheme of the test-set-up

By balancing the volume of speaker and music in a certain way, this test set-up recreates ascenario where usage of a speech recognition system is still realistic. The beamformer faces theloudspeaker playing the speech signal, which is positioned at a distance of 40 cm; the otherloudspeaker creates background noise in form of music.A set consisting of ten sentences in the context of DIRHA was recorded by two different speakersand played back several times on the test-set-up, each time applying a different beamformingalgorithm. The following ones were chosen for the testing purposes:

• First-order ADMA with 2 microphones [8, Ch. 4.1.1]

• Second-order ADMA with 3 microphones [8, Ch. 4.1.2]

• First-/second-order Hybrid ADMA with 4 microphones [8, Ch. 4.1.3]

June 23, 2016 – 16 –

Page 18: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

3 Method and procedure

In order to obtain a clean reference signal, a second speech-set was recorded with only onemicrophone and no background audio. Unfortunately, due to an adverse behaviour of the beam-former, speech and audio material was sometimes interrupted by chunks of noise, which renderedsome of the recordings useless.To be able to evaluate the influence of the beamformer on the speech recogniser with the Pock-etSphinx testing script, again transcriptions and control files had to be created for this set ofsentences. Furthermore, the dictionary needed a few minor updates. The tests did not includeall combinations of acoustic and (statistic) language models. In fact only the best models fromprevious tests were used (both adapted or generated with data including corpus 1 and 2). Testswith the JSGF grammar led to surprisingly bad accuracy rates, which were caused by a bug inthe grammar. The problem was easily fixed by summarising all grammar rules under one publicrule.

June 23, 2016 – 17 –

Page 19: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

Speech recognition of Austrian German

4Results

4.1 Evaluation parameters

The training-process is followed by automated tests. The test set should consist of ratherunique speech data to ensure the model’s ability to recognise speech patterns not included inthe training-data.The script word align.pl checks the output of the recognition-process for wrong, missing orexcess words using the transcriptions to compare. The following values are used to determinethe accuracy of the tested model:

• Words: amount of words in test

• Correct: amount of words recognized

• Errors: amount of words not recognized

• Percentage correct: CorrectWords · 100%

• Error: WER (word error rate) Insertions+Deletions+SubstitutionsWords · 100%

• Accuracy: Words−Deletions−SubstitutionsWords · 100%

• Insertions: Word inserted, meaning when words are in-between correct wordsTranscription: Bitte erhohe die Temperatur um vier Grad.Output of ASR: Bitte nonsens erhohe die Temperatur um vier Grad.

• Deletions: missing wordsTranscription: Bitte erhohe die Temperatur um vier Grad.Output of ASR: Bitte die Temperatur um vier Grad.

• Substitutions: wrong wordsTranscription: Bitte erhohe die Temperatur um vier Grad.Output of ASR: Bitte senke die Temperatur um vier Grad.

The results are presented in tables, showing the different combinations of acoustic and languagemodels. The models are either generated with data from the voxforge project or from a speechdatabase provided by the SPSC (corpus 1+2). Some language models were created as linearlanguage models (see Ch. 3.3). The tables furthermore indicate the type of the acoustic model(continuous or semi-continuous model) and if the MLLR matrix was applied during testing.Three different test sets were used throughout the testing phase. However, the corpus 2 testset was the preferred one, as it contained a little more new, unseen material compared to thecorpus 1 set. Additionally, the voxforge test set was only relevant for the evaluation of the basicacoustic model after training. The tables’ evaluation parameters include the correctly recognisedpercentage of words, the word error rate and the accuracy, all listed in percent.

June 23, 2016 – 18 –

Page 20: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

4 Results

4.2 Results for trained and adapted acoustic models

4.2.1 Results for the basic German acoustic model

These are the results for the newly created acoustic model, trained with the data of the Germanvoxforge model. The configuration details can be seen in the appended .cfg file.

AM Data AM-type LM Data MLLR Test setResults [%]

Percentagecorrect

WER Accuracy

voxforge

continuous

voxforge no voxforge 93,06 7,65 92,35

voxforge no test set 2 38,74 75,68 24,32voxforge yes test set 2 42,34 71,17 28,83

voxf. + c.1+2 no test set 2 63,06 43,24 56,76

voxf. + c.1+2 yes test set 2 78,38 25,23 74,77corpus 2 no test set 2 67,57 38,74 61,26

semi -continuous

corpus 2 yes test set 2 84,68 17,12 82,88voxforge no test set 2 31,53 81,08 18,92

voxforge, linear no test set 2 30,63 80,18 19,82

voxf. + c.1+2 no test set 2 67,57 36,04 63,96corpus 1 + 2 no test set 2 70,27 31,53 68,47

corpus 2 no test set 2 70,27 31,53 68,47corpus 1 no test set 2 20,72 90,99 9,01

Table 4.1: Results of the newly trained acoustic model

4.2.2 Results for the adapted acoustic model

The adapted acoustic models rely on the semi-continuous or continuous German model and areadapted either with data from corpus 1, corpus 2 or both. In the case of adapting with bothcorpora, two models were created. The first one was generated by adapting with both corporaat the same time, the second one by adapting stepwise (adapting with corpus 1 and then repeatthe adaptation process with corpus 2). The models are labelled with ”step adapt.” and ”singleadapt.”.

AdaptationData

AM-type LM Data MLLR Test setResults [%]

Percentagecorrect

WER Accuracy

corpus 1

continuous

corpus 1 no test set 1 73,00 33,00 67,00corpus 1 yes test set 1 71,00 36,00 64,00

corpus 1, lin. no test set 1 71,00 32,00 68,00corpus 1+2 no test set 1 73,00 31,00 69,00

corpus 1+2, lin. no test set 1 74,00 30,00 70,00voxf. + c.1+2 no test set 1 77,00 28,00 72,00

voxf. + c.1+2, lin. no test set 1 73,00 31,00 69,00

semi-continuous

corpus 1 no test set 2 18,92 95,50 4,50corpus 2 no test set 2 70,27 31,53 68,47

corpus 1+2 no test set 2 68,47 33,33 66,67voxf. + c.1+2 no test set 2 64,86 38,74 61,26

Table 4.2: Results of the acoustic model adapted with corpus 1

June 23, 2016 – 19 –

Page 21: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

4 Results

AdaptationData

AM-type LM Data MLLR Test setResults [%]

Percentagecorrect

WER Accuracy

corpus 2

continuous

voxforge no test set 1 58,00 51,00 49,00voxforge yes test set 1 58,00 51,00 49,00corpus 1 no test set 1 99,00 1,00 99,00

corpus 1, lin. no test set 1 100,00 0,00 100,00corpus 2 no test set 1 98,00 2,00 98,00corpus 2 yes test set 1 99,00 1,00 99,00

corpus 1+2 no test set 1 100,00 0,00 100,00voxf. c.1+2 no test set 1 99,00 1,00 99,00

voxforge no test set 2 44,14 63,06 36,94corpus 1 no test set 2 27,93 89,19 10,81corpus 2 no test set 2 98,20 1,80 98,20

corpus 1+2 no test set 2 98,20 1,80 98,20corpus 1+2, lin. no test set 2 99,10 0,90 99,10voxf. + c.1+2 no test set 2 89,19 10,81 89,19

voxf. + c.1+2, lin. no test set 2 94,59 5,41 94,59

semi-continuous

corpus 2 no test set 2 76,58 26,13 73,87corpus 1+2 no test set 2 72,07 29,73 70,27

corpus 1+2, lin. no test set 2 84,68 15,32 84,68voxf. + c.1+2 no test set 2 68,47 34,23 65,77

voxf. + c.1+2, lin. no test set 2 75,68 27,03 72,97

Table 4.3: Results of the acoustic model adapted with corpus 2

AdaptationData

AM-type LM Data MLLR Test setResults [%]

Percentagecorrect

WER Accuracy

corpus 1+2step adapt.

continuous

corpus 1 no test set 1 84,00 18,00 82,00corpus 1+2 no test set 1 81,00 20,00 80,00

voxf. + c.1+2 no test set 1 80,00 22,00 78,00

corpus 1+2single adapt.

voxforge no test set 1 58,00 51,00 49,00voxforge yes test set 1 60,00 50,00 50,00corpus 1 no test set 1 98,00 2,00 98,00corpus 2 no test set 1 97,00 3,00 97,00

corpus 1+2 no test set 1 99,00 1,00 99,00voxf. + c.1+2 no test set 1 98,00 2,00 98,00voxf. + c.1+2 yes test set 1 98,00 2,00 98,00

voxforge no test set 2 45,95 62,16 37,84voxforge yes test set 2 45,95 62,16 37,84corpus 1 no test set 2 23,42 90,09 9,91corpus 2 no test set 2 97,30 2,70 97,30corpus 2 yes test set 2 97,30 2,70 97,30

corpus 1+2 no test set 2 97,30 2,70 97,30voxf. c.1+2 no test set 2 87,39 13,51 86,49

corpus 1+2single adapt.

semi-continuous

corpus 1 no test set 2 19,82 95,50 4,50corpus 1, lin. no test set 2 19,82 94,59 5,41

corpus 2 no test set 2 72,07 29,73 70,27corpus 1+2 no test set 2 71,17 29,73 70,27

corpus 1+2, lin. no test set 2 83,78 17,12 82,88voxf. + c.1+2 no test set 2 65,77 36,04 63,96

voxf. + c.1+2, lin. no test set 2 72,97 28,83 71,17

Table 4.4: Results of the acoustic model adapted with corpus 1+2

June 23, 2016 – 20 –

Page 22: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

4 Results

4.3 Results for applying different beamforming algorithms

The final step of evaluating was to compare the beamforming algorithms listed in ch. 2.7. To beable to compare the results in a reasonable manner, two reference situations (clean background,noisy background) were recorded using one microphone. Every other beamforming algorithmwas tested with background noise (music).Only three different acoustic models were tested with the beamformer: the basic German modeltrained with the voxforge data, the continuous model adapted with corpus 1+2, and the semi-continuous model adapted with corpus 1+2. The used language model is the JSGF grammar.Tests with statistical language models are not shown in the tables, as they were skipped whenthey proved to be useless in practical terms.

Acoustic model Algorithm SpeakerResults

Percentagecorrect

WER Accuracy

basic German model

1 microphoneclean reference

1 89,29 14,29 85,712 87,50 12,50 87,50

1 microphonenoisy reference

1 28,57 73,21 26,792 32,14 67,86 32,14

first-order ADMA2 microphones

1 50,00 50,00 50,002 65,96 34,04 65,96

second-order ADMA3 microphones

1 90,24 9,76 90,242 59,57 42,55 57,45

first-/second-order hybrid ADMA4 microphones

1 68,97 31,03 68,972 66,67 33,33 66,67

Table 4.5: Results of tests with different beamforming algorithms using the basic German acoustic model

June 23, 2016 – 21 –

Page 23: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

4 Results

Acoustic model Algorithm SpeakerResults

Percentagecorrect

WER Accuracy

continuous,corpus 1+2

1 microphoneclean reference

1 89,29 14,29 85,712 73,21 14,29 73,21

1 microphonenoisy reference

1 12,50 87,50 12,502 1,79 98,21 1,79

first-order ADMA2 microphones

1 26,79 73,21 26,792 12,77 89,36 10,64

second-order ADMA3 microphones

1 43,90 56,10 43,902 10,64 89,36 10,64

first-/second-order hybrid ADMA4 microphones

1 24,14 79,31 20,692 8,33 108,33 -8,33

Table 4.6: Results of tests with different beamforming algorithms using the continuous acoustic model adaptedwith corpus 1+2

Acoustic model Algorithm SpeakerResults

Percentagecorrect

WER Accuracy

semi-continuous,corpus 1+2

1 microphoneclean reference

1 85,71 14,29 85,712 73,21 26,79 73,21

1 microphonenoisy reference

1 5,36 94,64 5,362 1,79 98,21 1,79

first-order ADMA2 microphones

1 16,07 83,93 16,072 2,13 97,87 2,13

second-order ADMA3 microphones

1 17,07 82,93 17,072 2,13 97,87 2,13

first-/second-order hybrid ADMA4 microphones

1 10,34 89,66 10,342 0,00 100,00 0,00

Table 4.7: Results of tests with different beamforming algorithms using the semi-continuous acoustic modeladapted with corpus 1+2

June 23, 2016 – 22 –

Page 24: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

Speech recognition of Austrian German

5Discussion and conclusion

5.1 Interpretation of the results

5.1.1 Discussion of tests in the course of acoustic model creation

One important aspect for the interpretation of the results is to keep in mind the big impact ofthe test sets and the language model. Both test sets are rather small and were taken from thesame data corpus as the adaptation data. Therefore they only display very few new words andword combinations. Tests with a language model created with the same or similar data as thetest set achieve, in general, very good accuracy rates, which hinders an objective evaluation ofthe quality of the acoustic model (e.g. Tab.4.3: tests with the continuous model on test set 2show about 10% accuracy for the corpus 1 language model and 98, 20% for the corpus 2 languagemodel). The small test sets and their little diversity from the acoustic and language model datain terms of word combinations and recording conditions cause a reduction of the results’ validityconcerning the acoustic model.The big impact of the language model on the results is rather unsurprising, but still has to beemphasised. For that matter, more data does not always mean better results, as can be seenin Tab. 4.4. The continuous, single adapted acoustic model has a lower accuracy for test set 2in combination with the language model generated with all data (voxf.+c.1+2) than with thelanguage model generated with corpus 2 only.Another thing to mention is that the evaluation parameters disregard meanings of sentences.If for example only the word bitte is not recognised but the sentence’s meaning maintains thesame, it still counts as an error. So one could say the results are evaluated in a ”harsh” way.

Tab.4.1 shows the accuracy of the newly built model using the voxforge-data. The high ac-curacy of 92% for the voxforge test set was to be expected, as the database is relatively largeand the model is used for ”German German”. The necessity of creating a model for ”AustrianGerman” can be seen too though: accuracy does not exceed a percentage higher than 74, 77 (forthe continuous acoustic model), even with a robust language model that includes both AustrianGerman data sets. This cannot be regarded as a reliably working acoustic model.Although tested with test set 1 (which contains a lot of corpus 1 material), the acoustic modeladapted with just the data from corpus 2 achieves the highest accuracy, 100%. The reason forthis is that corpus 2 contains most of the corpus 1 material and is also greater in size. A decisionabout which is the best model is not that easy after all. Judging from bare numbers, the acousticmodel adapted with corpus 2 would be considered best. However, the adaptation with corpus1 and 2 scores nearly equally good results, and adapting with more data principally stands fora more robust and efficient outcome. Corpus 2, on the other hand, already includes practicallythe whole corpus 1, which means more data is redundant and maybe even causes overfitting.Either way, the two models are very close to each other and choosing one ”best” model seemsto be unnecessary.

June 23, 2016 – 23 –

Page 25: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

5 Discussion and conclusion

Generally speaking, the adapted models work very well for these particular test sets and bigimprovements can be observed in the tables 4.2 - 4.4 compared with Tab. 4.1. The first thingthat catches the eye is probably the differing performance of semi-continuous and continuousmodels: semi-continuous acoustic models tend to attain rather poor accuracy rates. In return,continuous models lead to longer recognition times (this issue is further discussed in section5.2). Further notable is the positive impact of linear language models, which could improveresults up to 15%. Especially for semi-continuous models the usage of linear LMs was beneficial,whereas continuous acoustic models often did not have much room for improvement, but alsofeatured the only decline of accuracy with linear language models (Tab. 4.2, LM generated withvoxforge+corpus 1+2 data).Applying the MLLR matrix does improve the recognition rates, but only to a very limited extent- in most cases advances are missed when using MLLR. The effect on semi-continuous acous-tic models seems to be bigger than on continuous ones, although the CMU website states theopposite. At this time it has to be pointed out that better statements could have been made -especially about linear language models and the MLLR algorithm - if testing was performed ina more systematically consistent manner.

5.1.2 Discussion of tests applying beamforming algorithms

Like the previous tests, the results presented in Tab. 4.5 - 4.7 depend on a relatively small testset, not even covering twenty sentences. This is important to remember for the interpretationof the results, as some statements may be put into perspective.When looking at the results in section 4.3, the most conspicuous thing is the lacking performanceof the adapted models compared to the original base model. Advances with the ”Austrian Ger-man” models have already been observed in the previous tests, with test sets resembling theadaptation data (same speakers/phrases/recording surroundings). Why do they underperformin this situation? The original model works better in the case of a clean signal, but is alsoparticularly successful in noisy situations. The reason for this is the way bigger data corpus andthe larger variety of different speakers and recording environments that go along with an onlineopen-source database like voxforge. The data provided for adaptation did not differ at all interms of recording environments and only featured rather few different speakers compared tovoxforge. Although the adapted acoustic models are based on the original German model, theyare nevertheless overfitted so to speak.The tests were conducted using JSGF as language model format, which proved to be a veryreliable and well-functioning method for this application. The good outcomes of JSGF in com-bination with the German AM raise the question whether adaptation for ”Austrian German”was necessary at all.Despite the relatively good results from the previous tests, short time in the evaluation processit was clear that the statistical language models fail in a real world scenario like this. Accuracyrates for the clean signal using statistical LMs lie below 10% and have not been included in thetables. As the LMs are based on data equivalent to the data used for generating the acousticmodels, the bad performance is also attributed to the homogeneous database, which lacks avariety of distinct sentences and phrases.The beamformer’s positive influence on the result is obvious; accuracy rates for all three testedmodels could be significantly improved. In the case of the basic German model, accuracy fora previously noisy, beamformed signal is even higher (90% !) than for the clean signal with-out background noise. Also lower rates (under 70%) often allow the reconstruction of sentencemeaning from beamformed signals when the original acoustic model is used. Beamforming alsoworks well for the adapted models, in spite of the poor results. The second-order ADMA beam-forming algorithm with 3 microphones does best, the algorithm with 4 microphones seems to bethe inferior one compared to the others.

June 23, 2016 – 24 –

Page 26: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

5 Discussion and conclusion

5.2 Usability of the system

The advantage in recognition accuracy of continuous over semi-continuous acoustic models hasalready been mentioned, the tradeoff however are significantly longer recognition times. Run-time varies by 5 to 10 seconds for the different acoustic models. An issue even more crucialconcerning this matter is the choice of the language model type: decoding a single sentence onthe Raspberry Pi takes around 30 seconds for continuous models and about 20 seconds when us-ing semi-continuous AMs. These long recognition times represent the biggest problem in termsof usability. Using a system which takes roughly 30 seconds to deliver a result is not at allsuitable for everyday use in a home-automation scenario like DIRHA.The use of JSGF instead of a statistical language model lowers these numbers. Depending on thelength of the utterance and the used AM, processing times lie between about 4 and 11 seconds.Beside the better accuracy rates, this constitutes another big advantage of JSGF over statisticalLMs, which check for a quite large number of possible word combinations. Nevertheless, therecognition times achieved still represent a significant delay for the reaction to a speech com-mand which would heavily influence the user’s handling of the system.Due to the circumstance that the recognition accuracy is only determined by the acoustic andlanguage model, which can be easily transferred to another platform, processing time is theonly restriction caused by the Raspberry Pi. In the course of this work, the tests were carriedout with the Raspberry Pi 1. It is hardly probable that the usage of a newer version of theRaspberry Pi would lead to decent results in terms of usability with the DIRHA system.

However, the long recognition time is not the only problem, as the accuracy of the latest con-figuration proved to be not quite at the level a reliable speech-recognition-system would deemnecessary. Possible improvements as explained in Ch. 5.3 may be able to rise it to an applicablelevel.Another highly relevant topic concerning the practical implementation of the system is, of course,the disability for live applications. In consequence of the cancelled support of JACK throughPocketSphinx, the system will not be able to work as planned with the used beamformer, whichrequires JACK to function.

5.3 Suggestions for possible improvements

As stated in the previous sections, the low accuracies are reason for further thought. The adap-tation data consists of speech recorded at the SPSC which has high recording quality and nonoise. This results in an acoustic model which cannot handle situations in higher noise environ-ments. A possible solution would be to gather more speech data from the most different noisyenvironments and adapt. Another one would be to reuse the currently used speech data andrun it through different filters to simulate noisy environments.One of the conclusions of this work is that the adaptation data decisively defines the qualityof the recognition engine. If only a small database is available, an alignment of the adaptationdata and the expected application input is desirable. An application-specific data corpus isguaranteed if same speakers, recording environments and ideally same or similar sentences areused. A robust acoustic model, able to handle a variety of circumstances, can only be adaptedwith a diverse and heterogeneous database. The same applies for language models - if buildingtakes place with regard to the final application, useful results may be reached.Because the adaptation data plays such a major role, a parameter to set the weighting of theadaptation process, and therefore its influence on the final model, would be desirable. This is asuggestion for the further development of the PocketSphinx toolkit.

June 23, 2016 – 25 –

Page 27: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

5 Discussion and conclusion

An improvement could possibly also be attained by experimenting with different configurationsfor training the acoustic model, for example, changing the number of senones. Furthermore,regarding the optimisation of the system, a more proper examination of the effect of MLLRMatrix and linear LM on the recognition accuracy would be preferable. However, referring tothese measures the question arises whether the relation of expended effort to final result is areasonable one.Not included in this work, but rather interesting for the improvement of this ASR system, isthe building and testing of the recommended ptm acoustic model type, which is said to lie inbetween continuous and semi-continuous models in terms of recognition time and accuracy rate.

June 23, 2016 – 26 –

Page 28: Bachelor-seminar Thesis Speech recognition of Austrian ... · Figure 1.1:Scheme of an ASR system [1] The pronunciation dictionary contains details about the pronunciation of single

Speech recognition of Austrian German

Bibliography

[1] Automatic Speech Recognition with HTK.

[2] Phon. [Online]. Available: https://de.wikipedia.org/wiki/Phon (Linguistik)

[3] Basic concepts of speech. [Online]. Available: http://cmusphinx.sourceforge.net/wiki/tutorialconcepts

[4] Training acoustic model for cmusphinx. [Online]. Available: http://cmusphinx.sourceforge.net/wiki/tutorialam

[5] Adapting the default acoustic model. [Online]. Available: http://cmusphinx.sourceforge.net/wiki/tutorialadapt

[6] X-sampa. [Online]. Available: https://en.wikipedia.org/wiki/X-SAMPA

[7] Building language model. [Online]. Available: http://cmusphinx.sourceforge.net/wiki/tutoriallm

[8] E. Meissner, “Differential microphone arrays,” Master’s thesis, Technical University Graz,Austria, 2013.

[9] T. Pichler, “Realtime demonstrator for differential beamformers,” Project Report, TechnicalUniversity Graz, Austria, 2015.

June 23, 2016 – 27 –