automatic speech recognition studies

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 1

Automatic Speech Recognition Studies

Guy Brown, Amy Beeston and Kalle Palomäki


Overview

• Aims

• The articulation index (AI) corpus

• Phone recogniser

• Results on sir/stir subset of AI corpus

• Future plans


Aims

• Aim to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR).

• Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task.– wider vocabulary

– range of reverberation conditions

– variety of speech contexts

– naturalistic speech, rather than interpolated stimuli

– consider phonetic confusions in reverberation in general


Progress to date

• Current work has focused on implementing a baseline ASR system for the articulation index (AI) corpus, which meets the requirements for speech material stated on previous slide.

• So far have results for phone recognition on small test set without any ‘constancy’ processing.

• Planning evaluation that compares phonetic confusions made by listeners and ASR on the same test.


The articulation index (AI) corpus• Recorded by Jonathan Wright (University of

Pennsylvania), available via LDC.

• Intended for speech recognition in noise experiments similar to those of Fletcher.

• Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins et al.:– English (American)

– Target syllables are mostly nonsense, but some correspond to real words (including “sir” and “stir”)

– Target syllables are embedded in a context sentence drawn from a limited vocabulary


Details of the AI corpus

• Includes all “valid” English diphone (CV, VC) syllables.

• Triphone syllables (CVC, CCV, VCC) chosen according to frequency in Switchboard corpus– correlated with syllable frequency in casual conversation.

• 12 male speakers, 8 female speakers.

• Approximately 2000 syllables common to all speakers.

• Small amount (10 min) of conversational data.

• All speech data sampled at 16 kHz.


AI corpus examples

• Target syllable preceded by two context words and followed by one context word:– CW1 CW2 SYL CW3

– CW1, CW2 and CW3 drawn from sets of 8, 51 and 44 words respectively

• Examples:

they recognise sir entirely

people ponder stir second


Phone recogniser

• Monophone recogniser implemented and trained on the TIMIT corpus.

• Based on HTK scripts by Tony Robinson1.

• Front-end: speech encoded as 12 cepstral coefficients +energy+deltas+accelerations (39 features).

• Cepstral mean normalisation applied.

• 3 emitting states per phone model, observations modelled by 20 Gaussian mixtures per state.

• Approx 58% phone accuracy on TIMIT test set.1http://www.cantabResearch.com/HTKtimit.html


Training and testing

• Trained on TIMIT training set.

• Really needs adapting to the AI corpus material; work in progress.

• Removed allophones from TIMIT labels (as is usual) to give 41 phone set.

• Short pause and silence models.

• For testing on AI corpus, word-level transcriptions were expanded into phone sequences using Switchboard-ICSI pronunciation dictionary.


Experiments

• Initial experiments done with a subset of AI corpus utterances in which the target syllable is “sir” or “stir”.

• Small test set of 40 utterances:

Male speaker Female speaker

“Sir” 12 8

“Stir” 12 8


Experiment 1: Fletcher-style paradigm• A recogniser grammar was used in which

– The sets of context words CW1, CW2 and CW3 are specified;

– Target syllable is any sequence of two or three phones.

• Corresponds to task in which listener knows that context words are drawn from a limited set.

• Recogniser grammar is a (rather unconventional) mix of word-level and phone-level labels.


Experiment 1: results

• Overall 47.5% correct at word level (sir/stir)

• Context words not correctly recognised in some cases, leading to knock-on effect on recognition of the target syllable.

• Examples:

they imagine stir surelythey imagine s t er surely

Correct

they sense stir gladly they sense s er gladly Deletion

I evoke sir precisely I evoke s eh preciselySubstitution (/eh/ as in ‘head’)

they recognize sir entirely

they witness er n p daily

Incorrect context words


Experiment 2: constrained sir/stir• A recogniser grammar was used in which

– The sets of context words CW1, CW2 and CW3 are specified;

– Target syllable is constrained to “sir” or “stir”;

– Canonical pronunciation of “sir” and “stir” is assumed (i.e. “sir” = /s er/ and “stir” = /s t er/)

• Corresponds to Watkins-style task, except that context words vary and are drawn from a limited set.

• Utterances either presented clean or convolved with the left channel or right channel of the L-shaped room or corridor BRIRs.


Experiment 2: recogniser grammar• Recogniser grammar was

$test = SIR | STIR;

( !ENTER $cw1 $cw2 $test $cw3 !EXIT )

with $cw1, $cw2 and $cw3 defined as before.


Results: L-shaped room, left channel

Impulse response ID % Correct SIR STIR

clean.wav 95.0SIR 18 2

STIR 0 20

outconv22feb31p5.wav 92.5SIR 18 2

STIR 1 19

outconv22feb63.wav 85.0SIR 18 2

STIR 4 16


STIR 4 16


STIR 2 18


STIR 10 10


STIR 8 12


Results: L-shaped room, right channel



STIR 0 20


STIR 2 18


STIR 5 15


STIR 1 19


STIR 3 17


STIR 9 11


STIR 8 12


Results: corridor, left channel



STIR 0 20


STIR 2 18


STIR 4 16


STIR 4 16


STIR 8 12


STIR 10 10


STIR 11 9


Results: corridor, right channel



STIR 0 20


STIR 2 18


STIR 3 17


STIR 3 17


STIR 2 18


STIR 6 14


STIR 11 9


Conclusions

• The phone recogniser works well when constrained to recogniser “sir”/”stir” only (95% correct).

• Recognition rate falls as reverberation increases, as expected.

• The fall in performance is not only due to “stir” being reported as “sir”, as expected from human studies.

• Some effects of BRIR channel on performance. Right channel of the corridor BRIR is less problematic, most likely due to a strong early reflection in the right channel for the 5m condition.


Plans for next period: experiments• The AI corpus lends itself to experiments in

which target and context are varied as in Watkins et al. experiments.

• Suggestion:– Compare listener and ASR phone confusions under conditions

in which the whole utterance is reverberated, and when reverberation is added to the target syllable only.

• Possible problems:– Relatively insensitive design? Will effect of reverberation be

sufficient to show up as consistent phone confusions? – Are the contexts long enough? (some contexts as short as 0.5

s).– As shown in baseline studies, recogniser does not necessarily

make the same mistakes as human listeners.

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 22PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 22

AI corpus sir/stir stimuli

• Utterances similar to sir/stir format– Wider variety of speakers/contexts (but still limited vocabulary)

– Targets mostly nonsense, but some real words (eg. sir/stir)

– Reverberated (by Amy) according to sir-stir paradigm

• Widening sir/stir paradigm towards ASR environment– Introduce different stop consonants first: s {t,p,k} ir

– Look for confusion in place of articulation

near-near near-far far-far

near-near near-far far-far

PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 23PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 23

Test words from AI corpus

We could record our own: sigh, sty, spy, sky (sky is missing)


Questions for Tony

• Generally - would this sort of thing work?

• Is the initial delay in BRIR kept?

• How should the AI corpus signals be level-normalised when mixed reveberation distance is used?

• How to control the ordering of stimuli?


Plans: system development

• Currently the ASR system is trained on TIMIT; expect improvement if adapted to the AI corpus material.

• Only have word-level transcription for the AI corpus so must obtain phone labels by forced alignment.

• We will try the efferent model as a front end for recognition of reverberated speech, however:– it may not be sufficiently general, having been developed/tuned

only for the sir/stir task– that said, we have shown elsewhere that efferent suppression is

effective in improving ASR performance in additive noise– there is some relationship between the efferent model and

successful engineering approaches


Plans: system development

• Current efferent model is not unrelated to engineering approach of Thomas et al. (2008):– “the effect of reverberation is reduced when features are extracted

from gain normalized temporal envelopes of long duration in narrow subbands”

• Our efferent model also does gain control over long-duration windows (and will work in narrow bands).

• The model currently produces a spectral representation but could be modified to give cesptral features for ASR.


Plans: other approaches

• Parallel search over room acoustics and word models?– How would context effects be included in such a scheme?

– On-line selection of word models trained in dry or reverberant conditions, according to context characteristics?

• Recognition within individual bands (i.e. train recogniser for each band and combine posterior probabilities)– May allow modelling of Watkins et al. 8-band results

– Performance of multiband systems generally lower than conventional ASR

Lunch


automatic speech recognition studies

Documents

speech data

speech material

ai corpusincludes

phone recognition

vc syllables

perceptual constancy

stirtarget syllables

syllables common