automatic speech recognition studies
DESCRIPTION
Automatic Speech Recognition Studies. Guy Brown, Amy Beeston and Kalle Palomäki. Overview. Aims The articulation index (AI) corpus Phone recogniser Results on sir/stir subset of AI corpus Future plans. Aims. - PowerPoint PPT PresentationTRANSCRIPT
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 1
Automatic Speech Recognition Studies
Guy Brown, Amy Beeston and Kalle Palomäki
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 2
Overview
• Aims
• The articulation index (AI) corpus
• Phone recogniser
• Results on sir/stir subset of AI corpus
• Future plans
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 3
Aims
• Aim to develop a ‘perceptual constancy’ front-end for automatic speech recognition (ASR).
• Should be compatible with Watkins et al. findings but also validated on a ‘real world’ ASR task.– wider vocabulary
– range of reverberation conditions
– variety of speech contexts
– naturalistic speech, rather than interpolated stimuli
– consider phonetic confusions in reverberation in general
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 4
Progress to date
• Current work has focused on implementing a baseline ASR system for the articulation index (AI) corpus, which meets the requirements for speech material stated on previous slide.
• So far have results for phone recognition on small test set without any ‘constancy’ processing.
• Planning evaluation that compares phonetic confusions made by listeners and ASR on the same test.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 5
The articulation index (AI) corpus• Recorded by Jonathan Wright (University of
Pennsylvania), available via LDC.
• Intended for speech recognition in noise experiments similar to those of Fletcher.
• Suggested to us by Hynek Hermansky; utterances are similar to those used by Watkins et al.:– English (American)
– Target syllables are mostly nonsense, but some correspond to real words (including “sir” and “stir”)
– Target syllables are embedded in a context sentence drawn from a limited vocabulary
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 6
Details of the AI corpus
• Includes all “valid” English diphone (CV, VC) syllables.
• Triphone syllables (CVC, CCV, VCC) chosen according to frequency in Switchboard corpus– correlated with syllable frequency in casual conversation.
• 12 male speakers, 8 female speakers.
• Approximately 2000 syllables common to all speakers.
• Small amount (10 min) of conversational data.
• All speech data sampled at 16 kHz.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 7
AI corpus examples
• Target syllable preceded by two context words and followed by one context word:– CW1 CW2 SYL CW3
– CW1, CW2 and CW3 drawn from sets of 8, 51 and 44 words respectively
• Examples:
they recognise sir entirely
people ponder stir second
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 8
Phone recogniser
• Monophone recogniser implemented and trained on the TIMIT corpus.
• Based on HTK scripts by Tony Robinson1.
• Front-end: speech encoded as 12 cepstral coefficients +energy+deltas+accelerations (39 features).
• Cepstral mean normalisation applied.
• 3 emitting states per phone model, observations modelled by 20 Gaussian mixtures per state.
• Approx 58% phone accuracy on TIMIT test set.1http://www.cantabResearch.com/HTKtimit.html
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 9
Training and testing
• Trained on TIMIT training set.
• Really needs adapting to the AI corpus material; work in progress.
• Removed allophones from TIMIT labels (as is usual) to give 41 phone set.
• Short pause and silence models.
• For testing on AI corpus, word-level transcriptions were expanded into phone sequences using Switchboard-ICSI pronunciation dictionary.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 10
Experiments
• Initial experiments done with a subset of AI corpus utterances in which the target syllable is “sir” or “stir”.
• Small test set of 40 utterances:
Male speaker Female speaker
“Sir” 12 8
“Stir” 12 8
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 11
Experiment 1: Fletcher-style paradigm• A recogniser grammar was used in which
– The sets of context words CW1, CW2 and CW3 are specified;
– Target syllable is any sequence of two or three phones.
• Corresponds to task in which listener knows that context words are drawn from a limited set.
• Recogniser grammar is a (rather unconventional) mix of word-level and phone-level labels.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 12
Experiment 1: recogniser grammar$cw1 = I | YOU | WE | THEY | SOMEONE | NO-ONE | EVERYONE | PEOPLE;
$cw2 = SEE | SAW | HEAR | PERCEIVE | THINK | SAY | SAID | SPEAK | PRONOUNCE | WRITE | RECORD | OBSERVE | TRY | UNDERSTAND | ATTEMPT | REPEAT | DESCRIBE | DETECT | DETERMINE | DISTINGUISH | ECHO | EVOKE | PRODUCE | ELICIT | PROMPT | SUGGEST | UTTER | IMAGINE | PONDER | CHECK | MONITOR | RECALL | REMEMBER | RECOGNIZE | REPEAT | REPORT | USE | UTILIZE | REVIEW | SENSE | SHOW | NOTE | NOTICE | SPELL | READ | EXAMINE | STUDY | PROPOSE | WATCH | VIEW | WITNESS;
$cw3 = NOW | AGAIN | OFTEN | TODAY | WELL | CLEARLY | ENTIRELY | NICELY | PRECISELY | ANYWAY | DAILY | WEEKLY | YEARLY | HOURLY | MONTHLY | ALWAYS | EASILY | SOMETIME | TWICE | MORE | EVENLY | FLUENTLY | GLADLY | HAPPILY | NEATLY | NIGHTLY | ONLY | PROPERLY | FIRST | SECOND | THIRD | FOURTH | FIFTH | SIXTH | SEVENTH | EIGHTH | NINTH | TENTH | STEADILY | SURELY | TYPICALLY | USUALLY | WISELY;
$phn = AA | AE | AH | AO | AW | AX | AY | B | CH | D | DH | DX | EH | ER | EY | F | G | HH | IH | IY | JH | K | L | M | N | NG | OW | OY | P | R | S | SH | T | TH | UH | UW | V | W | Y | Z | ZH;
(!ENTER $cw1 $cw2 $phn $phn [$phn] $cw3 !EXIT)
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 13
Experiment 1: results
• Overall 47.5% correct at word level (sir/stir)
• Context words not correctly recognised in some cases, leading to knock-on effect on recognition of the target syllable.
• Examples:
they imagine stir surelythey imagine s t er surely
Correct
they sense stir gladly they sense s er gladly Deletion
I evoke sir precisely I evoke s eh preciselySubstitution (/eh/ as in ‘head’)
they recognize sir entirely
they witness er n p daily
Incorrect context words
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 14
Experiment 2: constrained sir/stir• A recogniser grammar was used in which
– The sets of context words CW1, CW2 and CW3 are specified;
– Target syllable is constrained to “sir” or “stir”;
– Canonical pronunciation of “sir” and “stir” is assumed (i.e. “sir” = /s er/ and “stir” = /s t er/)
• Corresponds to Watkins-style task, except that context words vary and are drawn from a limited set.
• Utterances either presented clean or convolved with the left channel or right channel of the L-shaped room or corridor BRIRs.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 15
Experiment 2: recogniser grammar• Recogniser grammar was
$test = SIR | STIR;
( !ENTER $cw1 $cw2 $test $cw3 !EXIT )
with $cw1, $cw2 and $cw3 defined as before.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 16
Results: L-shaped room, left channel
Impulse response ID % Correct SIR STIR
clean.wav 95.0SIR 18 2
STIR 0 20
outconv22feb31p5.wav 92.5SIR 18 2
STIR 1 19
outconv22feb63.wav 85.0SIR 18 2
STIR 4 16
outconv22feb125.wav 72.5SIR 13 7
STIR 4 16
outconv22feb250.wav 67.5SIR 9 11
STIR 2 18
outconv22feb500.wav 62.5SIR 15 5
STIR 10 10
outconv22feb1000.wav 65.0SIR 14 6
STIR 8 12
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 17
Results: L-shaped room, right channel
Impulse response ID % Correct SIR STIR
clean.wav 95.0SIR 18 2
STIR 0 20
outconv22feb31p5.wav 87.5SIR 17 3
STIR 2 18
outconv22feb63.wav 85.0SIR 19 1
STIR 5 15
outconv22feb125.wav 87.5SIR 15 5
STIR 1 19
outconv22feb250.wav 82.5SIR 16 4
STIR 3 17
outconv22feb500.wav 67.5SIR 16 4
STIR 9 11
outconv22feb1000.wav 65.0SIR 14 6
STIR 8 12
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 18
Results: corridor, left channel
Impulse response ID % Correct SIR STIR
clean.wav 95.0SIR 18 2
STIR 0 20
outconv22feb31p5.wav 90.0SIR 18 2
STIR 2 18
outconv22feb63.wav 87.5SIR 19 1
STIR 4 16
outconv22feb125.wav 77.5SIR 15 5
STIR 4 16
outconv22feb250.wav 72.5SIR 17 3
STIR 8 12
outconv22feb500.wav 67.5SIR 17 3
STIR 10 10
outconv22feb1000.wav 57.5SIR 14 6
STIR 11 9
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 19
Results: corridor, right channel
Impulse response ID % Correct SIR STIR
clean.wav 95.0SIR 18 2
STIR 0 20
outconv22feb31p5.wav 90.0SIR 18 2
STIR 2 18
outconv22feb63.wav 87.5SIR 18 2
STIR 3 17
outconv22feb125.wav 87.5SIR 18 2
STIR 3 17
outconv22feb250.wav 85.0SIR 16 4
STIR 2 18
outconv22feb500.wav 82.5SIR 19 1
STIR 6 14
outconv22feb1000.wav 60.0SIR 15 5
STIR 11 9
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 20
Conclusions
• The phone recogniser works well when constrained to recogniser “sir”/”stir” only (95% correct).
• Recognition rate falls as reverberation increases, as expected.
• The fall in performance is not only due to “stir” being reported as “sir”, as expected from human studies.
• Some effects of BRIR channel on performance. Right channel of the corridor BRIR is less problematic, most likely due to a strong early reflection in the right channel for the 5m condition.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 21
Plans for next period: experiments• The AI corpus lends itself to experiments in
which target and context are varied as in Watkins et al. experiments.
• Suggestion:– Compare listener and ASR phone confusions under conditions
in which the whole utterance is reverberated, and when reverberation is added to the target syllable only.
• Possible problems:– Relatively insensitive design? Will effect of reverberation be
sufficient to show up as consistent phone confusions? – Are the contexts long enough? (some contexts as short as 0.5
s).– As shown in baseline studies, recogniser does not necessarily
make the same mistakes as human listeners.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 22PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 22
AI corpus sir/stir stimuli
• Utterances similar to sir/stir format– Wider variety of speakers/contexts (but still limited vocabulary)
– Targets mostly nonsense, but some real words (eg. sir/stir)
– Reverberated (by Amy) according to sir-stir paradigm
• Widening sir/stir paradigm towards ASR environment– Introduce different stop consonants first: s {t,p,k} ir
– Look for confusion in place of articulation
near-near near-far far-far
near-near near-far far-far
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 23PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 23
Test words from AI corpus
We could record our own: sigh, sty, spy, sky (sky is missing)
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 24
Questions for Tony
• Generally - would this sort of thing work?
• Is the initial delay in BRIR kept?
• How should the AI corpus signals be level-normalised when mixed reveberation distance is used?
• How to control the ordering of stimuli?
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 25
Plans: system development
• Currently the ASR system is trained on TIMIT; expect improvement if adapted to the AI corpus material.
• Only have word-level transcription for the AI corpus so must obtain phone labels by forced alignment.
• We will try the efferent model as a front end for recognition of reverberated speech, however:– it may not be sufficiently general, having been developed/tuned
only for the sir/stir task– that said, we have shown elsewhere that efferent suppression is
effective in improving ASR performance in additive noise– there is some relationship between the efferent model and
successful engineering approaches
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 26
Plans: system development
• Current efferent model is not unrelated to engineering approach of Thomas et al. (2008):– “the effect of reverberation is reduced when features are extracted
from gain normalized temporal envelopes of long duration in narrow subbands”
• Our efferent model also does gain control over long-duration windows (and will work in narrow bands).
• The model currently produces a spectral representation but could be modified to give cesptral features for ASR.
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 27
Plans: other approaches
• Parallel search over room acoustics and word models?– How would context effects be included in such a scheme?
– On-line selection of word models trained in dry or reverberant conditions, according to context characteristics?
• Recognition within individual bands (i.e. train recogniser for each band and combine posterior probabilities)– May allow modelling of Watkins et al. 8-band results
– Performance of multiband systems generally lower than conventional ASR
Lunch
PERCEPTUAL COMPENSATION | EPSRC 12-MONTH PROGRESS MEETING 28