development of the embedded speech recognition interface done for aibo

Post on 16-Jan-2016

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Development of the Embedded Speech Recognition Interface done for AIBO. ICSI Presentation January 2003. Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan, Lex Olorenshaw, Honda Hitoshi, Helmut Luke. Spoken Language Technology, SONY NSCA - PowerPoint PPT Presentation

TRANSCRIPT

1

Development of the Embedded Speech Recognition Interface done for AIBO

ICSI Presentation

January 2003

2

Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan,

Lex Olorenshaw, Honda Hitoshi, Helmut Luke

Spoken Language Technology, SONY NSCA

3300 Zanker Rd MS/SJ1B5, San Jose CA

E-mail:xavier@slt.sel.sony.com

3

ABSTRACT

This presentation highlights three major key techniques used in the embedded isolated command recognition system developed for AIBO:

Robust Broadband HMMs

Small Context dependent HMMs

Efficient Confidence Measure (Task independent)

4

Sony’s AIBO entertainment robot

5

General AIBO ASR Overview and Features

End-point Detection + Feature Extraction

-Noise Attenuation: NSS-Channel Normalization: CMS, CMV, or DB eq.

ASR + CM for Speech Verification

-ASR based on PLUs:-Engine based on Viterbi with Beam Search-Lexicon ~ 100 to 300 Dictionaries Entries

-3 states/1 Gaussian per State CHMM Triphone

Clean Speech + Mixed with Noises + Artificially Reverberated

AIBO Dialogue ManagerOthers Sensors:

-Vision-Tact

AIBO:-Activity

-Personality, mud

6

HMM Training Strategies 1TRAINING OBJECTIVES:

Obtain a robust recognizer in noisy far field conditions: We use SIMULATE noisy Matched conditions by : Mixing Clean speech with expected noises at target SNR Artificially reverberate the training Corpus using the frequency

Response filter of expected far field Room environments (0.5 ~ 1.5m)

Obtain an accurate recognizer in near field conditions high SNR conditions.The recognizer should be close to real-time.

A Tradeoff is obtained by training in match noisy conditions and

clean speech conditions: “Broadband HMM”

7

Robust “Broadband” HMMs

HMM-AccumulatorsNoise+Reverberation1

HMM-AccumulatorsNoise+Reverberation N

~N Clean Accumulators

Room_Response_1 * Speech + Noise_1 Room_Response_N * Speech + Noise_N Clean Speech

Final Broadband HMM

8

Embedded ASR System Specification

HMM with Small Memory Size : < 500 Kb

CPU efficient ASR:The CPU can calculated a Maximum of Gaussians

300 per frameCompress front-end, 20 features: 6 Mfcc + 7 delta-

MFCC + 7 delta2-MFCC

Vocabulary can be easily modified: Phone based approach

9

Monophone vs Triphone

Monophone 1

Monophone 2 Triphone

# of Gaussians 4 2 1

# of States ~120 ~120 ~1500

Beam 200~600 150~full 300

Memory (Kb) 90 45 500

Ave. Word Acc. 95.5 83.6~86 97.2

10

CM computation

CM Generator CM>Thres

Thres

yesno

reject orask for

confirmation

performAIBOaction

11

Recognition process

Hypo 1Hypo 2

Hypo N

.

.

.

N-best RecognizerSPEECH

AM

Vocabulary

12

CM Formulation 1

N

iiLR S

NSS

21 1

1

BKWLR SSS

N

iiS

NS

SSC

31

21

21

Likelihood score ratio:

Approximation with the N-best:

Used in combination withA test for in-vocabulary errors,A confidence measure is built:

13

CM Formulation 2

CMS

NS

S S

ii

N

N

12

1

1

12

Pseudo-filler score

Pseudo-background score

Confidence value [0,1]Number of hypos in the listi-th score in the N-best listSi

NCM

S1

Saverage

SN

14

CM Thresholds for several AM’s and AIBO life

15

CM thresholds for different vocabularies

16

Conclusions

Broadband HMMs provide a convenient tradeoff between noise robustness and accuracy in quite conditions.

HMM with Context dependent units (triphones or biphones) and 1 Gaussian/State are computationally less expensive and more accurate than monophones and more robust to noise.

The CM presented is very simple to compute yet effective to categorize correct results from incorrect ones and OOV’s.CMs are robust to changes in the vocabulary and architecture of the recognizer.Due to its simplicity and stability, the CM looks appealing for real-life command applications.

17

ReferencesH. Lucke, H Honda, K Minamino, A Hiroe, H Mori, H Ogawa, Y Asano, H Kishi, “Development of a Spontaneous Speech Rcognition engine for an Entertainment Robot”, ISCA IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003.

G. Hernández Ábrego, X. Menéndez-Pidal, Thomas Kemp, K Minamino, H Lucke, “Automatic Set-up for Spontaneous Speech Recognition Engines Based on Merit Optimization”, ICASSP-2003, HongKong

Xavier Menéndez-Pidal, Lei Duan, Jingwen Lu, Beatriz Dukes, Michael Emonts, Gustavo Hernández-Ábrego, Lex Olorenshaw “Efficient phone-base Recognition Engines for Chinese and English Isolated command applications”, International Symposium on Chinese Spoken Language Processing (ISCSLP) Taipei, Taiwan, August 2002

G. Hernández Ábrego, X. Menéndez-Pidal, L. Olorenshaw, "Robust and Efficient Confidence measure for Isolated command application", in Proceedings of Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, December 2001

top related