development of the embedded speech recognition interface done for aibo

17
1 Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003

Upload: gretel

Post on 16-Jan-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Development of the Embedded Speech Recognition Interface done for AIBO. ICSI Presentation January 2003. Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan, Lex Olorenshaw, Honda Hitoshi, Helmut Luke. Spoken Language Technology, SONY NSCA - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Development of the Embedded Speech Recognition  Interface done for AIBO

1

Development of the Embedded Speech Recognition Interface done for AIBO

ICSI Presentation

January 2003

Page 2: Development of the Embedded Speech Recognition  Interface done for AIBO

2

Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan,

Lex Olorenshaw, Honda Hitoshi, Helmut Luke

Spoken Language Technology, SONY NSCA

3300 Zanker Rd MS/SJ1B5, San Jose CA

E-mail:[email protected]

Page 3: Development of the Embedded Speech Recognition  Interface done for AIBO

3

ABSTRACT

This presentation highlights three major key techniques used in the embedded isolated command recognition system developed for AIBO:

Robust Broadband HMMs

Small Context dependent HMMs

Efficient Confidence Measure (Task independent)

Page 4: Development of the Embedded Speech Recognition  Interface done for AIBO

4

Sony’s AIBO entertainment robot

Page 5: Development of the Embedded Speech Recognition  Interface done for AIBO

5

General AIBO ASR Overview and Features

End-point Detection + Feature Extraction

-Noise Attenuation: NSS-Channel Normalization: CMS, CMV, or DB eq.

ASR + CM for Speech Verification

-ASR based on PLUs:-Engine based on Viterbi with Beam Search-Lexicon ~ 100 to 300 Dictionaries Entries

-3 states/1 Gaussian per State CHMM Triphone

Clean Speech + Mixed with Noises + Artificially Reverberated

AIBO Dialogue ManagerOthers Sensors:

-Vision-Tact

AIBO:-Activity

-Personality, mud

Page 6: Development of the Embedded Speech Recognition  Interface done for AIBO

6

HMM Training Strategies 1TRAINING OBJECTIVES:

Obtain a robust recognizer in noisy far field conditions: We use SIMULATE noisy Matched conditions by : Mixing Clean speech with expected noises at target SNR Artificially reverberate the training Corpus using the frequency

Response filter of expected far field Room environments (0.5 ~ 1.5m)

Obtain an accurate recognizer in near field conditions high SNR conditions.The recognizer should be close to real-time.

A Tradeoff is obtained by training in match noisy conditions and

clean speech conditions: “Broadband HMM”

Page 7: Development of the Embedded Speech Recognition  Interface done for AIBO

7

Robust “Broadband” HMMs

HMM-AccumulatorsNoise+Reverberation1

HMM-AccumulatorsNoise+Reverberation N

~N Clean Accumulators

Room_Response_1 * Speech + Noise_1 Room_Response_N * Speech + Noise_N Clean Speech

Final Broadband HMM

Page 8: Development of the Embedded Speech Recognition  Interface done for AIBO

8

Embedded ASR System Specification

HMM with Small Memory Size : < 500 Kb

CPU efficient ASR:The CPU can calculated a Maximum of Gaussians

300 per frameCompress front-end, 20 features: 6 Mfcc + 7 delta-

MFCC + 7 delta2-MFCC

Vocabulary can be easily modified: Phone based approach

Page 9: Development of the Embedded Speech Recognition  Interface done for AIBO

9

Monophone vs Triphone

Monophone 1

Monophone 2 Triphone

# of Gaussians 4 2 1

# of States ~120 ~120 ~1500

Beam 200~600 150~full 300

Memory (Kb) 90 45 500

Ave. Word Acc. 95.5 83.6~86 97.2

Page 10: Development of the Embedded Speech Recognition  Interface done for AIBO

10

CM computation

CM Generator CM>Thres

Thres

yesno

reject orask for

confirmation

performAIBOaction

Page 11: Development of the Embedded Speech Recognition  Interface done for AIBO

11

Recognition process

Hypo 1Hypo 2

Hypo N

.

.

.

N-best RecognizerSPEECH

AM

Vocabulary

Page 12: Development of the Embedded Speech Recognition  Interface done for AIBO

12

CM Formulation 1

N

iiLR S

NSS

21 1

1

BKWLR SSS

N

iiS

NS

SSC

31

21

21

Likelihood score ratio:

Approximation with the N-best:

Used in combination withA test for in-vocabulary errors,A confidence measure is built:

Page 13: Development of the Embedded Speech Recognition  Interface done for AIBO

13

CM Formulation 2

CMS

NS

S S

ii

N

N

12

1

1

12

Pseudo-filler score

Pseudo-background score

Confidence value [0,1]Number of hypos in the listi-th score in the N-best listSi

NCM

S1

Saverage

SN

Page 14: Development of the Embedded Speech Recognition  Interface done for AIBO

14

CM Thresholds for several AM’s and AIBO life

Page 15: Development of the Embedded Speech Recognition  Interface done for AIBO

15

CM thresholds for different vocabularies

Page 16: Development of the Embedded Speech Recognition  Interface done for AIBO

16

Conclusions

Broadband HMMs provide a convenient tradeoff between noise robustness and accuracy in quite conditions.

HMM with Context dependent units (triphones or biphones) and 1 Gaussian/State are computationally less expensive and more accurate than monophones and more robust to noise.

The CM presented is very simple to compute yet effective to categorize correct results from incorrect ones and OOV’s.CMs are robust to changes in the vocabulary and architecture of the recognizer.Due to its simplicity and stability, the CM looks appealing for real-life command applications.

Page 17: Development of the Embedded Speech Recognition  Interface done for AIBO

17

ReferencesH. Lucke, H Honda, K Minamino, A Hiroe, H Mori, H Ogawa, Y Asano, H Kishi, “Development of a Spontaneous Speech Rcognition engine for an Entertainment Robot”, ISCA IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003.

G. Hernández Ábrego, X. Menéndez-Pidal, Thomas Kemp, K Minamino, H Lucke, “Automatic Set-up for Spontaneous Speech Recognition Engines Based on Merit Optimization”, ICASSP-2003, HongKong

Xavier Menéndez-Pidal, Lei Duan, Jingwen Lu, Beatriz Dukes, Michael Emonts, Gustavo Hernández-Ábrego, Lex Olorenshaw “Efficient phone-base Recognition Engines for Chinese and English Isolated command applications”, International Symposium on Chinese Spoken Language Processing (ISCSLP) Taipei, Taiwan, August 2002

G. Hernández Ábrego, X. Menéndez-Pidal, L. Olorenshaw, "Robust and Efficient Confidence measure for Isolated command application", in Proceedings of Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, December 2001