iitdmj 1

Multimedia Information Processing (1)

Koichi Shinoda

Tokyo Institute of Technology

1

Outline

• Theory and implementation of statistical speech recognition

– Hidden Markov models

– Clustering, Bayes estimation, etc

– speaker adaptation

• Video information retrieval

2

Syllabus

1. Introdution: Sound and speech

2. Speech analysis

3. Very simple speech recognition

4. Hidden Markov model(1)

5. Hidden Markov model(2)

6. Continuous speech recognition

7. Language model

8. Speaker adaptation

9. Video Information Retrieval (1)

10. Video Information Retrieval (2)

3

My CV

1987 Graduated from The University of Tokyo (Physics) 1989 MS from The University of Tokyo (Astronomical physics) 1989 Joined NEC Corporation. Research on speech recognition. 1997 Visiting Scholar at Bell Labs, NJ, USA (-1998) 2001 Dr. Eng. from Tokyo Institute of Technology 2001 Associate Professor of The University of Tokyo 2003 Associate Professor of Dept. Computer Science, Tokyo Institute of Technology Visiting Associate Professor of The Institute of Statistical Mathematics 2013 Professor of Dept. Computer Science, Tokyo Tech

4

Research Area Statistical Pattern Recognition (Speech, Video)

• Acoustic Modeling for speech recognition

– High speed calculation in pattern matching

– Autonomous model-size control

– Graphical Modeling

– Active learning

• Speaker Adaptation for speech recognition

– Rapid improvement with a small amount of user’s utterances.

• Robust speech recognition

– Noises, Microphones, Channels,...

• Video Information Retrieval

– Highlight scene extraction from the broadcast of sports

– High level feature extraction

– Event detection (Surveillance)

• Multimodal interface

– Simultaneous input interface of speech and gestures.

• Social Signal Processing

– Data mining from human-human communication

5

Speech recognition

• Familiar in SF novels (2001 A Space Odyssey, Blade Runner, Star Wars,…)

• Now used in car navigation, voice search, call center business, etc

Problems:

spontaneous speech, noisy environment, multi-modality, conversation, etc

6

A brief history of speech recognition

1952: The first speech recognition system(10 digits, Bell Labs) 1952: Dynamic Programming (DP) was used in Operation Research 1968: The theory of Hidden Markov Model(Baum) 1976: Research for Speech Recognition using HMM(IBM) 1978: Commercial speech recognition system using DP matching(10 digits, N 1983: The development of HMM based continuous speech recognition(AT&T 1980s∼: Large projects (DARPA) 1990s∼: Software for continuous speech recognition using HMMs Speech recognition algorithm Simple pattern matching → DP matching → HMM Signal Processing – Extraction of good features ⇓ Computational theory, Hardware Information-Theoretic approach – Data mining from large database

7

8

Gartner Hype Cycle for 2011

Video Analysis for Consumer Service

Gesture Recognition

Image Recognition

Biometric Authentication

Method

Speech Recognition

Babble

Crash!

History of DARPA speech recognition benchmark tests

1k

ATIS

100%

10%

1%

WO

RD

ER

RO

R R

ATE

1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003

Read Speech

Spontaneous Speech

Conversational Speech

Broadcast Speech Varied

Microphone

Noisy

20k

5k

foreign

Courtesy NIST 1999 DARPA HUB-4 Report, Pallett et al.

foreign

Resource Management

WSJ

Switchboard

NAB

A speech recognition system

10

My Research in NEC

• Automatic interpretation system between Japanese and English (1989–1991) • Large vocabulary speech recognition hardware (1993-1994) • Speech recognition software on MS Windows (1994-1995) • Dictation software (1998-2001) • Robot with speech recognition function. • Speech recognition middleware for car navigation system • Telephone speech recognition • Japanese-English recognition • Speech input interface for many applications, such as presentation, home appliance, train transfer guide. • Robust speech recognition with microphone array.

11

Speech to speech translation system (Japanese ⇔ English) 1989-1991

• NEC’s CI (Computer & Communication)

• Speech recognition + machine translation + speech synthesis

• Hardware implementation

• Demo at Telecom91 (Genève)

• I made English speech recognition tools.

12

Large Vocabulary Speech Recognition Device (1993-1994)

• Name: DS-1000

• Recognizes 1000 isolated words

• 2-3 million yen

• Market: hand-busy, eyes-busy – Classify meet by their quality

– Rapping fish, vegetables

• Since CPU was not fast, we design a special LSI

• I went to business department for 3 months

• Circuit diagram, Time chart, Simulator, etc.

13

Dictation Software (1998-2001)

• Smart Voice series

• Large vocabulary continuous speech recognition

• Database, Algorithms, Evaluation,…

• Team leader for acoustic model development

14

Other projects

15

What you learn in this lecture

• Even beginners can run speech recognition

– Many tools and software: HTK, Sphinx, Jucer, T-cubed decoder

– But they do not know how it works

– They do not know how to solve problems

Speech recognition INSIDE

16

Textbook

• S. Furui, "Digital speech processing, synthesis, and recognition", Second Edition, Marcel Deccor, 2001.

• C. M. Bishop, "Pattern Recognition and Machine Intelligence", Springer, 2006

17

iitdmj 1

Technology

speech recognition software

simple speech recognition

problems speech recognition

speech recognition middleware

speech recognition familiar

speech recognition function

speech analysis

digital speech processing