iitdmj 1
DESCRIPTION
TRANSCRIPT
Multimedia Information Processing (1)
Koichi Shinoda
Tokyo Institute of Technology
1
Outline
• Theory and implementation of statistical speech recognition
– Hidden Markov models
– Clustering, Bayes estimation, etc
– speaker adaptation
• Video information retrieval
2
Syllabus
1. Introdution: Sound and speech
2. Speech analysis
3. Very simple speech recognition
4. Hidden Markov model(1)
5. Hidden Markov model(2)
6. Continuous speech recognition
7. Language model
8. Speaker adaptation
9. Video Information Retrieval (1)
10. Video Information Retrieval (2)
3
My CV
1987 Graduated from The University of Tokyo (Physics) 1989 MS from The University of Tokyo (Astronomical physics) 1989 Joined NEC Corporation. Research on speech recognition. 1997 Visiting Scholar at Bell Labs, NJ, USA (-1998) 2001 Dr. Eng. from Tokyo Institute of Technology 2001 Associate Professor of The University of Tokyo 2003 Associate Professor of Dept. Computer Science, Tokyo Institute of Technology Visiting Associate Professor of The Institute of Statistical Mathematics 2013 Professor of Dept. Computer Science, Tokyo Tech
4
Research Area Statistical Pattern Recognition (Speech, Video)
• Acoustic Modeling for speech recognition
– High speed calculation in pattern matching
– Autonomous model-size control
– Graphical Modeling
– Active learning
• Speaker Adaptation for speech recognition
– Rapid improvement with a small amount of user’s utterances.
• Robust speech recognition
– Noises, Microphones, Channels,...
• Video Information Retrieval
– Highlight scene extraction from the broadcast of sports
– High level feature extraction
– Event detection (Surveillance)
• Multimodal interface
– Simultaneous input interface of speech and gestures.
• Social Signal Processing
– Data mining from human-human communication
5
Speech recognition
• Familiar in SF novels (2001 A Space Odyssey, Blade Runner, Star Wars,…)
• Now used in car navigation, voice search, call center business, etc
Problems:
spontaneous speech, noisy environment, multi-modality, conversation, etc
6
A brief history of speech recognition
1952: The first speech recognition system(10 digits, Bell Labs) 1952: Dynamic Programming (DP) was used in Operation Research 1968: The theory of Hidden Markov Model(Baum) 1976: Research for Speech Recognition using HMM(IBM) 1978: Commercial speech recognition system using DP matching(10 digits, N 1983: The development of HMM based continuous speech recognition(AT&T 1980s∼: Large projects (DARPA) 1990s∼: Software for continuous speech recognition using HMMs Speech recognition algorithm Simple pattern matching → DP matching → HMM Signal Processing – Extraction of good features ⇓ Computational theory, Hardware Information-Theoretic approach – Data mining from large database
7
8
Gartner Hype Cycle for 2011
Video Analysis for Consumer Service
Gesture Recognition
Image Recognition
Biometric Authentication
Method
Speech Recognition
Babble
Crash!
History of DARPA speech recognition benchmark tests
1k
ATIS
100%
10%
1%
WO
RD
ER
RO
R R
ATE
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Read Speech
Spontaneous Speech
Conversational Speech
Broadcast Speech Varied
Microphone
Noisy
20k
5k
foreign
Courtesy NIST 1999 DARPA HUB-4 Report, Pallett et al.
foreign
Resource Management
WSJ
Switchboard
NAB
A speech recognition system
10
My Research in NEC
• Automatic interpretation system between Japanese and English (1989–1991) • Large vocabulary speech recognition hardware (1993-1994) • Speech recognition software on MS Windows (1994-1995) • Dictation software (1998-2001) • Robot with speech recognition function. • Speech recognition middleware for car navigation system • Telephone speech recognition • Japanese-English recognition • Speech input interface for many applications, such as presentation, home appliance, train transfer guide. • Robust speech recognition with microphone array.
11
Speech to speech translation system (Japanese ⇔ English) 1989-1991
• NEC’s CI (Computer & Communication)
• Speech recognition + machine translation + speech synthesis
• Hardware implementation
• Demo at Telecom91 (Genève)
• I made English speech recognition tools.
12
Large Vocabulary Speech Recognition Device (1993-1994)
• Name: DS-1000
• Recognizes 1000 isolated words
• 2-3 million yen
• Market: hand-busy, eyes-busy – Classify meet by their quality
– Rapping fish, vegetables
• Since CPU was not fast, we design a special LSI
• I went to business department for 3 months
• Circuit diagram, Time chart, Simulator, etc.
13
Dictation Software (1998-2001)
• Smart Voice series
• Large vocabulary continuous speech recognition
• Database, Algorithms, Evaluation,…
• Team leader for acoustic model development
14
Other projects
15
What you learn in this lecture
• Even beginners can run speech recognition
– Many tools and software: HTK, Sphinx, Jucer, T-cubed decoder
– But they do not know how it works
– They do not know how to solve problems
Speech recognition INSIDE
16
Textbook
• S. Furui, "Digital speech processing, synthesis, and recognition", Second Edition, Marcel Deccor, 2001.
• C. M. Bishop, "Pattern Recognition and Machine Intelligence", Springer, 2006
17