cmu shpinx speech recognition engine
DESCRIPTION
CMU Shpinx Speech Recognition Engine. Reporter : Chun-Feng Liao NCCU Dept. of Computer Sceince Intelligent Media Lab. Purposes of this project. Finding out how an efficient speech recognition engine can be implemented. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/1.jpg)
CMU Shpinx Speech Recognition Engine
Reporter : Chun-Feng LiaoNCCU Dept. of Computer Sceince
Intelligent Media Lab
![Page 2: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/2.jpg)
Purposes of this project
• Finding out how an efficient speech recognition engine can be implemented.
• Examine the source code of Sphinx2 to find out the role and function of each component.
• Reading key chapters of Dr. Mosur K. Ravishankar’s thesis as a reference.
• Some demo programs will be given during oral presentation.
![Page 3: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/3.jpg)
Presentation Agenda• Project Summary/ Agenda/ Goal. (In English)• Introduction.• Basics of Speech Recognitions.• Architecture of CMU Sphinx.
– Acoustic Model and HMM.– Language Model.
• Java™ Platform Issues.• Demo• Conclusion.
![Page 4: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/4.jpg)
Voice Technologies
• In the mid- to late 1990s, personal computers started to become powerful enough to support ASR
• The two key underlying technologies behind these advances are speech recognition (SR) and text-to-speech synthesis (TTS).
![Page 5: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/5.jpg)
Basics of Speech Recognition
![Page 6: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/6.jpg)
Speech Recognition
• Capturing speech (analog) signals• Digitizing the sound waves, converting the
m to basic language units or phonemes( 音素 ).
• Constructing words from phonemes, and contextually analyzing the words to ensure correct spelling for words that sound alike (such as write and right).
![Page 7: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/7.jpg)
Speech Recognition Process Flow
Source:Microsoft Speech.NET Home(http://www.microsoft.com/speech/ )
![Page 8: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/8.jpg)
Recognition Process Flow Summary
• Step 1:User Input– The system catches user’s voice in
the form of analog acoustic signal .
• Step 2:Digitization– Digitize the analog acoustic signal.
• Step 3:Phonetic Breakdown– Breaking signals into phonemes.
![Page 9: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/9.jpg)
Recognition Process Flow Summary(2)
• Step 4:Statistical Modeling– Mapping phonemes to their phonetic representati
on using statistics model.• Step 5:Matching
– According to grammar , phonetic representation and Dictionary , the system returns an n-best list (I.e.:a word plus a confidence score)
– Grammar-the union words or phrases to constraint the range of input or output in the voice application.
– Dictionary-the mapping table of phonetic representation and word(EX:thu,theethe)
![Page 10: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/10.jpg)
Architecture of CMU Sphinx.
![Page 11: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/11.jpg)
Introduction to CMU Sphinx
• A speech recognition system developed at Carnegie Mellon University.
• Consists of a set of libraries – core speech recognition functions – low-level audio capture
• Continuous speech decoding• Speaker-independent
![Page 12: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/12.jpg)
Brief History of CMU Sphinx
• Sphinx-I (1987)– The first user independent ,high performance A
SR of the world.– Written in C by Kai-Fu Lee ( 李開復博士,現任 Mi
crosoft Asia 首席技術顧問 / 副總裁 ).• Sphinx-II (1992)
– Written by Xuedong Huang in C. ( 黃學東博士,現為 Microsoft Speech.NET 團隊領導人 )
– 5-state HMM / N-gram LM.• ( 我們可以推測, CMU Sphinx 的核心技術對
Microsoft Speech SDK 影響很大。 )
![Page 13: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/13.jpg)
Brief History of CMU Sphinx (2)
• Sphinx 3 (1996)– Built by Eric Thayer and Mosur Ravishank
ar.– Slower than Sphinx-II but the design is m
ore flexible.• Sphinx 4 (Originally Sphinx 3j)
– Refactored from Sphinx 3.– Fully implemented in Java.– Not finished yet.
![Page 14: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/14.jpg)
Components of CMU Sphinx
![Page 15: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/15.jpg)
Front End
• libsphinx2fe.lib / libsphinx2ad.lib• Low-level audio access• Continuous Listening and Silence Filte
ring• Front End API overview.
![Page 16: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/16.jpg)
Knowledge Base
• The data that drives the decoder.• Three sets of data
– Acoustic Model.– Language Model.– Lexicon (Dictionary).
![Page 17: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/17.jpg)
Acoustic Model
• /model/hmm/6k• Database of statistical model.• Each statistical model represents a
phoneme.• Acoustic Models are trained by
analyzing large amount of speech data.
![Page 18: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/18.jpg)
HMM in Acoustic Model
• HMM represent each unit of speech in the Acoustic Model.
• Typical HMM use 3-5 states to model a phoneme.
• Each state of HMM is represented by a set of Gaussian mixture density functions.
• Sphinx2 default phone set.
![Page 19: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/19.jpg)
Gaussian Mixtures• Refer to text book p 33 eq 38 • Represent each state in HMM.• Each set of Gaussian Mixtures are called “s
enones”.• HMM can share “senones”.
![Page 20: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/20.jpg)
![Page 21: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/21.jpg)
Language Model• Describes what is likely to be spoken in a par
ticular context• Word transitions are defined in terms of tran
sition probabilities• Helps to constrain the search space• See examples of LM.
![Page 22: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/22.jpg)
N-gram Language Model
• Probability of word N dependent on word N-1, N-2, ...
• Bigrams and trigrams most commonly used• Used for large vocabulary applications such a
s dictation• Typically trained by very large (millions of wo
rds) corpus
![Page 23: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/23.jpg)
Decoder
• Selects next set of likely states• Scores incoming features against thes
e states• Drop low scoring states• Generates results
![Page 24: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/24.jpg)
Speech in Java™ Platform
![Page 25: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/25.jpg)
Sun Java Speech API
• First released on October 26, 1998.• The Java™ Speech API allows Java
applications to incorporate speech technology into their user interfaces.
• Defines a cross-platform API to support command and control recognizers, dictation systems and speech synthesizers.
![Page 26: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/26.jpg)
Implementations of Java Speech API
• Open Source– FreeTTS / CMU Sphinx4.
• IBM Speech for Java.• Cloud Garden.• L&H TTS for Java Speech API.• Conversa Web 3.0.
![Page 27: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/27.jpg)
Free TTS
• Fully implemented with Java.• Based upon Flite 1.1: a small run-time
speech synthesis engine developed at CMU.
• Partial support for JSAPI 1.0.– Speech Recognition functions.– JSML.
![Page 28: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/28.jpg)
Sphinx 4 (Sphinx 3j)
• Fully implemented with Java.• Speed is equal or faster than Sphinx3.• Acoustic model and Language model
is under construction.• Source code are available by CVS.(but
you can not run any applications without models !)
For Example : To check out the Sphinx4 ,you can using the following command.cvs -z3 -d:pserver:[email protected]:/cvsroot/cmusphinx co sphinx4
![Page 29: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/29.jpg)
Java™ Platform Issues
• GC makes managing data much easier
• Native engines typically optimize inner loops for the CPU – can't do that on the Java platform.
• Native engines arrange data to• optimize cache hits – can't really
do that either.
![Page 30: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/30.jpg)
DEMO
• Sphinx-II batch mode.• Sphinx-II live mode.• Sphinx-II Client / Server mode.• A Simple Free TTS Application.• (Java-based) TTS vs (c-based)SR .• Motion Planner with Free TTS-using J
ava Web Start™.(This is GRA course final project)
![Page 31: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/31.jpg)
Summary• Sphinx is a open source Speech
Recognition developed at CMU.• FE / KB / Decoder form the core of SR
system.• FE receives and processes speech signal.• Knowledge Base provide data for
Decoder.• Decoder search the states and return the
results.• Speech Recognition is a challenging
problem for the Java platform.
![Page 32: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/32.jpg)
Reference• Mosur K.Ravishankar, Efficient Alogrit
hms for Speech Recognition, CMU, 1996.
• Mosur K.Ravishankar, Kevin A. Lenzo ,Sphinx-II User Guide , CMU,2001.
• Xuedong Huang,Alex Acerd,Hsiao-Wuen hon,Spoken Language Processing,Prentice Hall,2000.
![Page 33: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/33.jpg)
Reference (on-line)
• On-line documents of Java™ Speech API – http://java.sun.com/products/java-media/spee
ch/
• On-line documents of Free TTS– http://freetts.sourceforge.net/docs/
• On-line documents of Sphinx-II– http://www.speech.cs.cmu.edu/sphinx/
![Page 34: CMU Shpinx Speech Recognition Engine](https://reader030.vdocuments.site/reader030/viewer/2022032805/56813354550346895d9a66cf/html5/thumbnails/34.jpg)
Q & A