progress of sphinx 3.x from x=5 to x=6 arthur chan evandro gouvea david j. huggins-daines alex i....
Post on 19-Dec-2015
213 views
TRANSCRIPT
Progress of Sphinx 3.XFrom X=5 to X=6
Arthur ChanEvandro GouveaDavid J. Huggins-DainesAlex I. RudnickyMosur RavishankarYitao Sun
Here is another one……Take home message 2
We need Better We need Better Acoustic ModelsAcoustic Models.
This talk (~37 pages) Overview (6 pages) Better Software Architecture (9 pages) Speed of Sphinx 3.6 (3 pages) Accuracy Improvement (7 pages) Functionalities Improvement (3 pages) Documentation (4 pages) Sphinx 3.X (X>6) and Conclusion (~5
pages) Discussion (10 mins?)
What is CMU Sphinx?
Definition 1 : Large vocabulary speech recognizers
with high accuracy and speed performance.
Definition 2 : A collection of tools and resources that
enables developers/researchers to build successful speech recognition systems
Family of CMU Sphinx Decoders
Sphinx {II – IV} PocketSphinx (by Dave at Oct 2005)
Acoustic Model Trainer SphinxTrain
Documentation Hieroglyphs Robust/SphinxTrain Tutorial
Sphinx Developers Sphinx is maintained by
Volunteer programmers/researchers who like speech recognition
Funded by different projects Motivated by different reasons
All contribution go to the same codebase Goal : Sustainable development of Sphinx
Sphinx Developer Meetings are held regularly secretly to decide the way to go in Sphinx
What is Sphinx 3.X? An extension of Sphinx 3’s recognizers “Sphinx 3.X (X=6)” means “Sphinx 3.6” Provide more functionalities such as
Real-time speech recognition Speaker adaptation Developers Application Interfaces (APIs) Different search algorithms
3.X (X>3) is motivated by Project CALO and GALE
Development History of Sphinx 3.X
S3 -Sphinx 3 flat-lexicon recognizer (s3 slow)
S3.2 -Sphinx 3 tree-lexicon recognizer (s3 fast)
S3.3 -live-mode demo
S3.4 -fast GMM, class-based LM, dynamic LM
S3.5 –some support on speaker adaptation
-live mode APIs
3.X/3.0 merge
- Better Search Architecture/Implementation
-More support for Speaker Adaptation
- Gentle Re-factoring of code-base
-Somme support on FSG decoding and confidence
-Better Documentation/Tutorial
lm_convert(lm3g2dmp)
dp3.6
This talk – Progress of Sphinx 3.6 From the perspective of
a developer an observer
Sphinx 3.6 Where are we now? Where will we go?
Summary of 5 talks http://www.cs.cmu.edu/~archan/sphinxPresentation.
html
Motivation of Re-Architecting Sphinx 3.X We start to need a new search algorithms
New search algorithm development could have risk. We don’t want to throw away the old one. Mere replacement could cause backward
compatibility problem. Code has grown to a stage where
Some changes could be very hard. Multiple programmers become active at the
same time CVS conflict could become often if things are
controlled by “if-else” structure
Architecture of Sphinx 3.X (X<6)
Batch sequential Architecture (Shaw 96) Each executable has customized sub-routines
decode livepretend Decode_anytopo align allphone
GMM Computation 1approx_cont_mgau
Search 1
Process Controller 1
GMM Computation 2(Using gauden &
senone Method 1)
Search 2
Process Controller 2
GMM Computation 3(Using gauden &
senone Method 2)
Search 3
Process Controller 3
GMM Computation 4(Using gauden & senone Method 3)
Search 4
Process Controller 4
Command Line 1 Command Line 2 Command Line 3 Command Line 4
Initialization 1(kb and kbcore)
Initialization 2 Initialization 3 Initialization 4
Architecture Diagram of Sphinx 3.6
Applications Controllers/Abstractions
Implementations Libraries
decode
livepretend
align
allphone
dag
astar
livedecodeAPI
SearchController
ProcessController
SearchInitializer
CommandLine
Processor
User Defined Applications
Fast Single Stream GMM
Computation
Multi Stream GMM
Computation
FSG Search
Flat Lexicon Search
DictionaryLibrary
SearchLibrary
LM Library
AM Library
Utility Library
FeatureLibrary
MiscellaneousLibrary
decode(anytopo)
Tree Lexicon Search
Separation of Mechanism and Implementation
Search MechanismModule (srch.c)
-A class provides Atomic Search Operations (ASOs) in the form of function pointers
-Configured by just setting function pointers
- A single interface for applications
Search ImplementationModule (srch.c)Search Implementation
Module (srch.c)Search ImplementationModule (srch.c)Search Implementation
Module (srch.c)Search Implementation
Modules(srch_????.c)
-Could have many of them
-Possibilities:
A, Decoding with different implementations
B, Concept of search including
-alignment,
-phoneme recognition
-keyword spotting.
Search Mechanism Module – What does it do?
Computation of One Frame
SelectActive
CDSenone
ComputeApprox.
GMMScore
(CI senone)
ComputeDetailGMMScore
(CD senone)
ComputeDetailHMMScore(CD)
PropagateGraph (Phone-Level)
RescoringAt word
End usingHigh-Level
KS(e.g. LM)
PropagateGraph(Word-Level)
Search For One FrameGMMCompute
Search Implementations Implemented (-op_mode)
Finite State Grammar Search (Mode 2) Flat Lexicon Search (Mode 3) Tree Search (Mode 4)
Not in 3.6 Aligner (Mode 0) Phoneme recognition (Mode 1) A new tree search (Mode 5)
Different ways to implement search implementations 1, Use default implementation
Just specify all atomic search operations (ASOs) provided
2, Override “search_one_frame” Only need to specify GMM computation
and how to “search_one_frame” 3, Override the whole mechanism
For people who dislike the default so much Override how to “search”
Consequence of Re-factoring Calling decode
Could use flat-lexicon decoding as well decode_anytopo still exists
For backward compatibility decode_anytopo = decode
allphone, align, decode_anytopo could use fast GMM computation
decode could use S3’s SCHMM Command-line is now synchronized
Summary on the Architecture
Sphinx 3.6 A gentle re-factoring has carried out. A more flexible architecture A better playground for AM and
search people S2 SCHMM computation routine? NN, SVM, ML techniques for AM?
Speed in Sphinx 3.6 Further work on Context-Independent Senone-
based GMM Selection (CIGMMS) 20-30% Speed Up
3 tricks were proposed Fixed amount of CD senone compute. Use of best Gaussian index Tightening factor of CI-phone beam
Published in “On Improvements of CI-based GMM Selection “ (Chan 2005)
but not very well received Alright, there are accuracy lost
A note on Sphinx 3.6 Speed Performance Sphinx 3.X works under 1xRT in most
tasks. E.g. Smartnote/Sphinx Integration Broadcast News UNTUNED RESULT: 1.5xRT
Sphinx 3.X is still slower than Sphinx 2 Fast setup of Sphinx 2: use 256 codeword
SCHMM Fast setup of Sphinx 3: use 2000-6000
senone FCHMM Historical notes: Comparable SCHMM setup has
4096 codewords Need benchmarking to truly judge
Speed - Conclusion
Sphinx 3.X is in a reasonable level Sphinx 2 should still be used in speed-
critical condition Further work
GALE/CALO will still be around in 3.6/3.7
Accuracy become more motivated than speed
Our Immediate Problem
What help us more in accuracy? Acoustic modeling ? Speaker Adaptation ? Search Improvement ?
Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation
Speaker adaptation techniques are shown to be crucia
Even in tough task (e.g. CALO) 10-15% relative improvement Gain similar to LM/AM modeling work
Accuracy Improvement of Sphinx 3.6 – Speaker Adaptation (cont.)
Dave has done a great job on Multiple-class MLLR MAP adaptation
Things to watch Ziad’s VTLN implementation
Conclusion in Speaker Adaptation
Observation in 3.6 Speaker adaptation is very
important. What we still need:
Maximum likelihood linear transformation (MLLT)
Combination of MLLT, MLLR, MAP and VTLN
Proved to be additive
Accuracy Improvement of Sphinx 3.6 - Search Our Attempts in Flat Lexicon Decoder
Full triphones 2.5% rel. gain But 100xRT
Full trigram Will give another 5-10 times slowdown
Diff between Tree vs Flat Lex. Decoder 5% relative
Conclusion: Further improvement in search is limited
Accuracy Improvement in Sphinx 3.6 -Modeling Mainly
on addition of data (Major contributor) interpolation of LM (very decent gain)
Things to watch: Yi’s LDA Yet to explore
Speaker Adaptive Training (SAT) Semi-tied Covariance (STC) Matrix
Conclusion: Commodity techniques are still not
widely used in Sphinx (Bad sign).
Conclusion of Accuracy Improvement 3.6 3.6 has a healthy development in
speaker adaptation Improvement in search is hard Need 10x effort on acoustic modeling
Commodity techniques are still not there Three final keywords: MLLT, SAT, STC
Priorities: Adaptation > AM, LM > 2 stage Search
>> 1st Stage
FSG search 3.6 supports FSG search
Adapted from Sphinx 2’s implementation Current Issues
No lextree implementation Static allocation of all HMMs; not allocated “on
demand” FSG transitions represented by NxN matrix
Other wish list No histogram pruning No state-based implementation
Need more testing
Confidence Annotation
conf Adapted from Rong with
permission Compute Word Posterior Probability of
a word given lattice Still under work
Language Model Related
Now fully supports Text-based LM reading Inter-conversion of LM in TXT & DMP
format lm_convert = lm3g2dmp++
LM switching API in live_decode_API
Hieroglyphs A collection of documentation of
using Sphinx 3, SphinxTrain and CMU LM Tool kit
1st Draft is completed All chapter are filled with information. Writing the 2nd Draft
“Chief Editor”: Arthur Chan Does it even exist?
Hieroglyph: An outline Chapter 1: Licensing of Sphinx, SphinxTrain and LM Toolkit Chapter 2: Introduction to Sphinx Chapter 3: Introduction to Speech Recognition Chapter 4: Recipe of Building Speech Application using Sphinx Chapter 5: Different Software Toolkits of Sphinx Chapter 6: Acoustic Model Training Chapter 7: Language Model Training Chapter 8: Search Structure and Speed-up of the Speech
recognizer Chapter 9: Speaker Adaptation Chapter 10: Research using Sphinx Chapter 11: Development using Sphinx Appendix A: Command Line Information Appendix B: FAQ
Book Reviews of Hieroglyphs “You wrote the worst preface I have ever
seen in my life. “ Dr. Evandro Gouvea “The content is o. k., but the writing is
still ……” Prof. Alex I. Rudnicky “Wow, it is thick. And, oh…… there are
no blank spaces! You are not supposed to add contents in any CMU open source manuals, don’t you know?” Dr. Alan W. Black
Other Documents Robust Tutorial (Aka Sphinx 101)
Thanks to Evandro Now could be used for
archive_s3 Sphinx 2 Sphinx 3
http://www.cs.cmu.edu/~robust/Tutorial/ Doxygen documentation for Sphinx 3.x
is fully available http://www.speech.cs.cmu.edu/sphinx/sphin
x3/doxygen/html/
What is important? Keep the current design priorities:
1, Accuracy We are just OK and we badly need to improve it.
2, Speed We are OK and it doesn’t hurt to improve it
3, Functionalities Still a pain to use Sphinx 3 but it is constant
improved Usability eventually implies distributing models.
Accuracy should be prior to Speed No excuse in 3.7
Roadmap: In X=7…… For GALE/CALO
Speaker Clustering/SAT Bridging SI and SA
VTLN LDA
0.5 x CALO may need further speed improvement BBI More secret ideas in GMM computation
Roadmap (cont.) X=8
D.T. MMIE, MCE
STC Interface with HTK model
X=9 D.T. + S.A.
X>10 Time to fire Arthur Chan and hire an
assistant professor
We need your help! Project Manager: Enable Development of Sphinx
Translation: Kick/Fix people and Kicked/Fixed by Evandro Developers: Incorporate state-of-art speech
technology into Sphinx Translation: Fix 1 bug and Generate 5 more
Maintainer: Ensure integrity of Sphinx code and resource
Translation: You become so called the “Grand Janitor of Sphinx”.
Tester: Enable test-based development in Sphinx Translation: You will learn a lot of Zen-Buddhism.
Our Current Motto (Subject to Change)
“Don’t ever underestimate yourself…… You never know what a kind of mess you could make.”
-Dr. Evandro Gouvea
Conclusion for Sphinx 3.X
We have done something We are making some sense in the
system development now We have healthy growth in
accuracy But we still need more
Thank you Acknowledgement
Rich/Alan: for your constant encouragement Alex: for your understanding of Yin/Yang Rong: for contributing the confidence
estimation program Bano: for reminding me I could die at any
time when we were in Lake Arthur -> Hieroglyphs 1st draft’s progress sped up.
Sphinx developers: without you, I won’t be the “Grand Janitor”.
Sphinx users: for your capabilities of giving me nightmares
Postscript, a word from my friend
“Don’t ever underestimate yourself…… You never know what a mess you could make.”
–Dr. Evandro Gouvea
Pros/Cons of Batch Sequential Architecture Pros:
Great flexibility for individual programmers No assumption, data structure are usually
optimized for the application. Align and allphone have optimization.
Crafting in individual application has high quality
Cons: Great difficulty in maintenance
Most changes need to be carried out for 5-6 times. Spread disease of code duplication
Code with functionality was duplicated multiple times