real-time voice actuation
TRANSCRIPT
• Design and build real-time speech recognition system
• Build with embedded hardware
• Used Source-Filter model of speech and Support Vector Machine classifier to recognize commands “zero” through “nine”
• Finished system executes in real-time and has GPIO-based actuation to demonstrate functional voice recognition
Project Goals
• Word characterization should be independent of volume, pitch, and duration of the word
• Simplify speech production model to being: 1.Source - vibration of vocal chords2.Filter – vocal tract (i.e. positioning of
tongue, mouth, etc.)
• Accurately modeling the filter provides a basis for word recognition[4]
Source-Filter Model of Speech
Broad sweeps of spectrum (formants) result from the filter configuration. Rapidly varying
peaks come from source resonances
All-Pole Filter Coefficients
• First n filter coefficients can be roughly calculated using the first n time shifts of the autocorrelation of a signal
• Levinson-Durbin recursion algorithm calculates all-pole filter coefficients from autocorrelation
• Want to capture spectral envelope, so want ~10 filter coefficients[5]
Too many coefficients leads to over-fitting of curve
Cepstral Coefficients
• Cepstrum is useful in separating the source and filter
• Cepstral coefficients are a very compact representation of the spectral envelope and are highly uncorrelated
• Filter coefficients are too sensitive to numerical precision
• Better to transform LP coefficients into cepstral coefficients[5]
Cepstral Analysis on source filter model(a) DFT (b) log magnitude of DFT (c) IDFT
Support Vector Machine Learning
• Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression
• We utilize Multi-class Support Vector Machine
• Our algorithm uses one-against-one method to construct (k *(k-1)/2) classifiers (k = number of classes), one SVM for each pair of classes.
• LIBSVM, an integrated software for multi-class support vector classification is used[6]
Library• Stored autocorrelation coefficients calculated through C5515
• Calculated cepstral coefficients in MATLAB
• Three male speakers with combined 1920 recordings• 64 instances of each digit for each speaker
9 Coef 0 1 2 3 4 5 6 7 8 90 154 0 0 4 0 0 22 6 0 61 0 166 1 1 23 1 0 0 0 02 1 0 168 22 0 0 1 0 0 03 13 0 6 172 0 0 1 0 0 04 1 9 0 0 181 0 0 1 0 05 0 1 0 0 0 190 0 1 0 06 4 0 1 0 0 0 187 0 0 07 1 0 0 0 1 0 0 189 0 18 0 0 1 0 0 0 0 0 191 09 0 0 1 0 0 0 0 2 0 189
Rejected Methods• Classification based on correlation of cepstral coefficients
• Took maximum correlation between new signal and library• Not very robust to small variations or scalable
• Classification using SVM on CRM database• Words cut off early in database or contaminated by other words• Recording conditions do not match our method
C5515: Vocalization Identification• Implemented Word from non-Word
Identification• Grab frame of 256 samples Compute
RMS of frame, compare to threshold • If RMS > Threshold
• Accumulate frame data• Else if RMS < Threshold and Frames
Acquired > 3• Compute Autocorrelation, • Transmit Data
• Else• Reset Stored Data
• Specific values determined experimentally
C5515: UART Transmission
• Transmit Autocorrelation Coefficients• UART is 115200 baud, 8 bit, No
Parity, 1 stop bit• Data is signed 16 bit • Bit masking and Reconstruction on
the Raspberry Pi• BlueSmirf Bluetooth-UART Pipes
• Abstracts wireless transmission• Looks like UART to microcontroller• Effectively Plug&Play
C5515: Major Challenges Faced• Autocorrelation Coefficient Overflow
• Function Generator Provide too large a voltage• Forces autocorrelation to overflow• Bit-shifting worked temporarily, but reduced data precision: poor
classifier performance and threshold variability• Solution: Switched to Microphone
• Bluesmirf Setup• Configuring Bluesmirf requires commands at precise times• Solution: Implemented long delay function on C5515
Raspberry Pi: Word Classification
• Implemented All-pole Model of Speech Vocalization for Classification
• Computes LPC Coefficients from Autocorrelation
• Converts LPC Coefficients into Cepstral Coefficients
• LIBSVM multistage classifier
• Algorithm written in mixed C/C++ • LPC and Cepstral functions codegen’d
from Matlab• Wrapper in hand written code• Waits for autocorrelation input from UART
Raspberry Pi: Actuation• State Machine implemented
• Displays infamous EECS 452 Fall 2014 Image on sequence of “452”
• Displays special Raspberry Pi Image on “314”
• GPIO array drives LED Binary Counter• Capable of implemented more complicated functions• Planned for Coffee Machine Actuation, ran out of time
• Renders graphics using OpenVG Library• Displays Startup Image• Displays Digit Image on Classification
Raspberry Pi: Major Challenges Faced
• Initially planned to use Simulink Model to implement code• Worked great for algorithm• Did not work well for IO• S-Functions are tricky to work with• Solution
• Codegen core algorithm• Hand write wrapper
• Matlab Coder Toolbox• Converts Matlab code into ANSI C code, with processor specific
optimizations available• Extremely useful for complex algorithms• Very finicky to configure properly• Solution: Study, study, study
Looking Forward
• Coffee Machine Actuation
• Build Better Library• More speakers• Female speakers• Non-Midwestern speakers
• Investigate Tuning SVM Parameters
References
[1]http://www.spectrumdigital.com/product_info.php?cPath=31&products_id=238
[2] https://www.sparkfun.com/products/12577
[3] http://www.adafruit.com/product/1914
[4] Dutoit, T., Moreau, N., Kroon, P., How is speech processed in a cell phone conversation?, 2009
[5] Rabiner, L., Schafer, R., Introduction to Digital Speech Processing, 2007
[6] http://www.csie.ntu.edu.tw/~cjlin/libsvm/