real-time voice actuation

Team JarvisFinal Presentation

Pragya AgrawalDominic Calabrese

David MartelNathan Sawicki

• Design and build real-time speech recognition system

• Build with embedded hardware

• Used Source-Filter model of speech and Support Vector Machine classifier to recognize commands “zero” through “nine”

• Finished system executes in real-time and has GPIO-based actuation to demonstrate functional voice recognition

Project Goals

System Architecture

• Word characterization should be independent of volume, pitch, and duration of the word

• Simplify speech production model to being: 1.Source - vibration of vocal chords2.Filter – vocal tract (i.e. positioning of

tongue, mouth, etc.)

• Accurately modeling the filter provides a basis for word recognition[4]

Source-Filter Model of Speech

Broad sweeps of spectrum (formants) result from the filter configuration. Rapidly varying

peaks come from source resonances

All-Pole Filter Coefficients

• First n filter coefficients can be roughly calculated using the first n time shifts of the autocorrelation of a signal

• Levinson-Durbin recursion algorithm calculates all-pole filter coefficients from autocorrelation

• Want to capture spectral envelope, so want ~10 filter coefficients[5]

Too many coefficients leads to over-fitting of curve

Cepstral Coefficients

• Cepstrum is useful in separating the source and filter

• Cepstral coefficients are a very compact representation of the spectral envelope and are highly uncorrelated

• Filter coefficients are too sensitive to numerical precision

• Better to transform LP coefficients into cepstral coefficients[5]

Cepstral Analysis on source filter model(a) DFT (b) log magnitude of DFT (c) IDFT

Support Vector Machine Learning

• Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression

• We utilize Multi-class Support Vector Machine

• Our algorithm uses one-against-one method to construct (k *(k-1)/2) classifiers (k = number of classes), one SVM for each pair of classes.

• LIBSVM, an integrated software for multi-class support vector classification is used[6]

Library• Stored autocorrelation coefficients calculated through C5515

• Calculated cepstral coefficients in MATLAB

• Three male speakers with combined 1920 recordings• 64 instances of each digit for each speaker

9 Coef 0 1 2 3 4 5 6 7 8 90 154 0 0 4 0 0 22 6 0 61 0 166 1 1 23 1 0 0 0 02 1 0 168 22 0 0 1 0 0 03 13 0 6 172 0 0 1 0 0 04 1 9 0 0 181 0 0 1 0 05 0 1 0 0 0 190 0 1 0 06 4 0 1 0 0 0 187 0 0 07 1 0 0 0 1 0 0 189 0 18 0 0 1 0 0 0 0 0 191 09 0 0 1 0 0 0 0 2 0 189

Rejected Methods• Classification based on correlation of cepstral coefficients

• Took maximum correlation between new signal and library• Not very robust to small variations or scalable

• Classification using SVM on CRM database• Words cut off early in database or contaminated by other words• Recording conditions do not match our method

C5515: Vocalization Identification• Implemented Word from non-Word

Identification• Grab frame of 256 samples Compute

RMS of frame, compare to threshold • If RMS > Threshold

• Accumulate frame data• Else if RMS < Threshold and Frames

Acquired > 3• Compute Autocorrelation, • Transmit Data

• Else• Reset Stored Data

• Specific values determined experimentally

C5515: UART Transmission

• Transmit Autocorrelation Coefficients• UART is 115200 baud, 8 bit, No

Parity, 1 stop bit• Data is signed 16 bit • Bit masking and Reconstruction on

the Raspberry Pi• BlueSmirf Bluetooth-UART Pipes

• Abstracts wireless transmission• Looks like UART to microcontroller• Effectively Plug&Play

C5515: Major Challenges Faced• Autocorrelation Coefficient Overflow

• Function Generator Provide too large a voltage• Forces autocorrelation to overflow• Bit-shifting worked temporarily, but reduced data precision: poor

classifier performance and threshold variability• Solution: Switched to Microphone

• Bluesmirf Setup• Configuring Bluesmirf requires commands at precise times• Solution: Implemented long delay function on C5515

Raspberry Pi: Word Classification

• Implemented All-pole Model of Speech Vocalization for Classification

• Computes LPC Coefficients from Autocorrelation

• Converts LPC Coefficients into Cepstral Coefficients

• LIBSVM multistage classifier

• Algorithm written in mixed C/C++ • LPC and Cepstral functions codegen’d

from Matlab• Wrapper in hand written code• Waits for autocorrelation input from UART

Raspberry Pi: Actuation• State Machine implemented

• Displays infamous EECS 452 Fall 2014 Image on sequence of “452”

• Displays special Raspberry Pi Image on “314”

• GPIO array drives LED Binary Counter• Capable of implemented more complicated functions• Planned for Coffee Machine Actuation, ran out of time

• Renders graphics using OpenVG Library• Displays Startup Image• Displays Digit Image on Classification

Raspberry Pi: Major Challenges Faced

• Initially planned to use Simulink Model to implement code• Worked great for algorithm• Did not work well for IO• S-Functions are tricky to work with• Solution

• Codegen core algorithm• Hand write wrapper

• Matlab Coder Toolbox• Converts Matlab code into ANSI C code, with processor specific

optimizations available• Extremely useful for complex algorithms• Very finicky to configure properly• Solution: Study, study, study

Design Expo Pictures

Design Expo Demonstration

http://youtube.com/v/iKBFW55iNik

Looking Forward

• Coffee Machine Actuation

• Build Better Library• More speakers• Female speakers• Non-Midwestern speakers

• Investigate Tuning SVM Parameters

Questions / Comments

References

[1]http://www.spectrumdigital.com/product_info.php?cPath=31&products_id=238

[2] https://www.sparkfun.com/products/12577

[3] http://www.adafruit.com/product/1914

[4] Dutoit, T., Moreau, N., Kroon, P., How is speech processed in a cell phone conversation?, 2009

[5] Rabiner, L., Schafer, R., Introduction to Digital Speech Processing, 2007

[6] http://www.csie.ntu.edu.tw/~cjlin/libsvm/

real-time voice actuation

Documents

n filter coefficients

filter coefficients5too

filter configuration

cepstral coefficients

filter vocal tract

frame data

source vibration

source resonances