isolated digit recognizer using gmm’s

Isolated Digit Recognizer using GMM’s

ECE5526 FINAL PROJECTSPRING 2011

JIM BRYAN

Abstract• Provide an in depth look at how GMM’s can be used for word recognition based on

Matlab’s statistical toolbox. • The isolated digit recognizer is based on a voice activity detector using energy

thresholding and zero crossing detection. Moveover, the recognizer uses MFCC’s as the basis for acoustic speech representation. These are standard voice processing techniques which it is assumed the reader is familiar with. The focus of this presentation is on the details the GMM implementation in Matlab, with the idea that a good understanding of the Matlab approach will yield insight to other system implementations such as Sphinx and HTK.

• Word recognition is comprised of two components, Model training and Model testing.

• The statistical toolbox functionGmmdistribution.fit is used for training• The statistical toolbox Postpriori is used for testing• The purpose of this effort is to train and run the recognizer, and to understand the

basic functionality of functionGmmdistribution.fit and Postpriori funcion calls.

Introduction• Based on

MATLAB Digest - January 2010Developing an Isolated Word Recognition System in MATLABBy Daryl Ning

• Describe the Matlab GUI base recognizer application• Provide introductory material on GMM’s using a simple 2 Mixture example with

2 models• Discuss in detail the algorithms used to determine the best model match• Show examples of Matlabs statistical toolbox representation of GMM’s• Run the simulation• Discuss simulaton results and show possible improvements• Summary• Conclusions• Areas for further study

mailto:[email protected]

mailto:[email protected]

Isolated digit recognizer overview• Uses 8 GMM’s per digit to train and recognize an individual users voice• Matlab GUI based digit recognizer uses the following toolboxes

– Signal Processing toolbox provides a filtering and signal processing functions – Statistics toolbox is used to implement a GMM Expectation Maximization

algorithm to build the GMM’s and to compute the Mahalnobis distance during recognition

– Data acquisition toolbox is used to stream the microphone input to Matlab for continous recognition

• Single digit recognizer implemented using dictionary of digits 0 – 9• Training is done with 30 second captures of repeated utterance of the

given digit using the wavread function in Matlab• Data is input continuously to Matlab via the data acquisition toolbox while

the GUI recognizer is running• The recognized digit is displayed on the GUI

Overview Continued• Uses laptop’s internal microphone• Sample rate is 8ksps• Uses 20msec frames with a 10 msec overlap with a frame size of

160 samples per frame• Uses a simple voice activity detector based on energy threshold

and zero crossings per second for both training and the recognizer• Voice activity energy and zero crossing thresholds are

programmable and must be the same for training and recognition• No model for silence or missed digit, so the recognizer displays

the closest digit

GMM training and recognizer Matlab function calls

• The recognizer compute the posterior probabilities using the Statistics Toolbox function ‘posterior’

• ‘Posterior’ accepts a gmm object/model as its input, along with an input data set, and returns a log-likelihood number that represents the data set match to the model

• The smallest log-likelihood has the highest posterior probability• The recognizer computes the probability of the current “word”

to each model in the dictionary. The model that has the lowest posterior probability is the recognized digit.

• A gmm object is created during training for each dictionary entry, in this case digits 0-9, using the function call gmdistribution.fit.

Example using 2 GMM’s with 2 mixtures

-8 -6 -4 -2 0 2 4 6-8

-6

-4

-2

0

2

4

6

x

y

pdf(obj2,[x,y])

gmm1

gmm2

Posterior • Posterior extracts gmdistribution object parameters necessary

to call Wdensity• Wdensity performs the actual log-likely-hood calculation for the

GMM, given the data set• Wdensity returns two arrays

– log_lh is an array of size length(data)x order(GMM)– mahalaD is an array of size length(data)x order(GMM), this is not the

actual Mahalnoblis distance– mahalaD = (x -Ω)Σ-1 (x -Ω)T

• Estep calculates the loglikelihood based on the log_lh array and returns ll which is the loglikelihood of data x given the model

Wdensity function description

Example funtioncall[log_lh,mahalaD]=wdensity(X, mu, Sigma, p, sharedCov, CovType)• Where X is input data• Mu is an array of means with(j,:) corresponding to jth mean

vector• Sigma is an array of arrays with (:,:,j) corresponding to the jth

sigma in the model• P are the mixture weights• sharedCov indicates the covariance matrices may be common

to all mixtures• CovType may be either diagonal or full

Wdensity log-likelihood calculation

,

Wdensity log-likelihood implementation details

L = sqrt(Sigma(:,:,j)); % a vector

xRinv = bsxfun(@times,Xcentered , (1./ L));

mahalaD(:,j) = sum(xRinv.^2, 2); log_lh(:,j) = -0.5 * mahalaD(:,j) +... (-0.5 *logDetSigma + log_prior(j)) - *log(2*pi)/2;

Xcentered = bsxfun(@minus, X, mu(j,:));

estep• [ll, post, logpdf]=estep(log_lh)• Find the max of each row of log_lh matrix• This represents the closest distance to the jth mixture for this data

point.• Convert log_ih distance probabilities by using post =

exp(bsxfun(@minus, log_lh, maxll)), there will always be a 1 in the column of the maximum value, therefore this number is always >=1

• Sum across the rows to normalize the relative probabilities density = sum(post,2);

• normalize posteriors post = bsxfun(@rdivide, post, density)• Calculate the logpdf = log(density) + maxll; • ll = sum(logpdf)

Estep example showing log_lh inputsfor two Gaussian Mixtures and the Maximum value of the log_lh

P11 data from modellog_lh = -18.6236 -3.0708 -36.2569 -3.0821 -24.1669 -2.2514 -33.8821 -3.2357 -18.4447 -3.2818 -5.8488 -4.2339 -18.4529 -2.5661 -14.7058 -3.5421 -2.7563 -19.3866 -3.0744 -21.2154 -2.4251 -14.8179 -4.1699 -12.7317 -2.5825 -16.8520 -4.4938 -8.5847 -3.7883 -13.7861 -2.8691 -7.2573

maxll = -3.0708 -3.0821 -2.2514 -3.2357 -3.2818 -4.2339 -2.5661 -3.5421 -2.7563 -3.0744 -2.4251 -4.1699 -2.5825 -4.4938 -3.7883 -2.8691

Estep example showing Post and density,density is used to normalize post

P11 data from modelpost = exp(bsxfun(@minus, log_lh, maxll)); 1.0000 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 0.0000 1.0000 0.0001 0.0832 1.0000 1.0000 0.0109 1.0000 0.0008 0.0000 1.0000 0.0003 1.0000 0.0001 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000

density = sum(post,2) 1.0000 1.0000 1.0000 1.0000 1.0001 1.0832 1.0109 1.0008 1.0000 1.0003 1.0001 1.0000 1.0000 1.0000 1.0000 1.0000

Estep example showing post after normalizationand logpdf

P11 data from model

post = bsxfun(@rdivide, post, density)

1.0000 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 0.0000 0.9999 0.0001 0.0768 0.9232 0.9892 0.0108 0.9992 0.0008 0.0000 1.0000 0.0003 0.9997 0.0001 0.9999 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000

logpdf = log(density) + maxll;ll = sum(logpdf) =-53.7464 -3.6490 -4.6937 -2.3765 -3.3219 -3.1317 -4.4911 -4.0361 -3.8076 -2.7171 -2.5739 -2.3359 -2.6023 -2.1502 -5.5963 -2.2777 -3.9857

Estep example showing log_lh inputsfor two Gaussian Mixtures and the Maximum value of

the log_lh P12 Data not from Modellog_lh = -6.2916 -6.2281 -6.1189 -7.3603 -12.5238 -2.5414 -7.3336 -24.5710 -7.0679 -14.3058 -5.7049 -7.7255 -7.8564 -23.6082 -6.8128 -4.4655 -27.4139 -19.2832 -20.1139 -14.0730 -27.0048 -11.4791 -17.2614 -8.2714 -33.8912 -15.5351 -26.0666 -9.9934 -20.4353 -9.9218 -15.9387 -13.2732

maxll = -6.2281 -6.1189 -2.5414 -7.3336 -7.0679 -5.7049 -7.8564 -4.4655 -19.2832 -14.0730 -11.4791 -8.2714 -15.5351 -9.9934 -9.9218 -13.2732

Estep example showing Post and density,density is used to normalize post

Data not from model P12post = exp(bsxfun(@minus, log_lh, maxll)); 0.9384 1.0000 1.0000 0.2890 0.0000 1.0000 1.0000 0.0000 1.0000 0.0007 1.0000 0.1326 1.0000 0.0000 0.0956 1.0000 0.0003 1.0000 0.0024 1.0000 0.0000 1.0000 0.0001 1.0000 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0696 1.0000

density = sum(post,2) 1.9384 1.2890 1.0000 1.0000 1.0007 1.1326 1.0000 1.0956 1.0003 1.0024 1.0000 1.0001 1.0000 1.0000 1.0000 1.0696

Estep example showing post after normalizationand logpdf P12 data not from model

post = bsxfun(@rdivide, post, density)

0.4841 0.5159 0.7758 0.2242 0.0000 1.0000 1.0000 0.0000 0.9993 0.0007 0.8829 0.1171 1.0000 0.0000 0.0873 0.9127 0.0003 0.9997 0.0024 0.9976 0.0000 1.0000 0.0001 0.9999 0.0000 1.0000 0.0000 1.0000 0.0000 1.0000 0.0650 0.9350

logpdf = log(density) + maxll;ll = sum(logpdf) = -147.9445 -5.5662 -5.8650 -2.5414 -7.3336 -7.0671 -5.5804 -7.8564 -4.3742 -19.2829 -14.0706 -11.4791 -8.2713 -15.5351 -9.9934 -9.9218 -13.2060

Log-likelihood for 2 mixture example

P =Nlogl = -ll

55.3416 109.3820 184.7868 42.8043

The diagonal term are the case where the data came from the model

The off diagonal terms represent when the data came from the other model

Gaussian Models in Matlab

Model for ‘one’

Gaussian Mixture Distribution Structure ‘one’

8 Gaussian model means 8x39 ‘one’

Diagonal Covariance Matrix

Training the GMM’s

• Before recording can begin it is necessary to set the laptops internal microphone

• Training involves finding a quiet environment and recording 30 seconds of utterance for each digit

• These are captured using Matlabs wavrecord• y = wavrecord(30*8000,8000); • There is a utility supplied that allows viewing the Voice

Activity detection algorithm in order to determine correct captures of the training data

• speechdetect(y);

Trainmodels overview• Generates Frames of speech base on 160 samples/frame with an 80 sample

overlap• Uses the same energy detect and zero crossing thresholds as the recognizer• Determines portions of voiced speech based on these thresholds as well as a

minimum of 250msec duration for each word• A minimum of 100msec is required between each word• Frames are marked as VA, voice active, and stored in a buffer call ALLdata.• ALLdata is arranged so that the frames are in columns, the dimensions are

160xnumFRAMES• Once all the words are captured, MFCC is called which is passed the ALLdata

buffer for Mel cepstral coefficient processing• MFCC returns MFCC vectors that are 39 coefficients per frame• Gmmdistribution.fit is passed the MFCC vectors which runs an EM algorithm on

the MFCC vectors to generate an 8 Mixture GMM for each digit

MFCC creditsDerived from the original function 'mfcc.m' in the Auditory Toolbox% written by: %% Malcolm Slaney% Interval Research Corporation% [email protected]% http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/%% Also uses the 'deltacoeff.m' function written by:%% Olutope Foluso Omogbenigun% London Metropolitan University% http://www.mathworks.com/matlabcentral/fileexchange/19298

MFCC overview• Pre-filter the data using a pre-emphasis filter• preEmphasized = filter([1 -.97], 1, input);• Window the data with a Hamming window• preEmphasized = preEmphasized.*repmat(hamWindow(:),1,frames);• fftMag = abs(fft(preEmphasized,fftSize));• earMag = log10(mfccFilterWeights * fftMag);• ceps = mfccDCTMatrix * earMag;• meanceps = mean(ceps,2);• ceps = ceps - repmat(meanceps,1,frames);• d = (deltacoeff(ceps')).*0.6; %Computes delta-mfcc• d1 = (deltacoeff(d)).*0.4; %as above for delta-delta-mfcc• ceps = [ceps; d'; d1']; %concatenates all together• Return vector of 13 cep, 13 diff and 13 diff diff coefficients

Sound Settings for Microphone on Windows 7 laptop

Voice Activity Detector Overview• Voice activity detection based on energy detection and zero crossing

rate• std_zxings: is the zero crossing threshold, default = .5• std_energy: is the energy detect threshold, default = .5• Energy and zero crossings thresholds are determined during the first

500msec of training to determine the background silence energy and zero crossing rate

• The same threshold settings must be used for all digit recordings• Once a good recording has been made, save it to the hard drive using;• wavwrite(y,8000,‟one.wav‟); • Repeat for all the digits• Run “transcript” and this will train the GMM’s

Authors Ideal Voice Activity detector

Voice Detect using default thresholdsdigit = ‘one’

7.5 8 8.5 9 9.5 10 10.5 11

x 104

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Voice Detect using default thresholds 1,1 digit = ‘one’

4 4.5 5 5.5 6 6.5 7 7.5

x 104

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

Voice Detect using default thresholds 1.5,1.5 digit = ‘one’

5.5 6 6.5 7 7.5 8 8.5 9

x 104

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Transcript reads each model and calls trainmodels

y = wavread('one.wav');trainmodels(y,'one');y = wavread('two.wav');trainmodels(y,'two');y = wavread('three.wav');trainmodels(y,'three');y = wavread('four.wav');trainmodels(y,'four');y = wavread('five.wav');trainmodels(y,'five');y = wavread('six.wav');trainmodels(y,'six');y = wavread('seven.wav');trainmodels(y,'seven');y = wavread('eight.wav');trainmodels(y,'eight');y = wavread('nine.wav');trainmodels(y,'nine');y = wavread('zero.wav');trainmodels(y,'zero');

GMM dimensions for typical utterance

• Assume average digit length is 300 mSec• Fs = 8000Hz• 1/Fs = 125μsec• 160 samples/Fs = 20msec• Since overlap and add using 50 % Hamming widow, 1 Frame occurs

every 10msec• Average number of frames per word 300/10 = 30• MFCC takes in 30x160 samples and produces 30x39 MFCC vectors on

average• Average size of log_lh vector per word for 8 Gaussian mixtures =

30x8• Log-likelihood based on average 30x8 matrix

Voice Activity detect filter implemented as a 128 tap FIR filter based on a Chebyschev window with 40 dB

sidelobe attenation

Voice detector using 125-750 Hz 128 tap Chebyshev bandpass filter with 40 dB side lobe suppression and 20mse pre oneshot

with 40msec post oneshot digit = ‘one’

1 2 3 4 5 6 7

x 104

-1000

-500

0

500

1000

Training Vector for digit ‘one’ after modified VA detection

0 0.5 1 1.5 2 2.5

x 105

-1.5

-1

-0.5

0

0.5

1

1.5x 10

6

Scoring• Difficult to score based on the real time recognizer. • Recognizer “fires” on ambient noise• Recognizer is slow as it has to perform GMM calculation for all dictionary

entries• Recorded test set of test set, counting from 1-9,0 produced 70% accuracy

two and seven and eight did not correctly classify• Had to lower zerocrossing threshold for test to collect all the utterances• Accuracy might be due to insufficient training data• Could have bad models for some of the “classes”• Hand scoring difficult because must correctly label each utterance for the

classifier. Seven had a null portion in the middle • Lap top computers fan kicked on during training, this caused ambient

noise during training so data set was not perfect

Test Set counting 1-9,0 and repeatframe based with silence removed

2 4 6 8 10 12 14

x 104

-0.04

-0.03

-0.02

-0.01

0

0.01

0.02

0.03

Summary• An 8 mixture GMM’s for speech recognition were demonstrated. Using only a

small training set and an laptop microphone, digit recognition was demonstrated using only 8000Hz sample rate

• Care and feeding of the GMM’s is very important for successful implementation.

• Garbage in, garbage out is especially true for speech recognition• Background noise is a very big problem in accurate speech recognition.

Adaptive noise cancellation using a second microphone for just the background noise should improve accuracy

• The voice activity detector is a critical component of the recognizer• Scoring is also a difficult problem as the acoustic data must be synchronized

with the dictionary to provide accurate results• Marking the speech pattern and word isolation is not without difficulties as

pauses between syllables occur during a single utterance

Conclusion• GMM’s are very powerful models for speech recognition. • Scoring the models is difficult. The EM algorithm will produce different

models based on the random seeding of the starting conditions. • Simple utterances of ~15 repetitions is not sufficient for good GMM

accuracy• The voice activity detector plays a significant part in the training and

testing of the data• A new voice activity detector did not magically produce 100 percent

scoring accuracy with a recorded test wav file• Noise cancellation techniques and sophisticated voice detection

algorithms are necessary for good performance as well as model optimization

Areas for further investigation

• Automate the scoring process• Improve the Voice activity detector in the real time

recognizer• Add a second microphone for adaptive noise

cancellation• Convert GMM’s to combination GMM’s and HMM’s so

dictionary search isn’t so computationally intensive• Modify the number of mixtures of the GMM’s with

HMM phonetic implementation• HMM’s will allow for continuous digit recognition

isolated digit recognizer using gmm’s

Documents

gui recognizer

given digit

closest digit gmm training

model training

matlab approach

matlab digest

simple voice activity

model testing