sequence scoring experiments using the timit corpus and the htk recognition framework author: arthur...

Sequence Scoring Experiments Using the TIMIT Corpus and the HTK Recognition Framework Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Kpuska

ASR Defined Automatic Speech Recognition (ASR) - mapping an acoustic signal into a string of words. ASR systems play a big role in Human Machine Interaction (HMI). Speech has a natural potential to be much more intuitive to use to command a machine versus the existing input methods, such as keyboard and mouse.

Early ASR Systems Earliest systems for ASR would model natural resonances that occur as a result of air flowing over the vocal tract creating sounds Example: To recognize the digit five, the system would determine that the vowel sound eye matched the correct digit. Limitation - Utterance contained only a single digit and no other word or non-speech event that would confuse the system.

ASR Improvements ASR System Development in the 1980s and 1990s introduced use of Hidden Markov Models (HMMs). Still widely used over the past two decades Improvements being made on a continual basis. ASR received interest from DARPA, leading to new and notable ASR systems such as the CMU Sphinx (Carnegie Mellon University) system. Formalized the tasks and evaluation criterion that were used to measure ASR System Performance.

Major Tasks in ASR History

Timeline of ASR Achievements

Characteristics of ASR Systems ASR Systems are defined by the tasks they are designed to solve. We have already discussed some examples of tasks. Tasks involve the following parameters: Vocalbulary Size Fluency Environmental Effects Speaker Characteristics

Vocabulary Size Milestones in ASR Systems are often related to how large of a vocabulary a system can handle while keeping error rate at a minimum. Simple Task Vocabulary: Recognizing digits: zero, one, two,, and oh These eleven words are the in-vocabulary words (INV). If the system encounters any words outside of this set, they are known as out-of-vocabulary words (OOV).

ASR Tasks and Vocabulary Sizes Task NameVocabulary SizeWord Error Rate (%) Texas Instruments (TI) Digits11 (zero-nine, oh)0.5 Wall Street Journal 15,0003 Wall Street Journal 220,0003 Broadcast News64,000+10 Conversational Telephone Speech64,000+20 As vocabulary size of a task increases, so does the Word Error Rate (WER). WER is the standard evaluation metric for speech recognition

Example WER Calculation This example is an output hypothesis of a string of numbers from an ASR system, compared with the true sentence string. The bottom line marks the types of errors as they occur in the transcription. Reference:ONE TWO THREE FOUR FIVE SIX SEVEN ***** Hypothesis:**** TWO ******* FIVE FIVE SIX SEVEN ONE Evaluation:D D S I

ASR System Fluency Fluency measures the rigidity of input speech. In isolated-word recognition, the speech to be processed is surrounded by a known silence or pause. Examples include the digit recognition or command- and-control tasks. Continuous-speech systems must take non- speech events and segmentation of real words into account. This is much harder to accomplish!

Other ASR System Parameters Environmental noise and channel characteristics. Recording instruments may be located at different distances from each speaker and may pick up other noises in addition to speech. Speaker-dependant characteristics. Speaker dialect and accent.

Wake-up-Word Paradigm The Wake-up-Word (WUW) ASR Problem: Detect a single word or phrase when spoken in an alerting context, while rejecting all other words, phrases, sounds, noises and other acoustic events with virtually 100% accuracy including the same word or phrase of interest spoken in a non-alerting (i.e. referential) context.

WUW Example Application User utters the WUW Computer to alert a machine to perform various commands. When the user utters the command phrase Computer, begin presentation, WUW technology should detect that Computer was spoken in the alerting context and perform the requested command. If the user utters the phrase I want to buy a new computer, WUW technology must detect that Computer was used in a non-alerting context and avoid parsing the command.

WUW Problem Areas Detecting WUW Context The WUW system must be able to notify the host system that attention is required in certain circumstances and with high accuracy. Unlike keyword-spotting,WUW dictates these occurrences only be reported during an alerting context. This context can be determined using features such as leading and trailing silence, difference in the long term average of speech features, and prosodic information (pitch, intonation, rhythm, etc.). Identifying WUW After identifying the correct context for a spoken utterance, the WUW paradigm shall be responsible for determining if the utterance contains the pre-defined Wake-up-Word to be used for command (e.g. Computer) with a high degree of accuracy, e.g., > 99%. Correct Rejection of Non-WUW Similar to identification of the WUW, the system shall also be capable of filtering speech tokens that are not WUWs with practically 100% accuracy to guarantee 0% false acceptances.

Current WUW System Currently being used for practical applications such as: PowerPoint Commander, Elevator Simulator, Car Inspection System, and Nursing Call Center

Motivations for External Scoring Toolkit Support for standard speech recognition testing data sets. Provide support for evaluating the TIMIT data set in order to evaluate novel scoring methods against a broader class of words. Integration of standard toolkits. Utilize the Hidden Markov Model Toolkit (HTK) and the SVM library (LIBSVM) to build and evaluate HMM and SVM models. Using industry-standard frameworks has the benefit of a well-documented environment and previous results. Integration of novel scoring techniques with standard toolkits. The novel method used in the WUW system must be integrated with the existing workflow in the HTK framework in order to augment the technique and evaluate its effectiveness against additional data sources. Provide MATLAB-based analysis and experimentation tools. Once results are obtained using the SeqRec tools for HTK and LIBSVM, MATLAB scripts will be used to provide visualization of the results. Provide support for One-Class SVM modeling. A technique that allows a recognition model to be built on only INV data scores. This SVM type will be applied to WUW and the benefits and disadvantages will be explored.

SeqRec System Overview In order to further explore and refine the unique speech recognition elements of the WUW system, the Sequence Recognizer (SeqRec) Toolkit was developed.

Speech Recognition Goals Speech recognition systems often assume speech is a realization of some message encoded as a sequence of one or more discrete symbols. Speech is normally converted into a sequence of equally spaced discrete parameter vectors. (typically every 10ms). Makes the assumption that a speech waveform can be regarded as a stationary process over the sampling time

Speech Recognition Goals, contd.Speech Recognition Goals, contd. The speech recognizers job is to create a mapping between the sequences of speech frames and the underlying speech symbols that constitute the utterance.

Probability Theory of ASR What is the most likely discrete symbol sequence out of all valid sequences in the language L, given some acoustic input O? Acoustic Input is set of discrete observations: Symbol sequence is defined as: Fundamental ASR System Goal:

Probability Theory of ASR, contd.Probability Theory of ASR, contd. Applying Bayes Theorem: New quantities are easier to compute than P(W |O). P(W) is defined as the prior probability for the sequence itself. This is calculated by using the prior knowledge of occurrences of the sequence W. P(O) is the prior probability of the acoustic input occurrence.

Probability Theory of ASR, contd.Probability Theory of ASR, contd. P(O) is not needed, because the argmax expression implies we will be calculating over all possible sequences. The probability P(O|W), which is the likelihood of the acoustic input O, given the sequence W, is defined as the observation likelihood. (often referred to as the acoustic score) This quantity can be determined using the Hidden Markov Model.

Elements of HMMs The set of states constituting the model. Although the states themselves are hidden from the perspective of state assignment of each observation vector, the exact number of states often carries a physical significance

Elements of HMMs, contd. The transition probability matrix. Each element of this matrix represents the probability of transitioning from state i to state j. Each row of this matrix must sum to 1 to be valid.

Elements of HMMs, contd. The emission probabilities. Each of these expresses the probability of an observation being generated during state i. Note that the beginning and end states of an HMM do not have an associated emission probability.

Elements of HMMs, contd. The probability distribution of starting in each state.

Elements of HMMs, contd. The following equation is used to express all the parameters of an HMM in a compact form:

ASR HMMs An ASR HMM is normally used to model a phoneme. Smallest distinguishable sound unit in a language. Generally have three emitting states in order to model the transition-in, steady state, and transition-out regions of the phoneme. Whole word HMM is created by simply concatenating the phonemes used to spell the word in question.

Acoustic Scores Using HMMs So how do we use HMMs to calculate the probability of an observation sequence, given a specific model? Restated: Score how well a given model matches an input observation sequence. For HMMs, each hidden state produces only a single observation. Length(sequence of traversed states) == Length(sequence of observation)

Acoustic Scores Using HMMs, contd. The actual state sequence that observation sequence will take is hidden. Assuming independence, have to calculate joint probability of a particular observation sequence and a particular state sequence: This probability must be calculated across all valid state sequences in the model:

Acoustic Scores Using HMMs, contd. While this solution is valid, it presents a calculation that requires O(N^T) computations. For speech processing applications of HMM, these parameters can become quite large. In order to reduce the amount of calculations needed, the forward algorithm is used.

Forward AlgorithmForward Algorithm The forward algorithm is a dynamic programming technique that uses a table to store intermediate values as it builds the final probability of the observation sequence. Each cell is calculated by summing over the extensions in all paths that lead to the current cell.

Forward Algorithm, contd.Forward Algorithm, contd. The forward algorithm is a three step process: 1.Initialization: 2.Induction: 3.Termination:

Forward Algorithm, contd.Forward Algorithm, contd.

HMM Paramter Re-estimation HMM parameter re-estimation is how we should adjust the model parameters in order to maximize the acoustic score. This problem is addressed by using the Baum-Welch algorithm.

HMM Paramter Re-estimation, contd. Goal for Re-estimating the transition probability matrix A: Goal for Re-estimating the emission probability distributions:

HMM Paramter Re-estimation, contd. These calculations lead to the following equations. (See Rabiner for details and derivations.)

HMM Paramter Re-estimation, contd. If a current model is re-estimated using the EM algorithm to create a new, refined model, then either: 1.The initial model defines a critical point of the likelihood function, in which case (no HMM parameter updates were made). 2.A new model has been discovered that describes an HMM in which an observation sequence O is more likely to have been produced. The final model produced by EM is called the maximum likelihood HMM.

Speech-Specific HMM Recognition The previous section presented the fundamentals associated with using HMMs to perform general sequence recognition. There are some additional concepts associated specifically with the speech recognition task domain: Feature Representation of Speech Gaussian Mixture Model Distributions

Feature Representation of Acoustic Speech Signals The input to an ASR system is normally a continuous speech waveform. This input must be transformed into a sequence of acoustic feature vectors, each of which captures a small amount of information within the original waveform.

Feature Representation of Acoustic Speech Signals, contd. Pre-emphasis This stage is used to amplify energy in the high-frequencies of the input speech signal. This allows information in these regions to be more recognizable during HMM model training and recognition.

Feature Representation of Acoustic Speech Signals, contd. Windowing This stage slices the input signal into discrete time segments. A Hamming window is commonly used to prevent edge effects associated with the sharp changes in a Rectangular window.

Feature Representation of Acoustic Speech Signals, contd. Discrete Fourier Transform DFT is applied to the windowed speech signal, resulting in the magnitude and phase representation of the signal.

Feature Representation of Acoustic Speech Signals, contd. Mel Filter Bank - Human hearing is less sensitive at frequencies above 1000 Hz. so the spectrum is warped using a logarithmic Mel scale. A bank of filters is constructed with filters distributed equally below 1000 Hz and spaced logarithmically above 1000 Hz

Feature Representation of Acoustic Speech Signals, contd. Inverse DFT The IDFT of the Mel spectrum is computed, resulting in the cepstrum. This representation is valuable because it separates characteristics of the source and filter of the speech waveform. The first 12 values of the resulting cepstrum are recorded. Delta MFCC Features In order to capture the changes in speech from frame-to-frame, the first and second derivative of the MFCC coefficients are also calculated and included.

Feature Representation of Acoustic Speech Signals, contd. Energy Feature This step is performed in parallel with the MFCC feature extraction and involves calculating the total energy of the input frame.

Feature Representation of Acoustic Speech Signals, contd. Results in a 39- element Observation Vector for each Frame of Speech Feature TypeCount Cepstral Coefficients12 Delta Cepstral Coefficients12 Double Delta Cepstral Coefficients12 Energy Coefficient1 Delta Energy Coefficient1 Double Delta Energy Coefficient1 Total39

Gaussian Mixture Models Until now, the emission probability associated with an HMM state was left as a general probability distribution. In most ASR systems, these output probabilities are continuous-density multivariate output distributions. The most common form of this distribution used in speech recognition is the Gaussian Mixture Model (GMM).

Gaussian Mixture Models, contd. A simple Gaussian distribution describing a one- dimensional random variable X is described by the mean and variance

Gaussian Mixture Models, contd. Assume a simple (though impractical) ASR system exists where a single-variable Gaussian is used. Each HMM state would have an emission probability that assumes the values of each observation vector are normally distributed

Gaussian Mixture Models, contd. Recall that each observation is actually a D element vector. (where we found D = 39 for common MFCC representations) Extend the distribution to multivariate Gaussian distributions. In this case, the mean is a vector of length D and the covariance is a matrix of size D x D

Gaussian Mixture Models, contd. What if some of the features do not follow a strict, normal distribution. This is actually quite common. In order to account for complex, non-normal distributions, the Gaussian Mixture Model is used Result of combining M Gaussian mixtures, the contribution of each is given by a scaler weight.

GMM Example Example of a non-normal, one-dimensional probability distribution that is more effectively modeled using a GMM with 3 mixtures

The Hidden Markov Modeling Toolkit (HTK) HTK is a well-established framework, primarily designed to build HMM-based systems used for speech processing and speech recognition tools.

HTK Data Preparation Tools Provide mechanisms to prepare arbitrarily formatted speech sound files and textual transcriptions into a uniform format suitable for HMM model training. The raw waveform audio must also be converted to MFCCs. Support data such as the phonetic dictionary must be properly formatted to ensure all pronunciations are available prior to training.

HTK Training Tools Uses the HTK-formatted data from the previous stage to define, initialize, and re-estimate the set of HMM models.

HTK Testing Tools Tools for generating text hypothesis given a set of unknown speed data. HTK provides features for full speech recognition SeqRec only needs tools that will generate the acoustic scores.

TIMIT Corpus Experiments will use the Texas Instruments and Massachusetts Institute of Technology (TIMIT) corpus. Contains recordings of 630 speakers in 8 dialects of U.S. English. Each speaker is assigned 10 sentences to read that are carefully designed to contain a wide range of phonetic variability. Each utterance is recorded as a 16-bit waveform file sampled at 16 KHz. Two Partitions of TIMIT: TRAIN Used to generate HMM Models. TEST - Unseen by the SeqRec system until the final evaluation

TIMIT Experiment Data Set The 24 words with the highest count of occurrences in the database. Varying length from ~7 frames for a to ~39 frames for greasy. Highlighted words are another subset that will be used to show detailed experiment results. WordTRAIN CountTEST Count Phonemic Pronunciation Phonemic Length Frame Length the1603599 da ah 28.26 to1018352 t uw 210.69 in947313 ih n 213.14 a867301 ah 16.69 all545223 ao l 220.67 that612215 dh ae t 331.02 she572208 sh iy 219.86 an571207 ae n 29.83 your565202 y ao r 312.31 me517193 m iy 212.11 of455185 ah v 211.99 had526183 hh ae d 324.54 like518179 l ay k 323.65 year473177 y ih r 330.75 and492175 ah n d 314.13 dark473171 d aa r k 433.2 water479170 w ao t er 428.35 ask464169 ae s k 328.12 carry463169 k ae r iy 436.51 suit462168 s uw t 334.99 greasy462168 g r iy s iy 539.05 wash469168 w aa sh 335.07 oily470168 oy l iy 333.38 rag470168 r ae g 334.23

The HTK Recipe The versatility of HTK Toolkit presents a steep learning curve. The HTK Recipe is used by SeqRec to provide a known- good starting point to create a well-trained set of monophone HMM models based on TIMIT.

Isolated Word Recognition Result Format Red - normalized acoustic scores for the INV evaluated against INV HMM Blue - normalized acoustic scores for the OOV evaluated against the INV HMM

Isolated Word Recognition Result Format, contd. CDFs plotted for each score distribution, OOV reversed. Point where two CDFs intersect is the operating threshold.

Isolated Word Recognition Result Format, contd. FA Rate False acceptances of OOV words as INV. FR Rate False rejections of INV words as OOV. Total Error Rate = FA Rate + FR Rate

HMM Biasing Prior to scoring, the monophone HMM models constituting the INV word are re- estimated against only the INV Training Data. This allows SeqRec to simulate the performance improvement of context- dependant models. Experimentally found that performing two re-estimations yielded the optimum increase in performance.

HMM Biasing and Increased Recognizer Performance

Baseline Monophone HMM Results The TIMIT single-word recognizer performance baseline was established using monophone HMMs with 1, 8, and 16 Gaussian Mixture components

Validation of Results Same TIMIT data set was evaluated against third-party WSJ models from the author of the HTK Recipe procedure. Average Total Error Rate was compared to the SeqRec models.

Baseline Results Observations In general, a higher number of mixture components in the GMM models yield lower error rates. Expected, due to the complex distributions of many of the MFCC features used to represent the speech data. HMM models generated by SeqRec perform slightly better than the WSJ models WSJ models are re-estimated many times using data from a much broader test set than just TIMIT. Overall, the baseline monophone models have shown that 16-mixture TIMIT monophone HMMs yield the lowest average Total Error Rate of 20.07%

Incorporating Additional Scores Key feature of the existing WUW system is the application of an additional scoring method. Score 2 can be computed using the same HTK tools that were used to determine the acoustic score.

Distribution of Multiple Scores When combined, score 1 and score 2 each contribute unique information to the recognition task. Score 2 shifts the INV score result distribution below the OOV results

Introduction to SVM Cannot use the simple, one-dimensional binary classifier with two scores. Support Vector Machines (SVMs) are a set of learning methods that can be used to build a complex classification model for data with multiple features.

Fundamentals of SVM Classifiers Consider a task requiring the binary classification of m data points, each having classification labels +/-1. Each data point is represented by a d- dimensional collection of attributes (also known as a feature vector).

Discriminant Plane The vector w describes the orientation and b describes the offset of a discriminant plane that can be used to classify the data. There are an infinite amount of planes that can be applied to a set of points.

Maximal Margin The plane given by the solid line provides the best solution because it would be more robust to additional data that exhibit perturbations from the training set. This plane is said to provide the maximal margin between the two classes of data points.

Maximal Margin, contd. For linearly seperable data, a method for determining the maximum margin between the two classes is to maximize the margin between two parallel supporting planes. The distance between these planes is maximized to determine the optimal plane for classification

Maximal Margin, contd. Maximizing the margin is equivalent to maximizing the distance between the two supporting planes. Solved using the following Quadratic Programming problem:

Linearly Inseperable Data For this type of data, have to introduce a slack variable to each constraint and then add as a weighted penalty term. Practically, the C parameter represents a trade-off between classification error and maximal margin

Alternate Form of the QP Writing the classification rule in its dual form reveals that the maximum margin hyperplane is only a function of the support vectors - the training data that lie on the margin Orange data points in previous slides

Non-Linear Classification For many data distributions, a simple linear plane cannot be effectively applied to classify points. This data distribution would be best classified using an elliptical classification surface.

Non-Linear Classification, contd. Consider 2-dimensional training data with attributes [r,s]. To construct a quadratic discriminant function, the 2- dimensional input can be mapped into a 5-dimensional data set described by [r, s, rs, r 2, s 2 ]. A linear discriminant can then be computed in this new feature space. This can be substituted into the original linear discriminant function, taking into account the mapping function into feature-space.

Non-Linear Classification, contd.Non-Linear Classification, contd. Existing Quadratic Programming problem from can be modified to use the mapping function: For practical usage of SVM, it is not feasible to calculate the mapping function. SVMs work around this issue by using kernel functions. Allows us to evaluate the inner product without having to explicitly know the mapping function.

Non-Linear Classification, contd.Non-Linear Classification, contd. Final form of the Quadratic Programming problem: Following table outlines the popular Kernel functions used in SVMs:

Summary of SVM Procedure 1. Select the C parameter (recall this is the trade-off between classification error and margin maximization). 2. Select the kernel function and any kernel-specific parameter values. 3. Solve the Quadratic Problem to determine the set of support vectors and multipliers. 4. Recover the threshold variable b using the set of support vectors. 5. Apply the SVM to classify a new data point x using the final classification function.

Example of SVM Polynomial Kernel

Applying SVM to SeqRec LIBSVM is a software library that provides tools to allow users to easily and quickly implement SVM-based classifiers. svm-scale This tool is used to scale the features of input data. svm-train This tool will train an SVM model using a set of labeled training data. It supports the popular kernel functions and specification of the C parameter to use. svm-predict This tool takes un-labeled data and a previously generated SVM model and outputs the classification label hypothesis determined by applying the decision function.

SVM Parameter Search The RBF kernel will be used for the experiments with TIMIT. The two parameters that must be selected when applying the RBF kernel to SVM are the C and parameters. A common method to perform parameter searching is known as cross-validation. LIBSVM provides an implementation known as v-fold cross- validation. Training data set is first sub-divided into v subsets. Each subset is then sequentially tested using a classifier trained using the other v-1 subsets. Repeat for each other subset, allowing each instance of the whole training set to be predicted once. The cross-validation accuracy is the percentage of data correctly classified using the procedure.

SVM Parameter Search, contd. Cross-validation has the property of avoiding the problem of overtraining. If parameters were chosen that yielded the best classification accuracy for the entire training data set, the SVM may be too specific and would falsely reject unseen data. SVM may have worse accuracy during the model building stage but in general will perform better against unseen data. Cross-validation accuracy is computed across the following parameter ranges:

Applying SVM to Multiple Scores TIMIT greasy TRAIN score data yields = 0.03125 and C = 8 as the best parameters. To evaluate the model on un-seen data, the SVM model is now applied to the TEST scores 1 & 2 for TIMIT greasy. LIBSVM is able to output decision values in addition to the binary class labels. Greater magnitude of a decision value means greater confidence that the value is a part of the chosen class. These values can then be treated as a single-dimensional input to the original binary classifier

Two-Class SVM TIMIT greasy Total Error Rate is reduced from 2.45% to 0.97% Recognition rate is 2.55 times better!

Incorporating the Word Duration Feature As opposed to the TIMIT greasy example considered so far, the distributions for some words are highly correlated and do not exhibit good performance using just Score 1 & 2. One possible cause for this is that the shorter the time duration of the word, the more apparent any errors in the hand-labeled durations are.

Incorporating the Word Duration Feature, contd. SVMs are capable of handling data with many features. Makes sense to think of the length of the scored word as a feature itself. If two phonetically similar words such as a and and produce very similar acoustic scores, duration could intuitively be used to increase the reliability of the decision.

SVM with Duration - TIMIT and Able to lower the original monophone classifier error rate from 61.83% to 32.95% Relative improvement of 88% or 1.88 times. Notice that SVM applied without the duration feature is basically useless for this particular word.

One-Class SVM SVMs that have been considered thus far have operated by classifying data vectors into one of two different classes. This requires a database of acoustic scores for both the INV word, as well as scores of all other words. One-Class SVM is a class of SVM models that only depend on having a single class of data available for classification.

One-Class SVM, contd. Problem Statement - Suppose that some data set has a probability distribution P in feature space. Find a simple subset S of the feature space such that the probability that a test point from P lies outside of S is bounded by some a priori value.

One-Class SVM, contd. The strategy is to map the data into kernel feature space (same as regular SVM), and then separate the data from the origin with maximum margin. The origin in feature space is the only original member of the negative class. Results in a modification to the Two-Class SVM Quadratic Programming problem: The classification function is the same as Two-Class SVM:

One-Class SVM - v ParameterOne-Class SVM - v Parameter The modified QP introduces the v parameter. As v approaches 0, the upper boundary on the second inequality becomes very large and has decreasing impact on the expression. Leads to a hard margin problem because the penalty for errors becomes infinite. As v is increased, the mis- classification penalty is relaxed and errors are allowed. Notice the effect of v on outliers when the penalty of errors is low.

One-Class SVM Parameter Search The cross-validation grid-search strategy will be applied for One-Class SVM parameter optimization: 1.The v parameter searched for instead of the cost parameter C. 2.The input SVM training data now only includes INV TRAIN data. 3.The One-Class SVM model is evaluated against INV and OOV TRAIN data. This accuracy is recorded in order to evaluate the effect of v on overall error rate. 4.Select the parameters that yield the highest accuracy from (3).

One-Class SVM TIMIT greasy Results in a Total Error Rate of 1.10%, as compared to the Two-Class SVM classifier that is able to achieve 0.88%

One-Class SVM Observations The number of Support Vectors required for a competitive One-Class SVM model is much lower than the number required for Two-Class SVM (54 versus 19 for TIMIT greasy). The processing time to train the One-Class SVM model is much lower because only the INV data has to be considered in the Quadratic Programming optimization problem to determine the maximum margin classifier (2.330s versus 0.001s for the TIMIT greasy). The overall performance is generally lower for One-Class SVM models. Of course, the absence of negative information entails a price, and one should not expect as good results as when this information is available

Final SeqRec Experiment Configurations The following techniques will be evaluated against the 25-word TIMIT test subset: 1.Score 1 Classification (Code: Score 1) 2.(Score 1 + Score 2) Classification With Two-Class SVM (Code: CSVMND) 3.(Score 1 + Score 2 + Duration) Classification With Two-Class SVM (Code: CSVM) 4.(Score 1 + Score 2 + Duration) Classification With One-Class SVM (Code: OSVM) All acoustic scores will be generated using 16-Mixture Monophone HMMs generated using SeqRec. The TIMIT test set is divided into two groups to increase graph readability.

Evaluation Metrics The Total Error Rate metric will be the primary criterion of performance for each method. The Relative Error Rate Reduction (RERR) and Error Rate Reduction (ERRR) will be calculated and used to compare performance between two methods: B Baseline Total Error Rate N New Total Error Rate

Manual Parameter Selections Experimentation has revealed that the grid search method does not always yield the most appropriate parameters for Two-Class SVM. The following words were found to perform considerably better for the Two-Class SVM models when using the parameters listed in the right columns, as opposed to the parameters in the left columns that the grid search discovered. Using these determined values for the problem words in the TIMIT test data set demonstrates the actual capabilities of the SeqRec Classifier. WordC - Grid - Grid C Select - Select ask 2048888 all 81922328 water 81920.5322 year 2048281922 in 81928 0.5 that 512881920.5

SeqRec Results TIMIT Test Set 1

Word RERR (%) CSVMND ERRR CSVMND RERR (%) CSVM ERRR CSVM RERR (%) OSVM ERRR OSVM Suit 3914.917238.23901.90 greasy 1462.461722.721172.17 year 231.231262.2681.08 rag 5186.181472.47191.19 wash 143215.32430344.03224923.49 carry 4075.07115212.52115212.52 water 1562.561802.801972.97 she 1242.243444.442453.45 all 221.22821.82591.59 dark 5906.907378.374655.65 had 2053.052533.535756.75 ask 1582.583244.241692.69 oily 891.892283.281112.11 me 1752.753684.682113.11 like 3894.895386.385226.22

SeqRec Results TIMIT Test Set 2

Word RERR (%) CSVMND ERRR CSVMND RERR (%) CSVM ERRR CSVM RERR (%) OSVM ERRR OSVM your 2513.513604.601372.37 that 1062.061332.33101.10 an 641.643014.01791.79 in -150.851162.16531.53 of 651.651982.98551.55 to 2613.617258.252783.78 and -360.64881.88-80.92 the 2753.756737.732583.58 Average 245 3.45 532 6.32 301 4.00

Concluding Remarks The SeqRec system was able to successfully integrate off-the-shelf speech recognition and SVM frameworks to create a working single-word classification system that shows remarkable error rate improvements against the well-known TIMIT data set. Average RERR of 532% using Two-Class SVM Scoring with the Duration feature. This leads to a single word recognition system capable of performing with an overall average Total Error Rate of 5.4% as compared to the baseline of 20.6%. The highest gain wasTIMIT word wash: the baseline Total Error Rate was 2.51% and the Two-Class SVM with Duration Total Error Rate was 0.06% RERR of 4303% One-Class SVM is indeed a viable method for significantly reducing recognizer error, with an average RERR of 301% Outperforms Two-Class SVM without the Duration feature.

Acknowledgements A very special thank you to Dr. Kepuska for his dedication to the field of Speech Recognition and allowing me to participate in a very exciting part of it! Thanks to FITs ECE Department for the support provided to this field of study.

Wrap-up (Time Permitting) Show Individual TIMIT Word Results in MATLAB. Future Work Topics. Questions from the audience.

sequence scoring experiments using the timit corpus and the htk recognition framework author: arthur...

Documents

asr tasks

asr history slide

asr system performance

notable asr systems

asr system fluency fluency

speech recognition slide

timeline of asr achievements

kpuska slide