arxiv:1606.05925v1 [stat.ml] 19 jun 2016 · graph based manifold regularized deep neural networks...

12
GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1,2 , Richard Rose 1,3 1 McGill University, Department of Electrical and Computer Engineering, Montreal, Canada 2 Fluent.ai, Montreal, QC, Canada 3 Google Inc., NYC, USA ABSTRACT Deep neural networks (DNNs) have been successfully applied to a wide variety of acoustic modeling tasks in recent years. These include the applications of DNNs either in a discrim- inative feature extraction or in a hybrid acoustic modeling scenario. Despite the rapid progress in this area, a number of challenges remain in training DNNs. This paper presents an effective way of training DNNs using a manifold learning based regularization framework. In this framework, the pa- rameters of the network are optimized to preserve underlying manifold based relationships between speech feature vectors while minimizing a measure of loss between network outputs and targets. This is achieved by incorporating manifold based locality constraints in the objective criterion of DNNs. Em- pirical evidence is provided to demonstrate that training a net- work with manifold constraints preserves structural compact- ness in the hidden layers of the network. Manifold regulariza- tion is applied to train bottleneck DNNs for feature extraction in hidden Markov model (HMM) based speech recognition. The experiments in this work are conducted on the Aurora- 2 spoken digits and the Aurora-4 read news large vocabu- lary continuous speech recognition tasks. The performance is measured in terms of word error rate (WER) on these tasks. It is shown that the manifold regularized DNNs result in up to 37% reduction in WER relative to standard DNNs. Index Termsmanifold learning, deep neural networks, manifold regularization, manifold regularized deep neural networks, speech recognition 1. INTRODUCTION Recently there has been a resurgence of research in the area of deep neural networks (DNNs) for acoustic modeling in auto- matic speech recognition (ASR) [1–6]. Much of this research has been concentrated on techniques for regularization of the algorithms used for DNN parameter estimation [7–9]. At the same time, there has also been a great deal of research on graph based techniques that facilitate the preservation of lo- cal neighborhood relationships among feature vectors for pa- rameter estimation in a number of application areas [10–13]. Algorithms that preserve these local relationships are often referred to as having the effect of applying manifold based constraints. This is because subspace manifolds are generally defined to be low-dimensional, perhaps nonlinear, surfaces where locally linear relationships exist among points along the surface. Hence, preservation of these local neighborhood relationships is often thought, to some approximation, to have the effect of preserving the structure of the manifold. To the extent that this assumption is correct, it suggests that the man- ifold shape can be preserved without knowledge of the overall manifold structure. This paper presents an approach for applying manifold based constraints to the problem of regularizing training al- gorithms for DNN based acoustic models in ASR. This ap- proach involves redefining the optimization criterion in DNN training to incorporate the local neighborhood relationships among acoustic feature vectors. To this end, this work uses discriminative manifold learning (DML) based constraints. The resultant training mechanism is referred to here as mani- fold regularized DNN (MRDNN) training and is described in Section 3. Results are presented demonstrating the impact of MRDNN training on both the ASR word error rates (WERs) obtained from MRDNN trained models and on the behavior of the associated nonlinear feature space mapping. Previous work in optimizing deep network training in- cludes approaches for pre-training of network parameters such as restricted Boltzmann machine (RBM) based genera- tive pre-training [14–16], layer-by-layer discriminative pre- training [17], and stacked auto-encoder based pre-training [18, 19]. However, more recent studies have shown that, if there are enough observation vectors available for training models, the network pre-training has little impact on ASR perfor- mance [6, 9]. Other techniques, like dropout with the use of rectified linear units (ReLUs) as nonlinear elements in place of sigmoid units in the hidden layers of the network are also thought to have the effect of regularization on DNN training [7, 8]. The manifold regularization techniques described in this paper are applied to estimate parameters for a tandem DNN with a low-dimensional bottleneck hidden layer [20]. The arXiv:1606.05925v1 [stat.ML] 19 Jun 2016

Upload: others

Post on 08-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FORAUTOMATIC SPEECH RECOGNITION

Vikrant Singh Tomar1,2, Richard Rose1,3

1 McGill University, Department of Electrical and Computer Engineering, Montreal, Canada2 Fluent.ai, Montreal, QC, Canada

3 Google Inc., NYC, USA

ABSTRACT

Deep neural networks (DNNs) have been successfully appliedto a wide variety of acoustic modeling tasks in recent years.These include the applications of DNNs either in a discrim-inative feature extraction or in a hybrid acoustic modelingscenario. Despite the rapid progress in this area, a numberof challenges remain in training DNNs. This paper presentsan effective way of training DNNs using a manifold learningbased regularization framework. In this framework, the pa-rameters of the network are optimized to preserve underlyingmanifold based relationships between speech feature vectorswhile minimizing a measure of loss between network outputsand targets. This is achieved by incorporating manifold basedlocality constraints in the objective criterion of DNNs. Em-pirical evidence is provided to demonstrate that training a net-work with manifold constraints preserves structural compact-ness in the hidden layers of the network. Manifold regulariza-tion is applied to train bottleneck DNNs for feature extractionin hidden Markov model (HMM) based speech recognition.The experiments in this work are conducted on the Aurora-2 spoken digits and the Aurora-4 read news large vocabu-lary continuous speech recognition tasks. The performanceis measured in terms of word error rate (WER) on these tasks.It is shown that the manifold regularized DNNs result in up to37% reduction in WER relative to standard DNNs.

Index Terms— manifold learning, deep neural networks,manifold regularization, manifold regularized deep neuralnetworks, speech recognition

1. INTRODUCTION

Recently there has been a resurgence of research in the area ofdeep neural networks (DNNs) for acoustic modeling in auto-matic speech recognition (ASR) [1–6]. Much of this researchhas been concentrated on techniques for regularization of thealgorithms used for DNN parameter estimation [7–9]. At thesame time, there has also been a great deal of research ongraph based techniques that facilitate the preservation of lo-cal neighborhood relationships among feature vectors for pa-rameter estimation in a number of application areas [10–13].

Algorithms that preserve these local relationships are oftenreferred to as having the effect of applying manifold basedconstraints. This is because subspace manifolds are generallydefined to be low-dimensional, perhaps nonlinear, surfaceswhere locally linear relationships exist among points alongthe surface. Hence, preservation of these local neighborhoodrelationships is often thought, to some approximation, to havethe effect of preserving the structure of the manifold. To theextent that this assumption is correct, it suggests that the man-ifold shape can be preserved without knowledge of the overallmanifold structure.

This paper presents an approach for applying manifoldbased constraints to the problem of regularizing training al-gorithms for DNN based acoustic models in ASR. This ap-proach involves redefining the optimization criterion in DNNtraining to incorporate the local neighborhood relationshipsamong acoustic feature vectors. To this end, this work usesdiscriminative manifold learning (DML) based constraints.The resultant training mechanism is referred to here as mani-fold regularized DNN (MRDNN) training and is described inSection 3. Results are presented demonstrating the impact ofMRDNN training on both the ASR word error rates (WERs)obtained from MRDNN trained models and on the behaviorof the associated nonlinear feature space mapping.

Previous work in optimizing deep network training in-cludes approaches for pre-training of network parameterssuch as restricted Boltzmann machine (RBM) based genera-tive pre-training [14–16], layer-by-layer discriminative pre-training [17], and stacked auto-encoder based pre-training [18,19]. However, more recent studies have shown that, if thereare enough observation vectors available for training models,the network pre-training has little impact on ASR perfor-mance [6, 9]. Other techniques, like dropout with the useof rectified linear units (ReLUs) as nonlinear elements inplace of sigmoid units in the hidden layers of the networkare also thought to have the effect of regularization on DNNtraining [7, 8].

The manifold regularization techniques described in thispaper are applied to estimate parameters for a tandem DNNwith a low-dimensional bottleneck hidden layer [20]. The

arX

iv:1

606.

0592

5v1

[st

at.M

L]

19

Jun

2016

Page 2: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

discriminative features obtained from the bottleneck layerare input to a Gaussian mixture model (GMM) based HMM(GMM-HMM) speech recognizer [4, 21]. This work ex-tends [22] by presenting an improved MRDNN training algo-rithm that results in reduced computational complexity andbetter ASR performance. ASR performance of the proposedtechnique is evaluated on two well known speech-in-noisecorpora. The first is the Aurora-2 speech corpus, a connecteddigit task [23], and the second is the Aurora-4 speech corpus,a read news large vocabulary continuous speech recognition(LVCSR) task [24]. Both speech corpora and the ASR sys-tems configured for these corpora are described in Section 4.

An important impact of the proposed technique is thewell-behaved internal feature representations associated withMRDNN trained networks. It is observed that the modifiedobjective criterion results in a feature space in the internalnetwork layers such that the local neighborhood relationshipsbetween feature vectors are preserved. That is, if two inputvectors, xi and xj , lie within the same local neighborhood inthe input space, then their corresponding mappings, zi andzj , in the internal layers will also be neighbors. This prop-erty can be characterized by means of an objective measurereferred to as the contraction ratio [25], which describes therelationship between the sizes of the local neighborhoods inthe input and the mapped feature spaces. The performanceof MRDNN training in producing mappings that preservethese local neighborhood relationships is presented in termsof the measured contraction ratio in Section 3. The localitypreservation constraints associated with the MRDNN trainingare also shown in Section 3 to lead to a more robust gradientestimation in error back propagation training.

The rest of this paper is structured as follows. Second2 provides a review of basic principals associated with DNNtraining, manifold learning, and manifold regularization. Sec-tion 3 describes the MRDNN algorithm formulation, and pro-vides a discussion of the contraction ratio as a measure oflocality preservation in the DNN. Section 4 describes the taskdomains, system configurations and presents the ASR WERperformance results. Section 5 discusses computational com-plexity of the proposed MRDNN training and the effect ofnoise on the manifold learning methods. In Section 5.3, an-other way of applying the manifold regularization to DNNtraining is discussed, where a manifold regularization is usedonly for first few training epochs. Conclusions and sugges-tions for future work are presented in Section 6.

2. BACKGROUND

This section provides a brief review of DNNs and manifoldlearning principals as context for the MRDNN approach pre-sented in Section 3. Section 2.1 provides a summary of DNNsand their applications in ASR. An introduction to manifoldlearning and related techniques is provided in Section 2.2; thisincludes the discriminative manifold learning (DML) frame-

work and locality sensitive hashing techniques for speedingup the neighborhood search for these methods. Section 2.3briefly describes the manifold regularization framework andsome of its example implementations.

2.1. Deep Neural Networks

A DNN is simply a feed-forward artificial neural networkthat has multiple hidden layers between the input and out-put layers. Typically, such a network produces a mappingfdnn : x → z from the inputs, xi, i = 1 . . . N , to the outputactivations, zi, i = 1 . . . N . This is achieved by finding anoptimal set of weights, W , to minimize a global error or lossmeasure, V (xi, ti, fdnn), defined between the outputs of thenetwork and the targets, ti, i = 1 . . . N . In this work, L2-norm regularized DNNs are used as baseline DNN models.The objective criterion for such a model is defined as

F(W ) =1

N

N∑i=1

V (xi, ti, fdnn) + γ1||W ||2, (1)

where the second term refers to the L2-norm regularizationapplied to the weights of the network, and γ1 is the regular-ization coefficient that controls the relative contributions ofthe two terms.

The weights of the network are updated for several epochsover the training set using mini-batch gradient descent basederror back-propagation (EBP),

wlm,n ←− wlm,n + η∇wlm,nF(W ), (2)

where wlm,n refers to the weight on the input to the nth unit

in the lth layer of the network from the mth unit in the pre-ceding layer. The parameter η corresponds to the gradientlearning rate. The gradient of the global error with respect towl

m,n is given as

∇wlm,nF(W ) = −∆l

m,n − 2γ1wlm,n, (3)

where ∆lm,n is the error signal in the lth layer and its form

depends on both the error function and the activation function.Typically, a soft-max nonlinearity is used at the output layeralong with the cross-entropy objective criterion. Assumingthat the input to pth unit in the final layer is calculated asnetLp =

∑n w

Ln,pz

L−1p , the error signal for this case is given

as ∆Ln,p = (zp − tp)zL−1

n [26, 27].

2.2. Manifold Learning

Manifold learning techniques work under the assumptionthat high-dimensional data can be considered as a set of ge-ometrically related points lying on or close to the surfaceof a smooth low dimensional manifold [12, 28, 29]. Typi-cally, these techniques are used for dimensionality reducingfeature space transformations with the goal of preservingmanifold domain relationships among data vectors in theoriginal space [28, 30].

Page 3: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

The application of manifold learning methods to ASR issupported by the argument that speech is produced by themovements of loosely constrained articulators [13,31]. Moti-vated by this, manifold learning can be used in acoustic mod-eling and feature space transformation techniques to preservethe local relationships between feature vectors. This can berealized by embedding the feature vectors into one or morepartially connected graphs [32]. An optimality criterion isthen formulated based on the preservation of manifold basedgeometrical relationships. One such example is locality pre-serving projections (LPP) for feature space transformationsin ASR [33]. The objective function in LPP is formulated sothat feature vectors close to each other in the original spaceare also close to each other in the target space.

2.2.1. Discriminative Manifold Learning

Most manifold learning techniques are unsupervised andnon-discriminative, for example, LPP. As a result, these tech-niques do not generally exploit the underlying inter-classdiscriminative structure in speech features. For this reason,features belonging to different classes might not be well sep-arated in the target space. A discriminative manifold learning(DML) framework for feature space transformations in ASRis proposed in [10, 34, 35]. The DML techniques incorpo-rate discriminative training into manifold based nonlinearlocality preservation. For a dimensionality reducing mappingfG : x ∈ RD → z ∈ Rd, where d << D, these techniquesattempt to preserve the within class manifold based rela-tionships in the original space while maximizing a criterionrelated to the separability between classes in the output space.

In DML, the manifold based relationships between thefeature vectors are characterized by two separate partiallyconnected graphs, namely intrinsic and penalty graphs [10,32]. If the feature vectors, X ∈ RN×D, are representedby the nodes of the graphs, the intrinsic graph, Gint =X,Ωint, characterizes the within class or within manifoldrelationships between feature vectors. The penalty graph,Gpen = X,Ωpen, characterizes the relationships betweenfeature vectors belonging to different speech classes. Ωint

and Ωpen are known as the affinity matrices for the intrinsicand penalty graphs respectively and represent the manifoldbased distances or weights on the edges connecting the nodesof the graphs. The elements of these affinity matrices aredefined in terms of Gaussian kernels as

ωintij =

exp(−||xi−xj ||2

ρ

); C(xi) = C(xj), e(xi,xj) = 1

0 ; Otherwise(4)

and

ωpenij =

exp(−||xi−xj ||2

ρ

); C(xi) 6= C(xj), e(xi,xj) = 1

0 ; Otherwise,

(5)

whereC(xi) refers to the class or label associated with vectorxi and ρ is the Gaussian kernel heat parameter. The functione(xi,xj) indicates whether xj is in the neighborhood of xi.Neighborhood relationships between two vectors can be de-termined by k-nearest neighborhood (kNN) search.

In order to design an objective criterion that minimizesmanifold domain distances between feature vectors belong-ing to the same speech class while maximizing the distancesbetween feature vectors belonging to different speech classes,a graph scatter measure is defined. This measure representsthe spread of the graph or average distance between featurevectors in the target space with respect to the original space.For a generic graph G = X,Ω, a measure of the graph’sscatter for a mapping fG : x→ z can be defined as

FG(Z) =∑i,j

||zi − zj ||2ωij . (6)

The objective criterion in discriminative manifold learning isdesigned as the difference of the intrinsic and penalty scattermeasures,

FGdiff(Z) =

∑i,j

||zi − zj ||2ωintij −∑i,j

||zi − zj ||2ωpenij ,

=∑i,j

||zi − zj ||2ωdiffij ,(7)

where ωdiffij = ωint

ij − ωpenij .

DML based dimensionality reducing feature space trans-formation techniques have been reported to provide signifi-cant gains over conventional techniques such as linear dis-criminant analysis (LDA) and unsupervised manifold learn-ing based LPP for ASR [10, 34, 35]. This has motivated theuse of DML based constraints for regularizing DNNs in thiswork.

2.2.2. Locality Sensitive Hashing

One key challenge in applying manifold learning based meth-ods to ASR is that due to large computational complexityrequirements, these methods do not directly scale to bigdatasets. This complexity originates from the need to calcu-late a pair-wise similarity measure between feature vectors toconstruct neighborhood graphs. This work uses locality sen-sitive hashing (LSH) based methods for fast construction ofthe neighborhood graphs as described in [36,37]. LSH createshashed signatures of vectors in order to distribute them intoa number of discrete buckets such that vectors close to eachother are more likely to fall into the same bucket [38, 39]. Inthis manner, one can efficiently perform similarity searchesby exploring only the data-points falling into the same oradjacent buckets.

2.3. Manifold Regularization

Manifold regularization is a data dependent regularizationframework that is capable of exploiting the existing mani-fold based structure of the data distribution. It is introduced

Page 4: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

in [11], where the authors have also presented manifoldextended versions of regularized least squares and supportvector machines algorithms for a number of text and imageclassification tasks.

There has been several recent efforts to incorporate someform of manifold based learning into a variety of machinelearning tasks. In [40], authors have investigated manifoldregularization in single hidden layer multilayer perceptronsfor a phone classification task. Manifold learning based semi-supervised embedding have been applied to deep learning fora hand-written character recognition task in [41]. Manifoldregularized single-layer neural networks have been used forimage classification in [42]. In spite of these efforts of apply-ing manifold regularization to various application domains,similar efforts are not known for training deep models inASR.

3. MANIFOLD REGULARIZED DEEP NEURALNETWORKS

The multi-layered nonlinear structure of DNNs makes themcapable of learning the highly nonlinear relationships be-tween speech feature vectors. This section proposes an ex-tension of EBP based DNN training by incorporating theDML based constraints discussed in Section 2.2.1 as a reg-ularization term. The algorithm formulation is provided inSection 3.1. The proposed training procedure emphasizeslocal relationships among speech feature vectors along a lowdimensional manifold while optimizing network parameters.To support this claim, empirical evidence is provided in Sec-tion 3.3 that a network trained with manifold constraintshas a higher capability of preserving the local relationshipsbetween the feature vectors than one trained without theseconstraints.

3.1. Manifold Regularized Training

This work incorporates locality and geometrical relationshipspreserving manifold constraints as a regularization term inthe objective criterion of a deep network. These constraintsare derived from a graph characterizing the underlying man-ifold of speech feature vectors in the input space. The ob-jective criterion for a MRDNN network producing a mappingfmrdnn : x→ z is given as follows,

F(W ;Z) =N∑i=1

1

NV (xi, ti, fmrdnn) + γ1||W ||2

+γ21

k2

k∑j=1

||zi − zj ||2ωintij

,

(8)

where V (xi, ti, fmrdnn) is the loss between a target vectorti and output vector zi given an input vector xi; V(·) istaken to be the cross-entropy loss in this work. W is thematrix representing the weights of the network. The sec-ond term in Eq. (8) is the L2-norm regularization penalty

on the network weights; this helps in maintaining smooth-ness for the assumed continuity of the source space. Theeffect of this regularizer is controlled by the multiplier γ1.The third term in Eq. (8) represents manifold learning basedlocality preservation constraints as defined in Section 2.2.1.Note that only the constraints modeled by the intrinsic graph,Gint = X,Ωint, are included, and the constraints mod-eled by the penalty graph, Gpen = X,Ωpen, are ignored.This is further discussed in Section 3.2. The scalar k denotesthe number of nearest neighbors connected to each featurevector, and ωint

ij refers to the weights on the edges of theintrinsic graph as defined in Eq. (4). The relative impor-tance of the mapping along the manifold is controlled by theregularization coefficient γ2.

This framework assumes that the data distribution is sup-ported on a low dimensional manifold and corresponds toa data-dependent regularization that exploits the underlyingmanifold domain geometry of the input distribution. By im-posing manifold based locality preserving constraints on thenetwork outputs in Eq. (8), this procedure encourages a map-ping where relationships along the manifold are preserved anddifferent speech classes are well separated. The manifold reg-ularization term penalizes the objective criterion for vectorsthat belong to the same neighborhood in the input space buthave become separated in the output space after projection.

The objective criterion given in Eq. (8) has a very similarform to that of a standard DNN given in Eq. (1). The weightsof the network are optimized using EBP and gradient descent,

∇W F(W ;Z) =

N∑i

∂F(W ;Z)

∂zi

∂zi

∂W, (9)

where ∇WF(W ;Z) is the gradient of the objective crite-rion with respect to the DNN weight matrix W . Using thesame nomenclature as defined in Eq. (3), the gradient withrespect to the weights in the last layer is calculated as

∇wLn,pF(W ;Z) = −∆L

n,p − 2γ1wLn,p

−2γ2

k2

k∑j=1

ωij(z(i),p − z(j),p)

(∂z(i),p

∂wLn,p−∂z(j),p

∂wLn,p

),

(10)

where z(i),p refers to the activation of the pth unit in the out-put layer when the input vector is xi. The error signal, ∆L

n,p,is the same as the one specified in Eq. (3).

3.2. Architecture and Implementation

The computation of the gradient in Eq. (10) depends notonly on the input vector, xi, but also its neighbors, xj , j =1, . . . , k, that belong to the same class. Therefore, MRDNNtraining can be broken down into two components. The first isthe standard DNN component that minimizes a cross-entropybased error with respect to given targets. The second is themanifold regularization based component that focuses on pe-nalizing a criterion related to preservation of neighborhoodrelationships.

Page 5: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

An architecture for manifold regularized DNN training isshown in Figure 1. For each input feature vector, xi, k of itsnearest neighbors, xj , j = 1, . . . , k, belonging to the sameclass are selected. These k + 1 vectors are forward propa-gated through the network. This can be visualized as makingk+1 separate copies of the DNN one for the input vector andthe remaining k for its neighbors. The DNN correspondingto the input vector, xi, is trained to minimize cross-entropyerror with respect to a given target, ti. Each copy-DNN cor-responding to one of the selected neighbors, xj , is trained tominimize a function of the distance between its output, zj ,and that corresponding to the input vector, zi. Note that theweights of all these copies are shared, and only an averagegradient is used for weight-update. This algorithm is then ex-tended for use in mini-batch training.

Hidden Layers

W (Shared Weights)

xj, ωijint DNN

(manifold regularization)xj, ωijint DNN

(manifold regularization)xj, ωijint DNN

(manifold regularization)xj, ωijint DNN

(manifold regularization)xj, ωijint DNN

(manifold regularization)

xj

ωijint

DNN:manifold

regularization

xiDNN:

cross-entropy

ErrorBackprop

InputVector

Targets: zi

zjOutputs:

Target: ci

ziOutput:

Near-Neighbor Vectors

Cross-entropy Error

ManifoldError

Fig. 1. Illustration of MRDNN architecture. For each inputvector, xi, k of its neighbors, xj , belonging to the same classare selected and forward propagated through the network. Forthe neighbors, the target is set to be the output vector corre-sponding to xi. The scalar ωint

ij represents the intrinsic affin-ity weights as defined in Eq. (4).

It should be clear from this discussion that the relation-ships between a given vector and its neighbors play an impor-tant role in the weight update using EBP. Given the assump-tion of a smooth manifold, the graph based manifold regular-ization is equivalent to penalizing rapid changes of the clas-sification function. This results in smoothing of the gradientlearning curve that leads to robust computation of the weightupdates.

In the applications of DML framework for feature spacetransformations, inclusion of both the intrinsic and penaltygraphs terms, as shown in Eq. (7), is found to be important[10,34,43]. For this reason, experiments in the previous workincluded both these terms [22]. However, in initial experi-ments performed on the corpora described in Section 4, thegains achieved by including the penalty graph were found tobe inconsistent across datasets and different noise conditions.This may be because adding an additional term for discrimi-

nating between classes of speech feature vectors might not al-ways impact the performance of DNNs since DNNs are inher-ently powerful discriminative models. Therefore, the penaltygraph based term is not included in any of the experimentspresented in this work, and the manifold learning based ex-pression given in Eq. (8) consists of the intrinsic componentonly.

3.3. Preserving Neighborhood Relationships

This section describes a study conducted to characterize theeffect of applying manifold based constraints on the behaviorof a deep network. This is an attempt to quantify how neigh-borhood relationships between feature vectors are preservedwithin the hidden layers of a manifold regularized DNN. Tothis end, a measure referred to as the contraction ratio is in-vestigated. A form of this measure is presented in [25].

In this work, the contraction ratio is defined as the ratio ofdistance between two feature vectors at the output of the firsthidden layer to that at the input of the network. The averagecontraction ratio between a feature vector and its neighborscan be seen as a measure of locality preservation and com-pactness of its neighborhood. Thus, the evolution of the av-erage contraction ratio for a set of vectors as a function ofdistances between them can be used to characterize the over-all locality preserving behavior of a network. To this end, asubset of feature vectors not seen during the training are se-lected. The distribution of pair-wise distances for the selectedvectors is used to identify a number of bins. The edges ofthese bins are treated as a range of radii around the featurevectors. An average contraction ratio is computed as a func-tion of radius in a range r1 < r ≤ r2 over all the selectedvectors and their neighbors falling in that range as

CR(r1, r2) =1

N

N∑i=1

∑j∈Φ

1

kΦ· ||z

1i − z1

j ||2

||xi − xj ||2, (11)

where for a given vector xi, Φ represents a set of vectors xj

such that r21 < ||xi − xj ||2 ≤ r2

2 , and kΦ is the number ofsuch vectors. z1

i represents the output of the first layer cor-responding to vector xi at the input. It follows that a smallercontraction ratio represents a more compact neighborhood.

Figure 2 displays the contraction ratio of the output to in-put neighborhood sizes relative to the radius of the neighbor-hood in the input space for the DNN and MRDNN systems.The average contraction ratios between the input and the firstlayer’s output features are plotted as functions of the medianradius of the bins. It can be seen from the plots that thefeatures obtained from a MRDNN represent a more compactneighborhood than those obtained from a DNN. Therefore itcan be concluded that the hidden layers of a MRDNN are ableto learn the manifold based local geometrical representationof the feature space. It should also be noted that the contrac-tion ratio increases with the radius indicating that the effect of

Page 6: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

Input Neighborhood Radius0 20 40 60 80 100

Con

tract

ion

Rat

io

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3DNNMRDNN

Fig. 2. Contraction ratio of the output to input neighborhoodsizes as a function of input neighborhood radius.

manifold preservation diminishes as one moves farther froma given vector. This is in agreement with the local invarianceassumption over low-dimensional manifolds.

4. EXPERIMENTAL STUDY

This section presents the experimental study conducted toevaluate the effectiveness of the proposed MRDNNs in termsof ASR WER. The ASR performance of a MRDNN is pre-sented and compared with that of DNNs without manifoldregularization and the traditional GMM-HMM systems. Theexperiments in this work are done on two separate speech-in-noise tasks, namely the Aurora-2, which is a connecteddigit task, and the Aurora-4, which is a read news LVCSRtask. The speech tasks and the system setup are described inSection 4.1. The results of the experiments on the Aurora-2task are presented in Section 4.2. The results and compara-tive performance of the proposed technique on the Aurora-4task are presented in Section 4.3. In addition, experimentson clean condition training sets of the Aurora-2 and Aurora-4are also performed. The results for these experiments areprovided in Section 4.4.

4.1. Task Domain and System Configuration

The first dataset used in this work is the Aurora-2 connecteddigit speech in noise corpus. In most of the experiments,the Aurora-2 mixed-condition training set is used for train-ing [23]. Some experiments have also used the clean trainingset. Both the clean and mixed-conditions training sets containa total of 8440 utterances by 55 male and 55 female speak-ers. In the mixed-conditions set, the utterances are corruptedby adding four different noise types to the clean utterances.The conventional GMM-HMM ASR system is configured us-

ing the standard configuration specified in [23]. This corre-sponds to using 10 word-based continuous density HMMs(CDHMMs) for digits 0 to 9 with 16 states per word-model,and additional models with 3 states for silence and 1 statefor short-pause. In total, there are 180 CDHMM states eachmodeled by a mix of 3 Gaussians. During the test phase,five different subsets are used corresponding to uncorruptedclean utterances and utterances corrupted with four differ-ent noise types, namely subway, babble, car and exhibitionhall, at signal-to-noise ratios (SNRs) ranging from 5 to 20dB. There are 1001 utterances in each subset. The ASR per-formance obtained for the GMM-HMM system configurationagrees with those reported elsewhere [23].

The second dataset used in this work is the Aurora-4 readnewspaper speech-in-noise corpus. This corpus is created byadding noise to the Wall Street Journal corpus [24]. Aurora-4 represents a MVCSR task with a vocabulary size of 5000words. This work uses the standard 16kHz mixed-conditionstraining set of the Aurora-4 [24]. It consists of 7138 noisyutterances from a total of 83 male and female speakers corre-sponding to about 14 hours of speech. One half of the utter-ances in the mixed-conditions training set are recorded witha primary Sennheiser microphone and the other half with asecondary microphone, which enables the effect of transmis-sion channel. Both halves contain a mixture of uncorruptedclean utterances and noise corrupted utterances with the SNRlevels varying from 10 to 20 dB in 6 different noise condi-tions (babble, street traffic, train station, car, restaurant andairport). A bi-gram language model is used with a perplex-ity of 147. Context-depedent cross-word triphone CDHMMmodels are used for configuring the ASR system. There are atotal of 3202 senones and each state is modeled by 16 Gaus-sian components. Similar to the training set, the test set isrecorded with the primary and secondary microphones. Eachsubset is further divided into seven subsets, where one sub-set is clean speech data and the remaining six are obtained byrandomly adding the same six noise types as training at SNRlevels ranging from 5 to 15 dB. Thus, there are a total of 14subsets. Each subset contains 330 utterances from 8 speakers.

It should be noted that both of the corpora used in thiswork represent simulated speech-in-noise tasks. These arecreated by adding noises from multiple sources to speech ut-terances spoken in a quite environment. For this reason, oneshould be careful about generalizing these results to otherspeech-in-noise tasks.

For both corpora, the baseline GMM-HMM ASR systemsare configured using 12-dimensional static Mel-frequencycepstrum coefficient (MFCC) features augmented by normal-ized log energy, difference cepstrum, and second differencecepstrum resulting in 39-dimensional vectors. The ASR per-formance is reported in terms of WERs. These GMM-HMMsystems are also used for generating the context-dependentstate alignments. These alignments are then used as the targetlabels for deep networks training as described below.

Page 7: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

Collectively, the term “deep network” is used to referto both DNN and MRDNN configurations. Both of theseconfigurations include L2-norm regularization applied to theweights of the networks. The ASR performance of thesemodels is evaluated in a tandem setup where a bottleneckdeep network is used as feature extractor for a GMM-HMMsystem. The deep networks take 429-dimensional input vec-tors that are created by concatenating 11 context frames of12-dimensional MFCC features augmented with log energy,and first and second order difference. The number of unitsin the output layer is equal to the number of CDHMM states.The class labels or targets at the output are set to be theCDHMM states obtained by a single pass force-alignmentusing the baseline GMM-HMM system. The regularizationcoefficient γ1 for the L2 weight decay is set to 0.0001. In theMRDNN setup, the number of nearest neighbors, k, is set to10, and γ2 is set to 0.001. While the manifold regularizationtechniques is not evaluated for hybrid DNN-HMM systems,one would expect the presented results to generalize to thosemodels. The experimental study performed in this work islimited to tandem DNN-HMM scenarios. It is expected thatthe results reported here should also generalize to DNN train-ing for hybrid DNN-HMM configurations. This is a topic forfuture work.

For the Aurora-2 experiments, the deep networks havefive hidden layers. The first four hidden layers have 1024 hid-den units each, and the fifth layer is bottleneck layer with 40hidden units. For the Aurora-4 experiments, larger networksare used. There are seven hidden layers. The first six hiddenlayers have 2048 hidden units each, and the seventh layer isbottleneck layer with 40 hidden units. This larger network forAurora-4 is in-line with other recent work [44, 45]. The hid-den units use ReLUs as activation functions. The soft-maxnonlinearity is used in the output layer with the cross-entropyloss between the outputs and targets of the network as theerror [26, 27]. After the EBP training, the 40-dimensionaloutput features are taken from the bottleneck layer and de-correlated using principal component analysis (PCA). Onlyfeatures corresponding to the top 39 components are kept dur-ing the PCA transformation to match the dimensionality ofthe baseline system. The resultant features are used to train aGMM-HMM ASR system using a maximum-likelihood cri-terion. Although some might argue that PCA is not neces-sary after the compressed bottleneck output, in this work, aperformance gain of 2-3% absolute is observed with PCA onthe same bottleneck features for different noise and channelconditions. No difference in performance is observed for theclean data.

4.2. Results for the mixed-noise Aurora-2 Spoken DigitsCorpus

The ASR WER for the Aurora-2 speech-in-noise task are pre-sented in Table 1. The models are trained on the mixed-noise

Table 1. WERs for mixed noise training and noisy test-ing on the Aurora-2 speech corpus for GMM-HMM, DNNand MRDNN systems. The best performance has been high-lighted for each noise type per SNR level.

Noise TechniqueSNR (dB)

20 15 10 5

SubwayGMM-HMM 2.99 4.00 6.21 11.89DNN 1.19 1.69 2.95 6.02MRDNN 0.91 1.60 2.39 5.67

BabbleGMM-HMM 3.31 4.37 7.97 18.06DNN 1.15 1.42 2.81 7.38MRDNN 0.83 1.26 2.27 7.05

CarGMM-HMM 2.77 3.36 5.45 12.31DNN 1.05 1.85 2.98 6.92MRDNN 0.84 1.37 2.59 6.38

ExhibitionGMM-HMM 3.34 3.83 6.64 12.72DNN 1.23 1.54 3.30 7.87MRDNN 0.96 1.44 2.48 7.24

set. The test results are presented in four separate tables eachcorresponding to a different noisy subset of the Aurora-2 testset. Each noisy subset is further divided into four subsetscorresponding to noise corrupted utterances with 20 to 5dBSNRs. For each noise condition, the ASR results for featuresobtained from three different techniques are compared. Thefirst row of each table, labeled ‘GMM-HMM’, contains theASR WER for the baseline GMM-HMM system trained us-ing MFCC features appended with first and second order dif-ferences. The second row, labeled ‘DNN’, displays resultsfor the bottleneck features taken from the DNN described inSection 2.1. The final row, labeled ‘MRDNN’, presents theASR WER results for the bottleneck features obtained fromthe MRDNN described in Section 3.1. For all the cases, theGMM-HMM and DNN configurations described in Section4.1 are used. The initial learning rates for both systems are setto 0.001 and decreased exponentially with each epoch. Eachsystem is trained for 40 epochs over the training set.

Two main observations can be made from the results pre-sented in Table 1. First, both DNN and MRDNN provide sig-nificant reductions in ASR WER over the GMM-HMM sys-tem. The second observation can be made by comparing theASR performance of DNN and MRDNN systems. It is evi-dent from the results presented that the features derived froma manifold regularized network provide consistent gains overthose derived from a network that is regularized only with theL2 weight-decay. The maximum relative WER gain obtainedby using MRDNN over DNN is 37%.

4.3. Results for the mixed-noise Aurora-4 Read NewsCorpus

The ASR WERs for the Aurora-4 task are given in Table2. All three acoustic models in the table are trained on the

Page 8: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

Table 2. WER for mixed conditions training on the Aurora-4task for GMM-HMM, DNN, and MRDNN systems.

Clean Noise Channel Noise + ChannelGMM-HMM 13.02 18.62 20.27 30.11DNN 5.91 10.32 11.35 22.78MRDNN 5.30 9.50 10.11 21.90

Aurora-4 mixed-conditions set. The table lists ASR WERsfor four sets, namely clean, additive noise (noise), channeldistortion (channel), and additive noise combined with chan-nel distortion (noise + channel). These sets are obtained bygrouping together the fourteen subsets described in Section4.1. The first row of the table, labeled ‘GMM-HMM’, pro-vides results for the traditional GMM-HMM system trainedusing MFCC features appended with first and second orderdifferences. The second row, labeled ‘DNN’, presents WERscorresponding to the features derived from the baseline L2regularized DNN system. ASR performances for the base-line systems reported in this work agree with those reportedelsewhere [44,45]. The baseline set-up is consistent with thatspecified for the Aurora 4 task in order to be able to makecomparisons with other results obtained in the literature forthat task. The third row, labeled ‘MRDNN’, displays theWERs for features obtained from a MRDNN. Similar to thecase for Aurora-2, L2 regularization with the coefficient γ1

set to 0.0001 is used for training both the DNN and MRDNN.The initial learning rate is set to 0.003 and reduced exponen-tially for 40 epochs when the training is stopped.

Similar to the results for the Aurora-2 corpus, two mainobservations can be made by comparing the performance ofthe presented systems. First, both DNN and MRDNN pro-vide large reductions in WERs over the GMM-HMM for allconditions. Second, the trend of the WER reductions by us-ing MRDNN over DNN is visible here as well. The max-imum relative WER reduction obtained by using MRDNNover DNN is 10.3%.

The relative gain in the ASR performance for the Aurora-4 corpus is less than that for the Aurora-2 corpus. This mightbe due to the fact that the performance of manifold learn-ing based algorithms is highly susceptible to the presence ofnoise. This increased sensitivity is linked to the Gaussian ker-nel scale factor, ρ, defined in Eq. (4) [43]. In this work, ρ, isset to be equal to 1000 for all the experiments. This value istaken from previous work where it is empirically derived ona subset of Aurora-2 for a different task [34, 43]. While it iseasy to analyze the effect of noise on the performance of thediscussed models at each SNR level for the Aurora-2 corpus,it is difficult to do the same for the Aurora-4 corpus becauseof the way the corpus is organized. The Aurora-4 corpus onlyprovides a mixture of utterances at different SNR levels foreach noise type. Therefore, the average WER given for theAurora-4 corpus might be affected by the dependence of ASRperformance on the choice of ρ and SNR levels, and a bettertuned model is expected to provide improved performance.

This is further discussed in Section 5.2.Note that other techniques that have been reported to pro-

vide gains for DNNs were also investigated in this work. Inparticular, the use of RBM based generative pre-training wasinvestigated to initialize the weights of the DNN system. Sur-prisingly, however, the generative pre-training mechanism didnot lead to any reductions in ASR WER performance on thesecorpora. This is in agreement with other recent studies in theliterature, which have suggested that the use of ReLU hiddenunits on a relatively well-behaved corpus with enough train-ing data obviates the need of a pre-training mechanism [9].

4.4. Results for Clean-condition Training

In addition to the experiments on the mixed-conditions noisydatasets discussed in Sections 4.2 and 4.3, experiments arealso conducted to evaluate the performance of MRDNN formatched clean training and testing scenarios. To this end, theclean training sets of the Aurora-2 and Aurora-4 corpora areused for training the deep networks as well as building mani-fold based neighborhood graphs [23, 24]. The results are pre-sented in Table 3. It can be observed from the results pre-sented in the table that MRDNN provides 21.05% relativeimprovements in WER over DNN for the Aurora-2 set and14.43% relative improvement for the Aurora-4 corpus.

Table 3. WERs for clean training and clean testing onthe Aurora-2 and Aurora-4 speech corpora for GMM-HMM,DNN, and MRDNN models. The last column lists % WERimprovements of MRDNN relative to DNN.

GMM-HMM DNN MRDNN % Imp.Aurora-2 0.93 0.57 0.45 21.05Aurora-4 11.87 8.11 6.94 14.43

The ASR performance results presented for the Aurora-2and the Aurora-4 corpora in Sections 4.2, 4.3 and 4.4 demon-strate the effectiveness of the proposed manifold regularizedtraining for deep networks. While the conventional pre-training and regularization approaches did not lead to anyperformance gains, the MRDNN training provided consistentreductions in WERs. The primary reason for this perfor-mance gain is the ability of MRDNNs to learn and preservethe underlying low-dimensional manifold based relationshipsbetween feature vectors. This has been demonstrated withempirical evidence in Section 3.3. Therefore, the proposedMRDNN technique provides a compelling regularizationscheme for training deep networks.

5. DISCUSSION AND ISSUES

There are a number of factors that can have an impact onthe performance and application of MRDNN to ASR tasks.

Page 9: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

This section highlights some of the factors and issues affect-ing these techniques.

5.1. Computational Complexity

Though the inclusion of a manifold regularization factor inDNN training leads to encouraging gains in WER, it does soat the cost of additional computational complexity for param-eter estimation. This additional cost for MRDNN training istwo-fold. The first is the cost associated with calculating thepair-wise distances for populating the intrinsic affinity matrix,Ωint. As discussed in Section 2.2.2 and [36, 37], this compu-tation complexity can be managed by using locality sensitivehashing for approximate nearest neighbor search without sac-rificing ASR performance.

The second source of this additional complexity is the in-clusion of k neighbors for each feature vector during forwardand back propagation. This results in an increase in the com-putational cost by a factor of k. This cost can be managed byvarious parallelization techniques [3] and becomes negligiblein the light of the massively parallel architectures of the mod-ern processing units. This added computational cost is onlyrelvent during the training of the networks and has no impactduring the test phase or when the data is transformed using atrained network.

The networks in this work are training on Nvidia K20graphics boards using tools developed on python based frame-works such as numpy, gnumpy and cudamat [46, 47]. ForDNN training, each epoch over the Aurora-2 set took 240seconds and each epoch over the Aurora-4 dataset took 480seconds. In comparison, MRDNN training took 1220 sec-onds for each epoch over the Aurora-2 set and 3000 secondsfor each epoch over the Aurora-4 set.

5.2. Effect of Noise

In previous work, the authors have demonstrated that mani-fold learning techniques are very sensitive to the presence ofnoise [43]. This sensitivity can be traced to the Gaussian ker-nel scale factor, ρ, used for defining the local affinity matricesin Eq. (4). This argument might apply to MRDNN trainingas well. Therefore, the performance of a MRDNN might beaffected by the presence of noise. This is visible to some ex-tent in the results presented in Table 1, where the WER gainsby using MRDNN over DNN vary with SNR level. On theother hand, the Aurora-4 test corpus contains a mixture ofnoise corrupted utterances at different SNR levels for eachnoise type. Therefore, only an average over all SNR levelsper noise type is presented.

There is a need to conduct extensive experiments to in-vestigate this effect as was done in [43]. However, if thesemodels are affected by the presence of noise in a similar man-ner, additional gains could be derived by designing a methodfor building the intrinsic graphs separately for different noise

Table 4. Average WERs for mixed noise training and noisytesting on the Aurora-2 speech corpus for DNN, MRDNN andMRDNN 10 models.

Noise TechniqueSNR (dB)

clean 20 15 10 5

Avg.DNN 0.91 1.16 1.63 3.01 7.03MRDNN 0.66 0.88 1.42 2.43 6.59MRDNN 10 0.62 0.81 1.37 2.46 6.45

conditions. During the DNN training, an automated algorithmcould select a graph that best matches the estimated noise con-ditions associated with a given utterance. Training in this waycould result in a MRDNN that is able to provide further gainsin ASR performance in various noise conditions.

5.3. Alternating Manifold Regularized Training

Section 4 has demonstrated reductions in ASR WER byforcing the output feature vectors to conform to the localneighborhood relationships present in the input data. Thisis achieved by applying the underlying input manifold basedconstraints to the DNN outputs throughout the training. Therehave also been studies in literature where manifold regular-ization is used only for first few iterations of model training.For instance, authors in [48] have applied manifold regular-ization to multi-task learning. The authors have argued thatoptimizing deep networks by alternating between trainingwith and without manifold based constraints can increase thegeneralization capacity of the model.

Motivated by these efforts, this section investigates a sce-nario in which the manifold regularization based constraintsare used only for the first few epochs of the training. All lay-ers of the networks are randomly initialized and then trainedwith the manifold based constraints for first 10 epochs. Theresultant network is then further trained for 20 epochs us-ing the standard EBP without manifold based regularization.Note that contrary to the pre-training approaches in deeplearning, this is not a greedy layer-by-layer training.

ASR results for these experiments are given in Table 4 forthe Aurora-2 dataset. For brevity, the table only presents theASR WERs as an average over the four noise types describedin Table 1 at each SNR level. In addition to the average WERsfor the DNN and MRDNN, a row labeled ‘MRDNN 10’ isgiven. This row refers to the case where manifold regulariza-tion is used only for first 10 training epochs.

A number of observations can be made from the resultspresented in Table 4. First, it can be seen from the resultsin Table 4 that both MRDNN and MRDNN 10 training sce-narios improve ASR WERs when compared to the standardDNN training. Second, MRDNN 10 training provides furtherreductions in ASR WERs for the Aurora-2 set. Experimentson the clean training and clean testing set of the Aurora-2

Page 10: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

corpus also results in interesting comparison. In this case, theMRDNN 10 training improved ASR performance to 0.39%WER as compared to 0.45% for MRDNN and 0.57% forDNN training. That translates to 31.5% gain in ASR WERrelative to DNNs.

The WER reductions achieved in these experiementsare encouraging. Furthermore, this approach where manifoldconstraints are applied only for the first few epochs might leadto a more efficient manifold regularized training procedurebecause manifold regularized training has higher computa-tional complexity than a DNN (as discussed in Section 5.1).However, unlike [48], this work only applied one cycle ofmanifold constrained-unconstrained training. It should benoted that these are preliminary experiments. Similar gainsare not seen when these techniques are applied to the Aurora-4 dataset. Therefore, further investigation is required beforemaking any substantial conclusions.

6. CONCLUSIONS AND FUTURE WORK

This paper has presented a framework for regularizing thetraining of DNNs using discriminative manifold learningbased locality preserving constraints. The manifold basedconstraints are derived from a graph characterizing the un-derlying manifold of the speech feature vectors. Empiricalevidence has also been provided showing that the hiddenlayers of a MRDNN are able to learn the local neighbor-hood relationships between feature vectors. It has also beenconjectured that the inclusion of a manifold regularizationterm in the objective criterion of DNNs results in a robustcomputation of the error gradient and weight updates.

It has been shown through experimental results that theMRDNNs result in consistent reductions in ASR WERs thatrange up to 37% relative to the standard DNNs on the Aurora-2 corpus. For the Aurora-4 corpus, the WER gains range upto 10% relative on the mixed-noise training and 14.43% onthe clean training task. Therefore, the proposed manifold reg-ularization based training can be seen as a compelling regu-larization scheme.

The studies presented here open a number of interestingpossibilities. One would expect MRDNN training to have asimilar impact in hybrid DNN-HMM systems. The applica-tion of these techniques in semi-supervised scenarios is alsoa topic of interest. To this end, the manifold regularizationbased techniques could be used for learning from a large un-labeled dataset, followed by fine-tuning on a smaller labeledset.

AcknowledgementsThe authors thank Nuance Foundation for providing finan-cial support for this project. The experiments were conductedon the Calcul-Quebec and Compute-Canada supercomputing

clusters. The authors are thankful to the consortium for pro-viding the access and support.

7. REFERENCES

[1] Navdeep Jaitly, Patrick Nguyen, Andrew Senior, andVincent Vanhoucke, “An application of pretraineddeep neural networks to large vocabulary conversationalspeech recognition,” in Interspeech, 2012, number Cd,pp. 3–6.

[2] Geoffrey Hinton, Li Deng, and Dong Yu, “Deep neuralnetworks for acoustic modeling in speech recognition,”IEEE Signal Process. Mag., pp. 1–27, 2012.

[3] Tara N. Sainath, Brian Kingsbury, Bhuvana Ramabhad-ran, Petr Fousek, Petr Novak, and Abdel-Rahman Mo-hamed, “Making deep belief networks effective for largevocabulary continuous speech recognition,” in IEEEWork. Autom. Speech Recognit. Underst., Dec. 2011, pp.30–35.

[4] Z Tuske, Martin Sundermeyer, R Schluter, and HermannNey, “Context-dependent MLPs for LVCSR: TAN-DEM, hybrid or both,” in Interspeech, Portland, OR,USA, 2012.

[5] Yoshua Bengio, “Deep learning of representations:Looking forward,” Stat. Lang. Speech Process., pp. 1–37, 2013.

[6] Li Deng, Geoffrey Hinton, and Brian Kingsbury, “Newtypes of deep neural network learning for speech recog-nition and related applications: An overview,” in Proc.ICASSP, 2013.

[7] GE Dahl, TN Sainath, and GE Hinton, “Improving deepneural networks for LVCSR using rectified linear unitsand dropout,” ICASSP, 2013.

[8] MD Zeiler, M Ranzato, R Monga, and M Mao, “OnRectified Linear Units for Speech Processing,” pp. 3–7,2013.

[9] Li Deng, Jinyu Li, JT Huang, Kaisheng Yao, and DongYu, “Recent advances in deep learning for speech re-search at Microsoft,” ICASSP 2013, pp. 0–4, 2013.

[10] Vikrant Singh Tomar and Richard C Rose, “A familyof discriminative manifold learning algorithms and theirapplication to speech recognition,” IEEE/ACM Trans.Audio, Speech Lang. Process., vol. 22, no. 1, pp. 161–171, 2014.

[11] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani,“Manifold Regularization : A Geometric Framework forLearning from Labeled and Unlabeled Examples,” J.Mach. Learn., vol. 1, pp. 1–36, 2006.

Page 11: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

[12] Xiaofei He and Partha Niyogi, “Locality preserving pro-jections,” in Neural Inf. Process. Syst., 2002.

[13] Aren Jansen and Partha Niyogi, “Intrinsic Fourier analy-sis on the manifold of speech sounds,” in ICASSP IEEEInt. Conf. Acoust. Speech Signal Process., 2006.

[14] GE Hinton, Simon Osindero, and YW Teh, “A fastlearning algorithm for deep belief nets,” Neural Com-put., 2006.

[15] Dong Yu, Li Deng, and George E Dahl, “Roles ofpre-training and fine-tuning in context-dependent DBN-HMMs for real-world speech recognition,” in NIPSWork. Deep Learn. Unsupervised Featur. Learn., 2010.

[16] GE Dahl, Dong Yu, Li Deng, and A Acero, “Context-dependent pre-trained deep neural networks for largevocabulary speech recognition,” IEEE Trans. Audio,Speech Lang. Process., pp. 1–13, 2012.

[17] Frank Seide, Gang Li, Xie Chen, and Dong Yu, “Featureengineering in context-dependent deep neural networksfor conversational speech transcription,” in IEEE Work.Autom. Speech Recognit. Underst., Dec. 2011, pp. 24–29.

[18] Pascal Vincent, Hugo Larochelle, and Isabelle Lajoie,“Stacked denoising autoencoders: Learning useful rep-resentations in a deep network with a local denoisingcriterion,” J. Mach. Learn. Res., vol. 11, pp. 3371–3408,2010.

[19] Hugo Larochelle and Yoshua Bengio, “Exploring strate-gies for training deep neural networks,” J. Mach. Learn.Res., vol. 1, pp. 1–40, 2009.

[20] F Grezl and M Karafiat, “Probabilistic and bottle-neckfeatures for LVCSR of meetings,” in Int. Conf. Acoust.Speech, Signal Process., 2007.

[21] Hynek Hermansky, Daniel Ellis, and Sangita Sharma,“Tandem connectionist feature extraction for conven-tional HMM systems,” in ICASSP, 2000, pp. 1–4.

[22] Vikrant Singh Tomar and Richard C Rose, “ManifoldRegularized Deep Neural Networks,” in Interspeech,2014.

[23] HG Hirsch and David Pearce, “The Aurora experimen-tal framework for the performance evaluation of speechrecognition systems under noisy conditions,” in Autom.Speech Recognit. Challenges next millenium, 2000.

[24] N. Parihar and Joseph Picone, “Aurora Working Group: DSR Front End LVCSR Evaluation,” Tech. Rep., Eu-ropean Telecommunications Standards Institute, 2002.

[25] Salah Rifai, Pascal Vincent, Xavier Muller, GlorotXavier, and Yoshua Bengio, “Contractive auto-encoders: explicit invariance during feature extraction,” in Int.Conf. Mach. Learn., 2011, vol. 85, pp. 833–840.

[26] RA Dunne and NA Campbell, “On the pairing of theSoftmax activation and cross-entropy penalty functionsand the derivation of the Softmax activation function,”in Proc. 8th Aust. Conf. Neural Networks, 1997, pp. 1–5.

[27] P Golik, P Doetsch, and H Ney, “Cross-entropy vs.squared error training: a theoretical and experimentalcomparison.,” in Interspeech, 2013, pp. 2–6.

[28] JB Tenenbaum, “Mapping a manifold of perceptual ob-servations,” Adv. Neural Inf. Process. Syst., 1998.

[29] LK Saul and ST Roweis, “Think globally, fit locally:unsupervised learning of low dimensional manifolds,”J. Mach. Learn. Res., vol. 4, pp. 119–155, 2003.

[30] M Liu, K Liu, and X Ye, “Find the intrinsic space formulticlass classification,” ACM Proc. Int. Symp. Appl.Sci. Biomed. Commun. Technol. (ACM ISABEL), 2011.

[31] Aren Jansen and Partha Niyogi, “A geometric per-spective on speech sounds,” Tech. Rep., University ofChicago, 2005.

[32] Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-JiangZhang, Qiang Yang, and Stephen Lin, “Graph embed-ding and extensions: a general framework for dimen-sionality reduction.,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 29, no. 1, pp. 40–51, Jan. 2007.

[33] Yun Tang and Richard Rose, “A study of using localitypreserving projections for feature extraction in speechrecognition,” in IEEE Int. Conf. Acoust. Speech Sig-nal Process., Las Vegas, NV, Mar. 2008, pp. 1569–1572,IEEE.

[34] Vikrant Singh Tomar and Richard C Rose, “Applicationof a locality preserving discriminant analysis approachto ASR,” in 2012 11th Int. Conf. Inf. Sci. Signal Process.their Appl., Montreal, QC, Canada, July 2012, pp. 103–107, IEEE.

[35] Vikrant Singh Tomar and Richard C Rose, “A correla-tional discriminant approach to feature extraction for ro-bust speech recognition,” in Interspeech, Portland, OR,USA, 2012.

[36] Vikrant Singh Tomar and Richard C Rose, “Efficientmanifold learning for speech recognition using localitysensitive hashing,” in ICASSP IEEE Int. Conf. Acoust.Speech Signal Process., Vancouver, BC, Canada, 2013.

Page 12: arXiv:1606.05925v1 [stat.ML] 19 Jun 2016 · GRAPH BASED MANIFOLD REGULARIZED DEEP NEURAL NETWORKS FOR AUTOMATIC SPEECH RECOGNITION Vikrant Singh Tomar 1;2, Richard Rose 3 1 McGill

[37] Vikrant Singh Tomar and Richard C Rose, “Localitysensitive hashing for fast computation of correlationalmanifold learning based feature space transformations,”in Interspeech, Lyon, France, 2013, pp. 2–6.

[38] Piotr Indyk and R Motwani, “Approximate nearestneighbors: towards removing the curse of dimension-ality,” in Proc. thirtieth Annu. ACM Symp. Theory Com-put., 1998, pp. 604–613.

[39] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Va-hab S. Mirrokni, “Locality-sensitive hashing schemebased on p-stable distributions,” Proc. Twent. Annu.Symp. Comput. Geom. - SCG ’04, p. 253, 2004.

[40] Amarnag Subramanya and Jeff Bilmes, “The semi-supervised switchboard transcription project,” Inter-speech, 2009.

[41] Jason Weston, Frederic Ratle, and Ronan Collobert,“Deep learning via semi-supervised embedding,” Proc.25th Int. Conf. Mach. Learn. - ICML ’08, pp. 1168–1175, 2008.

[42] Frederic Ratle, Gustavo Camps-valls, Senior Member,and Jason Weston, “Semi-supervised neural networksfor efficient hyperspectral image classification,” IEEETrans. Geosci. Remote Sens., pp. 1–12, 2009.

[43] Vikrant Singh Tomar and Richard C Rose, “Noise awaremanifold learning for robust speech recognition,” inICASSP IEEE Int. Conf. Acoust. Speech Signal Process.,2013.

[44] Rongfeng Su, Xurong Xie, Xunying Liu, and Lan Wang,“Efficient Use of DNN Bottleneck Features in Gen-eralized Variable Parameter HMMs for Noise RobustSpeech Recognition,” in Interspeech, 2015.

[45] M Seltzer, D Yu, and Yongqiang Wang, “An investi-gation of deep neural networks for noise robust speechrecognition,” Proc. ICASSP, pp. 7398–7402, 2013.

[46] Volodymyr Mnih, “CUDAMat : a CUDA-based matrixclass for Python,” Tech. Rep., Department of ComputerScience, University of Toronto, 2009.

[47] Tijmen Tieleman, “Gnumpy : an easy way to use GPUboards in Python,” Tech. Rep., Department of ComputerScience, University of Toronto, 2010.

[48] A Agarwal, Samuel Gerber, and H Daume, “Learningmultiple tasks using manifold regularization,” in Adv.Neural Inf. Process. Syst., 2010, pp. 1–9.