towards relevance and sequence modeling in language

10
1 Towards Relevance and Sequence Modeling in Language Recognition Bharat Padi, Anand Mohan and Sriram Ganapathy, Senior Member, IEEE Abstract—The task of automatic language identification (LID) involving multiple dialects of the same language family in the presence of noise is a challenging problem. In these scenarios, the identity of the language/dialect may be reliably present only in parts of the temporal sequence of the speech signal. The con- ventional approaches to LID (and for speaker recognition) ignore the sequence information by extracting long-term statistical sum- mary of the recording assuming an independence of the feature frames. In this paper, we propose a neural network framework utilizing short-sequence information in language recognition. In particular, a new model is proposed for incorporating relevance in language recognition, where parts of speech data are weighted more based on their relevance for the language recognition task. This relevance weighting is achieved using the bidirectional long short-term memory (BLSTM) network with attention modeling. We explore two approaches, the first approach uses segment level i-vector/x-vector representations that are aggregated in the neural model and the second approach where the acoustic features are directly modeled in an end-to-end neural model. Experiments are performed using the language recognition task in NIST LRE 2017 Challenge using clean, noisy and multi-speaker speech data as well as in the RATS language recognition corpus. In these experiments on noisy LRE tasks as well as the RATS dataset, the proposed approach yields significant improvements over the conventional i-vector/x-vector based language recognition approaches as well as with other previous models incorporating sequence information. Index Terms—Language Recognition, i-vectors, Long-Short Term Memory (LSTM) networks, Sequence Modeling, Attention Networks. I. I NTRODUCTION T HE problem of recognizing the spoken language of a given audio segment is of considerable interest for several commercial applications like speech translation [1], multi-lingual speech recognition [2], document retrieval [3] as well as in defense and surveillance applications [4]. In the recent years, several advances in speech signal modeling, most prominent of them being the application of factor analysis methods, have contributed to improving the performance of language recognition systems [5]. However, the task is still of considerable challenge when the recognition involves multiple dialects of the same language (like the recent NIST Language This work was partly supported by grants from Department of Science and Technology, Early Career Award (ECR-01341/2017) and the Extra Mural Grant (EMR-2016/007934). Bharat Padi is with the mind.ai, Bangalore. This work was performed when he was a student at the LEAP lab. Anand Mohan is with Amazon Alexa Lab, Bangalore, India. This work was performed when he was a student at the LEAP lab. Sriram Ganapathy is with the Learning and Extraction of Acoustic Patterns (LEAP) lab, Electrical Engineering, Indian Institute of Science, Bengaluru, India, 560012 Recognition Evaluation (LRE) 2017 challenge). Further, the language identification (LID) performance is degraded in the presence of noise and other artifacts, such as in the robust automatic transcription of speech (RATS) databases [4], [6], [7]. In this paper, we propose a modeling framework to address the challenge of noise in language recognition. Traditionally, phoneme recognition followed by language modeling (PRLM) was one of the popular methods for au- tomatic LID task [8], [9]. In the recent past, the use of deep neural network (DNN) based posterior features were attempted for LID [10]. The bottleneck features based on the acoustic model of a speech recognition system had also shown promising results for noisy language recognition [11]. The development of i-vectors as one of the primary rep- resentations for LID was first introduced in [12]. The i- vectors are features of fixed dimension derived from variable length speech utterances using a background model [13]. The background model can be a Gaussian mixture model (GMM) [14] or a DNN model [11], [15]. The i-vectors extracted from the training data are used to train classifiers such as support vector machines (SVMs) which perform the task of language identification [16], [17]. One of the main drawbacks of the i-vector representations [12] and the recently proposed x-vector representations [18] is the long term summarization of the audio signal. Even in the extraction of x-vector embeddings using the TDNN mod- els [19], the temporal context of 15 frames (about 150msec.) is alone used in the forward propagation of frame level features. For tasks like dialect identification in the presence of noise and other artifacts, some regions of audio may be more reliable than the rest. We hypothesize that there is a need to extract information only from the relevant regions of the audio signal rather than the long-term summary of the signal for the task of noisy LID. The attention approach in neural network modeling was originally proposed for neural machine translation [20] and image captioning [21]. The attention approach to speech recognition was initially investigated for phoneme recognition [22] where the issues in dealing with variable length speech sequences were identified. Recently, the attention based models have also been applied for end-to-end speech recognition tasks [23]–[25]. The end-to-end approaches to language recognition have been explored with long short term memory (LSTM) networks and with DNNs [26]. A recent approach using curriculum learning had also been applied for noise robust language recognition [27]. However, the state-of- the-art language recognition systems using large scale NIST language recognition evaluation (LRE) challenges, continue to use the i-vector/x-vector based approaches with support vector

Upload: others

Post on 22-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards Relevance and Sequence Modeling in Language

1

Towards Relevance and Sequence Modeling inLanguage Recognition

Bharat Padi, Anand Mohan and Sriram Ganapathy, Senior Member, IEEE

Abstract—The task of automatic language identification (LID)involving multiple dialects of the same language family in thepresence of noise is a challenging problem. In these scenarios,the identity of the language/dialect may be reliably present onlyin parts of the temporal sequence of the speech signal. The con-ventional approaches to LID (and for speaker recognition) ignorethe sequence information by extracting long-term statistical sum-mary of the recording assuming an independence of the featureframes. In this paper, we propose a neural network frameworkutilizing short-sequence information in language recognition. Inparticular, a new model is proposed for incorporating relevancein language recognition, where parts of speech data are weightedmore based on their relevance for the language recognition task.This relevance weighting is achieved using the bidirectional longshort-term memory (BLSTM) network with attention modeling.We explore two approaches, the first approach uses segment leveli-vector/x-vector representations that are aggregated in the neuralmodel and the second approach where the acoustic features aredirectly modeled in an end-to-end neural model. Experimentsare performed using the language recognition task in NIST LRE2017 Challenge using clean, noisy and multi-speaker speech dataas well as in the RATS language recognition corpus. In theseexperiments on noisy LRE tasks as well as the RATS dataset,the proposed approach yields significant improvements overthe conventional i-vector/x-vector based language recognitionapproaches as well as with other previous models incorporatingsequence information.

Index Terms—Language Recognition, i-vectors, Long-ShortTerm Memory (LSTM) networks, Sequence Modeling, AttentionNetworks.

I. INTRODUCTION

THE problem of recognizing the spoken language ofa given audio segment is of considerable interest for

several commercial applications like speech translation [1],multi-lingual speech recognition [2], document retrieval [3]as well as in defense and surveillance applications [4]. In therecent years, several advances in speech signal modeling, mostprominent of them being the application of factor analysismethods, have contributed to improving the performance oflanguage recognition systems [5]. However, the task is still ofconsiderable challenge when the recognition involves multipledialects of the same language (like the recent NIST Language

This work was partly supported by grants from Department of Scienceand Technology, Early Career Award (ECR-01341/2017) and the Extra MuralGrant (EMR-2016/007934).

Bharat Padi is with the mind.ai, Bangalore. This work was performed whenhe was a student at the LEAP lab.

Anand Mohan is with Amazon Alexa Lab, Bangalore, India. This work wasperformed when he was a student at the LEAP lab.

Sriram Ganapathy is with the Learning and Extraction of Acoustic Patterns(LEAP) lab, Electrical Engineering, Indian Institute of Science, Bengaluru,India, 560012

Recognition Evaluation (LRE) 2017 challenge). Further, thelanguage identification (LID) performance is degraded in thepresence of noise and other artifacts, such as in the robustautomatic transcription of speech (RATS) databases [4], [6],[7]. In this paper, we propose a modeling framework to addressthe challenge of noise in language recognition.

Traditionally, phoneme recognition followed by languagemodeling (PRLM) was one of the popular methods for au-tomatic LID task [8], [9]. In the recent past, the use ofdeep neural network (DNN) based posterior features wereattempted for LID [10]. The bottleneck features based on theacoustic model of a speech recognition system had also shownpromising results for noisy language recognition [11].

The development of i-vectors as one of the primary rep-resentations for LID was first introduced in [12]. The i-vectors are features of fixed dimension derived from variablelength speech utterances using a background model [13]. Thebackground model can be a Gaussian mixture model (GMM)[14] or a DNN model [11], [15]. The i-vectors extracted fromthe training data are used to train classifiers such as supportvector machines (SVMs) which perform the task of languageidentification [16], [17].

One of the main drawbacks of the i-vector representations[12] and the recently proposed x-vector representations [18]is the long term summarization of the audio signal. Even inthe extraction of x-vector embeddings using the TDNN mod-els [19], the temporal context of 15 frames (about 150msec.) isalone used in the forward propagation of frame level features.For tasks like dialect identification in the presence of noise

and other artifacts, some regions of audio may be morereliable than the rest. We hypothesize that there is a needto extract information only from the relevant regions of theaudio signal rather than the long-term summary of the signalfor the task of noisy LID. The attention approach in neuralnetwork modeling was originally proposed for neural machinetranslation [20] and image captioning [21]. The attentionapproach to speech recognition was initially investigated forphoneme recognition [22] where the issues in dealing withvariable length speech sequences were identified. Recently, theattention based models have also been applied for end-to-endspeech recognition tasks [23]–[25]. The end-to-end approachesto language recognition have been explored with long shortterm memory (LSTM) networks and with DNNs [26]. A recentapproach using curriculum learning had also been applied fornoise robust language recognition [27]. However, the state-of-the-art language recognition systems using large scale NISTlanguage recognition evaluation (LRE) challenges, continue touse the i-vector/x-vector based approaches with support vector

Page 2: Towards Relevance and Sequence Modeling in Language

2

machine classifier [28].In this paper, we propose an attention based framework for

language recognition to perform relevance weighting of the au-dio signal region in the presence of noise. The term relevanceused in this work corresponds to the relative importance ofshort regions of audio (1000msec. chunks) for determining thelanguage/dialect of the given utterance. For neural modelingof relevance, we explore two approaches,

i Modeling short segment statistics from i-vector represen-tations using a deep bidirectional LSTM model [29]. Werefer to this as the i/x−BLSTM model.

ii Modeling the audio features directly using end-to-end deeprecurrent model with hierarchical gated recurrent units. Werefer to this model as HGRU [30].

The rest of the paper is organized as follows. The state-of-the-art systems using i-vector based model and the LSTMbased neural network approach to end-to-end language recog-nition are discussed in Sec. II-A and Sec. II-B respectively.Then, we discuss several approaches for incorporating rele-vance using short-term i-vector representations in Sec. III. InSec. III-B, we discuss the proposed hybrid model of usingshort-term i-vector representations in a neural framework forlanguage recognition. Sec. III-C details the proposed end-to-end framework for language recognition. The details of theexperimental set and the various datasets used is given inSec. IV. The LID results of various language recognition taskson LRE 2017 and RATS datasets are reported in Sec. V. Thisis followed by an analysis of the robustness of the proposedapproaches in the presence of noise in Sec.VI. In Sec. VII, wesummarize the important contributions of this paper.

II. RELEVANT PRIOR WORK

A. i-vector/x-vector SVM LID system

The i-vectors constitute the widely used features for lan-guage recognition [12]. A Gaussian Mixture Universal Back-ground Model (GMM-UBM) is obtained by pooling the frontend features from all the utterances in the train dataset. Themeans of the GMM are adapted to each utterance using theBaum-Welch (BW) statistics of the front-end features. A TotalVariability Model (TVM) is assumed as a generative model forthe adapted GMM mean supervector, which is given by,

M(s) =M0 + Ty(s),

where M0 and M(s) are the UBM mean supervector and theadapted mean supervector of recording s respectively. Here,T is a rectangular matrix, and y(s) ∼ N (0, I) is a latentvariable. The maximum aposteriori (MAP) estimate of y(s)given the front-end features of the recording s is called thei-vector of recording y∗(s). The i-vectors extracted for eachof the speech files are processed with length normalization[31] and dimensionality reduction is performed using lineardiscriminant analysis (LDA).

Recently, x-vector representations have been developed forspeaker and language recognition [18], [19]. The x-vectormodel is based on time delay neural network (TDNN) wherethe initial layers operate on frame level features. The higherlayers convert these features to segment level by summarizing

mean and standard deviation of frame level features. This isfollowed by 2 fully connected hidden layers and the entireneural network model is trained to classify languages. Theutterance level features before the last fully connected hiddenlayer are used as embeddings and they are termed as x-vectors. The x-vector features can be used instead of i-vectorembeddings in a SVM classifier for language recognition [19].

B. Long Short Term Memory Neural Network LID system

Long Short-Term Memory networks (LSTM) [32], [33] aredesigned to address the issue of learning long-term dependen-cies in recurrent neural networks [34]. In [35], [36], the authorshave proposed an Long Short Term Memory Recurrent NeuralNetwork (LSTM-RNN) based end-to-end model for exploitingtemporal information useful for LID. In [35] authors haveshown that the LSTM model outperforms a baseline i-vectorsystem and a deep neural network model on short duration (3s)test segments. In a recent work, the authors of [36] show thatneural networks are useful even for challenging NIST datasetswhere the LSTM model outperforms i-vector baseline on shortduration (3 sec.) test segments but performs relatively worseon longer duration test segments (10 sec., 30 sec.).

III. RELEVANCE IN LANGUAGE RECOGNITION

We propose three different approaches for incorporatingrelevance and sequence information in language recognition.

A. Relevance Weighted Baum-Welch statistics

The main motivation in this approach is the use of in-verse entropy as measure of confidence/relevance. The entropyof class posterior distribution carriers information regardingthe uncertainty of the classification. Given a discrete classposterior distribution, the higher the entropy the lower theconfidence of the decision. Misra et. al [37] had shown thatthe entropy increases with decrease in SNR as the posteriorprobability distribution becomes more and more uniform whenthe noise level increases. If two segments s1, s2 of the audiocontain two different SNR values and if the corresponding i-vector representations be denoted as y1 and y2, the inverse ofthe entropy of the corresponding language posteriors H(l|y1)and H(l|y2) provides a measure of the reliability of thedecisions from each of audio regions. The optimal languagedecision using the information from the two audio regions canbe taken by using inverse entropy weighted linear combinationof the posteriors [37], [38],

l∗ = argmaxl[w1p(l|y1) + w2p(l|y2)

](1)

where, w1 and w2 are weights given by

w1 =

1H(l|y1)

1H(l|y1)

+ 1H(l|y2)

(2)

and w2 = 1 − w1. In the proposed work, the relevanceweighting of the Baum-Welch statistics is used to providethe optimal statistics for i-vector estimation (similar to SADprobability estimation done in [39]).

Page 3: Towards Relevance and Sequence Modeling in Language

3

f2f1 f101f100

f100T

x1 x2 xT

f200

extractiïvec/xïvec

extractiïvec/xïvec

extractiïvec/xïvec

1h2 h3 hT

BLSTMLayers

Speech Features[25ms with a shift of 10ms]

ModuleAttention

Fully Conn.Layers

LanguageClasses

hMeanCompute

Fig. 1: Proposed BLSTM model using 1000 msec. i-vector/x-vector features for language recognition.

In our proposed approach, we use short term i-vectorsthat are labeled individually according to the label of thecorresponding utterance. These 1000 msec. i-vectors from trainand development datasets are then used to train a feed forwarddeep neural network (DNN). The DNN has an input layer of500 dimensions with 3 hidden layers having 1024 dimensionsand rectified linear unit (ReLU) non-linearity. The output layerwas L dimensions where L is the number of language classes.

Once the DNN model is trained, for each non-overlappingshort term i-vector yi from every utterance, the informationentropy Hi is computed from the posteriors from the DNNmodel where i is the index of 1000 msec. i-vector. We usethe entropy measure Hi to compute a relevance parameter γi.The value of γi changes only at 1 sec. intervals. The value ofγi is obtained from Hi as,

γi =

1 when Hi < HminHmax−Hi

Hmax−Hminwhen Hmin ≤ Hi ≤ Hmax

0 when Hi > Hmax

(3)

where Hmax and Hmin are hyper-parameters. The zeroth andfirst-order Baum-Welch statistics used for i-vector extractionare now modified as:

Nc(s) =

H(s)∑i=1

γi p(c |xi) (4)

FX,c(s) =

H(s)∑i=1

γi p(c | xi)(xi − µc) (5)

e

h 1 h 2 h 3 h T

eW , b e

u e

a 1

2aa 3

a T

NetworkAttention

Fig. 2: Attention modeling in the proposed i/x-BLSTM/HGRUmodel.

B. Hybrid i-vector/x-vector BLSTM model

This approach is shown in Fig. 1. The short-term i-vectors/x-vectors, extracted every 200 msec. from overlapping windowsof 1000 msec. duration (100 frames of acoustic featuresf1, f2, ... extracted with 10 msec. shift) are modeled usinga bidirectional LSTM model (BLSTM) with attention. Theinput to the BLSTM is the variable length sequence of short-term i-vectors/x-vectors. The LSTM architecture that we use inthis paper contains two layers with 256 memory cells in eachlayer1. We propose to use an attention model to weight the 1sec. representations based on their relevance to the languageclassification task. The attention method [20] provides anefficient way to aggregate the sequence of 1 sec. vectors. Theattention mechanism used in this work is shown in Fig. 2. Themodel implements the following set of equations,

ut = tanh(Weht + be) (6)

at =exp(uT

t ue)∑t exp(uT

t ue)(7)

e =∑t

atht (8)

Here, We, be are the weights and the bias of the attentionmodule which are learned in training process along withthe vector ue. The output e denotes the fixed dimensionalembedding from the input sequence. The attention modulebased on the similarity of ut with ue assigns normalizedweights at using a softmax function. The utterance levelrepresentation e is mapped to the final language targets througha layer of fully connected network (FCN) with 512 dimensionsand ReLU non-linearity. The output layer has a softmax non-linearity and the model is trained with cross entropy lossusing stochastic gradient descent [40] and is implemented inTensorFlow [41].

C. End-to-End Hierarchical GRU model

We propose to use a simplified version of the LSTMfunction, the Gated Recurrent Unit (GRU) architecture [42]in the end-to-end model. The GRUs combine the input andforget gate into one single gate and in many tasks, the GRUsthat have a relatively smaller number of parameters are shownto achieve or improve over the performance of LSTMs [43].

1All models proposed in this work are available athttps://github.com/iiscleap/lre-relevance-weighting

Page 4: Towards Relevance and Sequence Modeling in Language

4

f2

f1 f

21f60f

20f40

f41

x1 x2 x3 x4 x5 x 5T

hTh1

AttentionModule

Fully Conn.Network (FCN)

FCN FCN

LayersBGRU

>= 10sec.< 10sec.

GRULayers

Embedding

Embedding

LanguageClasses

LanguageClasses

Speech Features[25ms with a shift of 10ms]

[200 ms shift 100ms]

[1sec. non−overlap.]

Fig. 3: End-to-end Hierarchical GRU RNN with attentionmodule and duration dependent target layers.

Even with LSTM/GRU models, the direct modeling of longsequences can be cumbersome [36]. In order to model suchlong sequences, we propose a novel hierarchical bidirectionalGRU [30] network with attention in this paper.

The block schematic of the proposed model is given inFig. 3. At the first layer, a 256 cell unidirectional GRU blockaccumulates information across a window of 200 msec. i.e, asequence of 20 feature vectors with a shift of 100 msec. overthe entire input sequence. The output from the first layer is asequence of vectors that are sampled every 100 msec. witheach vector representing information from overlapping 200msec. segments of input speech. This is then fed to the secondlayer of GRU block with 512 cells where the information isaccumulated over a window of 1 sec. The accumulated 1 sec.vectors from second layer are fed to the final bidirectionalGRU layer [44] with 512 cells in the third layer.

The output of the three layer hierarchical GRU modelcontains representations at 1 sec. level. The rest of modelframework is similar to the i-BLSTM model (Sec. III-B).For the end-to-end model, we found that the distribution ofthe embeddings from the attention network is quite differentfor short and long duration inputs. Hence, we propose touse two separate target layers, one for short duration inputsthat are of the order of 3 sec. duration and other one withlonger duration input sequences that are 10 sec. or above. Theentire network is trained using Adam optimization and BackPropagation Through Time (BPTT) algorithm [40].

TABLE I: LRE17 training set : language clusters, targetlanguages and total number of hours.

Cluster Target Languages Hours

Arabic

Egyptian Arabic (ara-arz)Iraqi Arabic (ara-acm)Levantine Arabic (ara-apc)Maghrebi Arabic (ara-ary)

190.9130.8440.781.8

Chinese Mandarin (zho-cmn)Min Nan (zho-nan)

379.413.3

English British English (eng-gbr)General American English (eng-usg)

4.8327.7

Slavic Polish (qsl-pol)Russian (qsl-rus)

59.369.5

Iberian

Caribbean Spanish (spa-car)European Spanish (spa-eur)Latin American Continental Spanish (spa-lac)Brazilian Portuguese (por-brz)

166.324.7175.94.1

IV. EXPERIMENTAL SET UP

A. Feature Extraction

All the systems in this paper use the same front end features.The features are the Deep Neural Network (DNN) Bottleneck(BN) features [11], [28]. We extract BN features from a feedforward deep neural network trained for automatic speechrecognition (ASR) using Kaldi [45] framework. The ASR-DNN was trained using 39 (13 +∆+∆∆) dimensional melfrequency cepstral coefficient (MFCC) features with 10 msec.frame shift and 25 msec. windows. The ASR-DNN was trainedon speech from combined Switchboard (SWB1) and Fishercorpora (about 2000 hours of labeled audio). The ASR-DNNmodel uses 7 hidden layers with ReLU activation with layer-wise batch normalization. We use the last hidden layer outputof size 80.

B. Datasets

1) NIST LRE 2017 Dataset: The details of the LRE17dataset is given in Table I. The LRE17 training data(LDC2017E22) has five major language clusters with 14 targetdialects with a total duration of 2069 hours in 16205 files.The development dataset consists of 3661 files which contain253 hours of audio and the evaluation dataset consists of25451 files with 1065 hours of audio. The development andevaluation datasets are further partitioned into utterances ofdurations 3 sec, 10 sec, 30 sec. or 1000 sec. The datasetscontain conversational telephone speech (CTS) and broadcastnarrow band speech (BNBS) and speech extracted from videosor video speech (VS) (the 1000 sec. files). We have not usedthe development data in training the embedding models (i-vector/x-vector) or the back-end models.

In addition to the standard LRE17 test set, we test therobustness of the various LID systems using noisy versions ofthe evaluation data for the 10 sec. recording conditions. Weuse four different noise types (Babble, Subway, Airport andStreet) from the Aurora-4 corpus and these are added to theaudio signal at different signal-to-noise ratio (SNR) like 5, 10,15 and 20 dB using the filter and noise adding tool (FANT)[46]. While adding the noise, either the entire audio file iscorrupted with noise (Noisy cond.) or only the first half ofthe audio file is corrupted with noise (Partial Noise.). The

Page 5: Towards Relevance and Sequence Modeling in Language

5

TABLE II: Performance of reference and the developed systems on the NIST LRE17 evaluation dataset in terms of percentageaccuracy, Cavg and EER. The performance of the x-vector systems are shown in parentheses. The x-vector systems wereevaluated only for 3,10 and 30s conditions. The x-post system uses the x-vector network directly without the proposed BLSTMbackend model.

Dur. i-LDA-SVM (x-LDA-SVM) LSTM [36] DNN-Attn [27] RWBW i-BLSTM (x-BLSTM) HGRU x-post. x-BLSTM-E2EAccuracy (%)

3 53.84 (46.14) 54.7 54.6 51.37 54.80 (56.99) 55.13 (60.55) (60.2)10 72.36 (73.19) 72.1 72.5 68.42 75.89 (71.60) 74.06 (69.82) (70.5)30 82.98 (85.09) 76.1 79.7 77.59 82.27 (76.02) 82.98 (72.02) (74.2)

1000 56.23 42.8 50.2 58.48 54.07 53.53 - -overall 67.86 64.3 66.36 64.78 68.65 68.48 - -

Cavg

3 0.53 (0.49) 0.55 0.53 0.58 0.50 (0.46) 0.55 (0.45) (0.46)10 0.27 (0.21) 0.35 0.28 0.35 0.26 (0.28) 0.32 (0.32) (0.32)30 0.13 (0.10) 0.28 0.20 0.21 0.18 (0.23) 0.23 (0.28) (0.28)

1000 0.54 0.79 0.61 0.51 0.50 0.62 - -overall 0.37 0.48 0.40 0.40 0.36 0.42 - -

EER (%)3 13.40 (16.92) 15.39 14.98 15.60 15.47 (13.50) 15.33 (13.70) (13.45)10 6.47 (5.98) 8.70 7.05 7.84 6.32 (7.33) 7.49 (9.20) (8.53)30 3.50 (2.75) 7.25 4.73 4.61 3.67 (6.28) 4.93 (8.11) (7.12)

1000 15.35 26.27 15.72 14.20 14.71 17.02 - -overall 9.26 14.38 10.80 10.34 9.65 10.79 - -

latter condition is considered to simulate non-stationary testconditions where the noise characteristics can change withinthe course of recording duration.

2) RATS LID Dataset: The DARPA Robust AutomaticTranscription of Speech (RATS) [4] program targets the devel-opment of speech systems operating on highly distorted speechrecorded over “degraded” radio channels. The data used hereconsists of recordings obtained from re-transmitting a cleansignal over eight different radio channel types, where eachchannel introduces a unique degradation mode specific to thedevice and modulation characteristics [4]. For the languageidentification (LID) task, the performance is degraded due tothe short segment duration of the speech recordings in additionto the significant amount of channel noise [47].

The training data for the RATS experiments consist of20000 recordings (about 1600 hours of audio) from five targetlanguages (Arabic, Pashto, Dari, Farsi and Urdu) as well asfrom several other non-target languages. We have used 6 outof 8 given channels (channels B-G) for training and testingpurposes. The development and the evaluation data consistsof 5663 and 14757 recordings respectively from the above 6channels. We also evaluate the models on sampled 3 sec, 10sec. and 30 sec. chunks of voiced data from the full lengthevaluation files.

C. Evaluation MetricsThe performance is measured using the primary metric

described in the evaluation plan of NIST LRE 2017 Cavg [28],Equal Error Rate (EER) and accuracy.• Cavg - The pair-wise language recognition performance

will be computed for all the target-language/non-target-language pairs (LT , LN ). An average performance costfor each system is computed as

Cavg(β) =1

NL

∑LT

{Pm(LT ) +

∑LN

βPf (LT , LN )

NL − 1

}

where Pm is the probability of miss detection, Pf isthe probability of false alarm, which are computed byapplying detection threshold of log β to the system scores.The primary metric used for LRE17 is the average costof two Cavg measures obtained using β1 = 1, β2 = 9.

• EER - Equal Error Rate is measured for individuallanguages and then their average is computed. Note that,the EER and Cavg consider the LID system as detectionproblem.

• Accuracy - The accuracy is measure in a classificationsetting. It considers the LID systems as multi-class learn-ing problem (14 classes in LRE17 dataset and 6 classesin the RATS evaluation).

D. i-vector/x-vector baseline system

Once the BN features are extracted from the ASR-DNNfor the LID data, a speech activity detection (SAD) algorithmwas applied to remove the unvoiced frames [48]. We usethe implementation of SAD from the Voicebox toolkit [49].This was followed by cepstral mean variance normalization(CMVN) done over each utterance, followed by a slidingwindow CMVN over a sliding window of 3 sec. The numberof UBM mixture components was set to C = 2048 andthe dimension of the total variability space was fixed to beR = 500. The i-vectors are processed with within classcovariance normalization (WCCN) technique [50] and lengthnormalized. The dimension of the i-vectors is then reducedto 13 and 5 for LRE17 and RATS datasets respectively usinglinear discriminant analysis (LDA) and modeled using supportvector machines (LDA-SVM). The i-vector SVM system im-plementation follows the recipe provided by the NIST LREsystem [28].

The x-vector baseline system is implemented using theTDNN model architecture described in [19]. The x-vectorsystem consists of 5 layers of time delay neural network

Page 6: Towards Relevance and Sequence Modeling in Language

6

TABLE III: Performance of the systems when evaluated on data (10 sec. recording condition) corrupted with noise at variousSNR levels. The performance of the x-vector systems are shown in parentheses.

SNR i-LDA-SVM (x-LDA-SVM) LSTM [36] DNN-Attn [27] i-BLSTM (x-BLSTM) HGRU (x-post) (x-BLSTM-E2E)Accuracy (%)

5dB 47.28 (49.96) 48.36 46.35 51.58 (55.44) 45.78 (58.13) (57.41)10dB 53.42 (58.43) 56.30 54.86 59.86 (62.65) 53.92 (63.57) (63.73)15dB 57.47 (63.20) 61.63 61.58 64.30 (66.27) 60.20 (65.88) (66.94)20dB 59.64 (66.66) 65.28 65.76 67.72 (68.24) 64.21 (67.40) (68.36)Avg. 54.45 (59.56) 57.89 57.13 60.87 (63.15) 56.02 (63.74) (64.36)

Cavg5dB 0.62 (0.71) 0.65 0.60 0.59 (0.47) 0.68 (0.47) (0.50)10dB 0.53 (0.63) 0.54 0.49 0.48 (0.38) 0.58 (0.40) (0.41)15dB 0.48 (0.58) 0.47 0.41 0.42 (0.33) 0.48 (0.36) (0.37)20dB 0.45 (0.54) 0.42 0.37 0.37 (0.31) 0.43 (0.34) (0.35)Avg. 0.52 (0.62) 0.52 0.47 0.47 (0.37) 0.54 (0.39) (0.41)

EER (%)5dB 22.78 (26.52) 19.79 17.95 17.01 (14.44) 19.16 (16.25) (15.54)10dB 18.20 (21.97) 15.88 13.27 12.90 (11.11) 15.61 (12.64) (11.94)15dB 15.63 (19.30) 12.87 10.88 11.05 (9.26) 13.19 (10.95) (10.07)20dB 13.99 (17.87) 11.11 9.43 9.44 (8.37) 11.69 (10.11) (9.20)Avg. 17.65 (21.41) 14.91 12.88 12.60 (10.80) 14.91 (12.49) (11.69)

(TDNN) architecture followed by statistics (mean and std.dev.) pooling layer which converts frame level features toutterance level features. These segment level statistics are fedthrough two layers of feed forward network with 512 hiddendimensions. The x-vectors are the embeddings from the affinelayer after the statistics pooling (output of layer 6 beforethe non-linearity with 512 dimensions). The target for the x-vector model are the language labels. We use the LRE2017training data along with data augmentation to train the x-vector model. The model used a five fold data augmentationprocedure with room reverberation as well as the additivenoises like music, noise and babble (MUSAN corpus [51]) atSNR values of 5, 10, 15, 20 dB. The entire model is trainedusing the Kaldi recipe [45]. We report the results using x-vector features for the LRE2017 evaluation in Table II. Wealso report the performance of the x-vector system directlyusing the posteriors from the trained network (denoted as x-post).

E. DNN and LSTM baseline

We implement the best performing LSTM model [36],which is a two layer LSTM with 512 units in each layerfollowed by an output softmax layer as a baseline end to endsystem. We also compare the performance of proposed modelswith a DNN with attention developed recently [27].

F. Proposed Relevance Models for LID

The i-BLSTM model uses the i-vector extraction setupsimilar to the one used in the baseline system. The audiorecordings are chunked into 1000 msec. which are shiftedevery 200 msec. The segment i-vectors are used to train theBLSTM model on the training data. We use 80, 000 recordingsof 15 sec. duration from the LRE training dataset and themodel is trained using a cross entropy loss. For trainingthe proposed end-to-end HGRU model on the LRE17 data,audio snippets of duration ranging from 3 sec. to 30 sec.are randomly sampled from the training data. A similar kindof setup is used for RATS dataset, where around 100, 000

samples of 15 sec. duration is sampled for training the i-BLSTM model and random sampling of 3 sec. and 30 sec.is done for the end-to-end HGRU model. Finally, we alsoexperiment with replacing the i-vector features with the x-vector features for the BLSTM model (this model is referedto as x-BLSTM system).

V. LANGUAGE RECOGNITION RESULTS

A. LRE Evaluation Results

The results for LRE evaluations are listed in Table II.The relevance weighted models proposed in this paper arethe relevance weighted Baum-Welch statistics (RWBW), thehybrid i-vector BLSTM model (i-BLSTM) and the hierarchicalGRU model (HGRU).

As seen in Table II, in terms of the Cavg , the baselineneural models like LSTM [36] and DNN with attention[27] perform comparatively well on short durations (3 sec.and 10 sec.). However, on longer duration of 30 sec. and1000 sec, the i-vector based LDA-SVM system provides thebest performance. This indicates that the limitations of theprevious models to incorporate the long term dependencies.The proposed relevance based approaches compare favorablywith the i-vector SVM baseline in all conditions. The RWBWsystem provides the best performance on the long duration(1000 sec.) condition as there are a large number of short seg-ment i-vectors in these recordings. However, in short durationconditions, the RWBW is inferior to other proposed methods.The HGRU model performs similar to the DNN-attn modelproposed previously. The best system in terms of the Cavg .and the EER metric is the x-BLSTM system.

B. Results on Noisy LRE Data

The results with various noisy versions of the evaluationdata are shown in Table III and Table IV. For the experi-ments with noisy data (Table III), among the various baselinesystems, the LSTM model [36] gives the best robustness.The proposed approach of hybrid i-BLSTM model providessignificant improvements over all models considered for all

Page 7: Towards Relevance and Sequence Modeling in Language

7

TABLE IV: Performance of the systems with partial noise (50% of the utterance, 10 sec. recording condition) at various SNRlevels. The x-vector based systems are shown in parentheses.

SNR i-LDA-SVM (x-LDA-SVM) LSTM [36] DNN-Attn [27] i-BLSTM (x-BLSTM) HGRU (x-post) (x-BLSTM-E2E)Accuracy (%)

5dB 53.16 (57.36) 56.50 54.64 59.79 (61.41) 55.32 (58.85) (61.68)10dB 55.57 (61.07) 60.42 58.22 63.02 (63.56) 60.20 (61.44) (63.64)15dB 58.33 (64.10) 62.61 61.61 65.90 (65.66) 63.16 (63.05) (65.60)20dB 59.56 (65.53) 64.61 63.55 68.16 (66.51) 65.65 (63.79) (66.16)Avg. 56.65 (62.01) 61.04 59.50 64.20 (64.29) 61.08 (61.78) (64.27)

Cavg5dB 0.54 (0.66) 0.52 0.50 0.47 (0.38) 0.57 (0.45) (0.44)10dB 0.50 (0.61) 0.47 0.44 0.43 (0.35) 0.51 (0.41) (0.41)15dB 0.47 (0.58) 0.44 0.41 0.39 (0.33) 0.47 (0.38) (0.38)20dB 0.45 (0.56) 0.42 0.39 0.37 (0.32) 0.44 (0.37) (0.37)Avg. 0.49 0.46 0.44 0.42 (0.35) 0.50 (0.40) (0.40)

EER (%)5dB 17.56 (22.32) 14.59 13.28 12.30 (10.67) 15.36 (13.81) (12.54)10dB 15.79 (20.66) 13.10 11.52 10.84 (9.75) 13.05 (12.41) (11.22)15dB 14.32 (18.75) 11.80 10.45 9.60 (9.04) 11.89 (11.60) (10.62)20dB 13.70 (17.97) 11.20 9.64 8.95 (8.59) 11.08 (11.34) (10.22)Avg. 15.34 (19.92) 12.67 11.22 10.42 (9.51) 12.85 (12.29) (11.15)

0.0 2.0 4.0 6.0 8.0 10.0Time(s)

0.01.02.03.04.0

Freq

uenc

y(kH

z)

0.0 2.0 4.0 6.0 8.0 10.0Time(s)

0.0

0.1

0.2

0.3

Attention Wts.

Attention Weights for the HGRU ModelClean AudioNoisy Audio

0.0 2.0 4.0 6.0 8.0 10.0Time(s)

0.0

0.1

0.2

0.3

Attention Wts.

Attention Weights for the 1s-BLSTM Attn ModelClean AudioNoisy Audio

Fig. 4: Plot of a partially noisy (noise present in the firsthalf of the audio) LRE-17 audio file spectrogram and thecorresponding attention weights estimated every 1000 msec.for the proposed HGRU model and the i-BLSTM model.

SNR conditions. The hierarchical GRU model also improvesover the LDA-SVM baseline but the performance of thissystem is inferior to the hybrid x-BLSTM system as well asthe DNN-attn model [27]. On the average, the proposed x-BLSTM model shows a relative improvement of 28.8 % overthe baseline i-LDA-SVM in terms of the accuracy measure.

On the partially noisy conditions (Table IV), the trends aresimilar. The baseline performance of the LDA-SVM systemis improved by the DNN-attn model. However, the proposedi-BLSTM model improves over the previous neural networkmodel based approaches for LID. On the average, the x-BLSTM model improves the baseline LDA-SVM relativelyby 28.5% in term of the accuracy measure.

The reason for the counterintuitive lack of improvement

TABLE V: Performance of the baseline system and the pro-posed models on the RATS dataset in terms of accuracy, Cavg

and EER.

Duration (sec.) i-LDA-SVM i-BLSTM HGRUAccuracy (%)

3 65.39 64.16 69.7810 77.46 79.07 81.7630 85.38 87.12 87.89120 92.22 92.69 91.43

Cavg3 1.12 1.04 0.84

10 0.81 0.63 0.5730 0.57 0.43 0.43120 0.40 0.32 0.35

EER (%)3 25.90 24.32 19.56

10 17.81 13.69 12.2330 11.88 8.61 8.87120 7.64 5.74 6.81

for the partial noise over the full noise case for the baselinesystems (comparing Table III and Table IV for x-post systemresults) may be attributed to the lack of ability of the baselinemethods like x-post in modeling the non-stationarity of thesignal (partial noise). The results for the proposed approach(x-BLSTM) in terms of [Acc]{Cavg}(EER) reported in TableIII are [63.15]{0.37}(10.80). These results are improved forthe partial noise case in Table IV as [64.29]{0.35}(9.51).We attribute this improvement to the ability of the attentionmodeling to re-weight the relevance to regions of the signalthat are less noisy.

C. Results on RATS Dataset

Here, we compare the performance of the i-vector LDA-SVM with the proposed approaches of hybrid i-BLSTMsystem and the end to end HGRU model. As seen in the results,the proposed approaches yields significant improvements overthe baseline system in terms of all the performance metricsand for all the duration conditions. The HGRU model is moreefficient in the short duration conditions while the i-BLSTM

Page 8: Towards Relevance and Sequence Modeling in Language

8

(a) Clean Speech

(b) Noisy Speech

-100

1020

(c) Segmental SNR

0 2 4 6 8 100

0.05(d) Attn Weight

Fig. 5: Plot of (a) a clean LRE utterance from validation data, (b) the same utterance with 10 dB babble noise, (c) segmentlevel SNR measured with 1000msec. windows shifted by 200msec. and (d) the corresponding attention weights estimated fromthe i-BLSTM model for 1000 msec. windows shifted by 200 msec.

model improves over the HGRU model for longer duration,full length (120 sec.) recording conditions.

D. Results with x-vector embeddings

From Table II, we observe that the x-vector based LDA-SVM improves over the baseline i-vector based system withsimilar back-end in terms of Cavg measure for all durations.The proposed BLSTM based backend improves the x-LDA-SVM system significantly for short duration (3sec. condition).However, for the longer durations, the baseline xvec. LDA-SVM provides better performance than the proposed model.The x-vector network posteriors directly used for LRE is alsoworse compared to the use of the SVM backend (especiallyfor long durations).

The results on noisy LRE17 dataset and partially noisydataset are given in Table III and Table IV respectively.As seen here, the x-vector embeddings provide noticeableimprovements over the i-vector counterparts. Further, theproposed backend approach of x-BLSTM provides significantimprovements over the baseline system both in the noisy andpartially noisy case. The relative improvements in terms ofCavg metric over the baseline system ranges from 34 − 43% for noisy conditions and about 42 % for partially noisyconditions. It is also noteworthy that the proposed approachsuffers from a performance degradation of less than 20%relative (in terms of Cavg) compared with clean conditionswhen then SNR is above 15 dB while the baseline x-LDA-SVM system suffers from more than 40 % degradation in

relative performance. These results are also in alignment withi-vector based results.

VI. DISCUSSION AND ANALYSIS

A. Importance of Relevance Modeling

The plot shown in Fig. 4 illustrates the spectrogram of apartially noisy (noise present in the first half of the audiorecording) and the corresponding attention weights from theHGRU and the i-BLSTM model. The attention weights for thesame audio recordings without any noise (clean) condition arealso shown for reference. As seen in this plot, the attentionweights are considerably changed due to the presence of thenoise in the first half of the file. The model is able to be adap-tive and focuses the language detection on the more reliableregions of the audio. The increased senstivity of the i-BLSTMsystem also potentially explains the improved performance ofthe i-BLSTM model in the experiments reported in Table IIIand Table IV.

Specifically, in the presence of noise, different parts ofthe signal have different signal-to-noise ratio values (as thespeech signal is time-varying even if the noise is stationary).This can be visualized using the example given in Fig. 5.While the SNR for this entire speech utterance is 10 dB,the SNR measured at short-term segment level is highly timevarying (ranging between 20 dB and −10 dB). The goal ofthe relevance models for LID proposed in this work is todeemphasize the contributions from the noisy regions whileenhancing the contribution of the high SNR regions in makingthe decision about the language identity. An example of the

Page 9: Towards Relevance and Sequence Modeling in Language

9

attention weights from the i-BLSTM model is also shown inFig. 5 (d). As seen in this plot, the attention weights tend totrack the short-term SNR thereby weighting the embeddingsfrom the regions of high SNR with higher weights whilereducing the weights associated with regions of low SNR.

B. End-to-end LID with x-vector

With use of the x-vector based embeddings in the proposedapproach, we explore whether the proposed attention basedBLSTM and the embedding x-vector model can be jointlytrained. This model is referred to as the x-BLSTM E2E.The results for clean LRE17 test set are also reported inTable II and the noisy LRE results are reported in Table IIIand Table IV. As seen here, the end-to-end system improvesover the x-BLSTM system for short durations while theperformance is degraded for longer duration conditions. It isalso important to note that the performance of the x-BLSTM-E2E is better than the x-post system which uses the x-vectornetwork outputs directly for LRE. These results indicate thatthe attention based backend modeling is important for noisyLRE task.

VII. CONCLUSION

In this paper, we have proposed a new hybrid i-vector/x-vector BLSTM attention model (i/x-BLSTM) for languagerecognition where the sequence of 1000 msec. i-vectors/x-vectors are modeled in a bidirectional attention based network.A novel processing pipeline with hierarchical gated recurrentunit (HGRU) or x-vector BLSTM model is also proposed forend-to-end spoken language recognition. While, the conven-tional backend systems based on LDA-SVM classifiers ignoresequential speech information, the attention mechanism in theproposed model plays the role of relevance weighting, whereportions of the speech signal more relevant to classificationdecision are given higher weightage. This relevance modelingis shown in particular to improve the robustness of the modelfor language recognition in noisy conditions. With severalexperiments on the noisy LRE dataset as well as the RATSdataset, we show the effectiveness of the proposed model. Wealso present a detailed analysis, which highlights the role ofattention modeling in language recognition.

ACKNOWLEDGMENT

The authors would like to acknowledge Shreyas Ramoji,Satish Kumar and Vaishnavi Y for their help in setting upthe LRE 2017 baseline, Omid Sadjadi for code fragmentsimplementing the NIST LRE baseline system. The workreported in this paper was partly funded by grants from theDepartment of Science and Technology (EMR/2016/007934)and by Department of Atomic Energy (DAE/34/20/12/2018-BRNS/34088).

REFERENCES

[1] Alex Waibel, Petra Geutner, L Mayfield Tomokiyo, Tanja Schultz, andMonika Woszczyna, “Multilinguality in speech and spoken languagesystems,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1297–1313,2000.

[2] Tanja Schultz and Alex Waibel, “Language-independent and language-adaptive acoustic modeling for speech recognition,” Speech Communi-cation, vol. 35, no. 1-2, pp. 31–51, 2001.

[3] Ciprian Chelba, Timothy J Hazen, and Murat Saraclar, “Retrieval andbrowsing of spoken content,” IEEE Signal Processing Magazine, vol.25, no. 3, 2008.

[4] Kevin Walker and Stephanie Strassel, “The rats radio traffic collectionsystem,” in Odyssey 2012-The Speaker and Language RecognitionWorkshop, 2012.

[5] Haizhou Li, Bin Ma, and Kong Aik Lee, “Spoken language recognition:from fundamentals to practice,” Proceedings of the IEEE, vol. 101, no.5, pp. 1136–1159, 2013.

[6] Sriram Ganapathy, Mohamed Omar, and Jason Pelecanos, “Unsuper-vised channel adaptation for language identification using co-training,”in 2013 IEEE International Conference on Acoustics, Speech and SignalProcessing. IEEE, 2013, pp. 6857–6861.

[7] Mitchell McLaren, Diego Castan, and Luciana Ferrer, “Analyzing theeffect of channel mismatch on the sri language recognition evaluation2015 system,” in Proc. Odyssey: Speaker Lang. Recognit. Workshop,2016, pp. 188–195.

[8] M. A. Zissman, “Comparison of four approaches to automatic languageidentification of telephone speech,” IEEE Transactions on Speech andAudio Processing, vol. 4, no. 1, pp. 31, 1996.

[9] Jiri Navratil, “Spoken language recognition-a step toward multilingualityin speech processing,” IEEE Transactions on Speech and AudioProcessing, vol. 9, no. 6, pp. 678–685, 2001.

[10] Mhamed Faouzi BenZeghiba, Jean-Luc Gauvain, and Lori Lamel,“Phonotactic language recognition using MLP features.,” in Interspeech,2012.

[11] Fred Richardson, Douglas Reynolds, and Najim Dehak, “Deep neuralnetwork approaches to speaker and language recognition,” IEEE SignalProcessing Letters, vol. 22, no. 10, pp. 1671–1675, 2015.

[12] Najim Dehak, Pedro A Torres-Carrasquillo, Douglas Reynolds, andReda Dehak, “Language recognition via i-vectors and dimensionalityreduction,” in Twelfth Annual Conference of the International SpeechCommunication Association, 2011.

[13] Najim Dehak, Patrick J Kenny, Reda Dehak, Pierre Dumouchel, andPierre Ouellet, “Front-end factor analysis for speaker verification,” IEEETransactions on Audio, Speech, and Language Processing, vol. 19, no.4, pp. 788–798, 2011.

[14] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speakerverification using adapted gaussian mixture models,” Digital signalprocessing, vol. 10, no. 1-3, pp. 19–41, 2000.

[15] Yun Lei, Nicolas Scheffer, Luciana Ferrer, and Mitchell McLaren, “Anovel scheme for speaker recognition using a phonetically-aware deepneural network,” in Acoustics, Speech and Signal Processing (ICASSP),2014 IEEE International Conference on. IEEE, 2014, pp. 1695–1699.

[16] Sriram Ganapathy, Kyu Han, Samuel Thomas, Mohamed Omar,Maarten Van Segbroeck, and Shrikanth S Narayanan, “Robust languageidentification using convolutional neural network features,” in Fifteenthannual conference of the international speech communication associa-tion, 2014.

[17] Bharat Padi, Shreyas Ramoji, Vaishnavi Yeruva, Satish Kumar, andSriram Ganapathy, “The LEAP language recognition system for lre 2017challenge-improvements and error analysis,” in Proc. Odyssey 2018 TheSpeaker and Language Recognition Workshop, 2018, pp. 31–38.

[18] David Snyder, Daniel Garcia-Romero, and Daniel Povey, “Time delaydeep neural network-based universal background models for speakerrecognition,” in Automatic Speech Recognition and Understanding(ASRU), 2015 IEEE Workshop on. IEEE, 2015, pp. 92–97.

[19] David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell,Daniel Povey, and Sanjeev Khudanpur, “Spoken language recognitionusing x-vectors.,” in Odyssey, 2018, pp. 105–111.

[20] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neuralmachine translation by jointly learning to align and translate,” arXivpreprint arXiv:1409.0473, 2014.

[21] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville,Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio, “Show, attendand tell: Neural image caption generation with visual attention,” inInternational conference on machine learning, 2015, pp. 2048–2057.

[22] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, KyunghyunCho, and Yoshua Bengio, “Attention-based models for speech recog-nition,” in Advances in neural information processing systems, 2015,pp. 577–585.

[23] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel,and Yoshua Bengio, “End-to-end attention-based large vocabularyspeech recognition,” in Acoustics, Speech and Signal Processing

Page 10: Towards Relevance and Sequence Modeling in Language

10

(ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp.4945–4949.

[24] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, andTomoki Hayashi, “Hybrid ctc/attention architecture for end-to-endspeech recognition,” IEEE Journal of Selected Topics in Signal Pro-cessing, vol. 11, no. 8, pp. 1240–1253, 2017.

[25] Shubham Bansal, Karan Malhotra, and Sriram Ganapathy, “Speaker andlanguage aware training for end-to-end asr,” in 2019 IEEE AutomaticSpeech Recognition and Understanding Workshop (ASRU). IEEE, 2019,pp. 494–501.

[26] KV Mounika, Sivanand Achanta, HR Lakshmi, Suryakanth V Gan-gashetty, and Anil Kumar Vuppala, “An investigation of deep neuralnetwork architectures for language recognition in indian languages.,” inINTERSPEECH, 2016, pp. 2930–2933.

[27] Ravi Kumar Vuddagiri, Hari Krishna Vydana, and Anil Kumar Vuppala,“Curriculum learning based approach for noise robust language identifi-cation using dnn with attention,” Expert Systems with Applications, vol.110, pp. 290–297, 2018.

[28] Seyed Omid Sadjadi et al., “The 2017 NIST language recognitionevaluation,” in Proc. Odyssey, Les Sables dOlonne, France, June 2018.

[29] Bharat Padi, Anand Mohan, and Sriram Ganapathy, “Attention basedhybrid i-vector blstm model for language recognition,” Proc. Interspeech2019, pp. 1263–1267, 2019.

[30] Bharat Padi, Anand Mohan, and Sriram Ganapathy, “End-to-endlanguage recognition using attention based hierarchical gated recurrentunit models,” in Proc. ICASSP, 2019.

[31] Daniel Garcia-Romero and Carol Y Espy-Wilson, “Analysis of i-vectorlength normalization in speaker recognition systems,” in Twelfth AnnualConference of the International Speech Communication Association,2011.

[32] Sepp Hochreiter and Jurgen Schmidhuber, “Long short-term memory,”Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

[33] Felix A Gers, Jurgen Schmidhuber, and Fred Cummins, “Learning toforget: Continual prediction with LSTM,” IET, 1999.

[34] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jurgen Schmidhuber,et al., “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” 2001.

[35] Ignacio Lopez-Moreno, Javier Gonzalez-Dominguez, Oldrich Plchot,David Martinez, Joaquin Gonzalez-Rodriguez, and Pedro Moreno, “Au-tomatic language identification using deep neural networks,” in Acous-tics, Speech and Signal Processing (ICASSP), 2014 IEEE InternationalConference on. IEEE, 2014, pp. 5337–5341.

[36] Ruben Zazo, Alicia Lozano-Diez, and Joaquin Gonzalez-Rodriguez,“Evaluation of an lstm-rnn system in different nist language recogni-tion frameworks,” in Proc. of Odyssey 2016 Speaker and LanguageRecognition Workshop. ATVS-UAM, June 2016.

[37] Hemant Misra, Herve Bourlard, and Vivek Tyagi, “New entropybased combination rules in hmm/ann multi-stream asr,” in 2003 IEEEInternational Conference on Acoustics, Speech, and Signal Processing,2003. Proceedings.(ICASSP’03). IEEE, 2003, vol. 2, pp. II–741.

[38] Fabio Valente, “A novel criterion for classifiers combination in multi-stream speech recognition,” IEEE Signal Processing Letters, vol. 16,no. 7, pp. 561–564, 2009.

[39] Mitchell McLaren, Martin Graciarena, and Yun Lei, “Softsad: Integratedframe-based speech confidence for speaker recognition,” in 2015 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP). IEEE, 2015, pp. 4694–4698.

[40] Paul J Werbos et al., “Backpropagation through time: what it does andhow to do it,” Proceedings of the IEEE, vol. 78, no. 10, pp. 1550–1560,1990.

[41] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, ZhifengChen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, MatthieuDevin, et al., “Tensorflow: Large-scale machine learning on heteroge-neous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.

[42] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, Dzmitry Bah-danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learn-ing phrase representations using RNN encoder-decoder for statisticalmachine translation,” arXiv preprint arXiv:1406.1078, 2014.

[43] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Ben-gio, “Empirical evaluation of gated recurrent neural networks onsequence modeling,” in NIPS 2014 Workshop on Deep Learning,December 2014, 2014.

[44] Mike Schuster, Kuldip K. Paliwal, and A. General, “Bidirectionalrecurrent neural networks,” IEEE Transactions on Signal Processing,1997.

[45] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, OndrejGlembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, YanminQian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,”in IEEE 2011 workshop on automatic speech recognition and under-standing. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584.

[46] H Guenter Hirsch, “Fant-filtering and noise adding tool,” NiederrheinUniversity of Applied Sciences, http://dnt.-kr. hsnr. de/download. html,2005.

[47] K. J. Han, S. Ganapathy, M. Li, M. Omar, and S. Narayanan, “TRAPLanguage identification system for RATS phase II evaluation,” inInterspeech. ISCA, 2013.

[48] Jongseo Sohn, Nam Soo Kim, and Wonyong Sung, “A statistical model-based voice activity detection,” IEEE Signal Processing Letters, vol. 6,pp. 1–3, 1999.

[49] N. Brummer, S. Cumani, O. Glembek, M. Karafiat, and P. Matejka,“Description and analysis of the Brno system for LRE2011,” Odyssey2012-The Speaker and Language Recognition Workshop, 2012.

[50] Andrew O Hatch, Sachin Kajarekar, and Andreas Stolcke, “Within-classcovariance normalization for svm-based speaker recognition,” in Ninthinternational conference on spoken language processing, 2006.

[51] David Snyder, Guoguo Chen, and Daniel Povey, “Musan: A music,speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.

Bharat Padi received his Bachelor of Engineering inElectrical and Electronics Engineering from AndhraUniversity, Visakhapatnam, India in 2009 and hisMaster’s Degree in System Science and Automationfrom Indian Institute of Science (IISc), Bengaluru,India in 2018. Between 2009-2016, he worked as asystems engineer in Infosys Technologies and lateras an Assistant Manager (Electrical) in Visakhapat-nam steel plant. He is currently working in minds.aias a neural network research engineer aiming atdeveloping controllers for vehicles using deep re-

inforcement learning. His current research interests include developing deepend-end models for spoken language recognition and reinforcement learning.He is a member of the IEEE.

Anand Mohan is currently working as Applied Sci-entist in Alexa Speech team in Amazon, Bangalore,India. He received his Master’s Degree in ArtificialIntelligence from Indian Institute of Science (IISc),Bangalore, India in 2019 and his Bachelor’s Degreein Electronics and Communication Engineering fromNational Institute of Technology, Calicut, India in2015. His research interests include end-to-end ASRtechnologies, speaker and language identification.

Sriram Ganapathy is a faculty member at the Elec-trical Engineering, Indian Institute of Science, Ban-galore, where he heads the activities of the learningand extraction of acoustic patterns (LEAP) lab. Priorto joining the Indian Institute of Science in 2016,he was a research staff member at the IBM WatsonResearch Center, Yorktown Heights. He received hisDoctor of Philosophy from the Center for Languageand Speech Processing, Johns Hopkins University in2012. He obtained his Bachelor of Technology fromCollege of Engineering, Trivandrum, India in 2004

and Master of Engineering from the Indian Institute of Science, Bangalore in2006. He has also worked as a Research Assistant in Idiap Research Institute,Switzerland from 2006 to 2008.

At the LEAP lab, his research interests include signal processing, machinelearning and deep learning methodologies for speech/speaker recognition andauditory neuroscience. He is a subject editor for the Speech Communicationsjournal, member of ISCA and a senior member of the IEEE.