domain attention with an ensemble of experts · mainwithlimitedsupervisiongivenk ex-isting domains....

11
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 643–653 Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics https://doi.org/10.18653/v1/P17-1060 Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 643–653 Vancouver, Canada, July 30 - August 4, 2017. c 2017 Association for Computational Linguistics https://doi.org/10.18653/v1/P17-1060 Domain Attention with an Ensemble of Experts Young-Bum Kim Karl Stratos Dongchan Kim Microsoft AI and Research Bloomberg L. P. {ybkim, dongchan.kim}@microsoft.com [email protected] Abstract An important problem in domain adapta- tion is to quickly generalize to a new do- main with limited supervision given K ex- isting domains. One approach is to re- train a global model across all K +1 do- mains using standard techniques, for in- stance Daum´ e III (2009). However, it is desirable to adapt without having to re- estimate a global model from scratch each time a new domain with potentially new intents and slots is added. We describe a solution based on attending an ensemble of domain experts. We assume K domain- specific intent and slot models trained on respective domains. When given domain K +1, our model uses a weighted combi- nation of the K domain experts’ feedback along with its own opinion to make predic- tions on the new domain. In experiments, the model significantly outperforms base- lines that do not use domain adaptation and also performs better than the full re- training approach. 1 Introduction An important problem in domain adaptation is to quickly generalize to a new domain with limited supervision given K existing domains. In spo- ken language understanding, new domains of in- terest for categorizing user utterances are added on a regular basis 1 . For instance, we may 1 A scenario frequently arising in practice is having a re- quest for creating a new virtual domain targeting a specific application. One typical use case is that of building natural language capability through intent and slot modeling (with- out actually building a domain classifier) targeting a specific application. add ORDERPIZZA domain and desire a domain- specific intent and semantic slot tagger with a lim- ited amount of training data. Training only on the target domain fails to utilize the existing resources in other domains that are relevant (e.g., labeled data for PLACES domain with place name, location as the slot types), but naively training on the union of all domains does not work well since different domains can have widely varying distributions. Domain adaptation offers a balance between these extremes by using all data but simultane- ously distinguishing domain types. A common approach for adapting to a new domain is to re- train a global model across all K +1 domains us- ing well-known techniques, for example the fea- ture augmentation method of Daum´ e III (2009) which trains a single model that has one domain- invariant component along with K +1 domain- specific components each of which is specialized in a particular domain. While such a global model is effective, it requires re-estimating a model from scratch on all K +1 domains each time a new do- main is added. This is burdensome particularly in our scenario in which new domains can arise fre- quently. In this paper, we present an alternative solu- tion based on attending an ensemble of domain experts. We assume that we have already trained K domain-specific models on respective domains. Given a new domain K +1 with a small amount of training data, we train a model on that data alone but queries the K experts as part of the training procedure. We compute an attention weight for each of these experts and use their combined feed- back along with the model’s own opinion to make predictions. This way, the model is able to selec- tively capitalize on relevant domains much like in 643

Upload: duongthien

Post on 29-Aug-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 643–653Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1060

Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 643–653Vancouver, Canada, July 30 - August 4, 2017. c©2017 Association for Computational Linguistics

https://doi.org/10.18653/v1/P17-1060

Domain Attention with an Ensemble of Experts

Young-Bum Kim† Karl Stratos‡ Dongchan Kim†

†Microsoft AI and Research‡Bloomberg L. P.

{ybkim, dongchan.kim}@[email protected]

Abstract

An important problem in domain adapta-tion is to quickly generalize to a new do-main with limited supervision givenK ex-isting domains. One approach is to re-train a global model across all K + 1 do-mains using standard techniques, for in-stance Daume III (2009). However, it isdesirable to adapt without having to re-estimate a global model from scratch eachtime a new domain with potentially newintents and slots is added. We describe asolution based on attending an ensembleof domain experts. We assume K domain-specific intent and slot models trained onrespective domains. When given domainK + 1, our model uses a weighted combi-nation of the K domain experts’ feedbackalong with its own opinion to make predic-tions on the new domain. In experiments,the model significantly outperforms base-lines that do not use domain adaptationand also performs better than the full re-training approach.

1 Introduction

An important problem in domain adaptation is toquickly generalize to a new domain with limitedsupervision given K existing domains. In spo-ken language understanding, new domains of in-terest for categorizing user utterances are addedon a regular basis1. For instance, we may

1A scenario frequently arising in practice is having a re-quest for creating a new virtual domain targeting a specificapplication. One typical use case is that of building naturallanguage capability through intent and slot modeling (with-out actually building a domain classifier) targeting a specificapplication.

add ORDERPIZZA domain and desire a domain-specific intent and semantic slot tagger with a lim-ited amount of training data. Training only on thetarget domain fails to utilize the existing resourcesin other domains that are relevant (e.g., labeleddata for PLACES domain with place name,location as the slot types), but naively trainingon the union of all domains does not work wellsince different domains can have widely varyingdistributions.

Domain adaptation offers a balance betweenthese extremes by using all data but simultane-ously distinguishing domain types. A commonapproach for adapting to a new domain is to re-train a global model across all K + 1 domains us-ing well-known techniques, for example the fea-ture augmentation method of Daume III (2009)which trains a single model that has one domain-invariant component along with K + 1 domain-specific components each of which is specializedin a particular domain. While such a global modelis effective, it requires re-estimating a model fromscratch on all K + 1 domains each time a new do-main is added. This is burdensome particularly inour scenario in which new domains can arise fre-quently.

In this paper, we present an alternative solu-tion based on attending an ensemble of domainexperts. We assume that we have already trainedK domain-specific models on respective domains.Given a new domainK+1 with a small amount oftraining data, we train a model on that data alonebut queries the K experts as part of the trainingprocedure. We compute an attention weight foreach of these experts and use their combined feed-back along with the model’s own opinion to makepredictions. This way, the model is able to selec-tively capitalize on relevant domains much like in

643

Page 2: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

standard domain adaptation but without explicitlyre-training on all domains together.

In experiments, we show clear gains in a do-main adaptation scenario across 7 test domains,yielding average error reductions of 44.97% forintent classification and 32.30% for slot taggingcompared to baselines that do not use domainadaptation. Moreover we have higher accuracythan the full re-training approach of Kim et al.(2016c), a neural analog of Daume III (2009).

2 Related Work

2.1 Domain Adaptation

There is a venerable history of research on do-main adaptation (Daume III and Marcu, 2006;Daume III, 2009; Blitzer et al., 2006, 2007; Panet al., 2011) which is concerned with the shiftin data distribution from one domain to another.In the context of NLP, a particularly successfulapproach is the feature augmentation method ofDaume III (2009) whose key insight is that if wepartition the model parameters to those that handlecommon patterns and those that handle domain-specific patterns, the model is forced to learn fromall domains yet preserve domain-specific knowl-edge. The method is generalized to the neu-ral paradigm by Kim et al. (2016c) who jointlyuse a domain-specific LSTM and also a globalLSTM shared across all domains. In the con-text of SLU, Jaech et al. (2016) proposed Kdomain-specific feedforward layers with a sharedword-level LSTM layer across domains; Kimet al. (2016c) instead employed K + 1 LSTMs.Hakkani-Tur et al. (2016) proposed to employ asequence-to-sequence model by introducing a fic-titious symbol at the end of an utterance of whichtag represents the corresponding domain and in-tent.

All these methods require one to re-train amodel from scratch to make it learn the correlationand invariance between domains. This becomesdifficult to scale when there is a new domain com-ing in at high frequency. We address this problemby proposing a method that only calls K traineddomain experts; we do not have to re-train thesedomain experts. This gives a clear computationaladvantage over the feature augmentation method.

2.2 Spoken Language Understanding

Recently, there has been much investment on thepersonal digital assistant (PDA) technology in in-

dustry (Sarikaya, 2015; Sarikaya et al., 2016). Ap-ples Siri, Google Now, Microsofts Cortana, andAmazons Alexa are some examples of personaldigital assistants. Spoken language understanding(SLU) is an important component of these exam-ples that allows natural communication betweenthe user and the agent (Tur, 2006; El-Kahky et al.,2014). PDAs support a number of scenarios in-cluding creating reminders, setting up alarms, notetaking, scheduling meetings, finding and consum-ing entertainment (i.e. movie, music, games), find-ing places of interest and getting driving directionsto them (Kim et al., 2016a).

Naturally, there has been an extensive line ofprior studies for domain scaling problems to eas-ily scale to a larger number of domains: pre-training (Kim et al., 2015c), transfer learning (Kimet al., 2015d), constrained decoding with a sin-gle model (Kim et al., 2016a), multi-task learn-ing (Jaech et al., 2016), neural domain adap-tation (Kim et al., 2016c), domainless adapta-tion (Kim et al., 2016b), a sequence-to-sequencemodel (Hakkani-Tur et al., 2016), adversary do-main training (Kim et al., 2017) and zero-shotlearning(Chen et al., 2016; Ferreira et al., 2015).

There are also a line of prior works on enhanc-ing model capability and features: jointly mod-eling intent and slot predictions (Jeong and Lee,2008; Xu and Sarikaya, 2013; Guo et al., 2014;Zhang and Wang, 2016; Liu and Lane, 2016a,b),modeling SLU models with web search click logs(Li et al., 2009; Kim et al., 2015a) and enhancingfeatures, including representations (Anastasakoset al., 2014; Sarikaya et al., 2014; Celikyilmazet al., 2016, 2010; Kim et al., 2016d) and lexicon(Liu and Sarikaya, 2014; Kim et al., 2015b).

3 Method

We use an LSTM simply as a mapping φ : Rd ×Rd′ → Rd′ that takes an input vector x and a statevector h to output a new state vector h′ = φ(x, h).See Hochreiter and Schmidhuber (1997) for a de-tailed description. At a high level, the individ-ual model consists of builds on several ingredientsshown in Figure 1: character and word embed-ding, a bidirectional LSTM (BiLSTM) at a charac-ter layer, a BiLSTM at word level, and feedfowardnetwork at the output.

644

Page 3: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

…𝑔1𝑡

𝑦1𝑡

Feedforward𝑔2𝑡

𝑦2𝑡 𝑦𝑛

𝑡

𝑔𝑛𝑡𝑔𝑖

Character

embeddingWord

embedding

Utterance

𝑤1

𝑐1,1 𝑐1,2 𝑐1,𝑚

𝑤2

𝑐2,1 𝑐2,2 𝑐2,𝑚

𝑤𝑛

𝑐𝑛,1 𝑐𝑛,2 𝑐𝑛,𝑚

Char-level

Bidirectional

LSTM…𝜙𝑓𝑐

𝜙𝑏𝑐

𝜙𝑓𝑐

𝜙𝑏𝑐

𝜙𝑓𝑐

𝜙𝑏𝑐

…𝜙𝑓𝑐

𝜙𝑏𝑐

𝜙𝑓𝑐

𝜙𝑏𝑐

𝜙𝑓𝑐

𝜙𝑏𝑐

…𝜙𝑓𝑐

𝜙𝑏𝑐

𝜙𝑓𝑐

𝜙𝑏𝑐

𝜙𝑓𝑐

𝜙𝑏𝑐

𝑦𝑖

𝜙𝑓𝑤

𝜙𝑏𝑤

Word-level

Bidirectional

LSTM

…𝜙𝑓𝑤

𝜙𝑏𝑤

𝜙𝑓𝑤

𝜙𝑏𝑤

𝜙𝑓𝑖

𝜙𝑏𝑖

Figure 1: The overall network architecture of the individual model.

3.1 Individual Model ArchitectureLet C denote the set of character types and Wthe set of word types. Let ⊕ denote the vectorconcatenation operation. A wildly successful ar-chitecture for encoding a sentence (w1 . . . wn) ∈Wn is given by bidirectional LSTMs (BiLSTMs)(Schuster and Paliwal, 1997; Graves, 2012). Ourmodel first constructs a network over an utteranceclosely following Lample et al. (2016). The modelparameters Θ associated with this BiLSTM layerare

• Character embedding ec ∈ R25 for each c ∈C

• Character LSTMs φCf , φCb : R25×R25 → R25

• Word embedding ew ∈ R100 for each w ∈ W

• Word LSTMs φWf , φWb : R150×R100 → R100

Letw1 . . . wn ∈ W denote a word sequence whereword wi has character wi(j) ∈ C at position j.First, the model computes a character-sensitiveword representation vi ∈ R150 as

fCj = φCf(ewi(j), f

Cj−1)

∀j = 1 . . . |wi|bCj = φCb

(ewi(j), b

Cj+1

)∀j = |wi| . . . 1

vi = fC|wi| ⊕ bC1 ⊕ ewi

for each i = 1 . . . n.2 Next, the model computes

fWi = φWf(vi, f

Wi−1)

∀i = 1 . . . n

bWi = φWb(vi, b

Wi+1

)∀i = n . . . 1

and induces a character- and context-sensitiveword representation hi ∈ R200 as

hi = fWi ⊕ bWi (1)

for each i = 1 . . . n. These vectors can be used toperform intent classification or slot tagging on theutterance.

Intent Classification We can predict the intentof the utterance using (h1 . . . hn) ∈ R200 in (1)as follows. Let I denote the set of intent types.We introduce a single-layer feedforward networkgi : R200 → R|I| whose parameters are denotedby Θi. We compute a |I|-dimensional vector

µi = gi

(n∑

i=1

hi

)

and define the conditional probability of the cor-rect intent τ as

p(τ |h1 . . . hn) ∝ exp(µiτ)

(2)

2For simplicity, we assume some random initial state vec-tors such as fC0 and bC|wi|+1 when we describe LSTMs.

645

Page 4: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

The intent classification loss is given by the nega-tive log likelihood:

Li(Θ,Θi

)= −

l

log p(τ (l)|h(l)

)(3)

where l iterates over intent-annotated utterances.

Slot Tagging We predict the semantic slots ofthe utterance using (h1 . . . hn) ∈ R200 in (1) asfollows. Let S denote the set of semantic types andL the set of corresponding BIO label types 3 thatis, L = {B-e : e ∈ E}∪{I-e : e ∈ E}∪{O}. Weadd a transition matrix T ∈ R|L|×|L| and a single-layer feedforward network gt : R200 → R|L| tothe network; denote these additional parametersby Θt. The conditional random field (CRF) tag-ging layer defines a joint distribution over labelsequences of y1 . . . yn ∈ L of w1 . . . wn as

p(y1 . . .yn|h1 . . . hn)

∝ exp

(n∑

i=1

Tyi−1,yi × gtyi(hi))

(4)

The tagging loss is given by the negative log like-lihood:

Lt(Θ,Θt

)= −

l

log p(y(l)|h(l)

)(5)

where l iterates over tagged sentences in the data.Alternatively, we can optimize the local loss:

Lt−loc(Θ,Θt

)= −

l

i

log p(y(l)i |h

(l)i

)

(6)

where p(yi|hi) ∝ exp(gtyi(hi)

).

4 Method

4.1 Domain Attention Architecture

Now we assume that for each of theK domains wehave an individual model described in Section 3.1.Denote these domain experts by Θ(1) . . .Θ(K).We now describe our model for a new domainK + 1. Given an utterance w1 . . . wn, it uses aBiLSTM layer to induce a feature representationh1 . . . hn as specified in (1). It further invokes Kdomain experts Θ(1) . . .Θ(K) on this utterance toobtain the feature representations h(k)1 . . . h

(k)n for

3For example, to/O San/B-Source Fran-cisco/I-Source airport/O.

Figure 2: The overall network architecture of thedomain attention, which consists of three compo-nents: (1) K domain experts + 1 target BiLSTMlayer to induce a feature representation, (2) K do-main experts + 1 target feedfoward layer to out-put pre-trained label embedding (3) a final feedfor-ward layer to output an intent or slot. We have twoseparate attention mechanisms to combine feed-back from domain experts.

k = 1 . . .K. For each word wi, the model com-putes an attention weight for each domain k =1 . . .K domains as

qdoti,k = h>i h

(k) (7)

in the simplest case. We also experiment with thebilinear function

qbii,k = h>i Bh

(k) (8)

where B is an additional model parameter, andalso the feedforward function

qfeedi,k = W tanh

(Uh>i + V h(k) + b1

)+ b2 (9)

where U, V,W, b1, b2 are additional model param-eters. The final attention weights a(1)i . . . a

(1)i are

obtained by using a softmax layer

ai,k =exp(qi,k)∑Kk=1 exp(qi,k)

(10)

The weighted combination of the experts’ feed-back is given by

hexpertsi =

K∑

k=1

ai,kh(k)i (11)

646

Page 5: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

and the model makes predictions by usingh1 . . . hn where

hi = hi ⊕ hexpertsi (12)

These vectors replace the original feature vectorshi in defining the intent or tagging losses.

4.2 Domain Attention Variants

We also consider two variants of the domain atten-tion architecture in Section 4.1.

Label Embedding In addition to the state vec-tors h(1) . . . h(K) produced by K experts, we fur-ther incorporate their final (discrete) label predic-tions using pre-trained label embeddings. We in-duce embeddings ey for labels y from all domainsusing the method of Kim et al. (2015d). At the i-thword, we predict the most likely label y(k) underthe k-th expert and compute an attention weight as

qdoti,k = h>i e

y(k) (13)

Then we compute an expectation over the experts’predictions

ai,k =exp(qi,k)∑Kk=1 exp(qi,k)

(14)

hlabeli =

K∑

k=1

ai,key(k)

i (15)

and use it in conjunction with hi. Note that thismakes the objective a function of discrete decisionand thus non-differentiable, but we can still opti-mize it in a standard way treating it as learning astochastic policy.

Selective Attention Instead of computing atten-tion over all K experts, we only consider the topK ′ ≤ K that predict the highest label scores. Weonly compute attention over these K ′ vectors. Weexperiment with various values of K ′

5 Experiments

In this section, we describe the set of experi-ments conducted to evaluate the performance ofour model. In order to fully assess the contri-bution of our approach, we also consider severalbaselines and variants besides our primary expertmodel.

Domain |I| |S| DescriptionEVENTS 10 12 Buy event ticketsFITNESS 10 9 Track healthM-TICKET 8 15 Buy movie ticketsORDERPIZZA 19 27 Order pizzaREMINDER 19 20 Remind task

TAXI 8 13 Find/book an cabTV 7 5 Control TV

Table 1: The number of intent types (|I|), the num-ber of slot types (|S|), and a short description ofthe test domains.

OverlappingDomain Intents SlotsEVENTS 70.00% 75.00%FITNESS 30.00% 77.78%M-TICKET 37.50% 100.00%

ORDERPIZZA 47.37% 74.07%REMINDER 68.42% 85.00%

TAXI 50.00% 100.00%TV 57.14% 60.00%AVG 51.49% 81.69%

Table 2: The overlapping percentage of intenttypes and slot types with experts or source do-mains.

5.1 Test domains and Tasks

To test the effectiveness of our proposed approach,we apply it to a suite of 7 Microsoft Cortana do-mains with 2 separate tasks in spoken language un-derstanding: (1) intent classification and (2) slot(label) tagging. The intent classification task isa multi-class classification problem with the goalof determining to which one of the |I| intents auser utterance belongs within a given domain. Theslot tagging task is a sequence labeling problemwith the goal of identifying entities and chunk-ing of useful information snippets in a user utter-ance. For example, a user could say “reserve atable at joeys grill for thursday at seven pm forfive people”. Then the goal of the first task wouldbe to classify this utterance as “make reservation”intent given the places domain, and the goal ofthe second task would be to tag “joeys grill” asrestaurant, “thursday” as date, “seven pm”as time, and “five” as number people.

The short descriptions on the 7 test domains areshown in Table 1. As the table shows, the testdomains have different granularity and diverse se-mantics. For each personal assistant test domain,

647

Page 6: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

we only used 1000 training utterances to simulatescarcity of newly labeled data. The amount of de-velopment and test utterance was 100 and 10k re-spectively.

The similarities of test domains, representedby overlapping percentage, with experts or sourcedomains are represented in Table 2. The in-tent overlapping percentage ranges from 30% onFITNESS domain to 70% on EVENTS, which av-erages out at 51.49%. And the slots for test do-mains overlaps more with those of source domainsranging from 60% on TV domain to 100% on bothM-TICKET and TAXI domains, which averagesout at 81.69%.

5.2 Experimental Setup

Category |D| ExampleTrans. 4 BUS, FLIGHTTime 4 ALARM, CALENDARMedia 5 MOVIE, MUSICAction 5 HOMEAUTO, PHONELoc. 3 HOTEL, BUSINESSInfo 4 WEATHER, HEALTH

TOTAL 25

Table 3: Overview of experts or source domains:Domain categories which have been created basedon the label embeddings. These categorizationsare solely for the purpose of describing domainsbecause of the limited space and they are com-pletely unrelated to the model. The number ofsentences in each domain is in the range of 50kto 660k and the number of unique intents and slotsare 200 and 500 respectively. In total, we have 25domain-specific expert models. For the averageperformance, intent accuracy is 98% and slot F1score is 96%.

In testing our approach, we consider a domainadaptation (DA) scenario, where a target domainhas a limited training data and the source domainhas a sufficient amount of labeled data. We furtherconsider a scenario, creating a new virtual domaintargeting a specific scenario given a large inven-tory of intent and slot types and underlying modelsbuild for many different applications and scenar-ios. One typical use case is that of building naturallanguage capability through intent and slot model-ing (without actually building a domain classifier)targeting a specific application. Therefore, our ex-perimental settings are rather different from previ-

ously considered settings for domain adaptation intwo aspects:

• Multiple source domains: In most previousworks, only a pair of domains (source vs. tar-get) have been considered, although they canbe easily generalized to K > 2. Here, weexperiment with K = 25 domains shown inTable 3.

• Variant output: In a typical setting for do-main adaptation, the label space is invariantacross all domains. Here, the label space canbe different in different domains, which isa more challenging setting. See Kim et al.(2015d) for details of this setting.

For this DA scenario, we test whether our ap-proach can effectively make a system to quicklygeneralize to a new domain with limited supervi-sion given K existing domain experts shown in 3.

In summary, our approach is tested with 7 Mi-crosoft Cortana personal assistant domains across2 tasks of intent classification and slot tagging.Below shows more detail of our baselines and vari-ants used in our experiments.

Baselines: All models below use same underly-ing architecture described in Section 3.1

• TARGET: a model trained on a targeted do-main without DA techniques.

• UNION: a model trained on the union of a tar-geted domain and 25 domain experts.

• DA: a neural domain adaptation method ofKim et al. (2016c) which trains domain spe-cific K LSTMs with a generic LSTM on alldomain training data.

Domain Experts (DE) variants: All models be-low are based on attending on an ensemble of25 domain experts (DE) described in Section 4.1,where a specific set of intent and slots models aretrained for each domain. We have two feedbackfrom domain experts: (1) feature representationfrom LSTM, and (2) label embedding from feed-foward described in Section 4.1 and Section 4.2,respectively.

• DEB: DE without domain attention mecha-nism. It uses the unweighted combination offirst feedback from experts like bag-of-wordmodel.

648

Page 7: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

• DE1: DE with domain attention with theweighted combination of the first feedbacksfrom experts.

• DE2: DE1 with additional weighted combina-tion of second feedbacks.

• DES2: DE2 with selected attention mecha-nism, described in Section 4.2.

In our experiments, all the models were imple-mented using Dynet (Neubig et al., 2017) andwere trained using Stochastic Gradient Descent(SGD) with Adam (Kingma and Ba, 2015)—anadaptive learning rate algorithm. We used the ini-tial learning rate of 4× 10−4 and left all the otherhyper parameters as suggested in Kingma and Ba(2015). Each SGD update was computed with-out a minibatch with Intel MKL (Math Kernel Li-brary)4. We used the dropout regularization (Sri-vastava et al., 2014) with the keep probability of0.4 at each LSTM layer.

To encode user utterances, we used bidirec-tional LSTMs (BiLSTMs) at the character leveland the word level, along with 25 dimensionalcharacter embedding and 100 dimensional wordembedding. The dimension of both the input andoutput of the character LSTMs were 25, and thedimensions of the input and output of the wordLSTMs were 1505 and 100, respectively. The di-mension of the input and output of the final feed-forward network for intent, and slot were 200 andthe number of their corresponding task. Its activa-tion was rectified linear unit (ReLU).

To initialize word embedding, we used wordembedding trained from (Lample et al., 2016). Inthe following sections, we report intent classifica-tion results in accuracy percentage and slot resultsin F1-score. To compute slot F1-score, we usedthe standard CoNLL evaluation script6

5.3 Results

We show our results in the DA setting where wehad a sufficient labeled dataset in the 25 sourcedomains shown in Table 3, but only 1000 labeleddata in the target domain. The performance of thebaselines and our domain experts DE variants areshown in Table 4. The top half of the table shows

4https://software.intel.com/en-us/articles/intelr-mkl-and-c-template-libraries

5We concatenated last two outputs from the characterLSTM and word embedding, resulting in 150 (25+25+100)

6http://www.cnts.ua.ac.be/conll2000/chunking/output.html

the results of intent classification and the results ofslot tagging is in the bottom half.

The baseline which trained only on the targetdomain (TARGET) shows a reasonably good per-formance, yielding on average 87.7% on the in-tent classification and 83.9% F1-score on the slottagging. Simply training a single model with ag-gregated utterance across all domains (UNION)brings the performance down to 77.4% and 75.3%.Using DA approach of Kim et al. (2016c) showsa significant increase in performance in all 7 do-mains, yielding on average 90.3% intent accuracyand 86.2%.

The DE without domain attention (DEB) showssimilar performance compared to DA. Using DEmodel with domain attention (DE1) shows an-other increase in performance, yielding on aver-age 90.9% intent accuracy and 86.9%. The per-formance increases again when we use both fea-ture representation and label embedding (DE2),yielding on average 91.4% and 88.2% and observenearly 93.6% and 89.1% when using selective at-tention (DES2). Note that DES2 selects the appro-priate number of experts per layer by evaluationon a development set.

The results show that our expert variant ap-proach (DES2) achieves a significant performancegain in all 7 test domains, yielding average er-ror reductions of 47.97% for intent classificationand 32.30% for slot tagging. The results suggestthat our expert approach can quickly generalize toa new domain with limited supervision given Kexisting domains by having only a handful moredata of 1k newly labeled data points. The poorperformance of using the union of both sourceand target domain data might be due to the rela-tively very small size of the target domain data,overwhelmed by the data in the source domain.For example, a word such as “home” can be la-beled as place type under the TAXI domain,but in the source domains can be labeled as ei-ther home screen under the PHONE domain orcontact name under the CALENDAR domain.

5.4 Training Time

The Figure 3 shows the time required for trainingDES2 and DA of Kim et al. (2016c). The trainingtime for DES2 stays almost constant as the numberof source domains increases. However, the train-ing time for DA grows exponentially in the num-ber of source domains. Specifically, when trained

649

Page 8: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

Task Domain TARGET UNION DA DEB DE1 DE2 DES2

Intent

EVENTS 88.3 78.5 89.9 93.1 92.5 92.7 94.5FITNESS 88.0 77.7 92.0 92.0 91.2 91.8 94.0M-TICKET 88.2 79.2 91.9 94.4 91.5 92.7 93.4ORDERPIZZA 85.8 76.6 87.8 89.3 89.4 90.8 92.8REMINDER 87.2 76.3 91.2 90.0 90.5 90.2 93.1

TAXI 87.3 76.8 89.3 89.9 89.6 89.2 93.7TV 88.9 76.4 90.3 81.5 91.5 92.0 94.0

AVG 87.7 77.4 90.3 90.5 90.9 91.4 93.6

Slot

EVENTS 84.8 76.1 87.1 87.4 88.1 89.4 90.2FITNESS 84.0 75.6 86.4 86.3 87.0 88.1 88.9M-TICKET 84.2 75.6 86.4 86.1 86.8 88.4 89.7ORDERPIZZA 82.3 73.6 84.2 84.4 85.0 86.3 87.1REMINDER 83.5 75.0 85.9 86.3 87.0 88.3 89.2

TAXI 83.0 74.6 85.6 85.5 86.3 87.5 88.6TV 85.4 76.7 87.7 87.6 88.3 89.3 90.1

AVG 83.9 75.3 86.2 86.2 86.9 88.2 89.1

Table 4: Intent classification accuracy (%) and slot tagging F1-score (%) of our baselines and variants ofDE. The numbers in boldface indicate the best performing methods.

Figure 3: Comparison of training time betweenour DES2 model and DA model of Kim et al.(2016c) as the number of domains increases. Thehorizontal axis means the number of domains, thevertical axis is training time per epoch in seconds.Here we use CALENDAR as the target domain,which has 1k training data.

with 1 source or expert domain, both took around aminute per epoch on average. When training withfull 25 source domains, DES2 took 3 minutes perepoch while DA took 30 minutes per epoch. Sincewe need to iterate over all 25+1 domains to re-trainthe global model, the net training time ratio couldbe over 250.

EVENTS

FITNESS

M-TICKET

ORDERPIZZA

REMINDER

TAXI

TV

Number of Experts

Inte

nt A

ccu

racy

(%

)

Figure 4: Learning curves in accuracy across allseven test domains as the number of expert do-mains increases.

5.5 Learning Curve

We also measured the performance of our methodsas a function of the number of domain experts. Foreach test domain, we consider all possible sizesof experts ranging from 1 to 25 and we then takethe average of the resulting performances obtainedfrom the expert sets of all different sizes. Figure 4shows the resulting learning curves for each testdomain. The overall trend is clear: as the more ex-pert domains are added, the more the test perfor-mance improves. With ten or more expert domainsadded, our method starts to get saturated achiev-

650

Page 9: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

REMINDER

TAXI

M-TICKET

0.92

0.84

0.93

Figure 5: Heatmap visualizing attention weights.

ing more than 90% in accuracy across all sevendomains.

5.6 Attention weightsFrom the heatmap shown in Figure 5, we can seethat the attention strength generally agrees withcommon sense. For example, the M-TICKET andTAXI domain selected MOVIE and PLACES astheir top experts, respectively.

5.7 Oracle Expert

Domain TARGET DE2 Top 1ALARM 70.1 98.2 ALARM (.99)HOTEL 65.2 96.9 HOTEL (.99)

Table 5: Intent classification accuracy with an or-acle expert in the expert pool.

The results in Table 5 show the intent classi-fication accuracy of DE2 when we already havethe same domain expert in the expert pool. Tosimulate such a situation, we randomly sampled1,000, 100, and 100 utterances from each domainas training, development and test data, respec-tively. In both ALARM and HOTEL domains, thetrained models only on the 1,000 training utter-ances (TARGET) achieved only 70.1%and 65.2%in accuracy, respectively. Whereas, with ourmethod (DE2) applied, we reached almost thefull training performance by selectively paying at-tention to the oracle expert, yielding 98.2% and96.9%, respectively. This result again confirmsthat the behavior of the trained attention networkindeed matches the semantic closeness betweendifferent domains.

5.8 Selective attentionThe results in Table 6 examines how the intent pre-diction accuracy of DES2 varies with respect to the

Domain Top 1 Top 3 Top 5 Top 25EVENTS 98.1 98.8 99.2 96.4

TV 81.4 82.0 81.7 80.9AVG 89.8 90.4 90.5 88.7

Table 6: Accuracies of DES2 using different num-ber of experts.

number of experts in the pool. The rationale be-hind DES2 is to alleviate the downside of soft at-tention, namely distributing probability mass overall items even if some are bad items. To deal withsuch issues, we apply a hard cut-off at top k do-mains. From the result, a threshold at top 3 or 5yielded better results than that of either 1 or 25 ex-perts. This matches our common sense that theirare only a few of domains that are close enough tobe of help to a test domain. Thus it is advisable tofind the optimal k value through several rounds ofexperiments on a development dataset.

6 Conclusion

In this paper, we proposed a solution for scal-ing domains and experiences potentially to a largenumber of use cases by reusing existing data la-beled for different domains and applications. Oursolution is based on attending an ensemble of do-main experts. When given a new domain, ourmodel uses a weighted combination of domainexperts’ feedback along with its own opinion tomake prediction on the new domain. In both in-tent classification and slot tagging tasks, the modelsignificantly outperformed baselines that do notuse domain adaptation and also performed betterthan the full re-training approach. This approachenables creation of new virtual domains througha weighted combination of domain experts’ feed-back reducing the need to collect and annotate thesimilar intent and slot types multiple times for dif-ferent domains. Future work can include an exten-sion of domain experts to take into account dialoghistory aiming for a holistic framework that canhandle contextual interpretation as well.

651

Page 10: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

ReferencesTasos Anastasakos, Young-Bum Kim, and Anoop De-

oras. 2014. Task specific continuous word repre-sentations for mono and multi-lingual spoken lan-guage understanding. In Acoustics, Speech and Sig-nal Processing (ICASSP), 2014 IEEE InternationalConference on. IEEE, pages 3246–3250.

John Blitzer, Mark Dredze, Fernando Pereira, et al.2007. Biographies, bollywood, boom-boxes andblenders: Domain adaptation for sentiment classi-fication. In ACL. volume 7, pages 440–447.

John Blitzer, Ryan McDonald, and Fernando Pereira.2006. Domain adaptation with structural correspon-dence learning. In Proceedings of the 2006 confer-ence on empirical methods in natural language pro-cessing. Association for Computational Linguistics,pages 120–128.

Asli Celikyilmaz, Ruhi Sarikaya, Minwoo Jeong, andAnoop Deoras. 2016. An empirical investiga-tion of word class-based features for natural lan-guage understanding. IEEE/ACM Transactions onAudio, Speech and Language Processing (TASLP)24(6):994–1005.

Asli Celikyilmaz, Silicon Valley, and Dilek Hakkani-Tur. 2010. Convolutional neural network based se-mantic tagging with entity embeddings. genre .

Yun-Nung Chen, Dilek Hakkani-Tur, and XiaodongHe. 2016. Zero-shot learning of intent embed-dings for expansion by convolutional deep struc-tured semantic models. In Acoustics, Speech andSignal Processing (ICASSP), 2016 IEEE Interna-tional Conference on. IEEE, pages 6045–6049.

Hal Daume III. 2009. Frustratingly easy domain adap-tation. arXiv preprint arXiv:0907.1815 .

Hal Daume III and Daniel Marcu. 2006. Domain adap-tation for statistical classifiers. Journal of ArtificialIntelligence Research 26:101–126.

Ali El-Kahky, Derek Liu, Ruhi Sarikaya, Gokhan Tur,Dilek Hakkani-Tur, and Larry Heck. 2014. Ex-tending domain coverage of language understand-ing systems via intent transfer between domains us-ing knowledge graphs and search query click logs.IEEE, Proceedings of the ICASSP.

Emmanuel Ferreira, Bassam Jabaian, and FabriceLefevre. 2015. Zero-shot semantic parser for spokenlanguage understanding. In Sixteenth Annual Con-ference of the International Speech CommunicationAssociation.

Alex Graves. 2012. Neural networks. In Super-vised Sequence Labelling with Recurrent NeuralNetworks, Springer, pages 15–35.

Daniel Guo, Gokhan Tur, Wen-tau Yih, and GeoffreyZweig. 2014. Joint semantic utterance classifica-tion and slot filling with recursive neural networks.

In Spoken Language Technology Workshop (SLT),2014 IEEE. IEEE, pages 554–559.

Dilek Hakkani-Tur, Gokhan Tur, Asli Celikyilmaz,Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frameparsing using bi-directional rnn-lstm. In Proceed-ings of The 17th Annual Meeting of the InternationalSpeech Communication Association.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation9(8):1735–1780.

Aaron Jaech, Larry Heck, and Mari Ostendorf. 2016.Domain adaptation of recurrent neural networks fornatural language understanding. arXiv preprintarXiv:1604.00117 .

Minwoo Jeong and Gary Geunbae Lee. 2008.Triangular-chain conditional random fields. IEEETransactions on Audio, Speech, and Language Pro-cessing 16(7):1287–1302.

Young-Bum Kim, Minwoo Jeong, Karl Stratos, andRuhi Sarikaya. 2015a. Weakly supervised slot tag-ging with partially labeled sequences from websearch click logs. In Proceedings of the NAACL. As-sociation for Computational Linguistics.

Young-Bum Kim, Alexandre Rochette, and RuhiSarikaya. 2016a. Natural language model re-usability for scaling to different domains. Pro-ceedings of the Empiricial Methods in Natural Lan-guage Processing (EMNLP). Association for Com-putational Linguistics .

Young-Bum Kim, Karl Stratos, and Dongchan Kim.2017. Adversarial adaptation of synthetic or staledata. In Annual Meeting of the Association for Com-putational Linguistics.

Young-Bum Kim, Karl Stratos, Xiaohu Liu, and RuhiSarikaya. 2015b. Compact lexicon selection withspectral methods. In Proc. of Annual Meeting of theAssociation for Computational Linguistics: HumanLanguage Technologies.

Young-Bum Kim, Karl Stratos, and Ruhi Sarikaya.2015c. Pre-training of hidden-unit crfs. In Proc.of Annual Meeting of the Association for Computa-tional Linguistics: Human Language Technologies.pages 192–198.

Young-Bum Kim, Karl Stratos, and Ruhi Sarikaya.2016b. Domainless adaptation by constrained de-coding on a schema lattice. Proceedings of the26th International Conference on ComputationalLinguistics (COLING) .

Young-Bum Kim, Karl Stratos, and Ruhi Sarikaya.2016c. Frustratingly easy neural domain adaptation.Proceedings of the 26th International Conference onComputational Linguistics (COLING) .

652

Page 11: Domain Attention with an Ensemble of Experts · mainwithlimitedsupervisiongivenK ex-isting domains. ... each of these experts and use their combined feed-back along with the model's

Young-Bum Kim, Karl Stratos, and Ruhi Sarikaya.2016d. Scalable semi-supervised query classifica-tion using matrix sketching. In The 54th AnnualMeeting of the Association for Computational Lin-guistics. page 8.

Young-Bum Kim, Karl Stratos, Ruhi Sarikaya, andMinwoo Jeong. 2015d. New transfer learning tech-niques for disparate label sets. ACL. Association forComputational Linguistics .

Diederik Kingma and Jimmy Ba. 2015. Adam: Amethod for stochastic optimization. The Inter-national Conference on Learning Representations(ICLR). .

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, and Chris Dyer. 2016.Neural architectures for named entity recognition.arXiv preprint arXiv:1603.01360 .

Xiao Li, Ye-Yi Wang, and Alex Acero. 2009. Extract-ing structured information from user queries withsemi-supervised conditional random fields. In Pro-ceedings of the 32nd international ACM SIGIR con-ference on Research and development in informationretrieval.

Bing Liu and Ian Lane. 2016a. Attention-based recur-rent neural network models for joint intent detectionand slot filling. In Interspeech 2016. pages 685–689.

Bing Liu and Ian Lane. 2016b. Joint online spoken lan-guage understanding and language modeling withrecurrent neural networks. In Proceedings of the17th Annual Meeting of the Special Interest Groupon Discourse and Dialogue. Association for Com-putational Linguistics, Los Angeles.

Xiaohu Liu and Ruhi Sarikaya. 2014. A discrimi-native model based entity dictionary weighting ap-proach for spoken language understanding. In Spo-ken Language Technology Workshop (SLT). IEEE,pages 195–199.

Graham Neubig, Chris Dyer, Yoav Goldberg, AustinMatthews, Waleed Ammar, Antonios Anastasopou-los, Miguel Ballesteros, David Chiang, DanielClothiaux, Trevor Cohn, et al. 2017. Dynet: Thedynamic neural network toolkit. arXiv preprintarXiv:1701.03980 .

Sinno Jialin Pan, Ivor W Tsang, James T Kwok, andQiang Yang. 2011. Domain adaptation via transfercomponent analysis. IEEE Transactions on NeuralNetworks 22(2):199–210.

Ruhi Sarikaya. 2015. The technology powering per-sonal digital assistants. Keynote at Interspeech,Dresden, Germany.

Ruhi Sarikaya, Asli Celikyilmaz, Anoop Deoras, andMinwoo Jeong. 2014. Shrinkage based features forslot tagging with conditional random fields. In IN-TERSPEECH. pages 268–272.

Ruhi Sarikaya, Paul Crook, Alex Marin, MinwooJeong, Jean-Philippe Robichaud, Asli Celikyilmaz,Young-Bum Kim, Alexandre Rochette, Omar ZiaKhan, Xiuahu Liu, et al. 2016. An overview of end-to-end language understanding and dialog manage-ment for personal digital assistants. In IEEE Work-shop on Spoken Language Technology.

Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks. IEEE Transactionson Signal Processing 45(11):2673–2681.

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. 2014.Dropout: a simple way to prevent neural networksfrom overfitting. Journal of Machine Learning Re-search 15(1):1929–1958.

Gokhan Tur. 2006. Multitask learning for spokenlanguage understanding. In In Proceedings of theICASSP. Toulouse, France.

Puyang Xu and Ruhi Sarikaya. 2013. Convolutionalneural network based triangular crf for joint in-tent detection and slot filling. In Automatic SpeechRecognition and Understanding (ASRU), 2013 IEEEWorkshop on. IEEE, pages 78–83.

Xiaodong Zhang and Houfeng Wang. 2016. A jointmodel of intent determination and slot filling forspoken language understanding. IJCAI.

653