abstract - arxiv · abstract conventional neural autoregressive decod-ing commonly assumes a fixed...

15
Insertion-based Decoding with automatically Inferred Generation Order Jiatao Gu , Qi Liu * , and Kyunghyun Cho ‡† Facebook AI Research University of Oxford New York University, CIFAR Azrieli Global Scholar {jgu, kyunghyuncho}@fb.com [email protected] Abstract Conventional neural autoregressive decod- ing commonly assumes a fixed left-to-right generation order, which may be sub-optimal. In this work, we propose a novel decod- ing algorithm – InDIGO – which supports flexible sequence generation in arbitrary or- ders through insertion operations. We extend Transformer, a state-of-the-art sequence gen- eration model, to efficiently implement the proposed approach, enabling it to be trained with either a pre-defined generation order or adaptive orders obtained from beam-search. Experiments on four real-world tasks, in- cluding word order recovery, machine trans- lation, image caption and code generation, demonstrate that our algorithm can generate sequences following arbitrary orders, while achieving competitive or even better perfor- mance compared to the conventional left-to- right generation. The generated sequences show that InDIGO adopts adaptive genera- tion orders based on input information. 1 Introduction Neural autoregressive models have become the de facto standard in a wide range of sequence genera- tion tasks, such as machine translation (Bahdanau et al., 2015), summarization (Rush et al., 2015) and dialogue systems (Vinyals and Le, 2015). In these studies, a sequence is modeled autoregressively with the left-to-right generation order, which raises the question of whether generation in an arbitrary order is worth considering (Vinyals et al., 2016; Ford et al., 2018). Nevertheless, previous studies on generation orders mostly resort to a fixed set of generation orders, showing particular choices of ordering are helpful (Wu et al., 2018; Ford et al., 2018; Mehri and Sigal, 2018), without providing * This work was completed while the author worked as an AI resident at Facebook AI Research. <S> </S> dream <S> </S> dream I <S> </S> dream I a <S> </S> dream I a have <S> </S> dream I have a 0 2 1 0 0 0 2 3 4 5 3 4 1 1 1 2 3 2 insert to right insert to left Figure 1: An example of InDIGO. At each step, we simultaneously predict the next token and its (relative) position to be inserted. The final output sequence is obtained by mapping the words based on their positions. an efficient algorithm for finding adaptive genera- tion orders, or restrict the problem scope to n-gram segment generation (Vinyals et al., 2016). In this paper, we propose a novel decoding al- gorithm, Insertion-based Decoding with Inferred Generation Order (InDIGO), which models gener- ation orders as latent variables and automatically infers the generation orders by simultaneously pre- dicting a word and its position to be inserted at each decoding step. Given that absolute positions are unknown before generating the whole sequence, we use a relative-position-based representation to capture generation orders. We show that decoding consists of a series of insertion operations with a demonstration shown in Figure 1. We extend Transformer (Vaswani et al., 2017) for supporting insertion operations, where the gen- eration order is directly captured as relative po- sitions through self-attention inspired by (Shaw et al., 2018). For learning, we maximize the evi- dence lower-bound (ELBO) of the maximum like- lihood objective, and study two approximate pos- terior distributions of generation orders based on a pre-defined generation order and adaptive orders arXiv:1902.01370v3 [cs.CL] 28 Oct 2019

Upload: others

Post on 18-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

  • Insertion-based Decoding with automaticallyInferred Generation Order

    Jiatao Gu†, Qi Liu�∗, and Kyunghyun Cho‡††Facebook AI Research �University of Oxford

    ‡New York University, CIFAR Azrieli Global Scholar†{jgu, kyunghyuncho}@fb.com ‡[email protected]

    Abstract

    Conventional neural autoregressive decod-ing commonly assumes a fixed left-to-rightgeneration order, which may be sub-optimal.In this work, we propose a novel decod-ing algorithm – InDIGO – which supportsflexible sequence generation in arbitrary or-ders through insertion operations. We extendTransformer, a state-of-the-art sequence gen-eration model, to efficiently implement theproposed approach, enabling it to be trainedwith either a pre-defined generation order oradaptive orders obtained from beam-search.Experiments on four real-world tasks, in-cluding word order recovery, machine trans-lation, image caption and code generation,demonstrate that our algorithm can generatesequences following arbitrary orders, whileachieving competitive or even better perfor-mance compared to the conventional left-to-right generation. The generated sequencesshow that InDIGO adopts adaptive genera-tion orders based on input information.

    1 Introduction

    Neural autoregressive models have become the defacto standard in a wide range of sequence genera-tion tasks, such as machine translation (Bahdanauet al., 2015), summarization (Rush et al., 2015) anddialogue systems (Vinyals and Le, 2015). In thesestudies, a sequence is modeled autoregressivelywith the left-to-right generation order, which raisesthe question of whether generation in an arbitraryorder is worth considering (Vinyals et al., 2016;Ford et al., 2018). Nevertheless, previous studieson generation orders mostly resort to a fixed set ofgeneration orders, showing particular choices ofordering are helpful (Wu et al., 2018; Ford et al.,2018; Mehri and Sigal, 2018), without providing

    ∗This work was completed while the author worked as anAI resident at Facebook AI Research.

    dream

    dream I

    dream I a

    dream I a have

    dreamI have a

    0 2 1

    0

    0

    0

    23

    4

    5

    3

    4

    1

    1

    1

    2

    3 2

    insert to right

    insert to left

    Figure 1: An example of InDIGO. At each step,we simultaneously predict the next token and its(relative) position to be inserted. The final outputsequence is obtained by mapping the words basedon their positions.

    an efficient algorithm for finding adaptive genera-tion orders, or restrict the problem scope to n-gramsegment generation (Vinyals et al., 2016).

    In this paper, we propose a novel decoding al-gorithm, Insertion-based Decoding with InferredGeneration Order (InDIGO), which models gener-ation orders as latent variables and automaticallyinfers the generation orders by simultaneously pre-dicting a word and its position to be inserted ateach decoding step. Given that absolute positionsare unknown before generating the whole sequence,we use a relative-position-based representation tocapture generation orders. We show that decodingconsists of a series of insertion operations with ademonstration shown in Figure 1.

    We extend Transformer (Vaswani et al., 2017)for supporting insertion operations, where the gen-eration order is directly captured as relative po-sitions through self-attention inspired by (Shawet al., 2018). For learning, we maximize the evi-dence lower-bound (ELBO) of the maximum like-lihood objective, and study two approximate pos-terior distributions of generation orders based ona pre-defined generation order and adaptive orders

    arX

    iv:1

    902.

    0137

    0v3

    [cs

    .CL

    ] 2

    8 O

    ct 2

    019

  • obtained from beam-search, respectively.Experimental results on word order recovery,

    machine translation, code generation and imagecaption demonstrate that our algorithm can gener-ate sequences with arbitrary orders, while achiev-ing competitive or even better performance com-pared to the conventional left-to-right generation.Case studies show that the proposed method adoptsadaptive orders based on input information. Thecode will be released as part of the official repo ofFairseq (https://github.com/pytorch/fairseq).

    2 Neural Autoregressive Decoding

    Let us consider the problem of generating a se-quence y = (y1, ..., yT ) conditioned on some in-puts, e.g., a source sequence x = (x1, ..., xT ′).Our goal is to build a model parameterized by θthat models the conditional probability of y givenx, which is factorized as:

    pθ(y|x) =T∏t=0

    pθ(yt+1|y0:t, x1:T ′), (1)

    where y0 and yT+1 are special tokens 〈s〉 and 〈/s〉,respectively. The model sequentially predicts theconditional probability of the next token at eachstep t, which can be implemented by any func-tion approximator such as RNNs (Bahdanau et al.,2015) and Transformer (Vaswani et al., 2017).

    Learning Neural autoregressive model is com-monly learned by maximizing the conditional like-lihood log p(y|x) =

    ∑Tt=0 log pθ(yt+1|y0:t, x1:T ′)

    given a set of parallel examples.

    Decoding A common way to decode a sequencefrom a trained model is to make use of the autore-gressive nature that allows us to predict one wordat each step. Given any source x, we essentiallyfollow the order of factorization to generate tokenssequentially using some heuristic-based algorithmssuch as greedy decoding and beam search.

    3 Insertion-based Decoding withInferred Generation Order (InDIGO)

    Eq. 1 explicitly assumes a left-to-right (L2R) gen-eration order of the sequence y. In principle, wecan factorize the sequence probability in any per-mutation and train a model for each permutationseparately. As long as we have infinite amount ofdata with proper optimization performed, all thesemodels are equivalent. Nevertheless, Vinyals et al.

    (2016) have shown that the generation order of asequence actually matters in many real-world tasks,e.g., language modeling.

    Although the L2R order is a strong inductivebias, as it is natural for most human-beings to readand write sequences from left to right, L2R is notnecessarily the optimal option for generating se-quences. For instance, people sometimes tend tothink of central phrases first before building up awhole sentence; For programming languages, it isbeneficial to be generated based on abstract syntaxtrees (Yin and Neubig, 2017).

    Therefore, a natural question arises, how can wedecode a sequence in its best order?

    3.1 Orders as Latent VariablesWe address this question by modeling generationorders as latent variables. Similar to Vinyals et al.(2016), we rewrite the target sequence y in a par-ticular order π = (z2, ..., zT , zT+1) ∈ PT 1 as a setyπ = {(y2, z2), ..., (yT+1, zT+1)}, where (yt, zt)represents the t-th generated token and its absoluteposition, respectively. Different from the commonnotation, the target sequence is 2-step drifted be-cause the two special tokens (y0, z0) = (〈s〉, 0)and (y1, z1) = (〈/s〉, T + 1) are always prependedto represent the left and right boundaries, respec-tively. Then, we model the conditional probabilityas the joint distribution of words and positions bymarginalizing all the orders:

    pθ(y|x) =∑π∈PT

    pθ(yπ|x),

    where for each element:

    pθ(yπ|x) = pθ(yT+2|y0:T+1, z0:T+1, x1:T ′)·T∏t=1

    pθ(yt+1, zt+1|y0:t, z0:t, x1:T ′)(2)

    where the third special token yT+2 = 〈eod〉 is intro-duced to signal the end-of-decoding, and p(yT+2|·)is the end-of-decoding probability.

    At decoding time, the factorization allows us todecode autoregressively by predicting word yt+1and its position zt+1 step by step. The generationorder is automatically inferred during decoding.

    3.2 Relative Representation of PositionsIt is difficult and inefficient to predict the absolutepositions zt without knowing the actual length T .

    1 PT is the set of all the permutations of (1, ..., T ).

    https://github.com/pytorch/fairseq

  • One solution is directly using the absolute posi-tions zt0, ..., z

    tt of the partial sequence y0:t at each

    autoregressive step t. For example, the absolutepositions for the sequence (〈s〉, 〈/s〉, dream, I) are(zt0 = 0, z

    t1 = 3, z

    t2 = 2, z

    t3 = 1) in Figure 1 at

    step t = 3. It is however inefficient to model suchexplicit positions using a single neural networkwithout recomputing the hidden states for the en-tire partial sequence, as some positions are changedat every step (as shown in Figure 1).

    Relative Positions We propose using relative-position representations rt0:t instead of absolutepositions zt0:t. We use a ternary vector r

    ti ∈

    {−1, 0, 1}t+1 as the relative-position representa-tion for zti . The j-th element of r

    ti is defined as:

    rti,j =

    −1 ztj > zti (left)0 ztj = z

    ti (middle)

    1 ztj < zti (right)

    , (3)

    where the elements of rti show the relative po-sitions with respect to all the other words inthe partial sequence at step t. We use a matrixRt =

    [rt0, r

    t1, ..., r

    tt

    ]to show the relative-position

    representations of all the words in the sequence.The relative-position representation can always bemapped back to the absolute position zti by:

    zti =

    t∑j=0

    max(0, rti,j) (4)

    One of the biggest advantages for using such vector-based representations is that at each step, updatingthe relative-position representations is simply ex-tending the relative-position matrix Rt with thenext predicted relative position, because the (left,middle, right) relations described in Eq. (3) stayunchanged once they are created. Thus, we updateRt as follows:

    Rt+1 =

    rt+1t+1,0

    Rt...

    rt+1t+1,t−rt+1t+1,0 · · · −r

    t+1t+1,t 0

    (5)

    where we use rt+1t+1 to represent the relative positionat step t+1. This append-only property enables ourmethod to reuse the previous hidden states withoutrecomputing the hidden states at each step. Forsimplicity, the superscript of r is omitted from nowon without causing conflicts.

    Algorithm 1 Insertion-based Decoding

    Initialize: y = (〈s〉, 〈/s〉), R =[

    0 1−1 0

    ], t = 1

    repeatPredict the next word yt+1 based on y, R.if yt+1 is 〈eod〉 then

    breakend ifChoose an existing word yk ∈ y;Choose the left or right (s) of yk to insert;Obtain the next position rt+1 with k, s (Eq. (6)).Update R by appending rt+1 (Eq. (5)).Update y by appending yt+1Update t = t+ 1

    until Reach the maximum lengthMap back to absolute positions π (Eq. (4))Reorder y: yzi = yi ∀zi ∈ π, i ∈ [0, t]

    3.3 Insertion-based DecodingGiven a partial sequence y0:t and its correspond-ing relative-position representations r0:t, not allof the 3t+2 possible vectors are valid for the nextrelative-position representation, rt+1. Only thesevectors corresponding to insertion operations sat-isfy Eq. (4). In Algorithm 1, we describe aninsertion-based decoding framework based on thisobservation. The next word yt+1 is predicted basedon y0:t and r0:t. We then choose an existing wordyk (0 ≤ k ≤ t)) from y0:t and insert yt+1 to itsleft or right. As a result, the next position rt+1 isdetermined by

    rt+1,j =

    {s j = k

    rk,j j 6= k, ∀j ∈ [0, t] (6)

    where s = −1 if yt+1 is on the left of yk, ands = 1 otherwise. Finally, we use rt+1 to updatethe relative-position matrix R as shown in Eq. (5).

    4 Model

    We present Transformer-InDIGO, an extension ofTransformer (Vaswani et al., 2017), supportinginsertion-based decoding. The overall frameworkis shown in Figure 2.

    4.1 Network DesignWe extend the decoder of Transformer with relative-position-based self-attention, joint word & positionprediction and position updating modules.

    Self-Attention One of the major challenges thatprevents the vanilla Transformer from generatingsequences following arbitrary orders is that the

  • dream I

    a

    Relative Positions

    Transformer-Decoder

    R L R RL L

    CausalSelf-attention

    Update

    a

    0 -1 +1

    R key for insert at right L key for insert at left dream I a have

    0 5 4 1 23

    dreamI ahave

    sort

    htAAAB6nicbVA9SwNBEJ2LXzF+RQUbm8UgWIU7m1gGbSwTNB+QHHFvs5cs2ds7dueEcOQn2FgoYusvsrPxt7j5KDTxwcDjvRlm5gWJFAZd98vJra1vbG7ltws7u3v7B8XDo6aJU814g8Uy1u2AGi6F4g0UKHk70ZxGgeStYHQz9VuPXBsRq3scJ9yP6ECJUDCKVrob9rBXLLlldwaySrwFKVVP6t8PAFDrFT+7/ZilEVfIJDWm47kJ+hnVKJjkk0I3NTyhbEQHvGOpohE3fjY7dULOrdInYaxtKSQz9fdERiNjxlFgOyOKQ7PsTcX/vE6K4ZWfCZWkyBWbLwpTSTAm079JX2jOUI4toUwLeythQ6opQ5tOwYbgLb+8SpqXZc8te3WbxjXMkYdTOIML8KACVbiFGjSAwQCe4AVeHek8O2/O+7w15yxmjuEPnI8f5jSPtA==AAAB6nicbVA9SwNBEJ2LXzF+RQUbm8UgWIU7Gy1DbCwTNImQHGFvs5cs2ds7dueEcOQn2FgoYmvrv/AX2Nn4W9x8FJr4YODx3gwz84JECoOu++XkVlbX1jfym4Wt7Z3dveL+QdPEqWa8wWIZ67uAGi6F4g0UKPldojmNAslbwfBq4rfuuTYiVrc4Srgf0b4SoWAUrXQz6GK3WHLL7hRkmXhzUqoc1b/Fe/Wj1i1+dnoxSyOukElqTNtzE/QzqlEwyceFTmp4QtmQ9nnbUkUjbvxseuqYnFqlR8JY21JIpurviYxGxoyiwHZGFAdm0ZuI/3ntFMNLPxMqSZErNlsUppJgTCZ/k57QnKEcWUKZFvZWwgZUU4Y2nYINwVt8eZk0z8ueW/bqNo0qzJCHYziBM/DgAipwDTVoAIM+PMATPDvSeXRenNdZa86ZzxzCHzhvPzeEkXA=AAAB6nicbVA9SwNBEJ2LXzF+RQUbm8UgWIU7Gy1DbCwTNImQHGFvs5cs2ds7dueEcOQn2FgoYmvrv/AX2Nn4W9x8FJr4YODx3gwz84JECoOu++XkVlbX1jfym4Wt7Z3dveL+QdPEqWa8wWIZ67uAGi6F4g0UKPldojmNAslbwfBq4rfuuTYiVrc4Srgf0b4SoWAUrXQz6GK3WHLL7hRkmXhzUqoc1b/Fe/Wj1i1+dnoxSyOukElqTNtzE/QzqlEwyceFTmp4QtmQ9nnbUkUjbvxseuqYnFqlR8JY21JIpurviYxGxoyiwHZGFAdm0ZuI/3ntFMNLPxMqSZErNlsUppJgTCZ/k57QnKEcWUKZFvZWwgZUU4Y2nYINwVt8eZk0z8ueW/bqNo0qzJCHYziBM/DgAipwDTVoAIM+PMATPDvSeXRenNdZa86ZzxzCHzhvPzeEkXA=AAAB6nicbVA9TwJBEJ3DL8Qv1NJmIzGxInc2WhJtLDEKksCF7C17sGFv97I7Z0Iu/AQbC42x9RfZ+W9c4AoFXzLJy3szmZkXpVJY9P1vr7S2vrG5Vd6u7Ozu7R9UD4/aVmeG8RbTUptORC2XQvEWCpS8kxpOk0jyx2h8M/Mfn7ixQqsHnKQ8TOhQiVgwik66H/WxX635dX8OskqCgtSgQLNf/eoNNMsSrpBJam038FMMc2pQMMmnlV5meUrZmA5511FFE27DfH7qlJw5ZUBibVwpJHP190ROE2snSeQ6E4oju+zNxP+8bobxVZgLlWbIFVssijNJUJPZ32QgDGcoJ45QZoS7lbARNZShS6fiQgiWX14l7Yt64NeDO7/WuC7iKMMJnMI5BHAJDbiFJrSAwRCe4RXePOm9eO/ex6K15BUzx/AH3ucPV5yNzw==

    CAAAB6HicbZC7SwNBEMbnfMb4ilraLAbBKtzZaCMG01gmYB6QHGFvM5es2ds7dveEcATsbSwUsfWfsbfzv3HzKDTxg4Uf3zfDzkyQCK6N6347K6tr6xubua389s7u3n7h4LCh41QxrLNYxKoVUI2CS6wbbgS2EoU0CgQ2g2FlkjcfUGkeyzszStCPaF/ykDNqrFWrdAtFt+RORZbBm0Px+jN/9QgA1W7hq9OLWRqhNExQrduemxg/o8pwJnCc76QaE8qGtI9ti5JGqP1sOuiYnFqnR8JY2ScNmbq/OzIaaT2KAlsZUTPQi9nE/C9rpya89DMuk9SgZLOPwlQQE5PJ1qTHFTIjRhYoU9zOStiAKsqMvU3eHsFbXHkZGuclzy15NbdYvoGZcnAMJ3AGHlxAGW6hCnVggPAEL/Dq3DvPzpvzPitdceY9R/BHzscPA+yOkA==AAAB6HicbZC7SgNBFIbPxltcb1FLm8UgWIVdG23EYBrLBMwFkiXMTs4mY2Znl5lZISx5AhsLRWz1YextxLdxcik08YeBj/8/hznnBAlnSrvut5VbWV1b38hv2lvbO7t7hf2DhopTSbFOYx7LVkAUciawrpnm2Eokkijg2AyGlUnevEepWCxu9ShBPyJ9wUJGiTZWrdItFN2SO5WzDN4cilcf9mXy/mVXu4XPTi+maYRCU06Uantuov2MSM0ox7HdSRUmhA5JH9sGBYlQ+dl00LFzYpyeE8bSPKGdqfu7IyORUqMoMJUR0QO1mE3M/7J2qsMLP2MiSTUKOvsoTLmjY2eytdNjEqnmIwOESmZmdeiASEK1uY1tjuAtrrwMjbOS55a8mlssX8NMeTiCYzgFD86hDDdQhTpQQHiAJ3i27qxH68V6nZXmrHnPIfyR9fYD9WyQBA==AAAB6HicbZC7SgNBFIbPxltcb1FLm8UgWIVdG23EYBrLBMwFkiXMTs4mY2Znl5lZISx5AhsLRWz1YextxLdxcik08YeBj/8/hznnBAlnSrvut5VbWV1b38hv2lvbO7t7hf2DhopTSbFOYx7LVkAUciawrpnm2Eokkijg2AyGlUnevEepWCxu9ShBPyJ9wUJGiTZWrdItFN2SO5WzDN4cilcf9mXy/mVXu4XPTi+maYRCU06Uantuov2MSM0ox7HdSRUmhA5JH9sGBYlQ+dl00LFzYpyeE8bSPKGdqfu7IyORUqMoMJUR0QO1mE3M/7J2qsMLP2MiSTUKOvsoTLmjY2eytdNjEqnmIwOESmZmdeiASEK1uY1tjuAtrrwMjbOS55a8mlssX8NMeTiCYzgFD86hDDdQhTpQQHiAJ3i27qxH68V6nZXmrHnPIfyR9fYD9WyQBA==AAAB6HicbVA9TwJBEJ3DL8Qv1NJmIzGxInc2UhJpLCGRjwQuZG+Zg5W9vcvungm58AtsLDTG1p9k579xgSsUfMkkL+/NZGZekAiujet+O4Wt7Z3dveJ+6eDw6PikfHrW0XGqGLZZLGLVC6hGwSW2DTcCe4lCGgUCu8G0sfC7T6g0j+WDmSXoR3QsecgZNVZqNYblilt1lyCbxMtJBXI0h+WvwShmaYTSMEG17ntuYvyMKsOZwHlpkGpMKJvSMfYtlTRC7WfLQ+fkyiojEsbKljRkqf6eyGik9SwKbGdEzUSvewvxP6+fmrDmZ1wmqUHJVovCVBATk8XXZMQVMiNmllCmuL2VsAlVlBmbTcmG4K2/vEk6N1XPrXott1K/y+MowgVcwjV4cAt1uIcmtIEBwjO8wpvz6Lw4787HqrXg5DPn8AfO5w+UpYzD

    DAAAB6HicbZC7SwNBEMbnfMbzFbW0WQyCVbiz0UYMamGZgHlAcoS9zVyyZm/v2N0TQgjY21goYus/Y2/nf+PmUWjiBws/vm+GnZkwFVwbz/t2lpZXVtfWcxvu5tb2zm5+b7+mk0wxrLJEJKoRUo2CS6wabgQ2UoU0DgXWw/71OK8/oNI8kXdmkGIQ067kEWfUWKty084XvKI3EVkEfwaFy0/34hEAyu38V6uTsCxGaZigWjd9LzXBkCrDmcCR28o0ppT1aRebFiWNUQfDyaAjcmydDokSZZ80ZOL+7hjSWOtBHNrKmJqens/G5n9ZMzPReTDkMs0MSjb9KMoEMQkZb006XCEzYmCBMsXtrIT1qKLM2Nu49gj+/MqLUDst+l7Rr3iF0hVMlYNDOIIT8OEMSnALZagCA4QneIFX5955dt6c92npkjPrOYA/cj5+AAVwjpE=AAAB6HicbZC7SgNBFIbPxltcb1FLm8UgWIVdG23EoBaWCZgLJEuYnZxNxszOLjOzQljyBDYWitjqw9jbiG/j5FJo4g8DH/9/DnPOCRLOlHbdbyu3tLyyupZftzc2t7Z3Crt7dRWnkmKNxjyWzYAo5ExgTTPNsZlIJFHAsREMrsZ54x6lYrG41cME/Yj0BAsZJdpY1etOoeiW3ImcRfBmULz4sM+T9y+70il8trsxTSMUmnKiVMtzE+1nRGpGOY7sdqowIXRAetgyKEiEys8mg46cI+N0nTCW5gntTNzfHRmJlBpGgamMiO6r+Wxs/pe1Uh2e+RkTSapR0OlHYcodHTvjrZ0uk0g1HxogVDIzq0P7RBKqzW1scwRvfuVFqJ+UPLfkVd1i+RKmysMBHMIxeHAKZbiBCtSAAsIDPMGzdWc9Wi/W67Q0Z8169uGPrLcf9vCQBQ==AAAB6HicbZC7SgNBFIbPxltcb1FLm8UgWIVdG23EoBaWCZgLJEuYnZxNxszOLjOzQljyBDYWitjqw9jbiG/j5FJo4g8DH/9/DnPOCRLOlHbdbyu3tLyyupZftzc2t7Z3Crt7dRWnkmKNxjyWzYAo5ExgTTPNsZlIJFHAsREMrsZ54x6lYrG41cME/Yj0BAsZJdpY1etOoeiW3ImcRfBmULz4sM+T9y+70il8trsxTSMUmnKiVMtzE+1nRGpGOY7sdqowIXRAetgyKEiEys8mg46cI+N0nTCW5gntTNzfHRmJlBpGgamMiO6r+Wxs/pe1Uh2e+RkTSapR0OlHYcodHTvjrZ0uk0g1HxogVDIzq0P7RBKqzW1scwRvfuVFqJ+UPLfkVd1i+RKmysMBHMIxeHAKZbiBCtSAAsIDPMGzdWc9Wi/W67Q0Z8169uGPrLcf9vCQBQ==AAAB6HicbVA9SwNBEJ2LXzF+RS1tFoNgFe5sTBnUwjIB8wHJEfY2c8mavb1jd08IR36BjYUitv4kO/+Nm+QKTXww8Hhvhpl5QSK4Nq777RQ2Nre2d4q7pb39g8Oj8vFJW8epYthisYhVN6AaBZfYMtwI7CYKaRQI7AST27nfeUKleSwfzDRBP6IjyUPOqLFS825QrrhVdwGyTrycVCBHY1D+6g9jlkYoDRNU657nJsbPqDKcCZyV+qnGhLIJHWHPUkkj1H62OHRGLqwyJGGsbElDFurviYxGWk+jwHZG1Iz1qjcX//N6qQlrfsZlkhqUbLkoTAUxMZl/TYZcITNiagllittbCRtTRZmx2ZRsCN7qy+ukfVX13KrXdCv1mzyOIpzBOVyCB9dQh3toQAsYIDzDK7w5j86L8+58LFsLTj5zCn/gfP4AlimMxA==

    EAAAB6HicbZC7SwNBEMbnfMbzFbW0WQyCVbiz0UYMimCZgHlAcoS9zVyyZm/v2N0TQgjY21goYus/Y2/nf+PmUWjiBws/vm+GnZkwFVwbz/t2lpZXVtfWcxvu5tb2zm5+b7+mk0wxrLJEJKoRUo2CS6wabgQ2UoU0DgXWw/71OK8/oNI8kXdmkGIQ067kEWfUWKty084XvKI3EVkEfwaFy0/34hEAyu38V6uTsCxGaZigWjd9LzXBkCrDmcCR28o0ppT1aRebFiWNUQfDyaAjcmydDokSZZ80ZOL+7hjSWOtBHNrKmJqens/G5n9ZMzPReTDkMs0MSjb9KMoEMQkZb006XCEzYmCBMsXtrIT1qKLM2Nu49gj+/MqLUDst+l7Rr3iF0hVMlYNDOIIT8OEMSnALZagCA4QneIFX5955dt6c92npkjPrOYA/cj5+AAb0jpI=AAAB6HicbZDLSsNAFIZP6q3GW9Wlm2ARXJXEjW7EogguW7AXaEOZTE/asZNJmJkIJfQJ3LhQxK0+jHs34ts4vSy09YeBj/8/hznnBAlnSrvut5VbWl5ZXcuv2xubW9s7hd29uopTSbFGYx7LZkAUciawppnm2Ewkkijg2AgGV+O8cY9SsVjc6mGCfkR6goWMEm2s6nWnUHRL7kTOIngzKF582OfJ+5dd6RQ+292YphEKTTlRquW5ifYzIjWjHEd2O1WYEDogPWwZFCRC5WeTQUfOkXG6ThhL84R2Ju7vjoxESg2jwFRGRPfVfDY2/8taqQ7P/IyJJNUo6PSjMOWOjp3x1k6XSaSaDw0QKpmZ1aF9IgnV5ja2OYI3v/Ii1E9Knlvyqm6xfAlT5eEADuEYPDiFMtxABWpAAeEBnuDZurMerRfrdVqas2Y9+/BH1tsP+HSQBg==AAAB6HicbZDLSsNAFIZP6q3GW9Wlm2ARXJXEjW7EogguW7AXaEOZTE/asZNJmJkIJfQJ3LhQxK0+jHs34ts4vSy09YeBj/8/hznnBAlnSrvut5VbWl5ZXcuv2xubW9s7hd29uopTSbFGYx7LZkAUciawppnm2Ewkkijg2AgGV+O8cY9SsVjc6mGCfkR6goWMEm2s6nWnUHRL7kTOIngzKF582OfJ+5dd6RQ+292YphEKTTlRquW5ifYzIjWjHEd2O1WYEDogPWwZFCRC5WeTQUfOkXG6ThhL84R2Ju7vjoxESg2jwFRGRPfVfDY2/8taqQ7P/IyJJNUo6PSjMOWOjp3x1k6XSaSaDw0QKpmZ1aF9IgnV5ja2OYI3v/Ii1E9Knlvyqm6xfAlT5eEADuEYPDiFMtxABWpAAeEBnuDZurMerRfrdVqas2Y9+/BH1tsP+HSQBg==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0m82GNRBI8t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJYPZpqgH9GR5CFn1FipeTcoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp6btVrupX6TR5HEc7gHC7Bg2uowz00oAUMEJ7hFd6cR+fFeXc+lq0FJ585hT9wPn8Al62MxQ==

    FAAAB6HicbZC7SwNBEMbnfMbzFbW0WQyCVbiz0UYMCmKZgHlAcoS9zVyyZm/v2N0TQgjY21goYus/Y2/nf+PmUWjiBws/vm+GnZkwFVwbz/t2lpZXVtfWcxvu5tb2zm5+b7+mk0wxrLJEJKoRUo2CS6wabgQ2UoU0DgXWw/71OK8/oNI8kXdmkGIQ067kEWfUWKty084XvKI3EVkEfwaFy0/34hEAyu38V6uTsCxGaZigWjd9LzXBkCrDmcCR28o0ppT1aRebFiWNUQfDyaAjcmydDokSZZ80ZOL+7hjSWOtBHNrKmJqens/G5n9ZMzPReTDkMs0MSjb9KMoEMQkZb006XCEzYmCBMsXtrIT1qKLM2Nu49gj+/MqLUDst+l7Rr3iF0hVMlYNDOIIT8OEMSnALZagCA4QneIFX5955dt6c92npkjPrOYA/cj5+AAh4jpM=AAAB6HicbZDLSsNAFIZP6q3GW9Wlm2ARXJXEjW7EoiAuW7AXaEOZTE/asZNJmJkIJfQJ3LhQxK0+jHs34ts4vSy09YeBj/8/hznnBAlnSrvut5VbWl5ZXcuv2xubW9s7hd29uopTSbFGYx7LZkAUciawppnm2Ewkkijg2AgGV+O8cY9SsVjc6mGCfkR6goWMEm2s6nWnUHRL7kTOIngzKF582OfJ+5dd6RQ+292YphEKTTlRquW5ifYzIjWjHEd2O1WYEDogPWwZFCRC5WeTQUfOkXG6ThhL84R2Ju7vjoxESg2jwFRGRPfVfDY2/8taqQ7P/IyJJNUo6PSjMOWOjp3x1k6XSaSaDw0QKpmZ1aF9IgnV5ja2OYI3v/Ii1E9Knlvyqm6xfAlT5eEADuEYPDiFMtxABWpAAeEBnuDZurMerRfrdVqas2Y9+/BH1tsP+fiQBw==AAAB6HicbZDLSsNAFIZP6q3GW9Wlm2ARXJXEjW7EoiAuW7AXaEOZTE/asZNJmJkIJfQJ3LhQxK0+jHs34ts4vSy09YeBj/8/hznnBAlnSrvut5VbWl5ZXcuv2xubW9s7hd29uopTSbFGYx7LZkAUciawppnm2Ewkkijg2AgGV+O8cY9SsVjc6mGCfkR6goWMEm2s6nWnUHRL7kTOIngzKF582OfJ+5dd6RQ+292YphEKTTlRquW5ifYzIjWjHEd2O1WYEDogPWwZFCRC5WeTQUfOkXG6ThhL84R2Ju7vjoxESg2jwFRGRPfVfDY2/8taqQ7P/IyJJNUo6PSjMOWOjp3x1k6XSaSaDw0QKpmZ1aF9IgnV5ja2OYI3v/Ii1E9Knlvyqm6xfAlT5eEADuEYPDiFMtxABWpAAeEBnuDZurMerRfrdVqas2Y9+/BH1tsP+fiQBw==AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0m82GNREI8t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ7dzvPKHSPJYPZpqgH9GR5CFn1FipeTcoV9yquwBZJ15OKpCjMSh/9YcxSyOUhgmqdc9zE+NnVBnOBM5K/VRjQtmEjrBnqaQRaj9bHDojF1YZkjBWtqQhC/X3REYjradRYDsjasZ61ZuL/3m91IQ1P+MySQ1KtlwUpoKYmMy/JkOukBkxtYQyxe2thI2poszYbEo2BG/15XXSvqp6btVrupX6TR5HEc7gHC7Bg2uowz00oAUMEJ7hFd6cR+fFeXc+lq0FJ585hT9wPn8AmTGMxg==

    WAAAB6HicbZC7SwNBEMbn4ivGV9TSZjEIVuHORhsxaGOZgHlAcoS9zVyyZm/v2N0TwhGwt7FQxNZ/xt7O/8bNo9DEDxZ+fN8MOzNBIrg2rvvt5FZW19Y38puFre2d3b3i/kFDx6liWGexiFUroBoFl1g33AhsJQppFAhsBsObSd58QKV5LO/MKEE/on3JQ86osVat2S2W3LI7FVkGbw6lq8/C5SMAVLvFr04vZmmE0jBBtW57bmL8jCrDmcBxoZNqTCgb0j62LUoaofaz6aBjcmKdHgljZZ80ZOr+7shopPUoCmxlRM1AL2YT87+snZrwws+4TFKDks0+ClNBTEwmW5MeV8iMGFmgTHE7K2EDqigz9jYFewRvceVlaJyVPbfs1dxS5RpmysMRHMMpeHAOFbiFKtSBAcITvMCrc+88O2/O+6w058x7DuGPnI8fIjyOpA==AAAB6HicbZC7SgNBFIbPeo3rLWppMxgEq7Bro40YtLFMwFwgCWF2cjYZMzu7zMwKYckT2FgoYqsPY28jvo2TS6GJPwx8/P85zDknSATXxvO+naXlldW19dyGu7m1vbOb39uv6ThVDKssFrFqBFSj4BKrhhuBjUQhjQKB9WBwPc7r96g0j+WtGSbYjmhP8pAzaqxVqXfyBa/oTUQWwZ9B4fLDvUjev9xyJ//Z6sYsjVAaJqjWTd9LTDujynAmcOS2Uo0JZQPaw6ZFSSPU7Wwy6IgcW6dLwljZJw2ZuL87MhppPYwCWxlR09fz2dj8L2umJjxvZ1wmqUHJph+FqSAmJuOtSZcrZEYMLVCmuJ2VsD5VlBl7G9cewZ9feRFqp0XfK/oVr1C6gqlycAhHcAI+nEEJbqAMVWCA8ABP8OzcOY/Oi/M6LV1yZj0H8EfO2w8Ty5AYAAAB6HicbZC7SgNBFIbPeo3rLWppMxgEq7Bro40YtLFMwFwgCWF2cjYZMzu7zMwKYckT2FgoYqsPY28jvo2TS6GJPwx8/P85zDknSATXxvO+naXlldW19dyGu7m1vbOb39uv6ThVDKssFrFqBFSj4BKrhhuBjUQhjQKB9WBwPc7r96g0j+WtGSbYjmhP8pAzaqxVqXfyBa/oTUQWwZ9B4fLDvUjev9xyJ//Z6sYsjVAaJqjWTd9LTDujynAmcOS2Uo0JZQPaw6ZFSSPU7Wwy6IgcW6dLwljZJw2ZuL87MhppPYwCWxlR09fz2dj8L2umJjxvZ1wmqUHJph+FqSAmJuOtSZcrZEYMLVCmuJ2VsD5VlBl7G9cewZ9feRFqp0XfK/oVr1C6gqlycAhHcAI+nEEJbqAMVWCA8ABP8OzcOY/Oi/M6LV1yZj0H8EfO2w8Ty5AYAAAB6HicbVBNT8JAEJ3iF+IX6tHLRmLiibRe9Ej04hESCyTQkO0yhZXtttndmpCGX+DFg8Z49Sd589+4QA8KvmSSl/dmMjMvTAXXxnW/ndLG5tb2Tnm3srd/cHhUPT5p6yRTDH2WiER1Q6pRcIm+4UZgN1VI41BgJ5zczf3OEyrNE/lgpikGMR1JHnFGjZVanUG15tbdBcg68QpSgwLNQfWrP0xYFqM0TFCte56bmiCnynAmcFbpZxpTyiZ0hD1LJY1RB/ni0Bm5sMqQRImyJQ1ZqL8nchprPY1D2xlTM9ar3lz8z+tlJroJci7TzKBky0VRJohJyPxrMuQKmRFTSyhT3N5K2JgqyozNpmJD8FZfXiftq7rn1r2WW2vcFnGU4QzO4RI8uIYG3EMTfGCA8Ayv8OY8Oi/Ou/OxbC05xcwp/IHz+QOy9YzX

    Position Prediction

    Word Prediction

    (a)AAAB6nicbVDLSgNBEOz1GeNr1aOXwSDES9j1oseAF48JmgckS5id9CZDZmeXmVkhLPkELx4U8eqH+A3e/As/wcnjoIkFDUVVN91dYSq4Np735aytb2xubRd2irt7+weH7tFxUyeZYthgiUhUO6QaBZfYMNwIbKcKaRwKbIWjm6nfekCleSLvzTjFIKYDySPOqLHSXZle9NySV/FmIKvEX5BS1a1/fwBAred+dvsJy2KUhgmqdcf3UhPkVBnOBE6K3UxjStmIDrBjqaQx6iCfnToh51bpkyhRtqQhM/X3RE5jrcdxaDtjaoZ62ZuK/3mdzETXQc5lmhmUbL4oygQxCZn+TfpcITNibAllittbCRtSRZmx6RRtCP7yy6ukeVnxvYpft2l4MEcBTuEMyuDDFVThFmrQAAYDeIRneHGE8+S8Om/z1jVnMXMCf+C8/wChyI+FAAAB6nicbVC7SgNBFL0bXzG+Vu20GQxCbMKujZYBGwuLBM0DkiXMTmaTIbMzy8ysEJZ8gdhYKGLrn/gHdv6Fn+DkUWjigQuHc+7l3nvChDNtPO/Lya2srq1v5DcLW9s7u3vu/kFDy1QRWieSS9UKsaacCVo3zHDaShTFcchpMxxeTfzmPVWaSXFnRgkNYtwXLGIEGyvdlvBZ1y16ZW8KtEz8OSlW3Nr3x83RQ7XrfnZ6kqQxFYZwrHXb9xITZFgZRjgdFzqppgkmQ9ynbUsFjqkOsumpY3RqlR6KpLIlDJqqvycyHGs9ikPbGWMz0IveRPzPa6cmugwyJpLUUEFmi6KUIyPR5G/UY4oSw0eWYKKYvRWRAVaYGJtOwYbgL768TBrnZd8r+zWbhgcz5OEYTqAEPlxABa6hCnUg0IdHeIYXhztPzqvzNmvNOfOZQ/gD5/0H55aQeQ==AAAB6nicbVC7SgNBFL0bXzG+Vu20GQxCbMKujZYBGwuLBM0DkiXMTmaTIbMzy8ysEJZ8gdhYKGLrn/gHdv6Fn+DkUWjigQuHc+7l3nvChDNtPO/Lya2srq1v5DcLW9s7u3vu/kFDy1QRWieSS9UKsaacCVo3zHDaShTFcchpMxxeTfzmPVWaSXFnRgkNYtwXLGIEGyvdlvBZ1y16ZW8KtEz8OSlW3Nr3x83RQ7XrfnZ6kqQxFYZwrHXb9xITZFgZRjgdFzqppgkmQ9ynbUsFjqkOsumpY3RqlR6KpLIlDJqqvycyHGs9ikPbGWMz0IveRPzPa6cmugwyJpLUUEFmi6KUIyPR5G/UY4oSw0eWYKKYvRWRAVaYGJtOwYbgL768TBrnZd8r+zWbhgcz5OEYTqAEPlxABa6hCnUg0IdHeIYXhztPzqvzNmvNOfOZQ/gD5/0H55aQeQ==AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoMQm7Bno2XAxjKi+YDkCHubSbJkb+/Y3RPCkZ9gY6GIrb/Izn/jJrlCEx8MPN6bYWZemEhhLKXfXmFjc2t7p7hb2ts/ODwqH5+0TJxqjk0ey1h3QmZQCoVNK6zETqKRRaHEdji5nfvtJ9RGxOrRThMMIjZSYig4s056qLLLfrlCa3QBsk78nFQgR6Nf/uoNYp5GqCyXzJiuTxMbZExbwSXOSr3UYML4hI2w66hiEZogW5w6IxdOGZBhrF0pSxbq74mMRcZMo9B1RsyOzao3F//zuqkd3gSZUElqUfHlomEqiY3J/G8yEBq5lVNHGNfC3Ur4mGnGrUun5ELwV19eJ62rmk9r/j2t1GkeRxHO4Byq4MM11OEOGtAEDiN4hld486T34r17H8vWgpfPnMIfeJ8/ggGNNA==

    (b)AAAB6nicbVDLSgNBEOz1GeNr1aOXwSDES9j1oseAF48JmgckS5id9CZDZmeXmVkhLPkELx4U8eqH+A3e/As/wcnjoIkFDUVVN91dYSq4Np735aytb2xubRd2irt7+weH7tFxUyeZYthgiUhUO6QaBZfYMNwIbKcKaRwKbIWjm6nfekCleSLvzTjFIKYDySPOqLHSXTm86Lklr+LNQFaJvyClqlv//gCAWs/97PYTlsUoDRNU647vpSbIqTKcCZwUu5nGlLIRHWDHUklj1EE+O3VCzq3SJ1GibElDZurviZzGWo/j0HbG1Az1sjcV//M6mYmug5zLNDMo2XxRlAliEjL9m/S5QmbE2BLKFLe3EjakijJj0ynaEPzll1dJ87LiexW/btPwYI4CnMIZlMGHK6jCLdSgAQwG8AjP8OII58l5dd7mrWvOYuYE/sB5/wGjTY+GAAAB6nicbVC7SgNBFL0bXzG+Vu20GQxCbMKujZYBGwuLBM0DkiXMTmaTIbMzy8ysEJZ8gdhYKGLrn/gHdv6Fn+DkUWjigQuHc+7l3nvChDNtPO/Lya2srq1v5DcLW9s7u3vu/kFDy1QRWieSS9UKsaacCVo3zHDaShTFcchpMxxeTfzmPVWaSXFnRgkNYtwXLGIEGyvdlsKzrlv0yt4UaJn4c1KsuLXvj5ujh2rX/ez0JEljKgzhWOu27yUmyLAyjHA6LnRSTRNMhrhP25YKHFMdZNNTx+jUKj0USWVLGDRVf09kONZ6FIe2M8ZmoBe9ifif105NdBlkTCSpoYLMFkUpR0aiyd+oxxQlho8swUQxeysiA6wwMTadgg3BX3x5mTTOy75X9ms2DQ9myMMxnEAJfLiAClxDFepAoA+P8AwvDneenFfnbdaac+Yzh/AHzvsP6RuQeg==AAAB6nicbVC7SgNBFL0bXzG+Vu20GQxCbMKujZYBGwuLBM0DkiXMTmaTIbMzy8ysEJZ8gdhYKGLrn/gHdv6Fn+DkUWjigQuHc+7l3nvChDNtPO/Lya2srq1v5DcLW9s7u3vu/kFDy1QRWieSS9UKsaacCVo3zHDaShTFcchpMxxeTfzmPVWaSXFnRgkNYtwXLGIEGyvdlsKzrlv0yt4UaJn4c1KsuLXvj5ujh2rX/ez0JEljKgzhWOu27yUmyLAyjHA6LnRSTRNMhrhP25YKHFMdZNNTx+jUKj0USWVLGDRVf09kONZ6FIe2M8ZmoBe9ifif105NdBlkTCSpoYLMFkUpR0aiyd+oxxQlho8swUQxeysiA6wwMTadgg3BX3x5mTTOy75X9ms2DQ9myMMxnEAJfLiAClxDFepAoA+P8AwvDneenFfnbdaac+Yzh/AHzvsP6RuQeg==AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoMQm7Bno2XAxjKi+YDkCHubSbJkb+/Y3RPCkZ9gY6GIrb/Izn/jJrlCEx8MPN6bYWZemEhhLKXfXmFjc2t7p7hb2ts/ODwqH5+0TJxqjk0ey1h3QmZQCoVNK6zETqKRRaHEdji5nfvtJ9RGxOrRThMMIjZSYig4s056qIaX/XKF1ugCZJ34OalAjka//NUbxDyNUFkumTFdnyY2yJi2gkuclXqpwYTxCRth11HFIjRBtjh1Ri6cMiDDWLtSlizU3xMZi4yZRqHrjJgdm1VvLv7ndVM7vAkyoZLUouLLRcNUEhuT+d9kIDRyK6eOMK6Fu5XwMdOMW5dOyYXgr768TlpXNZ/W/HtaqdM8jiKcwTlUwYdrqMMdNKAJHEbwDK/w5knvxXv3PpatBS+fOYU/8D5/AIOGjTU=

    (c)AAAB6nicbVDLSgNBEOz1GeNr1aOXwSDES9j1oseAF48JmgckS5id9CZDZmeXmVkhLPkELx4U8eqH+A3e/As/wcnjoIkFDUVVN91dYSq4Np735aytb2xubRd2irt7+weH7tFxUyeZYthgiUhUO6QaBZfYMNwIbKcKaRwKbIWjm6nfekCleSLvzTjFIKYDySPOqLHSXZld9NySV/FmIKvEX5BS1a1/fwBAred+dvsJy2KUhgmqdcf3UhPkVBnOBE6K3UxjStmIDrBjqaQx6iCfnToh51bpkyhRtqQhM/X3RE5jrcdxaDtjaoZ62ZuK/3mdzETXQc5lmhmUbL4oygQxCZn+TfpcITNibAllittbCRtSRZmx6RRtCP7yy6ukeVnxvYpft2l4MEcBTuEMyuDDFVThFmrQAAYDeIRneHGE8+S8Om/z1jVnMXMCf+C8/wCk0o+HAAAB6nicbVC7SgNBFL0bXzG+Vu20GQxCbMKujZYBGwuLBM0DkiXMTmaTIbMzy8ysEJZ8gdhYKGLrn/gHdv6Fn+DkUWjigQuHc+7l3nvChDNtPO/Lya2srq1v5DcLW9s7u3vu/kFDy1QRWieSS9UKsaacCVo3zHDaShTFcchpMxxeTfzmPVWaSXFnRgkNYtwXLGIEGyvdlshZ1y16ZW8KtEz8OSlW3Nr3x83RQ7XrfnZ6kqQxFYZwrHXb9xITZFgZRjgdFzqppgkmQ9ynbUsFjqkOsumpY3RqlR6KpLIlDJqqvycyHGs9ikPbGWMz0IveRPzPa6cmugwyJpLUUEFmi6KUIyPR5G/UY4oSw0eWYKKYvRWRAVaYGJtOwYbgL768TBrnZd8r+zWbhgcz5OEYTqAEPlxABa6hCnUg0IdHeIYXhztPzqvzNmvNOfOZQ/gD5/0H6qCQew==AAAB6nicbVC7SgNBFL0bXzG+Vu20GQxCbMKujZYBGwuLBM0DkiXMTmaTIbMzy8ysEJZ8gdhYKGLrn/gHdv6Fn+DkUWjigQuHc+7l3nvChDNtPO/Lya2srq1v5DcLW9s7u3vu/kFDy1QRWieSS9UKsaacCVo3zHDaShTFcchpMxxeTfzmPVWaSXFnRgkNYtwXLGIEGyvdlshZ1y16ZW8KtEz8OSlW3Nr3x83RQ7XrfnZ6kqQxFYZwrHXb9xITZFgZRjgdFzqppgkmQ9ynbUsFjqkOsumpY3RqlR6KpLIlDJqqvycyHGs9ikPbGWMz0IveRPzPa6cmugwyJpLUUEFmi6KUIyPR5G/UY4oSw0eWYKKYvRWRAVaYGJtOwYbgL768TBrnZd8r+zWbhgcz5OEYTqAEPlxABa6hCnUg0IdHeIYXhztPzqvzNmvNOfOZQ/gD5/0H6qCQew==AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoMQm7Bno2XAxjKi+YDkCHubSbJkb+/Y3RPCkZ9gY6GIrb/Izn/jJrlCEx8MPN6bYWZemEhhLKXfXmFjc2t7p7hb2ts/ODwqH5+0TJxqjk0ey1h3QmZQCoVNK6zETqKRRaHEdji5nfvtJ9RGxOrRThMMIjZSYig4s056qPLLfrlCa3QBsk78nFQgR6Nf/uoNYp5GqCyXzJiuTxMbZExbwSXOSr3UYML4hI2w66hiEZogW5w6IxdOGZBhrF0pSxbq74mMRcZMo9B1RsyOzao3F//zuqkd3gSZUElqUfHlomEqiY3J/G8yEBq5lVNHGNfC3Ur4mGnGrUun5ELwV19eJ62rmk9r/j2t1GkeRxHO4Byq4MM11OEOGtAEDiN4hld486T34r17H8vWgpfPnMIfeJ8/hQuNNg==

    Figure 2: The overall framework of the proposed Transformer-InDIGO which includes (a) the word &position prediction module; (b) the one step decoding with position updating; (c) final decoding output byreordering. The black-white blocks represent the relative position matrix.

    absolute-position-based positional encodings areinefficient as mentioned in Section 3.2, in that abso-lute positions are changed during decoding, invali-dating the previous hidden states. In contrast, weadapt Shaw et al. (2018) to use relative positions inself-attention. Different from Shaw et al. (2018), inwhich a clipping distance d (usually d ≥ 2) is setfor relative positions, our relative-position repre-sentations only preserve d = 1 relations (Eq. (3)).

    Each attention head in a multi-head self-attentionmodule of Transformer-InDIGO takes the hiddenstates of a partial sequence y0:t, denoted as U =(u0, ...,ut), and its corresponding relative positionmatrix Rt as input, where each input state ui ∈Rdmodel . The logit ei,j for attention is computed as:

    ei,j =

    (u>i Q

    )·(u>j K +A[ri,j+1]

    )>√dmodel

    , (7)

    where Q,K ∈ Rdmodel×dmodel and A ∈ R3×dmodelare parameter matrices. A[ri,j+1] is the row vectorindexed by ri,j +1, which biases all the input keysbased on the relative position, ri,j .

    Word & Position Prediction Like the vanillaTransformer, we take the representations from thelast layer of self-attention, H = (h0, ...,ht) andH ∈ Rdmodel×(t+1), to predict both the next wordyt+1 and its position vector rt+1 in two stagesbased on the following factorization:

    p(yt+1, rt+1|H)=p(yt+1|H) · p(rt+1|yt+1, H)

    It can also be factorized as predicting the positionbefore predicting the next word, yet our prelimi-nary experiments show that predicting the wordfirst works slightly better. The prediction module

    for word & position prediction are shown in Fig-ure 2(a).

    First, we predict the next word yt+1 from thecategorical distribution pword(y|H) as:

    pword(y|H) = softmax((h>t F ) ·W>

    ), (8)

    where W ∈ RdV×dmodel is the embedding matrixand dV is the size of vocabulary. We linearly projectthe last representation ht using F ∈ Rdmodel×dmodelfor querying W . Then, as shown in Eq. (6), the pre-diction of the next position is done by performinginsertion operations to existing words which canbe modeled similarly to Pointer Networks (Vinyalset al., 2015). We predict a pointer kt+1 ∈ [0, 2t+1]based on:

    ppointer(k|yt+1, H) =

    softmax

    ((h>t E +W[yt+1]) ·

    [H>CH>D

    ]>),

    (9)

    where C,D,E ∈ Rdmodel×dmodel and W[yt+1] is theembedding of the predicted word. C,D are usedto obtain the left and right keys, respectively, con-sidering that each word has two “keys” (its leftand right) for inserting the generated word. Thequery vector is obtained by adding up the word em-bedding W[yt+1], and the linearly projected state,h>t E. The resulting relative-position vector, rt+1is computed using kt+1 according to Eq. (6). Wemanually set ppointer(0|·) = ppointer(2 + t|·) = 0 toavoid any word from being inserted to the left of〈s〉 and the right of 〈/s〉.

    Position Updating As mentioned in Sec. 3.1, weupdate the relative position representation Rt with

  • Pre-defined Order Descriptions

    Left-to-right (L2R) Generate words from left to right. (Wu et al., 2018)Right-to-left (R2L) Generate words from right to left. (Wu et al., 2018)

    Odd-Even (ODD) Generate words at odd positions from left to right, then generate even positions. (Ford et al., 2018)Balanced-tree (BLT) Generate words with a top-down left-to-right order from a balanced binary tree. (Stern et al., 2019)Syntax-tree (SYN) Generate words with a top-down left-to-right order from the dependency tree. (Wang et al., 2018b)Common-First (CF) Generate all common words first from left to right, and then generate the others. (Ford et al., 2018)Rare-First (RF) Generate all rare words first from left to right, and then generate the remaining. (Ford et al., 2018)

    Random (RND) Generate words in a random order shuffled every time the example was loaded.

    Table 1: Descriptions of the pre-defined orders used in this work. Major references that have exploredthese generation orders with different models and applications are also marked.

    the predicted rt+1. Because updating the relativepositions will not change the pre-computed relative-position representations, Transformer-InDIGO canreuse the previous hidden states in the next decod-ing step the same as the vanilla Transformer.

    4.2 Learning

    Training requires maximizing the marginalizedlikelihood in Eq. (2). Yet this is intractable sincewe need to enumerate all of the T ! permutations oftokens. Instead, we maximize the evidence lower-bound (ELBO) of the original objective by intro-ducing an approximate posterior distribution ofgeneration orders q(π|x,y), which provides theprobabilities of latent generation orders based onthe ground-truth sequences x and y:

    LELBO = Eπ∼q

    log pθ(yπ|x) +H(q)

    = Er2:T+1∼q

    T+1∑t=1

    log pθ(yt+1|y0:t, r0:t, x1:T ′)︸ ︷︷ ︸Word Prediction Loss

    +T∑t=1

    log pθ(rt+1|y0:t+1, r0:t, x1:T ′)︸ ︷︷ ︸Position Prediction Loss

    +H(q),(10)

    where π = r2:T+1, sampled from q(π|x,y), isrepresented as relative positions. H(q) is the en-tropy term which can be ignored if q is fixed duringtraining. Eq. (10) shows that given a sampled or-der, the learning objective is divided into word &position objectives. For calculating the positionprediction loss, we aggregate the two probabilitiescorresponding to the same position by

    pθ(rt+1|·) = ppointer(kl|·) + ppointer(kr|·), (11)

    where ppointer(kl|·) and ppointer(kr|·) are calculatedsimultaneously from the same softmax function in

    Eq. (9). kl, kr(kl 6= kr) represent the keys corre-sponding to the same relative position.

    Here, we study two types of q(π|x,y):

    Pre-defined Order If we already possess someprior knowledge about the sequence, e.g., the L2Rorder is proven to be a strong baseline in manyscenarios, we assume a Dirac-delta distributionq(π|x,y) = δ(π = π∗(x,y)), where π∗(x,y))is a predefined order. In this work, we study a set ofpre-defined orders which can be found in Table. 1,for evaluating their effect on generation.

    Searched Adaptive Order (SAO) We choosethe approximate posterior q as the point estima-tion that maximizes log pθ(yπ|x), which can alsobe seen as the maximum-a-posteriori (MAP) esti-mation on the latent order π. In practice, we ap-proximate these generation orders π through beam-search (Pal et al., 2006). Unlike the original beam-search for autoregressive decoding that searches inthe sequence space to find the sequence maximiz-ing the probability shown in Eq. 1, we search in thespace of all the permutations of the target sequenceto find π maximising Eq. 2, as all the target tokensare known in advance during training.

    More specifically, we maintain B sub-sequenceswith the maximum probabilities using a set B ateach step t. For every sub-sequence y(b)0:t ∈ B, weevaluate the probabilities of every possible choicefrom the remaining words y′ ∈ y \ y(b)0:t and itsposition r′. We calculate the cumulative likelihoodfor each y′, r′, based on which we select top-B sub-sequences as the new set B for the next step. Afterobtaining the B generation orders, we optimize ourobjective as an average over these orders:

    LSAO =1

    B

    ∑π∈B

    log pθ(yπ|x) (12)

  • where we assume q(π|x,y) ={1/B π ∈ B0 otherwise

    .

    Beam Search with Dropout The goal of beamsearch is to approximately find the most likelygeneration orders, which limits learning from ex-ploring other generation orders that may not befavourable currently but may ultimately be deemedbetter. Prior research (Vijayakumar et al., 2016)also pointed out that the search space of the stan-dard beam-search is restricted. We encourage ex-ploration by injecting noise during beam search(Cho, 2016). Particularly, we found it effective tokeep the dropout on (e.g., dropout = 0.1).

    Bootstrapping from a Pre-defined Order Dur-ing preliminary experiments, sequences returnedby beam-search were often degenerated by alwayspredicting common or functional words (e.g., “the”,“,”, etc.) as the first several tokens, leading to in-ferior performance. We conjecture that is due tothe fact that the position prediction module learnsmuch faster than the word prediction module, andit quickly captures spurious correlations inducedby a poorly initialized model. It is essential to bal-ance the learning progress of these modules. Todo so, we bootstrap learning by pre-training themodel with a pre-defined order (e.g., L2R), beforetraining with beam-searched orders.

    4.3 Decoding

    As for decoding, we directly follow Algorithm 1to sample or decode greedily from the proposedmodel. However, in practice beam-search is im-portant to explore the output space for neural au-toregressive models. In our implementation, weperform beam-search for InDIGO as a two-stepsearch. Suppose the beam size B, at each step, wedo beam-search for word prediction and then withthe searched words, try out all possible positionsand select the top-B sub-sequences. In preliminaryexperiments, we also tried doing beam-search forword and positions simultaneously with their jointprobability. However, it did not seem helpful.

    5 Experiments

    We evaluate InDIGO extensively on four challeng-ing sequence generation tasks: word order recov-ery, machine translation, natural language to codegeneration (NL2Code, Ling et al., 2016) and imagecaptioning. We compare our model trained with thepre-defined orders and the adaptive orders obtained

    Dataset Train Dev Test Length

    WMT16 Ro-En 620k 2000 2000 26.48WMT18 En-Tr 207k 3007 3000 25.81KFTT En-Ja 405k 1166 1160 27.51Django 16k 1000 1801 8.87MS-COCO 567k 5000 5000 12.52

    Table 2: Dataset statistics for the machine transla-tion, code generation and image captioning tasks.Length represents the average number of tokens fortarget sentences of the training set.

    by beam-search. We use the same architecture forall orders including the standard L2R order.

    5.1 Experimental SettingsDataset The machine translation experiments areconducted on three language pairs for studyinghow the decoding order influences the translationquality of languages with diversified characteris-tics: WMT’16 Romanian-English (Ro-En),2 WMT18 English-Turkish (En-Tr)3 and KFTT English-Japanese (En-Ja, Neubig, 2011) 4 The Englishpart of the Ro-En dataset is used for the wordorder recovery task. For the NL2Code task, Weuse the Django dataset (Oda et al., 2015)5 andthe MS COCO (Lin et al., 2014) with the stan-dard split (Karpathy and Fei-Fei, 2015) for theNL2Code task and image captioning, respectively.The dataset statistics are shown in Table 2.

    Preprocessing We apply the Moses tokeniza-tion6 and normalization on all the text datasets ex-cept for codes. We perform 32, 000 joint BPE (Sen-nrich et al., 2016) operations for the MT datasets,while using all the unique words as the vocabularyfor NL2Code. For image captioning, we follow thesame procedure as described by Lee et al. (2018),where we use 49 512-dimensional image featurevectors (extracted from a pretrained ResNet-18 (Heet al., 2016)) as the input to the Transformer en-coder. The image features are fixed during training.

    Models We set dmodel = 512, dhidden = 2048,nheads = 8, nlayers = 6, lrmax = 0.0005,warmup = 4000 and dropout = 0.1 throughoutall the experiments. The source and target embed-ding matrices are shared except for En-Ja, as our

    2 http://www.statmt.org/wmt16/translation-task.html3 http://www.statmt.org/wmt18/translation-task.html4http://www.phontron.com/kftt/.5 https://github.com/odashi/ase15-django-dataset6 https://github.com/moses-smt/mosesdecoder

    http://www.statmt.org/wmt16/translation-task.htmlhttp://www.statmt.org/wmt18/translation-task.htmlhttp://www.phontron.com/kftt/https://github.com/odashi/ase15-django-datasethttps://github.com/moses-smt/mosesdecoder

  • Order WMT16 Ro→ En WMT18 En→ Tr KFTT En→ JaBLEU Ribes Meteor TER BLEU Ribes Meteor TER BLEU Ribes Meteor TER

    RND 20.20 79.35 41.00 63.20 03.04 55.45 19.12 90.60 17.09 70.89 35.24 70.11

    L2R 31.82 83.37 52.19 50.62 14.85 69.20 33.90 71.56 30.87 77.72 48.57 59.92R2L 31.62 83.18 52.09 50.20 14.38 68.87 33.33 71.91 30.44 77.95 47.91 61.09ODD 30.11 83.09 50.68 50.79 13.64 68.85 32.48 72.84 28.59 77.01 46.28 60.12BLT 24.38 81.70 45.67 55.38 08.72 65.70 27.40 77.76 21.50 73.97 40.23 64.39SYN 29.62 82.65 50.25 52.14 – –CF 30.25 83.22 50.71 50.72 12.04 67.61 31.18 74.75 28.91 77.06 46.46 61.56RF 30.23 83.29 50.72 51.73 12.10 67.44 30.72 73.40 27.35 76.40 45.15 62.14

    SAO 32.47 84.10 53.00 49.02 15.18 70.06 34.60 71.56 31.91 77.56 49.66 59.80

    Table 3: Results of translation experiments for three language pairs in different decoding orders. Scoresare reported on the test set with four widely used evaluation metrics (BLEU↑, Meteor↑, TER↓ and Ribes↑).We do not report models trained with SYN order on En-Tr and En-Ja due to the lack of reliable dependencyparsers. The statistical significance analysis6 between the outputs of SAO and L2R are conducted usingBLEU score as the metric, and the p-values are ≤ 0.001 for all three language pairs.

    1 3 5 7 9 11 13 15 17 19decoding beam size

    29

    30

    31

    32

    33

    L2RSAO

    Figure 3: The BLEU scores on the test set for wordorder recovery with various decoding beam sizes.

    preliminary experiments showed that keeping theembeddings not shared significantly improves thetranslation quality. Both the encoder and decoderuse relative positions during self-attention exceptfor the word order recovery experiments (where theposition embedding is removed in the encoder, asthere is no ground-truth position information in theinput.) We do not introduce task-specific modulessuch as copying mechanism (Gu et al., 2016).

    Training When training with the pre-defined or-ders, we reorder words of each training sequencein advance accordingly which provides supervisionof the ground-truth positions that each word shouldbe inserted. We test the pre-defined orders listed inTable 1. The SYN orders were generated accordingto the dependency parse obtained by a dependencyparse parser from Spacy (Honnibal and Montani,2017) following a parent-to-children left-to-rightorder. The CF & RF orders are obtained based on

    vocabulary cut-off so that the number of commonwords and the number of rare words are approxi-mately the same (Ford et al., 2018). We also con-sider on-the-fly sampling a random order for eachsentence as the baseline (RND). When using L2Ras the pre-defined order, Transformer-InDIGO isalmost equivalent to the vanilla Transformer, as theposition prediction simply learns to predict the nextposition as the left of the 〈s〉 symbol. The only dif-ference is that it enhances the vanilla Transformerwith a small number of additional parameters forthe position prediction.

    We also train Transformer-InDIGO using thesearched adaptive order (SAO) where we set thebeam size to 8. In default, models trained withSAO are bootstrapped from a slightly pre-trained(6,000 steps) model in L2R order.

    Inference During the test time, we do beam-search as described in Sec. 4.3. We observe fromour preliminary experiments that models trainedwith different orders (either pre-defined or SAO)have very different optimal beam sizes for decod-ing. Therefore, we perform sensitivity studies, inwhich the beam sizes vary from 1 ∼ 20 and pickthe beam size with the highest BLEU score on thevalidation set for each particular model.

    5.2 Results and AnalysisWord Order Recovery Word order recoverytakes a bag of words as input and recovers its origi-nal word order, which is challenging as the searchspace is factorial. We do not restrict the vocabularyof the input words. We compare our model trainedwith the L2R order and eight searched adaptive

  • Model Django MS-COCOBLEU Accuracy BLEU CIDEr-D

    L2R 36.74 13.6% 22.12 68.88SAO 42.33 16.3% 22.58 69.42

    Table 4: Results on the official test sets for bothcode generation and image captioning tasks.

    Model Variants dev test

    Baseline L2R 32.53 31.82SAO default 33.60 32.47

    no bootstrap 32.86 31.88no bootstrap, no noise 32.64 31.72bootstrap from R2L order 33.12 32.02bootstrap from SYN order 33.09 31.93

    Stern et al. (2019) - Uniform 29.99 28.52Stern et al. (2019) - Binary 32.27 30.66

    Table 5: Ablation study for machine translation onWMT16 Ro-En. Results of Stern et al. (2019) arebased on greedy decoding with the EOS penalty.

    orders (SAO) from beam search for word order re-covery. The BLEU scores over various beam sizesare shown in Figure 3. The model trained withSAO lead to higher BLEU scores over that trainedwith L2R with a gain up to 3 BLEU scores. Fur-thermore, increasing the beam size brings more im-provements for SAO compared to L2R, suggestingthat InDIGO produces more diversified predictionsso that it has higher chances to recover the order.

    Machine Translation As shown in Table 3, wecompare our model trained with pre-defined ordersand the searched adaptive orders (SAO) with vary-ing setups. We use four evaluation metrics includ-ing BLEU (Papineni et al., 2002), Ribes (Isozakiet al., 2010), Meteor (Banerjee and Lavie, 2005)and TER (Snover et al., 2006) to avoid using asingle metric that might be in favor of a particu-lar generation order. Most of the pre-defined or-ders (except for the random order and the balancedtree (BLT) order) perform reasonably well with In-DIGO on the three language pairs. The best scorewith a predefined word ordering is reached by theL2R order among the pre-defined orders except forEn-Ja, where the R2L order works slightly betteraccording to Ribes. This indicates that in machinetranslation, the monotonic orders are reasonableand reflect the languages. ODD, CF and RF show

    Model Training (b/s) Decoding (ms/s)

    L2R 4.21 12.3SAO (b = 1) 1.12 12.5SAO (b = 8) 0.58 12.8

    Table 6: Comparison of the L2R order with SAOon running time, where b/s is batches per secondand ms/s is ms per sentence. All experiments areconducted on 8 Nvidia V100 GPUs with 2000 to-kens per GPU. We also compare beam sizes of 1and 8 for SAO to search the best orders duringtraining. We report the decoding speed of all threemodels based on greedy decoding.

    similar performance, which is below the L2R andR2L orders by around 2 BLEU scores. The tree-based orders, such as the SYN and BLT orders donot perform well, indicating that predicting wordsfollowing a syntactic path is not preferable. Onthe other hand, Table 3 shows that the model withSAO achieves competitive and even statisticallysignificant improvements over the L2R order. Theimprovements are larger for Turkish and Japanese,indicating that a flexible generation order may im-prove the translation quality for languages withdifferent syntactic structures from English.

    Code Generation The goal of this task is to gen-erate Python code based on a natural language de-scription, which can be achieved by using a stan-dard sequence-to-sequence generation frameworksuch as the proposed Transformer-InDIGO. Asshown in Table 4, SAO works significantly betterthan the L2R order in terms of both BLEU and ac-curacy. This shows that flexible generation ordersare more preferable in code generation.

    Image Captioning For the captioning task, onecaption is generated per image and is comparedagainst five human-created captions during testing.As show in Table 4, we observe that SAO obtainshigher BLEU and CIDEr-D (Vedantam et al., 2015)compared to the L2R order, and it implies thatbetter captions are generated with different orders.

    5.3 Ablation StudyModel Variants Table 5 shows the results of theablation studies using the machine translation task.SAO without bootstrapping nor beam-search de-generates by approximate 1 BLEU score on Ro-En,demonstrating the effectiveness of these two meth-ods. We also test SAO by bootstrapping from amodel trained with a R2L order as well as a SYN

  • [Input] there are many shrines nation@@ wide with the same name .

    [Ground Truth]

    [Output]

    Figure 4: An instantiated concrete example of the decoding process using InDIGO sampled from theEn-Ja translation datset. The final output is reordered based on the predicted relative-position matrix.

    order, which obtains slightly worse yet compara-ble results compared to bootstrapping from L2R.This suggests that the SAO algorithm is quite ro-bust with different bootstrapping methods, and L2Rbootstrapping performs the best. In addition, were-implement a recent work (Stern et al., 2019),which adopts a similar idea of generating sequencesthrough insertion operations for machine transla-tion. We use the best settings of their algorithm,i.e., training with binary-tree/uniform slot-lossesand slot-termination, while removing the knowl-edge distillation for a fair comparison with ours.Our model obtains better performance comparedto Stern et al. (2019) on WMT16 Ro-En.

    Running Time As shown in Table 6, InDIGOdecodes sentences as efficient as the standard L2R

    autoregressive models. However, it is slower interms of training time using SAO as the supervi-sion, as additional efforts are needed to search thegeneration orders, and it is difficult to parallelizethe SAO. SAO with beam sizes 1 and 8 are 3.8 and7.2 times slower compared to L2R, respectively.Note that enlarging the beam size during trainingwon’t affect the decoding time as searching the bestorders only happen in the training time. We willinvestigate off-line searching methods to speed upSAO training and make InDIGO more scalable inthe future.

    5.4 Visualization

    Relative-Position Matrix In Figure 4, we showan instantiated example produced by InDIGO,which is randomly sampled from the validation set

  • WMT16 Ro-En[source] acestia dezvaluie secre@@ tele pant@@ om@@ im@@ ei lor la bbc break@@ fast .[target] they reveal their pan@@ to secre@@ ts to bbc break@@ fast .

    — InDIGO learned from Pre-defined R2L Order.fast .break@@ fast .bbc break@@ fast .the bbc break@@ fast .at the bbc break@@ fast .

    ts at the bbc break@@ fast .secre@@ ts at the bbc break@@ fast .their secre@@ ts at the bbc break@@ fast .reveal their secre@@ ts at the bbc break@@ fast .they reveal their secre@@ ts at the bbc break@@ fast .

    they reveal the secre@@ ts of their sho@@ esthey reveal the secre@@ ts of their sho@@ es to they reveal the secre@@ ts of their sho@@ es to the they reveal the secre@@ ts of their sho@@ es to the bbc they reveal the secre@@ ts of their sho@@ es to the bbc break@@ they reveal the secre@@ ts of their sho@@ es to the bbc break@@ fastthey reveal the secre@@ ts of their sho@@ es to the bbc break@@ fast .

    — InDIGO learned from Pre-defined L2R Ordertheythey reveal they reveal the they reveal the secre@@ they reveal the secre@@ ts they reveal the secre@@ ts ofthey reveal the secre@@ ts of theirthey reveal the secre@@ ts of their sho@@

    — InDIGO learned from Pre-defined SYN Orderrevealthey revealthey reveal secre@@they reveal secre@@ tsthey reveal secre@@ ts .they reveal the secre@@ ts .they reveal the secre@@ ts of .

    they reveal the secre@@ ts of sho@@ .they reveal the secre@@ ts of sho@@ e .they reveal the secre@@ ts of their sho@@ e .they reveal the secre@@ ts of their sho@@ e in .they reveal the secre@@ ts of their sho@@ e in break@@ .they reveal the secre@@ ts of their sho@@ e in break@@ fast .they reveal the secre@@ ts of their sho@@ e in bbc break@@ fast .

    — InDIGO learned from Searched Adaptive Order (SAO).the .the bbc .the bbc break@@ .the bbc break@@ fast .they the bbc break@@ fast .they the the bbc break@@ fast .they the secre@@ the bbc break@@ fast .

    they the secre@@ ts the bbc break@@ fast .they the secre@@ ts of the bbc break@@ fast .they the secre@@ ts of their the bbc break@@ fast .they the secre@@ ts of their sho@@ the bbc break@@ fast .they the secre@@ ts of their sho@@ e the bbc break@@ fast .they the secre@@ ts of their sho@@ e at the bbc break@@ fast .they reveal the secre@@ ts of their sho@@ e at the bbc break@@ fast .

    — InDIGO learned from Pre-defined RF Orderrevealreveal secrecyreveal secrecy sho@@reveal secrecy sho@@ esreveal secrecy sho@@ es bbcreveal secrecy sho@@ es bbc break@@reveal secrecy sho@@ es bbc break@@ fast

    they reveal secrecy sho@@ es bbc break@@ fastthey reveal the secrecy sho@@ es bbc break@@ fast they reveal the secrecy of sho@@ es bbc break@@ fast they reveal the secrecy of their sho@@ es bbc break@@ fast they reveal the secrecy of their sho@@ es at bbc break@@ fast they reveal the secrecy of their sho@@ es at the bbc break@@ fastthey reveal the secrecy of their sho@@ es at the bbc break@@ fast .

    [caption-1] a tall woman is standing in a small kitchen[caption-2] a girl is standing in a kitchen with a mug in her hands .[caption-3] woman in knitted jump pants and yellow slee@@ v@@ eless top , in kitchen scene with matching yellow tone area .[caption-4] a woman standing in a kitchen near a refrigerator and a stove[caption-5] a woman with pi@@ g@@ tails is standing in a kitchen .

    — InDIGO learned from Pre-defined L2R Ordera a man a man standinga man standing ina man standing in a

    a man standing in a kitchen a man standing in a kitchen holdinga man standing in a kitchen holding aa man standing in a kitchen holding a bananaa man standing in a kitchen holding a banana .

    — InDIGO learned from Searched Adaptive Order (SAO)a a woman a woman standinga woman standing in a woman standing in fronta woman standing in front of

    a woman standing in front of aa woman standing in front of a refrigeratora woman standing in front of a refrigerator .a woman in standing in front of a refrigerator .a woman in a standing in front of a refrigerator .a woman in a kitchen standing in front of a refrigerator .

    MS-COCO

    [source instruction] if tok starts with _STR:0_ ,[target Python code] if tok . startswith ( ' _STR:0_ ' ) :

    — InDIGO learned from Searched Adaptive Order (SAO)if if parserif parser .if parser . (

    if parser . ( 'if parser . ( ' ' if parser . ( ' ' )if parser . ( ' ' ) :

    Django

    if parser . ( ' _STR:0_ ' ) :if parser . startswith ( ' _STR:0_ ' ) :

    Figure 5: Examples randomly sampled from three tasks that are instructed to decode using InDIGO withvarious learned generation order. Words in red and underlined are the inserted token at each step. Forvisually convenience, we reordered all the partial sequences to its correct positions at each decoding step.

  • of the KFTT En-Ja dataset. The relative-positionmatrices (Rt) and their corresponding absolute po-sitions (zt) are shown at each step. We argue thatrelative-position matrices are flexible to encode po-sition information, and its append-only propertyenables InDIGO to reuse previous hidden states.

    Case Study We demonstrate how InDIGO worksby uniformly sampling examples from the valida-tion sets for machine translation (Ro-En), imagecaptioning and code generation. As shown in Fig-ure 5, the proposed model generates sequences indifferent orders based on the order used for learn-ing (either pre-defined or SAO). For instance, themodel generates tokens approximately followingthe dependency parse wheb we used the SYN orderfor the machine translation task. On the other hand,the model trained using the RF order learns to firstproduce verbs and nouns first, before filling up thesequence with remaining functional words.

    We observe several key characteristics about theinferred orders of SAO by analyzing the model’soutput for each task: (1) For machine translation,the generation order of an output sequence does notdeviate too much from L2R. Instead, the sequencesare shuffled with chunks, and words within eachchunk are generated in a L2R order; (2) In theexamples of image captioning and code generation,the model tends to generate most of the words inthe L2R order and insert a few words afterwardin certain locations. Moreover, we provide moreexamples in the appendix.

    6 Related Work

    Decoding for Neural Models Neural autoregres-sive modelling has become one of the most success-ful approaches for generating sequences (Sutskeveret al., 2011; Mikolov, 2012), which has been widelyused in a range of applications, such as machinetranslation (Sutskever et al., 2014), dialogue re-sponse generation (Vinyals and Le, 2015), im-age captioning (Karpathy and Fei-Fei, 2015) andspeech recognition (Chorowski et al., 2015). An-other stream of work focuses on generating a se-quence of tokens in a non-autoregressive fashion(Gu et al., 2018; Lee et al., 2018; van den Oordet al., 2018), in which the discrete tokens are gen-erated in parallel. Semi-autoregressive modelling(Stern et al., 2018; Wang et al., 2018a) is a mix-ture of the two approaches, while largely adheringto left-to-right generation. Our method is differ-ent from these approaches as we support flexible

    generation orders, while decoding autoregressively.

    Non-L2R Orders Previous studies on genera-tion order of sequences mostly resort to a fixedset of generation orders. Wu et al. (2018) empir-ically show that R2L generation outperforms itsL2R counterpart in a few tasks. Ford et al. (2018)devises a two-pass approach that produces partially-filled sentence “templates” and then fills in missingtokens. Zhu et al. (2019) also proposes to generatetokens by first predicting a text template and infillthe sentence afterwards while in a more generalway. Mehri and Sigal (2018) proposes a middle-out decoder that firstly predicts a middle-word andsimultaneously expands the sequence in both direc-tions afterwards. Previous studies also focused ondecoding in a bidirectional fashion such as (Sunet al., 2017; Zhou et al., 2019a,b). Another lineof work models sequence generation based on syn-tax structures (Yamada and Knight, 2001; Char-niak et al., 2003; Chiang, 2005; Emami and Je-linek, 2005; Zhang et al., 2016; Dyer et al., 2016;Aharoni and Goldberg, 2017; Wang et al., 2018b;Eriguchi et al., 2017). In contrast, Transformer-InDIGO supports fully flexible generation ordersduring decoding.

    There are two concurrent work (Welleck et al.,2019; Stern et al., 2019), which study sequence gen-eration in a non-L2R order. Welleck et al. (2019)propose a tree-like generation algorithm. Unlikethis work, the tree-based generation order only pro-duces a subset of all possible generation orderscompared to our insertion-based models. Further,Welleck et al. (2019) find L2R is superior to theirlearned orders on machine translation tasks, whiletransformer-InDIGO with searched adaptive ordersachieves better performance. Stern et al. (2019)propose a very similar idea of using insertion oper-ations in Transformer for machine translation. Themajor difference is that they directly use absolutepositions, while ours utilizes relative positions. Asa result, their model needs to re-encode the partialsequence at every step, which is computationallymore expensive. In contrast, our approach does notnecessitate re-encoding the entire sentence duringgeneration. In addition, knowledge distillation wasnecessary to achieve good performance in Sternet al. (2019), while our model is able to match theperformance of L2R even without bootstrapping.

  • 7 Conclusion

    We have presented a novel approach – InDIGO –which supports flexible sequence generation. Ourmodel was trained with either pre-defined ordersor searched adaptive orders. In contrast to conven-tional neural autoregressive models which oftengenerate from left to right, our model can flexiblygenerate a sequence following an arbitrary order.Experiments show that our method achieved com-petitive or even better performance compared to theconventional left-to-right generation on four tasks,including machine translation, word order recovery,code generation and image captioning.

    For future work, it is worth exploring a trainableinference model to directly predict the permuta-tion (Mena et al., 2018) instead of beam-search.Also, the proposed InDIGO could be extended forpost-editing tasks such as automatic post-editingfor machine translation (APE) and grammaticalerror correction (GEC) by introducing additionaloperations such as “deletion” and “substitution”.

    Acknowledgement

    We specially thank our action editor AlexandraBirch and all the reviewers for their great effortsto review the draft. We also would like to thankDouwe Kiela, Marc’Aurelio Ranzato, Jake Zhaoand our colleagues at FAIR for the valuable feed-back, discussions and technical assistance. Thiswork was partly supported by Samsung AdvancedInstitute of Technology (Next Generation DeepLearning: from pattern recognition to AI) and Sam-sung Electronics (Improving Deep Learning usingLatent Structure). KC thanks for the support ofeBay and Nvidia.

    References

    Roee Aharoni and Yoav Goldberg. 2017. Towardsstring-to-tree neural machine translation. In Pro-ceedings of the 55th Annual Meeting of the Asso-ciation for Computational Linguistics (Volume2: Short Papers), volume 2, pages 132–140.

    Dzmitry Bahdanau, Kyunghyun Cho, and YoshuaBengio. 2015. Neural machine translation byjointly learning to align and translate. In 3rdInternational Conference on Learning Represen-tations, ICLR 2015, San Diego, CA, USA, May7-9, 2015, Conference Track Proceedings.

    Satanjeev Banerjee and Alon Lavie. 2005. Meteor:An automatic metric for mt evaluation with im-proved correlation with human judgments. InProceedings of the acl workshop on intrinsicand extrinsic evaluation measures for machinetranslation and/or summarization, pages 65–72.

    Eugene Charniak, Kevin Knight, and Kenji Ya-mada. 2003. Syntax-based language models forstatistical machine translation. In In MT SummitIX. Intl. Assoc. for Machine Translation. Cite-seer.

    David Chiang. 2005. A hierarchical phrase-basedmodel for statistical machine translation. In Pro-ceedings of the 43rd Annual Meeting on Associ-ation for Computational Linguistics, pages 263–270. Association for Computational Linguistics.

    Kyunghyun Cho. 2016. Noisy parallel approx-imate decoding for conditional recurrent lan-guage model. arXiv preprint arXiv:1605.03835.

    Jan K Chorowski, Dzmitry Bahdanau, DmitriySerdyuk, Kyunghyun Cho, and Yoshua Bengio.2015. Attention-based models for speech recog-nition. In NIPS, pages 577–585.

    Chris Dyer, Adhiguna Kuncoro, Miguel Balles-teros, and Noah A. Smith. 2016. Recurrentneural network grammars. In NAACL HLT2016, The 2016 Conference of the North Amer-ican Chapter of the Association for Computa-tional Linguistics: Human Language Technolo-gies, San Diego California, USA, June 12-17,2016, pages 199–209.

    Ahmad Emami and Frederick Jelinek. 2005. A neu-ral syntactic language model. Machine learning,60(1-3):195–227.

    Akiko Eriguchi, Yoshimasa Tsuruoka, andKyunghyun Cho. 2017. Learning to parse andtranslate improves neural machine translation.In Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics,ACL 2017, Vancouver, Canada, July 30 - August4, Volume 2: Short Papers, pages 72–78.

    Nicolas Ford, Daniel Duckworth, MohammadNorouzi, and George E. Dahl. 2018. The impor-tance of generation order in language modeling.In Proceedings of the 2018 Conference on Empir-ical Methods in Natural Language Processing,

    http://arxiv.org/abs/1409.0473http://arxiv.org/abs/1409.0473http://aclweb.org/anthology/N/N16/N16-1024.pdfhttp://aclweb.org/anthology/N/N16/N16-1024.pdfhttps://doi.org/10.18653/v1/P17-2012https://doi.org/10.18653/v1/P17-2012https://aclanthology.info/papers/D18-1324/d18-1324https://aclanthology.info/papers/D18-1324/d18-1324

  • Brussels, Belgium, October 31 - November 4,2018, pages 2942–2946.

    Jiatao Gu, James Bradbury, Caiming Xiong, Vic-tor O.K. Li, and Richard Socher. 2018. Non-autoregressive neural machine translation. In 6thInternational Conference on Learning Represen-tations, ICLR 2018, Vancouver, Canada, April30-May 3, 2018, Conference Track Proceedings.

    Jiatao Gu, Zhengdong Lu, Hang Li, and VictorO. K. Li. 2016. Incorporating copying mecha-nism in sequence-to-sequence learning. In Pro-ceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics, ACL2016, August 7-12, 2016, Berlin, Germany, Vol-ume 1: Long Papers.

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. 2016. Deep residual learning for imagerecognition. In Proceedings of the IEEE confer-ence on computer vision and pattern recognition,pages 770–778.

    Matthew Honnibal and Ines Montani. 2017. spaCy2: Natural language understanding with Bloomembeddings, convolutional neural networks andincremental parsing. To appear.

    Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Kat-suhito Sudoh, and Hajime Tsukada. 2010. Auto-matic evaluation of translation quality for distantlanguage pairs. In Proceedings of the 2010 Con-ference on Empirical Methods in Natural Lan-guage Processing, pages 944–952. Associationfor Computational Linguistics.

    Andrej Karpathy and Li Fei-Fei. 2015. Deepvisual-semantic alignments for generating imagedescriptions. In Proceedings of the IEEE confer-ence on computer vision and pattern recognition,pages 3128–3137.

    Jason Lee, Elman Mansimov, and Kyunghyun Cho.2018. Deterministic non-autoregressive neuralsequence modeling by iterative refinement. InProceedings of the 2018 Conference on Empir-ical Methods in Natural Language Processing,Brussels, Belgium, October 31 - November 4,2018, pages 1173–1182.

    Tsung-Yi Lin, Michael Maire, Serge Belongie,James Hays, Pietro Perona, Deva Ramanan, Pi-otr Dollár, and C Lawrence Zitnick. 2014. Mi-crosoft coco: Common objects in context. In

    European conference on computer vision, pages740–755. Springer.

    Wang Ling, Phil Blunsom, Edward Grefenstette,Karl Moritz Hermann, Tomás Kociský, FuminWang, and Andrew W. Senior. 2016. Latent pre-dictor networks for code generation. In Proceed-ings of the 54th Annual Meeting of the Associa-tion for Computational Linguistics, ACL 2016,August 7-12, 2016, Berlin, Germany, Volume 1:Long Papers.

    Shikib Mehri and Leonid Sigal. 2018. Middle-out decoding. In S. Bengio, H. Wallach,H. Larochelle, K. Grauman, N. Cesa-Bianchi,and R. Garnett, editors, Advances in Neural In-formation Processing Systems 31, pages 5523–5534. Curran Associates, Inc.

    Gonzalo Mena, David Belanger, Scott Linderman,and Jasper Snoek. 2018. Learning latent permu-tations with gumbel-sinkhorn networks. In 6thInternational Conference on Learning Represen-tations, ICLR 2018, Vancouver, Canada, April30-May 3, 2018, Conference Track Proceedings.

    Tomáš Mikolov. 2012. Statistical language mod-els based on neural networks. Presentation atGoogle, Mountain View, 2nd April.

    Graham Neubig. 2011. The Kyoto free translationtask. http://www.phontron.com/kftt.

    Yusuke Oda, Hiroyuki Fudaba, Graham Neubig,Hideaki Hata, Sakriani Sakti, Tomoki Toda, andSatoshi Nakamura. 2015. Learning to generatepseudo-code from source code using statisticalmachine translation. In Proceedings of the 201530th IEEE/ACM International Conference onAutomated Software Engineering (ASE), ASE’15, pages 574–584, Lincoln, Nebraska, USA.IEEE Computer Society.

    Aaron van den Oord, Yazhe Li, Igor Babuschkin,Karen Simonyan, Oriol Vinyals, KorayKavukcuoglu, George van den Driessche,Edward Lockhart, Luis Cobo, Florian Stimberg,Norman Casagrande, Dominik Grewe, SebNoury, Sander Dieleman, Erich Elsen, NalKalchbrenner, Heiga Zen, Alex Graves, HelenKing, Tom Walters, Dan Belov, and Demis Hass-abis. 2018. Parallel WaveNet: Fast high-fidelityspeech synthesis. In Proceedings of the 35thInternational Conference on Machine Learning,

    http://aclweb.org/anthology/P/P16/P16-1154.pdfhttp://aclweb.org/anthology/P/P16/P16-1154.pdfhttps://aclanthology.info/papers/D18-1149/d18-1149https://aclanthology.info/papers/D18-1149/d18-1149http://aclweb.org/anthology/P/P16/P16-1057.pdfhttp://aclweb.org/anthology/P/P16/P16-1057.pdfhttp://papers.nips.cc/paper/7796-middle-out-decoding.pdfhttp://papers.nips.cc/paper/7796-middle-out-decoding.pdfhttps://doi.org/10.1109/ASE.2015.36https://doi.org/10.1109/ASE.2015.36https://doi.org/10.1109/ASE.2015.36http://proceedings.mlr.press/v80/oord18a.htmlhttp://proceedings.mlr.press/v80/oord18a.html

  • volume 80 of Proceedings of Machine LearningResearch, pages 3918–3926, Stockholmsmssan,Stockholm Sweden. PMLR.

    Chris Pal, Charles Sutton, and Andrew McCallum.2006. Sparse forward-backward using minimumdivergence beams for fast training of conditionalrandom fields. In Acoustics, Speech and Sig-nal Processing, 2006. ICASSP 2006 Proceed-ings. 2006 IEEE International Conference on,volume 5, pages V–V. IEEE.

    Kishore Papineni, Salim Roukos, Todd Ward, andWei-Jing Zhu. 2002. Bleu: a method for au-tomatic evaluation of machine translation. InProceedings of the 40th annual meeting on as-sociation for computational linguistics, pages311–318. Association for Computational Lin-guistics.

    Alexander M. Rush, Sumit Chopra, and Jason We-ston. 2015. A neural attention model for abstrac-tive sentence summarization. In Proceedings ofthe 2015 Conference on Empirical Methods inNatural Language Processing, pages 379–389,Lisbon, Portugal. Association for ComputationalLinguistics.

    Rico Sennrich, Barry Haddow, and AlexandraBirch. 2016. Neural machine translation of rarewords with subword units. In Proceedings of the54th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers),pages 1715–1725, Berlin, Germany. Associationfor Computational Linguistics.

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.2018. Self-attention with relative position repre-sentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Asso-ciation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Pa-pers), pages 464–468, New Orleans, Louisiana.Association for Computational Linguistics.

    Matthew Snover, Bonnie Dorr, Richard Schwartz,Linnea Micciulla, and John Makhoul. 2006. Astudy of translation edit rate with targeted humanannotation. In In Proceedings of Association forMachine Translation in the Americas, pages 223–231.

    Mitchell Stern, William Chan, Jamie Kiros, andJakob Uszkoreit. 2019. Insertion transformer:

    Flexible sequence generation via insertion oper-ations. arXiv preprint arXiv:1902.03249.

    Mitchell Stern, Noam Shazeer, and Jakob Uszkor-eit. 2018. Blockwise parallel decoding for deepautoregressive models. In Advances in NeuralInformation Processing Systems, pages 10107–10116.

    Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidi-rectional beam search: Forward-backward infer-ence in neural sequence models for fill-in-the-blank image captioning. In Proceedings of theIEEE Conference on Computer Vision and Pat-tern Recognition, pages 6961–6969.

    Ilya Sutskever, James Martens, and Geoffrey EHinton. 2011. Generating text with recurrentneural networks. In Proceedings of the 28thInternational Conference on Machine Learning(ICML-11), pages 1017–1024.

    Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.2014. Sequence to sequence learning with neu-ral networks. In Advances in Neural Informa-tion Processing Systems 27: Annual Confer-ence on Neural Information Processing Systems2014, December 8-13 2014, Montreal, Quebec,Canada, pages 3104–3112.

    Ashish Vaswani, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,Lukasz Kaiser, and Illia Polosukhin. 2017. At-tention is all you need. In Proceedings of theAnnual Conference on Neural Information Pro-cessing Systems (NIPS).

    Ramakrishna Vedantam, C Lawrence Zitnick, andDevi Parikh. 2015. Cider: Consensus-based im-age description evaluation. In Proceedings of theIEEE conference on computer vision and patternrecognition, pages 4566–4575.

    Ashwin K Vijayakumar, Michael Cogswell, Ram-prasath R Selvaraju, Qing Sun, Stefan Lee,David Crandall, and Dhruv Batra. 2016. Di-verse beam search: Decoding diverse solutionsfrom neural sequence models. arXiv preprintarXiv:1610.02424.

    Oriol Vinyals, Samy Bengio, and Manjunath Kud-lur. 2016. Order matters: Sequence to sequencefor sets. In 4th International Conference onLearning Representations, ICLR 2016, San Juan,

    https://doi.org/10.18653/v1/D15-1044https://doi.org/10.18653/v1/D15-1044https://doi.org/10.18653/v1/P16-1162https://doi.org/10.18653/v1/P16-1162https://doi.org/10.18653/v1/N18-2074https://doi.org/10.18653/v1/N18-2074http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networkshttp://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networkshttp://arxiv.org/abs/1511.06391http://arxiv.org/abs/1511.06391

  • Puerto Rico, May 2-4, 2016, Conference TrackProceedings.

    Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.2015. Pointer networks. In Advances in NeuralInformation Processing Systems, pages 2692–2700.

    Oriol Vinyals and Quoc Le. 2015. A neu-ral conversational model. arXiv preprintarXiv:1506.05869.

    Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018a.Semi-autoregressive neural machine translation.In Proceedings of the 2018 Conference on Empir-ical Methods in Natural Language Processing,pages 479–488, Brussels, Belgium. Associationfor Computational Linguistics.

    Xinyi Wang, Hieu Pham, Pengcheng Yin, and Gra-ham Neubig. 2018b. A tree-based decoder forneural machine translation. In Proceedings ofthe 2018 Conference on Empirical Methods inNatural Language Processing, pages 4772–4777,Brussels, Belgium. Association for Computa-tional Linguistics.

    Sean Welleck, Kianté Brantley, Hal Daumé III,and Kyunghyun Cho. 2019. Non-monotonicsequential text generation. arXiv preprintarXiv:1902.02192.

    Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jian-huang Lai, and Tie-Yan Liu. 2018. Beyond er-ror propagation in neural machine translation:Characteristics of language also matter. In Pro-ceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages3602–3611, Brussels, Belgium. Association forComputational Linguistics.

    Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In Proceed-ings of the 39th Annual Meeting of the Associa-tion for Computational Linguistics.

    Pengcheng Yin and Graham Neubig. 2017. A syn-tactic neural model for general-purpose codegeneration. In Proceedings of the 55th AnnualMeeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 440–450, Vancouver, Canada. Association for Com-putational Linguistics.

    Xingxing Zhang, Liang Lu, and Mirella Lapata.2016. Top-down tree long short-term memorynetworks. In Proceedings of the 2016 Confer-ence of the North American Chapter of the As-sociation for Computational Linguistics: Hu-man Language Technologies, pages 310–320,San Diego, California. Association for Compu-tational Linguistics.

    Long Zhou, Jiajun Zhang, and Chengqing Zong.2019a. Synchronous bidirectional neural ma-chine translation. Transactions of the Associa-tion for Computational Linguistics, 7:91–105.

    Long Zhou, Jiajun Zhang, Chengqing Zong, andHeng Yu. 2019b. Sequence generation: Fromboth sides to the middle. IJCAI.

    Wanrong Zhu, Zhiting Hu, and Eric Xing.2019. Text Infilling. arXiv e-prints, pagearXiv:1901.00158.

    https://www.aclweb.org/anthology/D18-1044https://www.aclweb.org/anthology/D18-1509https://www.aclweb.org/anthology/D18-1509https://www.aclweb.org/anthology/D18-1396https://www.aclweb.org/anthology/D18-1396https://www.aclweb.org/anthology/D18-1396https://doi.org/10.18653/v1/P17-1041https://doi.org/10.18653/v1/P17-1041https://doi.org/10.18653/v1/P17-1041https://doi.org/10.18653/v1/N16-1035https://doi.org/10.18653/v1/N16-1035http://arxiv.org/abs/1901.00158