3 glancing transformer

Proceedings of the 59th Annual Meeting of the Association for Computational Linguisticsand the 11th International Joint Conference on Natural Language Processing, pages 1993–2003

August 1–6, 2021. ©2021 Association for Computational Linguistics

1993

Glancing Transformer for Non-AutoregressiveNeural Machine Translation

Lihua Qian1,2∗ Hao Zhou2 Yu Bao3 Mingxuan Wang2

Lin Qiu1 Weinan Zhang1 Yong Yu1 Lei Li21 Shanghai Jiao Tong University 2 ByteDance AI Lab 3 Nanjing University

{qianlihua,lqiu,wnzhang,yyu}@apex.sjtu.edu.cn{zhouhao.nlp,wangmingxuan.89,lileilab}@bytedance.com

[email protected]

Abstract

Recent work on non-autoregressive neural ma-chine translation (NAT) aims at improving theefficiency by parallel decoding without sac-rificing the quality. However, existing NATmethods are either inferior to Transformer orrequire multiple decoding passes, leading toreduced speedup. We propose the GlancingLanguage Model (GLM) for single-pass par-allel generation models. With GLM, we de-velop Glancing Transformer (GLAT) for ma-chine translation. With only single-pass par-allel decoding, GLAT is able to generatehigh-quality translation with 8×-15× speedup.Note that GLAT does not modify the net-work architecture, which is a training methodto learn word interdependency. Experimentson multiple WMT language directions showthat GLAT outperforms all previous singlepass non-autoregressive methods, and is nearlycomparable to Transformer, reducing the gapto 0.25-0.9 BLEU points.

1 Introduction

Transformer has been the most widely used ar-chitecture for machine translation (Vaswani et al.,2017). Despite its strong performance, the decod-ing of Transformer is inefficient as it adopts thesequential auto-regressive factorization for its prob-ability model (Figure 1a). Recent work such asthe non-autoregressive transformer (NAT), aims todecode target tokens in parallel to speed up the gen-eration (Gu et al., 2018). However, the vanilla NATstill lags behind the Transformer in translation qual-ity – with a gap of about 7.0 BLEU points. NATassumes the conditional independence of the targettokens given the source sentence. We suspect thatNAT’s conditional independence assumption pre-vents learning word interdependency in the target

∗The work was done when the first author was an intern atBytedance.

y1y2

y4y5

y3

h1h2h3h4h5

NAT DecodingH H′�Glancing Sampling

Hamming Distance

N( Y, Y ) = 3

y1y2y3y4y5

y1y2

y4y5

y3

y1

y3

y5

Replace Inputs

0.8

0.5

0.7

0.6

0.9

h2

h4

y1 y2

y1 y2 y5

Decoder

Encoder

x1 x2 x3 x4

y3 y4

y3 y4

[BOS]

y1 y2 y5

Decoder

Encoder

x1 x2 x3 x4

y3 y4

h2 h4h1 h3 h5

att

entio

n

Ran

dom

M

aski

ng

Decoder

Encoder

x1 x2 x3 x4

y1 y4 y5[MASK] [MASK]

y2 y3y1 y5y4

att

entio

n

an apple in the car

ein Apfel im Auto

Decoder

Encoder

x1 x2 x3 x4

y1 h2 y5h4y3

y2 y4

Gla

ncin

g Sa

mpl

ing

y3y1 y5

ein Apfel im Auto

an apple in the car

att

entio

n

ein Apfel im Auto ein Apfel im Auto

apple in apple the

an the car

an apple in the

an in car

att

entio

n

(a) Sequential LM

y1y2

y4y5

y3

h1h2h3h4h5


Hamming Distance

N( Y, Y ) = 3

y1y2y3y4y5

y1y2

y4y5

y3

y1

y3

y5

Replace Inputs

0.8

0.5

0.7

0.6

0.9

h2

h4

y1 y2

y1 y2 y5

Decoder

Encoder

x1 x2 x3 x4

y3 y4

y3 y4

[BOS]

y1 y2 y5

Decoder

Encoder

x1 x2 x3 x4

y3 y4

h2 h4h1 h3 h5 a

tten

tion

Ran

dom

M

aski

ngDecoder

Encoder

x1 x2 x3 x4


y2 y3y1 y5y4

att

entio

n

an apple in the car

ein Apfel im Auto

Decoder

Encoder

x1 x2 x3 x4

y1 h2 y5h4y3

y2 y4

Gla

ncin

g Sa

mpl

ing

y3y1 y5

ein Apfel im Auto

an apple in the car

att

entio

n


apple in apple the

an the car

an apple in the

an in car

att

entio

n

(b) Cond. Independent LM

y1y2

y4y5

y3

h1h2h3h4h5


Hamming Distance

N( Y, Y ) = 3

y1y2y3y4y5

y1y2

y4y5

y3

y1

y3

y5

Replace Inputs

0.8

0.5

0.7

0.6

0.9

h2

h4

y1 y2

y1 y2 y5

Decoder

Encoder

x1 x2 x3 x4

y3 y4

y3 y4

[BOS]

y1 y2 y5

Decoder

Encoder

x1 x2 x3 x4

y3 y4

h2 h4h1 h3 h5

att

entio

n

Ran

dom

M

aski

ng

Decoder

Encoder

x1 x2 x3 x4


y2 y3y1 y5y4

att

entio

n

an apple in the car

ein Apfel im Auto

Decoder

Encoder

x1 x2 x3 x4

y1 h2 y5h4y3

y2 y4

Gla

ncin

g Sa

mpl

ing

y3y1 y5

ein Apfel im Auto

an apple in the car

att

entio

n


apple in apple the

an the car

an apple in the

an in car

att

entio

n(c) Masked LM (MLM)

y1y2

y4y5

y3

h1h2h3h4h5


Hamming Distance

N( Y, Y ) = 3

y1y2y3y4y5

y1y2

y4y5

y3

y1

y3

y5

Replace Inputs

0.8

0.5

0.7

0.6

0.9

h2

h4

y1 y2

y1 y2 y5

Decoder

Encoder

x1 x2 x3 x4

y3 y4

y3 y4

[BOS]

y1 y2 y5

Decoder

Encoder

x1 x2 x3 x4

y3 y4

h2 h4h1 h3 h5 a

tten

tion

Ran

dom

M

aski

ngDecoder

Encoder

x1 x2 x3 x4


y2 y3y1 y5y4

att

entio

n

an apple in the car

ein Apfel im Auto

Decoder

Encoder

x1 x2 x3 x4

y1 h2 y5h4y3

y2 y4

Gla

ncin

g Sa

mpl

ing

y3y1 y5

ein Apfel im Auto

an apple in the car

att

entio

n


apple in apple the

an the car

an apple in the

an in car

att

entio

n

(d) Glancing LM (GLM)

Figure 1: Probabilistic models for machine translationmethods. (b) Vanilla NAT uses conditional indepe-dent LM. (c) Mask-Predict NAT uses MLM and re-quires multiple passes of decoding. (d) Our proposedGLM leverages the decoder prediction to decide glanc-ing sampling policy during training and only requiresone pass of decoding during inference.

sentence. Notice that such word interdependencyis crucial, as the Transformer explicitly capturesthat via decoding from left to right (Figure 1a).

Several remedies are proposed (Ghazvininejadet al., 2019; Gu et al., 2019) to capture word inter-dependency while keeping parallel decoding. Theircommon idea is to decode the target tokens itera-tively while each pass of decoding is trained usingthe masked language model (Figure 1c). Sincethese methods require multiple passes of decod-ing, its generation speed is measurably slower thanthe vanilla NAT. With single-pass generation only,these methods still largely lag behind the autore-gressive Transformer.

1994

One open question is whether a complete par-allel decoding model can achieve comparable ma-chine translation performance to the Transformer.It should be non-autoregressive and take only onepass of decoding during the inference time.

To address the quest, we propose glancing lan-guage model (GLM), a new method to train a prob-abilistic sequence model. Based on GLM, we de-velop the glancing Transformer (GLAT) for neuralmachine translation. It achieves parallel text gener-ation with only single decoding. Yet, it outperformsprevious NAT methods and achieves comparableperformance as the strong Transformer baseline inmultiple cases. Intuitively, GLM adopts a adaptiveglancing sampling strategy, which glances at somefragments of the reference if the reference is toodifficult to fit in the training of GLAT. Correspond-ingly, when the model is well tuned, it will adap-tively reduce the percentage of glancing sampling,making sure that the resulting model could learnto generate the whole sentence in the single-passfashion. The gradual learning process smooths thelearning curve of single-pass parallel generation.

Specifically, our proposed GLM differs fromMLM in two aspects. Firstly, GLM proposes anadaptive glancing sampling strategy, which enablesGLAT to generate sentences in a one-iteration way,working by gradual training instead of iterative in-ference (see Figure 1d). Generally, GLM is quitesimilar to curriculum learning (Bengio et al., 2009)in spirit, namely first learning to generate somefragments and gradually moving to learn the wholesentences (from easy to hard). To achieve the adap-tive glancing sampling, GLM performs decodingtwice in training. The first decoding is the same asthe vanilla NAT, and the prediction accuracy indi-cates whether the current reference is “difficult” forfitting. In the second decoding, GLM gets wordsof the reference via glancing sampling accordingto the first decoding, and learn to predict the re-maining words that are not sampled. Note thatonly the second decoding will update the model pa-rameters. Secondly, instead of using the [MASK]token, GLM directly uses representations from theencoder at corresponding positions, which is morenatural and could enhance the interactions betweensampled words and signals from the encoder.

Note that GLAT does not modify the network ar-chitecture, which is a training method to explicitylylearn word interdependency. Experimental resultsshow that GLAT obtains significant improvements

(about 5 BLEU) on standard benchmarks comparedto the vanilla NAT, without losing inference speed-up. GLAT achieves competitive results against iter-ative approaches like Mask-Predict (Ghazvininejadet al., 2019), even outperforming the Mask-Predictmodel on WMT14 DE-EN and WMT16 RO-EN.Compared to the strong AT baseline, GLAT canstill close the performance gap within 0.9 BLEUpoint while keeping 7.9× speed-up. Empirically,we even find that GLAT outperforms AT when thelength of the reference is less than 20 on WMT14DE-EN. We speculate this is because GLM couldcapture bidirectional context for generation whileits left-to-right counterpart is only unidirectional,which indicates the potential of parallel generationapproaches like GLAT.

2 Probability Models of MachineTranslation

We state and compare different probability mod-els for machine translation. A machine translationtask can be formally defined as a sequence to se-quence generation problem: given the source sen-tence X = {x1, x2, ..., xN}, to generate the targetsentence Y = {y1, y2, ..., yT } according to the con-ditional probability P (Y |X; θ), where θ denotesthe parameter set of a network. Different methodsfactorize the conditional probability differently.

The Transformer uses the autoregressive factor-ization to maximize the following likelihood:

LAT = logP (Y |X; θ) =

T∑t=1

log p(yt|y<t, X; θ),

where y<t = {[BOS], y1, ..., yt−1}. For simplic-ity, we omit the number of samples in the equation.Note the training of AT adopts left-to-right teacherforcing on the target tokens (Vaswani et al., 2017).The word interdependency is learned in a unidi-rectional way. During inference, the precedingpredicted token is fed into the decoder to generatethe next token.

The vanilla NAT consists of the same encoder asthe Transformer and a parallel decoder with layersof multi-head attention (Gu et al., 2018). Duringtraining, it uses the conditional independent factor-ization for the target sentence:

LNAT =

T∑t=1

logP (yt|X; θ).

1995


y1

y3

y5

h2

h4

Hamming Distance

N( Y, Y ) = 3

Replace Inputs

y1y2y3y4y5

y1y2

y4y5

y3

0.8

0.5

0.7

0.6

0.9

Parallel DecoderH H′�GlancingEncoder Parallel

Decoder

y1y2

y4y5

y3

h1h2h3h4h5

h1h2h3h4h5

X Compute loss LGLM

y1

y3

y5

h2

h4

Compute Distance

y1y2

y4y5

y3

y1y2y3y4y5

Sample words S(Y, Y )

Replace Input H

Y Y

Figure 2: The training procedure with glancing sampling in GLAT. H is the representation computed by theencoder. y’s are the initial predicted tokens of the parallel decoder. y’s are the ground-truth target tokens. H ′ isfed into the decoder again to calculate the training loss.

Notice that, NAT’s log-likelihood is an approxima-tion to the full log-likelihood logP (Y |X; θ). Dur-ing inference, the encoder representation is copiedas the input to the decoder, therefore all tokens onthe target side can be generated in parallel. Sucha conditional independence assumption does nothold in general, which explains the inferior perfor-mance of NAT.

Multi-pass iterative decoding approaches such asMask-Predict (Ghazvininejad et al., 2019) extendsthe vanilla NAT. It still uses the conditional inde-pendent factorization, together with the randommasking scheme:

LMLM =∑

yt∈RM(Y )

log p(yt|Φ

(Y,RM(Y )

), X; θ

),

where RM(Y ) is a set of randomly selected wordsfrom Y , and Φ(·) replaces these selected wordsin Y with the [MASK] token. For example inFigure 1c, RM(Y ) = {y2, y3}, Φ

(Y,RM(Y )

)=

{y1,[MASK],[MASK], y4, y5}. The number ofmasked tokens distributes uniformly from 1 to thetotal number of tokens in the target sequence. Suchtraining objective is used to learn a refinementmodel θ that can predict the masked tokens giventhe source sentence X and words generated in theprevious iteration.

The vanilla NAT breaks word interdependency,while MLM requires multiple passes of decoding tore-establish the word interdependency. Our goal inthis work is to design a better probability model anda training objective to enable word interdependencylearning for single-pass parallel generation.

3 Glancing Transformer

In this section, we present GLAT in detail. GLATuses the same encoder-decoder architecture as thevanilla NAT (Gu et al., 2018). GLAT differs fromthe vanilla NAT in that it explicitly encouragesword interdependency via training with glancinglanguage model (GLM). It differs from the iterativeNAT with MLM in that it is trained to producesingle pass parallel decoding while MLM is usedfor prediction refinement.

3.1 The Glancing Language ModelGiven the input source sentence X ={x1, x2, ..., xN}, the task is to predictY = {y1, y2, ..., yT }. The glancing Trans-former (GLAT) formulates a glancing languagemodel (GLM) during training. It maximizes thefollowing:

LGLM =∑

yt∈GS(Y,Y )

log p(yt|GS(Y, Y ), X; θ)

(1)Where, Y is the initial predicted tokens, andGS(Y, Y ) is a subset of tokens selected via theglancing sampling strategy (Figure 2, described indetail in the next section). The glancing samplingstrategy selects those words from the target sen-tence by comparing the initial prediction againstthe ground-truth tokens. It selects more tokens andfeeds the embeddings of these tokens into the de-coder input if the network’s initial prediction isless accurate. GS(Y, Y ) is the remaining subset oftokens within the target Y but not selected. Thetraining loss above is calculated against these re-maining tokens.

1996

GLAT adopts similar encoder-decoder archi-tecture as the Transformer with some modifica-tion (Figure 1d). Its encoder fencis the same multi-head attention layers. Its decoder fdec includemultiple layers of multi-head attention where eachlayer attends to the full sequence of both encoderrepresentation and the previous layer of decoderrepresentation.

During the initial prediction, the input to thedecoder H = {h1, h2, ..., hT } are copied fromthe encoder output using either uniform copy orsoft copy (Wei et al., 2019). The initial tokensY are predicted using argmax decoding withfdec(fenc(X; θ), H; θ).

To calculate the loss LGLM, we compare the ini-tial prediction Y against the ground-truth to selecttokens within the target sentence, i.e. GS(Y, Y ).We then replace those sampled indices of h’s withcorresponding target word embeddings, H ′ =RP(Embyt∈GS(Y,Y )(yt), H), where RP replacesthe corresponding indices. Namely, if a token inthe target is sampled, its word embedding replacesthe corresponding h. Here the word embeddingsare obtained from the softmax embedding matrixof the decoder. The updated H ′ is then fed intothe decoder fdec again to calculate the output tokenprobability. Specifically, the output probabilities ofremaining tokens p(yt|GS(Y, Y ), X; θ) are com-puted with fdec(H

′, fenc(X; θ); θ).

3.2 The Glancing Sampling Strategy

One important component of GLM is to adaptivelyselect the positions of tokens from the target sen-tence. Those selected tokens provide “correct” in-formation from the ground-truth target, therefore ithelps training the decoder to predict the rest non-selected tokens. Intuitively, our adaptive samplingstrategy guides the model to first learn the gener-ation of fragments and then gradually turn to thewhole sentences. Our glancing sampling strategyselects many words at the start of the training, whenthe model is not yet well tuned. As the model getsbetter progressively, the sampling strategy will sam-ple fewer words to enable the model to learn theparallel generation of the whole sentence. Notethat the sampling strategy is crucial in the trainingof GLAT.

As illustrated in Figure 2, the glancing samplingcould be divided into two steps: first deciding asampling number S, and then randomly selectingS words from the reference. The sampling number

S will be larger when the model is poorly trainedand decreases along the training process. Note thatwe choose to randomly select the S words from thereference. The random reference word selection issimple and yields good performance empirically.

Formally, given the input X , its predicted sen-tence Y and its reference Y , the goal of glancingsampling function GS(Y, Y ) is to obtain a subsetof words sampled from Y :

GS(Y, Y ) = Random(Y, S(Y, Y )) (2)

Here, Random(Y, S) is randomly selecting S to-kens from Y , and S is computed by comparingthe difference between Y and Y , S(Y, Y ) = λ ·d(Y, Y ). The sampling ratio λ is a hyper-parameterto more flexibly control the number of sampled to-kens. d(Y, Y ) is a metric for measuring the differ-ences between Y and Y . We adopt the Hammingdistance (Hamming, 1950) as the metric, whichis computed as d(Y, Y ) =

∑Tt=1(yt 6= yt). With

d(Y, Y ), the sampling number can be decided adap-tively considering the current trained model’s pre-diction capability. For situations that Y and Y havedifferent lengths, d(Y, Y ) could be other distancessuch as Levenshtein distance (Levenshtein, 1966).

Alternative glancing sampling strategy can beadopted as well. For example, one simple alterna-tive strategy is to set the number of sampled tokensto be proportional to the target sentence length, i.e.S = λ ∗ T . We will evaluate the effects of thesevariations in the experiment.

3.3 Inference

GLAT only modifies the training procedure. Its in-ference is fully parallel with only a single pass. Forparallel generation, we need to decide the outputlengths before decoding. A simple way to decidethe output lengths is predicting length with repre-sentations from the encoder.

In GLAT, the length prediction is implementedas in Ghazvininejad et al. (2019). An additional[LENGTH] token is added to the source input, andthe encoder output for the [LENGTH] token isused to predict the length.

We also use two more complex methods to bet-ter decide the output lengths: noisy parallel decod-ing (NPD) and connectionist temporal classifica-tion (CTC). For NPD (Gu et al., 2018), we firstpredict m target length candidates, then generateoutput sequences with argmax decoding for eachtarget length candidate. Then we use a pre-trained

1997

Models IdecWMT14 WMT16 Speed UpEN-DE DE-EN EN-RO RO-EN

AT Models Transformer (Vaswani et al., 2017) T 27.30 / / / /Transformer (ours) T 27.48 31.27 33.70 34.05 1.0×†

Iterative NAT

NAT-IR (Lee et al., 2018) 10 21.61 25.48 29.32 30.19 1.5×LaNMT (Shu et al., 2020) 4 26.30 / / 29.10 5.7×LevT (Gu et al., 2019) 6+ 27.27 / / 33.26 4.0×Mask-Predict (Ghazvininejad et al., 2019) 10 27.03 30.53 33.08 33.31 1.7×JM-NAT (Guo et al., 2020b) 10 27.31 31.02 / / 5.7×

Fully NAT

NAT-FT (Gu et al., 2018) 1 17.69 21.47 27.29 29.06 15.6×Mask-Predict (Ghazvininejad et al., 2019) 1 18.05 21.83 27.32 28.20 /imit-NAT (Wei et al., 2019) 1 22.44 25.67 28.61 28.90 18.6×NAT-HINT (Li et al., 2019) 1 21.11 25.24 / / /Flowseq (Ma et al., 2019) 1 23.72 28.39 29.73 30.72 1.1×NAT-DCRF (Sun et al., 2019) 1 23.44 27.22 / / 10.4 ×

w/ CTC NAT-CTC (Libovicky and Helcl, 2018) 1 16.56 18.64 19.54 24.67 /Imputer (Saharia et al., 2020) 1 25.80 28.40 32.30 31.70 18.6×

w/ NPD

NAT-FT + NPD (m=100) 1 19.17 23.20 29.79 31.44 2.4×imit-NAT + NPD (m=7) 1 24.15 27.28 31.45 31.81 9.7×NAT-HINT + NPD (m=9) 1 25.20 29.52 / / /Flowseq + NPD (m=30) 1 25.31 30.68 32.20 32.84 /NAT-DCRF + NPD (m=9) 1 26.07 29.68 / / 6.1×

Ours

NAT-base† 1 20.36 24.81 28.47 29.43 15.3×†CTC† 1 25.52 28.73 32.60 33.46 14.6 ×†GLAT 1 25.21 29.84 31.19 32.04 15.3×†GLAT + CTC 1 26.39 29.54 32.79 33.84 14.6 ×†GLAT + NPD (m=7) 1 26.55 31.02 32.87 33.51 7.9×†

Table 1: Results on WMT14 EN-DE/DE-EN and WMT16 EN-RO/RO-EN benchmarks. Idec is the number ofdecoding iterations and m is the number of length reranking candidates. NPD represents noisy parallel decoding,CTC represents connectionist temporal classification. † indicate the results are obtained by our implementation.Note that our work and previous work may use different hardware settings and implementation, the speed-up maynot be fair to compare directly.

transformer to rank these sequences and identifythe best overall output as the final output. ForCTC (Graves et al., 2006), following Libovickyand Helcl (2018), we first set the max output lengthto twice the source input length, and remove theblanks and repeated tokens after generation.

4 Experiments

In this section, we first introduce the settings of ourexperiments, then report the main results comparedwith several strong baselines. Ablation studies andfurther analysis are also included to verify the ef-fects of different components used in GLAT.

4.1 Experimental Settings

Datasets We conduct experiments on three ma-chine translation benchmarks: WMT14 EN-DE(4.5M translation pairs), WMT16 EN-RO (610ktranslation pairs), and IWSLT16 DE-EN (150Ktranslation pairs). These datasets are tokenizedand segmented into subword units using BPE en-codings (Sennrich et al., 2016). We preprocessWMT14 EN-DE by following the data preprocess-ing in Vaswani et al. (2017). For WMT16 EN-RO

and IWSLT16 DE-EN, we use the processed dataprovided in Lee et al. (2018).

Knowledge Distillation Following previouswork (Gu et al., 2018; Lee et al., 2018; Wang et al.,2019), we also use sequence-level knowledgedistillation for all datasets. We employ thetransformer with the base setting in Vaswani et al.(2017) as the teacher for knowledge distillation.Then, we train our GLAT on distilled data.

Baselines and Setup We compare our methodwith the base Transformer and strong representa-tive NAT baselines in Table 1. For all our tasks,we obtain other NAT models’ performance by di-rectly using the performance figures reported intheir papers if they are available.

We adopt the vanilla model which copiessource input uniformly in Gu et al. (2018) asour base model (NAT-base) and replace the Uni-formCopy with attention mechanism using po-sitions. Note that the output length does notequal the length of reference in models usingCTC. Therefore, for GLAT with CTC, we adoptlongest common subsequence distance for compar-

1998

25 26 27 28 29 30 31Performance (BLEU)

2.5

5.0

7.5

10.0

12.5

15.0

17.5

Rela

tive

Deco

ding

Spe

ed-u

pGLATTransformerNAT-baseNAT-DCRFimit-NATCTCMask-Predict

Figure 3: The trade-off between speed-up and BLEUon WMT14 DE-EN

ing Y and Y , and the glancing target is the tar-get alignment that maximize the output probabilityargmaxa∈B−1(Y ) P (a|X; θ). B−1 is the mappingproposed in (Graves et al., 2006), which expandthe reference to the length of output by insertingblanks or repeating words.

For WMT datasets, we follow the hyperparam-eters of the base Transformer in Vaswani et al.(2017). And we choose a smaller setting forIWSLT16, as IWSLT16 is a smaller dataset. ForIWSLT16, we use 5 layers for encoder and decoder,and set the model size dmodel to 256. Using NvidiaV100 GPUs, We train the model with batchesof 64k/8k tokens for WMT/IWSLT datasets, re-spectively. We set the dropout rate to 0.1 anduse Adam optimizer (Kingma and Ba, 2014) withβ = (0.9, 0.999). For WMT datasets, the learningrate warms up to 5e− 4 in 4k steps and graduallydecays according to inverse square root schedulein Vaswani et al. (2017). As for IWSLT16 DE-EN,we adopt linear annealing (from 3e− 4 to 1e− 5)as in Lee et al. (2018). For the hyper-parameter λ,we adopt linear annealing from 0.5 to 0.3 for WMTdatasets and a fixed value of 0.5 for IWSLT16. Thefinal model is created by averaging the 5 best check-points chosen by validation BLEU scores. We re-port tokenized BLEU for all the datasets used inexperiment. We measure the average latency persentence on a single Nvidia 1080TI GPU.

4.2 Main Results

The main results on the benchmarks are presentedin Table 1. GLAT significantly improves the trans-lation quality and outperforms strong baselines by alarge margin. Our method introduces explicit wordinterdependency modeling for the decoder andgradually learns simultaneous generation of wholesequences, enabling the model to better capture theunderlying data structure. Compared to models

[0,20) [20,40) [40,60) >=60Source Input Length

26

28

30

32

34

Perfo

rman

ce (B

LEU)

NAT-baseTransformerGLAT

Figure 4: Performance under different source inputlength on WMT14 DE-EN

with iterative decoding, our method completelymaintains the inference efficiency advantage offully non-autoregressive models, since GLAT gen-erate with a single pass. Compared with the base-lines, we highlight our empirical advantages:

• GLAT is highly effective. Compared with thevanilla NAT-base models, GLAT obtains sig-nificant improvements (about 5 BLEU) on EN-DE/DE-EN. Additionally, GLAT also outper-forms other fully non-autoregressive modelswith a substantial margin (almost +2 BLEUpoints on average). The results are even veryclose to those of the AT model, which showsgreat potential.

• GLAT is simple and can be applied to otherNAT models flexibly, as we only modify thetraining process by reference glancing whilekeeping inference unchanged. For compari-son, NAT-DCRF utilizes CRF to generate se-quentially; NAT-IR and Mask-Predict modelsneed multiple decoding iterations.

• CTC and NPD use different approaches to de-termine the best output length, and they havetheir own advantages and disadvantages. CTCrequires the output length to be longer thanthe exact target length. With longer outputlengths, the training will consume more timeand GPU memory. As for NPD, with a certainnumber of length reranking candidates, theinference speed will be slower than modelsusing CTC. Note that NPD can use pretrainedAT models or the non-autoregressive modelitself to rerank multiple outputs.

We also present a scatter plot in Figure 3, dis-playing the trend of speed-up and BLEU with dif-ferent NAT models. It is shown that the point of

1999

1 2 3 4 5 6 7 8Decoding Iterations

26

27

28

29

30

31

32

BLEU WMT14 DE-EN

WMT14 EN-DE

Figure 5: The BLEU scores of GLAT with differentdecoding iterations

Model WMT14EN-DE DE-EN

NAT-base 8.32% 7.10%GLAT 1.19% 1.05%GLAT w/ NPD 0.32% 0.16%

Table 2: Token repetition ratio on WMT14 EN-DE andWMT14 DE-EN

GLAT is located on the top-right of the competingmethods. Obviously, GLAT outperforms our com-petitors in BLEU if speed-up is controlled, and inspeed-up if BLEU is controlled. This indicates thatGLAT outperforms previous NAT methods. Al-though iterative models like Mask-Predict achievescompetitive BLEU scores, they only maintain mi-nor speed advantages over AT. In contrast, fullynon-autoregressive models remarkably improve theinference speed.

4.3 AnalysisEffect of Source Input Length To analyze theeffect of source input length on the models’ per-formance, we split the source sentences into differ-ent intervals by length after BPE and compute theBLEU score for each interval. The histogram ofresults is presented in Figure 4. NAT-base’s perfor-mance drops sharply for long sentences, while thegradual learning process enables GLAT to boostthe performance by a large margin, especially forlong sentences. We also find that GLAT outper-forms autoregressive Transformer when the sourceinput length is smaller than 20.

GLAT Reduces Repetition We also measurethe percentage of repeated tokens on test set ofWMT14 EN-DE and WMT14 DE-EN. Table 2presents the token repetition ratio of sentences gen-erated by NAT-base and GLAT. The results showthat GLAT significantly reduces the occurrence ofrepetition, and the repetition ratio can be further

Sampling Number λ BLEU

Fixed

0.0 24.660.1 24.910.2 27.120.3 24.980.4 22.96

Adaptive - 29.61

Table 3: Performances on IWSLT16 with fixed sam-pling ratio.

Sampling Number λs λe BLEU

Decreasing0.5 0 27.800.5 0.1 28.210.5 0.2 27.150.5 0.3 23.37

Adaptive - 29.61

Table 4: Performances on IWSLT16 with decreasingsampling ratio.

reduced with NPD. We think an important causeof the improvement is better interdependency mod-eling. Since GLAT explicitly encourages wordinterdependency modeling to better capture the de-pendency between target tokens, wrong generationpatterns, such as repetition, can be largely avoided.

GLAT Achieves Strong Results without Multi-ple Iterations We conduct experiments of GLATwith more than one decoding iteration in inference.We adopt the inference algorithm in Mask-Predictfor multiple-iteration decoding. The results areshown in Figure 5. We find that GLAT can achievedecent performances with only one decoding iter-ation, while further iterations only obtain minorimprovements of 0.2∼0.3 BLEU.

4.4 Ablation StudyEffectiveness of the Adaptive Sampling Num-ber To validate the effectiveness of the adap-tive sampling strategy for the sampling numberS(Y, Y ), we also introduce two fixed approachesfor comparison. The first one decides the samplingnumber with λ∗T , where T is the length of Y , andλ is a constant ratio. The second one is relativelyflexible, which sets a start ratio of λs and an endratio λe, and linearly reduces the sampling numberfrom λs ∗ T to λe ∗ T along the training process.

As shown in Table 3 and Table 4, our adaptiveapproach (Adaptive in the table) outperforms thebaseline models with big margins. The results con-firm our intuition that the sampling schedule affectsthe generation performance of our NAT model. The

2000

Selection Strategy GLAT GLAT w/ NPD

random 25.21 26.55pref 24.87 25.831− pref 25.37 26.52most certain 24.99 26.22most uncertain 24.86 26.13

Table 5: Performance on WMT14 EN-DE with differ-ent reference word selection strategies.

Method Distance WMT14EN-DE DE-EN

GLAT Levenshtein 24.56 28.96Hamming 25.21 29.84

GLAT w/ NPD Levenshtein 26.21 30.85Hamming 26.55 31.02

Table 6: Performance on WMT14 EN-DE andWMT14 DE-EN with different distances.

sampling strategy, which first offers relatively easygeneration problems and then turns harder, bene-fits the final performance. Besides, even with thesimplest constant ratio, GLAT still achieves remark-able results. When set λ = 0.2, it even outperformsthe baseline λ = 0.0 by 2.5 BLEU points.

The experiments potentially support that it is ben-eficial to learn the generation of fragments at thestart and gradually transfer to the whole sequence.The flexible decreasing ratio method works betterthan the constant one, and our proposed adaptiveapproaches achieve the best results.

Influence of Reference Word Selection To ana-lyze how the strategies of selecting reference wordsaffect glancing sampling, we conduct experimentswith different selection strategies. By default, weassume all the words in the reference are equallyimportant and randomly choose reference wordsfor glancing. Besides the random strategy, we de-vise four other selection methods considering theprediction of first decoding. For pref and 1−pref, thesampling probability of each reference word is pro-portional to the output probability for the referenceword pref and the probability 1− pref, respectively.Similar to the word selection strategy for maskingwords during inference in Mask-Predict, we alsoadd two strategies related to the prediction confi-dence: "most certain" and "most uncertain." Wechoose the positions where predictions have higherconfidence for "most certain", and vise versa for"most uncertain." The results for different selectionmethods are listed in Table 5.

In comparisons, the model with the selection

Method WMT14EN-DE DE-EN

GLAT w/ uniform sampling 19.16 23.56GLAT w/ [MASK] inputs 24.99 29.48GLAT 25.21 29.84

Table 7: Ablation study for comparing GLAT andMask-Predict on WMT14 EN-DE and DE-EN.

strategy 1− pref outperforms the one with pref, in-dicating that words hard to predict are more im-portant for glancing in training. And we find thatthe random strategy performs a little better thanthe two confidence-based strategies. We think thisindicates that introducing more randomness in sam-pling enable GLAT to explore more interdepen-dency among target words. We adopt the randomstrategy for its simplicity and good performance.

Comparison of Different Distances for Glanc-ing Sampling We conduct experiments with twodistances for comparing the predictions of the firstdecoding and references, and the results are pre-sented in Table 6. Experimental results show thatboth distances can be used to improve the quality ofone-iteration generation, and GLAT with Hammingdistance is better than GLAT with Levenshtein dis-tance. Especially when there is no target lengthreranking, GLAT with Hamming distance outper-forms GLAT with Levenshtein distance by about0.7 BLEU and 0.9 BLEU on WMT14 EN-DE andDE-EN respectively. We think Hamming distanceis more strict than Levenshtein distance becauseonly the same words on the corresponding positionsare regarded as correct, which is more consistentwith the training of GLAT.

Advantages of GLAT over Mask-Predict Tostudy the effects of sampling strategy and decoderinputs of GLAT, we conduct experiments for re-placing these two modules in GLAT with the corre-sponding part in Mask-Predict, respectively. Theresults are presented in Table 7. GLAT employsglancing sampling strategy instead of the uniformsampling strategy used in Mask-Predict, and re-places the [MASK] token inputs with source rep-resentations from the encoder. The results showthat the glancing sampling strategy outperforms theuniform sampling strategy by 5∼6 BLEU points,and feeding representations from the encoder asthe decoder input could still improve the strongbaseline by 0.2∼0.3 BLEU points after adoptingglancing sampling. To sum up, the adaptive glanc-

2001

ing sampling approach contributes the most to thefinal improvement, and the use of representationsfrom the encoder also helps a bit.

5 Related Work

Fully Non-Autoregressive Models A line ofwork introduces various forms of latent variablesto reduce the model’s burden of dealing with de-pendencies among output words (Gu et al., 2018;Ma et al., 2019; Bao et al., 2019; Ran et al., 2019;Bao et al., 2021). Another branch of work consid-ers transferring the knowledge from autoregressivemodels to non-autoregressive models (Wei et al.,2019; Li et al., 2019; Guo et al., 2020a; Sun andYang, 2020). Besides, there are also some workthat apply different training objectives to train non-autoregressive models (Libovicky and Helcl, 2018;Shao et al., 2020; Ghazvininejad et al., 2020a), addregularization terms (Wang et al., 2019; Guo et al.,2019).

Non-Autoregressive Models with StructuredDecoding To model the dependencies betweenwords, Sun et al. (2019) introduces a CRF inferencemodule in NAT and performs additional sequentialdecoding after the non-autoregressive computationin inference. Deng and Rush (2020) proposes cas-caded CRF decoding. Since GLAT only performssingle-pass non-autoregressive generation, our ap-proach is orthogonal to the method proposed in Sunet al. (2019). We can also combine our approachwith the structured decoding methods.

Non-Autoregressive Models with Iterative Re-finement A series of work are devoted to semi-autoregressive models that refine the outputs withmulti-pass iterative decoding (Lee et al., 2018;Miao et al., 2019; Gu et al., 2019; Ghazvinine-jad et al., 2019, 2020b; Kasai et al., 2020; Li et al.,2020). Lee et al. (2018) proposed a method of it-erative refinement based on denoising autoencoder.Gu et al. (2019) utilized insertion and deletion torefine the outputs in inference. Ghazvininejad et al.(2019) trained the model with the masked languagemodel, and the model iteratively replaces maskedtokens with new outputs. (Li et al., 2020) first pre-dict the left token and right token for each position,and decode the final token at the current positionconditioned on the left-and-right tokens predictedbefore. Despite the relatively better accuracy, themultiple decoding iterations reduce the inferenceefficiency of non-autoregressive models.

Scheduled Sampling To alleviate exposure biasin autoregressive models, previous work attemptsto close the gap between training and inferenceby scheduled sampling (Bengio et al., 2015; Mi-haylova and Martins, 2019). Although scheduledsampling also modifies decoder inputs in training,there are mainly two differences between our workand scheduled sampling. Firstly, scheduled sam-pling mixes up the predicted sequence and thegold target sequence, and our method does notmix predicted sequences into decoder inputs. Be-sides, GLAT aims to learn word interdependencyfor single-pass parallel generation and scheduledsampling is designed for alleviating exposure bias.

6 Conclusion

In this paper, we propose Glancing Transformerwith a glancing language model to improve the per-formance of single-pass parallel generation models.With the glancing language model, the model startsfrom learning the generation of sequence fragmentsand gradually moving to whole sequences. Experi-mental results show that our approach significantlyimproves the performance of non-autoregressivemachine translation with single-pass parallel gener-ation. As GLAT achieves competitive performancecompared with autoregressive models, applying ourapproach to other generation tasks is a promisingdirection for future work.

Acknowledgments

We thank all the anonymous reviewers for theirvaluable comments. Hao Zhou and Lei Li are cor-responding authors.

ReferencesYu Bao, Shujian Huang, Tong Xiao, Dongqi Wang,

Xinyu Dai, and Jiajun Chen. 2021. Non-autoregressive translation by learning target categori-cal codes. In Proceedings of the 2021 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 5749–5759.

Yu Bao, Hao Zhou, Jiangtao Feng, Mingxuan Wang,Shujian Huang, Jiajun Chen, and Lei Li. 2019.Non-autoregressive transformer by position learning.arXiv preprint arXiv:1911.10677.

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, andNoam Shazeer. 2015. Scheduled sampling for se-quence prediction with recurrent neural networks.In Proceedings of the 28th International Conference

2002

on Neural Information Processing Systems-Volume1, pages 1171–1179.

Yoshua Bengio, Jérôme Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum learning. InICML, pages 41–48.

Yuntian Deng and Alexander Rush. 2020. Cascadedtext generation with markov transformers. Advancesin Neural Information Processing Systems, 33.

Marjan Ghazvininejad, Vladimir Karpukhin, LukeZettlemoyer, and Omer Levy. 2020a. Aligned crossentropy for non-autoregressive machine translation.In International Conference on Machine Learning,pages 3515–3523. PMLR.

Marjan Ghazvininejad, Omer Levy, Yinhan Liu, andLuke Zettlemoyer. 2019. Mask-predict: Parallel de-coding of conditional masked language models. InEMNLP-IJCNLP, pages 6114–6123.

Marjan Ghazvininejad, Omer Levy, and Luke Zettle-moyer. 2020b. Semi-autoregressive training im-proves mask-predict decoding. arXiv preprintarXiv:2001.08785.

Alex Graves, Santiago Fernández, Faustino Gomez,and Jürgen Schmidhuber. 2006. Connectionisttemporal classification: labelling unsegmented se-quence data with recurrent neural networks. InICML, pages 369–376.

Jiatao Gu, James Bradbury, Caiming Xiong, Vic-tor O.K. Li, and Richard Socher. 2018. Non-autoregressive neural machine translation. In ICLR.

Jiatao Gu, Changhan Wang, and Junbo Zhao. 2019.Levenshtein transformer. In NeurIPS, pages 11179–11189.

Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, andTie-Yan Liu. 2019. Non-autoregressive neural ma-chine translation with enhanced decoder input. InAAAI, volume 33, pages 3723–3730.

Junliang Guo, Xu Tan, Linli Xu, Tao Qin, EnhongChen, and Tie-Yan Liu. 2020a. Fine-tuning by cur-riculum learning for non-autoregressive neural ma-chine translation. In Proceedings of the AAAI Con-ference on Artificial Intelligence, volume 34, pages7839–7846.

Junliang Guo, Linli Xu, and Enhong Chen. 2020b.Jointly masked sequence-to-sequence model for non-autoregressive neural machine translation. In ACL,pages 376–385.

Richard W Hamming. 1950. Error detecting and errorcorrecting codes. The Bell system technical journal,29(2):147–160.

Jungo Kasai, James Cross, Marjan Ghazvininejad, andJiatao Gu. 2020. Non-autoregressive machine trans-lation with disentangled context transformer. In In-ternational Conference on Machine Learning, pages5144–5155. PMLR.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Jason Lee, Elman Mansimov, and Kyunghyun Cho.2018. Deterministic non-autoregressive neuralsequence modeling by iterative refinement. InEMNLP, pages 1173–1182.

Vladimir I Levenshtein. 1966. Binary codes capableof correcting deletions, insertions, and reversals. InSoviet physics doklady, pages 707–710.

Xiaoya Li, Yuxian Meng, Arianna Yuan, Fei Wu,and Jiwei Li. 2020. Lava nat: A non-autoregressive translation model with look-arounddecoding and vocabulary attention. arXiv preprintarXiv:2002.03084.

Zhuohan Li, Di He, Fei Tian, Tao Qin, Liwei Wang,and Tie-Yan Liu. 2019. Hint-based training for non-autoregressive translation. In EMNLP-IJCNLP.

Jindrich Libovicky and Jindrich Helcl. 2018. End-to-end non-autoregressive neural machine transla-tion with connectionist temporal classification. InEMNLP, pages 3016–3021.

Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neu-big, and Eduard Hovy. 2019. FlowSeq: Non-autoregressive conditional sequence generation withgenerative flow. In EMNLP-IJCNLP, pages 4273–4283, Hong Kong, China.

Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li.2019. CGMH: Constrained sentence generation byMetropolis-Hastings sampling. In AAAI.

Tsvetomila Mihaylova and André FT Martins. 2019.Scheduled sampling for transformers. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics: Student ResearchWorkshop, pages 351–356.

Qiu Ran, Yankai Lin, Peng Li, and Jie Zhou. 2019.Guiding non-autoregressive neural machine transla-tion decoding with reordering information. arXivpreprint arXiv:1911.02215.

Chitwan Saharia, William Chan, Saurabh Saxena, andMohammad Norouzi. 2020. Non-autoregressive ma-chine translation with latent alignments. In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages1098–1108.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare words withsubword units. In ACL, pages 1715–1725.

Chenze Shao, Jinchao Zhang, Yang Feng, FandongMeng, and Jie Zhou. 2020. Minimizing the bag-of-ngrams difference for non-autoregressive neural ma-chine translation. In AAAI, pages 198–205.

https://doi.org/10.18653/v1/D19-1437

https://doi.org/10.18653/v1/D19-1437

https://doi.org/10.18653/v1/D19-1437

2003

Raphael Shu, Jason Lee, Hideki Nakayama, andKyunghyun Cho. 2020. Latent-variable non-autoregressive neural machine translation with deter-ministic inference using a delta posterior. In Pro-ceedings of the AAAI Conference on Artificial Intel-ligence, volume 34, pages 8846–8853.

Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He,Zi Lin, and Zhihong Deng. 2019. Fast structureddecoding for sequence models. In NeurIPS, pages3016–3026.

Zhiqing Sun and Yiming Yang. 2020. An em ap-proach to non-autoregressive conditional sequencegeneration. In International Conference on MachineLearning, pages 9249–9258. PMLR.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In NIPS, pages 5998–6008.

Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiangZhai, and Tie-Yan Liu. 2019. Non-autoregressivemachine translation with auxiliary regularization. InAAAI.

Bingzhen Wei, Mingxuan Wang, Hao Zhou, JunyangLin, and Xu Sun. 2019. Imitation learning for non-autoregressive neural machine translation. In ACL.

3 glancing transformer

Documents