self-attention for generative models · music transformer (iclr 2019) by cheng-zhi anna huang,...
TRANSCRIPT
![Page 1: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/1.jpg)
Self-Attention For Generative Models
Ashish Vaswani and Anna Huang
Joint work with: Noam Shazeer, Niki Parmar, Lukasz Kaiser, Illia Polosukhin, Llion Jones, Justin Gilmer, David Bieber, Jonathan Frankle, Jakob Uszkoreit, and
others.
![Page 2: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/2.jpg)
Learning Representations of Variable Length Data
Basic building block of sequence-to-sequence learning
Neural machine translation, summarization, QA, …
![Page 3: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/3.jpg)
Recurrent Neural Networks
Model of choice for learning variable-length representations.
Natural fit for sentences and sequences of pixels.
LSTMs, GRUs and variants dominate recurrent models.
![Page 4: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/4.jpg)
Recurrent Neural Networks
![Page 5: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/5.jpg)
But…
Sequential computation inhibits parallelization.
No explicit modeling of long and short range dependencies.
We want to model hierarchy.
RNNs (w/ sequence-aligned states) seem wasteful!
![Page 6: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/6.jpg)
Convolutional Neural Networks?
![Page 7: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/7.jpg)
Convolutional Neural Networks?
Trivial to parallelize (per layer).
Exploits local dependencies
‘Interaction distance’ between positions linear or logarithmic.
Long-distance dependencies require many layers.
![Page 8: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/8.jpg)
Attention
Attention between encoder and decoder is crucial in NMT.
Why not use attention for representations?
![Page 9: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/9.jpg)
Self-Attention
![Page 10: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/10.jpg)
Text generation
![Page 11: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/11.jpg)
Self-Attention
Constant ‘path length’ between any two positions.
Gating/multiplicative interactions.
Trivial to parallelize (per layer).
Can replace sequential computation entirely?
![Page 12: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/12.jpg)
Previous work
Classification & regression with self-attention:
Parikh et al. (2016), Lin et al. (2016)
Self-attention with RNNs:
Long et al. (2016), Shao, Gows et al. (2017)
Recurrent attention:
Sukhbaatar et al. (2015)
![Page 13: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/13.jpg)
The Transformer
![Page 14: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/14.jpg)
Encoder Self-Attention
![Page 15: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/15.jpg)
Decoder Self-Attention
![Page 16: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/16.jpg)
FLOPs
Self-Attention O(length2 · dim)
RNN (LSTM) O(length · dim2)
Convolution O(length · dim2 · kernel_width)
Attention is Cheap!
![Page 17: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/17.jpg)
FLOPs
Self-Attention O(length2 · dim) = 4·109
RNN (LSTM) O(length · dim2) = 16·109
Convolution O(length · dim2 · kernel_width) = 6·109
Attention is Cheap!
length=1000 dim=1000 kernel_width=3
![Page 18: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/18.jpg)
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
![Page 19: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/19.jpg)
Convolutions
I kicked the ball
Who Did what? To whom?
![Page 20: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/20.jpg)
Self-Attention
I kicked the ball
Who Did what? To whom?
I kicked the ball
![Page 21: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/21.jpg)
Parallel attention heads
I kicked the ball
WhoDid what?
To whom?
I kicked the ball
![Page 22: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/22.jpg)
Attention head: Who
I kicked the ball
WhoDid what?
To whom?
I kicked the ball
![Page 23: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/23.jpg)
Parallel attention heads
I kicked the ball
WhoDid what?
I kicked the ball
![Page 24: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/24.jpg)
Parallel attention heads
I kicked the ball
WhoDid what?
To whom?
I kicked the ball
![Page 25: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/25.jpg)
Parallel attention heads
I kicked the ball
WhoDid what?
To whom?
I kicked the ball
![Page 26: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/26.jpg)
Self-Attention: Averaging
I kicked the ball
Who Did what? To whom?
kicked
![Page 27: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/27.jpg)
Attention head: Who
I kicked the ball
Who
kicked
![Page 28: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/28.jpg)
Attention head: Did What?
I kicked the ball
WhoDid what?
kicked
![Page 29: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/29.jpg)
Attention head: To Whom?
I kicked the ball
WhoDid what?
To whom?
kicked
![Page 30: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/30.jpg)
Multihead Attention
I kicked the ball
WhoDid what?
To whom?
kicked
![Page 31: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/31.jpg)
Convolution:Different linear transformations by relative position.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
![Page 32: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/32.jpg)
Attention: a weighted average
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
![Page 33: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/33.jpg)
Multi-head AttentionParallel attention layers with different linear transformations on input and output.
The cat stuck out its tongue and licked its owner
The cat stuck out its tongue and licked its owner
![Page 34: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/34.jpg)
Results
![Page 35: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/35.jpg)
Machine Translation: WMT-2014 BLEU
EN-DE EN-FR
GNMT (orig) 24.6 39.9
ConvSeq2Seq 25.2 40.5
Transformer* 28.4 41.8
Attention is All You Need (NeurIPS 2017) Vaswani*, Shazeer*, Parmar*, Uszkoreit*, Jones*, Kaiser*, Gomez*, Polosukhin*
*Transformer models trained >3x faster than the others.
![Page 36: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/36.jpg)
tensor2tensor
Sockeye
Frameworks:
![Page 37: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/37.jpg)
Importance of residuals
![Page 38: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/38.jpg)
Importance of Residuals
Residuals carry positional information to higher layers, among other information.
With residuals Without residuals Without residuals, with timing signals
![Page 39: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/39.jpg)
Training DetailsADAM optimizer with a learning rate warmup (warmup + exponential decay)
Dropout during training at every layer just before adding residual
Layer-norm
Attention dropout (for some experiments)
Checkpoint-averaging
Label smoothing
Auto-regressive decoding with beam search and length biasing…
![Page 40: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/40.jpg)
Results
![Page 41: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/41.jpg)
Generating Wikipedia by Summarizing Long Sequences
ROUGE
seq2seq-attention 12.7
Transformer-ED (L=500) 34.2
Transformer-DMCA (L=11000) 36.2
msaleh@ et al. submission to ICLR’18
![Page 42: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/42.jpg)
Self-Similarity, Image and Music Generation
![Page 43: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/43.jpg)
Self-similarity in images
https://en.wikipedia.org/wiki/Self-similarity
![Page 44: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/44.jpg)
Self-Similarity in Images
Starry Night (Van Gogh, June 1889)
![Page 45: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/45.jpg)
Self-similarity in musicMotifs repeat, immediately and also at a distance
![Page 46: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/46.jpg)
Probabilistic Image Generation
Model the joint distribution of pixels
Turning it into a sequence modeling problem
Assigning probabilities allows measuring generalization
![Page 47: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/47.jpg)
Probabilistic Image Generation
RNNs and CNNs are state-of-the-art (PixelRNN, PixelCNN)
CNNs incorporating gating now match RNNs in quality
CNNs are much faster due to parallelization
A Oord et al. (2016), Salimans et al. (2017), Kalchbrenner et al. (2016)
![Page 48: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/48.jpg)
Probabilistic Image Generation
Long-range dependencies matter for images (e.g. symmetry)
Likely increasingly important with increasing image size
Modeling long-range dependencies with CNNs requires either
Many layers likely making training harder
Large kernels at large parameter/computational cost
![Page 49: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/49.jpg)
Texture Synthesis with Self-Similarity
Texture Synthesis by Non-parametric Sampling (Efros and Leung, 1999)
![Page 50: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/50.jpg)
Non-local Means
BCM 2005
![Page 51: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/51.jpg)
Non-local Means
A Non-local Algorithm for Image Denoising (Buades, Coll, and Morel. CVPR 2005)
Non-local Neural Networks (Wang et al., 2018)
![Page 52: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/52.jpg)
Previous work
Self-attention:
Parikh et al. (2016), Lin et al. (2016), Vaswani et al. (2017)
Autoregressive Image Generation:
A Oord et al. (2016), Salimans et al. (2017)
![Page 53: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/53.jpg)
Self-Attention
![Page 54: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/54.jpg)
The Image Transformer
![Page 55: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/55.jpg)
Decoder Self-Attention
![Page 56: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/56.jpg)
FLOPs
Self-Attention O(length2 · dim)
RNN (LSTM) O(length · dim2)
Convolution O(length · dim2 · kernel_width)
Attention is Cheap!
![Page 57: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/57.jpg)
FLOPs
Self-Attention O(length2 · dim) (length=3072 for images)
RNN (LSTM) O(length · dim2)
Convolution O(length · dim2 · kernel_width)
Attention is Cheap if length << dim!
![Page 58: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/58.jpg)
Combining Locality with Self-Attention
Restrict the attention windows to be local neighborhoods
Good assumption for images because of spatial locality
![Page 59: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/59.jpg)
![Page 60: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/60.jpg)
![Page 61: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/61.jpg)
Image Transformer Layer
(x, y)(x, y) (x, y)(x, y)
![Page 62: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/62.jpg)
Tasks
Super-resolution
Unconditional and Conditional Image generation
![Page 63: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/63.jpg)
Results
Image TransformerParmar*, Vaswani*, Uszkoreit, Kaiser, Shazeer, Ku, and Tran. ICML 2018
![Page 64: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/64.jpg)
Cifar-10(Test)
Imagenet (Validation)
PixelRNN 3.00 3.86
Gated PixelCNN 3.03 3.83
PixelCNN++ 2.92 (dmol) -
PixelSNAIL 2.85 3.8
Image Transformer, 1D local 2.9 (xent) 3.77
Image Transformer, 1D local 2.9 (dmol) 3.78
Unconditional Image Generation
Cross entropy of various models on CIFAR-10 and Imagenet datasets.
![Page 65: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/65.jpg)
Cifar10 Samples
![Page 66: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/66.jpg)
CelebA Super ResolutionInput Local 1D Local 2D Truth
Γ=0.8 Γ=0.9 Γ=1.0 Γ=0.8 Γ=0.9 Γ=1.0
![Page 67: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/67.jpg)
CelebA Super Resolution
% Fooled
Γ = n/a Γ = 1.0 Γ = 0.9 Γ = 0.8
ResNet 4.0 - - -
srez GAN (Garcia, 2016) 8.5 - - -
Pixel Recursive (Dahl et al., 2017)
- 11.0 10.4 10.25
Image Transformer, 1D local 35.94 ± 3.0 33.5 ± 3.5 29.6 ± 4.0
Image Transformer, 2D local 36.11 ±2.5 34 ± 3.5 30.64 ± 4.0
Human Eval performance for the Image Transformer on CelebA. The fraction of humans fooled is significantly better than the previous state of art.
![Page 68: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/68.jpg)
Cifar10 SuperResolution
![Page 69: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/69.jpg)
Conditional Image Completion
![Page 70: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/70.jpg)
Music generation using relative self-attention
Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu and Douglas Eck.
Blog post: https://magenta.tensorflow.org/music-transformer
![Page 71: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/71.jpg)
Raw representations in music and language
(Image from Simon & Oore, 2016)
Language
Music
text speech
![Page 72: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/72.jpg)
A
ht
Xt
ht
Note on
Note off
Note Velocity
Advance clock
Music Language model:Prior work Performance RNN (Simon & Oore, 2016)
![Page 73: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/73.jpg)
Continuations to given initial motif
RNN-LSTM
Transformer
Music Transformer
Given motif
![Page 74: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/74.jpg)
Continuations to given initial motif
Given motif
![Page 75: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/75.jpg)
Continuations to given initial motif
Given motif
![Page 76: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/76.jpg)
Continuations to given initial motif
RNN-LSTM
Given motif
![Page 77: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/77.jpg)
Continuations to given initial motif
RNN-LSTM
Given motif
![Page 78: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/78.jpg)
Continuations to given initial motif
RNN-LSTM
Transformer
Given motif
![Page 79: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/79.jpg)
Continuations to given initial motif
RNN-LSTM
Transformer
Given motif
![Page 80: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/80.jpg)
Continuations to given initial motif
RNN-LSTM
Transformer
Music Transformer
Given motif
![Page 81: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/81.jpg)
Continuations to given initial motif
RNN-LSTM
Transformer
Music Transformer
Given motif
![Page 82: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/82.jpg)
Self-Similarity in Music
![Page 83: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/83.jpg)
Sample from Music Transformer
![Page 85: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/85.jpg)
Attention: a weighted average
TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
![Page 86: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/86.jpg)
Attention: a weighted average
TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
![Page 87: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/87.jpg)
Convolution:Different linear transformations by relative position.
TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
![Page 88: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/88.jpg)
Relative attention (Shaw et al, 2018)
Multihead attention + convolution?
TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
TimeShift100 TimeShift100 TimeShift30 NoteOn60 TimeShift20 NoteOn62 TimeShift90 NoteOff62 NoteOff60 TimeShift90
![Page 89: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/89.jpg)
Closer look at attention
QErT
![Page 90: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/90.jpg)
Closer look at relative attention
0,0 0,1 0,2
1,0 1,1 1,2
2,0 2,1 2,2
0 1 2
-1 0 1
-2 -1 0
QErT
Modulated by relative positions
![Page 91: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/91.jpg)
Machine Translation (Shaw et al, 2018)
Model Position Representation
BLEU En-De
BLEUEn-Fr
Transformer Big Absolute 27.9 41.3
Transformer Big Relative 29.2 41.5
![Page 92: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/92.jpg)
Previous work O(L2D): 8.5 GB per layer (Shaw et al, 2018)
Relative embeddings
-2
0
-1 0
0
-1
Multiply by Q
Relative distances
Per layer, L=2048, D=512
![Page 93: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/93.jpg)
Our formulation O(LD): 4.2 MB per layer
Absolute by relative Absolute by absolute
Pad Reshape Sliceiq
Per layer, L=2048, D=512
Skew
![Page 94: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/94.jpg)
Goal of skewing procedure
absolute by relative absolute by absolute
Indexed by
![Page 95: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/95.jpg)
Skewing to reduce relative memoryfrom O(L2D) to O(LD)
Relative embeddings Er
Per layer, L=2048, D=512O(L2D): 8.5 GB O(LD): 4.2 MB (ours)
Skew
Multiply by Q
Directly multiply by Q
Srel
Pad
QET
skew(QET)
QErT
Reshape Slice
iq
iq
Our work
O(LD): 4.2 MB
Previous work
O(L2D): 8.5 GB
Per layer, L=2048, D=512
-2
0
-1 0
0
-1
![Page 96: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/96.jpg)
A Jazz sample from Music Transformer
![Page 97: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/97.jpg)
A Jazz sample from Music Transformer
![Page 98: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/98.jpg)
Convolutions and Translational Equivariance
0.5
0 32
32
0 32
32
0.5
![Page 99: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/99.jpg)
Relative positions Translational Equivariance
0.5
0 32
32
0 32
32
0.5
![Page 100: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/100.jpg)
Relative Attention And Graphs
![Page 101: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/101.jpg)
Relative Attention And Graphs
Relational inductive biases, deep learning, and graph networks. (Battaglia et al., 2018)
Self-Attention With Relative Position Representations (Shaw et al., 2018)
![Page 102: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/102.jpg)
Message Passing Neural Networks
h2
h1
h3
Slide credit: Justin Gilmer
Neural Message Passing For Quantum Chemistry. Gilmer et al. ICML 2017
![Page 103: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/103.jpg)
Multiple Towers
Mixing Network
● Run k smaller copies of the MPNN in parallel.
● Mix node states after each message pass.
● Offers a factor of k speedup for the same node dimension d (> 2x speedup when d=200).
● Also helped improve performance when used with matrix multiply message function.
Slide credit: Justin Gilmer
![Page 104: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/104.jpg)
Graph Library
Code
With Justin Gilmer, Jonathan Frankle, and David Bieber
![Page 105: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/105.jpg)
Self-Attention
Constant ‘path length’ between any two positions.
Unbounded memory.
Trivial to parallelize (per layer).
Models Self-Similarity.
Relative attention provides expressive timing, equivariance, and extends naturally to graphs.
![Page 106: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/106.jpg)
Non autoregressive transformer (Gu and Bradbury et al., 2018)
Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement (Lee, Manismov, and Cho, 2018)
Fast Decoding in Sequence Models Using Discrete Latent Variables (ICML 2018)Kaiser, Roy, Vaswani, Pamar, Bengio, Uszkoreit, Shazeer
Towards a Better Understanding of Vector Quantized AutoencodersRoy, Vaswani, Parmar, Neelakantan, 2018
Blockwise Parallel Decoding For Deep Autogressive Models (NeurIPS 2019)Stern, Shazeer, Uszkoreit,
Active Research Area
![Page 107: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/107.jpg)
Transfer learning
![Page 108: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/108.jpg)
Improving Language Understanding by Generative Pre-Training (Radford, Narsimhan, Salimans, and Sutskever)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin, Chang, Lee, and Toutanova)
![Page 109: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/109.jpg)
Optimization and Large Models
![Page 110: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/110.jpg)
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost (ICML 2018). Shazeer, Stern.
Memory-Efficient Adaptive Optimization for Large-Scale Learning (2019). Anil, Gupta, Koren, Singer.
Mesh-TensorFlow: Deep Learning for Supercomputers (NeurIPS 2019). Shazeer, Cheng, Parmar, Tran, Vaswani, Koanantakool, Hawkins, Lee, Hong, Young, Sepassi, Hechtman) Code (5 billion parameters)
![Page 111: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/111.jpg)
Self-attention in Other Work.
![Page 112: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/112.jpg)
Generating Wikipedia by Summarizing Long sequences. (ICLR 2018). Liu, Saleh, Pot, Goodrich, Sepassi, Shazeer, Kaiser.
Universal Transformers (ICLR 2019). Deghiani*, Gouws*, Vinyals, Uszkoreit, Kaiser.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context (2019). Dai, Yang, Yang, Carbonell, Le, Salakhutdinov.
A Time-Restricted Self-Attention Layer for ASR (ICASSP 2018). Povey, Hadian, Gharemani, Li, Khudanpur.
Character-Level Language Modeling with Deeper Self-Attention (2018). Roufou*, Choe*, Guo*, Constant*, Jones*
![Page 113: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/113.jpg)
Ongoing and Future Work
![Page 114: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/114.jpg)
Self-supervision and classification for images and video
Understanding Transfer
Ongoing
![Page 115: Self-Attention For Generative Models · Music Transformer (ICLR 2019) by Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M](https://reader033.vdocuments.site/reader033/viewer/2022050110/5f47a8ad2c1a832f78192dd5/html5/thumbnails/115.jpg)
Multitask learning
Long-range attention
Future