self-attention and transformers - cornell university
TRANSCRIPT
![Page 1: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/1.jpg)
Self-Attention and Transformers
Instructor: Yoav Artzi
CS 5740: Natural Language Processing
Slides adapted from Greg Durrett
![Page 2: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/2.jpg)
Overview
• Motivation
• Self-Attention and Transformers
• Encoder-decoder with Transformers
![Page 3: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/3.jpg)
Encoders• RNN: map each token vector
a new context-aware tokenembedding using a autoregressive process
• CNN: similar outcome, but with local context using filters
• Attention can be an alternative method to generate context-dependent embeddings
the movie was great
the movie was great
![Page 4: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/4.jpg)
• What context do we want token embeddings to take into account?
• What words need to be used as context here?
LSTM/CNN Context
The ballerina is very excited that she will dance in the show.
• Pronouns context should be the antecedents (i.e., what they refer to)
• Ambiguous words should consider local context
• Words should look at syntactic parents/children
• Problem: very hard with RNNs and CNNs, even if possible
![Page 5: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/5.jpg)
LSTM/CNN Context• Want:
• LSTMs/CNNs: tend to be local
• To appropriately contextualize, need to pass information over long distances for each word
The ballerina is very excited that she will dance in the show.
The ballerina is very excited that she will dance in the show.
![Page 6: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/6.jpg)
Self-attention• Each word is a query to form attention over all tokens
• This generates a context-dependent representation of each token: a weighted sum of all tokens
• The attention weights dynamically mix how much is taken from each token
• Can run this process iteratively, at each step computing self-attention on the output of the previous level
[Vaswani et al. 2017]
![Page 7: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/7.jpg)
Self-attention• Each word is a query to form
attention over all tokens
• This generates a context-dependent representation of each token: a weighted sum of all tokens
• The attention weights dynamically mix how much is taken from each token
• Can run this process iteratively, at each step computing self-attention on the output of the previous level
[Vaswani et al. 2017]
the movie was great
![Page 8: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/8.jpg)
Self-attentionw/Dot-product
[Vaswani et al. 2017]
the movie was great
![Page 9: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/9.jpg)
Multiple Attention Heads• Multiple attention heads can
learn to attend in different ways
• Why multiple heads? Softmax operations often end up peaky, making it hard to put weight on multiple items
• Requires additional parameters to compute different attention values and transform vectors
• Analogous to multiple convolutional filters the movie was great
![Page 10: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/10.jpg)
Multiple Attention Heads
the movie was great
<latexit sha1_base64="0Rm+vSdGkIxCmb7THakE2JI9uSo=">AAAD+nicbVPLbhMxFHUnPEp4NIUlmysiUJFClKkQoFaRKtgg0UUrkTRSJhl5HE/jxPOQ7QmJBn8KGxYgxJYvYcff4JlMmqelke6cc67PuZbtxZxJ1Wj827NKt27fubt/r3z/wcNHB5XDx20ZJYLQFol4JDoelpSzkLYUU5x2YkFx4HF65Y0/ZPzVhArJovCzmsW0F+DrkPmMYGUg99CqjE9egKPoVKXA6YRyCJPAo0KD45TPl9wchciHIcUDmdFQ7sAJ3ChYGCcKJpSoSOS8oZuGDbAaen461a5dcwaRkrUVKMx0y/++7TJowloXy60cD4vUwTweYt1PxzWu3ZTVRnpDbahXtulZIJeFGBxirLeFowXy6UaYmeU+hU0WyHikjghARr4K8FQf7Y5j68WEO+lQv1wf12WFaTYEODIJjKxp6745ltUI+aQrbaN5+gXSXmZf33w8j97ddLT1aZ7zdJM41z23Um3UG/mC7cIuiioq1oVb+Wv2IklAQ0U4lrJrN2LVS7FQjHCqy04iaYzJGF/TrilDHFDZS/Orq+G5QQbgR8J8oYIcXe1IcSDlLPCMMosqN7kM3MV1E+W/66X5naQhmRv5CQcVQfYOYMCEual8ZgpMBDNZgQyxwESZ11I2h2BvjrxdtI/r9pv68eXr6tn74jj20VP0DB0hG71FZ+gjukAtRKwv1jfrh/Wz9LX0vfSr9HsutfaKnidobZX+/AcRWU0W</latexit>
k : level number
L : number of heads
X : input vectors
X =x1, . . . ,xn
x1i =xi
↵̄k,li,j =xk�1
i Qk,l · xk�1j Kk,l
↵k,li =softmax(↵̄k,l
i,1, . . . , ↵̄k,li,n)
xk,li =
nX
i=1
↵k,li,jx
k�1j Vk,l
xki =[xk,1
i ; . . . ;xk,Li ]
![Page 11: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/11.jpg)
What Can Self-attention do?
• Attend to nearby related terms
• But just the same to far semantically related terms
The ballerina is very excited that she will dance in the show.0 0.5 0 0 0.1 0.1 0 0.1 0.2 0 0 0
0 0.1 0 0 0 0 0 0 0.5 0 0.4 0
![Page 12: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/12.jpg)
Details Details Details• Self-attention is the basic building
block of an architecture called Transformers
• Many details to get it to work
• Significant improvements for many tasks, starting with machine translation (Vaswani et al. 2017) and later context-dependent pre-trained embeddings (BERT; Devlin et al. 2018)
• A detailed technical description (with code): https://www.aclweb.org/anthology/W18-2509/
[Figure from Vaswani et al. 2017]
![Page 13: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/13.jpg)
MT with Transformers• Input: sentence in source
language
• Output: sentence in target language
• Encoder Transformer processes the input
• Decoder Transformer generates the output
• More generally: this defines an encoder-decoder architecture with Transformers
![Page 14: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/14.jpg)
Encoder• Self-attention is not order-sensitive
• Need to add positional information
• Add time-dependent function to token embeddings (sin and cos)
• Output: a set of token embeddings
[Positional Dimensions figure from Rush 2018]
![Page 15: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/15.jpg)
Encoder• Use parameterized attention• Multiple attention heads, each
with separate parameters • This increases the attention
flexibility
![Page 16: Self-Attention and Transformers - Cornell University](https://reader033.vdocuments.site/reader033/viewer/2022042707/6266533194796338400c9967/html5/thumbnails/16.jpg)
Decoder• Can’t attend to the whole output
• Why? It doesn’t exist yet!
• Tokens are generated one-by-one
• Solution: mask tokens that are not predicted yet in the attention
• First: self-attend to the output only Second: attend to both input and output
Input Embeddings