one model to learn them all - university of illinois · multimodal learning: single task, different...
TRANSCRIPT
![Page 1: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/1.jpg)
One Model To Learn Them All
Lukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit
CS546 Course PresentationShruti Bhargava (shrutib2)
Advised by : Prof. Julia Hockenmaier
![Page 2: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/2.jpg)
Outline➢ Motivation
➢ Understanding the task
➢ Model Architecture
➢ Datasets
➢ Training details
➢ Performance Evaluation
➢ Key contributions/ Limitations
![Page 3: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/3.jpg)
MotivationWhat is your favourite fruit?
Apple /ˈapəl/
ImageModality
Text Modality
AudioModality
Write? Draw? Speak?
1. Process the question and think of an answer2. Convey the answer to me
![Page 4: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/4.jpg)
Motivation➢ Humans reason about concepts independent of input/output
modality
➢ Humans are able to reuse conceptual knowledge in different
tasks
![Page 5: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/5.jpg)
Understanding the task➢ Multimodal Learning: single task, different domains
Eg. Visual Question Answering
Input: Images + Text, Output: Text
➢ Multitask Learning: multiple tasks, mostly same domain
Eg. Translation + Parsing
➢ This work = Multimodal + Multitask
![Page 6: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/6.jpg)
Question addressed :Can one unified model solve tasks across
multiple domains?
![Page 7: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/7.jpg)
Multiple Tasks/Domains, One Model Multiple Tasks/Domains, One Model -
MultiModel
![Page 8: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/8.jpg)
Outline➢ Motivation
➢ Understanding the task
➢ Model Architecture
➢ Datasets
➢ Training details
➢ Performance Evaluation
➢ Key contributions / Limitations
![Page 9: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/9.jpg)
MultiModel Architecture➢ Modality Nets
➢ Encoder-Decoder
➢ I/O Mixer
![Page 10: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/10.jpg)
MultiModel: Input → Output➢ Modality Net: domain-specific input → unified representation
➢ Encoder: unified input representations → encoded input
➢ I/O Mixer: encoded input ⇌ previous outputs
➢ Decoder: decodes (input + mixture) → output representation
➢ Modality Net: unified representation → domain-specific output
![Page 11: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/11.jpg)
MultiModel: Input → Output
Input
Modality Nets
Output
![Page 12: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/12.jpg)
MultiModel: Modality NetsDomain-specific Representation ↔ Unified Representation
4 modality nets - One net per domain
➢ Language
➢ Image
➢ Audio
➢ Categorical - only output
![Page 13: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/13.jpg)
Modality Nets: Language Modality
Input Net -
Output Net -
See Details for Vocabulary construction here.
Input tokenized using 8k subword units
➢ Acts as an open vocabulary example - [ad|mi|ral]
➢ Accounts for rare words
![Page 14: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/14.jpg)
MultiModel: Domain Agnostic Body
![Page 15: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/15.jpg)
MultiModel: Domain Agnostic Body
Input Encoder I/O Mixer Decoder
![Page 16: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/16.jpg)
MultiModel: Building BlocksCombines 3 state-of-the-art blocks:
➢ Convolutional: SOTA for images
➢ Attention: SOTA in language understanding
➢ Mixture-of-Experts (MoE): studied only for language
![Page 17: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/17.jpg)
Building Block: ConvBlock
Depthwise Separable Convolutions➢ convolution on each feature channel➢ pointwise convolution for desired depth.
Layer Normalisation➢ Statistics computed for a layer (per sample)
See Details on Layer normalisation and Separable Convolutions.
![Page 18: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/18.jpg)
Building Block: Attention
See Details on the attention block here.
![Page 19: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/19.jpg)
Building Block: Mixture of Experts Sparsely-gated mixture-of-experts layer
➢ Experts: feed-forward neural networks
➢ Selection: trainable gating network
➢ Known booster for language tasks
See Details on the MoE block here.
![Page 21: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/21.jpg)
Outline➢ Motivation
➢ Understanding the task
➢ Model Architecture
➢ Datasets
➢ Training details
➢ Performance Evaluation
➢ Key contributions / Limitations
![Page 22: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/22.jpg)
Datasets/Tasks➢ WSJ speech
➢ WSJ parsing
➢ ImageNet
➢ COCO image-captioning
➢ WMT English-German
➢ WMT German-English
➢ WMT English-French
➢ WMT German-French
![Page 23: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/23.jpg)
Outline➢ Motivation
➢ Understanding the task
➢ Model Architecture
➢ Datasets
➢ Training details
➢ Performance Evaluation
➢ Key contributions / Limitations
![Page 24: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/24.jpg)
Training Details➢ Token for task eg. To-English or To-Parse-Tree, to decoder.
Embedding vector for each token learned.
➢ Mixture of experts block :
● 240 experts for joint training, 60 for single training
● Gating selects 4
➢ Adam optimizer with gradient clipping
➢ Experiments on all tasks use same hyperparameter values
![Page 25: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/25.jpg)
Outline➢ Motivation
➢ Understanding the task
➢ Model Architecture
➢ Datasets Used
➢ Training details
➢ Experiments/ Results
➢ Key contributions / Limitations
![Page 26: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/26.jpg)
Experiments➢ MultiModel vs state-of-the-art?
➢ Does simultaneous training on 8 problems help?
➢ Blocks specialising in one domain help/harm other?
![Page 27: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/27.jpg)
Results1. MultiModel vs state-of-the-art ?
![Page 28: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/28.jpg)
Results2. Does simultaneous training help?
![Page 29: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/29.jpg)
Results3. Blocks specialising in one domain help/harm other?
MoE, Attention - language experts
![Page 30: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/30.jpg)
Outline➢ Motivation
➢ Understanding the task
➢ Model Architecture
➢ Datasets Used
➢ Training details
➢ Performance Evaluation
➢ Key contributions / Limitations
![Page 31: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/31.jpg)
Key Contributions➢ First model performing large-scale tasks on multiple domains.
➢ Sets blueprint for potential future AI (broadly applicable)
➢ Designs multi-modal architecture with blocks from diverse
modalities
➢ Demonstrates transfer learning across domains
![Page 32: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/32.jpg)
Limitations➢ Comparison with SOTA - last few percentages, when models
approach 100% is the most crucial part
➢ Incomplete Experimentation - Hyperparameters not tuned
➢ Incomplete Results Reported - Only for some tasks
➢ Could be less robust to adversarial samples attack
![Page 33: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/33.jpg)
References➢ https://venturebeat.com/2017/06/19/google-advances-ai-with-one-model-to-lea
rn-them-all/➢ https://aidangomez.ca/multitask.pdf➢ https://blog.acolyer.org/2018/01/12/one-model-to-learn-them-all/➢ Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural
Information Processing Systems. 2017.➢ Chollet, François. "Xception: Deep learning with depthwise separable
convolutions." arXiv preprint (2016).➢ Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated
mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).
![Page 34: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/34.jpg)
Thank You!
![Page 35: One Model To Learn Them All - University Of Illinois · Multimodal Learning: single task, different domains Eg. Visual Question Answering Input: Images + Text, Output: Text Multitask](https://reader034.vdocuments.site/reader034/viewer/2022050719/5ecac3d3c964a732ae7875fe/html5/thumbnails/35.jpg)
Modality NetsImage Modality Net - analogous to Xception entry flow, uses residual convolution blocks
Categorical Modality Net - analogous to Xception exit flow, Global average pooling after conv layers
Audio Modality Net - similar to Image Modality Net