with%applications%in%audio0visual%recognition mmml’15 ...15a_mmml_nips_… · objective...
TRANSCRIPT
Our ApproachObjective
Leverage source domain data to improve target domain task
Multimodal Transfer Deep Learningwith Applications in Audio-‐Visual Recognition
Seungwhan Moon Suyoun Kim Haohan WangCarnegie Mellon University
Input: imbalanced (e.g. in label space) multimodal parallel datasets for training(e.g. source: audio and target: video)
MMML’15Workshop
audio: A - Z video: A - M
train
Output: a robust deep neural network for target task
video recognition network
test
video: A – Z (some unforeseen during training)
ApplicationsMultimodal tasks with imbalanced datasets• Audio-visual recognition- Lip-reading recognition- In-video action recognition
• Multi-lingual natural language learning- Rare language text classification
• Text-image joint learning
Fine-tune a target network with source instances transferred at intermediate layers
. . .
...
: output label
...
. . .
: audio data
. . .
...
: output label
...
. . .
: video data
②
③ ④
① ①
① : Train a separate model for each modality ( , )Define activation at i-‐th layer:
② : Learn a transfer function
using source-‐target correspondent instances
③ : Transfer auxiliary source data to target network,and compute activations at upper layers
④ : Fine-‐tune the target networkwith the transferred source instances
ResultsLabel Space Setup
audio: full video: partialTrainvideo: full (+transferred)Fine-tune
Datasets: AV_Letters, StanfordAV_Letters (26 labels)
Stanford (49 labels)
Interpretation
: intractable or less reliable transfer, fine-‐tune more layers: more reliable transfer, fine-‐tune less layers
• Performance is soft-‐upper-‐bounded by feature mapping accuracy• Trade-‐off for
Future work• Comparison with state-‐of-‐the-‐art transfer learning methods
(heterogeneous transfer, deep shared representation, etc.)
• Artificial construction of targetmodality instances via top-‐down inference, using sourcemodality instances
video: fullTest