sequence to sequence video to text -...

26
Sequence to Sequence Video to Text Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko Presented by Dewal Gupta UCSD CSE 291G, Winter 2019

Upload: dinhtuyen

Post on 17-Jun-2019

263 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

Sequence to SequenceVideo to Text

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

Presented by Dewal GuptaUCSD CSE 291G, Winter 2019

Page 2: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

BACKGROUNDChallenge: Create a description for a given video

Important in:

- describing videos for blind- human-robot interactions

Challenging because:

- diverse set of scenes, actions- necessary to recognize salient

action in context

Page 3: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

PREVIOUS WORK: Template Models- Tag video with captions and use as bag of words

- Two stage pipeline: - first: tag video with semantic information on objects, actions

- treated as a classification problem- FGM labels subject, verb, object, place

- second: generate sentence from semantic information

- S2VT approach: avoids separating content identification from sentence generation

Integrating Language and Vision: to Generate Natural Language Descriptions of Videos in the Wild - Mooney et al., 2014

Page 4: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

PREVIOUS WORK: Mean Pooling

- CNN trained on object classification (subset of ImageNet)

- 2 layer LSTM with video and previous word as input

- Ignores video frame ordering

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Mooney et al., 2015

Page 5: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

PREVIOUS WORK: Exploiting Temporal Structure

Encoder: - train 3D ConvNet on action recognition- fixed frame input- exploits local temporal structureDescribing Videos by Exploiting Temporal Structure

Videos in the WildCourville et al., 2015

Page 6: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

PREVIOUS WORK: Exploiting Temporal Structure

Decoder: - Similar to our HW 2- Exploits global temporal

structure

Describing Videos by Exploiting Temporal Structure Videos in the WildCourville et al., 2015

Page 7: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

GOAL

End to End differentiable model that can:

1. Handle variable video length (i.e. variable input length)

2. Learn temporal structure

3. Learn a language model that is capable of generating descriptive sentences

Page 8: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

MODEL: LSTMSingle LSTM network

2 layer LSTM network

- 1000 hidden units (ht)- red layer: models visual elements- green layer: models linguistic

elements

Page 9: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

MODEL: VGG-16

Page 10: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

MODEL: AlexNet

Used for RGB & Flow!

Page 11: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

MODEL: Details- Use Text Embedding (of 500 dimensions)

- self-trained, simple linear transformation

- RGB networks are pre-trained on subset of ImageNet data- Used networks from the original works

- Optical Flow pretrained on UCF101 dataset- Action Classification Task- Original work from ‘Action Tubes’

- All layers are frozen except last layers for training

- Flow and RGB combined by “shallow fusion technique”

Page 12: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

DATASETS3 datasets used:

- Microsoft Video Description corpus (MSVD)- MPII Movie Description Corpus (MPII-MD)- Montreal Video Annotation Dataset (M-VAD)

MSVD: web clips with human annotations

MPII-MD: Hollywood clips with descriptions from script & audio (originally for the visually impaired)

M-VAD: Hollywood clips with audio descriptions

All three have single sentence descriptions

Page 13: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

DATASETS: MetricsAuthors use METEOR metric

- uses exact token, stemmed token and WordNet synonym matches

- better correlation with human judgement than BLEU or ROUGE

- out performs CIDEr when fewer references- datasets only had 1 reference

where:- m is unigram (or n-gram)

matches after alignment- wr is length of reference - wt is length of candidate

Page 14: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

RESULTS: MSVDFGM is template based

- not very descriptive- predicts a noun, verb,

object, place- builds sentence off

template

Page 15: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

RESULTS: MSVDMean Pool based method

- very similar to author’s method

Page 16: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

RESULTS: MSVDTemporal Attention method

- Encoder/Decoder using attention

Page 17: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

RESULTS: Frame ordering

- Training with random ordering of frames results in “considerably lower” performance

Page 18: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

RESULTS: Optical Flow- Flow results in better

performance only when combined with RGB (& not when used alone)

- Flow can be very different even for same activities

- Flow can’t account for polysemous words like “play” - eg. “play guitar” vs “play golf”

Page 19: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

RESULTS: SOTA- Authors claim accurate

comparison is with GoogleNet with NO 3D-CNN (global temporal attention)

- questionable claim

Page 20: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

Results: MPII-MD, M-VAD

- Similar performance to Visual-Labels- VL uses more semantic information (eg. object detection) but no

temporal information

MPII-MD M-VAD

Page 21: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

Results: Edit DistanceLevenshtein Distance: represents edit distance between two strings

- 42.9% of generated samples match exactly with a sentence in the training corpus of MSVD

- model struggles to learn MVAD

Page 22: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

CRITICISM- Model fails to learn temporal relations

- performs nearly as well as mean pooling technique that makes no use of temporal relations

- Model struggles on MVAD dataset for some reason more than other

- Authors should have used BLEU and/or CIDEr scores as well (other studies have them)

- Conduct user study (where human looks at captions)? - Could improve by using better text embeddings?

Page 23: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

FURTHER WORK

End-to-End Video Captioning with Multitask Reinforcement Learning - Li & Gong, 2019

- Use Inception ResNet v2 as backbone CNN

- Train CNN against mined video “attributes”

- Achieve +5% METEOR score on MSVD- Same architecture

Page 24: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

FURTHER WORK- Use 3D CNN to get better clip embeddings instead of LSTMs

- proven better in activity recognition

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification - Xie et al., 2017

Page 25: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

CONCLUSIONAuthors build an end to end differentiable model that can:

1. Handle variable video length (i.e. variable input length)

2. Learn temporal structure

3. Learn a language model that is capable of generating descriptive sentences

Has become a baseline for many video captioning works

Page 26: Sequence to Sequence Video to Text - cseweb.ucsd.educseweb.ucsd.edu/classes/wi19/cse291-g/student_presentations/Image... · Sequence to Sequence Video to Text Subhashini Venugopalan,

EXAMPLES