sequence to sequence video to text -...

Sequence to SequenceVideo to Text

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

Presented by Dewal GuptaUCSD CSE 291G, Winter 2019

BACKGROUNDChallenge: Create a description for a given video

Important in:

- describing videos for blind- human-robot interactions

Challenging because:

- diverse set of scenes, actions- necessary to recognize salient

action in context

PREVIOUS WORK: Template Models- Tag video with captions and use as bag of words

- Two stage pipeline: - first: tag video with semantic information on objects, actions

- treated as a classification problem- FGM labels subject, verb, object, place

- second: generate sentence from semantic information

- S2VT approach: avoids separating content identification from sentence generation

Integrating Language and Vision: to Generate Natural Language Descriptions of Videos in the Wild - Mooney et al., 2014

PREVIOUS WORK: Mean Pooling

- CNN trained on object classification (subset of ImageNet)

- 2 layer LSTM with video and previous word as input

- Ignores video frame ordering

Translating Videos to Natural Language Using Deep Recurrent Neural Networks Mooney et al., 2015

PREVIOUS WORK: Exploiting Temporal Structure

Encoder: - train 3D ConvNet on action recognition- fixed frame input- exploits local temporal structureDescribing Videos by Exploiting Temporal Structure

Videos in the WildCourville et al., 2015

PREVIOUS WORK: Exploiting Temporal Structure

Decoder: - Similar to our HW 2- Exploits global temporal

structure

Describing Videos by Exploiting Temporal Structure Videos in the WildCourville et al., 2015

GOAL

End to End differentiable model that can:

1. Handle variable video length (i.e. variable input length)

2. Learn temporal structure

3. Learn a language model that is capable of generating descriptive sentences

MODEL: LSTMSingle LSTM network

2 layer LSTM network

- 1000 hidden units (ht)- red layer: models visual elements- green layer: models linguistic

elements

MODEL: VGG-16

MODEL: AlexNet

Used for RGB & Flow!

MODEL: Details- Use Text Embedding (of 500 dimensions)

- self-trained, simple linear transformation

- RGB networks are pre-trained on subset of ImageNet data- Used networks from the original works

- Optical Flow pretrained on UCF101 dataset- Action Classification Task- Original work from ‘Action Tubes’

- All layers are frozen except last layers for training

- Flow and RGB combined by “shallow fusion technique”

DATASETS3 datasets used:

- Microsoft Video Description corpus (MSVD)- MPII Movie Description Corpus (MPII-MD)- Montreal Video Annotation Dataset (M-VAD)

MSVD: web clips with human annotations

MPII-MD: Hollywood clips with descriptions from script & audio (originally for the visually impaired)

M-VAD: Hollywood clips with audio descriptions

All three have single sentence descriptions

DATASETS: MetricsAuthors use METEOR metric

- uses exact token, stemmed token and WordNet synonym matches

- better correlation with human judgement than BLEU or ROUGE

- out performs CIDEr when fewer references- datasets only had 1 reference

where:- m is unigram (or n-gram)

matches after alignment- wr is length of reference - wt is length of candidate

RESULTS: MSVDFGM is template based

- not very descriptive- predicts a noun, verb,

object, place- builds sentence off

template

RESULTS: MSVDMean Pool based method

- very similar to author’s method

RESULTS: MSVDTemporal Attention method

- Encoder/Decoder using attention

RESULTS: Frame ordering

- Training with random ordering of frames results in “considerably lower” performance

RESULTS: Optical Flow- Flow results in better

performance only when combined with RGB (& not when used alone)

- Flow can be very different even for same activities

- Flow can’t account for polysemous words like “play” - eg. “play guitar” vs “play golf”

RESULTS: SOTA- Authors claim accurate

comparison is with GoogleNet with NO 3D-CNN (global temporal attention)

- questionable claim

Results: MPII-MD, M-VAD

- Similar performance to Visual-Labels- VL uses more semantic information (eg. object detection) but no

temporal information

MPII-MD M-VAD

Results: Edit DistanceLevenshtein Distance: represents edit distance between two strings

- 42.9% of generated samples match exactly with a sentence in the training corpus of MSVD

- model struggles to learn MVAD

CRITICISM- Model fails to learn temporal relations

- performs nearly as well as mean pooling technique that makes no use of temporal relations

- Model struggles on MVAD dataset for some reason more than other

- Authors should have used BLEU and/or CIDEr scores as well (other studies have them)

- Conduct user study (where human looks at captions)? - Could improve by using better text embeddings?

FURTHER WORK

End-to-End Video Captioning with Multitask Reinforcement Learning - Li & Gong, 2019

- Use Inception ResNet v2 as backbone CNN

- Train CNN against mined video “attributes”

- Achieve +5% METEOR score on MSVD- Same architecture

FURTHER WORK- Use 3D CNN to get better clip embeddings instead of LSTMs

- proven better in activity recognition

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification - Xie et al., 2017

CONCLUSIONAuthors build an end to end differentiable model that can:

1. Handle variable video length (i.e. variable input length)

2. Learn temporal structure

3. Learn a language model that is capable of generating descriptive sentences

Has become a baseline for many video captioning works

EXAMPLES

sequence to sequence video to text -...

Documents