video + language: where does domain knowledge fit in?

60
Video + Language: Where Does Domain Knowledge Fit in? Jiebo Luo Department of Computer Science July 10, 2016 Keynote@2016 IJCAI Workshop on Semantic Machine Learning

Upload: goergen-institute-for-data-science

Post on 12-Apr-2017

56 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Video + Language: Where Does Domain Knowledge Fit in?

Video + Language: Where Does Domain Knowledge Fit in?Jiebo LuoDepartment of Computer Science

July 10, 2016

Keynote@2016 IJCAI Workshop on Semantic Machine Learning

Page 2: Video + Language: Where Does Domain Knowledge Fit in?

Domain Knowledge in Machine Learning

• Domain knowledge is used frequently in ML applications (sometimes without knowing that you are actually doing it)– A good example is feature extraction. What features to use?– Other uses include objective function, parameter selection– Even in deep learning (architecture, learning rate, etc.) – Certainly probabilistic graphical models (including priors)– Data cleaning (yes!)

• Context models encode domain knowledge– Spatial context (e.g., in computer vision)– Temporal context (e.g., in sequence analysis)– Social context (e.g., in social media data mining)

• We will focus on some less obvious, more sophisticated forms of domain knowledge, especially in the area of “vision and language”, an emerging fertile ground in machine learning

Page 3: Video + Language: Where Does Domain Knowledge Fit in?

IEEE Signal Processing, 2005

Page 4: Video + Language: Where Does Domain Knowledge Fit in?

Introduction

• Video has become ubiquitous on the Internet, TV, as well as personal devices.

• Recognition of video content has been a fundamental challenge in computer vision for decades, where previous research predominantly focused on understanding videos using a predefined yet limited vocabulary.

• Thanks to the recent development of deep learning techniques, researchers in both computer vision and multimedia communities are now striving to bridge video with natural language, which can be regarded as the ultimate goal of video understanding.

• We present recent advances in exploring the synergy of video understanding and language processing, including video-language alignment, video captioning, and video emotion analysis.

Page 5: Video + Language: Where Does Domain Knowledge Fit in?

Video Growth

• CISCO Trends

6

Page 6: Video + Language: Where Does Domain Knowledge Fit in?

SEMANTICS <> USER INTENT

Page 7: Video + Language: Where Does Domain Knowledge Fit in?

SEMANTICS IN LARGE SCALE VIDEO SEARCH & RETRIEVAL

Page 8: Video + Language: Where Does Domain Knowledge Fit in?

SEMANTIC VIDEO ENTITY LINKING

Users

User Modeling

User Content Understanding

Learning From User

Tags

Yuncheng Li, Xitong Yang, Jiebo Luo

Page 9: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

Page 10: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

• Motivations to use visual content1. Video entity linking is very challenging with only

title & descriptions, especially for UGC2. Video entity linking must be of high quality3. The visual content of a video truly represents the

user intent for video watching and sharing

Page 11: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

Page 12: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

• Sequence-to-Set matchingKey Frame Sequence

Representative Image Set

Match?

Page 13: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

• Train Data: Triplets (Keyframes, Images, Label)Key Frame Sequence

Representative Image Set

Yes

Key Frame Sequence

No

Representative Image Set

Page 14: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

• Goal: Keyframe-Image distance (metric)Key Frame Sequence

Representative Image Set

Page 15: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

• Challenge: Not all pairs are true matchesKey Frame Sequence

Representative Image Set

Yes

Page 16: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

• Solution: Multiple Instance Metric Learning– EM like algorithm:

• E-Step: infer the true matches• M-Step: Retrain the metric

• More Challenges– Overwhelming noise (variability & ambiguity) in

both visual content• Solution

– Structured constraints to suppress noise

Page 17: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

Title: “Jason Taylor Career Highlights”

Page 18: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

• Experiments– Dataset: 1920 videos– Labels: Amazon Mechanical Turk

Page 19: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking

We propose two constraints to guide the video entity linking process:1. Temporal Smoothness: the entity occurrence (matches with any of the representative images) is smooth over time 2. Representativeness Smoothness: in order to reduce the significant irrelevant information

Page 20: Video + Language: Where Does Domain Knowledge Fit in?

Conclusions

• Metric Learning helps by adapting to different topics and domains

• Structured constraints are important for suppressing noises

• Future work includes integrating the video metadata information and building entity integrated applications, e.g., video spam detection

Page 21: Video + Language: Where Does Domain Knowledge Fit in?

Semantic Video Entity Linking. An Application: Video Spam Removal

• “guardians of the galaxy full movie”– Let’s watch the movie

Page 22: Video + Language: Where Does Domain Knowledge Fit in?

Unsupervised Alignment of Actions in Video with Text Descriptions

Y. Song, I. Naim, A. Mamun, K. Kulkarni, P. SinglaJ. Luo, D. Gildea, H. Kautz

Page 23: Video + Language: Where Does Domain Knowledge Fit in?

Overview• Unsupervised alignment of video with text

• Motivations– Generate labels from data (reduce burden of manual labeling)– Learn new actions from only parallel video+text– Extend noun/object matching to verbs and actions

Matching Verbs to ActionsThe person takes out a knife

and cutting board

Matching Nouns to Objects

[Naim et al., 2015]

An overview of the text andvideo alignment framework

Page 24: Video + Language: Where Does Domain Knowledge Fit in?

Hyperfeatures for Actions• High-level features required for alignment with text

→ Motion features are generally low-level• Hyperfeatures, originally used for image recognition extended

for use with motion features

→ Use temporal domain instead of spatial domain for vector quantization (clustering)

Originally described in “Hyperfeatures:Multilevel Local Coding for Visual Recog-nition” Agarwal, A. (ECCV 06), for images Hyperfeatures for actions

Page 25: Video + Language: Where Does Domain Knowledge Fit in?

Hyperfeatures for Actions• From low-level motion features, create high-level

representations that can easily align with verbs in text

Cluster 3at time t

Accumulate overframe at time t

& cluster

Conduct vectorquantizationof the histogramat time t

Cluster 3, 5, …,5,20= Hyperfeature 6

Each color codeis a vectorquantizedSTIP point

Vector quantizedSTIP point histogram at time t

Accumulate clusters overwindow (t-w/2, t+w/2]and conduct vectorquantization→ first-level hyperfeatures

Align hyperfeatureswith verbs from text

(using LCRF)

Page 26: Video + Language: Where Does Domain Knowledge Fit in?

Latent-variable CRF Alignment• CRF where the latent variable is the alignment

– N pairs of video/text observations {(xi, yi)} i=1 (indexed by i)– Xi,m represents nouns and verbs extracted from the mth sentence– Yi,n represents blobs and actions in interval n in the video

• Conditional likelihood

– conditional probability of

• Learning weights w– Stochastic gradient descent

wherefeature function

More details in Naim et al. 2015 NAACL Paper -Discriminative unsupervised alignment of natural language instructions with corresponding video segments

N

Page 27: Video + Language: Where Does Domain Knowledge Fit in?

Experiments: Wetlab Dataset

• RGB-Depth video with lab protocols in text– Compare addition of hyperfeatures generated from motion features to

previous results (Naim et al. 2015)

• Small improvement over previous results– Activities already highly correlated with object-use

Detection of objects in 3D spaceusing color and point-cloud

Previous resultsusing object/nounalignment only

Addition of different typesof motion features

2DTraj: Dense trajectories*Using hyperfeature window size w=150

Page 28: Video + Language: Where Does Domain Knowledge Fit in?

Experiments: TACoS Dataset• RGB video with crowd-sourced text descriptions

– Activities such as “making a salad,” “baking a cake”– No object recognition, alignment using actions only

– Uniform: Assume each sentence takes the same amount of time over the entire sequence– Segmented LCRF: Assume the segmentation of actions is known, infer only the action labels– Unsupervised LCRF: Both segmentation and alignment are unknown

• Effect of window size and number of clusters– Consistent with average

action length: 150 frames

*Using hyperfeaturewindow size w=150

*d(2)=64

Page 29: Video + Language: Where Does Domain Knowledge Fit in?

Experiments: TACoS Dataset

• Segmentation from a sequence in the dataset

Crowd-sourced descriptionsExample of text and video alignment generated

by the system on the TACoS corpus for sequence s13-d28

Page 30: Video + Language: Where Does Domain Knowledge Fit in?

Image Captioning with Semantic Attention (CVPR 2016)

Quanzeng You, Jiebo LuoHailin Jin, Zhaowen Wang and Chen Fang

Page 31: Video + Language: Where Does Domain Knowledge Fit in?

Image Captioning• Motivations

– Real-world Usability• Help visually impaired people, learning-impaired

– Improving Image Understanding• Classification, Objection detection

– Image Retrieval

1. a young girl inhales with the intent of blowing out a candle2. girl blowing out the candle on an ice cream

1. A shot from behind home plate of children playing baseball

2. A group of children playing baseball in the rain

3. Group of baseball players playing on a wet field

Page 32: Video + Language: Where Does Domain Knowledge Fit in?

Introduction of Image Captioning• Machine learning as an approach to solve the problem

Model sentence

1. A young girl inhales with the intent of blowing out a candle.2. A young girl is preparing to blow out her candle.3. A kid is to blow out the single candle in the bowl of birthday goodness.4. Girl blowing out the candle on an ice-cream5. A little girl is getting ready to blow out a candle on a small dessert.

1. A shot from behind home plate of children playing baseball2. A group of children playing baseball in the rain3. Group of baseball players playing on a wet field4. A batter leaning back so they don’t get hit by a ball5. A group of young boys playing baseball in the rain

1. A girl in a park area flies a multi-colored kite.2. A girl flying a kit in the sky3. A young woman flying a rainbow colored kite.4. A person in a large field flying a kite in the sky.5. A woman looks up at her colorful sailing kite.

Page 33: Video + Language: Where Does Domain Knowledge Fit in?

Overview

• Brief overview of current approaches• Our main motivation• The proposed semantic attention model• Evaluation results

Page 34: Video + Language: Where Does Domain Knowledge Fit in?

Brief Introduction of Recurrent Neural Network

• Different from CNN

11),( −− +== ttttt BhAxhxfh

tt Chy =

• Unfolding over time Feedforward network

Backpropagation Through Time

Inputs

Hidden Units

Outputs

xt ht-1

ht

yt

...

Convolutional Neural NetworkInputs

Hidden Units

Outputs

C

yt

Inputs

Hidden Units

Inputs

Hidden Units

B

B

A

A

A B

t-1

t-2

Page 35: Video + Language: Where Does Domain Knowledge Fit in?

Applications of Recurrent Neural Networks

• Machine Translation• Reads input sentence “ABC” and produces

“WXYZ”

Decoder RNNEncoder RNN

Page 36: Video + Language: Where Does Domain Knowledge Fit in?

Encoder-Decoder Framework for Captioning

• Inspired by neural network based machine translation

• Loss function

∑=

−−=

−=N

ttt wwIwp

IwpL

110 ),,,|(log

)|(log

...

Convolutional Neural Network

w1

wstart

w2

w1

wN

wN-1

wend

wN

Image

Recurrent Neural Network

...

Convolutional Neural Network

#Start#

Some

riverSome

elephants .

Some elephants roaming around on a river bank.

bank

bank

Page 37: Video + Language: Where Does Domain Knowledge Fit in?

Our Motivation

• Additional textual information– Own noisy title, tags or captions (Web)– Visually similar nearest neighbor images– Success of low-level tasks

• Visual attributes detection

Page 38: Video + Language: Where Does Domain Knowledge Fit in?

Image Captioning with Semantic Attention

• Main idea

RNN ...

CNN

attention

waveridingman

surfboardoceanwatersurfersurfingpersonboard

0

0.1

0.2

0.3

surfboardwavesurfing

Page 39: Video + Language: Where Does Domain Knowledge Fit in?

First Idea• Provide additional knowledge at each input node

• Concatenate the input word and the extra attributes K

• Each image has a fixed keyword list

)],,([),( 11 −− +== tktttt hbKWwfhxfh

Visual Features: 1024 GoogleNetLSTM Hidden states: 512

Training details:1. 256 image/sentence pairs 2. RMS-Prob

...

Convolutional Neural Network

w1

wstart

w2

w1

wN

wN-1

wend

wN

Retrieve Tags, titles, descriptions

from weak annotated images

Feature extraction

K

Keywords, key-phrase

Image

Recurrent Neural Network

Page 40: Video + Language: Where Does Domain Knowledge Fit in?

Using Attributes along with Visual Features

• Provide additional knowledge at each input node

• Concatenate the visual embedding and keywords for h0

];[),( 10 bKWvWhvfh kiv +== −

...

Convolutional Neural Network

w1

wstart

w2

w1

wN

wN-1

wend

wN

Retrieve Tags, titles, descriptions

from weak annotated images

Feature extraction

K

Keywords, key-phrase

Image

h0

Recurrent Neural Network

Page 41: Video + Language: Where Does Domain Knowledge Fit in?

Attention Model on Attributes

• Instead of using the same set of attributes at every step• At each step, select the attributes

∑= m mtmt kKwatt α),(

)softmax VK(wTtt =α

))],,(;([),( 11 −− == tttttt hKwattxfhxfh

...

CNN

w1

wstart

w2

w1

wN

wN-1

wend

wN

Retrieve Tags, titles, descriptions

from weak annotated images

Feature extraction

K

Keywords, key-phrase

RNN

Page 42: Video + Language: Where Does Domain Knowledge Fit in?

Overall Framework

• Training with a bilinear/bilateral attention model

ht

pt

xt

v

{Ai}

Yt~𝜑

𝜙

RNN

Image

CNN

AttrDet 1

AttrDet 2

AttrDet 3

AttrDet N

RNN ...

CNN

attention

waveridingman

surfboardoceanwatersurfersurfingpersonboard

0

0.1

0.2

0.3

surfboardwavesurfing

Page 43: Video + Language: Where Does Domain Knowledge Fit in?

Visual Attributes

• A secondary contribution• We try different approaches

vase flowers bathroom table glass sink blue small white clear

k-NN

sitting table small many little glass different flowers vase shown

Multi-label Ranking

vase flowers table glass sitting kitchen water room white filled

FCN

Page 44: Video + Language: Where Does Domain Knowledge Fit in?

Performance

• Examples showing the impact of visual attributes on captions

Google NIC

Top-5 visual attributes

ATT-FCN

a white plate topped with a variety of food.

a plate with a sandwich and french fries.

plate broccoli fries food french

a baby with a toothbrush in its mouth.

a baby is eating a piece of paper.

teeth brushing toothbrush holding baby

a traffic light is on a city street.

a street with cars and a clock tower.

street sign cars clock traffic

a yellow and black train on a track.

a train traveling down tracks next to a building.

train tracks clock tower down

a close up of a plate of food on a table.

a table topped with a cake with candles on it.

a teddy bear sitting on top of a chair .

a white teddy bear sitting next to a stuffed animal .

a person is holding colorful umbrella.

a black umbrella sitting on top of a sandy beach .

a woman is holding a cell phone in her hand .

a woman holding a pair of scissors in her hands .

cake table plate sitting birthday

teddy cat bear stuffed white

umbrella beach water sitting boat

woman bathroom her scissors man

Page 45: Video + Language: Where Does Domain Knowledge Fit in?

Performance on the Testing Dataset

• Publicly available split

Page 46: Video + Language: Where Does Domain Knowledge Fit in?

Performance• MS-COCO Image Captioning Challenge

Page 47: Video + Language: Where Does Domain Knowledge Fit in?

TGIF: A New Dataset and Benchmark on Animated GIF Description

Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Jiebo Luo

Page 48: Video + Language: Where Does Domain Knowledge Fit in?

Overview

Page 49: Video + Language: Where Does Domain Knowledge Fit in?

Comparison with Existing Datasets

Page 50: Video + Language: Where Does Domain Knowledge Fit in?

Examples

a skate boarder is doing trick on his skate board. a gloved hand opens to

reveal a golden ring.

a sport car is swinging on the race playground

the vehicle is moving fast into the tunnel

Page 51: Video + Language: Where Does Domain Knowledge Fit in?

Contributions

• A large scale animated GIF description dataset for promoting image sequence modeling and research

• Performing automatic validation to collect natural language descriptions from crowd workers

• Establishing baseline image captioning methods for future benchmarking

• Comparison with existing datasets, highlighting the benefits with animated GIFs

Page 52: Video + Language: Where Does Domain Knowledge Fit in?

In Comparison with Existing Datasets

• The language in our dataset is closer to common language

• Our dataset has an emphasis on the verbs• Animated GIFs are more coherent and self contained• Our dataset can be used to solve more difficult

movie description problem

Page 53: Video + Language: Where Does Domain Knowledge Fit in?

Machine Generated Sentence Examples

Page 54: Video + Language: Where Does Domain Knowledge Fit in?

Machine Generated Sentence Examples

Page 55: Video + Language: Where Does Domain Knowledge Fit in?

Machine Generated Sentence Examples

Page 56: Video + Language: Where Does Domain Knowledge Fit in?

Comparing Professionals and Crowd-workers

Crowd worker: two people are kissing on a boat.Professional: someone glances at a kissing couple then steps to a railing overlooking the ocean an older man and woman stand beside him.

Crowd worker: two men got into their car and not able to go anywhere because the wheels were locked.Professional: someone slides over the camaros hood then gets in with his partner he starts the engine the revving vintage car starts to backup then lurches to a halt.

Crowd worker: a man in a shirt and tie sits beside a person who is covered in a sheet.Professional: he makes eye contact with the woman for only a second.

More: http://beta-web2.cloudapp.net/lsmdc_sentence_comparison.html

Page 57: Video + Language: Where Does Domain Knowledge Fit in?

Movie Descriptions versus TGIF

• Crowd workers are encouraged to describe the major visual content directly, and not to use overly descriptive language

• Because our animated GIFs are presented to crowd workers without any context, the sentences in our dataset are more self-contained

• Animated GIFs are perfectly segmented since they are carefully curated by online users to create a coherent visual story

Page 58: Video + Language: Where Does Domain Knowledge Fit in?

Where is CV (AI) in 2016?Winter 2002 Fall 2003 Summer 2008

Image/Video captioning看图识字 看图说话

Page 59: Video + Language: Where Does Domain Knowledge Fit in?

ThanksQ & A

Google***

BaiduSogou

BingXiaoIce

Are you smarter than a 5th grader?

What doe it take to go from a 5-year old to a 5th grader?1. Learning from “small data”2. Unsupervised learning3. Transfer learning4. Integration of domain

knowledge or experience

Page 60: Video + Language: Where Does Domain Knowledge Fit in?

Visual Intelligence & Social Multimedia Analyticswww.cs.rochester.edu/u/jluo

Questions?