text, language, and imagerygrauman/courses/spring2009/slides/em… · video and text transcription,...

52
TEXT , LANGUAGE, AND IMAGERY Yu-Ting Peng

Upload: others

Post on 30-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

TEXT, LANGUAGE, AND IMAGERY

Yu-Ting Peng1

Page 2: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

RESOURCE - SCRIPTS

2

Page 3: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

RESOURCE - SUBTITLES

3

Page 4: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

RESOURCE - NEWS

4

Page 5: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

RESOURCE - WIKIPEDIA

5

Page 6: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

Paper Resource Objective

Names and Faces in the News, by T. Berg, A.

Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-

Miller and D. Forsyth, CVPR 2004.

News

photos

Name faces

“Hello! My name is... Buffy” – Automatic

Naming of Characters in TV Video, by M.

Everingham, J. Sivic and A. Zisserman, BMVC 2006.

Movies /

TV series

(video)

Name faces

Movie/Script: Alignment and Parsing of

Video and Text Transcription, by T. Cour, C.

Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008.

Movies /

TV series

(video)

Action retrieval

and movie

structure

recovery

Nouns: Exploiting Prepositions and

Comparative Adjectives for Learning

Visual Classifiers, A. Gupta and L. Davis, ECCV

2008.

Corel

dataset

(images)

learning

classifiers for

nouns and

relationships

(prep. & adj).6

Page 7: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

NAMES AND FACES IN THE NEWS� Aim: Given an input image and an associated caption, automatically detects faces in the image and possible name strings.

� Application: to label faces in news images or to organize news pictures by individuals present.

7

Page 8: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

DATASET� half a million news pictures and captions from Yahoo News over a period of roughly two years.

� Obtained 44,773 face images� more realistic than usual face recognition datasets� it contains faces captured “in the wild” in a variety of configurations with respect to the camera, taking a variety of expressions, and under illumination of widely varying color. 8

Page 9: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

PROCEDURE

Extract names from the caption.

Detect and represent

facesImages

associated with names 9

Page 10: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

EXTRACT NAMES FROM THE CAPTION.Words are classified as verbs by first applying a list of morphological rules to present tense singular forms, and then comparing these to a database of known verbs.identifying two or more capitalized words followed by a present tense verb.This lexicon is matched to each caption. Each face detected in an image is associated with every name extracted from the associated caption 10

Page 11: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

EXAMPLES

11

Page 12: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

PROCEDURE

Extract names from the caption.

Detect and represent

facesImages

associated with names 12

Page 13: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

FACE DETECTION &RECTIFICATION� Face detector (K. Mikolajczyk) - biased to frontal faces � Rectify face to canonical pose.

• Geometric blur applied to grayscale patches• 5 SVM (trained with 150 hand clicked faces)• Determine affine transformation which best maps detected points to canonical positions

� Remove images with low rectification score

1313

Page 14: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

REPRESENT FACES� kernel principal components analysis (kPCA)-to reduce the dimensionality of data

� linear discriminant analysis (LDA) - to project data into a space that is suited for the discrimination task.

14

Page 15: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

PROCEDURE

Extract names from the caption.

Detect and represent

facesImages

associated with names 15

Page 16: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

MODIFIED K-MEANS CLUSTERINGRandomly assign each image to one of its extracted names For each distinct name (cluster), calculate the mean of image vectors assigned to that nameReassign each image to theclosest mean of its extracted namesRepeat until convergence 16

Page 17: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

MERGING CLUSTERS� Aim: different names that actually correspond to a single person.

� merge names that correspond to faces that look the same.

17

Page 18: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

18

Page 19: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

Paper Resource Objective

Names and Faces in the News, by T. Berg, A.

Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-

Miller and D. Forsyth, CVPR 2004.

News

photos

Name faces

“Hello! My name is... Buffy” – Automatic

Naming of Characters in TV Video, by M.

Everingham, J. Sivic and A. Zisserman, BMVC 2006.

Movies /

TV series

(video)

Name faces

Movie/Script: Alignment and Parsing of

Video and Text Transcription, by T. Cour, C.

Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008.

Movies /

TV series

(video)

Action retrieval

and movie

structure

recovery

Nouns: Exploiting Prepositions and

Comparative Adjectives for Learning

Visual Classifiers, A. Gupta and L. Davis, ECCV

2008.

Corel

dataset

(images)

learning

classifiers for

nouns and

relationships

(prep. & adj).19

Page 20: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

THE DIFFICULTY OF FACE RECOGNITION

20• scale, pose

• lighting

• partial occlusion

• expressions *slides from Andrew Zisserman

Page 21: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

“HELLO! MY NAME IS... BUFFY”� Aim - automatically label television or movie footage with the identity of the people present in each frame of the video.

21

Page 22: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

PROCESS

Obtain names

Extract face

tracksAssign

labels to faces 22

Page 23: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

OBTAIN NAMES� to extract an initial prediction of who appears in

the video, and when.

� Subtitles-What is said, and when, but not who

says it

� Script-What is said, and who says it, but not

when

� By automatic alignment of the two sources, it is

possible to extract who says what and when. 23

Page 24: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

ALIGNMENT BY DYNAMIC TIME WARPING

24

Page 25: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

PROCESS

Obtain names

Extract face

tracksAssign

labels to faces 25

Page 26: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

DETECT AND TRACK FACES

� Face detector- by P. Viola and M. Jones. � Frontal face

� KLT tracker-point tracks� Reduces the volume of data to be processed� Allows stronger appearance models to be built for each character. 26*Pictures from Andrew Zisserman

Page 27: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

FACE TRACKS

27*slides from Andrew Zisserman

Page 28: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

REPRESENTING FACE APPEARANCE

28Representing Face(SIFT Descriptor or Simple pixel-wised descriptor)Face normalization (Affine transform)

Locate facial features(Nine facial features eyes/nose/mouth)

Page 29: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

REPRESENTING CLOTHING APPEARANCE

29� Matching the appearance of the face can be extremely challenging; clothing can provide additional cues� Represent Clothing Appearance by detecting a bounding box containing cloth of a person� Similar clothing appearance suggests the same character, butdifferent clothing does not necessarily imply a differentcharacter� Straightforward weighting of the clothing appearance relativeto the face appearance proved effective

Page 30: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

SPEAKER AMBIGUITY

30

Page 31: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

SPEAKER DETECTION

31

Page 32: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

PROCESS

Obtain names

Extract face

tracksAssign

labels to faces 32

Page 33: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

ASSIGN LABELS TO FACES

*Graphs from Andrew Zisserman 33

• Assign names to unlabelled faces by classification based on extracted exemplars

• Classify tracks by nearest exemplar• Estimate probability of class from distance ratios

Page 34: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

Paper Resource Objective

Names and Faces in the News, by T. Berg, A.

Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-

Miller and D. Forsyth, CVPR 2004.

News

photos

Name faces

“Hello! My name is... Buffy” – Automatic

Naming of Characters in TV Video, by M.

Everingham, J. Sivic and A. Zisserman, BMVC 2006.

Movies /

TV series

(video)

Name faces

Movie/Script: Alignment and Parsing of

Video and Text Transcription, by T. Cour, C.

Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008.

Movies /

TV series

(video)

Action retrieval

and movie

structure

recovery

Nouns: Exploiting Prepositions and

Comparative Adjectives for Learning

Visual Classifiers, A. Gupta and L. Davis, ECCV

2008.

Corel

dataset

(images)

learning

classifiers for

nouns and

relationships

(prep. & adj).34

Page 35: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

MOVIE/SCRIPT: ALIGNMENT AND PARSING OFVIDEO AND TEXT TRANSCRIPTION� Aim: Automatically extracting large collections of actions “in the wild”

� Method: recovering scene structure in movies and TV series

� Application: semantic retrieval and indexing,browsing by character or object, re-editing and many more.

35

Page 36: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

GENERATIVE MODEL FOR SCENE STRUCTURE� This uncovered structure can be used to analyze the content of the video for tracking objects across cuts, action retrieval, as well as enriching browsing and editing interfaces.

� To model the scene structure, we propose a unified generative model for joint scene segmentation and shot threading.

36

Page 37: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

VIDEO DECONSTRUCTION

37

Page 38: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

ALIGNMENT

38

� screenplay � Dialogues� speaker identity, � scene transitions � no time-stamps

� closed captions� Dialogues� time-stamps � nothing else.

Page 39: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

PRONOUN RESOLUTION AND VERB FRAMES

39

Page 40: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

ACTION RETRIEVAL

40

� After pronoun resolution and verb frames, then matched to detected and named characters in the video sequence

Page 41: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

Paper Resource Objective

Names and Faces in the News, by T. Berg, A.

Berg, J. Edwards, M. Maire, R. White, Y. Teh, E. Learned-

Miller and D. Forsyth, CVPR 2004.

News

photos

Name faces

“Hello! My name is... Buffy” – Automatic

Naming of Characters in TV Video, by M.

Everingham, J. Sivic and A. Zisserman, BMVC 2006.

Movies /

TV series

(video)

Name faces

Movie/Script: Alignment and Parsing of

Video and Text Transcription, by T. Cour, C.

Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008.

Movies /

TV series

(video)

Action retrieval

and movie

structure

recovery

Nouns: Exploiting Prepositions and

Comparative Adjectives for Learning

Visual Classifiers, A. Gupta and L. Davis, ECCV

2008.

Corel

dataset

(images)

learning

classifiers for

nouns and

relationships

(prep. & adj).41

Page 42: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

NOUNS: EXPLOITING PREPOSITIONS AND

COMPARATIVE ADJECTIVES FOR LEARNING

VISUAL CLASSIFIERS,

42� Aim: to learn classifiers (i.e models) for nouns and relationships (prepositions and comparative adjectives).

above(statue,rocks);ontopof(rocks, water); larger(water,statue)

Page 43: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

LEARNING RELATIONSHIPSThese classifiers are based on differential features extracted from pairs of regions in an image.

43

Page 44: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

FEATURES� Each image region is represented by a set of visual features based on appearance and shape (e.g area, RGB).

� The classifiers for nouns are based on these features.

� The classifiers for relationships are based on differential features extracted from pairs of regions such as the difference in area of two regions. 44

Page 45: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

GENERATIVE MODEL

45Visual featurenouns nouns

Parameters of the appearance modelsA type of relationshipParameters of therelationship model

Visual featureImage features

Page 46: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

46The rjk represent the possiblewords for the relationship between regions ( j, k).

Page 47: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

EM-APPROACH� to simultaneously solve for the correspondence and for learning the parameters of classifiers.

� E-step: evaluate possible assignments using the parameters obtained at previous iterations.

� M-step: Using the probabilistic distribution of assignment computed in the E-step, we estimate the maximum likelihood parameters of the classifiers in the M-step. 47

Page 48: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

INFERENCE� use a Bayesian network to represent our labeling problem and use belief propagation for inference.

� Previous approaches estimate nouns for regions independently of each other. Here they use priors on relationships between pair of nouns to constrain the labeling problem.

48

Page 49: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

LABELING NEW IMAGES

49

near(birds,sea); below(birds,sun); above(sun, sea);larger(sea,sun);brighter(sun,sea);below(waves,sun)below(coyote, sky); below(bush, sky);left(bush, coyote);greener(grass, coyote);below(grass,sky)

Page 50: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

LABELING NEW IMAGES

50

below(building, sky); below(tree,building);below(tree, skyline); behind(buildings,tree);blueish(sky, tree)above(statue,rocks);ontopof(rocks, water); larger(water,statue) below(flowers,horses); ontopof(horses,field); below(flowers,foals)

Page 51: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

CONCLUSION� Lots of data out there with both text and images � Text provides potential labels of images� Scripts give cues about scene structure and actions performed

� Understanding the semantics of language can help in disambiguating labels

51

Page 52: Text, language, and imagerygrauman/courses/spring2009/slides/Em… · Video and Text Transcription, by T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar, ECCV 2008. Movies / TV series

DISCUSSION:

52extract names wrong association

•What resources also contain both text and image?•How can understanding languages help with the ambiguous labels?