words and pictures rahul raguram. motivation huge datasets where text and images co-occur ~ 3.6...

79
Words and Pictures Rahul Raguram

Post on 20-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Words and PicturesRahul Raguram

Page 2: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Huge datasets where text and images co-occur

~ 3.6 billion photos

Page 3: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Huge datasets where text and images co-occur

Page 4: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Huge datasets where text and images co-occur

Photos in the news

Page 5: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Huge datasets where text and images co-occur

Subtitles

Page 6: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Interacting with large image datasets Image content

‘Blobworld’[Carson et al., 99]

Page 7: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Interacting with large photo collections Image content

‘Blobworld’[Carson et al., 99]

Page 8: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Interacting with large photo collections Image content

‘Blobworld’[Carson et al., 99]

Page 9: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Interacting with large photo collections Image content

Query by sketch[Jacobs et al., 95]

Page 10: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Interacting with large photo collections Image content

Query by sketch[Jacobs et al., 95]

Page 11: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Interacting with large photo collections Large disparity between user needs and what

technology provides (Armitage and Enser 1997, Enser 1993, Enser 1995, Markulla and Sormunen 2000)

Queries based on image histograms, texture, overall appearance, etc. are vanishingly small

Page 12: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Interacting with large photo collections Text queries

Page 13: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Text and images may be separately ambiguous; jointly they tend not to be Image descriptions often leave out

what is visually obvious (eg: the colour of a flower)

…but often include properties that are difficult to infer using vision (eg: the species of the flower)

Page 14: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Linking words and pictures: Applications Automated image annotation

Auto illustration

Browsing supporttiger cat mouth teeth

“statue of liberty”

Page 15: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Learning the Semantics of Words and Pictures

Barnard and Forsyth, ICCV 2001

Page 16: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Key idea

Model the joint distribution of words and image features

Joint probability model for text and image features

Random bitsImpossible

Keywords:appletree

Unlikely

Keywords:skywatersun

Reasonable

Slide credit: David Forsyth

Page 17: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Input Representation

Extract keywords

Segment the image into a set of ‘blobs’

Page 18: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

EM revisited: Image segmentation

Examples from: http://www.eecs.umich.edu/~silvio/teaching/

Page 19: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

EM revisited: Image segmentation

Image

Segment 1Segment 2 . . .Segment k

),( 11 N),( 22 N

),( kkN

l )|( lxp l

),( lll )(xp

Generative model

Problem: You don’t know the parameters, the mixing weights, or the segmentation

Page 20: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

EM revisited: Image segmentation

Image

If you knew the segmentation, then you could find the parameters easily

Compute maximum likelihood estimates for

Fraction of the image in the segment gives the mixing weight

),( lll

l

Page 21: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

EM revisited: Image segmentation

Image

If you knew the segmentation, then you could find the parameters easily

If you knew the parameters, you could easily determine the segmentation

Solution: iterate

)|( xp lCalculate the posteriors

Page 22: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

EM revisited: Image segmentation

Image from: http://www.ics.uci.edu/~dramanan/teaching/

Page 23: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Input Representation

Segment the image into a set of ‘blobs’ Each region/blob represented by a

vector of 40 features (size, position, colour, texture, shape)

Page 24: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Modeling image dataset statistics Generative, hierarchical model

Extension of Hofmann’s model for text (1998)

Each node emits blobs and words

Higher nodes emit more general words and blobs

sky

Middle nodes emit moderately general words and blobs

sun

Lower nodes emit more specific words and blobs

waves

Page 25: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Modeling image dataset statistics Generative, hierarchical model

Extension of Hofmann’s model for text (1998)

Following a path from root to leaf generates image and associated text

sky

sun

waves

sun sky waves

Page 26: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Modeling image dataset statistics Generative, hierarchical model

Extension of Hofmann’s model for text (1998)

Each cluster is associated with a path from the root to a leaf

Cluster of images

Page 27: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Modeling image dataset statistics Generative, hierarchical model

Extension of Hofmann’s model for text (1998)

Each cluster is associated with a path from the root to a leaf

sky

sun, sea

waves rocks

sun seasky waves

sun seasky rocks

Adjacent clusters

Page 28: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Modeling image dataset statistics

Di lc

clPcliPcP )|(),|()(

)(DPD = blobs words

)|()( cDPcPc

Each cluster is associated with a path from a leaf to the root

ic

ciPcP )|()( Conditional independence of the items

Nodes along the path from leaf to root

Page 29: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Modeling image dataset statistics

For blobs

Di lc

clPcliPcPDP )|(),|()()(

)()(2

1

2/12/

1

||)2(

1),|(

xx

d

T

eclbP

For words Tabulate word frequencies

Page 30: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Modeling image dataset statistics

Model fitting: EM Missing data is path, nodes that

generated each data element Two hidden variables: If path, node were known for each data

element, easy to get maximum likelihood estimate of parameters

Given parameter estimate, path, node easy to figure out

Di lc

clPcliPcPDP )|(),|()()(

cdH ,

lidV ,,

document d is in cluster c

item i of document d was generated at level l

Page 31: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Clustering Does text+image clustering have an

advantage?

Only text

Page 32: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Clustering Does text+image clustering have an

advantage?

Only blob features

Page 33: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Clustering Does text+image clustering have an

advantage?

Both textand imagesegments

Page 34: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Clustering Does text+image clustering have an

advantage? User study:

Generate 64 clusters for 3000 images Generate 64 random clusters from the same

images Present random cluster to user, ask to rate

coherence (yes/no) 94% accuracy

Page 35: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Image search Supply a combination of text + image

features Approach: compute for each candidate

image, the probability of emitting the query items )|(),|()|( dcPdcQPdQP

c Q – set of query items

d – candidate document

c Qq l

dcPclPclqP )|()|(),|(

Page 36: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Image search

Image credit: David Forsyth

Page 37: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Image search

Image credit: David Forsyth

Page 38: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Image search

Image credit: David Forsyth

Page 39: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Auto-annotation Compute:

)|(),|()|( BcPBcwPBwPc

Page 40: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Auto-annotation Quantitative performance: Use 160 Corel CDs, each with 100 images

(grouped by theme) Select 80 of the CDs, split into training

(75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set

Model scoring:n – number of words for the imager – number of words predicted correctlyw – number of words predicted incorrectlyN – vocabulary sizeAll words that exceed a threshold are predicted

Page 41: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Auto-annotation Quantitative performance: Use 160 Corel CDs, each with 100 images

(grouped by theme) Select 80 of the CDs, split into training

(75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set

Model scoring:n – number of words for the imager – number of words predicted correctlyModel predicts n words

Can do surprisingly well just by using the empirical word frequency!

Page 42: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results

Auto-annotation Quantitative performance:

Score of 0.1 indicates roughly 1 out of every 3 words is correctly predicted(vs. 1 out of 6 for the empirical model)

eNS

mNS EE

ePR

mPR EE

Page 43: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Names and Faces in the News

Berg et al., CVPR 2004

Page 44: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Page 45: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Page 46: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Organize news photographs for browsing and retrieval

Build a large ‘real-world’ face dataset Datasets captured in lab conditions do

not truly reflect the complexity of the problem

Page 47: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Organize news photographs for browsing and retrieval

Build a large ‘real-world’ face dataset Datasets captured in lab conditions do

not truly reflect the complexity of the problem

In many traditional face datasets, it’s possible to get excellent performance by using no facial features at all (Shamir, 2008)

Page 48: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation

Top left 100×100 pixels of the first 10 individuals in the color FERET dataset. The IDs of the subjects are listed right to the images

Page 49: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Dataset

Download news photos and captions ~500,000 images from Yahoo News, over a period of

two years

Run a face detector 44,773 faces Resized to 86x86 pixels

Extract names from the captions Identify two or more capitalized words followed by a

present tense verb Associate every face in the image with every detected

name

Goal is to label each face detector output with the correct name

Page 50: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Dataset Properties

Diverse Large variation in lighting and pose Broad range of expressions

Page 51: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Dataset Properties

Diverse Large variation in lighting and pose Broad range of expressions

Name frequencies follow a long tailed distribution

Doctor Nikola shows a fork that was removed from an Israeli woman who swallowed it while trying to catch a bug that flew in to her mouth, in Poriah Hospital northern Israel July 10, 2003. Doctors performed emergency surgery and removed the fork. (Reuters)

President George W. Bush waves as he leaves the White House for a day trip to North Carolina, July 25, 2002. A White House spokesman said that Bush would be compelled to veto Senate legislation creating a new department of homeland security unless changes are made. (Kevin Lamarque/Reuters)

Page 52: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Preprocessing

Rectify faces to canonical position Train 5 SVMs as feature detectors

Corners of left and right eyes, tip of the nose, corners of the mouth

Use 150 hand-clicked faces to train the SVMs

For a test image, run the SVMs over the entire image Produces 5 feature maps Detect maximal outputs in the 5 maps, and

estimate the affine transformation to the canonical pose

Image credit: Y. J. Lee

Page 53: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Preprocessing

Rectify faces to canonical position Train 5 SVMs as feature detectors

Corners of left and right eyes, tip of the nose, corners of the mouth

Use 150 hand-clicked faces to train the SVMs

For a test image, run the SVMs over the entire image Produces 5 feature maps Detect maximal outputs in the 5 maps, and

estimate the affine transformation to the canonical pose

Reject images with poor rectification scores

Page 54: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Preprocessing

Rectify faces to canonical position Train 5 SVMs as feature detectors

Corners of left and right eyes, tip of the nose, corners of the mouth

Use 150 hand-clicked faces to train the SVMs

For a test image, run the SVMs over the entire image Produces 5 feature maps Detect maximal outputs in the 5 maps, and

estimate the affine transformation to the canonical pose

Reject images with poor rectification scores This leaves 34,623 images

Throw out images with more than 4 names 27,742 faces

Page 55: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Face representation

86x86 images – 7396 dimensional vectors However, relatively few 7396 dimensional

vectors actually correspond to valid face images

We want to effectively model the subspace of valid face images

Slide credit: S. Lazebnik

Page 56: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Face representation

We want to construct a low-dimensional linear subspace that best explains the variation in the set of face images

Slide credit: S. Lazebnik

Page 57: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Principal Component Analysis (PCA)

Definecovariance matrix

Formulation: C. Bishop

Page 58: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Principal Component Analysis (PCA) Want to maximize the projected variance

Alternate formulation: minimize sum- of-square errors

Maximize

subject to

Use Lagrange multipliersu1 must be an eigenvector of S

Choose maximum eigenvalue to maximize variance

Image, formulation: C. Bishop

Page 59: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Principal Component Analysis (PCA) The direction that captures the maximum

covariance of the data is the eigenvector corresponding to the largest eigenvalue of the data covariance matrix

Furthermore, the top k orthogonal directions that capture the most variance of the data are the k eigenvectors corresponding to the k largest eigenvalues

Slide credit: S. Lazebnik

Page 60: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Limitations of PCA

PCA assumes that the data has a Gaussian distribution (mean µ, covariance matrix Σ)

Slide credit: S. Lazebnik

Page 61: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Limitations of PCA

The direction of maximum variance is not always good for classification

Image credit: C. Bishop

Page 62: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Limitation #1

Shape of the data not modeled well by the linear principal components

Page 63: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

The return of the kernel trick

Basic idea: express conventional PCA in terms of dot products

From before:

For convenience, assume that you’ve subtracted off the mean from each vector

Consider a nonlinear function Φ(x) mapping into M-dimensions (M>D)

Assume

Covariance matrix

Formulation: C. Bishop

Page 64: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

The return of the kernel trick

Covariance matrix in feature space

Now MxM

Substituting for C

Scalar values

The eigenvectors vi can be written as a linear combination of the Φ(xn)

Formulation: C. Bishop

Page 65: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Key step: express this in terms of the kernel function

Multiply both sides by ΦT(xl)

Projection of a point onto eigenvector i

The return of the kernel trick

Formulation: C. Bishop

Page 66: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Kernel PCA

Image credit: C. Bishop

Page 67: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Limitation #2

The direction of maximum variance is not always good for classification

Image credit: C. Bishop

Page 68: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Linear Discriminant Analysis (LDA) Goal: Perform dimensionality reduction while

preserving as much of the class discriminatory information as possible

Try to find directions along which the classes are best separated

Capable of distinguishing image variation due to identity from variation due to other sources such as illumination and expression

Page 69: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Linear Discriminant Analysis (LDA) Define inter- and intra-class scatter matrices

LDA computes a projection that maximizes the ratio

by solving the generalized eigenvalue problem

W – intra-classB – inter-class

Page 70: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Class labels for LDA

For the unsupervised names and faces dataset, you don’t have true labels Use proxy for labeled training data Images from the dataset with only one

detected face and one detected name

Observation: Using LDA on top of the space found by kernel PCA improves performance significantly

Page 71: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Clustering faces

Now that we have a representation for faces, the goal is to ‘clean up’ this dataset

Modified k-means clusteringObamaBushClintonSaddam

Page 72: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Clustering faces

Now that we have a representation for faces, the goal is to ‘clean up’ this dataset

Modified k-means clusteringObamaBushClintonSaddam

Page 73: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Clustering faces

Now that we have a representation for faces, the goal is to ‘clean up’ this dataset

Modified k-means clusteringObamaBushClintonSaddam

x

xx x

Page 74: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Clustering faces

Now that we have a representation for faces, the goal is to ‘clean up’ this dataset

Modified k-means clustering

x

xx x

BushSaddam

Page 75: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Pruning clusters

Remove clusters with < 3 faces This leaves 19,355 images

For every data point, compute a likelihood score

Remove points with low likelihood

k – number of nearest neighbours being consideredki – number of n.n. that are in cluster in – total number of points in the datasetni – total number of points in cluster i

Page 76: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Pruning clusters

For various thresholds:

Page 77: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Merging clusters

Merge clusters with different names that correspond to a single person Defense Donald Rumsfeld and Donald

Rumsfeld Or Colin Powell and Secretary of State

Look at distance between the means in discriminant space If below a threshold, merge

Page 78: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Merging clusters

Image credit: David Forsyth

Page 79: Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Results