words and pictures rahul raguram. motivation huge datasets where text and images co-occur ~ 3.6...

Words and PicturesRahul Raguram

Motivation

Huge datasets where text and images co-occur

~ 3.6 billion photos

Motivation


Motivation


Photos in the news

Motivation


Subtitles

Motivation

Interacting with large image datasets Image content

‘Blobworld’[Carson et al., 99]

Motivation

Interacting with large photo collections Image content

‘Blobworld’[Carson et al., 99]

Motivation

Interacting with large photo collections Image content

Query by sketch[Jacobs et al., 95]

Motivation

Interacting with large photo collections Large disparity between user needs and what

technology provides (Armitage and Enser 1997, Enser 1993, Enser 1995, Markulla and Sormunen 2000)

Queries based on image histograms, texture, overall appearance, etc. are vanishingly small

Motivation

Interacting with large photo collections Text queries

Motivation

Text and images may be separately ambiguous; jointly they tend not to be Image descriptions often leave out

what is visually obvious (eg: the colour of a flower)

…but often include properties that are difficult to infer using vision (eg: the species of the flower)

Linking words and pictures: Applications Automated image annotation

Auto illustration

Browsing supporttiger cat mouth teeth

“statue of liberty”

Learning the Semantics of Words and Pictures

Barnard and Forsyth, ICCV 2001

Key idea

Model the joint distribution of words and image features

Joint probability model for text and image features

Random bitsImpossible

Keywords:appletree

Unlikely

Keywords:skywatersun

Reasonable

Slide credit: David Forsyth

Input Representation

Extract keywords

Segment the image into a set of ‘blobs’

EM revisited: Image segmentation

Examples from: http://www.eecs.umich.edu/~silvio/teaching/


Image

Segment 1Segment 2 . . .Segment k

),( 11 N),( 22 N

),( kkN

l )|( lxp l

),( lll )(xp

Generative model

Problem: You don’t know the parameters, the mixing weights, or the segmentation


Image

If you knew the segmentation, then you could find the parameters easily

Compute maximum likelihood estimates for

Fraction of the image in the segment gives the mixing weight

),( lll

l


Image

If you knew the segmentation, then you could find the parameters easily

If you knew the parameters, you could easily determine the segmentation

Solution: iterate

)|( xp lCalculate the posteriors


Image from: http://www.ics.uci.edu/~dramanan/teaching/

Input Representation

Segment the image into a set of ‘blobs’ Each region/blob represented by a

vector of 40 features (size, position, colour, texture, shape)

Modeling image dataset statistics Generative, hierarchical model

Extension of Hofmann’s model for text (1998)

Each node emits blobs and words

Higher nodes emit more general words and blobs

sky

Middle nodes emit moderately general words and blobs

sun

Lower nodes emit more specific words and blobs

waves



Following a path from root to leaf generates image and associated text

sky

sun

waves

sun sky waves



Each cluster is associated with a path from the root to a leaf

Cluster of images



Each cluster is associated with a path from the root to a leaf

sky

sun, sea

waves rocks

sun seasky waves

sun seasky rocks

Adjacent clusters

Modeling image dataset statistics

Di lc

clPcliPcP )|(),|()(

)(DPD = blobs words

)|()( cDPcPc

Each cluster is associated with a path from a leaf to the root

ic

ciPcP )|()( Conditional independence of the items

Nodes along the path from leaf to root


For blobs

Di lc

clPcliPcPDP )|(),|()()(

)()(2

1

2/12/

1

||)2(

1),|(

xx

d

T

eclbP

For words Tabulate word frequencies


Model fitting: EM Missing data is path, nodes that

generated each data element Two hidden variables: If path, node were known for each data

element, easy to get maximum likelihood estimate of parameters

Given parameter estimate, path, node easy to figure out

Di lc

clPcliPcPDP )|(),|()()(

cdH ,

lidV ,,

document d is in cluster c

item i of document d was generated at level l

Results

Clustering Does text+image clustering have an

advantage?

Only text

Results


advantage?

Only blob features

Results


advantage?

Both textand imagesegments

Results


advantage? User study:

Generate 64 clusters for 3000 images Generate 64 random clusters from the same

images Present random cluster to user, ask to rate

coherence (yes/no) 94% accuracy

Results

Image search Supply a combination of text + image

features Approach: compute for each candidate

image, the probability of emitting the query items )|(),|()|( dcPdcQPdQP

c Q – set of query items

d – candidate document

c Qq l

dcPclPclqP )|()|(),|(

Results

Image search

Image credit: David Forsyth

Results

Auto-annotation Compute:

)|(),|()|( BcPBcwPBwPc

Results

Auto-annotation Quantitative performance: Use 160 Corel CDs, each with 100 images

(grouped by theme) Select 80 of the CDs, split into training

(75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set

Model scoring:n – number of words for the imager – number of words predicted correctlyw – number of words predicted incorrectlyN – vocabulary sizeAll words that exceed a threshold are predicted

Results

Auto-annotation Quantitative performance: Use 160 Corel CDs, each with 100 images

(grouped by theme) Select 80 of the CDs, split into training

(75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set

Model scoring:n – number of words for the imager – number of words predicted correctlyModel predicts n words

Can do surprisingly well just by using the empirical word frequency!

Results

Auto-annotation Quantitative performance:

Score of 0.1 indicates roughly 1 out of every 3 words is correctly predicted(vs. 1 out of 6 for the empirical model)

eNS

mNS EE

ePR

mPR EE

Names and Faces in the News

Berg et al., CVPR 2004

Motivation

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Motivation

Organize news photographs for browsing and retrieval

Build a large ‘real-world’ face dataset Datasets captured in lab conditions do

not truly reflect the complexity of the problem

Motivation

Organize news photographs for browsing and retrieval

Build a large ‘real-world’ face dataset Datasets captured in lab conditions do

not truly reflect the complexity of the problem

In many traditional face datasets, it’s possible to get excellent performance by using no facial features at all (Shamir, 2008)

Motivation

Top left 100×100 pixels of the first 10 individuals in the color FERET dataset. The IDs of the subjects are listed right to the images

Dataset

Download news photos and captions ~500,000 images from Yahoo News, over a period of

two years

Run a face detector 44,773 faces Resized to 86x86 pixels

Extract names from the captions Identify two or more capitalized words followed by a

present tense verb Associate every face in the image with every detected

name

Goal is to label each face detector output with the correct name

Dataset Properties

Diverse Large variation in lighting and pose Broad range of expressions

Dataset Properties

Diverse Large variation in lighting and pose Broad range of expressions

Name frequencies follow a long tailed distribution

Doctor Nikola shows a fork that was removed from an Israeli woman who swallowed it while trying to catch a bug that flew in to her mouth, in Poriah Hospital northern Israel July 10, 2003. Doctors performed emergency surgery and removed the fork. (Reuters)

President George W. Bush waves as he leaves the White House for a day trip to North Carolina, July 25, 2002. A White House spokesman said that Bush would be compelled to veto Senate legislation creating a new department of homeland security unless changes are made. (Kevin Lamarque/Reuters)

Preprocessing

Rectify faces to canonical position Train 5 SVMs as feature detectors

Corners of left and right eyes, tip of the nose, corners of the mouth

Use 150 hand-clicked faces to train the SVMs

For a test image, run the SVMs over the entire image Produces 5 feature maps Detect maximal outputs in the 5 maps, and

estimate the affine transformation to the canonical pose

Image credit: Y. J. Lee

Preprocessing






Reject images with poor rectification scores

Preprocessing






Reject images with poor rectification scores This leaves 34,623 images

Throw out images with more than 4 names 27,742 faces

Face representation

86x86 images – 7396 dimensional vectors However, relatively few 7396 dimensional

vectors actually correspond to valid face images

We want to effectively model the subspace of valid face images

Slide credit: S. Lazebnik

Face representation

We want to construct a low-dimensional linear subspace that best explains the variation in the set of face images


Principal Component Analysis (PCA)

Definecovariance matrix

Formulation: C. Bishop

Principal Component Analysis (PCA) Want to maximize the projected variance

Alternate formulation: minimize sum- of-square errors

Maximize

subject to

Use Lagrange multipliersu1 must be an eigenvector of S

Choose maximum eigenvalue to maximize variance

Image, formulation: C. Bishop

Principal Component Analysis (PCA) The direction that captures the maximum

covariance of the data is the eigenvector corresponding to the largest eigenvalue of the data covariance matrix

Furthermore, the top k orthogonal directions that capture the most variance of the data are the k eigenvectors corresponding to the k largest eigenvalues


Limitations of PCA

PCA assumes that the data has a Gaussian distribution (mean µ, covariance matrix Σ)


Limitations of PCA

The direction of maximum variance is not always good for classification

Image credit: C. Bishop

Limitation #1

Shape of the data not modeled well by the linear principal components

The return of the kernel trick

Basic idea: express conventional PCA in terms of dot products

From before:

For convenience, assume that you’ve subtracted off the mean from each vector

Consider a nonlinear function Φ(x) mapping into M-dimensions (M>D)

Assume

Covariance matrix



Covariance matrix in feature space

Now MxM

Substituting for C

Scalar values

The eigenvectors vi can be written as a linear combination of the Φ(xn)


Key step: express this in terms of the kernel function

Multiply both sides by ΦT(xl)

Projection of a point onto eigenvector i



Kernel PCA


Limitation #2

The direction of maximum variance is not always good for classification


Linear Discriminant Analysis (LDA) Goal: Perform dimensionality reduction while

preserving as much of the class discriminatory information as possible

Try to find directions along which the classes are best separated

Capable of distinguishing image variation due to identity from variation due to other sources such as illumination and expression

Linear Discriminant Analysis (LDA) Define inter- and intra-class scatter matrices

LDA computes a projection that maximizes the ratio

by solving the generalized eigenvalue problem

W – intra-classB – inter-class

Class labels for LDA

For the unsupervised names and faces dataset, you don’t have true labels Use proxy for labeled training data Images from the dataset with only one

detected face and one detected name

Observation: Using LDA on top of the space found by kernel PCA improves performance significantly

Clustering faces

Now that we have a representation for faces, the goal is to ‘clean up’ this dataset

Modified k-means clusteringObamaBushClintonSaddam

Clustering faces


Modified k-means clusteringObamaBushClintonSaddam

x

xx x

Clustering faces


Modified k-means clustering

x

xx x

BushSaddam

Pruning clusters

Remove clusters with < 3 faces This leaves 19,355 images

For every data point, compute a likelihood score

Remove points with low likelihood

k – number of nearest neighbours being consideredki – number of n.n. that are in cluster in – total number of points in the datasetni – total number of points in cluster i

Pruning clusters

For various thresholds:

Merging clusters

Merge clusters with different names that correspond to a single person Defense Donald Rumsfeld and Donald

Rumsfeld Or Colin Powell and Secretary of State

Look at distance between the means in discriminant space If below a threshold, merge

Merging clusters

Image credit: David Forsyth

Results

words and pictures rahul raguram. motivation huge datasets where text and images co-occur ~ 3.6...

Documents