words and pictures rahul raguram. motivation huge datasets where text and images co-occur ~ 3.6...

Words and PicturesRahul Raguram

Motivation

Huge datasets where text and images co-occur

~ 3.6 billion photos

Motivation

Photos in the news

Motivation

Subtitles

Motivation

Interacting with large image datasets Image content

‘Blobworld’[Carson et al., 99]

Motivation

Interacting with large photo collections Image content

Motivation

Query by sketch[Jacobs et al., 95]

Motivation

Query by sketch[Jacobs et al., 95]

Motivation

Interacting with large photo collections Large disparity between user needs and what

technology provides (Armitage and Enser 1997, Enser 1993, Enser 1995, Markulla and Sormunen 2000)

Queries based on image histograms, texture, overall appearance, etc. are vanishingly small

Motivation

Interacting with large photo collections Text queries

Motivation

Text and images may be separately ambiguous; jointly they tend not to be Image descriptions often leave out

what is visually obvious (eg: the colour of a flower)

…but often include properties that are difficult to infer using vision (eg: the species of the flower)

Linking words and pictures: Applications Automated image annotation

Auto illustration

Browsing supporttiger cat mouth teeth

“statue of liberty”

Learning the Semantics of Words and Pictures

Barnard and Forsyth, ICCV 2001

Key idea

Model the joint distribution of words and image features

Joint probability model for text and image features

Random bitsImpossible

Keywords:appletree

Unlikely

Keywords:skywatersun

Reasonable

Slide credit: David Forsyth

Input Representation

Extract keywords

Segment the image into a set of ‘blobs’

EM revisited: Image segmentation

Examples from: http://www.eecs.umich.edu/~silvio/teaching/

Segment 1Segment 2 . . .Segment k

),( 11 N),( 22 N

),( kkN

l )|( lxp l

),( lll )(xp

Generative model

Problem: You don’t know the parameters, the mixing weights, or the segmentation

If you knew the segmentation, then you could find the parameters easily

Compute maximum likelihood estimates for

Fraction of the image in the segment gives the mixing weight

),( lll

If you knew the segmentation, then you could find the parameters easily

If you knew the parameters, you could easily determine the segmentation

Solution: iterate

)|( xp lCalculate the posteriors

Image from: http://www.ics.uci.edu/~dramanan/teaching/

Input Representation

Segment the image into a set of ‘blobs’ Each region/blob represented by a

vector of 40 features (size, position, colour, texture, shape)

Modeling image dataset statistics Generative, hierarchical model

Extension of Hofmann’s model for text (1998)

Each node emits blobs and words

Higher nodes emit more general words and blobs

Middle nodes emit moderately general words and blobs

Lower nodes emit more specific words and blobs

Following a path from root to leaf generates image and associated text

sun sky waves

Each cluster is associated with a path from the root to a leaf

Cluster of images

Each cluster is associated with a path from the root to a leaf

sun, sea

waves rocks

sun seasky waves

sun seasky rocks

Adjacent clusters

Modeling image dataset statistics

clPcliPcP )|(),|()(

)(DPD = blobs words

)|()( cDPcPc

Each cluster is associated with a path from a leaf to the root

ciPcP )|()( Conditional independence of the items

Nodes along the path from leaf to root

For blobs

clPcliPcPDP )|(),|()()(

For words Tabulate word frequencies

Model fitting: EM Missing data is path, nodes that

generated each data element Two hidden variables: If path, node were known for each data

element, easy to get maximum likelihood estimate of parameters

Given parameter estimate, path, node easy to figure out

clPcliPcPDP )|(),|()()(

lidV ,,

document d is in cluster c

item i of document d was generated at level l

Results

Clustering Does text+image clustering have an

advantage?

Only text

Results

advantage?

Only blob features

Results

advantage?

Both textand imagesegments

Results

advantage? User study:

Generate 64 clusters for 3000 images Generate 64 random clusters from the same

images Present random cluster to user, ask to rate

coherence (yes/no) 94% accuracy

Results

Image search Supply a combination of text + image

features Approach: compute for each candidate

image, the probability of emitting the query items )|(),|()|( dcPdcQPdQP

c Q – set of query items

d – candidate document

c Qq l

dcPclPclqP )|()|(),|(

Results

Image search

Image credit: David Forsyth

Results

Image search

Results

Image search

Results

Auto-annotation Compute:

)|(),|()|( BcPBcwPBwPc

Results

Auto-annotation Quantitative performance: Use 160 Corel CDs, each with 100 images

(grouped by theme) Select 80 of the CDs, split into training

(75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set

Model scoring:n – number of words for the imager – number of words predicted correctlyw – number of words predicted incorrectlyN – vocabulary sizeAll words that exceed a threshold are predicted

Results

Auto-annotation Quantitative performance: Use 160 Corel CDs, each with 100 images

(grouped by theme) Select 80 of the CDs, split into training

(75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set

Model scoring:n – number of words for the imager – number of words predicted correctlyModel predicts n words

Can do surprisingly well just by using the empirical word frequency!

Results

Auto-annotation Quantitative performance:

Score of 0.1 indicates roughly 1 out of every 3 words is correctly predicted(vs. 1 out of 6 for the empirical model)

mNS EE

mPR EE

Names and Faces in the News

Berg et al., CVPR 2004

Motivation

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Motivation

President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Motivation

Organize news photographs for browsing and retrieval

Build a large ‘real-world’ face dataset Datasets captured in lab conditions do

not truly reflect the complexity of the problem

Motivation

Organize news photographs for browsing and retrieval

Build a large ‘real-world’ face dataset Datasets captured in lab conditions do

not truly reflect the complexity of the problem

In many traditional face datasets, it’s possible to get excellent performance by using no facial features at all (Shamir, 2008)

Motivation

Top left 100×100 pixels of the first 10 individuals in the color FERET dataset. The IDs of the subjects are listed right to the images

Dataset

Download news photos and captions ~500,000 images from Yahoo News, over a period of

two years

Run a face detector 44,773 faces Resized to 86x86 pixels

Extract names from the captions Identify two or more capitalized words followed by a

present tense verb Associate every face in the image with every detected

Goal is to label each face detector output with the correct name

Dataset Properties

Diverse Large variation in lighting and pose Broad range of expressions

Dataset Properties

Diverse Large variation in lighting and pose Broad range of expressions

Name frequencies follow a long tailed distribution

Doctor Nikola shows a fork that was removed from an Israeli woman who swallowed it while trying to catch a bug that flew in to her mouth, in Poriah Hospital northern Israel July 10, 2003. Doctors performed emergency surgery and removed the fork. (Reuters)

President George W. Bush waves as he leaves the White House for a day trip to North Carolina, July 25, 2002. A White House spokesman said that Bush would be compelled to veto Senate legislation creating a new department of homeland security unless changes are made. (Kevin Lamarque/Reuters)

Preprocessing

Rectify faces to canonical position Train 5 SVMs as feature detectors

Corners of left and right eyes, tip of the nose, corners of the mouth

Use 150 hand-clicked faces to train the SVMs

For a test image, run the SVMs over the entire image Produces 5 feature maps Detect maximal outputs in the 5 maps, and

estimate the affine transformation to the canonical pose

Image credit: Y. J. Lee

Preprocessing

Reject images with poor rectification scores

Preprocessing

Reject images with poor rectification scores This leaves 34,623 images

Throw out images with more than 4 names 27,742 faces

Face representation

86x86 images – 7396 dimensional vectors However, relatively few 7396 dimensional

vectors actually correspond to valid face images

We want to effectively model the subspace of valid face images

Slide credit: S. Lazebnik

Face representation

We want to construct a low-dimensional linear subspace that best explains the variation in the set of face images

Principal Component Analysis (PCA)

Definecovariance matrix

Formulation: C. Bishop

Principal Component Analysis (PCA) Want to maximize the projected variance

Alternate formulation: minimize sum- of-square errors

Maximize

subject to

Use Lagrange multipliersu1 must be an eigenvector of S

Choose maximum eigenvalue to maximize variance

Image, formulation: C. Bishop

Principal Component Analysis (PCA) The direction that captures the maximum

covariance of the data is the eigenvector corresponding to the largest eigenvalue of the data covariance matrix

Furthermore, the top k orthogonal directions that capture the most variance of the data are the k eigenvectors corresponding to the k largest eigenvalues

Limitations of PCA

PCA assumes that the data has a Gaussian distribution (mean µ, covariance matrix Σ)

Limitations of PCA

The direction of maximum variance is not always good for classification

Image credit: C. Bishop

Limitation #1

Shape of the data not modeled well by the linear principal components

The return of the kernel trick

Basic idea: express conventional PCA in terms of dot products

From before:

For convenience, assume that you’ve subtracted off the mean from each vector

Consider a nonlinear function Φ(x) mapping into M-dimensions (M>D)

Assume

Covariance matrix

Covariance matrix in feature space

Now MxM

Substituting for C

Scalar values

The eigenvectors vi can be written as a linear combination of the Φ(xn)

Key step: express this in terms of the kernel function

Multiply both sides by ΦT(xl)

Projection of a point onto eigenvector i

Kernel PCA

Limitation #2

The direction of maximum variance is not always good for classification

Linear Discriminant Analysis (LDA) Goal: Perform dimensionality reduction while

preserving as much of the class discriminatory information as possible

Try to find directions along which the classes are best separated

Capable of distinguishing image variation due to identity from variation due to other sources such as illumination and expression

Linear Discriminant Analysis (LDA) Define inter- and intra-class scatter matrices

LDA computes a projection that maximizes the ratio

by solving the generalized eigenvalue problem

W – intra-classB – inter-class

Class labels for LDA

For the unsupervised names and faces dataset, you don’t have true labels Use proxy for labeled training data Images from the dataset with only one

detected face and one detected name

Observation: Using LDA on top of the space found by kernel PCA improves performance significantly

Clustering faces

Now that we have a representation for faces, the goal is to ‘clean up’ this dataset

Modified k-means clusteringObamaBushClintonSaddam

Clustering faces

Modified k-means clustering

BushSaddam

Pruning clusters

Remove clusters with < 3 faces This leaves 19,355 images

For every data point, compute a likelihood score

Remove points with low likelihood

k – number of nearest neighbours being consideredki – number of n.n. that are in cluster in – total number of points in the datasetni – total number of points in cluster i

Pruning clusters

For various thresholds:

Merging clusters

Merge clusters with different names that correspond to a single person Defense Donald Rumsfeld and Donald

Rumsfeld Or Colin Powell and Secretary of State

Look at distance between the means in discriminant space If below a threshold, merge

Merging clusters

Results

words and pictures rahul raguram. motivation huge datasets where text and images co-occur ~ 3.6...

Documents

rahul mehrotra

rahul bajaj

ards rahul

rahul rai1

scientist rahul

fast organization of large photo collections using...

rahul verma.pptx

12800546 rahul intranet mailing system rahul raj

rahul paritosh

rahul enterprises

secure instant messenger husky hackers –group 7 abdulla al...

rahul origin

lesson 6 - topics reading sas datasets subsetting sas...

form rahul

a comparative analysis of ransac techniques leading to...

of... · xls file · web view · 2018-02-15rahul rahul...

rahul pptt

ppt (rahul)

rahul presentation

· rahul das rahul kumar rahul kumar rahul kumar awasthi...