tiny images - people | mit csail

Computer Science and Artificial Intelligence Laboratory

Technical Report

m a s s a c h u s e t t s i n s t i t u t e o f t e c h n o l o g y, c a m b r i d g e , m a 0 213 9 u s a — w w w. c s a i l . m i t . e d u

MIT-CSAIL-TR-2007-024 April 23, 2007

Tiny imagesAntonio Torralba, Rob Fergus, and William T. Freeman

Tiny images

Antonio Torralba Rob Fergus William T. FreemanCSAIL, Massachusetts Institute of Technology,

32 Vassar St., Cambridge, MA 02139, USA{torralba,fergus,billf}@csail.mit.edu

Abstract

The human visual system is remarkably tolerant todegradations in image resolution: in a scene recognitiontask, human performance is similar whether 32 × 32 colorimages or multi-mega pixel images are used. With smallimages, even object recognition and segmentation is per-formed robustly by the visual system, despite the object be-ing unrecognizable in isolation. Motivated by these obser-vations, we explore the space of 32 × 32 images using adatabase of 108 32× 32 color images gathered from the In-ternet using image search engines. Each image is looselylabeled with one of the 70, 399 non-abstract nouns in En-glish, as listed in the Wordnet lexical database. Hence theimage database represents a dense sampling of all objectcategories and scenes. With this dataset, we use nearestneighbor methods to perform object recognition across the108 images.

1 IntroductionWhen we look the images in Fig. 1, we can recognize thescene and its constituent objects. Interestingly though, thesepictures have only32 × 32 color pixels (the entire image isjust a vector of3072 dimensions with8 bits per dimension),yet at this resolution, the images seem to already containmost of the relevant information needed to support reliablerecognition.

Motivated by our ability to perform object and scenerecognition using very small images, in this paper we ex-plore a number of fundamental questions: (i) what is thesmallest image dimensionality that suffices? (ii) how manydifferent tiny images are there? (iii) how much data do weneed to viably perform recognition with nearest neighborapproaches?

Currently, most successful computer vision approachesto scene and object recognition rely on extracting texturalcues, edge fragments, or patches from the image. Thesemethods require high-resolution images since only they canprovide the rich set of features required by the algorithms.Low resolution images, by contrast, provide a nearly inde-

Figure 1.1st & 3rd columns: Eight32×32 resolution color images.

Despite their low resolution, it is still possible to recognize mostof the objects and scenes. These are samples from a large datasetof 10

832× 32 images we collected from the web which spans all

visual object classes.2nd & 4th columns: Collages showing the16

nearest neighbors within the dataset to each image in the adjacentcolumn. Note the consistency between the neighbors and the queryimage, having related objects in similar spatial arrangements. Thepower of the approach comes from the copious amount of data,rather than sophisticated matching methods.

pendent source of information to that presently extractedfrom high resolution images by feature detectors and thelike. Hence any method successful in the low-resolutiondomain can augment existing methods suitable for high-resolution images.

Another benefit of working with tiny images is that it be-comes practical to store and manipulate datasets orders ofmagnitude bigger than those typically used in computer vi-sion. Correspondingly, we introduce a dataset of70 millionunique32 × 32 color images gathered from the Internet.Each images is loosely labelled with one of70, 399 English

1

nouns, so the dataset covers all visual object classes. Thisis in contrast to existing datasets which provide a sparse se-lection of object classes.

With overwhelming amounts of data, many problems canbe solved without the need for sophisticated algorithms.One example in the textual domain is Google’s “Did youmean?” tools which corrects errors in search queries, notthrough a complex parsing of the query but by memorizingbillions of correct query strings and suggesting the closestto the users query. We explore a visual analogy to this toolusing our dataset and nearest-neighbor matching schemes.

Nearest-neighbor methods have previously been used ina variety of computer vision problems, primarily for interestpoint matching [3, 12, 17]. It has also been used for globalimage matching, albeit in more restrictive domains such aspose estimation [24].

The paper is divided into three parts. In Section 2 we in-vestigate the performance of human recognition on tiny im-ages, establishing the minimal resolution required for robustscene and object recognition. In Sections 3 and 4 we intro-duce our dataset of70 million images and explore the man-ifold of images within it. In Section 5 we attempt scene andobject recognition using a variety of nearest-neighbor meth-ods. We measure performance at a number of semantic lev-els, obtaining impressive results for certain object classes,despite the labelling noise in the dataset.

2 Human recognition oflow-resolution images

In this section we study the minimal image resolution whichstill retains useful information about the visual world. Inorder to do this, we perform a series of human experimentson (i) scene recognition and (ii) object recognition.

Studies on face perception [1, 14] have shown that only16 × 16 pixels are needed for robust face recognition. Thisremarkable performance is also found in a scene recogni-tion task [20]. However, there are no studies that have ex-plored the minimal image resolution required to performvisual tasks such as generic object recognition, segmenta-tion, and scene recognition with many categories. In com-puter vision, existing work on low-resolution images relieson motion cues [7].

In this section we provide experimental evidence show-ing that 32×32 color images1 contain enough informationfor scene recognition, object detection and segmentation(even when the objects occupy just a few pixels in the im-age). A significant drop in performance is observed whenthe resolution drops below 322 pixels. Note that this prob-lem is distinct from studies investigating scene recognition

132×32 is very very small. For reference, typical thumbnail sizes are:Google images (130x100), Flikr (180x150), default Windowsthumbnails(90x90).

8 16 32 64 2560

10

20

30

40

50

60

70

80

90

100

Image resolution

Corr

ect re

cognitio

n r

ate

Color image (human)

Grayscale (human)Raw intensities (PCA)

Gabor (SVM)

SIFT words (SVM)

a)

8 16 32 64 2560

10

30

50

70

90

Image resolution

Corr

ect re

cognitio

n r

ate

b)

100

Figure 2. a) Human performance on scene recognition as a func-tion of resolution. The green and black curves shows the per-formance on color and grayscale images respectively. For color32 × 32 images the performance only drops by7% relative to fullresolution, despite having 1/64th of the pixels. (b) Computer vi-sion algorithms applied to the same data as (a). A baseline algo-rithm (blue) and state-of-the-art algorithms [16, 21].

using very short presentation times. [19, 22, 23]. Here, weare interested in characterizing the amount of informationavailable in the image as a function of the image resolution(there is no constraint on presentation time). We start witha scene recognition task.

2.1 Scene recognitionIn cognitive psychology, thegist of the scene [19, 25] refersto a short summary of the scene (the scene category, and adescription of a few objects that compose the scene). Incomputer vision, the termgist is used to refer to a lowdimensional representation of the entire image that pro-vides sufficient information for scene recognition and con-text for object detection. In this section, we show thatthis low dimensional representation can rely on very low-resolution information and, therefore, can be computed veryefficiently.

We evaluate the scene recognition performance of bothhumans and existing computer vision algorithms[9, 16, 21]as the image resolution is decreased. The test set of15scenes was taken from [16]. Fig. 2(a) shows human per-formance on this task when presented with grayscale andcolor images2 of varying resolution. For grayscale images,humans need around64 × 64 pixels. When the images arein color, humans need only32 × 32 pixels. Below this res-olution the performance rapidly decreases. Interestingly,when color and grayscale results are plotted against imagedimensionality (number of pixels× color bands) the curvesfor both color and grayscale images overlap (not shown).Therefore, humans need around 3072 dimensions of eithercolor or grayscale data to perform this task. Fig.2(b) com-pares human performance (on grayscale data) with state ofthe art computer vision algorithms as a function of image

2100% recognition rate can not be achieved in this dataset as thereisno perfect separation between the 15 categories

4−16 16−64 64−256 256−10240

10

20

30

40

50

60

70

80

90

100

% o

bje

cts

co

rre

ctly la

be

lled

4−16 16−64 64−256 256−10240

0.5

1

1.5

2

2.5

3

Ave

rag

e n

um

be

r o

f re

pp

ort

ed

ob

jects

number of pixels per object

12x1216x1624x2432x32

office

windows

drawers

desk

wall-space

waiting area

table

C ouches

chairs

reception desk

plantwindow

dining room

light

plant

tablechairs

window

256x256

32x32

dining roomceiling

lightdoors

picturewall

door

floor

table

picture

chairchair

chair chair

center piece

bedside

table

shoes painting chairlamp

plant monitor center piece

a) Segmentation of 32x32 images

b) Cropped objects c) Human performance as a function of resolution and object size

number of pixels per object

Figure 3. Human segmentation of tiny images. (a) Humans can correctly recognize and segment objects at very low resolutions, even whenthe objects in isolation can not be recognized (b). c) Summary of human performances as a function of image resolution andobject size.Humans analyze images quite well at32 × 32 resolution.

resolution. The algorithms used for scene recognition are:1) PCA on raw intensities, 2) a SVM classifier on a vectorof Gabor filter outputs[21], and 3) a descriptor built usinghistograms of quantized SIFT features ([16]). We used 100images from each class for training as in [16]. Raw intensi-ties perform very poorly. The best algorithms are (magenta)Gabor descriptors[21] with a SVM using a Gaussian kerneland (orange) the SIFT histograms[16]3 There is not a sig-nificant difference in performance between the two. All thealgorithms show similar trends, with performances at 2562

pixels still below human performance at322.

2.2 Object recognitionRecently, the PASCAL object recognition challenge eval-uated a large number of algorithms in a detection task forseveral object categories [8]. Fig. 4 shows the performances(ROC) of the best performing algorithms in the competitionin the car classification task (is there a car present in theimage?). These algorithms require access to relatively highresolution images. We studied the ability of human partic-ipants to perform the same detection task but at very low-resolutions. Human participants were shown color imagesfrom the test set scaled to have32 pixels on the smallestaxis. Fig.4 shows some examples of tiny PASCAL im-ages. Each participant classified between200 and400 im-ages selected randomly. Fig.4 shows the performances offour human observers that participated in the experiment.Although around 10% of cars are missed, the performanceis still very good, significantly outperforming the computervision algorithms using full resolution images.

3SIFT descriptors have an image support of16× 16 pixels. Therefore,when working at low resolutions it was necessary to upsamplethe images.The best performances were obtained when the images were upsampled to2562.

0

False positive rate

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20.65

0.7

0.75

0.8

0.85

0.9

0.95

1T

rue

po

sitiv

e r

ate

Figure 4. Car detection task on the PASCAL 2006 test dataset.Thecolored dots show the performance of four human subjects classi-fying tiny versions of the test data. The ROC curves of the bestvision algorithms (running on full resolution images) are shownfor comparison. All lie below the performance of humans on thetiny images, which rely on none of the high-resolution cues ex-ploited by the computer vision algorithms.

2.3 Object segmentationA more challenging task is that of object segmentation.Here, participants are shown color images at different reso-lutions (122, 162, 242, and322) and their task is to segmentand categorize as many objects as they can. Fig.3(a) showssome of the manually segmented images at322. It is im-portant to note that taking objects out of their context dras-tically reduces recognition rate. Fig.3(b) shows crops ofsome of the smallest objects correctly recognized. The res-olution is so low that recognition of these objects is almostentirely based on context. Fig.3(c) shows human perfor-mance (evaluation is done by a referee that sees the originalhigh resolution image and the label assigned by the partici-pant. The referee does not know at which resolution the im-age was presented). The horizontal axis corresponds to thenumber of pixels occupied by the object in the image. The

0 1000 2000 30000

200

400

600

800

1000

1200

1400

1600

1800

2000

Number of images per keyword

Num

ber

of keyw

ord

s

0 100 200 300

10

20

30

40

50

60

70

80

Recall (image rank)

Pre

cis

ion

altavistaaskcydralflickrgooglepicsearchwebshots

Total, unique,

non-uniform images:

73,454,453

Total number of words:

70,399

Mean # images per word:

1,043

Figure 5. Statistics of the tiny images database. a) A histogramof images per keyword collected. Around 10% of keywords havevery few images. b) Performance of the various engines (evaluatedon hand-labeled ground truth). Google and Altavista are thebestperforming and Cydral and Flickr the worst.

two plots show the recognition rate (% objects correctly la-belled) and the number of objects reported for each objectsize. Each curve corresponds to a different image resolu-tion. At 322 participants report around8 objects per imageand the correct recognition rate is around80%. Clearly, suf-ficient information remains for reliable segmentation.

Of course, not all visual tasks can be solved using suchlow resolution images, the experiments in this section hav-ing focused only on recognition tasks. However, we arguethat32 × 32 color images are the minimum viable size atwhich to study the manifold of natural images. any furtherlowering in resolution results in a rapid performance drop.

3 A large dataset of 32x32 imagesCurrent experiments in object recognition typically use102-104 images spread over a few different classes; the largestavailable dataset being one with 256 classes from the Cal-tech vision group [13]. Other fields such as speech, rou-tinely use106 data points for training, since they have foundthat large training sets are vital for achieving low errorsrates in testing. As the visual world is far more complexthan the aural one, it would seem natural to use very largeset of training images. Motivated by this and the ability ofhumans to recognize objects and scenes in32× 32 images,we have collected a database of108 such images, made pos-sible by the minimal storage requirements for each image.

3.1 Collection procedureWe use Wordnet4 to provide a comprehensive list of allclasses5 likely to have any kind of visual consistency. We do

4Wordnet [26] is a lexical dictionary, meaning that it gives the semanticrelations between words in addition to the information usually given in adictionary.

5The tiny database is not just about objects. It is about everything thatcan be indexed with Wordnet and this includes scene-level classes such as

this by extracting all non-abstract nouns from the database,75, 378 of them in total. Note that in contrast to existingobject recognition datasets which use a sparse selection ofclasses, by collecting images for all nouns, we have a densecoverage of all visual forms. Fig.5(a) shows a histogram ofthe number of images per class.

We selected7 independent image search engines: Al-tavista, Ask, Flickr, Cydral, Google, Picsearch and Web-shots (others have outputs correlated with these). We auto-matically download all the images provided by each enginefor all 75, 378 nouns. Running over6 months, this methodgathered95, 707, 423 images in total. Once intra-word du-plicates and uniform images (images with zero variance)are removed, this number is reduced to73, 454, 453 imagesfrom 70, 399 words (around 10% of the keywords had noimages). Storing this number of images at full resolutionis impractical on the standard hardware used in our experi-ments so we down-sampled the images to32 × 32 as theywere gathered6. The dataset fits onto a single hard disk,occupying600Gb in total.

3.2 Characterization of labeling noise

The images gathered by the engines are loosely labeled inthat the visual content is often unrelated to the query word.In Fig.5(b) we quantify this using a hand-labeled portion ofthe dataset.78 animal classes were labeled in a binary fash-ion (belongs to class or not) and a recall-precision curve wasplotted for each search engine. The differing performanceof the various engines is visible, with Google and Altavistaperforming the best and Cydral and Flickr the worst. Vari-ous methods exist for cleaning up the data by removing im-ages visually unrelated to the query word. Berg and Forsyth[5] have shown a variety of effective methods for doing thiswith images of animals gathered from the web. Berget al.[4] showed how text and visual cues could be used to clus-ter faces of people from cluttered news feeds. Ferguset al.[11, 10] have shown the use of a variety of approaches forimproving Internet image search engines. However, due tothe extreme size of our dataset, it is not practical to em-ploy these methods. In Section 5, we show that reasonablerecognition performances can be achieved despite the highlabelling noise.

streets, beaches, mountains, as well category-level classes and more spe-cific objects such as US presidents, astronomical objects and Abyssiniancats.

6We also stored a version maintaining the original aspect ratio (the min-imum dimension was set at32 pixels) and a link to the original thumbnailand high resolution URL.

4 The manifold of natural imagesUsing the dataset we are able to explore the manifold ofnatural images7. Despite32 × 32 being very low resolu-tion, each image lives in a space of3072 dimensions. Thisis still a huge space - if each dimension has8 bits, there area total of107400 possible images. However, natural imagesonly correspond to a tiny fraction of this space (most of theimages correspond to white noise), and it is natural to in-vestigate the size of that fraction. To measure this fractionwe must first define an appropriate distance metric to use inthe3072 dimensional space.

We introduce three different distance measures betweena pair of imagesi1 andi2. We assume that all images havealready been normalized to zero mean, unit variance.

• Sum of squared differences (SSD) between the nor-malized images (across all three color channels). NotethatD1 = 2(1−ρ), ρ being the normalized correlation.

D1 =∑

x,y,c

(i1(x, y, c) − i2(x, y, c))2

• Warping. We optimize each image by transformingi2 (horizontal mirror; translations and scalings up to10 pixel shifts) to give the minimum SSD. The trans-formation parametersθ are optimized by gradient de-scent.

D2 = minθ

∑

x,y,c

(i1(x, y, c) − Tθ[i2(x, y, c)])2

• Pixel shifting. We allow for additional distortion inthe images by shifting every pixel individually withina 5 by 5 window to give minimum SSD (w = 2). Weassume thati2 has already been warped:i2 = Tθ[i2].The minimum can be found by exhaustive evaluationof all shifts, only possible due to the low resolution ofthe images.

D3 = min|Dx,y|≤w

∑

x,y,c

(i1(x, y, c)− i2(x+Dx, y+Dy, c))2

Computing distances to70, 000, 000 images is computa-tionally expensive. To improve speed, we index the imagesusing the first20 principal components of the70, 000, 000images. With only20 components, all3 metrics are equiv-alent and the entire index structure can be held in memory.Using exhaustive search we find the closest4000 images in30 seconds8 per image. The distancesD1, D2, D3 to these

7Although our dataset is large, it is not necessarily representative of allnatural images. Images on the Internet have their own biases, e.g. objectstend to be centered and fairly large in the image.

8Undoubtedly, if efficient data structures such as a kd-tree were used,the matching would be significantly faster. Nister and Stewenius [18] usedrelated methods to index over1 million images in∼ 1sec.

NeighborTarget Warping Pixel shifting

D1=1.04 D2=0.54 D3=0.32

Figure 6. Image matching using distance metricsD1, D2 andD3.For D2 and D3 we show the closest manipulated image to thetarget.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Max normalized correlation ρ=(1-D1 / 2)

Cu

mu

lative p

rob

ab

ility

of m

ax c

orr

ela

tion

1000

7000

70000

700000

7000000

70000000

white noise

a) b)

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Pro

ba

bili

ty fo

r sam

e c

ate

gory

Pixelwise correlation

d)

0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Pro

bab

ility

for

ima

ge d

up

licate

s

Pixelwise correlation

c)

104 105 106 107

Pro

ba

bili

ty m

atc

h

number of images

0.05

0.1

0.15

0.2

0.25

0.3

0.35

ρ>0.8

ρ>0.9

Figure 7. Exploring the dataset usingD1. (a) Cumulative probabil-ity for the correlation with the nearest neighbor. (b) Cross-sectionof figure (a) plots the probability of finding a neighbor with cor-relation> 0.9 as a function of dataset size. (c) Probability thattwo images are duplicates as a function of pixelwise correlation.(d) Probability that two images belong to the same category as afunction of pixelwise correlation (duplicate images are removed).Each curve represents a different human labeler.

4000 neighbors are then computed,D1 taking negligibletime while D2 andD3 take a minute or so. Fig.6 showsa pair of images being matched using the3 metrics. Fig.1shows examples of query images and sets of neighboringimages from our dataset found usingD3.

Inspired by studies on the statistics of image patches [6],we use our dataset to explore the density of images usingD1. Fig. 7 shows several plots measuring various proper-ties as the size of the dataset is increased. In Fig.7(a), weshow the probability that the nearest neighbor has a normal-ized correlation exceeding a certain value. Fig.7(b) showsa vertical section through Fig.7(a) at 0.8 and 0.9 as thenumber of images grows logarithmically. Fig.7(c) showsthe probability of the matched image being a duplicate asa function ofD1. While we remove duplicateswithin eachword, it is not trivial to remove thembetween words.

In Fig. 7(d) we explore how the plots shown in Fig.7(a)& (b) relate to recognition performance. Three human sub-jects labelled pairs of images as belonging to the same vi-sual class or not. As the normalized correlation exceeds0.8, the probability of belonging to the same class growsrapidly. From Fig.7(b) we see that a quarter of the imagesin the dataset are expected to have a neighbor with corre-lation > 0.8. Hence a simple nearest-neighbor approachmight be effective with our size of dataset.

5 RecognitionWe now attempt to recognize objects and scenes in ourdataset. While a variety of existing computer vision algo-rithms could be adapted to work on32 × 32 images, weprefer to use a simple nearest-neighbor scheme based onone of the distance metricsD1, D2 or D3. Instead of rely-ing on the complexity of the matching scheme, we let thedata to do the work for us: the hope is that there will alwaysbe images close to a given query image with some semanticconnection to it.

Given the large number of classes in our dataset (70, 399)and their highly specific nature, it is not practical or desir-able to try and classify each of the classes separately. In-stead we make use of Wordnet [26] which provides seman-tic hierarchy (hypernyms) for each noun. Using this hier-archy, we can perform classification at a variety of differ-ent semantic levels, thus instead of trying to recognize thenoun “yellowfin tuna” we can also perform recognition atthe level of “tuna” or “fish” or “animal”. Other work mak-ing use of Wordnet includes Hoogs and Collins [15] whouse it to assist with image segmentation. Barnardet al. [2]showed how to learn simultaneously the visual and text tagsof images etc.

An additional factor in our dataset is the labelling noise.To cope with this we use a voting scheme based around thisWordnet semantic hierarchy.

5.1 Classification using Wordnet votingWordnet provides semantic relationships between the70, 399 nouns for which we have collected images. We de-compose the graph-structured relationships into a tree bytaking the most common meaning of each word. This treeis then used to accumulate votes from the set of neighborsfound for a given query image. Each neighbor has its ownbranch within the tree for which it votes. By accumulat-ing these branches the query image may be classified at avariety of levels within the tree.

In Fig. 8(a) we show a query image of a vise from ourtest set. In Fig.8(b) we show a selection from theK =80 nearest neighbors usingD3 over the70 million images.In Fig. 8(c) we show the Wordnet branch for “vise”. InFig.8(d) we show the accumulated votes from the neighbors

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Semantic level: 1

# images

Avera

ge R

OC

Are

a

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Semantic level: 2

# images

Avera

ge R

OC

Are

a

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Semantic level: 3

# images

Avera

ge R

OC

Are

a

Location

Object

Subtance

Organism

Region

Artifact...

Plant

Device

Drug

Person

Mountain

Vehicle

Animal...

Herb

Bird

Arthropod-

Flower

Fish

Artist

Politician

104.7 10

6.7 108.7 10

4.7 106.7 10

8.7104.7 10

6.7 108.7

Figure 9. Average ROC curve area at different semantic levels asa function of number of images in the dataset, forD1 (red), D2

(green) andD3 (blue). Words within each of the semantic levelsare shown in each subplot. The red dot shows the expected per-formance if all images in Google image search were used (∼ 2

billion), extrapolating linearly.

at different levels in the tree, each image voting with unitweight. For clarity, we only show parts of the tree withat least three votes (the full Wordnet tree has45, 815 non-leaf nodes). The nodes shown in red illustrate the branchwith the most votes, which matches the majority of levelsin query image branch (Fig.8(c)). Note that many of theneighbors, despite not being vices, are some kind of deviceor instrument.

5.2 ResultsWe used a test set of323 images, hand-picked so that thevisual content was consistent with the text label. Usingthe voting tree described above, we classified them usingK = 80 neighbors at a variety of semantic levels. To sim-plify the presentation of results, we collapsed the Wordnettree by hand (which had19 levels) down to3 levels corre-sponding to one very high level (“organism”, “object”), anintermediate level (“person”, “plant”, “animal”) and a leveltypical of existing datasets (“fish”, “bird”, “herb”).

In Fig.9 we show the average ROC curve area for a clas-sification task per word at each of the three semantic levelsfor D1, D2, D3 as the number of images in the dataset isvaried. Note that (i) the classification performance increasesas the number of images increases; (ii)D3 outperforms theother distance metrics; (iii) the performance drops off as theclasses become more specific.

In Fig. 10 we show the ROC curve area for a number ofclasses at different semantic levels, comparing theD1 andD3 metrics. For the majority of classes,D3 may be seen tooutperformD1.

entity

object

artifact

instrumentality

device

holding

vise

entity71

object 56

artifact

36

instrumentality25

device14

covering

5

thing

4

part

4

4

structure

3

living19

substance10

psychological

phenomen5

cognition5

content5

organism

17

plant

5

vascular

5

person

8clothing

3

instrument

3

material

3

implement

4animal

4

chordate

4

vertebrate4

container

5

utensil

3

a) Input image

b) Neighbors c) Ground truth d) Wordnet voted branches

Figure 8. (a) Query image of a vise. (b) First16 of 80 neighbors found usingD3. (c) Wordnet branch for vise. (d) Sub-tree formed byaccumulating branches from all80 neighbors. The red branch shows the nodes with the most votes. Note that this branch substantiallyagrees with the branch for vise.

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

ROC area (D1)

RO

C a

rea (

D3

)

level 1 level 2 level 3

Figure 10. Scatter plot comparingD1 andD3 on classes from thesemantic levels used in Fig.9. In most casesD3 beatsD1. Notethe wide variation in performance of classes at the finer semanticlevel.

To illustrate the quality of the recognition achieved byusing the70, 000, 000 weakly labeled images, we show inFig. 11, for categories at three semantic levels, the imagesthat were more confidently assigned to each class. Note thatdespite the simplicity of the matching procedure presentedhere, the recognition performance achieves reasonable lev-els even for fine levels of categorization.

5.3 PeopleFor certain classes the dataset is extremely rich. For exam-ple, many images on the web contain pictures of people.Thus for this class we are able to reliably find neighborswith similar locations, scales and poses to the query image,as shown in Fig.12.

6 ConclusionsMany recognition tasks can be solved with images as smallas32 × 32 pixels. Working with tiny images has multiplebenefits: features can be computed efficiently and collect-ing and working with larger collections of images becomespractical. We have presented a dataset with70, 000, 000 im-ages, organized on a semantic hierarchy, and we have shownthat, despite the dataset being weakly labeled, it can be ef-fectively used for recognition tasks.

We have used simple nearest neighbor methods to ob-tain good recognition performance for a number of classes,such as “person”. However, the performance of some otherclasses is poor (some of the classes in Fig.10 have ROCcurve areas around 65-70%). We believe that this is dueto two shortcomings of our dataset: (i) sparsity of imagesin some classes; (ii) labelling noise. The former may beovercome by collecting more images, perhaps from sourcesother than the Internet. One approach to overcoming the la-belling noise would be to bias the Wordnet voting towardimages with high rank (using the performance curves ob-tained in Fig.5(b)).

The dense sampling of categories provides an importantdataset to develop transfer learning techniques useful forobject recognition. Small images also present challenges

agent

great

black−backed

gull 0.04

thiopental

sodium0.04

antitussive

0.04

urban

area0.04

ranunculus

ficaria0.04

sodium

thiopental0.04

amoxicillin

0.04

antifungal

agent0.04

chinese

primrose0.04

racialist

0.02

paste

0.02

hyoscyamus

muticus0.02

garand

rifle0.02

yellow

jasmine0.02

b−complex

vitamin0.02

sir clive

marles sinclair0.02

st. leo

0.02

fumimaro

konoye0.02

topographic

point

mount

ararat0.10

grand

teton0.06

ocyurus

chrysurus0.06

mount

ararat0.06

mount

ararat0.06

mount

ararat0.06

mount

ararat0.05

mount

ararat0.05

mount

ararat0.05

mount

ararat0.05

garand

rifle0.04

mount

ararat0.04

mount

ararat0.04

carcharinus

longimanus0.02

amazon

river0.02

arctic

squirrel0.02

mount

ararat0.02

mount

ararat0.02

plant

ranunculus

repens0.73

chinese

primrose0.59

linaria

vulgaris0.58

oregon

holly grapes0.49

ranunculus

ficaria0.46

yellow

jasmine0.41

chinese

primrose0.40

hyoscyamus

muticus0.33

chinese white

cabbage0.32

chinese

primrose0.31

gumweed

0.30

chinese

primrose0.28

bos

taurus0.25

pasque

flower0.25

ocyurus

chrysurus0.23

chinese

primrose0.22

antitussive

0.21

chinese

primrose0.21

device

flat

clarinet

0.28

woodworking

vise

0.26

mattock

0.23

woodworking

vise

0.22

amphiuma

0.20

bluegill

0.20

woodworking

vise

0.17

woodworking

vise

0.17

woodworking

vise

0.16

macrotus

0.15

omelette

pan

0.15

photocathode

0.15

ammeter

0.15

woodworking

vise

0.15

chinese

primrose

0.15

woodworking

vise

0.14

woodworking

vise

0.14

paintball

gun

0.12

scientist

a. noam

chomsky

0.27

nikolaas

tinbergen

0.19

glenn

t. seaborg

0.17

missioner

0.14

glenn

t. seaborg

0.12

arithmetician

0.11

backslapper

0.11

woodworking

vise

0.11

grandad

0.10

antonin

dvorak

0.10

amedeo

avogadro

0.10

marie goeppert

mayer

0.09

laurence

olivier

0.09

elmore john

leonard

0.09

antoine laurent

de jussieu

0.09

glenn t.

seaborg

0.09

sebastian

cabot

0.09

henri

rousseau

0.07

vehicle

minicab

0.11

anchor

chain

0.10

scow

0.09

archerfish

0.09

apartment

0.06

water

boy

0.05

runabout

0.05

anas

platyrhynchos

0.05

stylus

printer

0.05

woodworking

vise

0.05

ocyurus

chrysurus

0.05

chinese

primrose

0.05

mail

car

0.05

hardtop

0.04

amsterdam

0.04

ambulance

0.04

araucaria

columnaris

0.04

saint

emilion

0.04

Level 1

Level 2

Level 3

Figure 11. Test images assigned to words at each semantic level. The images are ordered by voting confidence from left to right (theconfidence is shown above each image). The color of the text indicates if the image was correctly assigned (green) or not (red). The texton top of each image corresponds to the string that returned the image as a result of querying online image indexing tools.Each word isone of the70, 399 nouns from Wordnet.

for recognition - many objects can not be recognized in iso-lation. Therefore, recognition requires algorithms that in-corporate contextual models, a direction for future work.

References[1] T. Bachmann. Identification of spatially queatized tachis-

toscopic images of faces: How many pixels does it take tocarry identity? European Journal of Cognitive Psychology,3:85–103, 1991.2

[2] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei,and M. Jordan. Matching words and pictures.JMLR,3:1107–1135, Feb 2003.6

[3] A. Berg, T. Berg, and J. Malik. Shape matching and objectrecognition using low distortion correspondence. InProc.CVPR, volume 1, pages 26–33, June 2005.2

[4] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y.-W. Teh,E. Learned-Miller, and D. Forsyth. Names and faces in thenews. InProc. CVPR, volume 2, pages 848–854, 2004.4

[5] T. Berg and D. Forsyth. Animals on the web. InProc. CVPR,pages 1463–1470, 2006.4

[6] D. M. Chandler and D. J. Field. Estimates of the informationcontent and dimensionality of natural scenes from proximitydistributions.JOSA, 2006.5

[7] A. Efros, A. Berg, G. Mori, and J. Malik. Recognizing actionat a distance. InIEEE International Conference on ComputerVision, pages 726–733, Nice, France, 2003.2

[8] M. Everingham, A. Zisserman, C. K. I. Williams, and e. a.L. Van Gool. The 2005 pascal visual object classes chal-lenge. InLNAI 3944,, pages 117–176, 2006.3

[9] L. Fei-Fei and P. Perona. A bayesian hierarchical model forlearning natural scene categories. InCVPR, pages 524–531,

2005.2[10] R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman. Learn-

ing object categories from google’s image search. InProc.ICCV, volume 2, pages 1816–1823, Oct 2005.4

[11] R. Fergus, P. Perona, and A. Zisserman. A visual categoryfilter for google images. InProc. ECCV, pages 242–256,May 2004.4

[12] K. Grauman and T. Darrell. Pyramid match hashing: Sub-linear time indexing over partial correspondences. InProc.CVPR, 2007.2

[13] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cate-gory dataset. Technical Report UCB/CSD-04-1366, Califor-nia Institute of Technology, 2007.4

[14] L. D. Harmon and B. Julesz. Masking in visual recognition:Effects of two-dimensional noise.Science, 180:1194–1197,1973.2

[15] A. Hoogs and R. Collins. Object boundary detection in im-ages using a semantic ontology. InAAAI, 2006.6

[16] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags offeatures: Spatial pyramid matching for recognizing naturalscene categories. InCVPR, pages 2169–2178, 2006.2, 3

[17] D. Lowe. Object recognition from local scale-invariant fea-tures. InProc. ICCV, pages 1150–1157, Sep 1999.2

[18] D. Nister and H. Stewenius. Scalable recognition with avo-cabulary tree. InCVPR, pages 2161–2168, 2006.5

[19] A. Oliva. Gist of the scene. InNeurobiology of Attention, L.Itti, G. Rees & J. K. Tsotsos (Eds.), pages 251–256, 2005.2

[20] A. Oliva and P. Schyns. Diagnostic colors mediate scenerecognition.Cognitive Psychology, 41:176–210, 1976.2

[21] A. Oliva and A. Torralba. Modeling the shape of the scene: aholistic representation of the spatial envelope.InternationalJournal in Computer Vision, 42:145–175, 2001.2, 3

0.5

0.6

0.7

0.8

0.9

1

# images

RO

C A

rea

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False alarm rate

De

tectio

n r

ate

D3

7 103. 7 107.7105.

D2D1

7,000

70,000,000

Figure 12. Some examples of test images belonging to the “per-son” node of the Wordnet tree, organized according to body size.Each pair shows the query image and the 25 closest neighbors outof 70, 000, 000 images usingD3 with 32 × 32 images. Note howthe neighbors match not just the category but also the location andsize of the body in the image, which varies considerably in theexamples. The bottom left figure shows the ROC curve area forthe “person” node in the Wordnet tree as the number of images isvaried, forD1, D2 andD3. The bottom right figure shows theevolution of the ROC curve for “person” withD3 as the numberof images is varied.

[22] M. Potter. Short-term conceptual memory for pictures.Journal of Experimental Psychology: Human Learning andMemory, 2:509–522, 1976.2

[23] L. Renninger and J. Malik. When is scene recognition justtexture recognition?Vision Research, 44:2301–2311, 2004.2

[24] G. Shakhnarovich, P. Viola, and T. Darrell. Fast pose estima-tion with parameter sensitive hashing. InProc. ICCV, 2003.2

[25] J. Wolfe. Visual memory: What do you know about whatyou saw?Current Biology, 8:R303–R304, 1998.2

[26] Wordnet: a lexical database for the English language. Prince-ton Cognitive Science Laboratory, 2003. Version 2.0.4, 6

tiny images - people | mit csail

Documents