fcv hum mach_t_berg

24
Learning from Descriptive Text Tamara L Berg Stony Brook University

Upload: zukun

Post on 20-Jul-2015

221 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Fcv hum mach_t_berg

Learning from

Descriptive Text

Tamara L Berg

Stony Brook University

Page 2: Fcv hum mach_t_berg

Vision

Humans

Language

Tags:

canon, eos, macro, japan, vacation, f

rog, animal, toad, amphibian, pet, ey

e, feet, mouth, finger, hand, prince, p

hoto, art, light, photo, flickr, blurry, fa

vorite, nice.

It's the perfect party dress. With

distinctly feminine details such as a wide

sash bow around an empire waist and a

deep scoopneck, this linen dress will

keep you comfortable and feeling

elegant all evening long.

Page 3: Fcv hum mach_t_berg

Visually Descriptive Text

Visually descriptive language provides:

• information about how people construct natural language for imagery.

• information about the world, especially the visual world.

• guidance for computational visual recognition.

How do people

describe the world?

“It was an arresting face, pointed of chin, square of jaw. Her eyes

were pale green without a touch of hazel, starred with bristly black

lashes and slightly tilted at the ends. Above them, her thick black

brows slanted upward, cutting a startling oblique line in her

magnolia-white skin–that skin so prized by Southern women and so

carefully guarded with bonnets, veils and mittens against hot

Georgia suns” – Gone with the Wind

How does the

world work? What should we

recognize?

Page 4: Fcv hum mach_t_berg

Visually Descriptive Text

Visually descriptive language provides:

• information about how people construct natural language for imagery.

• information about the world, especially the visual world.

• guidance for computational visual recognition.

How do people

describe the

world?

“It was an arresting face, pointed of chin, square of jaw. Her eyes

were pale green without a touch of hazel, starred with bristly black

lashes and slightly tilted at the ends. Above them, her thick black

brows slanted upward, cutting a startling oblique line in her

magnolia-white skin–that skin so prized by Southern women and so

carefully guarded with bonnets, veils and mittens against hot

Georgia suns” – from Gone with the Wind by Margaret Mitchell

How does the

world work? What should we

recognize?

Page 5: Fcv hum mach_t_berg

What’s in a description?

What do people describe?“A bearded man is holding a child in a sling.”

manbabyslingshirtglassesladderfridgetablewatermelonchairboxescupswater bottlewallpacifierbeard…

“A bearded man stands while holding a small child in a green sheet.” “A bearded man with a baby in a sling poses.”“Man standing in kitchen with little girl in green sack.” “Man with beard and baby”

What’s in this image?

Page 6: Fcv hum mach_t_berg

What’s in a description?

Predict what people will describeGiven an image

“looking for

castles in the

clouds out my car

window”

1)

2)

Given a caption

“two women sitting brunette

blonde on bench reading

magazine”

Predict what’s in the image

clouds ✔car

window

castle

✖✖?

women ✔bench ✔

magazine✔grass

skirt

✖✖…

e.g. Spain & Perona, 2010

Page 7: Fcv hum mach_t_berg

President George W. Bush makes a

statement in the Rose Garden while

Secretary of Defense Donald Rumsfeld

looks on, July 23, 2003. Rumsfeld said

the United States would release graphic

photographs of the dead sons of

Saddam Hussein to prove they were

killed by American troops. Photo by

Larry Downing/Reuters

Who’s in the picture?T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth

Model Accuracy of labeling

Vision model, No Lang model 67%

Vision model + Lang model 78%

Page 8: Fcv hum mach_t_berg

Visually Descriptive Text

Visually descriptive language provides:

• information about how people construct natural language for imagery.

• information about the world, especially the visual world.

• guidance for computational visual recognition.

How do people

describe the world?

“It was an arresting face, pointed of chin, square of jaw. Her eyes

were pale green without a touch of hazel, starred with bristly black

lashes and slightly tilted at the ends. Above them, her thick black

brows slanted upward, cutting a startling oblique line in her

magnolia-white skin–that skin so prized by Southern women and so

carefully guarded with bonnets, veils and mittens against hot

Georgia suns” – from Gone with the Wind by Margaret Mitchell

How does the

world work? What should we

recognize?

Page 9: Fcv hum mach_t_berg

Vision is hard

World knowledge (from descriptive text) can be used to smooth noisy vision predictions!

Green sheep

Page 10: Fcv hum mach_t_berg

Learning World Knowledge

Attributes

Relationships

green green grass by the

lake

a very shiny car in the car

museum in my hometown of

upstate NY.

Our cat Tusik sleeping on

the sofa near a hot radiator.

very little person in a big

rocking chair

BabyTalk: Understanding and Generating Simple Image DescriptionsKulkarni, Premraj, Dhar, Li, Choi, AC Berg, TL Berg, CVPR 2011

Page 11: Fcv hum mach_t_berg

System Flow

Input Image

Extract Objects/stuff

a) dog

b) person

c) sofa

brown 0.32

striped 0.09

furry .04

wooden .2

Feathered

.04

...

brown 0.94

striped 0.10

furry .06

wooden .8

Feathered

.08

...

brown 0.01

striped 0.16

furry .26

wooden .2

feathered

.06

...

a) dog

b) person

c) sofa

Predict attributesPredict prepositions

a) dog

b) person

c) sofa

near(a,b) 1

near(b,a) 1

against(a,b)

.11

against(b,a)

.04

beside(a,b)

.24

beside(b,a)

.17

...

near(a,c) 1

near(c,a) 1

against(a,c) .3

against(c,a)

.05

beside(a,c) .5

beside(c,a)

.45

...near(b,c) 1

near(c,b) 1

against(b,c)

.67

against(c,b)

.33

beside(b,c) .0

beside(c,b)

.19

...

Predict labeling – vision

potentials smoothed with text

potentials

! "#$%

! "#&%

' () *$%

+, ($%

! "#- %

+, (&%

+, (- %

' () *&%

' () *- %

<<null,person_b>,against,<brown,sofa_c>>

<<null,dog_a>,near,<null,person_b>>

<<null,dog_a>,beside,<brown,sofa_c>> Generate natural

language

description

This is a photograph of one

person and one brown sofa and

one dog. The person is against

the brown sofa. And the dog is

near the person, and beside the

brown sofa.

Page 12: Fcv hum mach_t_berg

This is a picture of one

sky, one road and one

sheep. The gray sky is

over the gray road. The

gray sheep is by the gray

road.

Here we see one

road, one sky and one

bicycle. The road is near

the blue sky, and near the

colorful bicycle. The

colorful bicycle is within

the blue sky.

BabyTalk results

This is a picture of two

dogs. The first dog is

near the second furry

dog.

Objects, Attributes,Prepositions

Page 13: Fcv hum mach_t_berg

Visually Descriptive Text

Visually descriptive language provides:

• information about how people construct natural language for imagery.

• information about the world, especially the visual world.

• guidance for computational visual recognition.

How do people

describe the world?

“It was an arresting face, pointed of chin, square of jaw. Her eyes

were pale green without a touch of hazel, starred with bristly black

lashes and slightly tilted at the ends. Above them, her thick black

brows slanted upward, cutting a startling oblique line in her

magnolia-white skin–that skin so prized by Southern women and so

carefully guarded with bonnets, veils and mittens against hot

Georgia suns” – from Gone with the Wind by Margaret Mitchell

How does the

world work?

What should we

recognize?

Page 14: Fcv hum mach_t_berg

What should we recognize?

• Recognition is beginning to work

• Open question – what should we recognize?

• Maybe objects aren’t (always) the right base level entities

Page 15: Fcv hum mach_t_berg

Object Recognition

Parts, Poselets, Attributes

For example: [Fergus, Perona, Zisserman2003],[Bourdev, Malik2009], …

Slide Credit: Ali Farhadi

Page 16: Fcv hum mach_t_berg

Learn which terms in descriptions are depictable

Fully beaded with megawatt

crystals, this Christian Louboutin suede

pump matches the gleam in your eye.

Pump's linear heel plays up the alluring

curves of its dipped sides.

Round toe frames low-cut vamp.

Tonally topstitched collar.

4" straight, covered heel shows off

signature red sole.

Creamy leather lining with padded

insole.

"Fifi" is made in Italy.

attributes

Automatically Discovering Attributes from Noisy Web Data T.L. Berg, A.C. Berg, J. Shih ECCV 2010

Page 17: Fcv hum mach_t_berg

Given Web Images + Noisy Text Descriptions:

1) Discover visual attribute terms in text descriptions - likely domain dependent

2) Learn appearance models for attributes without labeled data

3) Characterize attributes by: type, localizability

Page 18: Fcv hum mach_t_berg

Object Recognition

For example: [Oliva, Torralba 2001],[SUN 2010], …

Slide Credit: Ali Farhadi

Scenes

Page 19: Fcv hum mach_t_berg

What are the right quanta of

Recognition?

Farhadi & SadeghiRecognition using Visual Phrases , CVPR 2011

Page 20: Fcv hum mach_t_berg

Participating in Phrases Profoundly affects the appearance of objects

Farhadi & SadeghiRecognition using Visual Phrases , CVPR 2011

Page 21: Fcv hum mach_t_berg

Maybe descriptive text can inform entity hypotheses!

“the dog is sleeping”

“sleeping dog in delhi”“A dog is sleeping in”

“a sleeping dog in NTHU”

What should we recognize?

Page 22: Fcv hum mach_t_berg

“cat in a bag”

“cat in the bag”“cat in bag”

“the cat is in the bag”

What should we recognize?

Maybe descriptive text can inform entity hypotheses!

Page 23: Fcv hum mach_t_berg

Conclusion

Use large pools of descriptive text to:

Learn how people describe the visual world

Learn how the world works

Guide future efforts in recognition

Apply this knowledge to multi-modal

collections & applications

Page 24: Fcv hum mach_t_berg

Acknowledgements

• Collaborators: Alex Berg, David Forsyth, JaetyEdwards, Jonathan Shih, Girish Kulkarni, VisruthPremraj, Sagnik Dhar, Vicente Ordonez, SimingLi, Yejin Choi, Kota Yamaguchi, Vicente Ordonez

• Funded by NSF Faculty Early Career Development (CAREER) Program: Award #1054133