from image sequences to natural language: a first step toward automatic perception and description...

This article was downloaded by: [The University of Manchester Library]On: 10 October 2014, At: 04:15Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK

Applied Artificial Intelligence: An International JournalPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uaai20

FROM IMAGE SEQUENCES TO NATURAL LANGUAGE: AFirst Step toward Automatic Perception and Descriptionof MotionsJ.R.J. SCHIRRA a , G. BOSCH a , C.K. SUNG b & G. ZIMMERMANN ba Department of Computer Science , Universität des Saarlandes , D-6600 Saarbrücken 11,Federal Republic of Germanyb Fraunbofer-lnstitut für Informations- und Datenverarbeitung , D-7500 Karlsruhe 1, FederalRepublic of GermanyPublished online: 24 Oct 2007.

To cite this article: J.R.J. SCHIRRA , G. BOSCH , C.K. SUNG & G. ZIMMERMANN (1987) FROM IMAGE SEQUENCES TO NATURALLANGUAGE: A First Step toward Automatic Perception and Description of Motions, Applied Artificial Intelligence: AnInternational Journal, 1:4, 287-305, DOI: 10.1080/08839518708927976

To link to this article: http://dx.doi.org/10.1080/08839518708927976

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/uaai20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/08839518708927976

http://dx.doi.org/10.1080/08839518708927976

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

o FROM IMAGE SEQUENCESTO NATURAL LANGUAGE:A First Step towardAutomatic Perceptionand Description of Motions

J. R. J. SCHIRRA and G. BOSCHDepartment of Computer Science, Universitiit desSaarlandes, 0-6600 Saarbrucken 11, Federal Republicof Germany

C. K. SUNG and G. ZIMMERMANNFraunhofer-Institut fur Informations- undDatenverarbeitung, 0-7590 Karlsruhe 1, FederalRepublic of Germany

1% present our work concerning the connection of a vision system to a natural languagesystem. That is, automatic processing transforms the original sequence of IV images intonatural language descriptions concerning moving objects. It is the first time that this transformation has been achieved entirely by computer. A vision system that has been developed inKarlsruhe is briefly introduced. By analyzing displacement vector fields, trajectories ofobjectcandidates are recognized. The natural language system CrrYTOUR is presented. The verbalization ofspatial relations between static and moving objects can be studied with this system.The present state of the connection is described, and the data resulting from the vision systemare pan/ally used and verbalized by CrrYTOUR.

INTRODUCTION

One of the most fascinating aspects of human intelligence is the capability ofperceiving the surrounding environment and describing to another person. AIrepresenting a field of cognitive science concerned with machine-simulated intelligent behavior also tries to model this relationship between the perceptionand language.

The problem of automatic description of a natural scene is usually dividedinto two steps: (1) constructing an abstract propositional description of thescene-a task carried out by a vision system, and (2) expressing these abstract

We thank Ingrid Wellner, who implemented the data transformation algorithm and integrated the staticbackground in CITYTOUR, Wilfried Enkelmann for his endeavor to establish communication via DFN, andGudula Retz-Schmidt for her insight, comments. and suggestions on earlier versions of this paper.

The work described in this paper was supported by the German Special Collaboration Project SFB 314 onAI and Knowledge-Based Systems of the German Science Foundation (DFG).

Applied Artificial Intelligence: 1:287-305, 1987Copyright © 1987 by Hemisphere Publishing Corporation 287,

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

288 J. R. J. Schlrra at et,

descriptions in natural language-a task carried out by a natural language system. Both of these tasks have been dealt with separately. Systems like MORIO(Dreschler and Nagel, 1982) try to solve the first step while others, likeHAM-ANS (Nebel and Marburger, 1982) or NAOS (Neumann and Novak,1986) deal with the second, using simulated data of the scene description.

Obviously, there are advantages in working on both partial problems simultaneously. For humans, the data from vision systems alone are usually difficultto comprehend. Their interpretation often demands precise knowledge of theunderlying analysis process. Language, as our natural medium of communication, is very efficient and suited to representing the relevant structures within thedata in a clear and compact manner.

Conversely, if there is no relationship between language notions and percepts, semantic theory will remain fragmentary. Within this traditional, purelyintensional attempt, the world surrounding the system is not considered.Schwind (1984) characterized this approach as follows: "A semantic conceptshall simulate or represent something, which we do not know directly: conceptual structures, that humans construct to understand and produce sentences."With respect to intensional semantics, Lakoff (1987) remarks: "As should beobvious, such models of concepts make no use of any experiencial aspect ofhuman cognition. That is, intensions have nothing in them corresponding tohuman perceptual abilities, imaging capabilities, motor abilities, etc." Thus, thereference-semantic anchoring of meaning verified by algorithms, which shouldbe a result of the integration of perceptual (or motor) systems and natural language systems, is of great importance for computational linguistics.

Indeed, the modeling of systems integrating the capability of visual perception and natural language communication is still in an early stage. In only a fewsystems will processing span the entire way between image sequences and natural language descriptions-restricted to very small domains.

LANDSCAN (Bajcsy et al., 1985), for example, is such a system, but itdeals only with static scenes. Another system, ALVEN (Tsotsos, 1980), workson a very specialized domain-left ventricular heart motion-but does not generate sentences in natural language. However, most concepts of motion it recognizes in the image sequence correspond to certain natural language notions ofchange.

Project VITRA (VIsual TRAnslator)* is concerned with the principles governing the relationship between natural language and visual perception. Experimental studies dealing with the connection of image and language understandingsystems are made with the goal of developing a system capable of describing animage sequence in the German language. The vision project VI is engaged in the

"The projects VITRA and V1 are parts of the SFB 314 Research Program of the German ScienceFoundation (DFG) on AI and Knowledge-Based Systems.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

From Image Sequences to Natural Language 289

task of automatically extracting information about the motion of objects fromreal scenes.

In the following sections, our mutual research regarding the connection of avision system to a natural language system is presented. This is the first time thetransformation of sequential TV images into natural language descriptions ofmoving objects is achieved entirely by computer. In this paper, the natural language half of this connection is emphasized. First, a vision system developed inKarlsruhe is briefly introduced. Then we present the natural language systemCITYlOUR, which has been developed in Saarbriicken. The present state ofthis vision/NL connection is dealt with in the fourth section.

IMAGE SEQUENCE PROCESSING

Image Material

From a building about 35 m high, a stationary TV camera recorded animage sequence of a road crossing on videotape. From this tape, 130 frames(5.2 seconds) were selected, digitized (512 x 412 pixels, 8 bits), and stored ona magnetic disk. The scene shows a tram moving from the left to the right of thescreen, together with other vehicles moving in the opposite direction. In theupper left-hand corner, one car has already stopped while three others are slowing down in front of red traffic lights (Fig. 1).

FIGURE 1. First frame of the sequence. Moving objects are marked by a frame parallal to thedirection of motion indicated by an arrow Inside the frame; the canter of gravity of the vectorsof the cluster is indicated by a white dot.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

290 J. R. J. Schlrra at al.

Segmentation by Motion

In this paper we deal only with motions within the image sequence. Up tonow, objects in three-dimensional (3D) space cannot be perceived and, therefore, candidates representing moving objects within the image plane are employed.

Moving object candidates are segmented from the stationary ones by computing and analyzing displacement vector fields (Enkelmann et al., 1985; Koriesand Zimmermann, 1986; Sung and Zimmerman, 1986; Zimmermann and Kories, 1984). To reduce the amount of data, vector fields are computed from fourconsecutive image pairs. Then, as long as they persist, these vectors are individually chained together. Out of about 1350 vectors of the initial four image pairs,approximately 540 exist during the entire 130-image sequence. Because mostmoving objects leave the field of view during this sequence, most of the persisting vector chains must belong to the stationary background.

The extraction of moving objects is carried out in four steps:

I. To obtain stable starting conditions, only vectors persisting over 24 image pairs are considered.

2. To rule out the stationary background, all vectors with a displacement ofless than 15 pixels are discarded.

3. Within a neighborhood of ±20 pixels of each of the remaining vectors,another vector with the same displacement (threshold ± 1 pixel) is sought. Thisresults in clusters of similar displacement vectors.

4. As described in steps 2 and 3, broken vector chains within each clusterpossibly belonging to the cluster itself are sought. The condition of persistenceover 24 image pairs of step 1 is dropped. Condition 2 is retained for the initialpart of the broken vector chain. Step 4 allows short occlusions of the objects,e.g., by lampposts or trees.

Cueing of Object Candidates

Cueing is carried out in two phases. First, the next 24 image pairs with anoverlap of four pairs are analyzed using the same procedure as above, but nowwith only vectors contained in the clusters previously found. The condition ofstep 2 is applied to ensure that only object candidates initially moving are cued.This is repeated for the third 24 image pairs.

In the next phase, lasting until the end of the image sequence, the conditionof step 2 is dropped, allowing displacements down to zero. To ensure the stability of the size of the cluster, the condition of step 1 is reintroduced, allowingonly uninterrupted vectors.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

From Imaga Sequences to Natural Language 291

FIGURE 2. Cueing of the object candidates from Image 1 to Image 125. For seven time points aframe Is drawn around the object. Note that the objects In the upper part of the Image come torest.

The cueing results are shown in Fig. 2, where the object candidates aredisplayed with respect to images 1, 25, 45, 65, 85, 105, and 125. Initially, 10moving object candidates are found. The tram in the lower part of the imagemoves out of the field of view. Of the vehicles on the lanes directly above thetram, all leave the field of view except the two cars on the far right. One of themovertakes a cyclist (third frame from the right in Fig. 1); the cyclist is lost bythe algorithm during this process of overtaking. Later, a vector with zero displacement is erroneously included in this CaT'S cluster-the long, small framein the center of Fig. 2-and in the subsequent cueing step, the algorithm losesthis car.

Despite occlusions by trees, lampposts, and lamp suspension wires, threestopping cars are cued through the whole scene in the upper part of the image. Itmust be noted that in using the present procedure, the frames tend to becomesmaller because each new search begins with the previously found cluster and,in the second cueing phase, broken vector chains are discarded.

Results

Moving object candidates are represented by the center of gravity of thestarting points of all vectors within the cluster. For subsequent processing, additional information is given. This consists of the four comer points of the framearound the cluster and the mean displacement vector of the cluster. Togetherwith the identification number of the object candidates for each 20th frame, alldata are transmitted through the entire sequence (Fig. 3).

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

displacement vector fields

geometric image sequence description

.............................

D'U R LAC HER TOR ITRAJECTORY Cent... fll

Ro.o Col Row

BILDOOOI0005R2

Col Ro.o ColRJ

Row Col "Row Col

OElJECTOOOI 2~8.7 211.9 236.2 193.8 27~.6 182.6 249.3 2ltO.0 288.7 225.9 -3.0 -10.6OBJECTOO02 317.6 158.8 241.9 51.2 313.3 27.6 325.60 J04.3 397.0280.7 '.2 12.7OBJECTOO03 71.22&2.1 604.62600.0 81. 7 270.3 '8.9 289.4' 75.8 293.7 I.' -7 .~

OBJ(CTQOG4 23'5.0291.0 207.5 264.3 242.7 ;;::.5.1 222.9 323.6 255.1 314.'5 -3.2 -12.3O&JEtTOOO'5 283.3 312.4 263.0 294.0 2'm.6 286.9 274.8 JJ9.2 302.3332.1 -3.b -13.9OBJECTOO06 96.8 32b.2 91.2 307.1 113.2 312.0 82.4 34b.l 104.43'51.1 0.' -4.0OBJ!:CTOO07 2M'.& 3nl.O 253.1 364.4 284.& 357.2 261.8 402.2 293.4394.9 -1.1 -4.&OBJECT0008 olt.2 It10.0 '54.1 39'5.3 62.9 ItOI.7 49.3 '016.6 78.1 4;;:3.1 I., -0.7O£lJECTOO09 276.7437.9 264.2419.8 294.6412.9 273.8461.9 304.2 4'5'5.0 -2.3 -10.1(l9JECTOOI0 203.9 4'It.1 2'3.9438.4 266.7436.1 20(1.2 473.2 273.0 470.9 -1.2 -6.6

DURLACHER TOR:TRAJECTORY C...,t... RI

Row Col Row

BI LOOOO 1OO~R2

CQl Row ColOJ

Row Col "RO!'l Col Ro. '01

292

OBJECTOOOI 241.b 149.0 2~.2 130.8 261.0 120.4 231.2 lh.2 272.0 163.8 -3.0 -11.808J£CTOO02 341.' 231.1 261.3 111.0 338.3 87.3 346.4 3&6.9 423.3 363.2 .., 14.6OBJ£CTOO03 80.8 :44.9 74.3 229.8 'Xl.1Ii 234.1 MI.1 2'5:.8 84.8 257.2 1.7 -6.408J£CTOO04 218.8 228.6 193.0 203.1 227.'5 194.8 206.8 200.2 241.3 251.8 -2.9 -12.0OBJECTOO05 2b'5.J Z42.9 247.1 225.1 271.3 218.2 259.5 268.4 283.7 261.4 -3.7 -12.9O£lJECTOOOo 101.3 JOb.9 96.1 287.'5 118.1 292.) 87.5 320.8 109.5331.6 0.7 -).2OElJECTOOO7 203.b JJZ.S 2'53.932e..7 173.~ 323.5 250.9 340.3 :Z76o.~ J3~\' I -1.7 -0.0OBJECT00Q8 7).3371.4 b5.4 343.8 91.0 346.9 'ILl J9b.o M.8 399.6 0.6 -5.1OBJECTOOO'1' 20'.5385.7 :<'45.3308.3 27b.) 358.4 258.Q 410.9 290.0401.0 -3.1 -9.701lJECTOOIO 255.8 416.2 244.7402.0 256.2 3~8.4 2'5'5.0435.4 266.60 431.8 -2.2 -7.1

FIGURE 3. Diagram of the motion analysis algorithm with part of the resulting data.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4


THE NATURAL LANGUAGE COMPONENT: CITYTOUR

The German dialogue system CITYlOUR (Andre et al., 1985) answersquestions concerning

The spatial relationship between static objects (static relationships),Direction and path of moving objects, i.e., the geometric properties of motions

(dynamic relationships), andOther visible, especially kinematic, properties of motions (e.g., velocity, accel

eration).

The objects are supposed to be arranged in a two-dimensional Euclideanarea. The questioner is assumed to be one of these objects. Thus, the conversational partner is part of the scene under discussion. Therefore CITYlOUR'sanswer can take into account the current position of the observer.

Representation of Objects and Motions

The domain of discourse in question consists of the so-called static background, e.g., buildings, streets, or places, and a set of dynamic objects able tomove within the scene, e.g., cars, trams, and cyclists. Essentially, static objectsare represented as closed polygons; more primitive forms of representationdelineative rectangles and centers of gravity-can be calculated if needed. Asanother property of some of the static objects, their prominent front is defined;its use is explained below.

Moving objects, like cars or pedestrians, are all equally represented by theircenters of gravity. Their motions are represented as trajectories, i.e., lists ofpairs

where P,; denotes the position (x, Vti) of that object's center of gravity at time t;

on the underlying discrete time axis (Fig. 4). The set of representations of thelocation of both static and dynamic objects is called the geometric scene description (GSD).

Three examples of this general domain were examined:

I. The city center of a city map. Here CITYlOUR simulates aspects of afictitious sightseeing tour through that part of the city (Andre et al., 1986a,1986b; Retz-Schrnidt, 1986). Figure 4a includes part of the static background ofthis domain.

2. A campus guide of the University of Saarbrucken. In this example, the

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

ll: ""

Wa

slI

egl

all

es

hin

ter

der

Po

st1

IHA

US

8b

efl

nd

et

alc

hU

NM

lTT

EL

BA

RH

INT

ER

de

"P

OS

TH

AU

S9

be

rln

de

tel

chR

EC

HT

GU

TH

INT

ER

de

"P

OS

Td

as

ST

EA

KH

AU

Sb

erl

nd

et

sle

hR

EC

HT

GU

Ttl

lNT

ER

de'

"P

OS

Td

aBR

AT

HA

US

be

fln

de

tB

leh

INE

rWA

HIN

TE

Rd

....

PO

STd

ieB

1Efl

AK

AD

EM

IEb

efl

nd

et

sk

hR

EC

HT

GU

TH

INT

ER

de

rP

OS

Td

ee

we"'.

lIeg

td

as

flat

hau

sh

inte

rd

e,.

Sp

ark

a.s

o'1

nern

,d

a.

kenn

man

nlc

ht

sag

en

TR

AC

E~:

-'-:

:::c

':;:

~)

y>

(s)

..it

FIG

UR

E4.

The

besl

cC

ITY

TOU

Rw

indo

ws.

(8)

The

city

map

dom

ain:

The

win

dow

atth

efa

rri

ght

disp

lays

part

ofth

est

atic

hack

grou

nd(o

nly

the

hous

es)

and

two

traj

ecto

ries

.T

hebi

gdo

tat

the

botto

mre

pres

ents

the

ob

sen

erIn

the

scen

e.W

ithin

the

win

dow

DIA

WG

.se

vera

lque

stlo

nsan

dan

swer

sar

egi

ven.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

DIA

LO

Gh

lelt

ob

jee

t00

03

be

ld

am

ha

Ll8

.Ja

,8

8h

lelt

GE

RA

OE

EB

EN

BE

ld

am

HA

US

.

'uh

ro

bje

ct0

00

9d

iek

ala

.Mit

ra.B

een

tlan

gje

,JE

TZ

T1M

MO

ME

NT

--

I\C

E

(grl

nd

-to

p-I

eve

l(g

et

'ob

ject0

00

3't

raJ)

40

)((

7(2

.83

62

70

66

.26

26

07

))(6

(2.8

56

04

14

6.2

65

64

93

»(5

(2.9

01

66

45

6.2

76

29

47

))(4

(3.0

78

07

61

6.3

72

10

3))

(3(3

.29

40

27

86

.44

20

6))

(2(3

.72

4"1

12

6.5

57

64

))(1

(4.2

90

14

46

.70

36

36

)))

··"'OR

E·"

Ha

eu

lJe

r:v..

No

Pla

et1

!e:

V..

No

DU

RL

AC

HE

wA

LL

EE

~.".

Ob

Jek

te:

V..

No

~St

,.asse"

:V

_N

oB

us:

Ves

No

Ellit

12"'

22.1

861

3:5

7:2

5!>

era

Her

Eo

gC

ITY

:C

ho

ose

N '"en

(hI

FIG

UR

E4.

The

basi

cC

ITY

TO

UR

win

dow

s(C

on

tinu

ed

).(b

)T

hest

reet

cros

sing

dom

ain:

The

grap

hic

win

dow

disp

lays

aho

use,

aca

rpa

rk,

abu

sst

op,

seve

rall

anes

,an

dfi

vetr

ajec

tori

es.

Inth

eT

RA

CE

win

dow

,th

ein

tern

alre

pres

enta

tion

ora

traj

ecto

ryis

show

n.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

296 J. R. J. Schirra et al.

use and verbalization of canonical trajectories is studied, i.e., the normal pathsof, for example, bus lines. This particular capability will not be dealt with here.

3. A bird's-eye view of a street crossing. In this domain of discourse, weexamine spatial relationships between moving objects as well as simple events,using data pertaining to real motion. The corresponding static background isillustrated in Fig. 4b.

Static Spatial Relationships

Several kinds of spatial relationships between objects of the GSD are ofparticular interest. Together with their degrees of applicability, they are calculated by request from the GSD.

CITYTOUR is able to recognize relationships between two or three objectswithin the static background-so-called static relationships. The following staticrelationships are implemented: in front of, behind, to the left, to the right,beside, at, on, in, and between (in German vor, hinter, links, reclus, neben, an,auf, in and zwischen). In using them, CITYTOUR can answer the following typeof question:

Is the house to the right of the bus stop?

Is the post office between the bust stop and the house?

Is the house behind the bus stop from here?

The applicability of the three-place relationship between, for example, iscalculated by means of the following algorithm (cf. Fig. 5) (ObI and Ob2 arethe names of the two reference objects; the object possibly between ObI andOb2 is called the subject of the relationship):

Step I: Calculate the two tangents gl and g2 between the reference objectsusing their closed-polygon representation;

Step 2: If:A: both tangents cross the subject (also in its polygon representation),

the relationship between holds with degree I;B: the subject is totally enclosed by the tangents and the reference

objects, the relationship is also applicable with degree 1;C. only one of the tangents intersects the subject, the degree of the

applicability is calculated, depending on its penetration depth inthe area between the tangents:

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4


g,

Db2

FIGURE 5. The three cases of betw69n.

Applicability degree _ max (_a_ _a_)a + b' a +c

Otherwise:D. the relationship is not applicable: degree = O.

The degrees of applicability are used for two purposes: When answering YIN questions, they can be verbalized as linguistic hedges:

Is the post office between the bus stop and the house?

les, the post office is approximately between the bus stop and the house.

When answering Where questions, they help to choose the best referenceobject. In this case, the applicabilities of the four basic relationships (behind, infront of, to the right, and to the left) are calculated for several reference objects,also considering a degree of salience (in [0.. 1D associated with each static object.The relationship with the resulting highest degree of applicability is verbalized.

Where is the post office? It is directly behind the church.

A group of two-place prepositions, called relational prepositions, makes itpossible to localize the subject first with respect to intrinsic properties of the

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

298 J. R. J. Schlrra at 01.

reference object, e.g., its intrinsic prominent front, and second with respect tothe position of an observer. The first case is called intrinsic use, the second,extrinsic use of the preposition. If the position of the observer is equal to that ofthe speaker (or listener), we call it deictic use.

In CITYTOUR, the four basic prepositions as well as beside can be usedintrinsically and deictically and, since the observer is assumed to be locatedwithin the scene, in combination with to pass as well. Thus, spatial relationshipscan becalculated with respect to the observer's position. In order to distinguishintrinsic and deictic use, we employ the following strategy derived from Millerand Johnson-Laird (1976): If the reference object has a prominent front, theintrinsic use is seen as the default use. Otherwise, and if it is forced by from herethe deictic interpretation is used [for deictic versus intrinsic: use of prepositions,see Andre et al., (1986b) and Retz-Schmidt (1986).]

Because the necessary data pertaining to the static background cannot becalculated by the vision component, the static relationship concerning the connection between the vision system and CITYTOUR are presently not of greatinterest.

Dynamic Relationships and Computational Semanticsfor Path Prepositions

In CITYTOUR we also consider relationships between a dynamic and astatic object, especially ones described by path prepositions such as past andalong (in German vorbei and entlang). Furthermore, the four basic relationshipsmentioned above can be used in their directional content:

The policeman went behind the building from here.

The car drove past the bus stop.

The tram went along 3rd Street.

The description of the paths of moving objects, i.e., the decision about theapplicability of dynamic relationships, is based on knowledge of the full trajectories. Within the graphic window Of CITYTOUR, the positions of the movingobjects are projected onto the static background for all instances of time; the lastpoint of the trajectory is regarded as the actual time of observation.

Precise analysis of past and along shows that they differ in the followingaspects: In both cases, and depending on the size of the reference object, thedistance between the moving subject of the relationship and the static referenceobject should not exceed a certain threshold. Along has a smaller threshold thanpast. In addition, in the case of along, the trajectory must follow more closely

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

From Image Sequences to Natural Language· 299

the boundary of the reference object (Fig. 6). Therefore, the closed-polygonrepresentation is used to calculate the applicability for along, whereas for past,the more general delineative rectangle representation is sufficient. During theapplicability of along, the moving object does not have to change its direction;past, on the other hand, is free of this restriction. Thus, along seems to implypast. But that is not the case, because past requires that the subject move the fulllength from one side of the reference object to its other side; to move along theobject, it has only to follow its shape for a.minimal distance.

These relationships are quite important in the present state of the linkage.Their implementation also works well in conjunction with the transferred datafrom Karlsruhe. They are described in detail in Andre et al., (1985, 1986a,1986b).

Recognition of Simple Events

We also consider simple events in which only one dynamic object is involved. More precisely, in these cases we are considered with relationshipsbetween several positions of the object at different times.

past

threshold for along

and past past past

FIGURE 6. Differences between past and 8long.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

300 J. R. J. Schirra at 01.

In CITYTOUR, to start off and to stop (in German anfahren and anhalten)are implemented. Formally, we define two predicates with the object a as thefirst and the time interval {to . . tn } as the second argument as follows:

stop(O,{to.' tn}) := 3t E {to' . tn{: move(O,{to' . t}) 1\ stand(O,{t+ 1 .. tn});

start(O,{to' . tn}) : = 3t E {to . . tn{: stand(O,{to .. t}) II. move(O,{t+ 1 .. tn } ) ;

Sentences like Object 0 stopped are interpreted as would be the case in ordinarylanguage: Object 0 moved during an interval of time and did not move in thefollowing interval. To start describes the symmetrical case. The auxiliary predications move( ) and stand( ) hold within every interval in which a subintervalexists without any motion or standstill, respectively.

move(O,{to' . tn}) : = 3{t, .. tJ S; {to' . tn}: not - stand(O,{t, .. tz});

stand(O,{to' . tn}) : = 3{t, .. tJ S; {to' . tn}: not - move(O,{t, .. tz});

Thus, no durative events are defined, since for these kinds of events the corresponding predication must hold for every subinterval, too. The correspondingdurative event types are defined by the following two predications:

not - stand(O,{to .. tn}) : = VI E {to . . tn{: position(O,t)'* position(O, t + 1);

not - move(O,{to .. tn}) : = 'It E {to . . tn{ : position(O,t)= position(O, t + 1);

These definitions correspond closely to the colloquial meaning. If we say that acar did not move for a certain period, then the event so' described is surelydurative. We mean that the car did not move for any length of time during theperiod in question. By comparison, if we say that the car moved within a certainperiod, it is absolutely possible that the car stood still for a subperiod, even if itdid not starid still during another subinterval.

Thus, CITYTOUR can now also answer questions of the following type:

Did the car start off? li's, it did a short time ago.

By mentioning its location, a recognized event can be described more precisely. All static prepositions of CITYTOUR can be used for this purpose:

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

Did the cyclist stop?


les, he stopped beside the post office.

Did the tram stop at the tram stop? No, it went past it.

The difference between deictic and intrinsic use is considered here as well:

Did the van stop in front of the church from here?

les, it did just now.

Finally, to tum into (in German einbiegen) is implemented in its meaning ofchanging the street. More precisely, a relationship between the moving subjectand two streets is described as follows:

Did the van tum from Kaiserstrasse into Brunnengasse?

Up to now, changes in direction have not been considered here.

THE CONNECTION

The connection between a vision system and a natural language system described here is based on mutual advances in both technical fields during the past10 years. The understanding of both kinds of systems pertaining to the typicalproblems and solutions gradually arose during this preparation phase.

Finally, we succeeded in establishing a first and very simple linkage betweenthe two systems described above. Both systems are simply linked one after theother and they work sequentially; i.e., a scene pertaining to a certain period isthoroughly analyzed by the image sequence analysis system, with the resultingdata being transmitted en bloc to CITYTOUR. Thus, there is no feedback fromthe natural language system to the vision system, nor will the data be processedin a kind of "pipelining,' nor will the events, so to speak, be incrementallyrecognized and verbalized.

Technical Presuppositions

To bridge the rather long spatial distance between the two systems, we nowuse the new transmission medium computer net. The data comes through theDFN (German Research Net), Cantus, and Ethernet. Apart from the actualresults (7 * approximately 12 KB), several digitized pictures of the scene (between 250 KB and I MB) are also transmitted.

Working with the transmitted data requires the construction of a frameworkto simultaneously represent digitized pictures, trajectories, and dialogues as part

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

302 J. R. J. Schlrra at al.

of the language processing system. The pixel scroll windows developed for thispurpose allow for scrolling within a digitized image and for sequences of theseimages to be animated. Graphical representations of trajectories can be faded in,as well (Fig. 7).

Because the format of the transmitted results did not correspond to the format of trajectories expected from CITYlOUR, we also implemented an algorithm to transform the formats, allowing us to select out of the set of all trajectories those that are especially interesting. The parameters of this algorithm arethe spatial and temporal borders of the relevant part of the scene, with onlyobjects whose trajectories start and stop at these borders being considered further. No object should pop up within the field of view or just disappear there.

Present State of the Connection

The CITYlOUR system takes for granted that the scene is observed/represented from a bird's-eye view, i.e., that there is no perspective distortion.As mentioned above, the reconstruction of the 3D objects cannot yet be includedautomatically. Therefore, the spatial relationships recognized by CITYlOURrefer to the picture plane: Instead of the real location of an object within thestreet plane, the position of an object candidate within the image is used.

At present, the representations of the static objects, i.e., the polygons for thebuildings and lanes, are fed into the system manually. This is supported by adigitized TV picture of the scene that can be faded into the graphic window ofCITYlOUR as shown in Fig. 7. Nevertheless, because perspective distortion isstill being Ignored, the-polygons can be copied in quite simply by means of amouse-directed graphic editor.

In contrast to the static objects, the descriptions of the dynamic objects arecalculated from the data from the vision system. These data are, as mentionedabove, transferred via the DFN from Karlsruhe to Saarbriicken and then transformed automatically from the original time-oriented format (described in thesection on cueing) to the trajectory format (described in the section on representation of objects and motions). These trajectories can be loaded directly intoCITYlOUR. Out of the results obtained from the vision system, only the centers of gravity of the object candidates are presently used.

FUTURE WORK

The vision group in Karlsruhe will concentrate on classifying the dynamicobjects into, e.g., cyclists, cars, vans, lorries, buses, and trams (object recognition) and cueing them through more complicated trajectories including object .rotation. The algorithm presented will also be tested with scenes up to 20 seconds in duration.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

~ W

ruh

r-o

bJ

ec

t00

09

die

k81..

....

t...

....

..n

tla

ng

Ro

oJ

ET

ZT

1MM

OM

EN

T

TR

AC

E

•F

IGU

RE

7.T

hew

ind

ow

sof

CIT

YT

OU

Rw

ith

the

dig

itize

dp

ho

toof

the

ba

ckg

rou

nd

and

som

etr

aje

cto

rie

s.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

304 J. R. J. Schlrra at 01.

The VITRA project in Saarbriicken plans to extend CITYTOUR throughspatial relationships between two or more dynamic objects, e. g., x followed y(in German x folgte y). Furthermore, the more complex events to overtake, topark, and to back-park (in German uberholen, parken, and einparken) will beelaborated. In some of these cases it is necessary to represent more than just thecenters of gravity of moving objects; at least delineative rectangles are required.

Another goal is the derivation of temporary prominent fronts of movingobjects, induced by their direction of motion. That will allow for extrinsic use ofprepositions referring to the moving object (cf. Retz-Schmidt, 1986).

APPARATUS

Image processing has been done with a VTE digital videodisk and a VAX111780, programmed in Pascal. CITYTOUR is implemented in FUZZY andZetaLISP on a SYMBOLICS 3600. The machines are connected via DFN andTCP/IP.

REFERENCES

Andre, E., Bosch, G., Herzog, G., and Rist, T. 1985. CITYlOUR-Ein natiirlichsprachliches Anfragesystemzur Evaluierung rliumlicher Priipositionen. Abschlu/lbericht zum Fortgeschrittenenpraktikum Prof. Dr.\\\lhlsler, Wintersemester 1984/85; Fachbereich Inforrnatik, Universitlit des Saarlandes.

Andre, E., Bosch, G., Herzog, and Rist, T. 1986a. Characterizing Trajectories of Moving Objects UsingNatural Language Path Descriptions. Universitlit des Saarlandes, SFB 314, Memo No.5; also in Proc.7th ECAl.

Andre, E., Bosch, G., Herzog, G., and Rist, T. 1986b. Coping with the Intrinsic and Deictic Use of SpatialPrepositions. Universitiit des Saarlandes, SFB 314, Bericht No.9; also in Proc. AlMSA 1986.

Bajcsy, R., Joshi, A., Krotkov, E., and Zwarico, A. 1985. LandScan: A Natural language and computervision system for analyzing aerial images. Proc. 1JCAl 1985.

Dreschler, L., and Nagel, H.-H. 1982. Volumetric model and 3D-trajectory of a moving car derived frommonocular TV-frarne sequences of a street scene. Comput Graphics Image Process 20: 199-228.

Enkelmann, w., Kories, R., Nagel, H.-H., and Zimmermann, G. 1985. An experimental investigation ofestimation approaches for optical flow fields. In: Marion Undemanding: Robor and Human Vision, eds.W. N. Manin and J. K. Aggarwal. Hingham, Mass.: Kluver.

Kories, R., and Zimmermann, G. 1986. A versatile method for the estimation of displacement vector fieldsfrom image sequences. Proc. ""r!<shopon Marion: Represenuuion and Analysis. pp. 101-106, May 7-9,1986, Kiawah Island Resort, Charleston, S.C.

Lakoff', G. 1987. ""man, Fire, and Dangerous Things: What Casegories Reveal about the Mind. Chicago:Univ. of Chicago Press.

Miller, J. A., and Johnson-Lairol, P. N. 1976. Language and Perception. London: Cambridge Univ. Press.Nebel, B., and Marburger, H. 1982. Das natiirlichsprachliche System HAM-ANS: Intelligenter Zugriff auf

heterogene Wisseus- und Datenbasen. Univ. of Hamburg, Bericht ANS-7.Neumann,' B., and Novak, H.-J. 1986. NAOS, ein System zur natiirlichsprachlichen Beschreibung zeitveran

derlicher Szenen. lnformatik Forsch Entwicklung I: 83-92.Rctz-Schmidt, G. 1986. Deictic and Intrinsic Uses of Spatial Prepositions. A Multidisciplinary Comparison.

Universitiit des Saarlandes, SFB 314, Memo 13; also in Kak, A., and Chen, S.-S., eds, 1987. Spatialreasoning and multi-sensor fusion. Proc. 1987 ""rkshop. Los Altos, Calif.: Morgan Kaufmann.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

From Image Sequances to Natural Language 305

Schwind, C. 1984. Semantikkonzepte in der Kiinstlichen Intelligenz. In Kisnstlidu: lntelligenz; Proc. 2d Frishjahrsschule isberKl in Dassel, IFB-KI 93. Berlin: Springer.

Sung, C. K., and Zimmermann, G. 1986. Detektion und Verfolgung mehrerer Objekte in Bildfolgen. InMusrererkennung 1986, Infonnatik-Fachberichte 125, pp. 181-184. Berlin: Springer.

TSOlsOS, J. K. 1980. A Framework for Visual Motion Understanding. TR CSRG-1I4, Univ. of Toronto.Zimmerman, G., and Kories, R. 1984. Eine Familie von Bildmerkmalen filr die Bewegungbestimmung in

Bildfolgen. In Mustererkennung 1984, Infonnatik-Fachberichte 125, pp. 181-184. Berlin: Springer.

Received July 13, 1987

Request reprints from J. R. J. Schirra.

Dow

nloa

ded

by [

The

Uni

vers

ity o

f M

anch

este

r L

ibra

ry]

at 0

4:15

10

Oct

ober

201

4

from image sequences to natural language: a first step toward automatic perception and description...

Documents