from image sequences to natural language: a first step toward automatic perception and description...
TRANSCRIPT
This article was downloaded by: [The University of Manchester Library]On: 10 October 2014, At: 04:15Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,37-41 Mortimer Street, London W1T 3JH, UK
Applied Artificial Intelligence: An International JournalPublication details, including instructions for authors and subscription information:http://www.tandfonline.com/loi/uaai20
FROM IMAGE SEQUENCES TO NATURAL LANGUAGE: AFirst Step toward Automatic Perception and Descriptionof MotionsJ.R.J. SCHIRRA a , G. BOSCH a , C.K. SUNG b & G. ZIMMERMANN ba Department of Computer Science , Universität des Saarlandes , D-6600 Saarbrücken 11,Federal Republic of Germanyb Fraunbofer-lnstitut für Informations- und Datenverarbeitung , D-7500 Karlsruhe 1, FederalRepublic of GermanyPublished online: 24 Oct 2007.
To cite this article: J.R.J. SCHIRRA , G. BOSCH , C.K. SUNG & G. ZIMMERMANN (1987) FROM IMAGE SEQUENCES TO NATURALLANGUAGE: A First Step toward Automatic Perception and Description of Motions, Applied Artificial Intelligence: AnInternational Journal, 1:4, 287-305, DOI: 10.1080/08839518708927976
To link to this article: http://dx.doi.org/10.1080/08839518708927976
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) containedin the publications on our platform. However, Taylor & Francis, our agents, and our licensors make norepresentations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of theContent. Any opinions and views expressed in this publication are the opinions and views of the authors, andare not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon andshould be independently verified with primary sources of information. Taylor and Francis shall not be liable forany losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use ofthe Content.
This article may be used for research, teaching, and private study purposes. Any substantial or systematicreproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in anyform to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
o FROM IMAGE SEQUENCESTO NATURAL LANGUAGE:A First Step towardAutomatic Perceptionand Description of Motions
J. R. J. SCHIRRA and G. BOSCHDepartment of Computer Science, Universitiit desSaarlandes, 0-6600 Saarbrucken 11, Federal Republicof Germany
C. K. SUNG and G. ZIMMERMANNFraunhofer-Institut fur Informations- undDatenverarbeitung, 0-7590 Karlsruhe 1, FederalRepublic of Germany
1% present our work concerning the connection of a vision system to a natural languagesystem. That is, automatic processing transforms the original sequence of IV images intonatural language descriptions concerning moving objects. It is the first time that this transformation has been achieved entirely by computer. A vision system that has been developed inKarlsruhe is briefly introduced. By analyzing displacement vector fields, trajectories ofobjectcandidates are recognized. The natural language system CrrYTOUR is presented. The verbalization ofspatial relations between static and moving objects can be studied with this system.The present state of the connection is described, and the data resulting from the vision systemare pan/ally used and verbalized by CrrYTOUR.
INTRODUCTION
One of the most fascinating aspects of human intelligence is the capability ofperceiving the surrounding environment and describing to another person. AIrepresenting a field of cognitive science concerned with machine-simulated intelligent behavior also tries to model this relationship between the perceptionand language.
The problem of automatic description of a natural scene is usually dividedinto two steps: (1) constructing an abstract propositional description of thescene-a task carried out by a vision system, and (2) expressing these abstract
We thank Ingrid Wellner, who implemented the data transformation algorithm and integrated the staticbackground in CITYTOUR, Wilfried Enkelmann for his endeavor to establish communication via DFN, andGudula Retz-Schmidt for her insight, comments. and suggestions on earlier versions of this paper.
The work described in this paper was supported by the German Special Collaboration Project SFB 314 onAI and Knowledge-Based Systems of the German Science Foundation (DFG).
Applied Artificial Intelligence: 1:287-305, 1987Copyright © 1987 by Hemisphere Publishing Corporation 287,
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
288 J. R. J. Schlrra at et,
descriptions in natural language-a task carried out by a natural language system. Both of these tasks have been dealt with separately. Systems like MORIO(Dreschler and Nagel, 1982) try to solve the first step while others, likeHAM-ANS (Nebel and Marburger, 1982) or NAOS (Neumann and Novak,1986) deal with the second, using simulated data of the scene description.
Obviously, there are advantages in working on both partial problems simultaneously. For humans, the data from vision systems alone are usually difficultto comprehend. Their interpretation often demands precise knowledge of theunderlying analysis process. Language, as our natural medium of communication, is very efficient and suited to representing the relevant structures within thedata in a clear and compact manner.
Conversely, if there is no relationship between language notions and percepts, semantic theory will remain fragmentary. Within this traditional, purelyintensional attempt, the world surrounding the system is not considered.Schwind (1984) characterized this approach as follows: "A semantic conceptshall simulate or represent something, which we do not know directly: conceptual structures, that humans construct to understand and produce sentences."With respect to intensional semantics, Lakoff (1987) remarks: "As should beobvious, such models of concepts make no use of any experiencial aspect ofhuman cognition. That is, intensions have nothing in them corresponding tohuman perceptual abilities, imaging capabilities, motor abilities, etc." Thus, thereference-semantic anchoring of meaning verified by algorithms, which shouldbe a result of the integration of perceptual (or motor) systems and natural language systems, is of great importance for computational linguistics.
Indeed, the modeling of systems integrating the capability of visual perception and natural language communication is still in an early stage. In only a fewsystems will processing span the entire way between image sequences and natural language descriptions-restricted to very small domains.
LANDSCAN (Bajcsy et al., 1985), for example, is such a system, but itdeals only with static scenes. Another system, ALVEN (Tsotsos, 1980), workson a very specialized domain-left ventricular heart motion-but does not generate sentences in natural language. However, most concepts of motion it recognizes in the image sequence correspond to certain natural language notions ofchange.
Project VITRA (VIsual TRAnslator)* is concerned with the principles governing the relationship between natural language and visual perception. Experimental studies dealing with the connection of image and language understandingsystems are made with the goal of developing a system capable of describing animage sequence in the German language. The vision project VI is engaged in the
"The projects VITRA and V1 are parts of the SFB 314 Research Program of the German ScienceFoundation (DFG) on AI and Knowledge-Based Systems.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
From Image Sequences to Natural Language 289
task of automatically extracting information about the motion of objects fromreal scenes.
In the following sections, our mutual research regarding the connection of avision system to a natural language system is presented. This is the first time thetransformation of sequential TV images into natural language descriptions ofmoving objects is achieved entirely by computer. In this paper, the natural language half of this connection is emphasized. First, a vision system developed inKarlsruhe is briefly introduced. Then we present the natural language systemCITYlOUR, which has been developed in Saarbriicken. The present state ofthis vision/NL connection is dealt with in the fourth section.
IMAGE SEQUENCE PROCESSING
Image Material
From a building about 35 m high, a stationary TV camera recorded animage sequence of a road crossing on videotape. From this tape, 130 frames(5.2 seconds) were selected, digitized (512 x 412 pixels, 8 bits), and stored ona magnetic disk. The scene shows a tram moving from the left to the right of thescreen, together with other vehicles moving in the opposite direction. In theupper left-hand corner, one car has already stopped while three others are slowing down in front of red traffic lights (Fig. 1).
FIGURE 1. First frame of the sequence. Moving objects are marked by a frame parallal to thedirection of motion indicated by an arrow Inside the frame; the canter of gravity of the vectorsof the cluster is indicated by a white dot.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
290 J. R. J. Schlrra at al.
Segmentation by Motion
In this paper we deal only with motions within the image sequence. Up tonow, objects in three-dimensional (3D) space cannot be perceived and, therefore, candidates representing moving objects within the image plane are employed.
Moving object candidates are segmented from the stationary ones by computing and analyzing displacement vector fields (Enkelmann et al., 1985; Koriesand Zimmermann, 1986; Sung and Zimmerman, 1986; Zimmermann and Kories, 1984). To reduce the amount of data, vector fields are computed from fourconsecutive image pairs. Then, as long as they persist, these vectors are individually chained together. Out of about 1350 vectors of the initial four image pairs,approximately 540 exist during the entire 130-image sequence. Because mostmoving objects leave the field of view during this sequence, most of the persisting vector chains must belong to the stationary background.
The extraction of moving objects is carried out in four steps:
I. To obtain stable starting conditions, only vectors persisting over 24 image pairs are considered.
2. To rule out the stationary background, all vectors with a displacement ofless than 15 pixels are discarded.
3. Within a neighborhood of ±20 pixels of each of the remaining vectors,another vector with the same displacement (threshold ± 1 pixel) is sought. Thisresults in clusters of similar displacement vectors.
4. As described in steps 2 and 3, broken vector chains within each clusterpossibly belonging to the cluster itself are sought. The condition of persistenceover 24 image pairs of step 1 is dropped. Condition 2 is retained for the initialpart of the broken vector chain. Step 4 allows short occlusions of the objects,e.g., by lampposts or trees.
Cueing of Object Candidates
Cueing is carried out in two phases. First, the next 24 image pairs with anoverlap of four pairs are analyzed using the same procedure as above, but nowwith only vectors contained in the clusters previously found. The condition ofstep 2 is applied to ensure that only object candidates initially moving are cued.This is repeated for the third 24 image pairs.
In the next phase, lasting until the end of the image sequence, the conditionof step 2 is dropped, allowing displacements down to zero. To ensure the stability of the size of the cluster, the condition of step 1 is reintroduced, allowingonly uninterrupted vectors.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
From Imaga Sequences to Natural Language 291
FIGURE 2. Cueing of the object candidates from Image 1 to Image 125. For seven time points aframe Is drawn around the object. Note that the objects In the upper part of the Image come torest.
The cueing results are shown in Fig. 2, where the object candidates aredisplayed with respect to images 1, 25, 45, 65, 85, 105, and 125. Initially, 10moving object candidates are found. The tram in the lower part of the imagemoves out of the field of view. Of the vehicles on the lanes directly above thetram, all leave the field of view except the two cars on the far right. One of themovertakes a cyclist (third frame from the right in Fig. 1); the cyclist is lost bythe algorithm during this process of overtaking. Later, a vector with zero displacement is erroneously included in this CaT'S cluster-the long, small framein the center of Fig. 2-and in the subsequent cueing step, the algorithm losesthis car.
Despite occlusions by trees, lampposts, and lamp suspension wires, threestopping cars are cued through the whole scene in the upper part of the image. Itmust be noted that in using the present procedure, the frames tend to becomesmaller because each new search begins with the previously found cluster and,in the second cueing phase, broken vector chains are discarded.
Results
Moving object candidates are represented by the center of gravity of thestarting points of all vectors within the cluster. For subsequent processing, additional information is given. This consists of the four comer points of the framearound the cluster and the mean displacement vector of the cluster. Togetherwith the identification number of the object candidates for each 20th frame, alldata are transmitted through the entire sequence (Fig. 3).
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
displacement vector fields
geometric image sequence description
.............................
D'U R LAC HER TOR ITRAJECTORY Cent... fll
Ro.o Col Row
BILDOOOI0005R2
Col Ro.o ColRJ
Row Col "Row Col
OElJECTOOOI 2~8.7 211.9 236.2 193.8 27~.6 182.6 249.3 2ltO.0 288.7 225.9 -3.0 -10.6OBJECTOO02 317.6 158.8 241.9 51.2 313.3 27.6 325.60 J04.3 397.0280.7 '.2 12.7OBJECTOO03 71.22&2.1 604.62600.0 81. 7 270.3 '8.9 289.4' 75.8 293.7 I.' -7 .~
OBJ(CTQOG4 23'5.0291.0 207.5 264.3 242.7 ;;::.5.1 222.9 323.6 255.1 314.'5 -3.2 -12.3O&JEtTOOO'5 283.3 312.4 263.0 294.0 2'm.6 286.9 274.8 JJ9.2 302.3332.1 -3.b -13.9OBJECTOO06 96.8 32b.2 91.2 307.1 113.2 312.0 82.4 34b.l 104.43'51.1 0.' -4.0OBJ!:CTOO07 2M'.& 3nl.O 253.1 364.4 284.& 357.2 261.8 402.2 293.4394.9 -1.1 -4.&OBJECT0008 olt.2 It10.0 '54.1 39'5.3 62.9 ItOI.7 49.3 '016.6 78.1 4;;:3.1 I., -0.7O£lJECTOO09 276.7437.9 264.2419.8 294.6412.9 273.8461.9 304.2 4'5'5.0 -2.3 -10.1(l9JECTOOI0 203.9 4'It.1 2'3.9438.4 266.7436.1 20(1.2 473.2 273.0 470.9 -1.2 -6.6
DURLACHER TOR:TRAJECTORY C...,t... RI
Row Col Row
BI LOOOO 1OO~R2
CQl Row ColOJ
Row Col "RO!'l Col Ro. '01
292
OBJECTOOOI 241.b 149.0 2~.2 130.8 261.0 120.4 231.2 lh.2 272.0 163.8 -3.0 -11.808J£CTOO02 341.' 231.1 261.3 111.0 338.3 87.3 346.4 3&6.9 423.3 363.2 .., 14.6OBJ£CTOO03 80.8 :44.9 74.3 229.8 'Xl.1Ii 234.1 MI.1 2'5:.8 84.8 257.2 1.7 -6.408J£CTOO04 218.8 228.6 193.0 203.1 227.'5 194.8 206.8 200.2 241.3 251.8 -2.9 -12.0OBJECTOO05 2b'5.J Z42.9 247.1 225.1 271.3 218.2 259.5 268.4 283.7 261.4 -3.7 -12.9O£lJECTOOOo 101.3 JOb.9 96.1 287.'5 118.1 292.) 87.5 320.8 109.5331.6 0.7 -).2OElJECTOOO7 203.b JJZ.S 2'53.932e..7 173.~ 323.5 250.9 340.3 :Z76o.~ J3~\' I -1.7 -0.0OBJECT00Q8 7).3371.4 b5.4 343.8 91.0 346.9 'ILl J9b.o M.8 399.6 0.6 -5.1OBJECTOOO'1' 20'.5385.7 :<'45.3308.3 27b.) 358.4 258.Q 410.9 290.0401.0 -3.1 -9.701lJECTOOIO 255.8 416.2 244.7402.0 256.2 3~8.4 2'5'5.0435.4 266.60 431.8 -2.2 -7.1
FIGURE 3. Diagram of the motion analysis algorithm with part of the resulting data.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
From Image Sequences to Natural Language 293
THE NATURAL LANGUAGE COMPONENT: CITYTOUR
The German dialogue system CITYlOUR (Andre et al., 1985) answersquestions concerning
The spatial relationship between static objects (static relationships),Direction and path of moving objects, i.e., the geometric properties of motions
(dynamic relationships), andOther visible, especially kinematic, properties of motions (e.g., velocity, accel
eration).
The objects are supposed to be arranged in a two-dimensional Euclideanarea. The questioner is assumed to be one of these objects. Thus, the conversational partner is part of the scene under discussion. Therefore CITYlOUR'sanswer can take into account the current position of the observer.
Representation of Objects and Motions
The domain of discourse in question consists of the so-called static background, e.g., buildings, streets, or places, and a set of dynamic objects able tomove within the scene, e.g., cars, trams, and cyclists. Essentially, static objectsare represented as closed polygons; more primitive forms of representationdelineative rectangles and centers of gravity-can be calculated if needed. Asanother property of some of the static objects, their prominent front is defined;its use is explained below.
Moving objects, like cars or pedestrians, are all equally represented by theircenters of gravity. Their motions are represented as trajectories, i.e., lists ofpairs
where P,; denotes the position (x, Vti) of that object's center of gravity at time t;
on the underlying discrete time axis (Fig. 4). The set of representations of thelocation of both static and dynamic objects is called the geometric scene description (GSD).
Three examples of this general domain were examined:
I. The city center of a city map. Here CITYlOUR simulates aspects of afictitious sightseeing tour through that part of the city (Andre et al., 1986a,1986b; Retz-Schrnidt, 1986). Figure 4a includes part of the static background ofthis domain.
2. A campus guide of the University of Saarbrucken. In this example, the
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
ll: ""
Wa
slI
egl
all
es
hin
ter
der
Po
st1
IHA
US
8b
efl
nd
et
alc
hU
NM
lTT
EL
BA
RH
INT
ER
de
"P
OS
TH
AU
S9
be
rln
de
tel
chR
EC
HT
GU
TH
INT
ER
de
"P
OS
Td
as
ST
EA
KH
AU
Sb
erl
nd
et
sle
hR
EC
HT
GU
Ttl
lNT
ER
de'
"P
OS
Td
aBR
AT
HA
US
be
fln
de
tB
leh
INE
rWA
HIN
TE
Rd
....
PO
STd
ieB
1Efl
AK
AD
EM
IEb
efl
nd
et
sk
hR
EC
HT
GU
TH
INT
ER
de
rP
OS
Td
ee
we"'.
lIeg
td
as
flat
hau
sh
inte
rd
e,.
Sp
ark
a.s
o'1
nern
,d
a.
kenn
man
nlc
ht
sag
en
TR
AC
E~:
-'-:
:::c
':;:
~)
y>
(s)
..it
FIG
UR
E4.
The
besl
cC
ITY
TOU
Rw
indo
ws.
(8)
The
city
map
dom
ain:
The
win
dow
atth
efa
rri
ght
disp
lays
part
ofth
est
atic
hack
grou
nd(o
nly
the
hous
es)
and
two
traj
ecto
ries
.T
hebi
gdo
tat
the
botto
mre
pres
ents
the
ob
sen
erIn
the
scen
e.W
ithin
the
win
dow
DIA
WG
.se
vera
lque
stlo
nsan
dan
swer
sar
egi
ven.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
DIA
LO
Gh
lelt
ob
jee
t00
03
be
ld
am
ha
Ll8
.Ja
,8
8h
lelt
GE
RA
OE
EB
EN
BE
ld
am
HA
US
.
'uh
ro
bje
ct0
00
9d
iek
ala
.Mit
ra.B
een
tlan
gje
,JE
TZ
T1M
MO
ME
NT
--
I\C
E
(grl
nd
-to
p-I
eve
l(g
et
'ob
ject0
00
3't
raJ)
40
)((
7(2
.83
62
70
66
.26
26
07
))(6
(2.8
56
04
14
6.2
65
64
93
»(5
(2.9
01
66
45
6.2
76
29
47
))(4
(3.0
78
07
61
6.3
72
10
3))
(3(3
.29
40
27
86
.44
20
6))
(2(3
.72
4"1
12
6.5
57
64
))(1
(4.2
90
14
46
.70
36
36
)))
··"'OR
E·"
Ha
eu
lJe
r:v..
No
Pla
et1
!e:
V..
No
DU
RL
AC
HE
wA
LL
EE
~.".
Ob
Jek
te:
V..
No
~St
,.asse"
:V
_N
oB
us:
Ves
No
Ellit
12"'
22.1
861
3:5
7:2
5!>
era
Her
Eo
gC
ITY
:C
ho
ose
N '"en
(hI
FIG
UR
E4.
The
basi
cC
ITY
TO
UR
win
dow
s(C
on
tinu
ed
).(b
)T
hest
reet
cros
sing
dom
ain:
The
grap
hic
win
dow
disp
lays
aho
use,
aca
rpa
rk,
abu
sst
op,
seve
rall
anes
,an
dfi
vetr
ajec
tori
es.
Inth
eT
RA
CE
win
dow
,th
ein
tern
alre
pres
enta
tion
ora
traj
ecto
ryis
show
n.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
296 J. R. J. Schirra et al.
use and verbalization of canonical trajectories is studied, i.e., the normal pathsof, for example, bus lines. This particular capability will not be dealt with here.
3. A bird's-eye view of a street crossing. In this domain of discourse, weexamine spatial relationships between moving objects as well as simple events,using data pertaining to real motion. The corresponding static background isillustrated in Fig. 4b.
Static Spatial Relationships
Several kinds of spatial relationships between objects of the GSD are ofparticular interest. Together with their degrees of applicability, they are calculated by request from the GSD.
CITYTOUR is able to recognize relationships between two or three objectswithin the static background-so-called static relationships. The following staticrelationships are implemented: in front of, behind, to the left, to the right,beside, at, on, in, and between (in German vor, hinter, links, reclus, neben, an,auf, in and zwischen). In using them, CITYTOUR can answer the following typeof question:
Is the house to the right of the bus stop?
Is the post office between the bust stop and the house?
Is the house behind the bus stop from here?
The applicability of the three-place relationship between, for example, iscalculated by means of the following algorithm (cf. Fig. 5) (ObI and Ob2 arethe names of the two reference objects; the object possibly between ObI andOb2 is called the subject of the relationship):
Step I: Calculate the two tangents gl and g2 between the reference objectsusing their closed-polygon representation;
Step 2: If:A: both tangents cross the subject (also in its polygon representation),
the relationship between holds with degree I;B: the subject is totally enclosed by the tangents and the reference
objects, the relationship is also applicable with degree 1;C. only one of the tangents intersects the subject, the degree of the
applicability is calculated, depending on its penetration depth inthe area between the tangents:
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
From Image Sequences to Natural Language 297
g,
Db2
FIGURE 5. The three cases of betw69n.
Applicability degree _ max (_a_ _a_)a + b' a +c
Otherwise:D. the relationship is not applicable: degree = O.
The degrees of applicability are used for two purposes: When answering YIN questions, they can be verbalized as linguistic hedges:
Is the post office between the bus stop and the house?
les, the post office is approximately between the bus stop and the house.
When answering Where questions, they help to choose the best referenceobject. In this case, the applicabilities of the four basic relationships (behind, infront of, to the right, and to the left) are calculated for several reference objects,also considering a degree of salience (in [0.. 1D associated with each static object.The relationship with the resulting highest degree of applicability is verbalized.
Where is the post office? It is directly behind the church.
A group of two-place prepositions, called relational prepositions, makes itpossible to localize the subject first with respect to intrinsic properties of the
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
298 J. R. J. Schlrra at 01.
reference object, e.g., its intrinsic prominent front, and second with respect tothe position of an observer. The first case is called intrinsic use, the second,extrinsic use of the preposition. If the position of the observer is equal to that ofthe speaker (or listener), we call it deictic use.
In CITYTOUR, the four basic prepositions as well as beside can be usedintrinsically and deictically and, since the observer is assumed to be locatedwithin the scene, in combination with to pass as well. Thus, spatial relationshipscan becalculated with respect to the observer's position. In order to distinguishintrinsic and deictic use, we employ the following strategy derived from Millerand Johnson-Laird (1976): If the reference object has a prominent front, theintrinsic use is seen as the default use. Otherwise, and if it is forced by from herethe deictic interpretation is used [for deictic versus intrinsic: use of prepositions,see Andre et al., (1986b) and Retz-Schmidt (1986).]
Because the necessary data pertaining to the static background cannot becalculated by the vision component, the static relationship concerning the connection between the vision system and CITYTOUR are presently not of greatinterest.
Dynamic Relationships and Computational Semanticsfor Path Prepositions
In CITYTOUR we also consider relationships between a dynamic and astatic object, especially ones described by path prepositions such as past andalong (in German vorbei and entlang). Furthermore, the four basic relationshipsmentioned above can be used in their directional content:
The policeman went behind the building from here.
The car drove past the bus stop.
The tram went along 3rd Street.
The description of the paths of moving objects, i.e., the decision about theapplicability of dynamic relationships, is based on knowledge of the full trajectories. Within the graphic window Of CITYTOUR, the positions of the movingobjects are projected onto the static background for all instances of time; the lastpoint of the trajectory is regarded as the actual time of observation.
Precise analysis of past and along shows that they differ in the followingaspects: In both cases, and depending on the size of the reference object, thedistance between the moving subject of the relationship and the static referenceobject should not exceed a certain threshold. Along has a smaller threshold thanpast. In addition, in the case of along, the trajectory must follow more closely
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
From Image Sequences to Natural Language· 299
the boundary of the reference object (Fig. 6). Therefore, the closed-polygonrepresentation is used to calculate the applicability for along, whereas for past,the more general delineative rectangle representation is sufficient. During theapplicability of along, the moving object does not have to change its direction;past, on the other hand, is free of this restriction. Thus, along seems to implypast. But that is not the case, because past requires that the subject move the fulllength from one side of the reference object to its other side; to move along theobject, it has only to follow its shape for a.minimal distance.
These relationships are quite important in the present state of the linkage.Their implementation also works well in conjunction with the transferred datafrom Karlsruhe. They are described in detail in Andre et al., (1985, 1986a,1986b).
Recognition of Simple Events
We also consider simple events in which only one dynamic object is involved. More precisely, in these cases we are considered with relationshipsbetween several positions of the object at different times.
past
threshold for along
and past past past
FIGURE 6. Differences between past and 8long.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
300 J. R. J. Schirra at 01.
In CITYTOUR, to start off and to stop (in German anfahren and anhalten)are implemented. Formally, we define two predicates with the object a as thefirst and the time interval {to . . tn } as the second argument as follows:
stop(O,{to.' tn}) := 3t E {to' . tn{: move(O,{to' . t}) 1\ stand(O,{t+ 1 .. tn});
start(O,{to' . tn}) : = 3t E {to . . tn{: stand(O,{to .. t}) II. move(O,{t+ 1 .. tn } ) ;
Sentences like Object 0 stopped are interpreted as would be the case in ordinarylanguage: Object 0 moved during an interval of time and did not move in thefollowing interval. To start describes the symmetrical case. The auxiliary predications move( ) and stand( ) hold within every interval in which a subintervalexists without any motion or standstill, respectively.
move(O,{to' . tn}) : = 3{t, .. tJ S; {to' . tn}: not - stand(O,{t, .. tz});
stand(O,{to' . tn}) : = 3{t, .. tJ S; {to' . tn}: not - move(O,{t, .. tz});
Thus, no durative events are defined, since for these kinds of events the corresponding predication must hold for every subinterval, too. The correspondingdurative event types are defined by the following two predications:
not - stand(O,{to .. tn}) : = VI E {to . . tn{: position(O,t)'* position(O, t + 1);
not - move(O,{to .. tn}) : = 'It E {to . . tn{ : position(O,t)= position(O, t + 1);
These definitions correspond closely to the colloquial meaning. If we say that acar did not move for a certain period, then the event so' described is surelydurative. We mean that the car did not move for any length of time during theperiod in question. By comparison, if we say that the car moved within a certainperiod, it is absolutely possible that the car stood still for a subperiod, even if itdid not starid still during another subinterval.
Thus, CITYTOUR can now also answer questions of the following type:
Did the car start off? li's, it did a short time ago.
By mentioning its location, a recognized event can be described more precisely. All static prepositions of CITYTOUR can be used for this purpose:
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
Did the cyclist stop?
From Image Sequences to Natural Language 301
les, he stopped beside the post office.
Did the tram stop at the tram stop? No, it went past it.
The difference between deictic and intrinsic use is considered here as well:
Did the van stop in front of the church from here?
les, it did just now.
Finally, to tum into (in German einbiegen) is implemented in its meaning ofchanging the street. More precisely, a relationship between the moving subjectand two streets is described as follows:
Did the van tum from Kaiserstrasse into Brunnengasse?
Up to now, changes in direction have not been considered here.
THE CONNECTION
The connection between a vision system and a natural language system described here is based on mutual advances in both technical fields during the past10 years. The understanding of both kinds of systems pertaining to the typicalproblems and solutions gradually arose during this preparation phase.
Finally, we succeeded in establishing a first and very simple linkage betweenthe two systems described above. Both systems are simply linked one after theother and they work sequentially; i.e., a scene pertaining to a certain period isthoroughly analyzed by the image sequence analysis system, with the resultingdata being transmitted en bloc to CITYTOUR. Thus, there is no feedback fromthe natural language system to the vision system, nor will the data be processedin a kind of "pipelining,' nor will the events, so to speak, be incrementallyrecognized and verbalized.
Technical Presuppositions
To bridge the rather long spatial distance between the two systems, we nowuse the new transmission medium computer net. The data comes through theDFN (German Research Net), Cantus, and Ethernet. Apart from the actualresults (7 * approximately 12 KB), several digitized pictures of the scene (between 250 KB and I MB) are also transmitted.
Working with the transmitted data requires the construction of a frameworkto simultaneously represent digitized pictures, trajectories, and dialogues as part
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
302 J. R. J. Schlrra at al.
of the language processing system. The pixel scroll windows developed for thispurpose allow for scrolling within a digitized image and for sequences of theseimages to be animated. Graphical representations of trajectories can be faded in,as well (Fig. 7).
Because the format of the transmitted results did not correspond to the format of trajectories expected from CITYlOUR, we also implemented an algorithm to transform the formats, allowing us to select out of the set of all trajectories those that are especially interesting. The parameters of this algorithm arethe spatial and temporal borders of the relevant part of the scene, with onlyobjects whose trajectories start and stop at these borders being considered further. No object should pop up within the field of view or just disappear there.
Present State of the Connection
The CITYlOUR system takes for granted that the scene is observed/represented from a bird's-eye view, i.e., that there is no perspective distortion.As mentioned above, the reconstruction of the 3D objects cannot yet be includedautomatically. Therefore, the spatial relationships recognized by CITYlOURrefer to the picture plane: Instead of the real location of an object within thestreet plane, the position of an object candidate within the image is used.
At present, the representations of the static objects, i.e., the polygons for thebuildings and lanes, are fed into the system manually. This is supported by adigitized TV picture of the scene that can be faded into the graphic window ofCITYlOUR as shown in Fig. 7. Nevertheless, because perspective distortion isstill being Ignored, the-polygons can be copied in quite simply by means of amouse-directed graphic editor.
In contrast to the static objects, the descriptions of the dynamic objects arecalculated from the data from the vision system. These data are, as mentionedabove, transferred via the DFN from Karlsruhe to Saarbriicken and then transformed automatically from the original time-oriented format (described in thesection on cueing) to the trajectory format (described in the section on representation of objects and motions). These trajectories can be loaded directly intoCITYlOUR. Out of the results obtained from the vision system, only the centers of gravity of the object candidates are presently used.
FUTURE WORK
The vision group in Karlsruhe will concentrate on classifying the dynamicobjects into, e.g., cyclists, cars, vans, lorries, buses, and trams (object recognition) and cueing them through more complicated trajectories including object .rotation. The algorithm presented will also be tested with scenes up to 20 seconds in duration.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
~ W
ruh
r-o
bJ
ec
t00
09
die
k81..
....
t...
....
..n
tla
ng
Ro
oJ
ET
ZT
1MM
OM
EN
T
TR
AC
E
•F
IGU
RE
7.T
hew
ind
ow
sof
CIT
YT
OU
Rw
ith
the
dig
itize
dp
ho
toof
the
ba
ckg
rou
nd
and
som
etr
aje
cto
rie
s.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
304 J. R. J. Schlrra at 01.
The VITRA project in Saarbriicken plans to extend CITYTOUR throughspatial relationships between two or more dynamic objects, e. g., x followed y(in German x folgte y). Furthermore, the more complex events to overtake, topark, and to back-park (in German uberholen, parken, and einparken) will beelaborated. In some of these cases it is necessary to represent more than just thecenters of gravity of moving objects; at least delineative rectangles are required.
Another goal is the derivation of temporary prominent fronts of movingobjects, induced by their direction of motion. That will allow for extrinsic use ofprepositions referring to the moving object (cf. Retz-Schmidt, 1986).
APPARATUS
Image processing has been done with a VTE digital videodisk and a VAX111780, programmed in Pascal. CITYTOUR is implemented in FUZZY andZetaLISP on a SYMBOLICS 3600. The machines are connected via DFN andTCP/IP.
REFERENCES
Andre, E., Bosch, G., Herzog, G., and Rist, T. 1985. CITYlOUR-Ein natiirlichsprachliches Anfragesystemzur Evaluierung rliumlicher Priipositionen. Abschlu/lbericht zum Fortgeschrittenenpraktikum Prof. Dr.\\\lhlsler, Wintersemester 1984/85; Fachbereich Inforrnatik, Universitlit des Saarlandes.
Andre, E., Bosch, G., Herzog, and Rist, T. 1986a. Characterizing Trajectories of Moving Objects UsingNatural Language Path Descriptions. Universitlit des Saarlandes, SFB 314, Memo No.5; also in Proc.7th ECAl.
Andre, E., Bosch, G., Herzog, G., and Rist, T. 1986b. Coping with the Intrinsic and Deictic Use of SpatialPrepositions. Universitiit des Saarlandes, SFB 314, Bericht No.9; also in Proc. AlMSA 1986.
Bajcsy, R., Joshi, A., Krotkov, E., and Zwarico, A. 1985. LandScan: A Natural language and computervision system for analyzing aerial images. Proc. 1JCAl 1985.
Dreschler, L., and Nagel, H.-H. 1982. Volumetric model and 3D-trajectory of a moving car derived frommonocular TV-frarne sequences of a street scene. Comput Graphics Image Process 20: 199-228.
Enkelmann, w., Kories, R., Nagel, H.-H., and Zimmermann, G. 1985. An experimental investigation ofestimation approaches for optical flow fields. In: Marion Undemanding: Robor and Human Vision, eds.W. N. Manin and J. K. Aggarwal. Hingham, Mass.: Kluver.
Kories, R., and Zimmermann, G. 1986. A versatile method for the estimation of displacement vector fieldsfrom image sequences. Proc. ""r!<shopon Marion: Represenuuion and Analysis. pp. 101-106, May 7-9,1986, Kiawah Island Resort, Charleston, S.C.
Lakoff', G. 1987. ""man, Fire, and Dangerous Things: What Casegories Reveal about the Mind. Chicago:Univ. of Chicago Press.
Miller, J. A., and Johnson-Lairol, P. N. 1976. Language and Perception. London: Cambridge Univ. Press.Nebel, B., and Marburger, H. 1982. Das natiirlichsprachliche System HAM-ANS: Intelligenter Zugriff auf
heterogene Wisseus- und Datenbasen. Univ. of Hamburg, Bericht ANS-7.Neumann,' B., and Novak, H.-J. 1986. NAOS, ein System zur natiirlichsprachlichen Beschreibung zeitveran
derlicher Szenen. lnformatik Forsch Entwicklung I: 83-92.Rctz-Schmidt, G. 1986. Deictic and Intrinsic Uses of Spatial Prepositions. A Multidisciplinary Comparison.
Universitiit des Saarlandes, SFB 314, Memo 13; also in Kak, A., and Chen, S.-S., eds, 1987. Spatialreasoning and multi-sensor fusion. Proc. 1987 ""rkshop. Los Altos, Calif.: Morgan Kaufmann.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4
From Image Sequances to Natural Language 305
Schwind, C. 1984. Semantikkonzepte in der Kiinstlichen Intelligenz. In Kisnstlidu: lntelligenz; Proc. 2d Frishjahrsschule isberKl in Dassel, IFB-KI 93. Berlin: Springer.
Sung, C. K., and Zimmermann, G. 1986. Detektion und Verfolgung mehrerer Objekte in Bildfolgen. InMusrererkennung 1986, Infonnatik-Fachberichte 125, pp. 181-184. Berlin: Springer.
TSOlsOS, J. K. 1980. A Framework for Visual Motion Understanding. TR CSRG-1I4, Univ. of Toronto.Zimmerman, G., and Kories, R. 1984. Eine Familie von Bildmerkmalen filr die Bewegungbestimmung in
Bildfolgen. In Mustererkennung 1984, Infonnatik-Fachberichte 125, pp. 181-184. Berlin: Springer.
Received July 13, 1987
Request reprints from J. R. J. Schirra.
Dow
nloa
ded
by [
The
Uni
vers
ity o
f M
anch
este
r L
ibra
ry]
at 0
4:15
10
Oct
ober
201
4