recognizing and locating partially occluded 2-d objects: symbolic clustering method

1644 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 6, NOVEMBER/DECEMBER 1989

morphology are already being used with great effect in certain industrial machine vision systems [24], [27]. Future low cost massively parallel systems are under development [28] that will dramatically increase the speed of voting logic, rank filters and iconic neural networks.

B. Morphological Networks Among the various types of image domain operators in the

literature, voting logic seems to be a pivitol concept. One generalization leads to the more traditional analytic, and biological development of neural networks. The other generalization leads to weighted rank-order filters-the basis for morphology. Not only is morphology made richer with the inclusion of weighted rank order filters, but it is a natural generalization that cannot be avoided. The idea of vector convolutions is applied to the other operators so that in Fig. 9, the entire network of generalization paths of the scalar operators is replaced by the more general vector operators. The definition of vector operators allows not only an enrichment of the theory of the common image operations, but results in a new robust image processing techniques. Morphological networks can be defined using weighted rank-order filters, although it is doubtful that biological systems operate that way. There is no question that learning methods similar to those used in traditional neural networks can be brought to vector morphological networks for automatically computing weights, ranks, and structuring elements, however due to the grossly nonlinear nature of the field, a formal analysis of its properties will be a challenge.

Acoust. Speech Signal Proc.. vol. ASSP-35, no. 8, pp. 1153-1169, Aug. 1987. P. Maragos and R. W. Schafer, “Morphological filters-Part 11: Their relations to median, order-statistics, and stack filters,” Trans. Acoust. Speech Signal Proc., vol. ASSP-35, no. 8, pp. 1170-1184, Aug. 1984. G. Matheron, Random Sets and Integral Geometry. New York: Wiley, 1975.

[16]

[17]

H. Minkowski, “Volumen und Oberflache,” Math Ann., vol. 57, pp.

Y. Nakagawa and A. Rosenfeld, “A note on the use of the local min and max operations in digital picture processing,” IEEE Trans. Syst. Man Cybern., vol. SMC-8, no. 8, pp. 632-637, 1978. K. Preston Jr., “Xi filters,” IEEE Trans. Acoust. Speech and Signal Proc., vol. ASSP-31, no. 4, Aug. 1983. K. Preston Jr. and M. J. B. Duff, Modern Cellular Automata. New York: Plenum Press, 1984. G. X. Ritter, M. A. Shrader-Frachette. and J. N. Wilson, “Image algebra: A rigorous and unified way of expressing all image processing operations,” Proc. of the I987 Technical Symp. S E on Optics Elec.-Optics and Sensors, Orlando, FL, 1987. D. E. Rumelhart and J. L. McCelland, Parallel Distributed Processing, A Bradford B w k . L. A. Schmitt and S. S. Wilson, “The AIS5000 parallel processor,” IEEE Trans. Pattern Anal. Machine Intell., vol. 10, no. 3, pp. 320-330, May 1988. J. Serra, Image Analysis and Mathematical Morphology. London: Aca- demic Press, 1982. S. R. Sternberg, “Grey-scale morphology,” Comput. Vision Graph. Im- age Proc., vol. 35, no. 3, pp. 333-355, 1986. S. S. Wilson, “The pixie-5000, a systolic array processor,” in Proc. IEEE Workshop Comput. Architecture Puttern Anal. Image Database Management, Miami Beach, FL, Nov. 1985, pp. 477-483. -, “Onedimensional SIMD architectures-The AIS-5000,” in Mul- ticomputer Vision, S. Levialdi, Ed.

477-495,1903.

Cambridge, MA: MIT Press, 1986.

London: Academic Press, 1988.

ACKNOWLEDGMENT The author would like to thank IC. Preston, Jr., who pointed

out references for rank-order filters. Recognizing and Locating Partially Occluded 2-D Objects Symbolic Clustering Method

W C m T s. s. W A N G , MEMBER, IEEE REFERENCES

K. E. Batcher. “Design of a massively parallel processor,” IEEE Trans. Comput., vol. c-29, pp. 836-840, 1980. H. A. David. Order Statistics. P. Danielsson, “Getting the median faster,” Comput. Graphics Image Proc., vol. 17, pp. 71-78. 1981. E. R. Dougherty and P. Sehdev, “A robust image processing language in the context of image algebra,’’ in Proc. Computer Vision and Pattern Recog., Ann Arbor. MI, June 5-9,1988, pp. 748-753. F. A. <;erritsin and P. W. Verbeek, “Implementation of cellular logic operators using 3 X 3 convolution and table lookup hardware,” Comput. Vision Graphics Image Proc., vol. 27, pp. 115-123, 1984. V. Goetchanan. “From binary to gray-level tone image processing by using fuzzy logic concepts.” Pattern Recog., vol. 12, pp. 7-15, 1980. H. Hadwiger, Vorslesungen uher Inhalt, Obetflache und Isoperimetricie. Berlin, Germany: Springer, 1957. R. M. Haralick. S. R. Sternberg, and X. Zhuang, “Image analysis using mathematical morphology.” IEEE Trans. Pattern Anal. Machine Intell., vol. PAMI-9. no. 4. pp. 532-550, 1988. R. M. Haralick, X. Zhang, C. Lin, and J. Lee, “Binary morphology: working in the sampled domain,” in Proc. Computer Vision und Puttern Recog., Ann Arbor, MI, June 5-9, 1988, pp. 748-753. H. Heygester. “Rank filters in digital image processing,” in Proc. 5th Int. Conf. Pattern Recog.. Miami Beach, FL, 1980, pp. 1165-1167. D. H. Hubel and T. N. Wiesel, “Receptive fields and functional architecture of monkey striate cortex,” J . Physiol., vol. 195, no. 2, pp. 215-244, Nov. 1968. B. I. Justusson, “Median filtering: Statistical properties,” Two-Dimen- sional Signal Processing 11, T. S. Huang, Ed. Heidlburg, Germany: Springer-Verlag. pp. 161-196.1981. M. Maresfa and H. Li, “Morphological operations on mesh connected architecture: A generalized convolution algorithm,” Proc. Comput. Vi- sion and Pattern Recog., Miami Beach, FL, June 22-26, 1986, pp. 299-304. P. Maragos and R. W. Schafer, “A unification of linear, median, order- statistics, and morphological filters under mathematical morphology,” in Proc. 1985 IEEE Int. Con/. Acoust. Speech Signal Processing. Tampa, FL, Mar. 1985. pp. 1329-1332. P. Maragos and R. W. Schafer, “Morphological filters-Part I: Their set theoretic analysis and relations to linear shift-invariant filters,” Trans.

New York: Wiley, 1970.

Abstract-Recognition of objects using a model can be formulated as the finding of affine transforms such that the locations of all object features are consistent with the projected positions of the model from a single view. A method for the computing of the transform using the symbolic clustering method is described. The advantage of this approach is that the desired transform can be computed efficiently and in parallel. The authors experiments support such observation.

I. INTRODUCTION A primary objective of an object recognition system is to

recognize the identity and position of the desired object in a scene. The complexity of the system is often determined by whether the objects are two- or three-dimensional (2-D or 3-D); whether the objects are against a contrasting background; whether the objects are isolated or partially occluded. In this paper, we describe a technique for identifymg partially occluded complex 2-D objects against a complex background using object models.

Tasks of recognizing partially occluded 2-D objects can often be described as searching for a mapping between part of the image and a particular view of the object. One possible approach is the hypothesize-and-test strategy. Possible locations of the

Manuscript received September 20,1988; revised March 30,1989. This work was supported in part by the National Science Foundation under Grant DCR-8517513. This work was presented in part at the IEEE Workshop on Computer Vision, Miami Beach, FL, 1987 and at the SPIE Alplications of Artificial Intelligence VI, Orlando, FL, 1988.

The author was with the Artificial Intelligence Laboratory, Department of Computer Sciences, University of Texas, Austin, TX. He is now with the Artificial Intelligence Technical Center, Washington C’I Division, The MITRE Corporation, 7525 Colshire Drive, McLean, VA 22102.

IEEE Log Number 8930342.

001 8-9472/89/11oO-1644$01 .oO 01989 IEEE

1645 IEEE TRANSACTlONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 6, NOVEMBER/DECEMBER 1989

& . . -- . .. .,

.- .___, ".._ ,..- .... ... ?.- ....

Transform T 1.

Match set = ((A.h)(B.b)(C.c)]. Transfon 72. Match set = ((D.d)(E.c)(F,f))

Fig. 1. (a) Airplane. (b) Test image. (c) Several possible transforms. Pro- jected object is shown in dotted lines.

object are first computed. This is followed by the verification of the presence of the object at the computed locations. There have been three basic approaches explored. They have differed mainly in the way they generate the hypotheses.

The first approach is to generate the hypotheses based on no prior information about the image. This approach assumes that the object may be everywhere in the image. The hypothesis that minimizes some correlation error measurement is taken to be the correct hypothesis. An example of this approach includes the template-based matching [6]. In this approach, the generation of hypotheses requires no computation, while the verification of the hypotheses requires extensive computation.

The second approach is to generate the hypotheses based on all the available image information. Hypotheses are generated after extensive analysis of the image is performed. No verification is needed once the hypotheses are generated. Examples of this approach include generalized Hough transform [2], [3], graph matching, and relaxation [3]. In this approach, hypothesis generation requires extensive computation while little or no computation is required for the verification process.

The third approach is to generate the hypotheses based on some image information. Object models are analyzed and salient features are computed. Hypotheses are generated after the salient features are located in the image. Further analysis is required to refine the hypotheses. Examples of this approach include the alignment technique [7], local feature focus [5 ] , and recursive refinement [l]. Moderate amounts of computation are required for the generation and the verification process. The advantage of this approach is that through the careful selection of hypotheses the total amount of computation could be reduced.

This correspondence describes a method for the computing of the transform based on the symbolic clustering method using the

(b) Fig. 2. (a) Clusters consist of two hypotheses. (b) Projected objects shown in

dotted lines.

third approach. Matches between image features and object features are explored to generate hypotheses of possible object locations. Consistent hypotheses are grouped to form clusters. Supporting evidence of the participating hypotheses of a cluster is collected to generate a new transform hypothesis. The clusters that contain a sufficient amount of evidence are selected for further verification. Hypotheses are verified by comparing the object against the image directly. The advantage of this approach is that the basic hypotheses can be computed easily and in parallel and the clusters can be generated by combining the basic hypotheses. Also, since clusters with strong support are selected and investigated first, the probability that the correct transforms are computed earlier in the hypothesize-and-test process is high. Therefore, the total amount of computation for the recognition task may be reduced.

To demonstrate the technique, we present several examples of recognizing two-dimensional, rigid objects under orthographic projection. Partial occlusion is allowed. We first present our formulation to the object recognition task. Section I11 describes the symbolic clustering algorithm. An object recognition system based on the symbolic clustering technique is described in Sec- tion IV. Several examples are presented in Section V. Section VI describes the conclusions.

11. PROBLEM FORMULATION Let us assume that the objects to be recognized are flat or

almost flat so that they can be regarded as two-dimensional objects. The object recognition task can be formulated as the process of computing the affine transforms that map object model into image coordinates. Furthermore, using the hypothesize-and-test paradigm, the object recognition task can be formulated as follows.

1) Hypothesis generation: the searching for sets of matches between the object model and image features in order to generate possible transforms,

2) Hypothesis verification: the verification of the computed transforms.

Since a hypothesis can be verified effectively by comparing the object against the image directly and its complexity is usually a constant, the complexity of this approach is determined by the complexity of the hypothesis generation process as well as by the quality of the hypotheses generated.

Given a set of matches between object and images, a transform can be computed using the least square fitting method [9], [ll]. Let us call the transform computed hypothesis and the set of matches used to compute the transform match-set. Mathemati- cally, when a small set of matches (e.g., three noncollinear object points and their corresponding image points for orthographic


(b) Fig. 3. (a) Cluster consists of three hypotheses. (b) Projected object shown in dotted lines.

2-D projection) is determined, the transform can be computed. In practice, due to noise and uncertainty in the feature extraction process, the computation of the transform using a small match-set is usually inaccurate and unreliable. It is often necessary to use a large match-set in order to obtain reliable results. However, the amount of computation required to compute the match-set grows exponentially with the size of the match-set (e.g., the maximum clique finding algorithm). When the object is complex, it is not practical to find large match-sets directly.

One approach to the given match-set generation is to use the symbolic clustering technique. Small match-sets are first computed by matching object features against image features. A transform hypothesis is computed for each match-set. All the transforms are recorded in a database. Clusters of “consistent” transform hypotheses (e.g., transforms that map the object into similar locations in the image) are computed. Each cluster is represented by a new transform whose match-sets is the union of the match-sets of the participating hypotheses. This process continues until a ‘‘good” transform hypothesis is .computed. At this point, the hypothesis is selected for future verification.

Fig. 1 shows several transform hypotheses generated during the recognition of an airplane. Fig. l(a) shows the airplane model, and Fig. l(b) shows a partially occluded airplane instance. Figure l(c) shows several computed transforms and their corresponding match-sets. For example, transform 7‘’ is generated based on the match of comer abc (see Fig. l@)) with comer ABC (see Fig. 2(a)). The airplane is displayed (in dotted line) in the image coordinates. Let M, be the match-set of T , 1 < i < 3. Fig. 2(a) shows the clusters and the associated match-sets constructed by the symbolic clustering process. First, clusters Cl, = { T , T J ; 1 Q i c j Q 3) are constructed. A new transform T, is computed for each cluster C,, based on match-set MI U M,. Fig. 2(b) shows the resulting transforms. Once this is completed, cluster C123 = { T,, q, q } which is the largest cluster, can be constructed. Its corresponding transform hypothesis q23 can be computed using match-set M, U M, U M3. Transform hypothesis q 2 3 and its match-set are shown in Fig. 3.

Note that through the symbolic clustering, a more accurate transform hypothesis is constructed without resorting to the exhaustive matching of object features against image features. Initially, many transform hypotheses may be generated. As clusters are computed, transforms with larger match-set are generated. These newly generated transforms are more accurate (than the basic transforms) since these transforms are computed using a larger match-set. For example, the match-set of transform Ti23 contains nine elements and the matches have a wider spatial distribution than that of T , 1 Q i Q 3. It is clear from Fig. 1 that q23 is more accurate than T , 1 < i Q 3.

111. SYMBOLIC CLUSTERING An important task in recognizing objects using the hypothe-

size-and-test approach is the generating of ‘‘good” hypotheses. One approach is to use the symbolic clustering technique where a

good hypothesis is generated by the clustering of consistent hypotheses.

In symbolic clustering, hypotheses with similar transform parameters are clustered to form situations (a situation is a cluster of consistent hypotheses). A new transform hypothesis is computed for each situation. The most promising hypothesis is then selected for further analysis.

The clustering of consistent hypotheses in our approach is similar to generalized Hough transform in the sense that transforms with similar parameters are clustered into more reliable hypotheses. Our clustering algorithm differs from generalized Hough transform in the way hypotheses are represented and combined. In symbolic clustering, each hypothesis is attached with its supporting evidence (i.e.: the match-set). As hypotheses are clustered, the supporting evidence is combined symbolically (e.g.: the union of the match-set). In the rest of this section, we describe the symbolic clustering technique.

A . Representation of Hypothesis Each hypothesis represents a possible transform that maps an

object into image coordinates generated from a given match-set. The transform parameters can be computed once the transform is given [9]. An hypothesis H is represented symbolically by a tuple consisting of H’s transform parameters (say P) and its supporting match-set (say M ) (i.e., H is represented as { P, M}).

Under our assumptions, the transform parameters can be described by a five-tuple (q, T,, S,, S,, and 8). Note, that the scaling factors S, and S, are unity under orthographic projection assumption. In practice, S, and S, may not always be 1 due to the nonlinearity of the camera’s optical system. When the amount of distortion is smaller, the distortion does not affect the clustering process.

B. Consistency Between Hypotheses

{(T,2,T:,S2,S,?,82), M 2 } are consistent iff Two hypotheses H,: {(c, TJ, Si, Si, e’), MI} and H2:

1) a b s ( c - q2)+abs(q? - T,?) Q threshold,, 2) abs(8’ -8,) Q threshold,,

where abs( a ) returns the absolute value of a. Scaling factors are not taken into account since they are close to unity. However, a hypothesis is rejected whenever any one of the scaling factors differs from unity more than 0.15. In our experiment, threshold, is 15 pixels, and threshold, is 20 degrees.

Let us call a pair of consistent hypotheses 2-consistent. A set of n (n > 2) hypotheses is consistent if every possible subset of (n - 1) of the hypotheses is (n - 1) consistent. Note that this definition implies that a set of n hypotheses is n-consistent if and only if every possible pair of hypotheses in the set is two-consistent.

IEEE TRANSACTIONS ON SYSTEMS. MAN, AND CYBERNETICS, VOL. 19, NO. 6 , NOVEMBER/DECEMBER 1989 1647

(4 (b) Fig. 4. (a) SLDB before T3 is inserted. (bj SLDB after is inserted

I Recognition Phase Training Phase

I I I 1

Fig. 5. Two phases of object recognition system

C. Formation and Selection of Situations A situation is a set of consistent hypotheses. The hypotheses

are said to participate in the situation and form the P-set of the situation. A situation S,, is said to be at higher level than S, if the P-set of S, is a subset of that of S,,. This ordering is used to organize the situations into a hierarchical structure called the situation lattice. The roots of the lattice are the maximal elements of the lattice. The height of a root is defined to be the cardinality of its P-set. The situation lattices are recorded in the situation lattice database (SLDB). The SLDB is dynamically updated whenever a new hypothesis is generated. The update algorithm is described in the following paragraphs.

When a new hypothesis is generated, the current SLDB is updated by first computing the set U, which contains all previ- ously generated hypotheses that are consistent with H,,,,. Then, we iteratively compute all lists of n-consistent hypotheses for those hypotheses in the set U. Each such list of n-consistent hypotheses forms the P-set of some situation. Algorithm 1 describes this process. The complexity of this algorithm will be discussed in Section 111.

Fig. 4 shows an example of how the SLDB is updated when hypothesis 7; is inserted. A situation is denoted by its participating hypotheses. Fig. 4(a) shows the situation lattice before the insertion of (see Fig. 1 for hypotheses q , 1 < i d 3). Since T3 is consistent with both and T2, the set U would then include

The first time that SteD 3 is evaluated. set R contains the following situations:

s,3= { T l , T , I , s 2 3 = {GTGI. The second time that Step 3 is evaluated, set following situation:

&Z3 = { 3 T2 3 T3 1. The updating stops at the third iteration. Fig. situation lattice after the updating process.

R contains the

4(b) shows the

A situation with the largest c;a;dinality P-set is selected for verification. If there is more than one possible candidate, one is selected arbitrarily. Other selection criteria can be used. For example, one can select a P-set based on the spatial distribution of the points in the match-set of the participating hypothesis (select the one with widest spread).

For each situation selected, a new hypothesis is generated. The match-set of the hypothesis is the union of the match-sets of the participating hypotheses, and its transform parameters are deter-

Matcher Selector Database I

Fig. 6. Control block diagram of object recognition system

Fig. 7. Sample points for airplane object.

Fig. 8. Image of refrigerator magnet

mined by the computed match-set. A situation is removed once it is sketched.

D. Complexity Analysis The complexity of the symbolic clustering algorithm is a func-

tion of the number of basic transform hypotheses in SLDB, the number of roots in the situation lattice, and the average height of the roots. Let M be the total number of basic transforms, W be the total number of roots, and K be the average heights of the roots. The complexity of generating a root with height K is 2 K . W is bounded above by M 2 . The complexity of Algorithm 1 is W X 2 K + M . Algorithm 1 is as follows.

Algorithm 1 -Updating the Situation Lattice

Step 1:

Step 2:

Step 3:

Step 4:

Suppose the newly generated hypothesis is H,,,,. Compute the set U. N = 2. Compute the set, R , of all the N-consistent hypotheses for the hypotheses in U. Remove any that do not contain H,,,,. If R is empty, then exit. Otherwise, insert all the elements of R into the situation-lattice. Increment N by 1. Construct all the pairs for elements in R . Represent each pair by the union of the members in each element. Remove any pair that is


Fig. 9. Preprocessing results of refrigerator magnet image. (a) Edge segments. (b) Salient edge segments. (c) Comers. (d) Model sample points.

not N-consistent or does not contain H,,,,. Set R to be the set of resulting N-consistent hypotheses. Go to Step 3. Step 5 :

In general, M and W can be large. The computational cost of Algorithm 1 can be very expensive. Two methods are used to control the complexity of the clustering algorithm. First, the parameter space ( RS in our system) of all possible transforms is partitioned into disjoint bins. This is similar to the partitioning of the parameter space in generalized Hough transform. An SLDB is generated for each bin. Only hypotheses in the same bin are checked for consistency. This approach reduces the sue of each SLDB (i.e., M) as well as the number of roots (W). Second, the clustering of hypotheses in a bin is halted temporarily whenever any root whose height exceeds an upper bound B (in our experiment, B is set to 4) is generated. In this case, all the situations whose P-set contain more than B elements are selected for verification. The clustering process continues once these situations are removed from the SLDB.

N. AN OBJECT RECOGNITION SYSTEM USING SYMBOLIC CLUSTERING

Fig. 10. First test image of magnet A . Overview

The top level of our object recognition system is shown in Fig. 5. The system consists of two phase: model acquisition phase and object recognition phase.

During the model acquisition phase, the object is shown to the system. The system extracts features and constructs the object

L: L

(c ) Fig. 11. Preprocessing results of Fig. 10. (a) Edge segments. (b) Salient edge segments. ( c ) Corners.

model automatically. The model is represented as a list of corners and sample points.

The recognition phase consists of the following steps.

Step 1: Preprocessing: Extract image features from the image.

Step 2: Hypothesis generation: Match image features against the object model to generate possible transform hy- pothesess. Symbolic clustering and hypothesis selection: Cluster “consistent” transform hypotheses and select promising transform hypotheses for further analysis.

Step 4: Hypothesis verification: For each selected transform hypothesis, verify the hypothesis using the object model.

Step 3:

Step 1 is performed once for each image. Steps 2, 3, and 4 are performed iteratively. The analysis terminates when the goal is accomplished (the goal is examined at Step 4).

A control flow block diagram of the object recognition is shown in Fig. 6. Image features are first computed by the low level vision system (LLVS) and are stored in an iconic database. The feature matcher matches the computed feature with the stored object features to generate possible transform hypotheses. All the hypotheses are stored in the situation lattice database. Situation selector selects prominent situations and provides it to the verifier. Finally, the verifier checks if the hypothesis is correct.

In the following subsections, we describe pre-processing, hypothesis generating, goal description, and some performance analysis.

1650 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 6 . NOVEMBER/DECEMBER 1989

e = -32.57 degrees T, = 219.29; Ty = -59.95 S, = 1.03; Sy = 0.97 Percentage of Verification = 73.3% Error(T) = 1.94 pixels

(4

Fig. 13. Second test image of magnet.

C. Hypothesis Generation Comers extracted from the image are matched against

obiect model. For everv match of comers. a three-element ma

Fig. 12. (a) Verification result: Measurement. (b) Verification result: Pro- jected object is shown in solid white lines.

B. Preprocessing The object recognition system accepts gray level image. To

extract linear edge segments, the Sobel edge operator is applied to the image. The result is thinned using a nonmaximum suppres- sion operation [12]. Branch points are then removed, and eight- connected contours are extracted. A recursive split algorithm is then used to construct connected linear edge segments [6]. Right- hand orientation is used to define edge contour’s orientation [9] . Finally, least square error fitting is used to compute line equa- tions for the connected linear edge segments [6]. Attributes computed for each linear edge segment include the orientation, the line equation, centroid, length, and average intensity on both sides.

To extract comers, the following steps are used:

1) For each linear edge segment ( L ) , find a non-collinear line segment (R) such that the distance between L’s ending point and R’s starting point is the shortest (over all line segments). Construct a comer C using L and R . The angle of the corner is: orientation of R-orientation of L.

2)

L is said to be the left-side of C and R the right-side of C. To reduce the number of comers to be processed, we compute

only those comers formed by edge segments whose length is longer than seven pixels and whose contrast (the difference of the average intensity along the edge segment) is more than 20. These edge segments are called salient edge segments. Also, the search for a pairing edge segment is limited to a radius of no more than 15 pixels. Attributes computed for each comer include angle, the two sides, and the vertex (i.e., the intersection of the sides).

the Itch-

set is generated. This match-set is then used to compute the transform hypothesis.

An object comer “matches” a comer in the image if the difference between their angle is small (20 degrees in our experiment). Whenever two matching angles (say an object comer C, and an image comer C,) are found, we construct a three-element match-set

where yo and 5’ are the vertices of comer C, and C,, respectively; V;l and e are on the left side of corner C, and C, that are dist, pixels away from the vertices, respectively; and and 5’’ are on the right sides of comer C, and C, that are dist, pixels away from the vertices, respectively. Dist, and dist, are both 10 in our experiment.

A transform hypothesis can be computed for each possible match of object comers. Using the matching criterion, a comer may have several possible matches. Although one may eliminate some false matches by considering the intensity statistics of the regions outlined by the comers, a unique match is not always achievable.

D. Hypothesis Verification To verify a transform hypothesis, the projected object model is

compared against the image directly. First, a set of sample object points is computed along the edge segment of the object. A sample object point is selected every N pixels ( N is 10 in our experiment) along each edge segment. The attributes computed for each sample object point include the position ( x , y coordinates) and the orientation of the line segment passing through that point. This approach is similar to that used in [lo]. Fig. 7 shows the sample object point for the airplane object where sample points are marked by “+ .”

Let T, be the transform hypothesis to be verified. For each point in the sample set (say P , ) , it is mapped into image coordinates using (say eT). Then, presence of edge points in the

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 6, NOVEMBER/DECEMBER 1989 1651

I

f

Fig. 14. Preprocessing results of Fig. 13. (a) Edge segments. (b) Salient edge segments. ( c ) Comers.

image (within an eight-pixel radius of eT) that have similar orientations (within a 15-degree difference) is checked. In case of multiple candidates, the image point that is closest to eT is chosen. Let Q, be the edge point found. An error measurement

error( q)) = average distance between CT and Q,

is computed for the verification. Note that error(T,) is computed only for those points where matching edge points are found.

If more than 50 percent of the sample object points have matching edge pointss and the error measurement is less than five pixels, the object is said to be present and the transform hypothesis is accepted. In this case, an object instance is said to have

been found. Otherwise, the transform hypothesis is rejected. Note that since objects may be partially occluded, it is not practical to require a high percentage of the sample set to have matching edge points in the image. Instead, it is more practical to require that the error measurement be small.

It is possible that the same refined transform hypothesis may be generated at different time during the recognition process. To avoid redundant effort, a processed-set is maintained. The process-set is initially empty. Whenever a refined hypothesis is computed, it is compared against the processed-set. If a similar hypothesis exists, no verification process will be performed. 0th- erwise, the hypothesis is verified and is inserted into the processed-set.


0 = 62.51 d e w s

S, = 1.07; S, = 0.9 Percentage of Verification = 74.8% Errofl) = 2.4 pixels

T, = 258.59; T, = -28.66

(a)

Fig. 15. (a) Verification result: Measurement. (b) Verification result: Prc- jected object is shown in solid white lines.

E. Description of Goal The goal is described by the number of object instances to be

searched. Recognition process stops whenever the specified number of object instances is found.

Two types of goals are often encountered. One is to find exactly one object instance. This goal can be described as finding one object instance in the image. The other one is to find all object instances. This goal can be described as finding P object instances in the image when P is a very large number.

F. Performance Analysis The performance of the system depends on the number of

situations selected as well as on the sequence in which the hypotheses are generated. Domain knowledge can be used to improve the system’s performance. Examples include the use of salient features to generate reliable hypotheses, the use of probability to measure the likelihood of matching comers caused by accidental grouping. In our experiment, the features are ran- domly ordered.

If the goal is to find one object instance, the system terminates when the first object instance is found. Usually, only a few situations are examined before the object instance is found. Since the verification of the hypothesis is initiated whenever any promising hypothesis is generated. It is often the case that only a small portion of the basic hypotheses are generated when the first object instance is found. In practice, the complexity of achieving such a goal is much smaller than the theoretical complexity given in Section III-D. Our experiments support this observation. How- ever, when the goal is to find all object instances, exhaustive processing of all situations is necessary. Such goal requires extensive computation.

Potential parallelism in our system can be identified. First, the matching of object comers and image comers can be performed

Fig. 16. Image of beer bottle cap.

in parallel. Second, while the transform hypotheses are being computed, the SLDB can be updated simultaneously. Third, once promising situations are constructed, the verification process can be performed immediately.

v. EXAMPLES

We implemented the object recognition system on a Symbolics 3670. This section presents two examples. In each of the examples, the object model is first shown to the system. The system constructs the model automatically. Then test images that contain the object are shown to the system. The goal of the system is set to find an object instance in the image.

A . Example I - A Refrigerator Magnet The first object is a refrigerator magnet. This magnet is almost

flat. The magnet is shown to the system by placing it on a sheet of black paper (Fig 8). This image is 384 by 384 with 256 intensity levels. Fig. 9 shows the linear edge segments, the corners, and the model sample points computed. There are 1084 edge segments (Fig. 9(a)), and 313 of these edge segments are salient (Fig. 9(b)). Using these salient edge segments, 151 comers are computed (Fig. 9(c)). There are 652 sample object points (Fig. 9(d)). Note that some edge segments are extracted from the “background”. It is quite easy to remove these nonobject edge segments. To demonstrate the robustness of our system, we choose not to remove these edge segments.

Fig. 10 shows the first test image where the refrigerator magnet is put on a magazine and is partially occluded by the magazine. This image is 400 by 460 with 256 intensity levels. Fig. 11 shows the linear edge segments (Fig. ll(a)), the salient edge segments (Fig. ll(b)), and the comers computed (Fig. ll(c)). There are 3427 edge segments, of which 404 are marked as salient; 196 comers are computed. The transform parameters and the error measurement of the verified transform hypothesis are described in Fig. 12(a). Fig. 12(b) shows the magnet instance computed (in solid white lines). Visual inspection shows that the computed transform is very accurate.

In this example, the parameter space is partitioned into 9600 bins (i.e.: a partition of 20,23,20 bins in x , y , B axes, respectively). The upper-bound on the height of a root is set to 4. The total number of possible transforms, based on the match of the angles of comers in the image and in the object model$ 3552. However, our system generates only 106 transforms or 3 percent of the total possible transforms when the correct object instance is found. The total number of nonempty SLDB is 1853 and each SLDB contains, on the average, 1.29 roots. The largest SLDB generated contains 6 roots. The total processing time excluding the edge extraction process is 106 seconds.


Fig. 13 shows the second test image where the refrigerator magnet is put on a magazine with a pen on the magnet. This image is 400 by 460 with 256 intensity levels. Fig. 14 shows the linear edge segments (Fig. 14(a)), the salient edge segments (Fig. 14(b)), and the comers computed (Fig. 15(c)). There are 2693 edge segments, of which 428 are marked as salient; 172 comers are computed. The transform parameters and the error measurement of the verified transform hypothesis are described in Fig. lS(a). Fig. 15(b) shows the magnet instance computed (in solid white lines). The computed transform is very accurate in spite of the occlusion and the shadow cast on the magnet by the pen.

In this example, the parameter space is partitioned similar to the previous example. The upper-bound on the height of a root is also set to 4. The total number of possible transforms is 2869. However, our system generates only 189 transforms or 6.6 percent of the total possible transforms when the correct object instance is found. The total number of nonempty SLDB is 2643 and each SLDB contains, on the average, 1.71 roots. The largest SLDB generated contains 12 roots. The total processing time excluding the edge extraction process is 165 s.

B. Example 2 - Beer Bottle Cap The second object is a beer bottle cap. The cap is shown to the

system by placing it on a sheet of black paper (Fig. 16). This image is 384 by 384 with 256 intensity levels. Fig. 17 shows the linear edge segments (Fig. 17(a)), the salient edge segment (Fig. 17(b)), the comers (Fig. 17(c)), and the model sample points computed (Fig. 17(d)). There are 950 edge segments, and 214 of these edge segments are salient. Using these salient edge seg-

Fig. 18. First test image of bottle cap.

ments, 92 corners are computed. There are 471 sample object points.

Fig. 18 shows the first test image where the beer bottle cap is put on a magazine. This image is 512 bv 480 with 256 intensitv

Fig. 17. Preprocessing results of beer bottle cap. (a) Edge segments. (b) Salient edge segments. (c) Comers. (d) Model sample points.

-

1654 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 19, NO. 6. NOVEMBER/DECEMBER 1989

fJ = 39.5 degrees T, = 226.67; Ty = - 1 1.63 S, = 1 .W, S, = 0.95 Percentage of Verification = 77.3% Emr(T) = 1.5 pixels

(4

Fig. 20. (a) Verification result: Measurement. (b) Verification result: Pro- jected object is shown in solid black lines.

. Fig. 21 . SecoG4 test image of bottle cap

> > < levels. There is no occlusion in this example. Fig. 19 shows the

linear edge segments (Fig. 19(a)), the salient edges (Fig. 19(b)), 1 3 X and the comers computed (Fig. 19(c)). There are 4,852 edge

segments, 493 of which are marked as salient; 164 corners are computed. The transform parameters and the error measurement of the verified transform hypothesis are described in Fig. 20(a). Fig. 20(b) shows the cap instance computed (in solid black lines).

In this example, the parameter space is partitioned into 9,600 bins (i.e.: a partition of 32,24,20 bins in x , y , 8 axes, respec-

total numbers of possible transforms, based on the match of the

A c C k -2 A

d

\ T f

L

(4

segments. ( c ) Comers. Fig. 19. Preprocessing results of Fig. 18. (a) Edge segments (b) Salient edge tively). The upper-bound on the height Of a root is set to 4. The


e = 37.84 degrees T, = 218.96 T, = -16.88 S, = 1.04; S, = 0.96 Percentage of Verification = 67.4% Errofl) = 1.57 pixels

(a)

Fig. 23. (a) Verification result: measurement. (b) Verification result: Pro- jected object is shown in solid black lines.

Fig. 22. Preprocessing results of Fig. 22. (a) Edge segments. (b) Salient edge segments. (c) Comers.

angles of corners in the image and in the object model is 1679. However, our system generates only 150 transforms or 8.9 percent of the total possible transforms when the correct object instance is found. The total number of nonempty SLDB is 2210 and each SLDB contains, on the average 1.31 roots. The largest SLDB generated contains 17 roots. The total processing time excluding the edge extraction process is 59 s.

Fig. 21 shows the second test image where the beer bottle cap is put on a magazine with a pen on the cap. This image is 512 by 480 with 256 intensity levels. Fig. 22 shows the linear edge segments (Fig. 22(a)), the salient edges (Fig. 22(b)), and the comers computed (Fig. 22(c)). There are 4888 edge segments, 487 of which are marked as salient; 163 comers are computed. The transform parameters and the error measurement of the verified transform hypothesis are described in Fig. 23(a). Fig. 23(b) shows the cap instance computed (in solid black lines). The computed transform is very accurate in spite of the occlusion and the shadow cast on the cap by the pen.

In this example, the parameter space is partitioned similar to the previous example. The upper-bound on the height of a root is also set to 4. The total number of possible transforms is 1740. However, our system generates only 345 transforms or 19.8 percent of the total possible transforms when the correct object instance is found. The largest SLDB generated contains 13 roots. The total number of nonempty SLDB is 5279 and each SLDB contains, on the average, 1.54 roots. The largest SLDB generated contains 13 roots. The total processing time excluding the edge extraction process is 75 s.

VI. CONCLUSION We have presented an approach for the recognition of 2-D

objects. Our approach uses the symbolic clustering technique to locate instances of object in images. A model is constructed automatically by showing the system examples of the object. Matches between features of the object and images are used to construct transform hypotheses. Consistent hypotheses are clus-


tered to form more reliable hypotheses. Promising hypotheses are selected for further model-driven analysis of the image. The system successfully recognizes partially occluded objects in complex real images. Initial experiments indicate that our approach is robust and general. Our initial experiments also demonstrate that the symbolic clustering method can be effective since only a small portion of the possible matches is actually explored. We intent to extend our system to the recognition of several objects, revise our system to recognize objects under perspective projection, and develop algorithms for the recognition of 3-D objects using 2-D images.

ACKNOWLEDGMENT The author would like to thank Professor Jake Aggarwal for

his valuable advice and encouragement.

REFERENCES N. Ayache, “A model-based vision system to identify and locate partially visible industrial parts,” in Proc.: Computer Vision and Puttern Recognition. 1983, pp. 492-494. D. Ballard,“Generalizing the hough transform to detecting arbitrary shapes,” Puttern Recognition, vol. 13, no. 2, pp. 111-122, 1981. S. Barnard and W. Thompson, “Disparity analysis of images,” IEEE Truns. Pattern Anul. Machine Intell., vol. 24, no. 4, pp. 333-340, 1980. R. Bolles, “Robust feature matching through maximal cliques,” in Proc. SPIE Tech. Svmp. Imaging Application in A utomated Industrial Inspec- tion und Assemhlv. Society of Photo-Optical Instrumentation Engineers, Bellingham. WA. 1979. R. Bolles and R. Cain, “Recognizing and locating partially visible objects: The local feature focus method,” The Int. J . Robotics, vol. 1, no. 3, 1982. R. Duda and P. Hart, Puttern Recognition and Scene Analysis. New York: Wiley, 1973. D. Huttenlocher and S. Ullman, “Object recognition using alignment,” in Proc.: Int. Con/. Computer Vision, 1987, pp. 102-111. V. Hwang, L. S. Davis. and T. Matsuyama, “Hypothesis integration in image understanding systems,” Computer Vision, Gruphics, and Image Processing. vol. 36, pp. 321-371, 1986. V. Hwang. “Recognition of two dimensional objects using hypothesis integration technique,” Tech. Rep. no. AI TR-87-57. Univ. Texas, Austin, TX, 1987. W. Perkins. “A model-based vision system for industrial parts,’’ IEEE Truns. Computers, vol. 27. no. 2, pp. 126-143, 1978. D. Roger and J. Adams, Muthemutical Elements for Computer Gruphics. New York: McGraw-Hill, 1976. A. Rosenfeld and C. Kak, Digirul Picture Processing. New York: Aca- demic Press, 1976. G. Stockman. S . Kopstein. and S . Benett, ‘Matching images to models for registration and object detection via clustering,” IEEE Trans. Put- tern A n d . Muchine Intell.. vol. 4, no. 3, pp. 229-241, 1982.

Why Direction-Giving is Hard The Complexity of Using Landmarks in One-Dimensional Navigation

JOHN R. KENDER AND AVRAHAM LEFF

shown how to operationally define what is meant by a “landmark”: regardless of starting position, any custom map can attain a landmark with but a single W t i o n . It is shown that even in the simplest cases, NP-complete problems arise in the efficient selection and sequencing of sensor modalities, even while attempting to navigate from one single object to another. A heuristic that appears reasonable for map creation, is provided and examples of several very different maps that are each “optimal” under eight reasonable criteria are given.

I. INTRODUCTION In this correspondence, we define a model for characterizing a

class of robotic navigation problems, that of navigation-in-the- large within a linear environment (such as a single corridor of a building, or a single subway line), without the use of “graphic” information such street signs or house numbers. This work is ultimately motivated by: what is a “good” map, and what is a “landmark.” It addresses the abstract problem of how to opti- mally choose sensory features in order to describe and discrimi- nate objects, and how to create short or efficient sequences of such descriptions for low-overhead navigation from a given place to another.

First, we formalize the map-maker’s relationship to both the world and to the navigator. Second, we document problems that navigation even along a line entails. Third, we show that heuris- tics are necessary, since several problems are provably NP-complete.

11. THREE WORLD VIEWS There are three similar but subtly different perceptions of “the

world.” First, the one-dimension world as it exists is mathemati- cally rich, continuous, has a distance measure, and objects “em- bedded” in it have finite extent. But, second, the map-maker can abstract it into a sequence: the world is conceived as a function over the integers. Empty space, distance, and object extent are ignored. A plane flight would be simply the sequence, (New York, London, Paris, Rome). Third, the map-maker communi- cates even less of this sequence to a navigator in the form of a “best” “custom map” of “landmarks.”

We will refer to the world as “Lineland,” the abstraction of it as the “world model” or the “abstract sequence,” and the custom map as, simply, the “map.” (In layman’s terms, the world model is usually called a “map,” and the custom map would probably be called “the list of directions.” However, at car rental agencies equipped with map-makers, the layman can be gven exactly what we call a custom map, if he has a unique destination in mind. Surprisingly, research shows that such maps are preferred by human beings over the &on-variety world models [3].)

111. THE NAVIGATOR The representation of Lineland as a sequence depends criti-

cally on several epistemological assumptions. To consider objects as point-like reqiires enouZgh processing intelligence in the navi-

hold, and dismiss the single object from its sensory array. Fur- ther, we assume the navigator can only verify the imminent presence of an object: it is “nearsighted.”

Lineland can now be modeled by a vector-valued sequence; each object is a vector of sensations, one vector component for each sensor or feature detector. “Sensors” may be any measur- able quality such as color, distance, area, texture, shape, number

Abstract -A formal model of topological navigation in one-dimensional

and of custom map, as well as some specifications for feature detector models, including the idea of sensor synchronicity are defined formally. The representations necessary to model and exploit differences between the world itself, the world as perceived by the mapmaker, and the world as experienced by the naviEator are discussed. difficulty of giving precise meaning to what is meant by a “good” map is demonstrated, but it is

spsces such as single roads is outlined. The mncepts of direction-giving gator to recognize an object a single object, and to capture,

of holes, etc.: an object can be perceived as “green, far, large,

The navigator is modeled as having exactly S sensors or feature detectors, and each sensor sk takes on its values from its associated discrete domain d, . For example, d, might be {red, green, blue}. Thus, formally, Lineland is modeled by the map-maker as

Manuscript received December 14, 1988; revised April 16, 1989. This work was supported in part by The Defense Advanced Research Projects Agency under contracts DACA76-86-C-0024.

The authors are with the Department of Computer Science, Columbia Univeristy, New York, NY 10027.

rough, Square, two.”

IEEE Log Number 8930343.

0018-9472/89/1100-1656$01.00 01989 IEEE

recognizing and locating partially occluded 2-d objects: symbolic clustering method

Documents