a real-time multitarget tracking system with robust multichannel cnn-um algorithms

1358 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 52, NO. 7, JULY 2005

A Real-Time MultiTarget Tracking System WithRobust MultiChannel CNN-UM Algorithms

Gergely Tímár and Csaba Rekeczky, Member, IEEE

Abstract—This paper introduces a tightly coupled topographicsensor-processor and digital signal processor (DSP) architecturefor real-time visual multitarget tracking (MTT) applications. Wedefine real-time visual MTT as the task of tracking targets con-tained in an input image flow at a sampling-rate that is higher thanthe speed of the fastest maneuvers that the targets make. We uti-lize a sensor-processor based on the cellular neural network uni-versal machine architecture that permits the offloading of the mainimage processing tasks from the DSP and introduces opportunitiesfor sensor adaptation based on the tracking performance feedbackfrom the DSP. To achieve robustness, the image processing algo-rithms running on the sensor borrow ideas from biological sys-tems: the input is processed in different parallel channels (spa-tial, spatio-temporal and temporal) and the interaction of thesechannels generates the measurements for the digital tracking algo-rithms. These algorithms (running on the DSP) are responsible fordistance calculation, state estimation, data association and trackmaintenance. The performance of the proposed system is studiedusing actual hardware for different video flows containing rapidlymoving maneuvering targets.

Index Terms—Author, please supply your own keywords or senda blank e-mail to [email protected] to receive a list of suggestedkeywords..

I. INTRODUCTION

RECOGNIZING and interpreting the motion of objects inimage sequences (flows) is an essential task in a number

of applications, such as security, surveillance etc. In many in-stances, the objects to be tracked have no known distinguishingfeatures that would allow feature (or token) tracking [1], [2],optical flow or motion estimation [3], [4]. Therefore, the targetscan only be identified and tracked by their measured positionsand derived motion parameters. Target tracking algorithms de-veloped for tracking targets based on sonar and radar measure-ments are widely known and could be used for tracking based onvisual input (also known as motion correspondence). However,our requirement that the system should operate at least at videoframe-rate (possibly even higher) limits the choices between

Manuscript received May 7, 2004; revised October 5, 2004. This work wassupported in part by the Hungarian National Research and Development Pro-gram (NKFP), in part by TeleSense under Project Grant 2/035/2001, in part byEORD under Grant FA8655-03-1-3047, and in part by Human Frontier ScienceProgram (HFSP) Young Investigators. This paper was recommended by Asso-ciate Editor Y.-K. Chen.

G. Tímár is with the Analogic and Neural Computing Systems Laboratory,Computer and Automation Institute, Hungarian Academy of Sciences, H 1052Budapest, Hungary.

Cs. Rekeczky is with the Ányos Jedlik Laboratories, Department of Informa-tion Technology, Péter Pázmány Catholic University, H 1052 Budapest, Hun-gary.

Digital Object Identifier 10.1109/TCSI.2005.851703

the well-established statistical and nonstatistical tracking algo-rithms. The real-time requirements motivated the use of a uniqueimage sensing and processing device, the cellular neural net-work universal machine (CNN-UM) [5]–[8] and its very large-scale integration VLSI implementations, which provide severaladvantages over conventional CMOS or CCD sensors:

• possibility of focal plane processing, which means thatthe acquired image does not have to be moved from thesensor to the processor;

• very fast parallel image processing operators;• unique trigger-wave and diffusion based operators.We posited that the decreased running time of image pro-

cessing algorithms provides some headroom within the real-time constraints that allows for the use of more complex stateestimation and data assignment algorithms and sensor adapta-tion possibilities. Unfortunately, as it will be explored in detailin Section V, current generation CNN-UM devices contain sev-eral design compromises that severely limit their potential foruse in these kinds of applications. These shortcomings, how-ever, have nothing to do with their inherent capabilities for veryhigh-speed image processing; rather they are mostly the resultof communication bottlenecks between the processor and therest of the system.

In the next section, we give a high level overview of thesystem, and then we present the algorithms running on theCNN-UM. In Section IV, we give an overview of the algorithmsused in estimating the state of the targets, and creating andmaintaining the target tracks and the adaptation possibilitiesthat the tight coupling of the track maintenance system (TMS)and the sensor can provide. The main focus of this paper is ondescribing a real-time multitarget tracking (MTT) system thatcontains a CNN-UM sensor-processor and various algorithmsneeded to accomplish this, so for a detailed description ofthe state of the art data association, data assignment and stateestimation methods, the reader is kindly referred to [9], [10].Finally, we present experimental results obtained by runningthe algorithms on actual hardware.

II. BLOCK-LEVEL OVERVIEW OF SYSTEM

The system contains two main architectural levels (Fig. 1): theCNN-UM level, where all of the image processing takes place(and possibly image acquisition) and the digital signal processor(DSP) level where track management functions are performed.Using our ACE-BOX-based [11] implementation image acqui-sition is performed by the host PC since the Ace4k [12] does nothave an optical input, but by utilizing more recent Ace16k chips[13] it is possible to use their optical input for image capture.

1057-7122/$20.00 © 2005 IEEE

TÍMÁR AND REKECZKY: REAL-TIME MTT SYSTEM 1359

Fig. 1. CNN-UM/DSP hybrid system architecture for MTT. The mainprocessing blocks are divided into three categories: those that are bestperformed on the CNN-UM processor, those that are especially suitable for theDSP, and those that have to be performed using both processors.

After image acquisition, the CNN-UM can perform image en-hancement to compensate for ambient lighting changes, motionextraction and related image processing tasks and feature extrac-tion for some types of features. The DSP runs the rest of the fea-ture extraction routines, and the motion correspondence algo-rithms such as distance calculation, gating, data assignment andtarget state estimation. It also calculates new values for someCNN-UM algorithm parameters thus adapting the processing tothe current environment.

III. CNN-UM ALGORITHMS

The algorithms presented here were heavily influenced byknowledge gained from the study of the mammalian visualsystem, especially the retina. Recent studies [14] uncov-ered that the retina processes visual information in parallelspatio-temporal channels and only a sparse encoding of theinformation on these channels is transmitted via the optic nerveto higher visual cortical areas. Additionally, there is context andcontent sensitive interaction between these parallel channelsvia enhancement and suppression that results in remarkableadaptivity. These are highly desirable characteristics for allimage-processing systems but are essential for visual trackingtasks where degradation of input measurements cannot alwaysbe compensated for at the later stages of processing (by domainknowledge, for example).

In the following subsections, we will describe a conceptualframework for such a complex CNN-UM based front-end al-gorithm. First, we will discuss the computing blocks in generaland then specify the characteristics of the test implementationon the Ace4k CNN-UM chip operating on the ACE-BOX com-putational infrastructure.

A. Enhancement Methods and Spatio-TemporalChannel Processing

We tried to capture the main ideas from the natural system bydefining three “change enhancing” channels on the input imageflow: a spatial, a temporal and a spatio-temporal channel [seeFig. 2(a)]. The spatial channel contains the response of filtersthat detect spatial, i.e., brightness changes, revealing the edgesin a frame. The temporal channel contains the result of com-puting the difference between two consecutive frames, therebygiving a response to changes, while the spatio-temporal channelcontains the nonlinear combination of the spatial and temporalfilter responses. In a general scheme, it can also be assumed that

the input flow is preprocessed (enhanced) by a noise suppressingreconstruction filter.

The filtering on the parallel channels can be defined ascausal recursive difference-type filtering using some linear ornonlinear filters as prototypes [typically, difference of Gaussian(DoG) filters implemented using constrained linear diffusion[15], or difference of morphology (DoM) filters implementedby min-max statistical filters [16]]. These filters can be thoughtof as bandpass filters tuned to a specific spatial/temporal fre-quency (or a very narrow band of frequencies), thus enablinghighly selective filtering of grayscale images and sequences.

The three main parameters of these change-enhancing filtersare as follows.

• Spatial scale : the spatial frequency(s) (basically theobject size) the filter is tuned to (in pixels).

• Temporal rate : the rate of change in an image se-quence the filter is tuned to (in pixels/frame).

• Orientation : the direction in which the filter is sensi-tive (in radians).

In our current framework, the orientation parameter is notused, since we are relying on isotropic Gaussian kernels (or theapproximations thereof) to construct our filters, but we are in-cluding it here because the framework does not inherently pre-clude the use of it. It is possible to tune the spatial channel’sresponse to objects of a specific size (in pixels) using the pa-rameter. Similarly, the parameter allows the filtering out of allimage changes except those occurring at a certain rate (in pixelsper frame). This enables the multichannel framework to specif-ically detect targets with certain characteristics.

The output of these channels is filtered through a sigmoidfunction

(1)

The parameters of this function are the threshold andslope . For every , the output of the function is pos-itive, hence the threshold name. The slope parameter specifiesthe steepness of the transition from 0 to 1 and as it becomeslarger, the sigmoid approximates the traditional threshold stepfunction more closely.

The output of the best performing individual channel could beused by itself as the output of the image processing front-end,if the conditions where the system is deployed are static andwell controlled. If the conditions are dynamic or unknown apriori, then there is a no way to predict the best performingchannel in advance. Furthermore, even after the system is run-ning, no automatic direct measurement of channel performancecan be given short of a human observer deciding which outputis the best. To circumvent this problem, we decided to com-bine the output of the individual channels through a so-calledinteraction matrix, and use the combined output for further pro-cessing. The inclusion of the interaction matrix enables the flex-ible runtime combination of the images on these parallel chan-nels and the prediction map while also specifying a frameworkthat can be incorporated into the system at design time. Our ex-perimental results and measurements indicate that the combinedoutput is on average more accurate than each single channel fordifferent image sequences. Fig. 2(a) shows the conceptual block


Fig. 2. (a) Block overview of the channel-based image processing algorithm for change detection. (b) Ace4k implementation of the same algorithm. The inputimage is first enhanced (histogram modified), and then it is processed in three parallel change-enhancing channels. These channels and the prediction image arecombined through the interaction matrix and thresholded to form the final detection image. Observe, that the framework allows the entire processing to be grayscale(through the use of fuzzy logic); the only constraint is that the detection image must be binary. In the Ace4k implementation, the results of the channel processingare thresholded to arrive at binary images, which are then combined using Boolean logic functions as specified by the interaction matrix. The parameters for theAce4k algorithm are: �—the temporal rate of change, �—the scale (on the spatial (SP) and spatio-temporal (SPT) channels), #—per channel threshold values,L—logical inversion (�1) or simple transfer (+1); N—the number of morphological opening (N > 0) or closing (N < 0) operations.

diagram of the multichannel spatio-temporal algorithm with allcomputing blocks to be discussed in the following section. Thepseudo code for the entire multichannel algorithm along withall operators is given in Appendix C.

B. General Remarks on the Implementation of MultiChannelCNN Algorithms on the Ace4k Chip

The change enhancing channels are actually computed seri-ally (time multiplexed) in the current implementation, but this isnot a problem due to the high speed of the CNN-UM chips used.In the first stage of the on-going experiments, only isotropic

spatio-temporal processing has been considered fol-lowed by crisp thresholding through a hard nonlinearity

. Thus, the three types of general parameters used to de-rive and control the associated CNN templates (or algorithmicblocks) are the scale and rate parameters ( and ) and thethreshold parameter . Fig. 2(b) shows the functional buildingblocks of the Ace4k implementation of the algorithm (a hard-ware oriented simplification of the conceptual model) with allassociated parameters.

The enhancement (smoothing) techniques have been imple-mented in the form of nearest neighbor (NN) convolution fil-ters (circular positive template with entries normalized to 1)and applied to the actual frame ( determines the scale of theprefiltering in pixels, i.e., the number of convolution steps per-formed).

The spatio-temporal channel filtering (including the temporalfiltering solution) has been implemented as a fading memoryNN convolution filter applied to the actual and previous frames.In temporal filtering configuration (no spatial smoothing), rep-resents the fading rate (in temporal steps), thereby specifying thetemporal scale of the difference enhancement. In spatio-tem-poral filtering configuration (the fading rate is set to a fixed

value), represents the spatial scale (in pixels) at which thechanges are to be enhanced (the number of convolution oper-ations on the current and the previous frame are calculated im-plicitly from this information).

The pure spatial filtering is based on Sobel-type spatial pro-cessing of the actual frame along horizontal-vertical directionsand combining the outputs into a single “isotropic” solution(here represents the spatial support in pixels in the Sobel-typedifference calculation).

C. Channel Interaction and Detection Strategies

The interaction between the channels may be Boolean logicbased for binary images or fuzzy logic based for grayscale im-ages, specified via the so-called channel interaction matrix. Itsrole is to facilitate some kind of cross-channel interaction tofurther enhance relevant image characteristics and generate theso-called detection image, which is treated as the result of imageprocessing steps in the whole tracking system. The interactionmatrix is a matrix where each column stands for a single image.These images are the outputs of the parallel channels (SP, T,SPT) and the prediction (see next section, Pr). The values withinthe matrix specify the interaction “weight” of a given image(the image selected by the column of the matrix element). Ifusing binary images, the nonzero weights are treated as follows:if , then the input image is used, if , then its inverseis used.

The interaction takes place in a row-wise fashion, with therow-wise results aggregated. The interactions themselves aregiven globally as a function pair, and must be Boolean or realvalued functions (when using binary or grayscale images, re-spectively). The first function in the pair is the row-wise inter-action function ; the second is the aggregation function .

is used to generate an intermediate result for each row.


Fig. 3. Sample interaction matrix, and the calculated result. Detection =A(R(SP;T;Pr);R(�T;SPT)).

These intermediate results are the arguments of , which theaggregation function uses to generate the detection image. Thenumber of rows in the interaction matrix must be at least one,but can be arbitrarily large, which allows the construction ofsophisticated filters. A sample interaction matrix with the cal-culated detection result is shown on Fig. 3.

If using a fuzzy methodology, the detection image is thresh-olded, so the result of the channel interaction is always a bi-nary map (the detection map) that will be the basis for furtherprocessing. Ideally, this only contains black blobs where themoving targets are located.

1) Implementation on the Ace4k Chip: Only Boolean logicbased methods have been implemented. In the binary casethe channels are thresholded depending on the parametersof the channel detection modules then combined pair wisethrough AND logic and all outputs are summarized througha global OR gate. The final detection result is post processedby an -step morphological processing ( : opening;

: closing).

D. Prediction Methods

We also compute a prediction map that specifies the likelylocation of the targets in the image solely based on the currentdetection map and the previous prediction. This can then be used(via the interaction matrix) as a mask to filter out spurious sig-nals. It is extremely hard to include any kind of kinematical as-sumption at the cellular level of processing given the real-timeconstraints, since this would require the generation of a binaryimage based on the measurements, the current detection and thekinematical state parameters. Therefore, the algorithms only useisotropic maximum displacement estimation implemented byspatial logic and trigger-wave computing. However, the experi-ments indicate that even rudimentary input masking can be veryhelpful in obtaining better MTT results.

Implementation on the Ace4k Chip: The prediction block isimplemented relying on local logic operations in between theprevious prediction and the new detection result, then post pro-cessing the output by an -step morphological processing( : opening; : closing).

E. Feature Extraction and Target Filtering

The DSP state-estimation and data assignment algorithms op-erate on position measurements of the detected targets, thereforethese have to be extracted from the detection map. During dataextraction, it is also possible to filter targets according to certaincriteria based on easily (i.e., rapidly) obtainable features. The setof features we are currently using are: area, centroid, boundingbox, equivalent diameter (diameter of a circle with same area),extent (the proportion of pixels in the bounding box that are also

in the object), major and minor axis length (the length of themajor axis of the ellipse that has the same second-moments asthe object), eccentricity (eccentricity of the ellipse that has thesame second-moments as the object), orientation (the angle be-tween the axis and the major axis of the ellipse that has thesame second-moments as the object) and the extremal points.Filtering makes it possible to concentrate on only a certain classof targets while ignoring others.

The calculation of all of these features can be implemented onthe DSP but some of the features (centroid, horizontal or verticalCCD etc.) can be efficiently computed on the CNN-UM as well.Since the detection map is already present on the CNN-UM, cal-culation of these features can be extremely fast. It is also pos-sible to calculate a set of features in parallel on the DSP andthe CNN-UM, further speeding up this processing step. The lo-cation of the center of gravity (centroid) of each target is usu-ally considered the position of the target, unless special circum-stances dictate otherwise.

Implementation on the Ace4k Chip: Morphological filtering(structure and skeleton extraction) is implemented on the Ace4kchip. Feature extraction is performed exclusively on the DSP inthe first test implementation.

IV. DSP-BASED MTT ALGORITHMS

The combined estimation and data association problem ofMTT has traditionally been one of the most difficult problemsto solve. To describe these algorithms, we need to define someterms and symbols. A track is a state trajectory estimated fromthe observations (measurements) that have been associated withthe same target. Gating is a pruning technique to filter out highlyunlikely candidate associations. A track gate is a region in mea-surement space in which the true measurement of interest willlie accounting for all uncertainties with a given high probability[10]. All measurements within the gating region are consideredcandidates for the data association problem. Once the existenceof a track has been verified, its attributes such as velocity, futurepredicted positions and target classification characteristics canbe established. The tracking function consists of the estimationof the current state of the target based on the proper selection ofuncertain measurements and the calculation of the accuracy andcredibility of the state estimate. Degrading this estimate are themodel uncertainties due to target maneuvers and random per-turbations, and measurement uncertainties due to sensor noise,occlusions, clutter and false alarms.

A. Data Association

Data association is the linking of measurements to the mea-surement origin such that each measurement is associated withat most one origin. For a set of measurements and tracks eachmeasurement/track pair must be compared to decide if measure-ment is related to track . For measurements and tracks,this means comparisons, and for each comparison mul-tiple hypotheses may be made. As and increase in size, theproblem becomes computationally very intensive. Additionally,if the sensors are in an environment with significant noise andmany targets, then the association becomes very ambiguous.


There are two different approaches to solving the data associ-ation problem: 1) deterministic (assignment)—the best of sev-eral candidate associations is chosen based on a scoring func-tion (accepting the possibility that this might not be correct); 2)Probabilistic (Bayesian) association—use classical hypothesistesting (Bayes’ rule), accepting the association hypothesis ac-cording to a probability of error, but treating the hypothesis asif it were certain. The most commonly used assignment algo-rithms are the following.

• NN—the measurement closest to a given track is assignedin a serial fashion. It is computationally simple but is verysensitive to clutter.

• Global NN (GNN)—the assignment seeks a minimalsolution to the summed total distance between tracksand measurements. This is solved as a constrained op-timization problem where the cost of associating themeasurements to tracks is minimized subject to some fea-sibility constraints. This optimization can be solved usinga number of algorithms, such as the JVC (Jonker–Vol-genant–Castanon) [17] algorithm, the auction algorithm[18] and signature methods [19]. These are both polyno-mial time algorithms.

The most commonly used probabilistic algorithms are the fol-lowing: multihypothesis tracking (MHT) [9], [10] and proba-bilistic data association filters (PDAF). The MHT is a multiscanapproach that holds off the final decision as to which singleobservations are to be assigned to which single track. This iswidely considered the best algorithm but is also the most com-putationally intensive ruling out real-time implementation onour architecture.

The PDAF technique forms multihypotheses too after eachscan, but these are combined before the next scan of data isprocessed. Many versions of this filter exist, the PDA for singletracks, Joint PDA (JPDA) for multiple tracks, Integrated PDAFetc. [10], [20].

Based on data in the literature [10], we decided to work withassignment algorithms because they are high performance withcalculable worst case performance since they have a computa-tional complexity of (where n is the number of tracksand measurements) which was essential given our real-timeconstraints. We also restricted ourselves to the so-called two-di-mensional (2-D) assignment problems where the assignmentdepends only on the current and previous measurements(frames). The data assignment algorithms perform so-calledunique assignment, where each measurement is assigned toone-and-only-one track as opposed to nonunique assignment,when a measurement may belong to multiple tracks. We imple-mented two types of assignment algorithms a NN approach andthe JVC algorithm. Since nonunique assignment would be veryuseful in certain situations such as occlusions, we modified theNN algorithm and added a nonunique assignment mode to it.

B. 2-D Assignment Algorithms

Of the two algorithms we implemented, the NN algorithmis the faster one and for situations without clutter, it worksadequately. It can be run in unique assignment mode, whereeach track is assigned one and only one measurement (the one

closest to it) and in nonunique assignment mode, when allmeasurements within a track’s gate are assigned to the trackwhich makes it possible handle cases of occlusion.

The JVC algorithm is implemented as described in [17]. Itseeks to find a unique one-to-one track to measurement pairingas the solution to the following optimization problem:

(2)

(3)

(4)

Where is the number of tracks and measurements (it is easyto generalize the algorithm if there are more measurements thantracks), is the probable cost of associatingmeasurement with track calculated based on the distance be-tween the track and the measurement and is a binary assign-ment variable such that

if is assigned tootherwise.

(5)

The JVC algorithm consists of two steps, an auction-algo-rithm-like step [18] then a modified version of the Munkres al-gorithm [21] for sparse matrices.

Our experiments indicate that the JVC algorithm is indeedsuperior to the NN strategy while only affecting the executiontime marginally.

C. Track Maintenance

We have devised a state machine for each track for easiermanagement of a track’s state during its lifetime. Each trackstarts out in the :Free” state. If there are unassigned measure-ments after an assignment run, the remaining measurements areassigned to the available “Free” tracks and they are moved to the“Initialized” state. If in the next frame, the “Initialized” tracksare assigned measurements, they become “Confirmed;” other-wise, they are deleted and reset to “Free.” If a “Confirmed” trackis not assigned any measurement in a frame, the track becomes“Unconfirmed.” If in the next frame it still does not get a mea-surement, it becomes “Free,” i.e., the track is deleted.

D. State Estimation

For the time being, we have implemented fixed-gain stateestimation filters ( - - ) [9] while focusing on the imple-mentation of efficient front-end filtering and data assignmentstrategies. These filters seem to be a good compromise betweenestimation precision and runtime speed. Unfortunately, themore complex state estimators such as variants of the Kalmanfilter or interacting multiple model (IMM) state estimators [10]are very computationally intensive and will require a moreadvanced hardware environment for real-time MTT purposes,though in order to achieve even better results they are mostdefinitely required. To meet these requirements we are planningto utilize a more powerful DSP (Texas C64 family) to facilitatethe inclusion of more accurate state estimators.


Fig. 4. Accuracy comparison of different front-end channels and the multichannel arrangement. The best performing channel is highlighted in bold type foreach sequence. The MSE was calculated for each of the target locations, and averaged for each image sequence and the relative error was compared to the bestperforming channel. Observe that the output of the multichannel architecture is—on average—the best performer in these sequences.

E. Parameter Adaptation Possibilities

The final output, the “detection” can be the basis of param-eter tuning of the sensor. The track maintenance subsystem isthe part of the whole tracking system that can most accuratelygauge the performance of the adaptive CNN-UM. It is antic-ipated that the number of tracks will be nearly constant, i.e.,only a small amount of tracks will be initiated or deleted. Thismeans that if there is a large change in the number of tracksthe channel parameters and/or the interaction matrix may needto be adjusted. Currently, we have only focused on the channelparameters, because the interaction matrix is usually stable fora certain kind of application or environment. The track mainte-nance subsystem provides the following inputs to the adaptationalgorithm: the desired change in track numbers ( for more,for less) and the coordinates of the extra or missing tracks.

V. EXPERIMENTS AND RESULTS

During algorithmic development, we targeted a real-timeapplication for the ACE-BOX [11] platform. The ACE-BOXis a PCI extension stack that contains a Texas InstrumentsTMS320C6202B-233 DSP and either an Ace4k or an Ace16kCNN-UM chip in addition to 16MB’s of onboard memory.The Ace4k chip is a 64x64, single-layer, nearest-neighborCNN-UM implementation with 4 LAM’s (Local AnalogMemory for grayscale images) and 4 LLM’s (Local LogicalMemory for binary images) [9]. We also experimented with thenewer Ace16k chips that have 128x128 cells, an optical inputand 2 LLM’s and 8 LAM’s. The Ace4k chip was manufacturedon a 0.5 micron process, while the Ace16k on 0.35.

A. Algorithm Accuracy Measurements

To validate our choice of using multiple interacting channelsduring the image-processing phase of the system versus using asingle channel, we ran measurements on five image sequencescontaining rapidly moving and maneuvering targets in imageswith differing amounts of noise and clutter. All videos weremanually tracked by a human viewer to obtain reference mea-surements for the target positions in each frame. These positionswere compared to the measurements given by the multichannelfront-end and each of the constituent channels as well. The meansquare position error (MSE) was calculated for each of the targetlocations, and averaged for each image sequence. The relativeerror was compared to the best performing channel was also cal-culated, since this is a good indicator of the overall performanceof a channel for in varying conditions. Results for these experi-ments are shown in Fig. 4.

The use of the multichannel architecture allows the systemto be able to process markedly different inputs within the sameframework and achieve acceptably low error levels. If an “or-acle” (a different system, or a human) can provide quantitativeinput on the performance of each channel, then the system canadapt (through changing values in the interaction matrix) to givemore weight to the best performing channel. If no such informa-tion can be acquired, the system will still perform relatively well(see the relative errors in Fig. 4), often very close to the best per-forming channel.

We also tested the developed algorithms on several artificiallygenerated sequences in addition to video clips recorded in nat-ural settings (such as a flock of birds flying). We hand trackedsome of these videos to be used as ground truth references forassessing the quality of the tracking algorithms as measured atthe output of the complete MTT system. Fig. 5 shows a fewsample frames from the “birds” clip along with the detectionimages generated by the multichannel front-end. This sequencecontains 68 frames of seagulls moving rapidly in front of a clut-tered background.

The accuracy of image processing also depends heavily onthe noise characteristics of the input image sequence. To mea-sure this, we developed a program to generate artificial imagesequences, which allowed us to carefully specify the kinematicproperties of the moving targets. We could also mix additivenoise to the generated images to study the noise sensitivity of thesystem. To describe noise levels, we defined the signal to noiseratio (SNR) and peak signal to noise ratio (PSNR) according tothe definitions commonly used in image compression applica-tions [22]:

(6)

(7)

(8)

where and are the dimensions of the image in pixels andand are the pixel values at position in the

original image and the noisy image, respectively.Fig. 6 shows the results of our analyses. We did not calculate

the SNR and PSNR values for the natural image sequences sincethere was no reference image available to compare against ourinputs. The measurement errors were obtained by calculatingthe distance between a hand tracked/generated target reference


Fig. 5. Sample consecutive frames from a test video and the corresponding detection maps of the system. The input video shows birds flying in front of acluttered background (birds are circled in white on the input frames). Since the birds, leaves and branches of the trees are all moving, detecting the targets (birds)is very difficult and must make full use of the capabilities of the multichannel front-end (such as the ability to filter based on object size and object speed).

Fig. 6. Tracking accuracy and noise levels for sample videos. All errors are in pixels. Sub-pixel error values are the result of the sub-pixel accuracy of our stateestimation and centroid calculation routines. Note that igher SNR and PSNR values show lower noise levels. All videos in the table are different image sequences,the third and fourth however contain targets with the same kinematic properties moving on the same path, but with different image noise levels, which is why themeasurement errors are different and the modeling errors are the same for the two sequences.

position and the output of the multichannel front-end. The mod-eling errors were calculated by injecting the reference target po-sitions into the system and measuring the tracking error, whilethe tracking errors were the difference between the target refer-ence positions and the output of the whole MTT system.

As can been seen from the data in Fig. 6, the tracking erroris always lower than the measurement and modeling error com-bined, which suggests that these errors cancel each other outsomewhat. The performance of the multichannel front-end isvery good in cases where the images are corrupted with highlevels of noise, which is due to the noise suppression capabil-ities of DoG type filters. Lastly, the magnitude of the overalltracking errors is within two pixels for these sequences.

Fig. 7 shows the results of running the system on two videoflows (A and B) that contain targets which are maneuvering andsometimes move in front of each other, effectively stress testingthe tracking algorithms. Sequence A. was artificially generatedwhile sequence B. is a short natural video clip showing rapidlymaneuvering targets with a priori unknown kinematics. Wehand tracked each frame of these video flows to facilitate arigorous comparison of the system’s performance against ahuman observer. It must be noted, that sometimes even we hu-mans had trouble identifying targets in a frame without flipping

back-and-forth between frames, which illustrates the need fortemporal change detection in the image processing front-end.

We measured the performance of the system at three stages:the output of the multichannel framework, the output of thedata-assignment subsystem and the output of the whole MTTsystem. This enabled us to visualize and study the effect ofdifferent input sequences on various subsystems. We observed,for example, how the kinematic state estimators smooth out thetarget trajectories when fed the somewhat jittery data from themultichannel front-end (this was expected and desired).

The measured MTT system outputs show that the systemtracked the targets fairly well, although occlusion and missingsensor measurements have caused significant errors as thesystem merged tracks together and split others (this is commonerror in tracking systems). To address this issue, we’re currentlyworking on incorporating a more advanced state estimationalgorithm to better model target motion and include a prioriknowledge of target behavior.

B. Algorithm Performance Measurements

Before implementing the algorithms in C++, we first proto-typed them in MATLAB using a flexible simulation framework


Fig. 7. Target tracking results for two sample videos. The plots show the tracking results in three dimensions: two spatial dimensions that correspond to thecoordinates in the videos, and a third temporal dimension, which corresponds to the frame number in the sequence. Target tracks are continuous lines in these 3-Dplots, with different gray levels signifying different tracks. The targets were hand tracked by human observers to generate reference target positions [upper leftplots (“ref”) in (a) and (b)]. The track states [“sta,” lower left plots in (a) and (b)] are the target location outputs of the MTT system. The “meas” plots [upper rightcorner on (a) and (b)] show the outputs of the data-assignment subsystem. Observe, that the kinematic state estimation algorithms smooth out some of the jitter inthe direct measurements. Finally, the input measurements (the coordinates of the centroids of the black blobs in the detection maps) generated by the multichannelMTT framework are shown in the lower right plots [in both (a) and (b)].

based on the MatCNN simulator [23]. After the algorithms “sta-bilized,” we ported them to work on the ACE-BOX hardwareusing the Aladdin Professional programming environment [11].

To enable better comparison of the Ace4k-based algorithmimplementation with the pure DSP version, we coded the samealgorithms in both cases. Since the data assignment and state


Fig. 8. Running times for the various subtasks of the MTT system in different configurations. The 2 configurations were: the image processing steps of themultichannel front-end running on the Ace4k and everything else running on the DSP (Ace4k+DSP and Net Ace4k + DSP column), and all algorithms runningon the DSP (DSP column). The Net Ace4k+DSP column contains only the net computing time (without data transfers, which is very significant for the Ace4k).Observe that because of the data transfers, the Ace4k-DSP tandem is slower than DSP-only algorithms if the iteration count is small, but as the iteration countincreases, the data transfer speed is balanced by the linear slowing down of the DSP. Using the Ace4k chip, net computing times are always close to, or significantlybetter than those using the DSP.

estimation algorithms run on the DSP in both cases, only theoperations in the multichannel framework had to coded forboth platforms. We optimized the algorithms to run as fast aspossible on each platform, using methods optimized for theplatform’s characteristics. For example, we used the optimizedimage processing routines provided by Texas InstrumentsImage Processing Library to construct our multichannel algo-rithm on the DSP. Fig. 8 shows the running times of varioussteps of the algorithms for different parameter settings. Theruns differed only in the number of opening/closing iterationsapplied to the images in the multichannel front-end, since thisis the most costly step of processing. These iterations smoothout the input binary maps to provide better inputs for furtherprocessing.

VI. DISCUSSION

During the interpretation of the performance data in Fig. 8,it is important to note that the Ace4k chip was manufacturedon a 0.5 micron process while the TMS320C6202B-233 DSPon a 0.15 micron process, which is a significant advantage forthe DSP. Nonetheless, the results for the multichannel front-endperformance tests highlight several important facts. First, the

numbers indicate that the Ace4k has limited potential in prac-tical image processing scenarios, because it is hampered by theslow data transfer speed of its bus and the limited number of on-board memories (4 LAMs and 4 LLMs). Since it does not havean optical sensor, data must be transferred from a DSP for pro-cessing, and frequently, partial results of the algorithms mustbe transferred back to the DSP for storage, because of the lim-ited on-chip memory capacity of the chip. This is the reasonwhy the DSP is faster than the Ace4k-DSP duo if the numberof opening/closing iterations is small. As the iteration count in-creases, the transfer cost is balanced by the linear slowing downof the DSP, which is why the Ace4k becomes the clear winnerat higher iterations.

We have some preliminary data of our work with newerAce16k processor. This chip is larger (128 128 versus64 64), but it has a faster data bus, so data transfer times fornative size grayscale images are about half that of the Ace4k.Unfortunately, the chip does not have a dedicated binary imagereadout mode, which slows down the readout of binary imagesto speeds about ten times slower than the Ace4k. We areworking on algorithm changes that can balance this severehandicap of the new system.

The Ace16k has one other nice feature: a built-in resistivegrid. The resistive grid can calculate diffused images in as low


as 30 ns, which enables the very rapid generation of DoG filteredimages (which is just a difference of two diffused images). Ourfirst experiments indicate that using the resistive grid we canperform the front-end channel calculations about 4 times fasterthan the DSP for 128 128 sized images (including transfers).

Several lessons can be learned from the tests that haveto be addressed to design a competitive topographic visualmicroprocessor. The clear advantage that topographic imageprocessor have over conventional digital image processors isthat all other things being equal, the processing speed remainsessentially constant as the size of the array increases, while onDSPs, processing time increases linearly with the image area(which grows quadratically). Further advantages can be gainedby using diffusion and trigger-wave based image-processingoperators, which are very fast constant time operations onCNN-UM chips, but can only be approximated with iterativeapproaches on DSPs. However, to realize the full potentialof these architectures, they must be designed with practicalapplication scenarios in mind.

They have to feature focal plane input, so the initial imagesdo not have to be transferred through the data bus. The digitalcommunication bus’ transfer speed has to be increased by one,preferably two orders of magnitude, if we factor in the need forhigher resolution images. Even though a topographic processoris capable of performing many operations at constant speeds(with respect to image size, as long as the image size is equalto, or smaller than the number of processing units), some op-erations are much better suited to traditional DSP implementa-tion (2-D fast Fourier transforms (FFTs), for example]. Withouta high-speed bus, the transfer of images itself becomes a sig-nificant bottleneck that negates the advantages of using a topo-graphic processor in the first place. Some of the above issues areaddressed in a biologically motivated design for a standalonevisual computer (which uses the latest topographic sensor-pro-cessor), called the Bi-i [24], [25], which will be the target of ourfuture investigations.

VII. CONCLUSION AND FUTURE WORK

We have developed a robust high-speed MTT system capableof real-time operation. We have discussed the different designdecisions and trade-offs between algorithm accuracy and speed,and presented results that validate our decisions. We are cur-rently working on integrating the IMM state estimation algo-rithm into our system to further enhance the accuracy of trackingand implementing our multichannel front-end on the next gen-eration topographic sensor-processors. We are also developingan experiment where high-speed control can be demonstrated inaddition to passive tracking.

APPENDIX

CNN FRONT-END ALGORITHMS AND OPERATORS

These are the operational notations and definitions, which areused throughout the CNN, based MTT Front-end algorithm de-scriptions. Where the Ace4k implementations are noted as “notstable,” it means that the operation of that particular functioncould not be implemented in a way that it can be run reliably and

repeatedly across different Ace4k chips. The operations them-selves are possible and stable in principle, but not on the Ace4k.

A. Basic Operators

1) Threshold: Thresholds a gray-scale input image at agiven gray-scale level. The output is a binary image defined asfollows:

ifotherwise.

CNN implementation: by using the THRESH template.Ace4k implementation: available (not stable).

2) Erode: Calculates erosion of a binary input image witha specified structuring element . The set theoretical definitionof the erosion based on Minkowski subtraction is as follows (denotes translation):

CNN implementation: by using the EROSION template(single-step erosion) or PROPE (continuous erosion by atrigger wave).Ace4k implementation: iterated single step morphology—avail-able (stable), continuous trigger-wave computing—available(not stable).

3) Dilate: Calculates dilation of a binary input image witha specified structuring element . The set theoretical definitionof the dilation based on Minkowski addition is as follows (denotes translation):

CNN implementation: by using the DILATION template(single-step dilation) or PROPD (continuous dilation by atrigger wave).Ace4k implementation: iterated single step morphology—avail-able (stable), continuous trigger-wave computing—available(not stable).

4) Reconstruct: Calculates conditional (specified by a bi-nary mask ) dilation of a binary input image with a speci-fied structuring element . The set theoretical definition of thereconstruction based on Minkowski addition is as follows (denotes translation):

CNN implementation: by using the RECONSTR template(single-step conditional dilation) or PROPR (conditional contin-uous dilation by a trigger wave).Ace4k implementation: iterated conditional single stepmorphology—available (stable), continuous conditionaltrigger-wave computing—available (not stable).


5) Sobel: Enhances the edges on a gray-scale input by per-forming a convolution with a NN directional Sobel-type opera-tors (this assumes an 8-connected image)

CNN implementation: by using the different variants of theSOBEL template.Ace4k implementation: available (stable).

6) Laplace: Enhances the edges on a gray-scale input byperforming a convolution with the NN discrete Laplace oper-ator (the image can be either 4-connected or 8-connected)

for

end.

CNN implementation: by using the different variants of theLAPLACE template.Ace4k implementation: available (stable).

7) Gauss: Calculates a low-pass filtered version of agray-scale image by performing a convolution with the NNdiscrete Gaussian operator (the image can be either 4-connectedor 8-connected)

for

end.

CNN implementation: by using the GAUSS template.Ace4k implementation: available (stable).

8) Diffuse: Calculates a linear low-pass filtered version of agray-scale input image. The formulation of the operation is asfollows ( denotes convolution):

CNN implementation: the above equation describes a linear con-volution by a Gaussian kernel. Under fairly mild conditionsat some time this corresponds to the solution of a diffusiontype partial differential equation. After spatial discretization thiscan be mapped to a CNN structure programmed by templateDIFFUS. In this form, the transient length is explicitly related to

.Ace4k implementation: iterated convolution—available (stable),continuous diffusion—available (not stable).

9) CDiffuse: Calculates a linear low-pass filtered version ofa gray-scale input image. The formulation of the operation is asfollows ( denotes convolution):

CNN implementation: the above equation describes a homo-topy in between two different linear convolutions by a Gaussiankernel. Under fairly mild conditions at some time this cor-responds to the solution of a constrained diffusion type par-tial differential equation. After spatial discretization this can bemapped to a CNN structure programmed by template CDIFFUS.In this form the term directly approximates , while thetransient length is explicitly related to .Ace4k implementation: iterated convolution—available (stable),continuous diffusion—available (not stable).

B. Subroutines

1) Open: Calculates N-step opening on a binary input image

Open

forErode

endfor

Dilateend


2) Close: Calculates N-step closing on a binary input image

Close

forDilate

endfor

Erodeend

.

3) PreProc: Calculates a linear low-pass filtered version ofa gray-scale input image thereby performing additive Gaussiannoise removal. Both linear and constrained linear diffusion ap-proximations can be used

4) TeFilt: Calculates a convex sum in between thegray-scale input image and an internal state (that is the re-sult of the previous operation) and subtracts the resulting imagefrom the gray-scale input image (this corresponds to a causalrecursive temporal motion sensitive filtering if the images aresubsequent frames of a gray-scale image flow)

5) SpFilt: Calculates a spatial local difference based en-hancement (a Laplacian of a Gaussian, DoGs, or Sobel ofGaussians) of a gray-scale input image

6) SPTFilt: Calculates a convex sum in between the low-pass filtered gray-scale input image and an internal state (that isthe result of the previous operation) and subtracts the resultingimage from the low-pass filtered gray-scale input image (this

corresponds to a causal recursive spatio-temporal motion sensi-tive filtering if the images are subsequent frames of a gray-scaleimage flow)

7) Pred: Calculates an N-step isotropic expansion of the ob-jects identified by spatial logic from two a binary input images(this corresponds to a “prediction map” given that one of theinput images is a “detection map” containing the target objectsin form of binary patches and the other input image is the ”pre-diction map” from the previous operation)

if isempty

elseand

if isempty

else

CompDP xor

or

endend.

8) PostProc: Calculates an N-step erosion/dila-tion/closing/opening on a binary input image:

forErode

end

forDilate

end

Open

Close

C. Pseudocode for the MultiChannel Algorithm

This is shown in Fig. 9.


Fig. 9. Three-channel crisp logic MTT algorithm.

Fig. 10. Table of CNN templates used by the algorithms referred to in thepaper (ideal values corresponding to the Chua–Young CNN model).

D. Linear CNN Templates

All simulator versions of the templates are based on the Chua-Young model [1]

Fig. 10 shows the Table of CNN templates used by the algo-rithms in the paper.

Remarks:transient length of a noncoupled CNN operation;this solution is a conditional iterative dilation;see the possible nonisotropic operators in basicoperator description.

ACKNOWLEDGMENT

The authors would like to thank B. Roska and M. Roux fortheir thoughtful discussions. They are also grateful for the im-mense help of I. Szatmari, G. Erdosi, and A. Csermak in theimplementation details during the course of this project.

REFERENCES

[1] V. S. S. Hwang, “Tracking feature points in time-varying images usingan opportunistic selection approach,” Patt. Recogn., vol. 22, no. 3, pp.247–256, 1989.

[2] K. Sethi and R. Jain, “Finding trajectories of feature points in amonocular image sequence,” IEEE Trans. Patt. Anal. Mach. Intell., vol.PAMI-9, no. 1, pp. 56–73, Jan. 1987.

[3] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Art. Intell.,vol. 17, pp. 185–203, 1981.

[4] H. H. Nagel, “Displacement vectors derived from second order intensityvariations in image sequences,” Comp. Vis. Graph. Image Process., vol.21, no. 1, pp. 85–117, 1983.

[5] L. O. Chua and L. Yang, “Cellular neural networks: Theory and applica-tions,” IEEE Trans. Circuits Syst., vol. 35, no. 12, pp. 1257–1290, Dec.1988.

[6] L. O. Chua and T. Roska, “The CNN paradigm,” IEEE Trans. CircuitsSyst. I, Fundam. Theory Appl., vol. 40, no. 1, pp. 147–156, Jan. 1993.

[7] L. O. Chua, “CNN: A vision of complexity,” Int. J. Bifurc. Chaos, vol.7, no. 10, pp. 2219–2425, 1997.

[8] T. Roska and L. O. Chua, “The CNN universal machine,” IEEE Trans.Circuits Syst. I, Fundam. Theory Appl., vol. 40, no. 1, pp. 163–173, Jan.1993.

[9] S. Blackman and R. Popoli, Design and Analysis of Modern TrackingSystems. Norwood, MA: Artech House, 1999.

[10] Y. Bar-Shalom and W. D. Blair, Eds., Multitarget-Multisensor Tracking:Applications and Advances. Norwood, MA: Artech House, 2000, vol.III.

[11] Á. Zarándy, Cs. Rekeczky, I. Szatmári, and P. Földesy, “The new frame-work of applications: The aladdin system,” Int. J. Circuits Syst. Comp.,vol. 12, pp. 769–781, 2003.

[12] G. Liñán, S. Espejo, R. Domínguez-Castro, and A. Rodríguez-Vázquez,“Ace4k: An analog I/O 64� 64 visual microprocessor chip with 7-bitanalog accuracy,” Int. J. Circuit Theory Appl., vol. 30, no. 2–3, pp.89–116, 2002.

[13] A. Rodríguez-Vázquez, G. Liñán, L. Carranza, E. Roca, R. Carmona, F.Jiménez, R. Domínguez-Castro, and S. Espejo, “ACE16k: The third gen-eration of mixed-signal SIMD-CNN ACE chips toward VSoCs,” IEEETrans. Circuits Syst. I, Reg. Papers, vol. 51, no. 5, pp. 851–863, May2004.

[14] B. Roska and F. S Werblin, “Vertical interactions across ten parallelstacked representations in Mammalian Retina,” Nature, vol. 410, pp.583–587, 2001.

[15] D. Marr, Vision. San Francisco, CA: Freeman, 1982.[16] Cs. Rekeczky, T. Roska, and A. Ushida, “CNN-based difference-con-

trolled adaptive nonlinear image filters,” Int. J. Circuit Theory Appl.,vol. 26, pp. 375–423, Jul.–Aug. 1998.

[17] R. R. Jonker and A. Volgenant, “A shortest augmenting path algorithmfor dense and sparse linear assignment problems,” J. Comput., vol. 38,pp. 325–430, 1987.

[18] D. P. Bertsekas, “The auction algorithm: A distributed relaxation methodfor the assignment problem,” Ann. Oper. Res., vol. 14, pp. 105–123,1988.

[19] M. Balinski, “Signature methods for the assignment problem,” Oper.Res., vol. 33, no. 3, pp. 527–536, May 1985.

[20] Y. Bar-Shalom and E. Tse, “Tracking in a cluttered environment withprobabilistic data association,” Automatica, vol. 11, pp. 451–460, 1975.

[21] J. Munkres, “Algorithms for the assignment and transportation prob-lems,” J. Soc. Indust. Applied Math, vol. 5, no. 1, pp. 32–38, 1957.

[22] A. K. Jain, Fundamentals of Digital Image Processing. EnglewoodCliffs, NJ: Prentice-Hall, 1989.

[23] MatCNN [Online]. Available: http://lab.analogic.sz-taki.hu/Candy/matcnn.html

[24] Analogic Computers [Online]. Available: http://www.analogic-com-puters.com

[25] Cs. Rekeczky, I. Szatmári, D. Bálya, G. Tímár, and Á. Zarándy, “Cel-lular multiadaptive analogic architecture: A computational frameworkfor UAV applications,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol.51, no. 5, pp. 864–884, May 2004.


Gergely Tímár received the M.S. degree in computerscience from the Budapest University of Technologyand Economics, , Budapest, Hungary in 2001. Since2001, he is working toward the Ph.D. degree in neuro-morph information technology at the Pázmány PéterCatholic University and the Analogical and NeuralComputing Laboratory of the Computer and Automa-tion Research Institute of the Hungarian Academy ofSciences, Budapest, Hungary.

His research interests include high-speed imageprocessing, multitarget tracking, offline handwriting

recognition, and applied artificial intelligence.

Csaba Rekeczky (S’99–M’00) received the M.S.degree in electrical engineering from the TechnicalUniversity of Budapest, Budapest, Hungary, andthe Ph.D. degree in electrical engineering from theBudapest University of Technology and Economics,Budapest, Hungary, in 1993 and 1999, respectively.

In 1993, he was with the Analogical and NeuralComputing Systems Laboratory of the Computerand Automation Institute, Hungarian Academy ofSciences, Budapest, Hungary, working on nonlinearnetwork theory and signal processing, computational

neurobiology, and noninvasive medical diagnosis. In 1994 and 1995, he wasa Visiting Scholar at the Tokushima University, Tokushima, Japan, workingon cellular neural network projects related to medical image processing. In1997 and 1998, he conducted research on nonlinear image processing andneuromorphic modeling of the vertebrate retina at the University of Californiaat Berkeley. Currently, his research interest is focused on computational aspectsof cellular nonlinear arrays and includes neuromorphic modeling of biologicalsensing, nonlinear adaptive techniques in signal processing, multimodal fusion,and special topics in machine vision.

Dr. Rekeczky is the co-recipient of the Best Paper Award for a contributionpublished in International Journal of Circuit Theory and Its Applicationsin 1998. In 2001 and 2002, he served as an Associate Editor of the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS—I: FUNDAMENTAL THEORY AND

APPLICATIONS.

a real-time multitarget tracking system with robust multichannel cnn-um algorithms

Documents