A real-time multitarget tracking system with robust multichannel CNN-UM algorithms

Download A real-time multitarget tracking system with robust multichannel CNN-UM algorithms

Post on 10-Mar-2017




0 download

Embed Size (px)



    A Real-Time MultiTarget Tracking System WithRobust MultiChannel CNN-UM Algorithms

    Gergely Tmr and Csaba Rekeczky, Member, IEEE

    AbstractThis paper introduces a tightly coupled topographicsensor-processor and digital signal processor (DSP) architecturefor real-time visual multitarget tracking (MTT) applications. Wedefine real-time visual MTT as the task of tracking targets con-tained in an input image flow at a sampling-rate that is higher thanthe speed of the fastest maneuvers that the targets make. We uti-lize a sensor-processor based on the cellular neural network uni-versal machine architecture that permits the offloading of the mainimage processing tasks from the DSP and introduces opportunitiesfor sensor adaptation based on the tracking performance feedbackfrom the DSP. To achieve robustness, the image processing algo-rithms running on the sensor borrow ideas from biological sys-tems: the input is processed in different parallel channels (spa-tial, spatio-temporal and temporal) and the interaction of thesechannels generates the measurements for the digital tracking algo-rithms. These algorithms (running on the DSP) are responsible fordistance calculation, state estimation, data association and trackmaintenance. The performance of the proposed system is studiedusing actual hardware for different video flows containing rapidlymoving maneuvering targets.

    Index TermsAuthor, please supply your own keywords or senda blank e-mail to keywords@ieee.org to receive a list of suggestedkeywords..


    RECOGNIZING and interpreting the motion of objects inimage sequences (flows) is an essential task in a numberof applications, such as security, surveillance etc. In many in-stances, the objects to be tracked have no known distinguishingfeatures that would allow feature (or token) tracking [1], [2],optical flow or motion estimation [3], [4]. Therefore, the targetscan only be identified and tracked by their measured positionsand derived motion parameters. Target tracking algorithms de-veloped for tracking targets based on sonar and radar measure-ments are widely known and could be used for tracking based onvisual input (also known as motion correspondence). However,our requirement that the system should operate at least at videoframe-rate (possibly even higher) limits the choices between

    Manuscript received May 7, 2004; revised October 5, 2004. This work wassupported in part by the Hungarian National Research and Development Pro-gram (NKFP), in part by TeleSense under Project Grant 2/035/2001, in part byEORD under Grant FA8655-03-1-3047, and in part by Human Frontier ScienceProgram (HFSP) Young Investigators. This paper was recommended by Asso-ciate Editor Y.-K. Chen.

    G. Tmr is with the Analogic and Neural Computing Systems Laboratory,Computer and Automation Institute, Hungarian Academy of Sciences, H 1052Budapest, Hungary.

    Cs. Rekeczky is with the nyos Jedlik Laboratories, Department of Informa-tion Technology, Pter Pzmny Catholic University, H 1052 Budapest, Hun-gary.

    Digital Object Identifier 10.1109/TCSI.2005.851703

    the well-established statistical and nonstatistical tracking algo-rithms. The real-time requirements motivated the use of a uniqueimage sensing and processing device, the cellular neural net-work universal machine (CNN-UM) [5][8] and its very large-scale integration VLSI implementations, which provide severaladvantages over conventional CMOS or CCD sensors:

    possibility of focal plane processing, which means thatthe acquired image does not have to be moved from thesensor to the processor;

    very fast parallel image processing operators; unique trigger-wave and diffusion based operators.We posited that the decreased running time of image pro-

    cessing algorithms provides some headroom within the real-time constraints that allows for the use of more complex stateestimation and data assignment algorithms and sensor adapta-tion possibilities. Unfortunately, as it will be explored in detailin Section V, current generation CNN-UM devices contain sev-eral design compromises that severely limit their potential foruse in these kinds of applications. These shortcomings, how-ever, have nothing to do with their inherent capabilities for veryhigh-speed image processing; rather they are mostly the resultof communication bottlenecks between the processor and therest of the system.

    In the next section, we give a high level overview of thesystem, and then we present the algorithms running on theCNN-UM. In Section IV, we give an overview of the algorithmsused in estimating the state of the targets, and creating andmaintaining the target tracks and the adaptation possibilitiesthat the tight coupling of the track maintenance system (TMS)and the sensor can provide. The main focus of this paper is ondescribing a real-time multitarget tracking (MTT) system thatcontains a CNN-UM sensor-processor and various algorithmsneeded to accomplish this, so for a detailed description ofthe state of the art data association, data assignment and stateestimation methods, the reader is kindly referred to [9], [10].Finally, we present experimental results obtained by runningthe algorithms on actual hardware.


    The system contains two main architectural levels (Fig. 1): theCNN-UM level, where all of the image processing takes place(and possibly image acquisition) and the digital signal processor(DSP) level where track management functions are performed.Using our ACE-BOX-based [11] implementation image acqui-sition is performed by the host PC since the Ace4k [12] does nothave an optical input, but by utilizing more recent Ace16k chips[13] it is possible to use their optical input for image capture.

    1057-7122/$20.00 2005 IEEE


    Fig. 1. CNN-UM/DSP hybrid system architecture for MTT. The mainprocessing blocks are divided into three categories: those that are bestperformed on the CNN-UM processor, those that are especially suitable for theDSP, and those that have to be performed using both processors.

    After image acquisition, the CNN-UM can perform image en-hancement to compensate for ambient lighting changes, motionextraction and related image processing tasks and feature extrac-tion for some types of features. The DSP runs the rest of the fea-ture extraction routines, and the motion correspondence algo-rithms such as distance calculation, gating, data assignment andtarget state estimation. It also calculates new values for someCNN-UM algorithm parameters thus adapting the processing tothe current environment.


    The algorithms presented here were heavily influenced byknowledge gained from the study of the mammalian visualsystem, especially the retina. Recent studies [14] uncov-ered that the retina processes visual information in parallelspatio-temporal channels and only a sparse encoding of theinformation on these channels is transmitted via the optic nerveto higher visual cortical areas. Additionally, there is context andcontent sensitive interaction between these parallel channelsvia enhancement and suppression that results in remarkableadaptivity. These are highly desirable characteristics for allimage-processing systems but are essential for visual trackingtasks where degradation of input measurements cannot alwaysbe compensated for at the later stages of processing (by domainknowledge, for example).

    In the following subsections, we will describe a conceptualframework for such a complex CNN-UM based front-end al-gorithm. First, we will discuss the computing blocks in generaland then specify the characteristics of the test implementationon the Ace4k CNN-UM chip operating on the ACE-BOX com-putational infrastructure.

    A. Enhancement Methods and Spatio-TemporalChannel Processing

    We tried to capture the main ideas from the natural system bydefining three change enhancing channels on the input imageflow: a spatial, a temporal and a spatio-temporal channel [seeFig. 2(a)]. The spatial channel contains the response of filtersthat detect spatial, i.e., brightness changes, revealing the edgesin a frame. The temporal channel contains the result of com-puting the difference between two consecutive frames, therebygiving a response to changes, while the spatio-temporal channelcontains the nonlinear combination of the spatial and temporalfilter responses. In a general scheme, it can also be assumed that

    the input flow is preprocessed (enhanced) by a noise suppressingreconstruction filter.

    The filtering on the parallel channels can be defined ascausal recursive difference-type filtering using some linear ornonlinear filters as prototypes [typically, difference of Gaussian(DoG) filters implemented using constrained linear diffusion[15], or difference of morphology (DoM) filters implementedby min-max statistical filters [16]]. These filters can be thoughtof as bandpass filters tuned to a specific spatial/temporal fre-quency (or a very narrow band of frequencies), thus enablinghighly selective filtering of grayscale images and sequences.

    The three main parameters of these change-enhancing filtersare as follows.

    Spatial scale : the spatial frequency(s) (basically theobject size) the filter is tuned to (in pixels).

    Temporal rate : the rate of change in an image se-quence the filter is tuned to (in pixels/frame).

    Orientation : the direction in which the filter is sensi-tive (in radians).

    In our current framework, the orientation parameter is notused, since we are relying on isotropic Gaussian kernels (or theapproximations thereof) to construct our filters, but we are in-cluding it here because the framework does not inherently pre-clude the use of it. It is po