A real-time multitarget tracking system with robust multichannel CNN-UM algorithms

Download A real-time multitarget tracking system with robust multichannel CNN-UM algorithms

Post on 10-Mar-2017




0 download

Embed Size (px)


<ul><li><p>1358 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 52, NO. 7, JULY 2005</p><p>A Real-Time MultiTarget Tracking System WithRobust MultiChannel CNN-UM Algorithms</p><p>Gergely Tmr and Csaba Rekeczky, Member, IEEE</p><p>AbstractThis paper introduces a tightly coupled topographicsensor-processor and digital signal processor (DSP) architecturefor real-time visual multitarget tracking (MTT) applications. Wedefine real-time visual MTT as the task of tracking targets con-tained in an input image flow at a sampling-rate that is higher thanthe speed of the fastest maneuvers that the targets make. We uti-lize a sensor-processor based on the cellular neural network uni-versal machine architecture that permits the offloading of the mainimage processing tasks from the DSP and introduces opportunitiesfor sensor adaptation based on the tracking performance feedbackfrom the DSP. To achieve robustness, the image processing algo-rithms running on the sensor borrow ideas from biological sys-tems: the input is processed in different parallel channels (spa-tial, spatio-temporal and temporal) and the interaction of thesechannels generates the measurements for the digital tracking algo-rithms. These algorithms (running on the DSP) are responsible fordistance calculation, state estimation, data association and trackmaintenance. The performance of the proposed system is studiedusing actual hardware for different video flows containing rapidlymoving maneuvering targets.</p><p>Index TermsAuthor, please supply your own keywords or senda blank e-mail to keywords@ieee.org to receive a list of suggestedkeywords..</p><p>I. INTRODUCTION</p><p>RECOGNIZING and interpreting the motion of objects inimage sequences (flows) is an essential task in a numberof applications, such as security, surveillance etc. In many in-stances, the objects to be tracked have no known distinguishingfeatures that would allow feature (or token) tracking [1], [2],optical flow or motion estimation [3], [4]. Therefore, the targetscan only be identified and tracked by their measured positionsand derived motion parameters. Target tracking algorithms de-veloped for tracking targets based on sonar and radar measure-ments are widely known and could be used for tracking based onvisual input (also known as motion correspondence). However,our requirement that the system should operate at least at videoframe-rate (possibly even higher) limits the choices between</p><p>Manuscript received May 7, 2004; revised October 5, 2004. This work wassupported in part by the Hungarian National Research and Development Pro-gram (NKFP), in part by TeleSense under Project Grant 2/035/2001, in part byEORD under Grant FA8655-03-1-3047, and in part by Human Frontier ScienceProgram (HFSP) Young Investigators. This paper was recommended by Asso-ciate Editor Y.-K. Chen.</p><p>G. Tmr is with the Analogic and Neural Computing Systems Laboratory,Computer and Automation Institute, Hungarian Academy of Sciences, H 1052Budapest, Hungary.</p><p>Cs. Rekeczky is with the nyos Jedlik Laboratories, Department of Informa-tion Technology, Pter Pzmny Catholic University, H 1052 Budapest, Hun-gary.</p><p>Digital Object Identifier 10.1109/TCSI.2005.851703</p><p>the well-established statistical and nonstatistical tracking algo-rithms. The real-time requirements motivated the use of a uniqueimage sensing and processing device, the cellular neural net-work universal machine (CNN-UM) [5][8] and its very large-scale integration VLSI implementations, which provide severaladvantages over conventional CMOS or CCD sensors:</p><p> possibility of focal plane processing, which means thatthe acquired image does not have to be moved from thesensor to the processor;</p><p> very fast parallel image processing operators; unique trigger-wave and diffusion based operators.We posited that the decreased running time of image pro-</p><p>cessing algorithms provides some headroom within the real-time constraints that allows for the use of more complex stateestimation and data assignment algorithms and sensor adapta-tion possibilities. Unfortunately, as it will be explored in detailin Section V, current generation CNN-UM devices contain sev-eral design compromises that severely limit their potential foruse in these kinds of applications. These shortcomings, how-ever, have nothing to do with their inherent capabilities for veryhigh-speed image processing; rather they are mostly the resultof communication bottlenecks between the processor and therest of the system.</p><p>In the next section, we give a high level overview of thesystem, and then we present the algorithms running on theCNN-UM. In Section IV, we give an overview of the algorithmsused in estimating the state of the targets, and creating andmaintaining the target tracks and the adaptation possibilitiesthat the tight coupling of the track maintenance system (TMS)and the sensor can provide. The main focus of this paper is ondescribing a real-time multitarget tracking (MTT) system thatcontains a CNN-UM sensor-processor and various algorithmsneeded to accomplish this, so for a detailed description ofthe state of the art data association, data assignment and stateestimation methods, the reader is kindly referred to [9], [10].Finally, we present experimental results obtained by runningthe algorithms on actual hardware.</p><p>II. BLOCK-LEVEL OVERVIEW OF SYSTEM</p><p>The system contains two main architectural levels (Fig. 1): theCNN-UM level, where all of the image processing takes place(and possibly image acquisition) and the digital signal processor(DSP) level where track management functions are performed.Using our ACE-BOX-based [11] implementation image acqui-sition is performed by the host PC since the Ace4k [12] does nothave an optical input, but by utilizing more recent Ace16k chips[13] it is possible to use their optical input for image capture.</p><p>1057-7122/$20.00 2005 IEEE</p></li><li><p>TMR AND REKECZKY: REAL-TIME MTT SYSTEM 1359</p><p>Fig. 1. CNN-UM/DSP hybrid system architecture for MTT. The mainprocessing blocks are divided into three categories: those that are bestperformed on the CNN-UM processor, those that are especially suitable for theDSP, and those that have to be performed using both processors.</p><p>After image acquisition, the CNN-UM can perform image en-hancement to compensate for ambient lighting changes, motionextraction and related image processing tasks and feature extrac-tion for some types of features. The DSP runs the rest of the fea-ture extraction routines, and the motion correspondence algo-rithms such as distance calculation, gating, data assignment andtarget state estimation. It also calculates new values for someCNN-UM algorithm parameters thus adapting the processing tothe current environment.</p><p>III. CNN-UM ALGORITHMS</p><p>The algorithms presented here were heavily influenced byknowledge gained from the study of the mammalian visualsystem, especially the retina. Recent studies [14] uncov-ered that the retina processes visual information in parallelspatio-temporal channels and only a sparse encoding of theinformation on these channels is transmitted via the optic nerveto higher visual cortical areas. Additionally, there is context andcontent sensitive interaction between these parallel channelsvia enhancement and suppression that results in remarkableadaptivity. These are highly desirable characteristics for allimage-processing systems but are essential for visual trackingtasks where degradation of input measurements cannot alwaysbe compensated for at the later stages of processing (by domainknowledge, for example).</p><p>In the following subsections, we will describe a conceptualframework for such a complex CNN-UM based front-end al-gorithm. First, we will discuss the computing blocks in generaland then specify the characteristics of the test implementationon the Ace4k CNN-UM chip operating on the ACE-BOX com-putational infrastructure.</p><p>A. Enhancement Methods and Spatio-TemporalChannel Processing</p><p>We tried to capture the main ideas from the natural system bydefining three change enhancing channels on the input imageflow: a spatial, a temporal and a spatio-temporal channel [seeFig. 2(a)]. The spatial channel contains the response of filtersthat detect spatial, i.e., brightness changes, revealing the edgesin a frame. The temporal channel contains the result of com-puting the difference between two consecutive frames, therebygiving a response to changes, while the spatio-temporal channelcontains the nonlinear combination of the spatial and temporalfilter responses. In a general scheme, it can also be assumed that</p><p>the input flow is preprocessed (enhanced) by a noise suppressingreconstruction filter.</p><p>The filtering on the parallel channels can be defined ascausal recursive difference-type filtering using some linear ornonlinear filters as prototypes [typically, difference of Gaussian(DoG) filters implemented using constrained linear diffusion[15], or difference of morphology (DoM) filters implementedby min-max statistical filters [16]]. These filters can be thoughtof as bandpass filters tuned to a specific spatial/temporal fre-quency (or a very narrow band of frequencies), thus enablinghighly selective filtering of grayscale images and sequences.</p><p>The three main parameters of these change-enhancing filtersare as follows.</p><p> Spatial scale : the spatial frequency(s) (basically theobject size) the filter is tuned to (in pixels).</p><p> Temporal rate : the rate of change in an image se-quence the filter is tuned to (in pixels/frame).</p><p> Orientation : the direction in which the filter is sensi-tive (in radians).</p><p>In our current framework, the orientation parameter is notused, since we are relying on isotropic Gaussian kernels (or theapproximations thereof) to construct our filters, but we are in-cluding it here because the framework does not inherently pre-clude the use of it. It is possible to tune the spatial channelsresponse to objects of a specific size (in pixels) using the pa-rameter. Similarly, the parameter allows the filtering out of allimage changes except those occurring at a certain rate (in pixelsper frame). This enables the multichannel framework to specif-ically detect targets with certain characteristics.</p><p>The output of these channels is filtered through a sigmoidfunction</p><p>(1)</p><p>The parameters of this function are the threshold andslope . For every , the output of the function is pos-itive, hence the threshold name. The slope parameter specifiesthe steepness of the transition from 0 to 1 and as it becomeslarger, the sigmoid approximates the traditional threshold stepfunction more closely.</p><p>The output of the best performing individual channel could beused by itself as the output of the image processing front-end,if the conditions where the system is deployed are static andwell controlled. If the conditions are dynamic or unknown apriori, then there is a no way to predict the best performingchannel in advance. Furthermore, even after the system is run-ning, no automatic direct measurement of channel performancecan be given short of a human observer deciding which outputis the best. To circumvent this problem, we decided to com-bine the output of the individual channels through a so-calledinteraction matrix, and use the combined output for further pro-cessing. The inclusion of the interaction matrix enables the flex-ible runtime combination of the images on these parallel chan-nels and the prediction map while also specifying a frameworkthat can be incorporated into the system at design time. Our ex-perimental results and measurements indicate that the combinedoutput is on average more accurate than each single channel fordifferent image sequences. Fig. 2(a) shows the conceptual block</p></li><li><p>1360 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 52, NO. 7, JULY 2005</p><p>Fig. 2. (a) Block overview of the channel-based image processing algorithm for change detection. (b) Ace4k implementation of the same algorithm. The inputimage is first enhanced (histogram modified), and then it is processed in three parallel change-enhancing channels. These channels and the prediction image arecombined through the interaction matrix and thresholded to form the final detection image. Observe, that the framework allows the entire processing to be grayscale(through the use of fuzzy logic); the only constraint is that the detection image must be binary. In the Ace4k implementation, the results of the channel processingare thresholded to arrive at binary images, which are then combined using Boolean logic functions as specified by the interaction matrix. The parameters for theAce4k algorithm are: the temporal rate of change, the scale (on the spatial (SP) and spatio-temporal (SPT) channels), #per channel threshold values,Llogical inversion (1) or simple transfer (+1); Nthe number of morphological opening (N &gt; 0) or closing (N &lt; 0) operations.</p><p>diagram of the multichannel spatio-temporal algorithm with allcomputing blocks to be discussed in the following section. Thepseudo code for the entire multichannel algorithm along withall operators is given in Appendix C.</p><p>B. General Remarks on the Implementation of MultiChannelCNN Algorithms on the Ace4k Chip</p><p>The change enhancing channels are actually computed seri-ally (time multiplexed) in the current implementation, but this isnot a problem due to the high speed of the CNN-UM chips used.In the first stage of the on-going experiments, only isotropic</p><p>spatio-temporal processing has been considered fol-lowed by crisp thresholding through a hard nonlinearity</p><p>. Thus, the three types of general parameters used to de-rive and control the associated CNN templates (or algorithmicblocks) are the scale and rate parameters ( and ) and thethreshold parameter . Fig. 2(b) shows the functional buildingblocks of the Ace4k implementation of the algorithm (a hard-ware oriented simplification of the conceptual model) with allassociated parameters.</p><p>The enhancement (smoothing) techniques have been imple-mented in the form of nearest neighbor (NN) convolution fil-ters (circular positive template with entries normalized to 1)and applied to the actual frame ( determines the scale of theprefiltering in pixels, i.e., the number of convolution steps per-formed).</p><p>The spatio-temporal channel filtering (including the temporalfiltering solution) has been implemented as a fading memoryNN convolution filter applied to the actual and previous frames.In temporal filtering configuration (no spatial smoothing), rep-resents the fading rate (in temporal steps), thereby specifying thetemporal scale of the difference enhancement. In spatio-tem-poral filtering configuration (the fading rate is set to a fixed</p><p>value), represents the spatial scale (in pixels) at which thechanges are to be enhanced (the number of convolution oper-ations on the current and the previous frame are calculated im-plicitly from this information).</p><p>The pure spatial filtering is based on Sobel-type spatial pro-cessing of the actual frame along horizontal-vertical directionsand combining the outputs into a single isotropic solution(here repres...</p></li></ul>