ivc2011 cost effective

Download Ivc2011 Cost Effective

Post on 19-Jul-2016




4 download

Embed Size (px)




  • daja



    illanfronsoengbethe dise the synchronisation task by recording all trigger- or timestamp signals with ae. For sensors that don't have an external trigger signal, we let the computer that

    be used as a common time base to synchronise mu

    CCTV (ies hasti-sensuilding

    Image and Vision Computing xxx (2011) xxxxxx

    IMAVIS-03067; No of Pages 15

    Contents lists available at ScienceDirect

    Image and Visio

    e lsterfaces, personal wellbeing and independent living technologies,personalised assistance, etc. Furthermore, sensor fusion combiningvideo analysis with the analysis of audio, as well as other sensormodalities is becoming an increasingly active area of research [1]. It isalso considered a prerequisite to increase the accuracy and robustness ofautomatic human behaviour analysis [2]. Although humans tolerate anaudio lagofup to200 msoravideo lagofup to45 ms [3],multimodaldatafusion algorithmsmay benet from higher synchronisation accuracy. Forexample, in [4], correction of a 40 ms time difference, between the audioand video streams recorded by a single camcorder, resulted in a

    camera types of real-time video encoders are often xed or limited to asmall set of choices, dictated by established video standards. The accuracyof synchronisation between audio and video is mostly based on humanperceptual acceptability, and could be inadequate for sensor fusion. Evenif A/V synchronisation accuracy is maximised, an error below the timeduration between subsequent video frame captures can only be achievedwhen it is exactly known how the recorded video frames correspond tothe audio samples. Furthermore, commercial solutions are often closedsystems that do not allow the accuracy of synchronisation that can beachieved with direct connections between the sensors. Some systemssignicant increase in performance of speakeAudio-Visual (A/V) data fusion. Lienhart et amicrosecond accuracy between audio channelseparation gain in distributed blind signal sepa

    This paper has been recommended for acceptance b Corresponding author. Tel.: +44 20 7594 8336; fax

    E-mail address: j.lichtenauer@imperial.ac.uk (J. Licht

    0262-8856/$ see front matter 2011 Elsevier B.V. Aldoi:10.1016/j.imavis.2011.07.004

    Please cite this article as: J. Lichtenauer, eImage Vis. Comput. (2011), doi:10.1016/j.iblock of various systemshuman behaviour with ave human-computer in-

    theymay not allow the exibility, quality, accuracy or number of sensorsrequired for technological advancements in automatic human behaviouranalysis. The spatial and temporal resolutions, as well as the supportedaimed at detection, tracking, and analysis ofwide range of applications including proacti1. Introduction

    In the past two decades, the use ofand other visual surveillance technologlevels. Besides security applications,mulogy has also become an indispensable bthat a consumer PC can currently capture 8-bit video data with 10241024 spatial- and 59.1 Hz temporalresolution, from at least 14 cameras, together with 8 channels of 24-bit audio at 96 kHz. We thus improve thequality/cost ratio of multi-sensor systems data capture systems.

    2011 Elsevier B.V. All rights reserved.

    Closed Circuit Television)grown to unprecedentedorial surveillance technol-

    With the ever-increasing need for multi-sensorial surveillancesystems, the commercial sector started offering multi-channel framegrabbers and Digital Video Recorders (DVR) that encode video (possiblycombined with audio) in real-time (e.g. see [6]). Although these systemscan be the most suitable solutions for current surveillance applications,r identication based onl. [5] demonstrated thats helps to increase signalration.

    provide functionamodules. Suchaccuracy betweensolution dependenvironment (GPorbiting theEarthof the time lag indata. For PC syste

    y Jan-Michael Frahm.: +44 20 7581 8024.enauer).

    l rights reserved.

    t al., Cost-effective solution to synchronisedmavis.2011.07.004ltiple asynchronous audio interfaces. Furthermore, we showAudio recordingMultisensor systems captures the sensor data periodically generate timestamp signals from its serial port output. These signals can alsoVideo recording synchronisation, we centralmulti-channel audio interfacCost-effective solution to synchronised aumultiple sensors

    Jeroen Lichtenauer a,, Jie Shen a, Michel Valstar a, Ma Department of Computing, Imperial College London, 180 Queen's Gate, SW7 2AZ, UKb Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twen

    a b s t r a c ta r t i c l e i n f o

    Article history:Received 13 February 2011Received in revised form 7 June 2011Accepted 18 July 2011Available online xxxx


    Applications such as survemultiple cameras, as well assynchronisation between seaccuracy due to several challoffset and rate discrepanciesthe hardware; as well as t

    j ourna l homepage: www.io-visual data capture using

    Pantic a,b

    The Netherlands

    ce and human behaviour analysis require high-bandwidth recording fromm other sensors. In turn, sensor fusion has increased the required accuracy ofrs. Using commercial off-the-shelf components may compromise quality andes, such as dealingwith the combined data rate frommultiple sensors; unknownween independent hardware clocks; the absence of trigger inputs or -outputs inifferent methods for time-stamping the recorded data. To achieve accurate

    n Computing

    ev ie r.com/ locate / imav islity of time-stamping the sensordatawithGPSor IRIG-Bmodules can provide microsecond synchronisationremote systems. However, the applicability of such a

    s on sensor hard- and software, as well as on theS receivers need an unblocked view to the GPS satellites). Also, actual accuracy cannever exceed theuncertaintythe I/O process that precedes time-stamping of sensorms, this can be in the order of milliseconds [5].

    audio-visual data capture using multiple sensors,

  • A few companies aim at custom solutions for applications withrequirements that cannot be met with what is currently offered bycommercial surveillance hardware. For example, Boulder Imaging [7]builds custom solutions for any application, and Cepoint Networksoffers professional video equipment such as the Studio 9000 DVR [8],which can record up to 4 video streams per module, as well as externaltrigger events, with an option to timestamp with IRIG-B. It also has theoption of connecting an audio interface through a Serial Digital Interface(SDI) input. However, it is not clear from the specications if the time-stamping of audio and video can be done without being affected by thelatency between the sensors and the main device. Furthermore, whenmore than4 video streamshave to be recorded, a single Studio 9000willstill not sufce. The problem of the high cost of custom solutions andspecialised professional hardware is that it keeps accurately synchro-nised multi-sensor data capture out of reach for most computer visionand pattern recognition researchers. This is an important bottleneck forresearch on multi-camera and multi-modal human behaviour analysis.To overcome this, we propose solutions and present ndings regardingthe two most important difculties in using low-cost Commercial Off-The-Shelf (COTS) components: reaching the required bandwidth fordata capture and achieving accurate multi-sensor synchronisation.

    Fortunately, recent developments in computer hardware technologyhave signicantly increased the data bandwidths of commercial PCcomponents, allowing formore audio-visual sensors to be connected to asingle PC. Our low-cost PC conguration facilitates simultaneous,synchronous recordings of audio-visual data from 12 cameras having780580 pixels spatial resolution and 61.7 fps temporal resolution,together with eight 24-bit 96 kHz audio channels. The relevantcomponents of our system setup are summarised in Table 1. By usingsix internal 1.5 TBHardDiskDrives (HDD), 7.6 hof continuous recordingscan be made. With a different motherboard and an extra HDD controllercard to increase the amountofHDDs to14,we showthat 1PC is capable of

    continuously recording from 14 Gigabit Ethernet cameras with10241024 pixels spatial resolution and 59.1 fps, for up to 6.7 h. InTable2weshowthemaximumnumberof cameras that canbeused in thedifferent congurations that we tested. A higher number of cameras perPC means a reduction of cost, complexity as well as space requirementsfor visual data capture.

    Synchronisation between COTS sensors is hindered by the offset andrate discrepancies between independent hardware clocks, the absenceof trigger inputs or -outputs in the hardware, as well as differentmethods of time-stamping of the recorded data. To accurately derivesynchronisation between the independent timings of different sensors,

    Table 2Camera support of a single consumer PC.

    Spatial resolution Temporal resolution Rate per camera Max. no. of cameras

    780580 pixels 61.7 fps 26.6 MB/s 14780580 pixels 49.9 fps 21.5 MB/s 16780580 pixels 40.1 fps 17.3 MB/s 18

    With controller card for 8 additional HDDs

    10241024 pixels 59.1 fps 59.1 MB/s 14

    analog signalanalog signalbinary signalbinary signalnon -synchronised databinary signalnon -synchronised datanon -synchronised datasynchronised datasynchronised data

    Fig. 1. Overview of our synchronised multi-sensor data capture system, consisting of(a) microphones, (b) video cameras, (c) a multi-channel A/D converter, (d) an A/Vcapture PC, (e) an eye gaze capture PC, (f) an eye gaze tracker and (g) a photo diode to

    2 J. Lichtenauer et al. / Image and Vision Computing xxx (2011) xxxxxxTable 1Components of the capture system for 8 FireWire cameras with a resolution of780580pixels and 61.7 fps.

    Sensor component Description

    7 monochrome videocameras

    AVT Stin

Whoops - Discover Share Present


Something went wrong. Please come back later!