ivc2011 cost effective
Post on 19-Jul-2016
Embed Size (px)
illanfronsoengbethe dise the synchronisation task by recording all trigger- or timestamp signals with ae. For sensors that don't have an external trigger signal, we let the computer that
be used as a common time base to synchronise mu
CCTV (ies hasti-sensuilding
Image and Vision Computing xxx (2011) xxxxxx
IMAVIS-03067; No of Pages 15
Contents lists available at ScienceDirect
Image and Visio
e lsterfaces, personal wellbeing and independent living technologies,personalised assistance, etc. Furthermore, sensor fusion combiningvideo analysis with the analysis of audio, as well as other sensormodalities is becoming an increasingly active area of research . It isalso considered a prerequisite to increase the accuracy and robustness ofautomatic human behaviour analysis . Although humans tolerate anaudio lagofup to200 msoravideo lagofup to45 ms ,multimodaldatafusion algorithmsmay benet from higher synchronisation accuracy. Forexample, in , correction of a 40 ms time difference, between the audioand video streams recorded by a single camcorder, resulted in a
camera types of real-time video encoders are often xed or limited to asmall set of choices, dictated by established video standards. The accuracyof synchronisation between audio and video is mostly based on humanperceptual acceptability, and could be inadequate for sensor fusion. Evenif A/V synchronisation accuracy is maximised, an error below the timeduration between subsequent video frame captures can only be achievedwhen it is exactly known how the recorded video frames correspond tothe audio samples. Furthermore, commercial solutions are often closedsystems that do not allow the accuracy of synchronisation that can beachieved with direct connections between the sensors. Some systemssignicant increase in performance of speakeAudio-Visual (A/V) data fusion. Lienhart et amicrosecond accuracy between audio channelseparation gain in distributed blind signal sepa
This paper has been recommended for acceptance b Corresponding author. Tel.: +44 20 7594 8336; fax
E-mail address: firstname.lastname@example.org (J. Licht
0262-8856/$ see front matter 2011 Elsevier B.V. Aldoi:10.1016/j.imavis.2011.07.004
Please cite this article as: J. Lichtenauer, eImage Vis. Comput. (2011), doi:10.1016/j.iblock of various systemshuman behaviour with ave human-computer in-
theymay not allow the exibility, quality, accuracy or number of sensorsrequired for technological advancements in automatic human behaviouranalysis. The spatial and temporal resolutions, as well as the supportedaimed at detection, tracking, and analysis ofwide range of applications including proacti1. Introduction
In the past two decades, the use ofand other visual surveillance technologlevels. Besides security applications,mulogy has also become an indispensable bthat a consumer PC can currently capture 8-bit video data with 10241024 spatial- and 59.1 Hz temporalresolution, from at least 14 cameras, together with 8 channels of 24-bit audio at 96 kHz. We thus improve thequality/cost ratio of multi-sensor systems data capture systems.
2011 Elsevier B.V. All rights reserved.
Closed Circuit Television)grown to unprecedentedorial surveillance technol-
With the ever-increasing need for multi-sensorial surveillancesystems, the commercial sector started offering multi-channel framegrabbers and Digital Video Recorders (DVR) that encode video (possiblycombined with audio) in real-time (e.g. see ). Although these systemscan be the most suitable solutions for current surveillance applications,r identication based onl.  demonstrated thats helps to increase signalration.
provide functionamodules. Suchaccuracy betweensolution dependenvironment (GPorbiting theEarthof the time lag indata. For PC syste
y Jan-Michael Frahm.: +44 20 7581 8024.enauer).
l rights reserved.
t al., Cost-effective solution to synchronisedmavis.2011.07.004ltiple asynchronous audio interfaces. Furthermore, we showAudio recordingMultisensor systems captures the sensor data periodically generate timestamp signals from its serial port output. These signals can alsoVideo recording synchronisation, we centralmulti-channel audio interfacCost-effective solution to synchronised aumultiple sensors
Jeroen Lichtenauer a,, Jie Shen a, Michel Valstar a, Ma Department of Computing, Imperial College London, 180 Queen's Gate, SW7 2AZ, UKb Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twen
a b s t r a c ta r t i c l e i n f o
Article history:Received 13 February 2011Received in revised form 7 June 2011Accepted 18 July 2011Available online xxxx
Applications such as survemultiple cameras, as well assynchronisation between seaccuracy due to several challoffset and rate discrepanciesthe hardware; as well as t
j ourna l homepage: www.io-visual data capture using
ce and human behaviour analysis require high-bandwidth recording fromm other sensors. In turn, sensor fusion has increased the required accuracy ofrs. Using commercial off-the-shelf components may compromise quality andes, such as dealingwith the combined data rate frommultiple sensors; unknownween independent hardware clocks; the absence of trigger inputs or -outputs inifferent methods for time-stamping the recorded data. To achieve accurate
ev ie r.com/ locate / imav islity of time-stamping the sensordatawithGPSor IRIG-Bmodules can provide microsecond synchronisationremote systems. However, the applicability of such a
s on sensor hard- and software, as well as on theS receivers need an unblocked view to the GPS satellites). Also, actual accuracy cannever exceed theuncertaintythe I/O process that precedes time-stamping of sensorms, this can be in the order of milliseconds .
audio-visual data capture using multiple sensors,
A few companies aim at custom solutions for applications withrequirements that cannot be met with what is currently offered bycommercial surveillance hardware. For example, Boulder Imaging builds custom solutions for any application, and Cepoint Networksoffers professional video equipment such as the Studio 9000 DVR ,which can record up to 4 video streams per module, as well as externaltrigger events, with an option to timestamp with IRIG-B. It also has theoption of connecting an audio interface through a Serial Digital Interface(SDI) input. However, it is not clear from the specications if the time-stamping of audio and video can be done without being affected by thelatency between the sensors and the main device. Furthermore, whenmore than4 video streamshave to be recorded, a single Studio 9000willstill not sufce. The problem of the high cost of custom solutions andspecialised professional hardware is that it keeps accurately synchro-nised multi-sensor data capture out of reach for most computer visionand pattern recognition researchers. This is an important bottleneck forresearch on multi-camera and multi-modal human behaviour analysis.To overcome this, we propose solutions and present ndings regardingthe two most important difculties in using low-cost Commercial Off-The-Shelf (COTS) components: reaching the required bandwidth fordata capture and achieving accurate multi-sensor synchronisation.
Fortunately, recent developments in computer hardware technologyhave signicantly increased the data bandwidths of commercial PCcomponents, allowing formore audio-visual sensors to be connected to asingle PC. Our low-cost PC conguration facilitates simultaneous,synchronous recordings of audio-visual data from 12 cameras having780580 pixels spatial resolution and 61.7 fps temporal resolution,together with eight 24-bit 96 kHz audio channels. The relevantcomponents of our system setup are summarised in Table 1. By usingsix internal 1.5 TBHardDiskDrives (HDD), 7.6 hof continuous recordingscan be made. With a different motherboard and an extra HDD controllercard to increase the amountofHDDs to14,we showthat 1PC is capable of
continuously recording from 14 Gigabit Ethernet cameras with10241024 pixels spatial resolution and 59.1 fps, for up to 6.7 h. InTable2weshowthemaximumnumberof cameras that canbeused in thedifferent congurations that we tested. A higher number of cameras perPC means a reduction of cost, complexity as well as space requirementsfor visual data capture.
Synchronisation between COTS sensors is hindered by the offset andrate discrepancies between independent hardware clocks, the absenceof trigger inputs or -outputs in the hardware, as well as differentmethods of time-stamping of the recorded data. To accurately derivesynchronisation between the independent timings of different sensors,
Table 2Camera support of a single consumer PC.
Spatial resolution Temporal resolution Rate per camera Max. no. of cameras
780580 pixels 61.7 fps 26.6 MB/s 14780580 pixels 49.9 fps 21.5 MB/s 16780580 pixels 40.1 fps 17.3 MB/s 18
With controller card for 8 additional HDDs
10241024 pixels 59.1 fps 59.1 MB/s 14
analog signalanalog signalbinary signalbinary signalnon -synchronised databinary signalnon -synchronised datanon -synchronised datasynchronised datasynchronised data
Fig. 1. Overview of our synchronised multi-sensor data capture system, consisting of(a) microphones, (b) video cameras, (c) a multi-channel A/D converter, (d) an A/Vcapture PC, (e) an eye gaze capture PC, (f) an eye gaze tracker and (g) a photo diode to
2 J. Lichtenauer et al. / Image and Vision Computing xxx (2011) xxxxxxTable 1Components of the capture system for 8 FireWire cameras with a resolution of780580pixels and 61.7 fps.
Sensor component Description
7 monochrome videocameras