automatic video object extraction

Upload: nijo-john

Post on 23-Feb-2018

237 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/24/2019 Automatic Video Object Extraction

    1/52

    1

    CHAPTER I

    INTRODUCTION

  • 7/24/2019 Automatic Video Object Extraction

    2/52

    2

    1.1. VIDEO OBJECT EXTRACTION

    A desirable video object extraction scheme for content based applications should meet

    the following criteria:

    Segmented object should conform to human perception i.e., semantically meaningful

    objects should be segmented.

    Segmentation algorithm should be efficient and achieve fast speed.

    Initialization should be simple and easy for users to operate (human intervention

    should be minimized). One feasible solution that satisfies these criteria is edge change

    detection.

    In Video Object (VO) segmentation methods, which are using mathematical

    morphology and perspective motion model, objects of interest should be initially outlined by

    human observer. From the manually specified object boundary, the correct object boundary is

    calculated using a morphological segmentation tool. The obtained VOP is then automatically

    tracked and updated in successive frames. It has difficulty in dealing with a large non rigid

    object movement and in the presence of occlusion, especially in the VOP tracking schemes.

    The algorithm based on edge change detection allows automatic detection of the new

    appearance of a VO. The edge change detection for inter-frame difference is another stream

    of popular schemes because it is straightforward to implement and enables automatic

    detection of new appearance. This ability enables to develop a fully automated object-based

    system, such as an object-based video surveillance system.

    It is found that the algorithms based on inter frame change detection render automatic

    detection of objects and allow larger non rigid motion compared to mathematical morphology

    and perspective motion model methods. The drawbacks are small false regions detected by

    decision error due to noise. Thus, small whole removal using morphological operation and

    removal of false parts like uncovered background by motion information are usually

    incorporated.

    Another drawback in edge change detection is that object boundaries are irregular in

    some critical image areas, which must be smoothened and adapted by spatial edge

    information. Since spatial edge information is useful for generating VOP with accurateboundaries, a simple binary edge difference scheme may be assumed to be a good solution. In

  • 7/24/2019 Automatic Video Object Extraction

    3/52

    3

    order to overcome boundary inaccuracy multiple features, multiple frames and spatial-

    temporal entropy methods are used. In addition, it gives robustness to noise and occluding

    pixels.

    The first stage is applied to the first two frames of a video shot to discover moving

    objects while the second stage is applied to the rest of the frames to extract the detected

    objects through the video shot. The algorithm is applied to first two frames of the image

    sequence to detect the moving objects in the video sequence. First two frames of video

    sequence are taken and motion vectors are computed using Adaptive Rood Pattern Search

    (ARPS) algorithm. Simultaneously, components of optical flow are computed for each block

    in the image. By using the motion vectors, motion compensated frame is generated.

    Initial segmentation is performed on the first frame of traffic sequence. Applying

    watershed transformation directly on the gradient of image results in over segmentation. To

    avoid over segmentation morphological gradient is computed on the frame and then

    watershed transformation is performed. After watershed transformation, some regions may

    need to be merged because of possible over-segmentation.

    Canny binary edge image is used to localize an object in subsequent frames of video

    sequences and to detect the true weak edges. Intensity edge pixels are used as feature pointsdue to the key role that edges play in the human visual process and the fact that edges are

    little affected by variation of luminance. Object models evolve from one frame to the next,

    capturing the changes in the shape of objects as they move. The algorithm naturally

    establishes the temporal correspondence of objects throughout the video sequence, and the

    output of the algorithm is a sequence of binary models representing the motion and shape

    changes of the objects.

    Object model is obtained by subtracting background edge from edge image and

    eliminating unlinked pixels. After a binary model for the object of interest has been derived

    the motion vectors generated from ARPs algorithm are used to match the subsequent frames

    in the sequence. Matching is performed on edge images because it is computationally

    efficient and fairly insensitive to changes in illumination. The degree of change in the shape

    of an object from one frame to the next is determined based on simplified Hausdorff distance

    where simplified Hausdorff distance is defined as combination of distance transformation and

    correlation. Distance Transform the image and then threshold it by different amounts to formdifferent dilated image sets. To search for the object in the image, it is required to obtain the

  • 7/24/2019 Automatic Video Object Extraction

    4/52

    4

    amount by which the image is dilated such that maximum points in the object model are

    matched to image set.

    In this automatic VO segmentation algorithm edge change detection starts with edge

    detection which is the first and most important stage of human visual process. Edge

    information plays a key role in extracting the physical change of the corresponding surface in

    a real scene, exploiting simple difference of edges for extracting shape information of moving

    objects in video sequence suffers from great deal of noise even in stationary background.

    This is due to the fact that the random noise created in one frame is different from the one

    created in the successive frame, and thus results in slight changes of the edge locations in the

    successive frames. Thus difference edge of frames suppresses the noise in luminance

    difference by means of canny edge detector.

    Motion estimation is based on temporal changes in image intensities. The underlying

    supposition behind motion estimation is that the patterns corresponding to objects and

    background in a frame of video sequence move within the frame to form corresponding

    objects on the subsequent frame. Motion estimation is accomplished using ARPs algorithm.

    The ARPS algorithm makes use of the fact that the general motion in a frame is usually

    coherent, i.e. if the macro blocks around the current macro block move in a particular

    direction then there is a high probability that the current macro block will also have a similar

    motion vector. This algorithm uses the motion vector of the macro block to its immediate left

    to predict its own motion vector.

    The ARPS algorithm tries to achieve the Peak Signal to Noise Ratio (PSNR) similar

    to that of Exhaustive Search (ES) algorithm. ES algorithm, also known as Full Search, is the

    most computationally expensive block matching algorithm. This algorithm calculates the cost

    function at each possible location in the search window. As a result it finds the best possible

    match and gives the highest PSNR among any block matching algorithm. The obvious

    disadvantage of exhaustive search is it takes more computations to estimate motion vectors.

    ARPS tries to achieve the same PSNR as that of ES with less number of computations.

  • 7/24/2019 Automatic Video Object Extraction

    5/52

    5

    CHAPTER II

    LITERATURE SURVEY

  • 7/24/2019 Automatic Video Object Extraction

    6/52

    6

    2.1. EXISTING SYSTEM

    Existing Methodology involves a saliency-based video object extraction (VOE)

    framework. The proposed framework aims to automatically extract foreground objects of

    interest without any user interaction or the use of any training data. To separate foreground

    and background regions within and across video frames, the proposed method utilizes visual

    and motion saliency information extracted from the input video. A conditional random field

    is applied to effectively combine the saliency induced features, which allows us to deal with

    unknown pose and scale variations of the foreground object (and its articulated parts). Based

    on the ability to preserve both spatial continuity and temporal consistency in the proposed

    VOE framework, experiments on a variety of videos verify that our method is able to produce

    quantitatively and qualitatively satisfactory VOE results.

    2.1.1. Introduction

    A human can easily determine the subject of interest in a video, even though that

    subject is presented in an unknown or cluttered background or even has never been seen

    before. With the complex cognitive capabilities exhibited by human brains, this process can

    be interpreted as simultaneous extraction of both foreground and background information

    from a video.

    Many researchers have been working toward closing the gap between human and

    computer vision. However, without any prior knowledge on the subject of interest or training

    data, it is still very challenging for computer vision algorithms to automatically extract the

    foreground object of interest in a video. As a result, if one needs to design an algorithm to

    automatically extract the foreground objects from a video, several tasks need to be addressed.

    1) Unknown object category and unknown number of the object instances in a video.

    2) Complex or unexpected motion of foreground objects due to articulated parts or arbitrary

    poses.

    3) Ambiguous appearance between foreground and background regions due to similar color,

    low contrast, insufficient lighting, etc. conditions.

  • 7/24/2019 Automatic Video Object Extraction

    7/52

    7

    In practice, it is infeasible to manipulate all possible foreground object or background

    models beforehand. However, if one can extract representative information from either

    foreground or background (or both) regions from a video, the extracted information can be

    utilized to distinguish between foreground and background regions, and thus the task of

    foreground object extraction can be addressed. As discussed later in Section II, most of the

    prior works either consider a fixed background or assume that the background exhibits

    dominant motion across video frames. These assumptions might not be practical for real

    world applications, since they cannot generalize well to videos captured by freely moving

    cameras with arbitrary movements.

    In this system, we propose a robust video object extraction (VOE) framework, which

    utilizes both visual and motion saliency information across video frames. The observed

    saliency information allows us to infer several visual and motion cues for learning foreground

    and background models, and a conditional random field (CRF) is applied to automatically

    determines the label (foreground or background) of each pixel based on the observed models.

    With the ability to preserve both spatial and temporal consistency, our VOE

    framework exhibits promising results on a variety of videos, and produces quantitatively and

    qualitatively satisfactory performance. While we focus on VOE problems for single concept

    videos (i.e., videos which have only one object category of interest presented), our proposed

    method is able to deal with multiple object instances (of the same type) with pose, scale, etc.

    variations.

    Methods to solve the occlusion problem in multiple interacting objects tracking have

    been previously presented. Shiloh, Chang and Dockstader 1 The work presented in this paper

    was sponsored by the Foundation of National Laboratory of Pattern Recognition and National

    Natural Science Foundation of China (60172037) overcame occlusion in multiple object

    tracking by used fusing multiple camera inputs. Cucchiara proposed probabilistic masks and

    appearance models to cope with frequent shape changes and large occlusions. Eng developed

    a Bayesian segmentation approach that fused a region-based background subtraction and a

    human shape model for people tracking under occlusion. Wu proposed a dynamic Bayesian

    network which accommodates an extra hidden process for partial occlusion handling. Andrew

    used appearance models to track occluded objects. Siebel proposed a tracking system with

    three co-operating parts: an active shape tracker, a region tracker and a head detector. The

    region tracker exploits the other two modules to solve occlusions. Hieu proposed a template

  • 7/24/2019 Automatic Video Object Extraction

    8/52

    8

    matching algorithm and update the template using appearance features smoothed by kalman

    filter. Tao presented a dynamic background layer model and model each moving object as a

    foreground layer, together with the foreground ordering, the complete information necessary

    for reliably tracking objects through occlusion is included. Alper tracked the complete object

    and evolving the contour from frame to frame by minimizing some energy functions.

    An obvious step towards video segmentation in Efficient Hierarchical Graph-Based

    Video Segmentation Matthias is to apply image segmentation techniques to video frames

    without considering temporal coherence. These methods are inherently scalable and may

    generate segmentation results in real time. However, lack of temporal information from

    neighboring frames may cause jitter across frames. Freedman and Kisilev applied a sampling-

    based fast mean shift approach to a cluster of 10 frames as a larger set of image features to

    generate smoother results without taking into account temporal information.

    Spatio-temporal video segmentation techniques can be distinguished by whether the

    information from future frames is used in addition to past frames. Causal methods apply

    Kalman filtering to aggregate data over time, which only consider past data. Paris et al.

    derived the equivalent tool of mean-shift image segmentation for video streams based on the

    ubiquitous use of the Gaussian kernel. They achieved real-time performance without

    considering future frames in the video.

    Another class of spatio-temporal techniques takes advantage of both past and future

    data in a video. They treat the video as a 3D space-time volume, and typically use a variant of

    the mean shift algorithm for segmentation. Dementhon applied mean shift on a 3D lattice and

    used a hierarchical strategy to cluster the space-time video stack for computational efficiency.

    Wang et al. used anisotropic kernel mean shift segmentation for video tooning. Wang and

    Adelson used motion heuristics to iteratively segment video frames into motion consistent

    layers. Tracking-based video segmentation methods generally define segments at frame-level

    and use motion, color and spatial cues to force temporal coherence.

    Following the same line of work, Brendel and Todorovic used contour cues to allow

    splitting and merging of segments to boost the tracking performance. Interactive object

    segmentation has recently shown significant progress. These systems produce high quality

    segmentations driven by user input. We exhibit a similar interactive framework driven by our

    segmentation. Our video segmentation method builds on Felzenszwalb and Huttenlochersgraph-based image segmentation technique. Their algorithm is efficient, being nearly linear

  • 7/24/2019 Automatic Video Object Extraction

    9/52

    9

    in the number of edges in the graph, which makes it suitable for extension to spatio-temporal

    segmentation. We extend the technique to video making use of both past and future frames,

    and improve the performance and efficiency using a hierarchical framework.

    2.1.2. Methodology

    In this paper, we aim at automatically extracting foreground objects in videos which

    are captured by freely moving cameras. Instead of assuming that the background motion is

    dominant and different from that of the foreground as did, we relax this assumption and allow

    foreground objects to be presented in freely moving scenes.

    We advance both visual and motion saliency information across video frames, and a

    CRF model is utilized for integrating the associated features for VOE (i.e., visual saliency,

    shape, foreground/background color models, and spatial/temporal energy terms). From our

    quantitative and qualitative experiments, we verify that our VOE performance exhibits spatial

    consistency and temporal continuity, and our method is shown to outperform state-of the-art

    unsupervised VOE approaches. It is worth noting that, our proposed VOE framework is an

    unsupervised approach, which does not require the prior knowledge (i.e., training data) of the

    object of interest nor the user interaction for any annotation.

    Input video Visual saliency

    Motion saliency using

    optical flow

    Color and shape

    detection

    Cooperating

    multiple cues

    Detected object

    Fig. 2.1. Block Diagram of Existing System

  • 7/24/2019 Automatic Video Object Extraction

    10/52

    10

    Most existing unsupervised VOE approaches assume the foreground objects as

    outliers in terms of the observed motion information, so that the induced appearance, color,

    etc. features are utilized for distinguishing between foreground and background regions.

    However, these methods cannot generalize well to videos captured by freely moving cameras

    as discussed earlier. In this work, we propose a saliency-based VOE framework which learns

    saliency information in both spatial (visual) and temporal (motion) domains.

    By advancing conditional random fields (CRF), the integration of the resulting

    features can automatically identify the foreground object without the need to treat either

    foreground or background as outliers.

    In general, one can address VOE problems using supervised or unsupervised

    approaches. Supervised methods require prior knowledge on the subject of interest and need

    to collect training data beforehand for designing the associated VOE algorithms. For

    example, Wu and Nevatia and Lin and Davis both decomposed an object shape model in a

    hierarchical way to train object part detectors, and these detectors are used to describe all

    possible configurations of the object of interest (e.g. pedestrians).

    Another type of supervised methods requires user interaction for annotating candidate

    foreground regions. For example, image segmentation algorithms proposed in focused on an

    interactive scheme and required users to manually provide the ground truth label information.

    For videos captured by a monocular camera, methods applied a conditional random field

    (CRF) maximizing a joint probability of color, motion, etc. models to predict the label of

    Fig. 2.2. Overview of Existing VOE Framework

  • 7/24/2019 Automatic Video Object Extraction

    11/52

    11

    each image pixel. Although the color features can be automatically determined from the input

    video, these methods still need the user to train object detectors for extracting shape or

    motion features.

    Recently, researchers proposed to use some preliminary strokes to manually select the

    foreground and background regions, and they utilized such information to train local

    classifiers to detect the foreground objects. While these works produce promising results, it

    might not be practical for users to manually annotate a large amount of video data.

    2.1.3. Visual Saliency

    The salience (also called saliency) of an item be it an object, a person, a pixel, etc.

    is the state or quality by which it stands out relative to its neighbors. Saliency detection is

    considered to be a key attention mechanism that facilitates learning and survival by enabling

    organisms to focus their limited perceptual and cognitive resources on the most pertinent

    subset of the available sensory data.

    Saliency typically arises from contrasts between items and their neighborhood, such

    as a red dot surrounded by white dots, a flickering message indicator of an answering

    machine, or a loud noise in an otherwise quiet environment. Saliency detection is often

    studied in the context of the visual system, but similar mechanisms operate in other sensory

    systems. What is salient can be influenced by training: for example, for human subjects

    particular letters can become salient by training.

    When attention deployment is driven by salient stimuli, it is considered to be bottom-

    up, memory-free, and reactive. Attention can also be guided by top-down, memory-

    dependent, or anticipatory mechanisms, such as when looking ahead of moving objects or

    sideways before crossing streets. Humans and other animals have difficulty paying attention

    to more than one item simultaneously, so they are faced with the challenge of continuously

    integrating and prioritizing different bottom-up and top-down influences.

    In the domain of psychology, efforts have been made in modeling the mechanism of

    human attention, including the learning of prioritizing the different bottom-up and top-down

    influences.

  • 7/24/2019 Automatic Video Object Extraction

    12/52

    12

    In the domain of computer vision, efforts have been made in modeling the mechanism

    of human attention, especially the bottom-up attention mechanism. Such a process is also

    called visual saliency detection.

    Generally speaking, there are two kinds of models to mimic the bottom-up saliency

    mechanism. One way is based on the spatial contrast analysis. For example, in a center-

    surround mechanism is used to define saliency across scales, which is inspired by the putative

    neural mechanism. The other way is based on the frequency domain analysis. While they

    used the amplitude spectrum to assign saliency to rarely occurring magnitudes, Guo et al. use

    the phase spectrum instead. introduced a system that uses both the amplitude and the phase

    information.

    A key limitation in many such approaches is their computational complexity which

    produces less than real-time performance, even on modern computer hardware.Some recent

    work attempts to overcome these issues but at the expense of saliency detection quality under

    some conditions.

    Our attention is attracted to visually salient stimuli. It is important for complex

    biological systems to rapidly detect potential prey, predators, or mates in a cluttered visual

    world. However, simultaneously identifying any and all interesting targets in one's visualfield has prohibitive computational complexity making it a daunting task even for the most

    sophisticated biological brains let alone for any existing computer.

    One solution, adopted by primates and many other animals, is to restrict complex

    object recognition process to a small area or a few objects at any one time. The many objects

    or areas in the visual scene can then be processed one after the other. This serialization of

    Fig. 2. Example of visual saliency calculations. (a) Original video frame. (b) and (C)

    are Visual saliency of (a)

  • 7/24/2019 Automatic Video Object Extraction

    13/52

    13

    visual scene analysis is operationalized through mechanisms of visual attention: A common

    (although somewhat inaccurate) metaphor for attention is that of a virtualspotlight,shifting

    to and highlighting different sub-regions of the visual world, so that one region at a time can

    be subjected to more detailed visual analysis

    Visual attention may be a solution to the inability to fully process all locations in

    parallel. However, this solution produces a problem. If you are only going to process one

    region or object at a time, how do you select that target of attention? Visual salience helps

    your brain achieve reasonably efficient selection. Early stages of visual processing give rise

    to a distinct subjective perceptual quality which makes some stimuli stand out from among

    other items or locations. Our brain has evolved to rapidly compute salience in an automatic

    manner and in real-time over the entire visual field. Visual attention is then attracted towards

    salient visual locations.

    The core of visual salience is a bottom-up, stimulus-driven signal that announces this

    location is sufficiently different from its surroundings to be worthy of your attention.

    This bottom-updeployment of attention towards salient locations can be strongly modulated

    or even sometimes overridden by top-down,user-driven factors. Thus, a lone red object in a

    green field will be salient and will attract attention in a bottom-up manner (see illustration

    below). In addition, if you are looking through a childs toy bin for a red plastic dragon,

    amidst plastic objects of many vivid colors, no one color may be especially salient until your

    top-down desire to find the red object renders all red objects, whether dragons or not, more

    salient.

    Visual salience is sometimes carelessly described as a physical property of a visual

    stimulus. It is important to remember that salience is the consequence of an interaction of a

    stimulus with other stimuli, as well as with a visual system (biological or artificial). As a

    straight-forward example, consider that a color-blind person will have a dramatically

    different experience of visual salience than a person with normal color vision, even when

    both look at exactly the same physical scene (see, e.g., the first example image below).

    As a more controversial example, it may be that expertise changes the salience of

    some stimuli for some observers. Nevertheless, because visual salience arises from fairly

    low-level and stereotypical computations in the early stages of visual processing the factors

    contributing to salience are generally quite comparable from one observer to the next, leadingto similar experiences across a range of observers and of behavioral conditions.

  • 7/24/2019 Automatic Video Object Extraction

    14/52

    14

    2.1.4. Motion Saliency

    Motion saliency detection has an important impact on further video processing tasks,

    such as video segmentation, object recognition and adaptive compression. Different to image

    saliency, in videos, moving regions (objects) catch human beings attention much easier than

    static ones. Based on this observation, we propose a novel method of motion saliency

    detection, which makes use of the low-rank and sparse decomposition on video slices along

    X-T and Y-T planes to achieve the goal, i.e. separating foreground moving objects from

    backgrounds. To detect motion saliency in videos, however, most of the techniques for

    images (mentioned above) are not available.

    Since, different with image saliency detection, moving regions (objects) alternatively

    catch more human beings attention than static ones, even though which have large contrast

    to their neighbours in static images. That is to say, the focal point changes from the regions

    with large contrast to their neighbours for images to those with motion discrimination for

    videos. Therefore, the contrast based methods are hardly applied to videos directly.

    An exception exists, which extends the spectrum residual analysis in images to

    videos. Actually, the goal of moving object separation from background is the same as that of

    motion saliency detection. A few solutions for separating foreground moving objects from

    backgrounds have been proposed, such as Gaussian Mixture ModelTo detect each moving part and its corresponding pixels, we perform dense optical-

    flow forward and backward propagation at each frame of a video. A moving pixel qt at frame

    t is determined by

    Where denotes the pixel pair detected by forward or backward optical flow propagation.

    Determination of Shape Cues:

    Although motion saliency allows us to capture motion salient regions within and

    across video frames, those regions might only correspond to moving parts of the foreground

    object within some time interval.

    Determination of Color Cues:

    Besides the motion-induced shape information, we also extract both foreground and

    back- ground color information for improved VOE performance. According to the

    observation and the assumption that each moving part of the foreground object forms a

  • 7/24/2019 Automatic Video Object Extraction

    15/52

    15

    complete sampling of itself, we cannot construct foreground or background color models

    simply based on visual or motion saliency detection results at each individual frame

    2.1.5. CONDITIONAL RANDOM FIELDS (CRFs)

    Utilizing an undirected graph, conditional random field (CRF) is a powerful technique

    to estimate the structural information (e.g. class label) of a set of variables with the associated

    observations. For video foreground object segmentation, CRF has been applied to predict the

    label of each observed pixel in an image I.

    It is a class of statistical modeling method often applied in pattern recognition and

    machine learning, where they are used for structured prediction. Whereas an

    ordinary classifier predicts a label for a single sample without regard to "neighboring"

    samples, a CRF can take context into account; e.g., the linear chain CRF popular in natural

    language processing predicts sequences of labels for sequences of input samples.

    CRFs are a type of discriminative undirected probabilistic graphical model. It is used

    to encode known relationships between observations and construct consistent interpretations.

    It is often used for labeling or parsing of sequential data, such as natural language text

    or biological sequences and in computer vision. Specifically, CRFs find applications

    in shallow parsing, named entity recognition and gene finding, among other tasks, being an

    alternative to the related hidden Markov models. In computer vision, CRFs are often used for

    object recognition and image segmentation.

    Conditional random fields (CRFs) are a probabilistic framework for labeling and

    segmenting structured data, such as sequences, trees and lattices. The underlying idea is that

    of defining a conditional probability distribution over label sequences given a particular

    observation sequence, rather than a joint distribution over both label and observation

    sequences.

    Fig. 2.4. Motion Saliency Calculated for Fig 2.3. (a) Calculation of the Optical Flow.

    (b) Motion Saliency Derived from (a)

  • 7/24/2019 Automatic Video Object Extraction

    16/52

    16

    The primary advantage of CRFs over hidden Markov models is their conditional

    nature, resulting in the relaxation of the independence assumptions required by HMMs in

    order to ensure tractable inference.

    Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum

    entropy Markov models (MEMMs) and other conditional Markov models based on directed

    graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world

    tasks in many fields, including bioinformatics, computational linguistics and speech

    recognition.

    Imagine you have a sequence of snapshots from a day in Justin Biebers life, and you

    want to label each image with the activity it represents (eating, sleeping, driving, etc.). How

    can you do this?

    One way is to ignore the sequential nature of the snapshots, and build aper-

    image classifier. For example, given a months worth of labeled snapshots, you might learn

    that dark images taken at 6am tend to be about sleeping, images with lots of bright colors

    tend to be about dancing, and images of cars are about driving, and so on.

    By ignoring this sequential aspect, however, you lose a lot of information. For

    example, what happens if you see a close-up picture of a mouth is it about singing or

    eating? If you know that theprevious image is a picture of Justin Bieber eating or cooking,

    then its more likely this picture is about eating; if, however, the previous image contains

    Justin Bieber singing or dancing, then this one probably shows him singing as well.

    Thus, to increase the accuracy of our labeler, we should incorporate the labels of

    nearby photos, and this is precisely what a conditional random field does.

    2.1.6. Disadvantages

    Not possible to work with videos with moving background.

    Shape and colour cues extract foreground objects with missing parts.

  • 7/24/2019 Automatic Video Object Extraction

    17/52

    17

    2.2. SALIENCY BASED VIDEO SEGMENTATIONWITH GRAPH CUTS AND

    SEQUENTIALLY UPDATED PRIORS

    -Ken Fukuchi, Kouji Miyazato, Akisato Kimura, Shigeru Takagiand Junji Yamato

    This paper proposes a new method for achieving precise video segmentation without

    any supervision or interaction. The main contributions of this report include

    1) the introduction of fully automatic segmentation based on the maximum a

    posteriori (MAP) estimation of the Markov random field (MRF) with graph cuts and saliency

    driven priors.

    2) the updating of priors and feature likelihoods by integrating the previous

    segmentation results and the currently estimated saliency-based visual attention.

    Methods used

    Markov random field (MRF) model

    Here each hidden state corresponds to the label of a position representing an object

    or background, and an observation is a frame of the input frame. The density calculated in

    the previous step can be utilized for estimating the priors of objects/backgrounds and the

    feature likelihoods of the MRF. When calculating priors and likelihoods, the regionsextracted from the previous frames are also available.

    Image segmentation

    Consider a set of random variables A = {Ax}xI defined on a set I of co- ordinates.

    Each random variable Ax takes a value ax from the set L = {0,1}, which corresponds to a

    background (0) and an object (1), respectively.

    A MAP-based MRF estimation can be formulated as an energy minimizationproblem where the energy corresponding to the configuration a is the negative log likelihood

    of the joint posterior density of the MRF, E(a|D)=logp(A = a|D), where D represents the

    input image.

    Result

    Proposed a new method for achieving precise video segmentation without any

    supervised interactions.

  • 7/24/2019 Automatic Video Object Extraction

    18/52

    18

    The main contributions included

    1) the introduction of MAP-based frame-wise segmentation with graph cuts and

    saliency-driven priors.

    2) the technique for updating priors and likelihoods with a Kalman Filter.

    Disadvantage:

    Segmented regions were randomly switched

    2.3. SALIENCY DETECTION USING MAXIMUM SYMMETRIC SURROUND

    -Radhakrishna Achanta and Sabine Susstrunk

    Detection of visually salient image regions is useful for applications like object

    segmentation, adaptive compression, and object recognition. Recently, full-resolution salient

    maps that retain well defined boundaries have attracted attention. In these maps, boundaries

    are preserved by retaining substantially more frequency content from the original image than

    older techniques. However, if the salient regions comprise more than half the pixels of the

    image, or if the background is complex, the background gets highlighted instead of the salient

    object.

    This paper, introduce a method for salient region detection that retains the

    advantages of such saliency maps while overcoming their shortcomings. Our method exploits

    features of color and luminance, is simple to implement and is computationally efficient.

    Methods used

    Saliency computation methods:

    Saliency has been referred to as visual attention, unpredictability, rarity, or surprise.

    Saliency estimation methods can broadly be classified as biologically based, purely

    computational, or those that combine the two ideas. In general, most methods employ a low-

    level approach of determining contrast of image regions relative to their surroundings using

    one or more features of intensity, color, and orientation.

    Graph Based Segmentation

    Graph cuts based methods are popular for image segmentation applications. Boykov

    and Jolly perform interactive segmentation using graph cuts. They require a user to provide

    scribble based input to indicate foreground and background regions.

  • 7/24/2019 Automatic Video Object Extraction

    19/52

    19

    A graph cuts based algorithm then segments foreground from background. We use a

    similar approach, however, instead of the user indicating the background and foreground

    pixels using scribbles, we use the saliency map to assign these pixels automatically.

    Result

    This method improves upon six existing state-of-the-art algorithms in precision and

    recall with respect to a ground truth database.

    Disadvantage

    The saliency maps generated by this method suffer from low resolution

    2.4. AUTOMATIC OBJECT EXTRACTION IN SINGLE-CONCEPT VIDEOS

    -Kuo-Chin Lien and Yu-Chiang Frank Wang

    Proposes a motion-driven video object extraction (VOE) method, which is able to

    model and segment foreground objects in single-concept videos, i.e. videos which have only

    one object category of interest but may have multiple object instances with pose, scale, etc.

    variations. Given such a video, we construct a compact shape model induced by motion cues,

    and extract the foreground and background color in- formation accordingly.

    It integrates these feature models into a unified framework via a conditional random

    field (CRF), and this CRF can be applied to video object segmentation and further video

    editing and retrieval applications. One of the advantages of this method is that it does not

    require the prior knowledge of the object of interest, and thus no training data or

    predetermined object detectors are needed; this makes this approach robust and practical to

    real-world problems. Very attractive empirical results on a variety of videos with highly

    articulated objects support the feasibility of our proposed method.

    Methods used

    Object Modeling And Extraction

    First extract the motion cues from the moving object across video frames, and

    combine the motion- induced shape, foreground and background color models into a CRF.

    Without prior knowledge of the object of interest, this CRF model is designed to address

    VOE problems in an unsupervised setting.

    Conditional random field:

    By utilizing an undirected graph, CRF is a powerful technique to estimate the

    structural information (e.g. class label) of a set of variables. For object segmentation, CRF is

  • 7/24/2019 Automatic Video Object Extraction

    20/52

    20

    used to predict the label of each observed pixel in an image I. As shown in Figure 1, pixel i is

    associated with observation zi, while the hidden node Fi indicates its corresponding label (i.e.

    foreground or background).

    In this CRF framework, the label Fi is calculated by the observation zi, while the

    spatial coherence between this output and neighboring observations zj and labels Fj are

    simultaneously taken into consideration.

    Extraction of motion cues

    Aim was to extract different feature information from these moving parts for the later

    CRF construction. To detect the moving parts and their corresponding pixels, we perform

    dense optical flow forward and backward propagation at every frame.

    Result

    Proposed a method which utilizes multiple motion-induced features such as shape and

    fore- ground/background color models to extract foreground objects in single-concept videos.

    We advanced a unified CRF framework to integrate the above feature models.

    Disadvantage

    Some of the motion cues might be negligible due to low contrast

    2.5. VISUAL SALIENCY FROM IMAGE FEATURES WITH APPLICATION TO

    COMPRESSION

    -P. Harding N. M. Robertson

    This technique is validated using comprehensive human eye-tracking experiments.

    This algorithm is known as Visual Interest (VI) since the resultant segmentation reveals

    image regions that are visually salient during the performance of multiple observer search

    tasks. It demonstrate that it works on generic, eye-level photographs and is not dependent on

    heuristic tuning.

    The descriptor- matching property of the SURF feature points can be exploited via

    object recognition to modulate the context of the attention probability map for a given object

    search task, refining the salient area and fully validate the Visual Interest algorithm through

    applying it to salient compression using a pre-blur of non salient regions prior to JPEG and

    conducting comprehensive observer performance tests.

  • 7/24/2019 Automatic Video Object Extraction

    21/52

    21

    Methods used

    Feature points extraction

    Computer vision feature points (basically, local features at which the signal changes

    three-dimensionally in space and scale) have many attractive properties, such as robust

    invariant descriptor matching over scale, rotation and affine offset that could be useful in

    combination with their use as a primitive saliency detector.

    Feature matching has been used in estimating inter-frame homography mismatching

    as an estimate of temporal saliency in video, but not as a measure of spatial saliency. Harding

    and Robertson compare the co-occurrence of a set of computer vision feature points with

    predictive maps of visual saliency .

    Saliency has been referred to as visual attention, unpredictability, rarity, or surprise.

    Saliency estimation methods can broadly be classified as biologically based, purely

    computational, or those that combine the two ideas. In general, most methods employ a low-

    level approach of determining contrast of image regions relative to their surroundings using

    one or more features of intensity, color, and orientation.

    Result

    Presented an image segmentation algorithm to segment image areas which are

    visually salient to observers performing multiple tasks. In contrast to bottom-up saliency

    alone, and combined with specific task search models our technique finds image areas

    relevant to the performance of multiple objective tasks and without the need for prior

    learning.

    The general mode acts on eye-level imagery with parameters chosen from careful

    experimentation and requires no machine learning stage. The technique is built upon feature

    points and the descriptors of these feature points can be compared to database representations

    of stored objects to narrow the focus of the attention prediction map for object class search,

    all in one algorithmic iteration.

    Disadvantage

    The application to compression is just one possible use only for a segmentation

    algorithm based on visually salient information.

  • 7/24/2019 Automatic Video Object Extraction

    22/52

    22

    2.6. VISUAL ATTENTION DETECTION IN VIDEO SEQUENCES USING

    SPATIOTEMPORAL CUES

    - Yun Zhai ,Mubarak Shah

    A hierarchical spatial attention representation is established to reveal the interesting

    points in images as well as the interesting regions. Finally, a dynamic fusion technique is

    applied to combine both the temporal and spatial saliency maps, where temporal attention is

    dominant over the spatial model when large motion contrast exists, and vice versa. The

    proposed spatiotemporal attention framework has been extensively applied on several video

    sequences, and attended regions are detected to highlight interesting objects and motions

    present in the sequences with very high user satisfaction rate.

    Methods used:

    Video attention detection

    Propose a bottom-up approach for modeling the spatiotemporal attention in video

    sequences. The proposed technique is able to detect the attended regions as well as attended

    actions in video sequences. Different from the previous methods, most of which are based on

    the dense optical flow fields, our proposed temporal attention model utilizes the interest point

    correspondences and the geometric transformations between images.

    In this model, feature points are firstly detected in consecutive video images, and

    correspondences are established between the interest-points using the Scale Invariant Feature

    Transformation (SIFT ).

    Spatiotemporal saliency map

    A linear time algorithm is developed to compute pixel-level saliency maps. In this

    algorithm, color statistics of the images are used to reveal the color contrast information in

    the scene. Given the pixel-level saliency map, attended points are detected by finding the

    pixels with the local maxima saliency values. The region-level attention is constructed based

    upon the attended points. Given an attended point, a unit region is created with its center to

    be the point.

    This region is then iteratively expanded by computing the expansion potentials on the

    sides of the region. Rectangular attended regions are finally achieved. The temporal and

    spatial attention models are finally combined in a dynamic fashion. Higher weights are

    assigned to the temporal model if large motion contrast is present in the sequence.

  • 7/24/2019 Automatic Video Object Extraction

    23/52

    23

    Result

    Presented a spatiotemporal attention detection framework for detecting both attention

    regions and interesting actions in video sequences. The saliency maps are computed

    separately for the temporal and spatial information of the videos.

    Disadvantage

    Fail to highlight the entire salient region or highlight smaller salient regions better

    than larger ones.

    2.7. BACKGROUND MODELING USING MIXTURE OF GAUSSIANS FOR

    FOREGROUND DETECTION - A SURVEY

    -T. Bouwmans, F. El Baf, B. Vachon

    Mixture of Gaussians is a widely used approach for background modeling to detect

    moving objects from static cameras. Numerous improvements of the original method

    developed by Stauffer and Grimson have been proposed over the recent years and the purpose

    of this paper is to provide a survey and an original classification of these improvements. It

    also discusses relevant issues to reduce the computation time. Firstly, the original MOG are

    reminded and discussed following the challenges met in video sequences.

    This survey categorize the different improvements found in the literature and

    classified them in term of strategies used to improve the original MOG and we have

    discussed them in term of the critical situations they claim to handle. After analyzing the

    strategies and identifying their limitations, we conclude with several promising directions for

    future research.

    Methods used Background Modeling

    In the context of a traffic surveillance system, Friedman and Russell proposed to

    model each background pixel using a mixture of three Gaussians corresponding to road,

    vehicle and shadows. This model is initialized using an EM algorithm. Then, the Gaussians

    are manually labeled in a heuristic manner as follows: the darkest component is labeled as

    shadow; in the remaining two components, the one with the largest variance is labeled as

    vehicle and the other one as road.

  • 7/24/2019 Automatic Video Object Extraction

    24/52

    24

    This remains fixed for all the process giving lack of adaptation to changes over time.

    First, each pixel is characterized by its intensity in the RGB color space. Then, the probability

    of observing the current pixel value is considered given by the following formula in the

    multidimensional case:

    Foreground Detection

    In this case, the ordering and labeling phases are conserved and only the matching test

    is changed to be more exact statistically. Indeed, Stauffer and Grimson checked every new

    pixel against the K existing distribution using the Equation to classify it in background pixel

    or foreground one.

    Result

    Allows the reader to survey the strategies and it can effectively guide him to select the

    best improvement for his specific application.

    Disadvantage

    Leading to misdetection of foreground objects and background .

    2.8. EFFICIENT HIERARCHICAL GRAPH-BASED VIDEO SEGMENTATION

    -Matthias Grundmann1,Vivek Kwatra, Mei Han,Irfan Essa1

    This hierarchical approach generates high quality segmentations, which are

    temporally coherent with stable region boundaries, and allows subsequent applications to

    choose from varying levels of granularity. It further improve segmentation quality by using

    dense optical flow to guide temporal connections in the initial graph and also propose two

    novel approaches to improve the scalability of this technique:

    (a) a parallel out of core algorithm that can process volumes much larger than an in-

    core algorithm.

    (b) a clip-based processing algorithm that divides the video into overlapping clips in

    time, and segments them successively while enforcing consistency.

  • 7/24/2019 Automatic Video Object Extraction

    25/52

    25

    Methods used

    Graph-based Algorithm

    Specifically, for image segmentation, a graph is defined with the pixels as nodes,

    connected by edges based on 8- neighborhood. Edge weights are derived from the per-pixel

    normalized color difference.

    Hierarchical Spatio-Temporal Segmentation:

    The above algorithm can be extended to video by constructing a graph over

    the spatio-temporal video volume with edges based on a 26 neighborhood in 3D space-time.

    Following this, the same segmentation algorithm can be applied to obtain volumetric regions.

    This simple approach generally leads to somewhat underwhelming results due to the

    following drawbacks.

    Parallel Out-of-Core Segmentation:

    It consists of a multi-grid-inspired out of core algorithm that operates on a

    subset of the video volume. Performing multiple passes over windows of increasing size, it

    still generates a segmentation identical to the in memory algorithm. Besides segmenting large

    videos, this algorithm takes advantage of modern multi-core processors, and segments several

    parts of the same video in parallel.

    Result

    This method apply the segmentation algorithm to a wide range of videos, from classic

    examples to long dynamic movie shots, studying the contribution of each part of this

    algorithm.

    Disadvantage

    However, as it defines a graph over the entire video volume, there is a restriction on

    the size of the video that it can process, especially for the pixel-level over-segmentation stage

    2.9. FAST APPROXIMATE ENERGY MINIMIZATION VIA GRAPH CUTS

    -Yuri Boykov Olga Veksler Ramin Zabih

    This paper address the problem of minimizing a large class of energy functions that

    occur in early vision. The major restriction is that the energy function's smoothness term must

    only involve pairs of pixels. It propose two algorithms that use graph cuts to compute a local

    minimum even when very large moves are allowed.

  • 7/24/2019 Automatic Video Object Extraction

    26/52

    26

    The first move we consider is an -- swap: for a pair of labels ,, this move

    exchanges the labels between an arbitrary set of pixels labeled and another arbitrary set

    labeled .

    First algorithm generates a labeling such that there is no swap move that decreases the

    energy. The second move we consider is an _-expansion: for a label _, this move assigns an

    arbitrary set of pixels the label _. Our second algorithm, which requires the smoothness term

    to be a metric, generates a labeling such that there is no expansion move that decreases the

    energy. Moreover, this solution is within a known factor of the global minimum.

    Methods used

    Energy minimization via graph cuts:

    The most important property of these methods is that they produce a local minimum

    even when large moves are allowed. In this section, we discuss the moves we allow, which

    are best described in terms of partitions. We sketch the algorithms and list their basic

    properties. We then formally introduce the notion of a graph cut, which is the basis for our

    methods.

    Graph cuts

    The minimum cut problem is to find the cut with smallest cost. There are numerous

    algorithms for this problem with low-order polynomial complexity; in practice these methodsrun in near-linear time.

    Disadvantage

    This method can produce only low energy

    2.10. GEODESIC MATTING: A FRAMEWORK FOR FAST INTERACTIVE

    IMAGEAND VIDEO SEGMENTATION AND MATTING

    -Xue Bai, Guillermo Sapiro

    The proposed technique is based on the optimal, linear time, computation of weighted

    geodesic distances to user-provided scribbles, from which the whole data is automatically

    segmented. The weights are based on spatial and/or temporal gradients, considering the

    statistics of the pixels scribbled by the user, without explicit optical flow or any advanced and

    often computationally expensive feature detectors. These could be naturally added to the

    proposed framework as well if desired, in the form of weights in the geodesic distances.

  • 7/24/2019 Automatic Video Object Extraction

    27/52

    27

    Methods used

    Segmentation

    Algorithm starts from two types of user-provided scribbles, F for foreground and B for

    background, roughly placed across the main regions of interest. Now the problem is how to

    learn from them and propagate this prior information/labeling to the entire image, exploiting

    both the marked pixel statistics and their positions.

    Feature Distribution Estimation

    The role of the scribbles is twofold. The scribbles indicate spatial constraints, and also

    collect labeled information from F/B regions. We use discriminative features to learn from

    the samples on the scribbles (pixels marked by the user via the scribbles), and to classify all

    the remaining pixels in the image.

    Geodesic Distance:

    We use the geodesic distance from these user-provided scribbles to classify the pixels

    x in the image (outside of the scribbles), labeling them F orB. The geodesic distance d(x) is

    simply the smallest integral of a weight function over all possible paths from the scribbles to

    x (in contrast with the average distance as used in random walks or diffusion/Laplace based

    frameworks).

    Result

    Presented a geodesics-based algorithm for (interactive) natural image, 3D, and video

    segmentation and matting.We introduced the framework for still images and extended it to

    video segmentation and matting, as well as to 3D medical data.

    Disadvantage

    Although the proposed framework is general, we mainly exploited weights in the

    geodesic computation that depend on the pixel value distributions. Algorithm does not work

    when these distributions significantly overlap.

  • 7/24/2019 Automatic Video Object Extraction

    28/52

    28

    2.11. INTERACTIVE VIDEO SEGMENTATION SUPPORTED BY MULTIPLE

    MODALITIES, WITH AN APPLICATION TO DEPTH MAPS

    -Jeroen van Baar, Paul Beardsley, Marc Pollefeys, Markus Gross

    This paper proposes an interactive method for the segmentation of objects in video. It

    aims to exploit multiple modalities to reduce the dependency on color discrimination alone.

    Given an initial segmentation for the first and last frame of a video sequence, This method

    aims to propagate the segmentation to the intermediate frames of the sequence. Video frames

    are first segmented into super pixels.

    The segmentation propagation is then regarded as a super pixels labeling problem.

    The problem is formulated as an energy minimization problem which can be solved

    efficiently. Higher-order energy terms are included to represent temporal constraints. Our

    proposed method is interactive, to ensure correct propagation and relabel incorrectly labeled

    super pixels.

    Methods used

    Segmentation Propagation as Energy Minimization

    We formulate the problem of propagating known segmentations and In as an energy

    minimization: E = _(xi) + _(xi; xj) + _(x)Here _(x) represents a unary term, _(xi; xj) represents a binary term between neighboring

    super pixels xi and xj , and finally _(x) represents a so-called higher-order clique term . Each

    super pixel is assigned a label, with the set of labels L defined by the different segments.

    Interactive Segmentation Correction

    Super pixels may have an incorrect label after propagation. It is therefore necessary

    for the user to correct these incorrect segmentation labels. Rather than requiring to re-label

    individual pixels, in our case the interactive correction step can be more easily performed on

    the super pixels directly.

    Exploiting Multiple Modalities:

    We exploit the thermal signal in the super pixel segmentation and in the matching of

    super pixels. This is especially helpful for scenes with human actors, since the thermal signal

    helps to separate the actors from their background, and could also help to separate actors

    from each other.

  • 7/24/2019 Automatic Video Object Extraction

    29/52

    29

    Result:

    described an interactive video segmentation approach based on the propagation of

    known segmentations for the first and last frame, to the intermediate frames of a video

    sequence. The straightforward matching of super pixels across the video sequence could

    easily handle moving cameras and non-rigidly moving objects.

    Disadvantage:

    The method is not much efficient when the objects moving very fast

    2.12. SUMMARY

    No. Title Technique Used Drawbacks

    1Saliency based video segmentation

    with graph cuts and sequentially

    updated priors

    Markov random field

    (MRF) model

    Segmented regions

    were randomly

    switched

    2Saliency detection using maximum

    symmetric surroundSaliency computation

    methods

    The saliency maps

    generated by this

    method suffer from

    low resolution

    3Automatic object extraction in

    single concept VideosObject Modeling And

    Extraction

    Some of the motion

    cues might be

    negligible due to low

    contrast

    4

    Visual Saliency From Image

    Features With Application to

    Compression

    Feature points

    extraction

    Compression is just

    one possible use for

    segmentation

    algorithm based on

    visually salientinformation.

    5Visual Attention Detection In

    Video Sequences Using

    Spatiotemporal Cues

    Spatiotemporal

    saliency map

    Fail to highlight the

    entire salient region

    6 Background modeling using

    mixture of Gaussians for

    foreground detectionBackground modeling

    Leading to

    misdetection of

    foreground objects

    and background

  • 7/24/2019 Automatic Video Object Extraction

    30/52

    30

    7 Efficient hierarchical graph-based

    video segmentation

    Hierarchical Spatio-

    Temporal

    Segmentation

    there is a restriction

    on the size of the

    video

    8 Fast approximate energyminimization via graph cuts

    Energy minimization

    via graph cuts

    produces only low

    energy

    9

    A framework for fast interactive

    image and video segmentation and

    matting

    Feature Distribution

    Estimation

    does not work when

    distributions are

    significantly

    overlaped.

    10Interactive video segmentation

    supported by multiple modalities

    with an application to depth maps

    Interactive

    Segmentation

    Correction

    not much efficient

    when the objects

    moving very fast

  • 7/24/2019 Automatic Video Object Extraction

    31/52

    31

    CHAPTER III

    PROPOSED SYSTEM

  • 7/24/2019 Automatic Video Object Extraction

    32/52

    32

    3.1. INTRODUCTION

    The project proposes efficient motion detection and people counting based on

    background subtraction using dynamic threshold. Here three different methods are used

    effectively for object detection and compare these performance based on accurate detection.

    Here the techniques frame differences, dynamic threshold based detection and mixture of

    Gaussian model will be used. After the object foreground detection, the parameters like

    speed, velocity and angle of motion will be determined.

    For this, most of previous methods depend on the assumption that the background is

    static over short time periods. However, structured motion patterns of the background which

    are distinctive from variations due to noise, are hardly tolerated in this assumption and thus

    still lead to high-level false positive rates when using previous models. In dynamic threshold

    based object detection, morphological process and filtering also used effectively for

    unwanted pixel removal from the background.

    Along with this dynamic threshold, we introduce a background subtraction algorithm

    for temporally dynamic texture scenes using a mixture of Gaussian, which has an ability of

    greatly attenuating color variations generated by background motions while still highlighting

    moving objects. Finally the proposed method will be proved that effective for background

    subtraction in dynamic texture scenes compared to several competitive methods and

    parameter of moving object will be evaluated for all methods.

    3.2. OBJECTIVE

    The Background Subtraction for accurate moving object detection from dynamic

    scene using dynamic threshold detection and mixture of Gaussian model and determination of

    object parameters

    3.3. BLOCK DIAGRAM

    Frame Subtraction

    method

    Frame

    SeparationInput

    Video Foreground

    Detection

    Dynamic

    Threshold

    Connected Component

    Analysis

    Fig. 3.1. Block Diagram of Proposed System

  • 7/24/2019 Automatic Video Object Extraction

    33/52

    33

    3.4. METHODOLOGIES

    Frame Separation

    Frame Subtraction

    Dynamic Threshold Approach

    Morphological Filtering

    Object Detection

    Parameter analysis

    3.4.1. Frame Separation

    A method to shorten the time required for the frame separation in the time-division-multiplexed delta modulation system is described. Employing successive "1"s as the sync

    pattern, the system detects the sync channel out of the delta modulated information pulses

    which occur at the rate of one half in average.

    Using a memory device with the capacity of one frame, the detection is performed by

    taking successive frame correlation of each channel in parallel. For example, in the system

    with 20 channels, the frame separation is established in approximately six frame periods.

    The system includes such facilities as to stabilize the frame separation against the

    error of the sync pattern and against the occurrence of the information pattern similar to the

    sync pattern.

    3.4.2. Frame Subtraction

    In digital photography, dark-frame subtraction is a way to minimize image noise for

    pictures taken with long exposure times. It takes advantage of the fact that a component of

    image noise, known as fixed-pattern noise, is the same from shot to shot: noise from the

    sensor, dead or hot pixels. It works by taking a picture with the shutter closed.

    A dark frame is an image captured with the sensor in the dark, essentially just an

    image of noise in an image sensor. A dark frame, or an average of several dark frames, can

    then be subtracted from subsequent images to correct for fixed-pattern noise such as that

    caused by dark. Dark-frame subtraction has been done for some time in scientific imaging;

    many newer consumer digital cameras offer it as an option, or may do it automatically for

    exposures beyond a certain time.

  • 7/24/2019 Automatic Video Object Extraction

    34/52

    34

    Visible fixed-pattern noise is often caused by hot pixels pixel sensors with higher

    than normal dark current. On long exposure, they can appear as bright pixels. Sensors on the

    CCD that always appear as brighter pixels are called stuckpixels while sensors that only

    brighten up after long exposure are called hot pixels.

    The dark-frame-subtraction technique is also used in digital photogrammetry, to

    improve the contrast of satellite and air photograms, and is considered part of "best practice.

    Each CCD has its own dark signal signature that is present on every acquired image

    (bias, dark frame, light frame). By cooling the CCD to a very low temperature (down to -30

    C), the dark signal is attenuated but not completely cancelled. Image processing is then

    required to remove dark signature from each raw image. The method usually used consists of

    acquiring dark frames that are subsequently subtracted from each image. This permits

    keeping only relevant information related to the observed target.

    During an acquisition session, an ideal dark frame is obtained by acquiring and

    averaging several dark frames. This ideal dark frame has a reduced noise signature and is, in

    theory, acquired with the same conditions (temperature and moisture) as the observation

    images. This averaged dark frame will be subtracted from every raw image.

    This approach for reducing dark signal produces very nice astronomical images. It is

    suitable for many applications which do not require very accurate photometry results.However, the technique is not necessarily sufficiently accurate for detecting and analyzing

    very low signal-to-noise ratio objects.

    3.4.3. Dynamic Threshold Approach

    Fixed decision boundaries (or fixed threshold) classification approaches are

    successfully applied to segment human skin. These fixed thresholds mostly failed in two

    situations as they only search for a certain skin color range:

    1) any non-skin object may be classified as skin if non-skin objects color values

    belong to fixed threshold range.

    2) Any true skin may be mistakenly classified as non-skin if that skin color values do

    not belong to fixed threshold range. Instead of predefined fixed thresholds, novel online

    learned dynamic thresholds are used to overcome the above drawbacks. The experimental

    results show that our method is robust in overcoming these drawbacks.

  • 7/24/2019 Automatic Video Object Extraction

    35/52

    35

    3.4.4. Morphological Filtering

    Given a sampled2 binary image signal f[x] with values 1 for the image object and 0

    for the background, typical image transformations involving a moving window set

    W = {y1, y2, ..., yn} of n sample indexes would be

    b (f)[x] = b(f[x y1], ..., f[x yn]) (1)

    Where b(v1, ..., vn) is a Boolean function of n variables. The mappingf _ b(f) is

    called aBoolean filter. By varying the Boolean function b, a large variety of Boolean filters

    can be obtained. For example, choosing a Boolean AND for b wouldshrink the input image

    object, whereas a Boolean OR would expand it.

    Numerous other Boolean filters are possible, since there are 22n possible Boolean

    functions of n variables. The main applications of such Boolean image operations have been

    in biomedical image processing, character recognition, object detection, and general 2D

    shape analysis.

    Among the important concepts offered by mathematical morphology was to use sets

    to represent binary images and set operations to represent binary image transformations.

    Specifically, given a binary image, let the object be represented by the set X and its

    background by the set complementXc. The Boolean OR transformation ofXby a (window)

    setB is equivalent to the Minkowski set addition (+), also called dilation, ofXbyB:

    X (+) B {x + y:x (+) X, y (+)} (2)

    Extending morphological operators from binary to gray level images can be done by

    using set representations of signals and transforming these input sets via morphological set

    operations. Thus, consider an image signalf(x) defined on the continuous or discrete plane ID

    = R2 or Z2 and assuming values in R=RU {,}. Thresholding f at all amplitude levels v

    produces an ensemble of binary images represented by the threshold sets

    v(f) {x ID : f(x) v} , < v < + (3)

    Lattice Opening Filters

    The three types of nonlinear filters defined below have proven to be very useful forimage enhancement. If a 2D image f contains 1D object, e.g. lines, and B is a 2D disk-like

  • 7/24/2019 Automatic Video Object Extraction

    36/52

    36

    structuring element, then the simple opening or closing of f by B will eliminate these 1D

    objects. Another problem arises when f contains large-scale objects with sharp corners that

    need to be preserved; in such cases opening or closingfby a diskB will round these corners.

    These two problems could be avoided in some cases if we replace the conventional opening

    with a radial opening.

    3.4.5. Object Detection

    A substantial amount of research has been done in developing techniques for locating

    Objects of interest automatically in digitized pictures drawing the boundaries around

    objects are essential for pattern recognition, object tracking, image enhancement, data

    reduction, and various other applications. This constitutes a good survey of research and

    applications in image processing and picture analysis. Most researchers of picture analysts

    have assumed that:

    (1) The image of an object is more or less uniform or smooth in its local properties

    (that is, illumination, color, and local texture are smoothly changing inside the image of an

    object).

    (2) There is detectable discontinuity in local properties between images of two

    different objects. This will adopt these two assumptions in this paper and assume no textural

    image.

    The work on automatic location of objects in digitized images has split into two

    approaches: edge detection and edge following versus region growing. Edge detection applies

    local independent operators over the picture to detect edges and then uses algorithms to trace

    the boundaries by following the local edge detected A recent survey of literature in this area.

    The region growing approach uses various clustering algorithms to grow regions of

    almost uniform local properties in the image for typical applications. More detected

    references will be given later.In this method the two approaches are combined to complement each other; the result

    is a more powerful mechanism to segment pictures into objects. This will develop a new edge

    detector and combined it with new region growing techniques to locate objects; in so doing

    we resolved the confusion in regular edge following that the results where more than one

    isolated object on a uniform background is in the scene.

  • 7/24/2019 Automatic Video Object Extraction

    37/52

    37

    3.4.6. Parameters Analysis

    Generally this paper deals with many of the parameters that can be dealt here.

    Correlation

    Sensitivity MSE

    PSNR

    Correlation

    Correlation and Convolution are basic operations that we will perform to extract

    information from images. They are in some sense the simplest operations that we can perform

    on an image, but they are extremely useful. Moreover, because they are simple, they can be

    analyzed and understood very well, and they are also easy to implement and can be computed

    very efficiently.

    Our main goal is to understand exactly what correlation and convolution do, and why

    they are useful. We will also touch on some of their interesting theoretical properties; though

    developing a full understanding of them would take more time than we have.

    Sensitivity

    We estimate the sensitivity of the image processing operators with respect toparameter changes by performing a sensitivity analysis. This is not new to the image

    processing community but often left out for performance reasons.

    MSE & PSNR

    Image quality assessment is an important but difficult issue in image processing

    applications such as compression coding and digital watermarking. For a long time, mean

    square error (MSE) and peak signal-to-noise ratio (PSNR) are widely used to measure the

    degree of image distortion because they can represent the overall gray-value error contained

    in the entire image, and are mathematically tractable as well. In many applications, it is

    usually straightforward to design systems that minimize MSE or PSNR.

    MSE works satisfactorily when the distortion is mainly caused by contamination of

    additive noise. However the problem inherent in MSE and PSNR is that they do not take into

    account the viewing conditions and visual sensitivity with respect to image contents. With

    MSE or PSNR, only gray-value differences between corresponding pixels of the original andthe distorted version are considered. Pixels are treated as being independent of their

  • 7/24/2019 Automatic Video Object Extraction

    38/52

    38

    neighbors. Moreover, all pixels in an image are assumed to be equally important. This, of

    course, is far from being true. As a matter of fact, pixels at different positions in an image can

    have very different effects on the human visual system (HVS).

    3.4.7. Advantages

    Less sensitive to noise

    Automatic background updating model

    Accuracy is more

    3.4.8. Applications

    Video surveillance

    Object detection

    People counting

  • 7/24/2019 Automatic Video Object Extraction

    39/52

    39

    CHAPTER 4

    SYSTEM SPECIFICATION

  • 7/24/2019 Automatic Video Object Extraction

    40/52

    40

    4.1. HARDWARE SPECIFICATION

    The computer used for the development of the project is

    Processor : PENTIUM IV

    RAM : 256 DDR

    Hard Disc : 80 GB

    Floppy Disc : 1.44 MB

    CD drive : 52 * CD ROM DRIVE

    Monitor : Samsung 17

    Keyboard : 108 keys Samsung

    Mouse : Logitech scroll mouse

    Zip Drive : 250 MB

    Printer : HP DeskJet

    4.2. SOFTWARE SPECIFICATION

    The project was developed using the following software.

    Operating System : Windows XP

    Software : MATLAB

  • 7/24/2019 Automatic Video Object Extraction

    41/52

    41

    4.3. SOFTWARE FEATURES:

    The software used in this project is MATLAB- 7.10 or above

    4.3.1. Introduction To Matlab

    MATLAB is a product of The Math Works; Inc. The name MATLAB stands for

    MATRIX LABORATORY.It are a high-performance language for technical computing.

    It integrates computation, visualization, and programming in an easy-to-use

    environment where problems and solutions are expressed in familiar mathematical notation.

    Typical uses includes

    1. Math and computation

    2. Algorithm development

    3. Data acquisition

    4. Modeling, simulation, and prototyping

    5. Data analysis, exploration, and visualization

    6. Scientific and engineering graphics

    7. Application development, including graphical user interface building

    MATLAB is an interactive system whose basic data element is an array that does not

    require dimensioning. This allows you to solve many technical computing problems,

    especially those with matrix and vector formulations, in a fraction of the time it would take to

    write a program in a scalar non interactive language such as C or FORTRAN.

    The name MATLAB stands for matrix laboratory. MATLAB was originally written to

    provide easy access to matrix software developed by the LINPACK and EISPACK projects.Today, MATLAB engines incorporate the LAPACK and BLAS libraries, embedding the

    state of the art in software for matrix computation.

    MATLAB has evolved over a period of years with input from many users. In

    university environments, it is the standard instructional tool for introductory and advanced

    courses in mathematics, engineering, and science. In industry, MATLAB is the tool of choice

    for high-productivity research, development, and analysis.

  • 7/24/2019 Automatic Video Object Extraction

    42/52

    42

    MATLAB features a family of add-on application-specific solutions called toolboxes.

    Very important to most users of MATLAB, toolboxes allow you to learn and apply

    specialized technology. Toolboxes are comprehensive collections of MATLAB functions (M-

    files) that extend the MATLAB environment to solve particular classes of problems. Areas in

    which toolboxes are available include signal processing, control systems, neural networks,

    fuzzy logic, wavelets, simulation, and many others.

    4.3.2. THE MATLAB SYSTEM

    The MATLAB system consists of five main parts

    Desktop tools and development environment

    This is the set of tools and facilities that help you use MATLAB functions and files.

    Many of these tools are graphical user interfaces. It includes the MATLAB desktop and

    Command Window, a command history, an editor and debugger, a code analyzer and other

    reports, and browsers for viewing help, the workspace, files, and the search path.

    MATLAB windows

    1. Command window: This is the main window. It is characterized by Matlab

    command prompt>>. All commands, including those for running user-written programs, are

    typed in this window at the Matlab prompt.

    2. Graphics Window: The output of all graphics commands typed are flushed into the

    graphics orfigure window.

    3. Edit Window: This is a place to create, write, edit and save programs in files called

    Mfiles. Any text editor can be used to carry out these tasks.

    The MATLAB mathematical function library

    This is a vast collection of computational algorithms ranging from elementary

    functions, like sum, sine, cosine, and complex arithmetic, to more sophisticated functions like

    matrix inverse, matrix eigen values, Bessel functions, and fast Fourier transforms.

  • 7/24/2019 Automatic Video Object Extraction

    43/52

    43

    The MATLAB language

    This is a high-level matrix/array language with control flow statements, functions,

    data structures, input/output, and object-oriented programming features. It allows both

    programming in the small to rapidly create quick and dirty throw-away programs, and

    programming in the large to create large and complex application programs.

    Graphics handler

    MATLAB has extensive facilities for displaying vectors and matrices as graphs, as

    well as annotating and printing these graphs. It includes high-level functions for two-

    dimensional and three-dimensional data visualization, image processing, animation, and

    presentation graphics. It also includes low-level functions that allow you to fully customizethe appearance of graphics as well as to build complete graphical user interfaces on your

    MATLAB applications.

    4.3.3. The MATLAB application program interface

    This is a library that allows you to write C and Fortran programs that interact with

    MATLAB. It includes facilities for calling routines from MATLAB (dynamic linking),

    calling MATLAB as a computational engine, and for reading and writing MAT-files

    4.3.4. MATLAB documentation

    MATLAB provides extensive documentation, in both printed and online format, to

    help you learn about and use all of its features. If you are a new user, start with this Getting

    Started book. It covers all the primary MATLAB features at a high level, including many

    examples. The MATLAB online help provides task-oriented and reference information about

    MALAB.

    4.3.5. Introduction to M-function programming

    One of the most powerful features of MATLAB is the capability it provides users to

    program their own new functions. MATLAB function programming is flexible and

    particularly easy to learn.

  • 7/24/2019 Automatic Video Object Extraction

    44/52

    44

    4.3.6. M-files

    M-files in MATLAB can be scripts that simply execute a series of MATLAB

    statements, or they can be functions that can accept arguments and can produce one or more

    outputs. M FILE functions extend the capabilities of both MATLAB and the Image

    Processing Toolbox to address specific, user-defined applications. M-files are created using a

    text editor and are stored with a name of the form filename.m, such as average.m and filter.m.

    The components of a function M-file are,

    The function definition line

    The H1 line

    Help text

    The function body

    Comments

    4.3.7. Operators

    MATLAB operators are grouped into three main categories:

    Arithmetic operators that perform numeric computations

    Relational operators that compare operands quantitatively

    Logical operators that perform the functions AND, OR, and NOT

    4.3.8. Image processing tool box

    Image Processing Toolbox provides a comprehensive set of reference-standard

    algorithms and graphical tools for image processing, analysis, visualization, and algorithm

    development. You can perform image enhancement, image deblurring, feature detection,

    noise reduction, image segmentation, geometric transformations, and image registration.

    Many toolbox functions are multithreaded to take advantage of multicore and multiprocessor

    computers.

    Image Processing Toolbox supports a diverse set of image types, including high

    dynamic range, gigapixel resolution, embedded ICC profile, tomography Spatial image

    transformations, morphological operations, neighborhood and block operations, Linear

    filtering and filter design, Transforms Image analysis and enhancement, Image registration,

    De blurring Region of interest operations

  • 7/24/2019 Automatic Video Object Extraction

    45/52

    45

    You can extend the capabilities of Image Processing Toolbox by writing your own M-

    files, or by using the toolbox in combination with other toolboxes, such as Signal Processing

    Toolbox and Wavelet Toolbox.

    Graphical tools let you explore an image, examine a region of pixels, adjust the

    contrast, create contours or histograms, and manipulate regions of interest (ROIs). With

    toolbox algorithms you can restore degraded images, detect and measure features, analyze

    shapes and textures, and adjust color balance

    4.3.9. Interfacing with other languages

    Libraries written in Java, ActiveX or .NET can be directly called from MATLAB and

    many MATLAB libraries (for example XML or SQL support) are implemented as wrappers

    around Java or ActiveX libraries. Calling MATLAB from Java is more complicated, but can

    be done with MATLAB extension, which is sold separately by Math Works, or using an

    undocumented mechanism called JMI (Java-to-Matlab Interface), which should not be

    confused with the unrelated Java Metadata Interface that is also called JMI.

    As alternatives to the MuPAD based Symbolic Math Toolbox available from Math

    Works, MATLAB can be connected to Maple or Mathematical.

    MATLAB has a direct node with mode FRONTIER, a multidisciplinary and multi-

    objective optimization and design environment, written to allow coupling to almost any

    computer aided engineering (CAE) tool. Once obtained a certain result using Matlab, data

    can be transferred and stored in a mode FRONTIER

    4.4. SYSTEM DESIGN

    System design is the process of planning a new system to complement or altogether replace

    the old system. The purpose of the design phase is the first step in moving from the problem

    domain to the solution domain. The design of the system is the critical aspect that affects the

    quality of the software. System design is also called top-level design. The design phase

    translates the logical aspects of the system into physical aspects of the system.

  • 7/24/2019 Automatic Video Object Extraction

    46/52

    46

    CHAPTER V

    RESULT AND IMPLEMENTATION

  • 7/24/2019 Automatic Video Object Extraction

    47/52

    47

    Visual Saliency

  • 7/24/2019 Automatic Video Object Extraction

    48/52

    48

    CHAPTER VI

    WORKS TO BE DONE IN PHASE II

  • 7/24/2019 Automatic Video Object Extraction

    49/52

    49

    The Background Subtraction for accurate moving object detection from dynamic

    scene using dynamic threshold detection and mixture of Gaussian model and determination of

    object parameters with the help of morphological filtering will be done in next phase.

  • 7/24/2019 Automatic Video Object Extraction

    50/52

    50

    CHAPTER VII

    REFERNCES

  • 7/24/2019 Automatic Video Object Extraction

    51/52

    51

    1. Saliency-based video segmentation with graph cuts and sequentially updated priors,

    K. Fukuchi, K. Miyazato, A. Kimura, S. Takagi, and J. Yamato,in Proc. IEEE Int. Conf. Multimedia Expo, Jun.Jul. 2009, pp. 638641.

    2. Saliency detection using maximum symmetric surround,

    R. Achanta and S. Ssstrunk,

    in Proc. IEEE Int. Conf. Image Process., Sep. 2010,pp. 26532656.

    3. Automatic object extraction in single concept videos,

    K.-C. Lien and Y.-C. F. Wang,

    in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2011,pp. 16.

    4. Visual saliency from image features with application to compression,

    P. Harding and N. M. Robertson,

    Cognit. Comput., vol. 5, no. 1, pp. 7698, 2012.

    5. Visual attention detection in video sequencesusing spatiotemporal cues,

    Y. Zhai and M. Shah,

    in Proc. ACM Int. Conf. Multimedia, 2006, pp. 815824.

    6. Background modeling using mixture of Gaussians for foreground detectionA

    survey,

    T