automatic video object extraction

7/24/2019 Automatic Video Object Extraction

1/52

1

CHAPTER I

INTRODUCTION


2/52

2

1.1. VIDEO OBJECT EXTRACTION

A desirable video object extraction scheme for content based applications should meet

the following criteria:

Segmented object should conform to human perception i.e., semantically meaningful

objects should be segmented.

Segmentation algorithm should be efficient and achieve fast speed.

Initialization should be simple and easy for users to operate (human intervention

should be minimized). One feasible solution that satisfies these criteria is edge change

detection.

In Video Object (VO) segmentation methods, which are using mathematical

morphology and perspective motion model, objects of interest should be initially outlined by

human observer. From the manually specified object boundary, the correct object boundary is

calculated using a morphological segmentation tool. The obtained VOP is then automatically

tracked and updated in successive frames. It has difficulty in dealing with a large non rigid

object movement and in the presence of occlusion, especially in the VOP tracking schemes.

The algorithm based on edge change detection allows automatic detection of the new

appearance of a VO. The edge change detection for inter-frame difference is another stream

of popular schemes because it is straightforward to implement and enables automatic

detection of new appearance. This ability enables to develop a fully automated object-based

system, such as an object-based video surveillance system.

It is found that the algorithms based on inter frame change detection render automatic

detection of objects and allow larger non rigid motion compared to mathematical morphology

and perspective motion model methods. The drawbacks are small false regions detected by

decision error due to noise. Thus, small whole removal using morphological operation and

removal of false parts like uncovered background by motion information are usually

incorporated.

Another drawback in edge change detection is that object boundaries are irregular in

some critical image areas, which must be smoothened and adapted by spatial edge

information. Since spatial edge information is useful for generating VOP with accurateboundaries, a simple binary edge difference scheme may be assumed to be a good solution. In


3/52

3

order to overcome boundary inaccuracy multiple features, multiple frames and spatial-

temporal entropy methods are used. In addition, it gives robustness to noise and occluding

pixels.

The first stage is applied to the first two frames of a video shot to discover moving

objects while the second stage is applied to the rest of the frames to extract the detected

objects through the video shot. The algorithm is applied to first two frames of the image

sequence to detect the moving objects in the video sequence. First two frames of video

sequence are taken and motion vectors are computed using Adaptive Rood Pattern Search

(ARPS) algorithm. Simultaneously, components of optical flow are computed for each block

in the image. By using the motion vectors, motion compensated frame is generated.

Initial segmentation is performed on the first frame of traffic sequence. Applying

watershed transformation directly on the gradient of image results in over segmentation. To

avoid over segmentation morphological gradient is computed on the frame and then

watershed transformation is performed. After watershed transformation, some regions may

need to be merged because of possible over-segmentation.

Canny binary edge image is used to localize an object in subsequent frames of video

sequences and to detect the true weak edges. Intensity edge pixels are used as feature pointsdue to the key role that edges play in the human visual process and the fact that edges are

little affected by variation of luminance. Object models evolve from one frame to the next,

capturing the changes in the shape of objects as they move. The algorithm naturally

establishes the temporal correspondence of objects throughout the video sequence, and the

output of the algorithm is a sequence of binary models representing the motion and shape

changes of the objects.

Object model is obtained by subtracting background edge from edge image and

eliminating unlinked pixels. After a binary model for the object of interest has been derived

the motion vectors generated from ARPs algorithm are used to match the subsequent frames

in the sequence. Matching is performed on edge images because it is computationally

efficient and fairly insensitive to changes in illumination. The degree of change in the shape

of an object from one frame to the next is determined based on simplified Hausdorff distance

where simplified Hausdorff distance is defined as combination of distance transformation and

correlation. Distance Transform the image and then threshold it by different amounts to formdifferent dilated image sets. To search for the object in the image, it is required to obtain the


4/52

4

amount by which the image is dilated such that maximum points in the object model are

matched to image set.

In this automatic VO segmentation algorithm edge change detection starts with edge

detection which is the first and most important stage of human visual process. Edge

information plays a key role in extracting the physical change of the corresponding surface in

a real scene, exploiting simple difference of edges for extracting shape information of moving

objects in video sequence suffers from great deal of noise even in stationary background.

This is due to the fact that the random noise created in one frame is different from the one

created in the successive frame, and thus results in slight changes of the edge locations in the

successive frames. Thus difference edge of frames suppresses the noise in luminance

difference by means of canny edge detector.

Motion estimation is based on temporal changes in image intensities. The underlying

supposition behind motion estimation is that the patterns corresponding to objects and

background in a frame of video sequence move within the frame to form corresponding

objects on the subsequent frame. Motion estimation is accomplished using ARPs algorithm.

The ARPS algorithm makes use of the fact that the general motion in a frame is usually

coherent, i.e. if the macro blocks around the current macro block move in a particular

direction then there is a high probability that the current macro block will also have a similar

motion vector. This algorithm uses the motion vector of the macro block to its immediate left

to predict its own motion vector.

The ARPS algorithm tries to achieve the Peak Signal to Noise Ratio (PSNR) similar

to that of Exhaustive Search (ES) algorithm. ES algorithm, also known as Full Search, is the

most computationally expensive block matching algorithm. This algorithm calculates the cost

function at each possible location in the search window. As a result it finds the best possible

match and gives the highest PSNR among any block matching algorithm. The obvious

disadvantage of exhaustive search is it takes more computations to estimate motion vectors.

ARPS tries to achieve the same PSNR as that of ES with less number of computations.


5/52

5

CHAPTER II

LITERATURE SURVEY


6/52

6

2.1. EXISTING SYSTEM

Existing Methodology involves a saliency-based video object extraction (VOE)

framework. The proposed framework aims to automatically extract foreground objects of

interest without any user interaction or the use of any training data. To separate foreground

and background regions within and across video frames, the proposed method utilizes visual

and motion saliency information extracted from the input video. A conditional random field

is applied to effectively combine the saliency induced features, which allows us to deal with

unknown pose and scale variations of the foreground object (and its articulated parts). Based

on the ability to preserve both spatial continuity and temporal consistency in the proposed

VOE framework, experiments on a variety of videos verify that our method is able to produce

quantitatively and qualitatively satisfactory VOE results.

2.1.1. Introduction

A human can easily determine the subject of interest in a video, even though that

subject is presented in an unknown or cluttered background or even has never been seen

before. With the complex cognitive capabilities exhibited by human brains, this process can

be interpreted as simultaneous extraction of both foreground and background information

from a video.

Many researchers have been working toward closing the gap between human and

computer vision. However, without any prior knowledge on the subject of interest or training

data, it is still very challenging for computer vision algorithms to automatically extract the

foreground object of interest in a video. As a result, if one needs to design an algorithm to

automatically extract the foreground objects from a video, several tasks need to be addressed.

1) Unknown object category and unknown number of the object instances in a video.

2) Complex or unexpected motion of foreground objects due to articulated parts or arbitrary

poses.

3) Ambiguous appearance between foreground and background regions due to similar color,

low contrast, insufficient lighting, etc. conditions.


7/52

7

In practice, it is infeasible to manipulate all possible foreground object or background

models beforehand. However, if one can extract representative information from either

foreground or background (or both) regions from a video, the extracted information can be

utilized to distinguish between foreground and background regions, and thus the task of

foreground object extraction can be addressed. As discussed later in Section II, most of the

prior works either consider a fixed background or assume that the background exhibits

dominant motion across video frames. These assumptions might not be practical for real

world applications, since they cannot generalize well to videos captured by freely moving

cameras with arbitrary movements.

In this system, we propose a robust video object extraction (VOE) framework, which

utilizes both visual and motion saliency information across video frames. The observed

saliency information allows us to infer several visual and motion cues for learning foreground

and background models, and a conditional random field (CRF) is applied to automatically

determines the label (foreground or background) of each pixel based on the observed models.

With the ability to preserve both spatial and temporal consistency, our VOE

framework exhibits promising results on a variety of videos, and produces quantitatively and

qualitatively satisfactory performance. While we focus on VOE problems for single concept

videos (i.e., videos which have only one object category of interest presented), our proposed

method is able to deal with multiple object instances (of the same type) with pose, scale, etc.

variations.

Methods to solve the occlusion problem in multiple interacting objects tracking have

been previously presented. Shiloh, Chang and Dockstader 1 The work presented in this paper

was sponsored by the Foundation of National Laboratory of Pattern Recognition and National

Natural Science Foundation of China (60172037) overcame occlusion in multiple object

tracking by used fusing multiple camera inputs. Cucchiara proposed probabilistic masks and

appearance models to cope with frequent shape changes and large occlusions. Eng developed

a Bayesian segmentation approach that fused a region-based background subtraction and a

human shape model for people tracking under occlusion. Wu proposed a dynamic Bayesian

network which accommodates an extra hidden process for partial occlusion handling. Andrew

used appearance models to track occluded objects. Siebel proposed a tracking system with

three co-operating parts: an active shape tracker, a region tracker and a head detector. The

region tracker exploits the other two modules to solve occlusions. Hieu proposed a template


8/52

8

matching algorithm and update the template using appearance features smoothed by kalman

filter. Tao presented a dynamic background layer model and model each moving object as a

foreground layer, together with the foreground ordering, the complete information necessary

for reliably tracking objects through occlusion is included. Alper tracked the complete object

and evolving the contour from frame to frame by minimizing some energy functions.

An obvious step towards video segmentation in Efficient Hierarchical Graph-Based

Video Segmentation Matthias is to apply image segmentation techniques to video frames

without considering temporal coherence. These methods are inherently scalable and may

generate segmentation results in real time. However, lack of temporal information from

neighboring frames may cause jitter across frames. Freedman and Kisilev applied a sampling-

based fast mean shift approach to a cluster of 10 frames as a larger set of image features to

generate smoother results without taking into account temporal information.

Spatio-temporal video segmentation techniques can be distinguished by whether the

information from future frames is used in addition to past frames. Causal methods apply

Kalman filtering to aggregate data over time, which only consider past data. Paris et al.

derived the equivalent tool of mean-shift image segmentation for video streams based on the

ubiquitous use of the Gaussian kernel. They achieved real-time performance without

considering future frames in the video.

Another class of spatio-temporal techniques takes advantage of both past and future

data in a video. They treat the video as a 3D space-time volume, and typically use a variant of

the mean shift algorithm for segmentation. Dementhon applied mean shift on a 3D lattice and

used a hierarchical strategy to cluster the space-time video stack for computational efficiency.

Wang et al. used anisotropic kernel mean shift segmentation for video tooning. Wang and

Adelson used motion heuristics to iteratively segment video frames into motion consistent

layers. Tracking-based video segmentation methods generally define segments at frame-level

and use motion, color and spatial cues to force temporal coherence.

Following the same line of work, Brendel and Todorovic used contour cues to allow

splitting and merging of segments to boost the tracking performance. Interactive object

segmentation has recently shown significant progress. These systems produce high quality

segmentations driven by user input. We exhibit a similar interactive framework driven by our

segmentation. Our video segmentation method builds on Felzenszwalb and Huttenlochersgraph-based image segmentation technique. Their algorithm is efficient, being nearly linear


9/52

9

in the number of edges in the graph, which makes it suitable for extension to spatio-temporal

segmentation. We extend the technique to video making use of both past and future frames,

and improve the performance and efficiency using a hierarchical framework.

2.1.2. Methodology

In this paper, we aim at automatically extracting foreground objects in videos which

are captured by freely moving cameras. Instead of assuming that the background motion is

dominant and different from that of the foreground as did, we relax this assumption and allow

foreground objects to be presented in freely moving scenes.

We advance both visual and motion saliency information across video frames, and a

CRF model is utilized for integrating the associated features for VOE (i.e., visual saliency,

shape, foreground/background color models, and spatial/temporal energy terms). From our

quantitative and qualitative experiments, we verify that our VOE performance exhibits spatial

consistency and temporal continuity, and our method is shown to outperform state-of the-art

unsupervised VOE approaches. It is worth noting that, our proposed VOE framework is an

unsupervised approach, which does not require the prior knowledge (i.e., training data) of the

object of interest nor the user interaction for any annotation.

Input video Visual saliency

Motion saliency using

optical flow

Color and shape

detection

Cooperating

multiple cues

Detected object

Fig. 2.1. Block Diagram of Existing System


10/52

10

Most existing unsupervised VOE approaches assume the foreground objects as

outliers in terms of the observed motion information, so that the induced appearance, color,

etc. features are utilized for distinguishing between foreground and background regions.

However, these methods cannot generalize well to videos captured by freely moving cameras

as discussed earlier. In this work, we propose a saliency-based VOE framework which learns

saliency information in both spatial (visual) and temporal (motion) domains.

By advancing conditional random fields (CRF), the integration of the resulting

features can automatically identify the foreground object without the need to treat either

foreground or background as outliers.

In general, one can address VOE problems using supervised or unsupervised

approaches. Supervised methods require prior knowledge on the subject of interest and need

to collect training data beforehand for designing the associated VOE algorithms. For

example, Wu and Nevatia and Lin and Davis both decomposed an object shape model in a

hierarchical way to train object part detectors, and these detectors are used to describe all

possible configurations of the object of interest (e.g. pedestrians).

Another type of supervised methods requires user interaction for annotating candidate

foreground regions. For example, image segmentation algorithms proposed in focused on an

interactive scheme and required users to manually provide the ground truth label information.

For videos captured by a monocular camera, methods applied a conditional random field

(CRF) maximizing a joint probability of color, motion, etc. models to predict the label of

Fig. 2.2. Overview of Existing VOE Framework


11/52

11

each image pixel. Although the color features can be automatically determined from the input

video, these methods still need the user to train object detectors for extracting shape or

motion features.

Recently, researchers proposed to use some preliminary strokes to manually select the

foreground and background regions, and they utilized such information to train local

classifiers to detect the foreground objects. While these works produce promising results, it

might not be practical for users to manually annotate a large amount of video data.

2.1.3. Visual Saliency

The salience (also called saliency) of an item be it an object, a person, a pixel, etc.

is the state or quality by which it stands out relative to its neighbors. Saliency detection is

considered to be a key attention mechanism that facilitates learning and survival by enabling

organisms to focus their limited perceptual and cognitive resources on the most pertinent

subset of the available sensory data.

Saliency typically arises from contrasts between items and their neighborhood, such

as a red dot surrounded by white dots, a flickering message indicator of an answering

machine, or a loud noise in an otherwise quiet environment. Saliency detection is often

studied in the context of the visual system, but similar mechanisms operate in other sensory

systems. What is salient can be influenced by training: for example, for human subjects

particular letters can become salient by training.

When attention deployment is driven by salient stimuli, it is considered to be bottom-

up, memory-free, and reactive. Attention can also be guided by top-down, memory-

dependent, or anticipatory mechanisms, such as when looking ahead of moving objects or

sideways before crossing streets. Humans and other animals have difficulty paying attention

to more than one item simultaneously, so they are faced with the challenge of continuously

integrating and prioritizing different bottom-up and top-down influences.

In the domain of psychology, efforts have been made in modeling the mechanism of

human attention, including the learning of prioritizing the different bottom-up and top-down

influences.


12/52

12

In the domain of computer vision, efforts have been made in modeling the mechanism

of human attention, especially the bottom-up attention mechanism. Such a process is also

called visual saliency detection.

Generally speaking, there are two kinds of models to mimic the bottom-up saliency

mechanism. One way is based on the spatial contrast analysis. For example, in a center-

surround mechanism is used to define saliency across scales, which is inspired by the putative

neural mechanism. The other way is based on the frequency domain analysis. While they

used the amplitude spectrum to assign saliency to rarely occurring magnitudes, Guo et al. use

the phase spectrum instead. introduced a system that uses both the amplitude and the phase

information.

A key limitation in many such approaches is their computational complexity which

produces less than real-time performance, even on modern computer hardware.Some recent

work attempts to overcome these issues but at the expense of saliency detection quality under

some conditions.

Our attention is attracted to visually salient stimuli. It is important for complex

biological systems to rapidly detect potential prey, predators, or mates in a cluttered visual

world. However, simultaneously identifying any and all interesting targets in one's visualfield has prohibitive computational complexity making it a daunting task even for the most

sophisticated biological brains let alone for any existing computer.

One solution, adopted by primates and many other animals, is to restrict complex

object recognition process to a small area or a few objects at any one time. The many objects

or areas in the visual scene can then be processed one after the other. This serialization of

Fig. 2. Example of visual saliency calculations. (a) Original video frame. (b) and (C)

are Visual saliency of (a)


13/52

13

visual scene analysis is operationalized through mechanisms of visual attention: A common

(although somewhat inaccurate) metaphor for attention is that of a virtualspotlight,shifting

to and highlighting different sub-regions of the visual world, so that one region at a time can

be subjected to more detailed visual analysis

Visual attention may be a solution to the inability to fully process all locations in

parallel. However, this solution produces a problem. If you are only going to process one

region or object at a time, how do you select that target of attention? Visual salience helps

your brain achieve reasonably efficient selection. Early stages of visual processing give rise

to a distinct subjective perceptual quality which makes some stimuli stand out from among

other items or locations. Our brain has evolved to rapidly compute salience in an automatic

manner and in real-time over the entire visual field. Visual attention is then attracted towards

salient visual locations.

The core of visual salience is a bottom-up, stimulus-driven signal that announces this

location is sufficiently different from its surroundings to be worthy of your attention.

This bottom-updeployment of attention towards salient locations can be strongly modulated

or even sometimes overridden by top-down,user-driven factors. Thus, a lone red object in a

green field will be salient and will attract attention in a bottom-up manner (see illustration

below). In addition, if you are looking through a childs toy bin for a red plastic dragon,

amidst plastic objects of many vivid colors, no one color may be especially salient until your

top-down desire to find the red object renders all red objects, whether dragons or not, more

salient.

Visual salience is sometimes carelessly described as a physical property of a visual

stimulus. It is important to remember that salience is the consequence of an interaction of a

stimulus with other stimuli, as well as with a visual system (biological or artificial). As a

straight-forward example, consider that a color-blind person will have a dramatically

different experience of visual salience than a person with normal color vision, even when

both look at exactly the same physical scene (see, e.g., the first example image below).

As a more controversial example, it may be that expertise changes the salience of

some stimuli for some observers. Nevertheless, because visual salience arises from fairly

low-level and stereotypical computations in the early stages of visual processing the factors

contributing to salience are generally quite comparable from one observer to the next, leadingto similar experiences across a range of observers and of behavioral conditions.


14/52

14

2.1.4. Motion Saliency

Motion saliency detection has an important impact on further video processing tasks,

such as video segmentation, object recognition and adaptive compression. Different to image

saliency, in videos, moving regions (objects) catch human beings attention much easier than

static ones. Based on this observation, we propose a novel method of motion saliency

detection, which makes use of the low-rank and sparse decomposition on video slices along

X-T and Y-T planes to achieve the goal, i.e. separating foreground moving objects from

backgrounds. To detect motion saliency in videos, however, most of the techniques for

images (mentioned above) are not available.

Since, different with image saliency detection, moving regions (objects) alternatively

catch more human beings attention than static ones, even though which have large contrast

to their neighbours in static images. That is to say, the focal point changes from the regions

with large contrast to their neighbours for images to those with motion discrimination for

videos. Therefore, the contrast based methods are hardly applied to videos directly.

An exception exists, which extends the spectrum residual analysis in images to

videos. Actually, the goal of moving object separation from background is the same as that of

motion saliency detection. A few solutions for separating foreground moving objects from

backgrounds have been proposed, such as Gaussian Mixture ModelTo detect each moving part and its corresponding pixels, we perform dense optical-

flow forward and backward propagation at each frame of a video. A moving pixel qt at frame

t is determined by

Where denotes the pixel pair detected by forward or backward optical flow propagation.

Determination of Shape Cues:

Although motion saliency allows us to capture motion salient regions within and

across video frames, those regions might only correspond to moving parts of the foreground

object within some time interval.

Determination of Color Cues:

Besides the motion-induced shape information, we also extract both foreground and

background color information for improved VOE performance. According to the

observation and the assumption that each moving part of the foreground object forms a


15/52

15

complete sampling of itself, we cannot construct foreground or background color models

simply based on visual or motion saliency detection results at each individual frame

2.1.5. CONDITIONAL RANDOM FIELDS (CRFs)

Utilizing an undirected graph, conditional random field (CRF) is a powerful technique

to estimate the structural information (e.g. class label) of a set of variables with the associated

observations. For video foreground object segmentation, CRF has been applied to predict the

label of each observed pixel in an image I.

It is a class of statistical modeling method often applied in pattern recognition and

machine learning, where they are used for structured prediction. Whereas an

ordinary classifier predicts a label for a single sample without regard to "neighboring"

samples, a CRF can take context into account; e.g., the linear chain CRF popular in natural

language processing predicts sequences of labels for sequences of input samples.

CRFs are a type of discriminative undirected probabilistic graphical model. It is used

to encode known relationships between observations and construct consistent interpretations.

It is often used for labeling or parsing of sequential data, such as natural language text

or biological sequences and in computer vision. Specifically, CRFs find applications

in shallow parsing, named entity recognition and gene finding, among other tasks, being an

alternative to the related hidden Markov models. In computer vision, CRFs are often used for

object recognition and image segmentation.

Conditional random fields (CRFs) are a probabilistic framework for labeling and

segmenting structured data, such as sequences, trees and lattices. The underlying idea is that

of defining a conditional probability distribution over label sequences given a particular

observation sequence, rather than a joint distribution over both label and observation

sequences.

Fig. 2.4. Motion Saliency Calculated for Fig 2.3. (a) Calculation of the Optical Flow.

(b) Motion Saliency Derived from (a)


16/52

16

The primary advantage of CRFs over hidden Markov models is their conditional

nature, resulting in the relaxation of the independence assumptions required by HMMs in

order to ensure tractable inference.

Additionally, CRFs avoid the label bias problem, a weakness exhibited by maximum

entropy Markov models (MEMMs) and other conditional Markov models based on directed

graphical models. CRFs outperform both MEMMs and HMMs on a number of real-world

tasks in many fields, including bioinformatics, computational linguistics and speech

recognition.

Imagine you have a sequence of snapshots from a day in Justin Biebers life, and you

want to label each image with the activity it represents (eating, sleeping, driving, etc.). How

can you do this?

One way is to ignore the sequential nature of the snapshots, and build aper-

image classifier. For example, given a months worth of labeled snapshots, you might learn

that dark images taken at 6am tend to be about sleeping, images with lots of bright colors

tend to be about dancing, and images of cars are about driving, and so on.

By ignoring this sequential aspect, however, you lose a lot of information. For

example, what happens if you see a close-up picture of a mouth is it about singing or

eating? If you know that theprevious image is a picture of Justin Bieber eating or cooking,

then its more likely this picture is about eating; if, however, the previous image contains

Justin Bieber singing or dancing, then this one probably shows him singing as well.

Thus, to increase the accuracy of our labeler, we should incorporate the labels of

nearby photos, and this is precisely what a conditional random field does.

2.1.6. Disadvantages

Not possible to work with videos with moving background.

Shape and colour cues extract foreground objects with missing parts.


17/52

17

2.2. SALIENCY BASED VIDEO SEGMENTATIONWITH GRAPH CUTS AND

SEQUENTIALLY UPDATED PRIORS

-Ken Fukuchi, Kouji Miyazato, Akisato Kimura, Shigeru Takagiand Junji Yamato

This paper proposes a new method for achieving precise video segmentation without

any supervision or interaction. The main contributions of this report include

1) the introduction of fully automatic segmentation based on the maximum a

posteriori (MAP) estimation of the Markov random field (MRF) with graph cuts and saliency

driven priors.

2) the updating of priors and feature likelihoods by integrating the previous

segmentation results and the currently estimated saliency-based visual attention.

Methods used

Markov random field (MRF) model

Here each hidden state corresponds to the label of a position representing an object

or background, and an observation is a frame of the input frame. The density calculated in

the previous step can be utilized for estimating the priors of objects/backgrounds and the

feature likelihoods of the MRF. When calculating priors and likelihoods, the regionsextracted from the previous frames are also available.

Image segmentation

Consider a set of random variables A = {Ax}xI defined on a set I of co- ordinates.

Each random variable Ax takes a value ax from the set L = {0,1}, which corresponds to a

background (0) and an object (1), respectively.

A MAP-based MRF estimation can be formulated as an energy minimizationproblem where the energy corresponding to the configuration a is the negative log likelihood

of the joint posterior density of the MRF, E(a|D)=logp(A = a|D), where D represents the

input image.

Result

Proposed a new method for achieving precise video segmentation without any

supervised interactions.


18/52

18

The main contributions included

1) the introduction of MAP-based frame-wise segmentation with graph cuts and

saliency-driven priors.

2) the technique for updating priors and likelihoods with a Kalman Filter.

Disadvantage:

Segmented regions were randomly switched

2.3. SALIENCY DETECTION USING MAXIMUM SYMMETRIC SURROUND

-Radhakrishna Achanta and Sabine Susstrunk

Detection of visually salient image regions is useful for applications like object

segmentation, adaptive compression, and object recognition. Recently, full-resolution salient

maps that retain well defined boundaries have attracted attention. In these maps, boundaries

are preserved by retaining substantially more frequency content from the original image than

older techniques. However, if the salient regions comprise more than half the pixels of the

image, or if the background is complex, the background gets highlighted instead of the salient

object.

This paper, introduce a method for salient region detection that retains the

advantages of such saliency maps while overcoming their shortcomings. Our method exploits

features of color and luminance, is simple to implement and is computationally efficient.

Methods used

Saliency computation methods:

Saliency has been referred to as visual attention, unpredictability, rarity, or surprise.

Saliency estimation methods can broadly be classified as biologically based, purely

computational, or those that combine the two ideas. In general, most methods employ a low-

level approach of determining contrast of image regions relative to their surroundings using

one or more features of intensity, color, and orientation.

Graph Based Segmentation

Graph cuts based methods are popular for image segmentation applications. Boykov

and Jolly perform interactive segmentation using graph cuts. They require a user to provide

scribble based input to indicate foreground and background regions.


19/52

19

A graph cuts based algorithm then segments foreground from background. We use a

similar approach, however, instead of the user indicating the background and foreground

pixels using scribbles, we use the saliency map to assign these pixels automatically.

Result

This method improves upon six existing state-of-the-art algorithms in precision and

recall with respect to a ground truth database.

Disadvantage

The saliency maps generated by this method suffer from low resolution

2.4. AUTOMATIC OBJECT EXTRACTION IN SINGLE-CONCEPT VIDEOS

-Kuo-Chin Lien and Yu-Chiang Frank Wang

Proposes a motion-driven video object extraction (VOE) method, which is able to

model and segment foreground objects in single-concept videos, i.e. videos which have only

one object category of interest but may have multiple object instances with pose, scale, etc.

variations. Given such a video, we construct a compact shape model induced by motion cues,

and extract the foreground and background color information accordingly.

It integrates these feature models into a unified framework via a conditional random

field (CRF), and this CRF can be applied to video object segmentation and further video

editing and retrieval applications. One of the advantages of this method is that it does not

require the prior knowledge of the object of interest, and thus no training data or

predetermined object detectors are needed; this makes this approach robust and practical to

real-world problems. Very attractive empirical results on a variety of videos with highly

articulated objects support the feasibility of our proposed method.

Methods used

Object Modeling And Extraction

First extract the motion cues from the moving object across video frames, and

combine the motion- induced shape, foreground and background color models into a CRF.

Without prior knowledge of the object of interest, this CRF model is designed to address

VOE problems in an unsupervised setting.

Conditional random field:

By utilizing an undirected graph, CRF is a powerful technique to estimate the

structural information (e.g. class label) of a set of variables. For object segmentation, CRF is


20/52

20

used to predict the label of each observed pixel in an image I. As shown in Figure 1, pixel i is

associated with observation zi, while the hidden node Fi indicates its corresponding label (i.e.

foreground or background).

In this CRF framework, the label Fi is calculated by the observation zi, while the

spatial coherence between this output and neighboring observations zj and labels Fj are

simultaneously taken into consideration.

Extraction of motion cues

Aim was to extract different feature information from these moving parts for the later

CRF construction. To detect the moving parts and their corresponding pixels, we perform

dense optical flow forward and backward propagation at every frame.

Result

Proposed a method which utilizes multiple motion-induced features such as shape and

foreground/background color models to extract foreground objects in single-concept videos.

We advanced a unified CRF framework to integrate the above feature models.

Disadvantage

Some of the motion cues might be negligible due to low contrast

2.5. VISUAL SALIENCY FROM IMAGE FEATURES WITH APPLICATION TO

COMPRESSION

-P. Harding N. M. Robertson

This technique is validated using comprehensive human eye-tracking experiments.

This algorithm is known as Visual Interest (VI) since the resultant segmentation reveals

image regions that are visually salient during the performance of multiple observer search

tasks. It demonstrate that it works on generic, eye-level photographs and is not dependent on

heuristic tuning.

The descriptor- matching property of the SURF feature points can be exploited via

object recognition to modulate the context of the attention probability map for a given object

search task, refining the salient area and fully validate the Visual Interest algorithm through

applying it to salient compression using a pre-blur of non salient regions prior to JPEG and

conducting comprehensive observer performance tests.


21/52

21

Methods used

Feature points extraction

Computer vision feature points (basically, local features at which the signal changes

three-dimensionally in space and scale) have many attractive properties, such as robust

invariant descriptor matching over scale, rotation and affine offset that could be useful in

combination with their use as a primitive saliency detector.

Feature matching has been used in estimating inter-frame homography mismatching

as an estimate of temporal saliency in video, but not as a measure of spatial saliency. Harding

and Robertson compare the co-occurrence of a set of computer vision feature points with

predictive maps of visual saliency .

Saliency has been referred to as visual attention, unpredictability, rarity, or surprise.

Saliency estimation methods can broadly be classified as biologically based, purely

computational, or those that combine the two ideas. In general, most methods employ a low-

level approach of determining contrast of image regions relative to their surroundings using

one or more features of intensity, color, and orientation.

Result

Presented an image segmentation algorithm to segment image areas which are

visually salient to observers performing multiple tasks. In contrast to bottom-up saliency

alone, and combined with specific task search models our technique finds image areas

relevant to the performance of multiple objective tasks and without the need for prior

learning.

The general mode acts on eye-level imagery with parameters chosen from careful

experimentation and requires no machine learning stage. The technique is built upon feature

points and the descriptors of these feature points can be compared to database representations

of stored objects to narrow the focus of the attention prediction map for object class search,

all in one algorithmic iteration.

Disadvantage

The application to compression is just one possible use only for a segmentation

algorithm based on visually salient information.


22/52

22

2.6. VISUAL ATTENTION DETECTION IN VIDEO SEQUENCES USING

SPATIOTEMPORAL CUES

- Yun Zhai ,Mubarak Shah

A hierarchical spatial attention representation is established to reveal the interesting

points in images as well as the interesting regions. Finally, a dynamic fusion technique is

applied to combine both the temporal and spatial saliency maps, where temporal attention is

dominant over the spatial model when large motion contrast exists, and vice versa. The

proposed spatiotemporal attention framework has been extensively applied on several video

sequences, and attended regions are detected to highlight interesting objects and motions

present in the sequences with very high user satisfaction rate.

Methods used:

Video attention detection

Propose a bottom-up approach for modeling the spatiotemporal attention in video

sequences. The proposed technique is able to detect the attended regions as well as attended

actions in video sequences. Different from the previous methods, most of which are based on

the dense optical flow fields, our proposed temporal attention model utilizes the interest point

correspondences and the geometric transformations between images.

In this model, feature points are firstly detected in consecutive video images, and

correspondences are established between the interest-points using the Scale Invariant Feature

Transformation (SIFT ).

Spatiotemporal saliency map

A linear time algorithm is developed to compute pixel-level saliency maps. In this

algorithm, color statistics of the images are used to reveal the color contrast information in

the scene. Given the pixel-level saliency map, attended points are detected by finding the

pixels with the local maxima saliency values. The region-level attention is constructed based

upon the attended points. Given an attended point, a unit region is created with its center to

be the point.

This region is then iteratively expanded by computing the expansion potentials on the

sides of the region. Rectangular attended regions are finally achieved. The temporal and

spatial attention models are finally combined in a dynamic fashion. Higher weights are

assigned to the temporal model if large motion contrast is present in the sequence.


23/52

23

Result

Presented a spatiotemporal attention detection framework for detecting both attention

regions and interesting actions in video sequences. The saliency maps are computed

separately for the temporal and spatial information of the videos.

Disadvantage

Fail to highlight the entire salient region or highlight smaller salient regions better

than larger ones.

2.7. BACKGROUND MODELING USING MIXTURE OF GAUSSIANS FOR

FOREGROUND DETECTION - A SURVEY

-T. Bouwmans, F. El Baf, B. Vachon

Mixture of Gaussians is a widely used approach for background modeling to detect

moving objects from static cameras. Numerous improvements of the original method

developed by Stauffer and Grimson have been proposed over the recent years and the purpose

of this paper is to provide a survey and an original classification of these improvements. It

also discusses relevant issues to reduce the computation time. Firstly, the original MOG are

reminded and discussed following the challenges met in video sequences.

This survey categorize the different improvements found in the literature and

classified them in term of strategies used to improve the original MOG and we have

discussed them in term of the critical situations they claim to handle. After analyzing the

strategies and identifying their limitations, we conclude with several promising directions for

future research.

Methods used Background Modeling

In the context of a traffic surveillance system, Friedman and Russell proposed to

model each background pixel using a mixture of three Gaussians corresponding to road,

vehicle and shadows. This model is initialized using an EM algorithm. Then, the Gaussians

are manually labeled in a heuristic manner as follows: the darkest component is labeled as

shadow; in the remaining two components, the one with the largest variance is labeled as

vehicle and the other one as road.


24/52

24

This remains fixed for all the process giving lack of adaptation to changes over time.

First, each pixel is characterized by its intensity in the RGB color space. Then, the probability

of observing the current pixel value is considered given by the following formula in the

multidimensional case:

Foreground Detection

In this case, the ordering and labeling phases are conserved and only the matching test

is changed to be more exact statistically. Indeed, Stauffer and Grimson checked every new

pixel against the K existing distribution using the Equation to classify it in background pixel

or foreground one.

Result

Allows the reader to survey the strategies and it can effectively guide him to select the

best improvement for his specific application.

Disadvantage

Leading to misdetection of foreground objects and background .

2.8. EFFICIENT HIERARCHICAL GRAPH-BASED VIDEO SEGMENTATION

-Matthias Grundmann1,Vivek Kwatra, Mei Han,Irfan Essa1

This hierarchical approach generates high quality segmentations, which are

temporally coherent with stable region boundaries, and allows subsequent applications to

choose from varying levels of granularity. It further improve segmentation quality by using

dense optical flow to guide temporal connections in the initial graph and also propose two

novel approaches to improve the scalability of this technique:

(a) a parallel out of core algorithm that can process volumes much larger than an in-

core algorithm.

(b) a clip-based processing algorithm that divides the video into overlapping clips in

time, and segments them successively while enforcing consistency.


25/52

25

Methods used

Graph-based Algorithm

Specifically, for image segmentation, a graph is defined with the pixels as nodes,

connected by edges based on 8- neighborhood. Edge weights are derived from the per-pixel

normalized color difference.

Hierarchical Spatio-Temporal Segmentation:

The above algorithm can be extended to video by constructing a graph over

the spatio-temporal video volume with edges based on a 26 neighborhood in 3D space-time.

Following this, the same segmentation algorithm can be applied to obtain volumetric regions.

This simple approach generally leads to somewhat underwhelming results due to the

following drawbacks.

Parallel Out-of-Core Segmentation:

It consists of a multi-grid-inspired out of core algorithm that operates on a

subset of the video volume. Performing multiple passes over windows of increasing size, it

still generates a segmentation identical to the in memory algorithm. Besides segmenting large

videos, this algorithm takes advantage of modern multi-core processors, and segments several

parts of the same video in parallel.

Result

This method apply the segmentation algorithm to a wide range of videos, from classic

examples to long dynamic movie shots, studying the contribution of each part of this

algorithm.

Disadvantage

However, as it defines a graph over the entire video volume, there is a restriction on

the size of the video that it can process, especially for the pixel-level over-segmentation stage

2.9. FAST APPROXIMATE ENERGY MINIMIZATION VIA GRAPH CUTS

-Yuri Boykov Olga Veksler Ramin Zabih

This paper address the problem of minimizing a large class of energy functions that

occur in early vision. The major restriction is that the energy function's smoothness term must

only involve pairs of pixels. It propose two algorithms that use graph cuts to compute a local

minimum even when very large moves are allowed.


26/52

26

The first move we consider is an -- swap: for a pair of labels ,, this move

exchanges the labels between an arbitrary set of pixels labeled and another arbitrary set

labeled .

First algorithm generates a labeling such that there is no swap move that decreases the

energy. The second move we consider is an _-expansion: for a label _, this move assigns an

arbitrary set of pixels the label _. Our second algorithm, which requires the smoothness term

to be a metric, generates a labeling such that there is no expansion move that decreases the

energy. Moreover, this solution is within a known factor of the global minimum.

Methods used

Energy minimization via graph cuts:

The most important property of these methods is that they produce a local minimum

even when large moves are allowed. In this section, we discuss the moves we allow, which

are best described in terms of partitions. We sketch the algorithms and list their basic

properties. We then formally introduce the notion of a graph cut, which is the basis for our

methods.

Graph cuts

The minimum cut problem is to find the cut with smallest cost. There are numerous

algorithms for this problem with low-order polynomial complexity; in practice these methodsrun in near-linear time.

Disadvantage

This method can produce only low energy

2.10. GEODESIC MATTING: A FRAMEWORK FOR FAST INTERACTIVE

IMAGEAND VIDEO SEGMENTATION AND MATTING

-Xue Bai, Guillermo Sapiro

The proposed technique is based on the optimal, linear time, computation of weighted

geodesic distances to user-provided scribbles, from which the whole data is automatically

segmented. The weights are based on spatial and/or temporal gradients, considering the

statistics of the pixels scribbled by the user, without explicit optical flow or any advanced and

often computationally expensive feature detectors. These could be naturally added to the

proposed framework as well if desired, in the form of weights in the geodesic distances.


27/52

27

Methods used

Segmentation

Algorithm starts from two types of user-provided scribbles, F for foreground and B for

background, roughly placed across the main regions of interest. Now the problem is how to

learn from them and propagate this prior information/labeling to the entire image, exploiting

both the marked pixel statistics and their positions.

Feature Distribution Estimation

The role of the scribbles is twofold. The scribbles indicate spatial constraints, and also

collect labeled information from F/B regions. We use discriminative features to learn from

the samples on the scribbles (pixels marked by the user via the scribbles), and to classify all

the remaining pixels in the image.

Geodesic Distance:

We use the geodesic distance from these user-provided scribbles to classify the pixels

x in the image (outside of the scribbles), labeling them F orB. The geodesic distance d(x) is

simply the smallest integral of a weight function over all possible paths from the scribbles to

x (in contrast with the average distance as used in random walks or diffusion/Laplace based

frameworks).

Result

Presented a geodesics-based algorithm for (interactive) natural image, 3D, and video

segmentation and matting.We introduced the framework for still images and extended it to

video segmentation and matting, as well as to 3D medical data.

Disadvantage

Although the proposed framework is general, we mainly exploited weights in the

geodesic computation that depend on the pixel value distributions. Algorithm does not work

when these distributions significantly overlap.


28/52

28

2.11. INTERACTIVE VIDEO SEGMENTATION SUPPORTED BY MULTIPLE

MODALITIES, WITH AN APPLICATION TO DEPTH MAPS

-Jeroen van Baar, Paul Beardsley, Marc Pollefeys, Markus Gross

This paper proposes an interactive method for the segmentation of objects in video. It

aims to exploit multiple modalities to reduce the dependency on color discrimination alone.

Given an initial segmentation for the first and last frame of a video sequence, This method

aims to propagate the segmentation to the intermediate frames of the sequence. Video frames

are first segmented into super pixels.

The segmentation propagation is then regarded as a super pixels labeling problem.

The problem is formulated as an energy minimization problem which can be solved

efficiently. Higher-order energy terms are included to represent temporal constraints. Our

proposed method is interactive, to ensure correct propagation and relabel incorrectly labeled

super pixels.

Methods used

Segmentation Propagation as Energy Minimization

We formulate the problem of propagating known segmentations and In as an energy

minimization: E = _(xi) + _(xi; xj) + _(x)Here _(x) represents a unary term, _(xi; xj) represents a binary term between neighboring

super pixels xi and xj , and finally _(x) represents a so-called higher-order clique term . Each

super pixel is assigned a label, with the set of labels L defined by the different segments.

Interactive Segmentation Correction

Super pixels may have an incorrect label after propagation. It is therefore necessary

for the user to correct these incorrect segmentation labels. Rather than requiring to re-label

individual pixels, in our case the interactive correction step can be more easily performed on

the super pixels directly.

Exploiting Multiple Modalities:

We exploit the thermal signal in the super pixel segmentation and in the matching of

super pixels. This is especially helpful for scenes with human actors, since the thermal signal

helps to separate the actors from their background, and could also help to separate actors

from each other.


29/52

29

Result:

described an interactive video segmentation approach based on the propagation of

known segmentations for the first and last frame, to the intermediate frames of a video

sequence. The straightforward matching of super pixels across the video sequence could

easily handle moving cameras and non-rigidly moving objects.

Disadvantage:

The method is not much efficient when the objects moving very fast

2.12. SUMMARY

No. Title Technique Used Drawbacks

1Saliency based video segmentation

with graph cuts and sequentially

updated priors

Markov random field

(MRF) model

Segmented regions

were randomly

switched

2Saliency detection using maximum

symmetric surroundSaliency computation

methods

The saliency maps

generated by this

method suffer from

low resolution

3Automatic object extraction in

single concept VideosObject Modeling And

Extraction

Some of the motion

cues might be

negligible due to low

contrast

4

Visual Saliency From Image

Features With Application to

Compression

Feature points

extraction

Compression is just

one possible use for

segmentation

algorithm based on

visually salientinformation.

5Visual Attention Detection In

Video Sequences Using

Spatiotemporal Cues

Spatiotemporal

saliency map

Fail to highlight the

entire salient region

6 Background modeling using

mixture of Gaussians for

foreground detectionBackground modeling

Leading to

misdetection of

foreground objects

and background


30/52

30

7 Efficient hierarchical graph-based

video segmentation

Hierarchical Spatio-

Temporal

Segmentation

there is a restriction

on the size of the

video

8 Fast approximate energyminimization via graph cuts

Energy minimization

via graph cuts

produces only low

energy

9

A framework for fast interactive

image and video segmentation and

matting

Feature Distribution

Estimation

does not work when

distributions are

significantly

overlaped.

10Interactive video segmentation

supported by multiple modalities

with an application to depth maps

Interactive

Segmentation

Correction

not much efficient

when the objects

moving very fast


31/52

31

CHAPTER III

PROPOSED SYSTEM


32/52

32

3.1. INTRODUCTION

The project proposes efficient motion detection and people counting based on

background subtraction using dynamic threshold. Here three different methods are used

effectively for object detection and compare these performance based on accurate detection.

Here the techniques frame differences, dynamic threshold based detection and mixture of

Gaussian model will be used. After the object foreground detection, the parameters like

speed, velocity and angle of motion will be determined.

For this, most of previous methods depend on the assumption that the background is

static over short time periods. However, structured motion patterns of the background which

are distinctive from variations due to noise, are hardly tolerated in this assumption and thus

still lead to high-level false positive rates when using previous models. In dynamic threshold

based object detection, morphological process and filtering also used effectively for

unwanted pixel removal from the background.

Along with this dynamic threshold, we introduce a background subtraction algorithm

for temporally dynamic texture scenes using a mixture of Gaussian, which has an ability of

greatly attenuating color variations generated by background motions while still highlighting

moving objects. Finally the proposed method will be proved that effective for background

subtraction in dynamic texture scenes compared to several competitive methods and

parameter of moving object will be evaluated for all methods.

3.2. OBJECTIVE

The Background Subtraction for accurate moving object detection from dynamic

scene using dynamic threshold detection and mixture of Gaussian model and determination of

object parameters

3.3. BLOCK DIAGRAM

Frame Subtraction

method

Frame

SeparationInput

Video Foreground

Detection

Dynamic

Threshold

Connected Component

Analysis

Fig. 3.1. Block Diagram of Proposed System


33/52

33

3.4. METHODOLOGIES

Frame Separation

Frame Subtraction

Dynamic Threshold Approach

Morphological Filtering

Object Detection

Parameter analysis

3.4.1. Frame Separation

A method to shorten the time required for the frame separation in the time-division-multiplexed delta modulation system is described. Employing successive "1"s as the sync

pattern, the system detects the sync channel out of the delta modulated information pulses

which occur at the rate of one half in average.

Using a memory device with the capacity of one frame, the detection is performed by

taking successive frame correlation of each channel in parallel. For example, in the system

with 20 channels, the frame separation is established in approximately six frame periods.

The system includes such facilities as to stabilize the frame separation against the

error of the sync pattern and against the occurrence of the information pattern similar to the

sync pattern.

3.4.2. Frame Subtraction

In digital photography, dark-frame subtraction is a way to minimize image noise for

pictures taken with long exposure times. It takes advantage of the fact that a component of

image noise, known as fixed-pattern noise, is the same from shot to shot: noise from the

sensor, dead or hot pixels. It works by taking a picture with the shutter closed.

A dark frame is an image captured with the sensor in the dark, essentially just an

image of noise in an image sensor. A dark frame, or an average of several dark frames, can

then be subtracted from subsequent images to correct for fixed-pattern noise such as that

caused by dark. Dark-frame subtraction has been done for some time in scientific imaging;

many newer consumer digital cameras offer it as an option, or may do it automatically for

exposures beyond a certain time.


34/52

34

Visible fixed-pattern noise is often caused by hot pixels pixel sensors with higher

than normal dark current. On long exposure, they can appear as bright pixels. Sensors on the

CCD that always appear as brighter pixels are called stuckpixels while sensors that only

brighten up after long exposure are called hot pixels.

The dark-frame-subtraction technique is also used in digital photogrammetry, to

improve the contrast of satellite and air photograms, and is considered part of "best practice.

Each CCD has its own dark signal signature that is present on every acquired image

(bias, dark frame, light frame). By cooling the CCD to a very low temperature (down to -30

C), the dark signal is attenuated but not completely cancelled. Image processing is then

required to remove dark signature from each raw image. The method usually used consists of

acquiring dark frames that are subsequently subtracted from each image. This permits

keeping only relevant information related to the observed target.

During an acquisition session, an ideal dark frame is obtained by acquiring and

averaging several dark frames. This ideal dark frame has a reduced noise signature and is, in

theory, acquired with the same conditions (temperature and moisture) as the observation

images. This averaged dark frame will be subtracted from every raw image.

This approach for reducing dark signal produces very nice astronomical images. It is

suitable for many applications which do not require very accurate photometry results.However, the technique is not necessarily sufficiently accurate for detecting and analyzing

very low signal-to-noise ratio objects.

3.4.3. Dynamic Threshold Approach

Fixed decision boundaries (or fixed threshold) classification approaches are

successfully applied to segment human skin. These fixed thresholds mostly failed in two

situations as they only search for a certain skin color range:

1) any non-skin object may be classified as skin if non-skin objects color values

belong to fixed threshold range.

2) Any true skin may be mistakenly classified as non-skin if that skin color values do

not belong to fixed threshold range. Instead of predefined fixed thresholds, novel online

learned dynamic thresholds are used to overcome the above drawbacks. The experimental

results show that our method is robust in overcoming these drawbacks.


35/52

35

3.4.4. Morphological Filtering

Given a sampled2 binary image signal f[x] with values 1 for the image object and 0

for the background, typical image transformations involving a moving window set

W = {y1, y2, ..., yn} of n sample indexes would be

b (f)[x] = b(f[x y1], ..., f[x yn]) (1)

Where b(v1, ..., vn) is a Boolean function of n variables. The mappingf _ b(f) is

called aBoolean filter. By varying the Boolean function b, a large variety of Boolean filters

can be obtained. For example, choosing a Boolean AND for b wouldshrink the input image

object, whereas a Boolean OR would expand it.

Numerous other Boolean filters are possible, since there are 22n possible Boolean

functions of n variables. The main applications of such Boolean image operations have been

in biomedical image processing, character recognition, object detection, and general 2D

shape analysis.

Among the important concepts offered by mathematical morphology was to use sets

to represent binary images and set operations to represent binary image transformations.

Specifically, given a binary image, let the object be represented by the set X and its

background by the set complementXc. The Boolean OR transformation ofXby a (window)

setB is equivalent to the Minkowski set addition (+), also called dilation, ofXbyB:

X (+) B {x + y:x (+) X, y (+)} (2)

Extending morphological operators from binary to gray level images can be done by

using set representations of signals and transforming these input sets via morphological set

operations. Thus, consider an image signalf(x) defined on the continuous or discrete plane ID

= R2 or Z2 and assuming values in R=RU {,}. Thresholding f at all amplitude levels v

produces an ensemble of binary images represented by the threshold sets

v(f) {x ID : f(x) v} , < v < + (3)

Lattice Opening Filters

The three types of nonlinear filters defined below have proven to be very useful forimage enhancement. If a 2D image f contains 1D object, e.g. lines, and B is a 2D disk-like


36/52

36

structuring element, then the simple opening or closing of f by B will eliminate these 1D

objects. Another problem arises when f contains large-scale objects with sharp corners that

need to be preserved; in such cases opening or closingfby a diskB will round these corners.

These two problems could be avoided in some cases if we replace the conventional opening

with a radial opening.

3.4.5. Object Detection

A substantial amount of research has been done in developing techniques for locating

Objects of interest automatically in digitized pictures drawing the boundaries around

objects are essential for pattern recognition, object tracking, image enhancement, data

reduction, and various other applications. This constitutes a good survey of research and

applications in image processing and picture analysis. Most researchers of picture analysts

have assumed that:

(1) The image of an object is more or less uniform or smooth in its local properties

(that is, illumination, color, and local texture are smoothly changing inside the image of an

object).

(2) There is detectable discontinuity in local properties between images of two

different objects. This will adopt these two assumptions in this paper and assume no textural

image.

The work on automatic location of objects in digitized images has split into two

approaches: edge detection and edge following versus region growing. Edge detection applies

local independent operators over the picture to detect edges and then uses algorithms to trace

the boundaries by following the local edge detected A recent survey of literature in this area.

The region growing approach uses various clustering algorithms to grow regions of

almost uniform local properties in the image for typical applications. More detected

references will be given later.In this method the two approaches are combined to complement each other; the result

is a more powerful mechanism to segment pictures into objects. This will develop a new edge

detector and combined it with new region growing techniques to locate objects; in so doing

we resolved the confusion in regular edge following that the results where more than one

isolated object on a uniform background is in the scene.


37/52

37

3.4.6. Parameters Analysis

Generally this paper deals with many of the parameters that can be dealt here.

Correlation

Sensitivity MSE

PSNR

Correlation

Correlation and Convolution are basic operations that we will perform to extract

information from images. They are in some sense the simplest operations that we can perform

on an image, but they are extremely useful. Moreover, because they are simple, they can be

analyzed and understood very well, and they are also easy to implement and can be computed

very efficiently.

Our main goal is to understand exactly what correlation and convolution do, and why

they are useful. We will also touch on some of their interesting theoretical properties; though

developing a full understanding of them would take more time than we have.

Sensitivity

We estimate the sensitivity of the image processing operators with respect toparameter changes by performing a sensitivity analysis. This is not new to the image

processing community but often left out for performance reasons.

MSE & PSNR

Image quality assessment is an important but difficult issue in image processing

applications such as compression coding and digital watermarking. For a long time, mean

square error (MSE) and peak signal-to-noise ratio (PSNR) are widely used to measure the

degree of image distortion because they can represent the overall gray-value error contained

in the entire image, and are mathematically tractable as well. In many applications, it is

usually straightforward to design systems that minimize MSE or PSNR.

MSE works satisfactorily when the distortion is mainly caused by contamination of

additive noise. However the problem inherent in MSE and PSNR is that they do not take into

account the viewing conditions and visual sensitivity with respect to image contents. With

MSE or PSNR, only gray-value differences between corresponding pixels of the original andthe distorted version are considered. Pixels are treated as being independent of their


38/52

38

neighbors. Moreover, all pixels in an image are assumed to be equally important. This, of

course, is far from being true. As a matter of fact, pixels at different positions in an image can

have very different effects on the human visual system (HVS).

3.4.7. Advantages

Less sensitive to noise

Automatic background updating model

Accuracy is more

3.4.8. Applications

Video surveillance

Object detection

People counting


39/52

39

CHAPTER 4

SYSTEM SPECIFICATION


40/52

40

4.1. HARDWARE SPECIFICATION

The computer used for the development of the project is

Processor : PENTIUM IV

RAM : 256 DDR

Hard Disc : 80 GB

Floppy Disc : 1.44 MB

CD drive : 52 * CD ROM DRIVE

Monitor : Samsung 17

Keyboard : 108 keys Samsung

Mouse : Logitech scroll mouse

Zip Drive : 250 MB

Printer : HP DeskJet

4.2. SOFTWARE SPECIFICATION

The project was developed using the following software.

Operating System : Windows XP

Software : MATLAB


41/52

41

4.3. SOFTWARE FEATURES:

The software used in this project is MATLAB- 7.10 or above

4.3.1. Introduction To Matlab

MATLAB is a product of The Math Works; Inc. The name MATLAB stands for

MATRIX LABORATORY.It are a high-performance language for technical computing.

It integrates computation, visualization, and programming in an easy-to-use

environment where problems and solutions are expressed in familiar mathematical notation.

Typical uses includes

1. Math and computation

2. Algorithm development

3. Data acquisition

4. Modeling, simulation, and prototyping

5. Data analysis, exploration, and visualization

6. Scientific and engineering graphics

7. Application development, including graphical user interface building

MATLAB is an interactive system whose basic data element is an array that does not

require dimensioning. This allows you to solve many technical computing problems,

especially those with matrix and vector formulations, in a fraction of the time it would take to

write a program in a scalar non interactive language such as C or FORTRAN.

The name MATLAB stands for matrix laboratory. MATLAB was originally written to

provide easy access to matrix software developed by the LINPACK and EISPACK projects.Today, MATLAB engines incorporate the LAPACK and BLAS libraries, embedding the

state of the art in software for matrix computation.

MATLAB has evolved over a period of years with input from many users. In

university environments, it is the standard instructional tool for introductory and advanced

courses in mathematics, engineering, and science. In industry, MATLAB is the tool of choice

for high-productivity research, development, and analysis.


42/52

42

MATLAB features a family of add-on application-specific solutions called toolboxes.

Very important to most users of MATLAB, toolboxes allow you to learn and apply

specialized technology. Toolboxes are comprehensive collections of MATLAB functions (M-

files) that extend the MATLAB environment to solve particular classes of problems. Areas in

which toolboxes are available include signal processing, control systems, neural networks,

fuzzy logic, wavelets, simulation, and many others.

4.3.2. THE MATLAB SYSTEM

The MATLAB system consists of five main parts

Desktop tools and development environment

This is the set of tools and facilities that help you use MATLAB functions and files.

Many of these tools are graphical user interfaces. It includes the MATLAB desktop and

Command Window, a command history, an editor and debugger, a code analyzer and other

reports, and browsers for viewing help, the workspace, files, and the search path.

MATLAB windows

1. Command window: This is the main window. It is characterized by Matlab

command prompt>>. All commands, including those for running user-written programs, are

typed in this window at the Matlab prompt.

2. Graphics Window: The output of all graphics commands typed are flushed into the

graphics orfigure window.

3. Edit Window: This is a place to create, write, edit and save programs in files called

Mfiles. Any text editor can be used to carry out these tasks.

The MATLAB mathematical function library

This is a vast collection of computational algorithms ranging from elementary

functions, like sum, sine, cosine, and complex arithmetic, to more sophisticated functions like

matrix inverse, matrix eigen values, Bessel functions, and fast Fourier transforms.


43/52

43

The MATLAB language

This is a high-level matrix/array language with control flow statements, functions,

data structures, input/output, and object-oriented programming features. It allows both

programming in the small to rapidly create quick and dirty throw-away programs, and

programming in the large to create large and complex application programs.

Graphics handler

MATLAB has extensive facilities for displaying vectors and matrices as graphs, as

well as annotating and printing these graphs. It includes high-level functions for two-

dimensional and three-dimensional data visualization, image processing, animation, and

presentation graphics. It also includes low-level functions that allow you to fully customizethe appearance of graphics as well as to build complete graphical user interfaces on your

MATLAB applications.

4.3.3. The MATLAB application program interface

This is a library that allows you to write C and Fortran programs that interact with

MATLAB. It includes facilities for calling routines from MATLAB (dynamic linking),

calling MATLAB as a computational engine, and for reading and writing MAT-files

4.3.4. MATLAB documentation

MATLAB provides extensive documentation, in both printed and online format, to

help you learn about and use all of its features. If you are a new user, start with this Getting

Started book. It covers all the primary MATLAB features at a high level, including many

examples. The MATLAB online help provides task-oriented and reference information about

MALAB.

4.3.5. Introduction to M-function programming

One of the most powerful features of MATLAB is the capability it provides users to

program their own new functions. MATLAB function programming is flexible and

particularly easy to learn.


44/52

44

4.3.6. M-files

M-files in MATLAB can be scripts that simply execute a series of MATLAB

statements, or they can be functions that can accept arguments and can produce one or more

outputs. M FILE functions extend the capabilities of both MATLAB and the Image

Processing Toolbox to address specific, user-defined applications. M-files are created using a

text editor and are stored with a name of the form filename.m, such as average.m and filter.m.

The components of a function M-file are,

The function definition line

The H1 line

Help text

The function body

Comments

4.3.7. Operators

MATLAB operators are grouped into three main categories:

Arithmetic operators that perform numeric computations

Relational operators that compare operands quantitatively

Logical operators that perform the functions AND, OR, and NOT

4.3.8. Image processing tool box

Image Processing Toolbox provides a comprehensive set of reference-standard

algorithms and graphical tools for image processing, analysis, visualization, and algorithm

development. You can perform image enhancement, image deblurring, feature detection,

noise reduction, image segmentation, geometric transformations, and image registration.

Many toolbox functions are multithreaded to take advantage of multicore and multiprocessor

computers.

Image Processing Toolbox supports a diverse set of image types, including high

dynamic range, gigapixel resolution, embedded ICC profile, tomography Spatial image

transformations, morphological operations, neighborhood and block operations, Linear

filtering and filter design, Transforms Image analysis and enhancement, Image registration,

De blurring Region of interest operations


45/52

45

You can extend the capabilities of Image Processing Toolbox by writing your own M-

files, or by using the toolbox in combination with other toolboxes, such as Signal Processing

Toolbox and Wavelet Toolbox.

Graphical tools let you explore an image, examine a region of pixels, adjust the

contrast, create contours or histograms, and manipulate regions of interest (ROIs). With

toolbox algorithms you can restore degraded images, detect and measure features, analyze

shapes and textures, and adjust color balance

4.3.9. Interfacing with other languages

Libraries written in Java, ActiveX or .NET can be directly called from MATLAB and

many MATLAB libraries (for example XML or SQL support) are implemented as wrappers

around Java or ActiveX libraries. Calling MATLAB from Java is more complicated, but can

be done with MATLAB extension, which is sold separately by Math Works, or using an

undocumented mechanism called JMI (Java-to-Matlab Interface), which should not be

confused with the unrelated Java Metadata Interface that is also called JMI.

As alternatives to the MuPAD based Symbolic Math Toolbox available from Math

Works, MATLAB can be connected to Maple or Mathematical.

MATLAB has a direct node with mode FRONTIER, a multidisciplinary and multi-

objective optimization and design environment, written to allow coupling to almost any

computer aided engineering (CAE) tool. Once obtained a certain result using Matlab, data

can be transferred and stored in a mode FRONTIER

4.4. SYSTEM DESIGN

System design is the process of planning a new system to complement or altogether replace

the old system. The purpose of the design phase is the first step in moving from the problem

domain to the solution domain. The design of the system is the critical aspect that affects the

quality of the software. System design is also called top-level design. The design phase

translates the logical aspects of the system into physical aspects of the system.


46/52

46

CHAPTER V

RESULT AND IMPLEMENTATION


47/52

47

Visual Saliency


48/52

48

CHAPTER VI

WORKS TO BE DONE IN PHASE II


49/52

49

The Background Subtraction for accurate moving object detection from dynamic

scene using dynamic threshold detection and mixture of Gaussian model and determination of

object parameters with the help of morphological filtering will be done in next phase.


50/52

50

CHAPTER VII

REFERNCES


51/52

51

1. Saliency-based video segmentation with graph cuts and sequentially updated priors,

K. Fukuchi, K. Miyazato, A. Kimura, S. Takagi, and J. Yamato,in Proc. IEEE Int. Conf. Multimedia Expo, Jun.Jul. 2009, pp. 638641.

2. Saliency detection using maximum symmetric surround,

R. Achanta and S. Ssstrunk,

in Proc. IEEE Int. Conf. Image Process., Sep. 2010,pp. 26532656.

3. Automatic object extraction in single concept videos,

K.-C. Lien and Y.-C. F. Wang,

in Proc. IEEE Int. Conf. Multimedia Expo, Jul. 2011,pp. 16.

4. Visual saliency from image features with application to compression,

P. Harding and N. M. Robertson,

Cognit. Comput., vol. 5, no. 1, pp. 7698, 2012.

5. Visual attention detection in video sequencesusing spatiotemporal cues,

Y. Zhai and M. Shah,

in Proc. ACM Int. Conf. Multimedia, 2006, pp. 815824.

6. Background modeling using mixture of Gaussians for foreground detectionA

survey,

T

automatic video object extraction

Documents