a study on moving object detection and tracking with partial decoding in h.264|avc bitstream domain

8/3/2019 A Study on Moving Object Detection and Tracking with Partial Decoding in H.264|AVC Bitstream Domain

http://slidepdf.com/reader/full/a-study-on-moving-object-detection-and-tracking-with-partial-decoding-in-h264avc 1/104

i

A Thesis for the degree of Master

A Study on Moving Object Detection

and Tracking with Partial Decoding

in H.264|AVC Bitstream Domain

Wonsang You

School of Engineering

Information and Communications University

2008



ii

A Study on Moving Object Detectionand Tracking with Partial Decoding

in H.264|AVC Bitstream Domain



i

Abstract

An object detection and tracking technique has been an important issue

traditionally in the field of computer vision and video processing since it enables

efficient analysis of video contents. It can be utilized not merely for surveillance

systems but also for interactive broadcasting services.

However, most of current object detection and tracking techniques which

utilize only raw pixel data are not practical due to tremendously high computa-

tional complexity. Furthermore, most of videos tend to be communicated in the

form of encoded bitstreams in order to enhance the transmission efficiency. In

that case, the pixel domain approach requires additional computation time to ful-

ly decode the encoded bitstream.

In the meantime, H.264|AVC technology has been a popular compression

tool for videos due to its high coding efficiency and the availability of its real-

time encoding devices. Fortunately, the H.264|AVC bitstream contains encoded

information such as motion vectors, residual data, and macroblock types which

can be directly utilized as effective clues for object detection and tracking. The

traditional compressed domain algorithms which make use of such encoded in-



ii

formation have shown fast computation time with low computational complexity.

However, these algorithms are available only under limited circumstances. In

addition, they are difficult to be followed by the color extraction of objects or the

object recognition which distinguishes one object from other objects.

In this thesis, two methods for moving object detection and tracking with

partial decoding in H.264|AVC bitstream domain are introduced. While one ap-

proach is the semi-automatic method which users can initially select a target ob-

ject in stationary or non-stationary scenes, another approach is the automatic me-

thod which all moving objects are automatically detected and tracked especially

in stationary scenes. The former is beneficial to metadata authoring tools which

generate additional contents like the position information of an object for the in-

teractive broadcasting service. On the other hand, the latter is effective for sur-

veillance systems with fixed cameras. Unlike conventional compressed domain

algorithms, the proposed methods utilize partially decoded pixel data for object

detection and tracking. Therefore, these methods show reliable performance in

various scene situations as well as fast processing time enough to be performed

in real-time. Also, these methods can support the color extraction of objects or

the object recognition.



iii

Contents

A Thesis for the degree of Master ....................................................................... i Abstract ................................................................................................................. i Contents ............................................................................................................... iii List of Tables ........................................................................................................ v List of Figures ..................................................................................................... vi List of Abbreviations .......................................................................................... ix I Introduction .................................................................................................. 1 II Related Works ............................................................................................... 7

2.1 Overview of MPEG-4 Advanced Video Coding ...................................... 8 2.2 Pixel Domain Approach ......................................................................... 10

2.2.1 Region-based Methods ...........................................................................10 2.2.2 Contour-based Methods .......................................................................... 11 2.2.3 Feature-based Methods ...........................................................................12 2.2.4 Template-based Methods ........................................................................13

2.3 Compressed Domain Approach ............................................................. 14 2.3.1 Clustering-based Methods ......................................................................16 2.3.2 Filtering-based Methods .........................................................................21 2.3.4 Issues in Compressed Domain Approach ...............................................29

III Proposed Schemes for Moving Object Detection and Tracking with

Partial Decoding in H.264|AVC Bitstream Domain ....................................... 32 3.1 Semi-automatic Approach ...................................................................... 33

3.1.1 Forward Mapping of Backward Motion Vectors ....................................34

http://c/Users/wsgyou/%EF%B8%BB%E3%A6%B9%CC%BE%D6%9C/%EF%B8%BB%E3%9C%92esearch/Papers/2008%20Master%20Thesis/Draft/Thesis-publishment.doc%23_Toc208





iv

3.1.2 Texture Dissimilarity Energy ..................................................................36 3.1.3 Form Dissimilarity Energy .....................................................................39 3.1.4 Motion Dissimilarity Energy ..................................................................40 3.1.5 Energy Minimization ..............................................................................42 3.1.6 Adaptive Weight Factors.........................................................................43

3.2 Automatic Approach .............................................................................. 45 3.2.1 Block Group Extraction ..........................................................................46 3.2.2 Spatial Filtering ......................................................................................48 3.2.3 Temporal Filtering ..................................................................................49 3.2.4 Region Prediction of Moving Objects in I-frames ..................................55 3.2.5 Partial Decoding and Background Subtraction in I-frames ....................58 3.2.6 Motion Interpolation in P-frames ...........................................................59

IV Experiments ................................................................................................ 61 4.1 Semi-automatic Approach ...................................................................... 61 4.2 Automatic Approach .............................................................................. 71

V Conclusions and Future Works ................................................................. 83 국문요약 ............................................................................................................. 86 References .......................................................................................................... 88



v

List of Tables

Table 1. The processing time of compressed domain algorithms. ...................... 29 Table 2. The processing time of the proposed automatic method. ...................... 77



vi

List of Figures

Figure 1. The region-matching method for constructing the forward motion field

............................................................................................................ 34 Figure 2. The search region is centered at the predicted point located by a

forward motion vector. A candidate point inside the search region has

its neighborhood of square form to compute E C ................................. 37 Figure 3. The structure of partial decoding in the first P-frame of a GOP which

contains one I-frame and three P-frames. Two decoded block sets

Dk,n(k+1) and Dk,n(k+2) in the first P-frame are projected from two

predicted search regions Pk,n(k+1) and Pk,n(k+2). .............................. 39 Figure 4. The network of feature points in the previous frame and the network of

candidate points in the current frame. ................................................ 40 Figure 5. The reliability of forward motion vectors. The great gap between a

forward motion vector and a backward motion vector results in low

reliability. ............................................................................................ 41 Figure 6. The neural network for updating weight factors. ................................. 44 Figure 7. A procedure of object region extraction and refinement ...................... 46 Figure 8. Block groups before and after spatial filtering .................................... 47 Figure 9. Temporal filtering based on the occurrence probability of active group

trains ................................................................................................... 50 Figure 10. Train tangling. (a) Train merging. (b) Train separation. .................... 54 Figure 11. Optimizing the feature vector of an object through background

subtraction in an I frame. (a) The background Image. (b) The I frame

in the original sequence. (c) A partially decoded image from

H.264|AVC bitstream. (d) A background-subtracted image. .............. 57



vii

Figure 12. Motion interpolation. The dotted rectangle boxes are estimated simply

by enclosing active groups corresponding to the real object. These

boxes are replaced by the rectangle boxes through motion

interpolation. ....................................................................................... 60 Figure 13. The object tracking in “Coastguard” with 100 frames....................... 62 Figure 14. The object tracking in “Stefan” with 100 frames............................... 63 Figure 15. The object tracking in “Lovers” with 300 frames. Partially decoded

regions are shown in “Lovers”. .......................................................... 64 Figure 16. (a) The processing time which includes partial decoding in

“Coastgurad”, and (b) the processing time which does not include

partial decoding. ................................................................................. 67 Figure 17. (a) Dissimilarity energies in “Stefan” and (b) “Coastguard” ............. 68 Figure 18. (a) The variation of weight factors in “Coastguard”, and (b) the

squared error of dissimilarity energy in “Stefan”. .............................. 69 Figure 19. (a) The average reliabilities of forward motion vectors in

“Coastguard” and (b) in “Stefan”. ...................................................... 70 Figure 20. The performance measurement of spatial filtering and temporal

filtering in the indoor sequence. (a) The plot of spatial filtering rates,

and (b) The temporal filtering results in which one active group train

become the real object. ....................................................................... 72 Figure 21. The performance measurement of spatial filtering and temporal

filtering in the outdoor sequence. (a) The plot of spatial filtering rates,

and (b) The temporal filtering results in which three active group

trains become the real objects. ............................................................ 73 Figure 22. The effect of motion interpolation on correction of object trajectory.

(a)-(b) are object locations and sizes in one GOP before motion

interpolation, and (c)-(d) after motion interpolation. ......................... 75



ix

List of Abbreviations

AC Alternating Current

ADE Accumulated Dissimilarity Energy

AVC MPEG-4 Part 10: Advanced Video Coding

DC Direct Current

DCT Discrete Cosine Transform

DSP Digital Signal Processing

HS Hue Saturation

ISDB-T Integrated Services Digital Broadcasting - Terrestrial

ISO International Organization for Standardization

IT Integer Transform

MAF Multimedia Application Format

MPEG Moving Picture Experts Group

MRF Markov Random Field

MV Motion Vector

RD Rate Distortion

ROI Region of Interest

S-DMB Satellite Digital Multimedia Broadcasting

STMF Spatial and Temporal Macroblock Filter

T-DMB Terrestrial Digital Multimedia Broadcasting



2

which extracts the object information from an input video should have fast com-

putation time in order to be practically used in the industrial field.

Before discussing the computational complexity of current object detec-

tion and tracking techniques, we need to introduce two classes of object detec-

tion and tracking techniques: the pixel domain approach and the compressed

domain approach. Traditionally, most of object detection and tracking tech-

niques belong to the pixel domain approach which utilizes only raw pixel data as

resource. However, the pixel domain algorithms are actually difficult to be per-

formed with fast computation time or in real-time because they require a tre-

mendously the great amount of computations. Moreover, most of video contents

used in industry and in public are generally encoded by any compression tool

like MPEG in order to enhance the efficiency of communication by reducing the

size of video contents. In that case, the pixel domain approach requires addition-

al computation time to fully decode the encoded bitstream before initiating the

main algorithm for object detection and tracking. Although the difficulty is re-

cently more alleviated than before since the special-purpose DSP circuits for

surveillance have been developed and the performance of personal computers

have been steadily enhanced, the problem under limited hardware resources still

remains unresolved especially in such applications as large-scale distributed sur-

veillance systems which have to deal with several surveillance video contents at

the same time. Likewise, the pixel domain algorithms give a great burden of

computation to the metadata authoring tool, which makes it impractical when it



3

is performed in general-purpose PC. In this reason, maximizing the performance

of object detection and tracking under the restricted hardware resources has be-

come an important issue in designing real-time surveillance systems or metadata

authoring tools for interactive broadcasting service.

As an effective alternative of this problem, the compressed domain algo-

rithms have been proposed by many researchers. Unlike the pixel domain algo-

rithms, the compressed domain algorithms utilize the encoded information like

motion vectors, DCT coefficients, and macroblock types which are included in

the encoded bitstream. It should be noticed that the encoded information is bene-

ficial to reduce the computational complexity because it can be directly ex-

ploited as effective clues of object detection and tracking. Accordingly, employ-

ing the compressed domain approach is more effective to implement real-time or

fast object detection and tracking systems under restricted hardware resources

than employing the pixel-domain approach.

Nevertheless, the conventional compressed domain algorithms have lethal

drawbacks even though these algorithms have lower computational complexity

than pixel domain algorithms. First of all, these algorithms tend to show poor

performance of object detection and tracking due to unreliability or insufficiency

of data extracted from the encoded bitstreams. Especially, since some assump-

tions which are adopted in these algorithms are not available for various situa-

tions in a video scene, they do not guarantee the reliable performance in such

scene situations and can result in the failure of object detection and tracking.



4

The second problem in the compressed domain approach is that these al-

gorithms are not possible to support the color extraction of objects which is ne-

cessary for the object recognition or the metadata construction for interactive

broadcasting service. Since they exploit only the encoded information instead of

raw pixel data, the pixel data of the object region do not be extracted. Although

some algorithms perform partial decoding, the decoded regions are restricted in

the boundaries of objects in order to refine the edges of object regions. That is

why the representative color information of objects cannot be obtained from the

compressed domain algorithms.

Lastly, most of the compressed domain algorithms deal with only video

contents encoded by MPEG-1, MPEG-2, or MPEG-4 Visual (ISO/IEC Standard

14496 Part 2). However, H.264|AVC technology recently has become a popular

compression tool for video contents due to its high coding efficiency and the

availability of its real-time encoding devices. As an example, five representative

standards for the mobile broadcasting service such as T-DMB, S-DMB, ISDB-T,

DVB-H, and MediaFLO have adopted or have considered adopting H.264|AVC

as video compression technology. Also, MPEG recently started to standardize

the surveillance MAF which adopts H.264|AVC as the video resource. With such

a trend, the necessity of the object detection and tracking algorithm which is

available for H.264|AVC encoded videos has been emphasized. Nevertheless,

most of the traditional compressed domain algorithms are not perfectly applica-

ble for H.264|AVC videos since intra or inter prediction schemes in H.264|AVC



5

are slightly different from that of MPEG-1, MPEG-2, and MPEG-4 Visual. Al-

though some researchers have proposed some algorithms which are specified for

H.264|AVC videos, these algorithms do not ensure reliable performance in vari-

ous scenes due to such reasons as I noticed above.

In this thesis, the proposed methods for object detection and tracking in

H.264|AVC bitstream domain exploit partially decoded pixel data as well as the

encoded information in order to overcome the limitations of conventional com-

pressed domain algorithms. The main contribution in this thesis is that, unlike

conventional compressed domain algorithms, the proposed methods show relia-

ble performance in various scene situations as well as fast computation time

enough to be performed in real time.

The proposed methods are divided as the semi-automatic method and the

automatic method. In the semi-automatic method, users can manually choose

what they want to track in all kinds of scenes. This method is valuable for meta-

data authoring tools which quickly generate the position and motion information

of a predefined target object as the form of metadata for the interactive broad-

casting service. On the other hand, the automatic method is able to automatically

detect and track all moving objects in the environment that a camera is fixed. It

is beneficial to real-time surveillance systems in which monitoring video con-

tents are sent as the form of H.264|AVC bitstreams from a camera to the main

processing unit.

This thesis is organized as follows: In Section II, the MPEG-4 Advanced



6

Video Coding is briefly explained. Not only that, but the related research works

are also described and compared with the proposed methods. Section III intro-

duces two proposed schemes for moving object detection and tracking with par-

tial decoding in H.264|AVC bitstream domain: the semi-automatic method and

the automatic method. Especially, the dissimilarity minimization algorithm is

applied for the semi-automatic method; furthermore, the spatial and temporal

macroblock filter (STMF) is adopted for the automatic method. In Section IV,

the experimental results for the proposed methods are provided and analyzed in

terms of performance and computation time. Finally, the conclusion and the fu-

ture works are addressed in Section V.



7

II Related Works

The object detection and tracking techniques can be categorized as the

pixel domain approach and the compressed domain approach. The pixel domain

approach utilizes original pixel data which are perfectly decoded from com-

pressed bitstreams such as MPEG videos. On the other hand, the compressed

domain approach exploits the encoded information like motion vectors, DCT

coefficients, and macroblock types which are extracted in a compressed bit-

stream. Traditionally, the main researches on object detection and tracking have

been concentrated on the pixel domain approach since it can provide powerful

capability of object tracking by using computer vision technologies. However,

the pixel domain approach takes a long time to perform object detection and

tracking even though it detects and tracks any object precisely. Since the late

1990s, the compressed domain approach has been seriously considered to reduce

the computational complexity of object detection and tracking. It can greatly re-

duce the computational complexity and make real-time or fast processing possi-

ble although its precision is not better than the pixel domain approach. Recently,

the H.264|AVC-based algorithms, which deal with videos encoded by

H.264|AVC that is the most popular compression technology, started to be pro-

posed. The conventional H.264|AVC-based algorithms utilize not DCT coeffi-

cients but motion vectors. In this chapter, the H.264|AVC technology is summa-

rized just regarding to its baseline profile. Then, the pixel domain approach and



8

the compressed domain approach for object detection and tracking are explained

respectively.

2.1 Overview of MPEG-4 Advanced Video Coding

H.264|AVC, that is, H.264|MPEG-4 Advanced Video Coding is a kind of

video compression standard developed by the Moving Picture Experts Group

(MPEG) which is a working group of the International Organization for Standar-

dization (ISO). Contrary with MPEG-4 Visual which emphasizes high flexibility

regarding to coding techniques and resources, H.264|AVC concentrates on effi-

ciency and reliability of video compression and transmission. To support popular

applications of video compression, it defines only three profiles such as the

Baseline profile, the Extended profile, and the Main profile. The Baseline profile

is particularly beneficial for the real-time applications such as video conferenc-

ing, wireless mobile systems since it contains the error resilience functions as

well as the basic technology for video compression. The Main profile is defined

in consideration of applications like broadcasting and multimedia storage. Since

it deals with the great amount of contents, it emphasizes the technical functions

which enhance compression efficiency although it requires high computational

complexity. The Extended profile is useful for video streaming applications.

Video streaming applications put their purpose into the real-time playing of a

pre-encoded video content which is delivered in serial order. In this reason, it

does not consider real-time encoding techniques, but pursue high compression



9

efficiency.

In this thesis, the Baseline profile is only considered among three profiles

because it is suitable for applications which need the function of object detection

and tracking like surveillance systems and metadata authoring tools for digital

broadcasting service. This profile allows only I-slices and P-slices. The I-slices

include only I-type macroblocks which are encoded by using the intra prediction

while the P-slices contain I-type and P-type macroblocks which are predicted by

using inter prediction as well. The encoded information includes motion vectors,

macroblock types, and DCT coefficients which are the differences between orig-

inal pixel values and predicted values.

I-type macroblocks can either be encoded in 4x4 Intra or in 16x16 Intra

prediction in the baseline profile. For intra prediction, some neighbor blocks of

each block in the I-type macroblock are used for intra prediction as reference

data. On the other hand, P-type macroblocks can be split into macroblock parti-

tions (e.g. 16x16, 16x8, 8x16, and 8x8) or sub-macroblock partitions (e.g. 8x4,

4x8, and 4x4) for inter prediction. Each partition can have one single motion

vector. P-type macroblocks can also be encoded by the skip mode by which the

pixels in the motion-compensated region can be directly used as the recon-

structed data without residual data. Such macroblocks can generate mainly in the

background regions which have no color change according to camera motion. It

should be noticed that a macroblock with the skip mode can have its motion vec-

tor.



10

2.2 Pixel Domain Approach

In the special case of two-dimensional rigid object detection and tracking,

the pixel domain approach can be classified as four categories such as 1) region-

based methods, 2) contour-based methods, 3) feature-based methods, and 4)

template-based methods. Region-based methods perform object detection and

tracking by using the Region of Interest‟s (ROI‟s) characteristics like color his-

togram and motion distribution, and so on. Contour-based methods are the way

to find ROI‟s position and form by modeling the contour of objects. Feature-

based methods calculate the motion parameters of feature points which are au-

tomatically or semi-automatically defined inside objects; some algorithms which

belong to this method use cross-correlation or Gabor wavelet. Template-based

methods define the template or model of objects in advance and extract the

ROI‟s area which is well matched with such a model. In this chapter, four kinds

of pixel domain methods are explained respectively.

2.2.1 Region-based Methods

In the region-based methods, the region of objects can be defined as the

set of pixels which have similar properties. Such an object region can be sepa-

rated from an image sequence by using motion information and object properties

like color histogram [27-29]. Especially, color information is very useful for re-

gion-based methods in the case that the representative colors of an object are

evidently distinguishable from background or other objects. Since the region-

based methods which make use of color information tend to be sensitive to illu-



11

mination change, they use illumination invariance or color correction as an alter-

native to illumination change [30].

Modeling the color distribution of objects in advance is effective to im-

prove the performance of object detection and tracking. This color modeling can

be categorized as parametric methods and nonparametric methods. The former

applies Gaussian models to the color space normalized in regard to illumination

[31]. In every frame, the best matched regions with the color reference model are

searched for. Since the color distribution of objects can be changed according to

illumination condition, it is statistically estimated and updated frame by frame.

On the other hand, the latter utilizes lookup table or Bayes probability map to the

Hue Saturation (HS) color space [32].

Background subtraction is also one of popular techniques which are fre-

quently used in the region-based methods [29,33-35]. It should be noticed that

although most of object regions can be extracted by background subtraction,

background-subtracted images can be erroneous due to measurement errors or

scene environment change. Thus, techniques like morphological filtering or dy-

namical updating of the background model are used to extract more precise fore-

ground regions [36].

2.2.2 Contour-based Methods

The contour-based methods for object detection and tracking are the way

that finds out both the position and shape of objects through modeling the con-



12

tour information. It is not only more robust from partial occlusion but also can

be applicable for deformable objects as well as rigid objects although its compu-

tational complexity is higher than region-based methods.

An active contour is a terminology which has been mainly used in con-

tour-based algorithms [18,37-38]. An active contour is a polygon which consists

of several points which are placed at specific features such as lines and edges.

Then, the total energy of an active contour is defined as the sum of internal and

external energies which are related to contour elasticity and image contents re-

spectively. Among several candidates for the active contour of an object, the best

one is determined as one which has the minimum energy.

Another type of contour-based methods is the graph-cut algorithm [39-40].

It defines the object region as a graph which is made from the combination of an

inner and an outer boundary. The two boundaries consist of the set of points

called the sources in the inner boundary and the sinks in the outer boundary.

Then, the final deformation is decided by computing the minimum cut of the

graph.

2.2.3 Feature-based Methods

The feature-based methods for object detection and tracking in pixel do-

main calculate the motion parameters of feature points. Such motion parameters

are related with affine transformation which is composed of rotation and transla-

tion in 2D space. In this way, an object which users want to track is usually de-



13

fined as a bounding box or convex hull.

This method can have more tracking errors in object detection and track-

ing than other methods since it is sensitive to partial occlusion. Furthermore, its

performance is definitely subject to the selection of feature points. That is to say,

feature points should be visually prominent like edges or boundaries which are

exactly separated and recognized from the neighborhood. To select such feature

points with distinctive properties, various techniques such as Hough transform

and Gabor wavelets are able to be used [41-42].

Once feature points are selected, the displacement of these points can be

computed by minimizing the dissimilarity as described in [43]. Instead of the

dissimilarity minimization, the cross-correlation method is also effective to track

feature points as introduced in [44-45]. It searches for the best candidate, which

maximizes the cross-correlation, among the square neighborhood of the point in

the previous frame. Otherwise, some researchers make use of the 2D golden sec-

tion algorithm based on a mesh which can be created by interconnecting feature

points [42].

2.2.4 Template-based Methods

Template-based methods are the way of tracking special objects like face

by using the predefined template. First of all, a template can be created by ob-

serving during a particular period or from a database which is statistically made

[46]. The next step is the template matching which searches the best region



14

matched with the template. The template is projected onto the target image

through minimizing the distance measure; in other words, the parameters of this

geometric transformation are estimated [47-48].

However, since such an algorithm is available simply for rigid objects, the

algorithms for deformable objects have been introduced [49-51]. In these algo-

rithms, some parameters of a deformable template are obtained by minimizing

the template energy which is composed of terms attracting the template into

prominent features such as edges. Then, the deformable template can be acquired

by deformation of the template based on such parameters. In the meantime, to

cope with problems like viewpoint change and illumination change, the color

invariant features can be extracted by updating the template based on Kalman or

particle filters [48].

2.3 Compressed Domain Approach

The conventional compressed domain algorithms exploit motion vectors

or DCT coefficients instead of original pixel data as resources in order to reduce

computational complexity of object detection and tracking. These encoded data

are not enough credible or insufficient to detect and track moving objects. For

example, the motion vectors in the encoded bitstreams are not always coincident

with 2D projected true motion (sometimes called optical flow) since the block

matching algorithm for producing motion vectors in a video encoder is designed

to pursue data reduction instead of optical flow estimation. Furthermore, the mo-



15

tion vectors are sparsely distributed in an image in the units of blocks such as

8x8, 16x16, and so forth; that is, the motion vector field is not dense. With re-

spect to DCT coefficients in MPEG-1 or MPEG-2, the DC image can be con-

structed from DCT coefficients in an I-frame which are directly produced from

original pixel value without intra prediction. However, it provides insufficient

information about texture because its resolution is worse than the original im-

age‟s resolution. Due to these reasons, the compressed domain algorithms have

been concentrated on overcoming these limitations.

Most of compressed domain algorithms for object detection and tracking

carry out the object segmentation which partitions an image into several seg-

ments which represent background or moving objects with block unit boundaries.

Since these algorithms extract the boundaries of objects as well as their location

and size, they are able to require more computation time than those which ex-

tract only object location and size without describing object boundaries. Never-

theless, it is worth surveying these object segmentation algorithms because they

are intimately involved in object detection and tracking procedure.

In general, the compressed domain algorithms contain two steps such as

the clustering step and the filtering step. Depending on how these steps are orga-

nized, the compressed domain algorithms can be categorized as follows: the

clustering-based methods and the filtering-based methods. The clustering-based

methods attempt to perform grouping and merging all blocks into several regions

according to their spatial or temporal similarity. Then, these regions are merged



16

each other or classified as background or foreground. On the other hand, the fil-

tering-based methods extract foreground regions by filtering blocks which are

expected to belong to background or by classifying all blocks into foreground

and background. Then, the foreground region is split into several object parts

through clustering procedure. In this chapter, two types of the compressed do-

main algorithms are described respectively.

2.3.1 Clustering-based Methods

As the most important measurement for extracting the moving object re-

gion, the clustering-based methods emphasize the local similarity of blocks ra-

ther than the global similarity in a whole image. First of all, they split an image

into several regions which consist of blocks with homogeneous properties of

motion vectors or DCT coefficients. Then, after similar regions are merged,

these regions are classified as background or foreground. In most clustering-

based methods, a preferential clue for block-clustering is the similarity of motion

vectors while the similarity of DCT coefficients is complementarily employed to

improve the performance or refine object boundaries.

In the simplest algorithm introduced in [11], some blocks can be merged

by grouping similar nearby motion vectors, and the merged block group is con-

sidered as a moving object. After the target object is manually selected among

several block groups, then it is tracked by searching the corresponding block

group which has similar average motion vector. If such a group cannot be found



17

due to occlusion, the similarity measure based on DCT coefficients can be em-

ployed.

The leveled watershed technique is also another possible way for block-

clustering. As described in [10], the leveled watershed technique can be applied

to low resolution images which are generated from DC and first two AC coeffi-

cients. Then, it constructs the motion map in which the dominant motion blocks

are extracted from the histogram of accumulated motion vectors. Especially, the

intra-coded macroblocks are also added to the motion map. Based on such a mo-

tion map, all connected regions which have similar motion vectors are merged.

However, the above similarity measure of motion vectors is not accurate

and credible since motion vectors do not always correspond to optical flow. Thus,

the performance of object detection and tracking can be improved by measuring

the reliability of motion vectors. As described in [9], reliable motion vectors can

be extracted by the noise adaptive soft switching median (NASM) filter. Then,

for spatial segmentation in P-frames, it clusters NASM-filtered reliable motion

vectors into an optimal number of homogeneous groups according to motion

homogeneity. To compensate the limitation of spatial segmentation due to sparse

distribution of motion vectors, temporal segmentation is additionally employed.

After moving object regions in a P-frame are projected onto the current I-frame,

more precise boundaries can be obtained by maximizing the entropy based on

the DC image constructed from DCT coefficients.

In addition to the unreliability of motion vectors, the motion vector field



18

extracted from compressed videos is too sparse since each motion vector is as-

signed per macroblock. Therefore, the clustering-based methods have been

evolved to overcome the insufficiency of motion information due to sparse mo-

tion vector field. In [13], Babu et al. have introduced the advanced technique of

extracting more reliable and dense motion information from sparse and unrelia-

ble motion vector field. The algorithm calculates the reliability of motion vectors

based on the energy of DCT coefficients. Only reliable motion vectors are tem-

porally accumulated over several frames, and then are spatially interpolated by

median filtering to get the dense motion field; that is, one motion vector is as-

signed to each pixel. Basically, the dense motion field can be clustered by incor-

porating affine parametric motion model; however, such a clustering cannot be

precise since the dense motion field still remains unreliable. The problem can be

coped with by the expectation maximization (EM) algorithm which is an itera-

tive technique that alternately estimates and refines the segmentation and motion

estimation. Also, the optimal number of motion models is estimated by K-means

clustering. Such initially segmented object partitions are temporally tracked over

frames. Finally, the edge refinement process is done based on partially decoded

data from each edge block and its eight neighboring blocks.

Some algorithms make use of DCT coefficients rather than motion vectors

as a main resource because they never trust the reliability of motion vectors. An

example of such algorithms is introduced in [3]. The algorithm merges spatially

similar blocks based on DCT coefficients. A region with spatial homogeneity can



19

contain both true motion and false motion blocks. The decision rule for true mo-

tion blocks is dependent on the motion-compensated error which is derived from

motion vectors and DCT coefficients. Basically, a region which includes more

true motion blocks than false motion blocks can be considered as a part of mov-

ing objects and called a “dynamic region”. If a non-dynamic region is over-

lapped with the regions projected from moving objects in the previous frame, its

status can also be altered into “projected dynamic region” according to the num-

ber of true motion blocks in the non-dynamic region. All dynamic regions and

projected dynamic regions are merged into moving objects in the current frame.

The algorithm described in [8] never utilizes motion vectors; it exploits

only DCT coefficients for object detection and segmentation. It initially clusters

a frame into several fragments according to the similarity of AC components and

DC image which is constructed from DCT coefficients. Next, it merges homoge-

neous fragments based on two spatiotemporal similarity measures. One measure

is to merge spatiotemporally similar fragments, and the other is to merge the

fragments with lower spatiotemporal similarity but high average temporal

change within the fragments. These similarity measures are defined as the com-

bination of the spatial similarity and the temporal similarity. While the spatial

similarity is based on the entropy of AC coefficients, the temporal similarity is

measured through performing the 3D Sobel filter along x-, y-, and t -axes. Finally,

the fragments with high average temporal change are classified as objects, and

others are classified as background. Then, detailed features like edge information



20

can be extracted by decoding DCT coefficients around the boundaries of objects.

The region growing is one of typical techniques which are popularly em-

ployed in clustering-based methods. Chen et al. suggested a region growing al-

gorithm which is available in MPEG-2 compressed domain [26]. At the first

stage, the algorithm extracts DC images in I-frames and P-frames. While the DC

image in an I-frame is extracted directly from DC coefficients, the DC image in

a P-frame can be estimated from the DC image of its reference frame as de-

scribed in [21]. The second stage is the object segmentation which is composed

of three steps: (1) obtaining object fragments by extracting blocks with high

temporal change in the DC image, (2) region growing which is performed by

merging interconnected nearby object fragments, and (3) merging the regions

with similar motion vectors and spatially close regions. In the case of non-

stationary cameras, global motion compensation can be applied. The last stage is

the object tracking which is performed by searching a corresponding object in

the subsequent fame. The correspondence of objects is judged based on the simi-

larity of numbers of pixels and the similarity of center positions.

Another region growing algorithm has been proposed by Porikli and Sun

in [17]. The algorithm defines the frequency-temporal data structure which is

constructed from DCT coefficients and motion vectors. The feature vector at a

bock indexed by (m,n) in the t th frame is defined as follow:

T

y xd vhvu ynmt mvmveacacacdcdcdc f ,,,,,,,,:,, (1)



21

where vu y dcdcdc ,, mean DC coefficients of the Y-, U-, and V-channels;

d vhacacac ,, denote the averages of horizontal, vertical, and diagonal DC coef-

ficients; e describes the energy of DCT coefficients in a block; y x mvmv , de-

note the x- and y-components of the forward motion vector. It should be noticed

that the original motion vectors with backward direction are converted to the

backward motion vectors. After constructing the frequency-temporal data struc-

ture, a seed region grows the volume in spatial and temporal directions by merg-

ing homogeneous blocks which have similar feature vectors. Next, each volume

is fitted to a motion model by estimating the affine motion parameters. Lastly,

each segmented volume is hierarchically clustered into objects by using its mo-

tion parameters. To terminate the iteration of hierarchical clustering, the algo-

rithm measures the validity score that evaluate the result of object segmentation.

The algorithm provides the processing time of 0.9~2ms per frame.

2.3.2 Filtering-based Methods

While the first step in clustering-based methods is related to block cluster-

ing, the filtering-based methods first extract the foreground region by removing

all blocks which are unreliable or judged to belong to background. After such a

global segmentation is performed, the foreground region is split into multiple

objects by an appropriate clustering technique. A variety of filtering-based tech-

niques have been proposed.

Wang et al. employed three (spatial, temporal, and directional) confidence



22

measures and global motion compensation for filtering unreliable macroblocks.

The spatial confidence measure assesses how a motion vector is conformable to

the local motion smoothness constraint within a neighborhood region in terms of

its magnitude and direction. The temporal confidence measure assesses how a

motion vector is smoothly changed over neighborhood frames. On the other

hand, the texture confidence measure is based on the assumption that a low-

textured region tend to cause false motion vectors which do not coincide with

optical flow. In other words, the average energy of AC coefficients in four

neighborhood blocks is computed; if the AC energy is higher than a predefined

threshold, the current block has the perfect texture confidence. Combining the

three confidence measures, motion vectors with low confidence score are re-

jected, and the holes occurred in the motion field are repaired by spatial and

temporal motion filtering. The dominant foreground region is separated through

iterative estimation of global motion parameters such as zoom, vertical, and ho-

rizontal translations. Then, it is split into multiple objects by performing K-

means and EM clustering based on spatial and motion features. These objects are

tracked by their location and motion.

As shown in the above algorithm, the global motion estimation is one of

popular filtering-based techniques for extracting the foreground region. Similarly,

the foreground/background segmentation can be performed by the iterative ma-

croblock rejection which iteratively estimates the parameters of the global-

motion model [22]. Then, the foreground is clustered by examining the temporal



23

consistency of the iterative rejection output.

The background subtraction technique is also beneficial to extracting the

foreground in the filtering-based method. Zeng et al. have also proposed the

change-based algorithm which extracts moving objects by the background detec-

tion based on the inter-frame difference of DC images. First, the background

subtraction is performed by applying the moment-preserving thresholding me-

thod to the histogram of inter-frame difference, on the basis of the experimental

observation such that the background tends to have low value of inter-frame dif-

ference. Then, if moving objects are assumed to be non-Gaussian signal, they

can be detected by the fourth-order moment measure which is computed within a

moving window of inter-frame difference in the block unit. The blocks with high

moment are considered as a part of moving objects.

The advanced background subtraction technique is introduced by Aggar-

wal et al. in [2]. In this algorithm, a user manually selects a target object in an I-

frame by drawing a rectangle box. The location of the object in the subsequent

frames of one GOP is estimated by using motion vectors within the target object

region. The object region in the subsequent I-frame can be found by the back-

ground subtraction of DC images. In other words, all foreground regions are ex-

tracted by subtracting the DC image which is constructed from DC coefficients

from the low-resolution background image. The foreground is clustered into sev-

eral candidate objects, and then the target object is found among these candidate

objects. To be specific, the previous target object is projected into the current I-



24

frame based on its motion vector. Then each candidate object is compared with

the projected object in terms of the histograms of DCT coefficients of chromin-

ance components (Cb and Cr ) and the distance from the projected object; that is,

a candidate object which has the smallest difference with the projected object is

considered as the target object in the current I-frame. Finally, the locations of the

target object in the previous P- or B-frames are updated by object interpolation.

This method is available only for surveillance videos which are taken from a sta-

tionary camera.

Exceptionally, the algorithm proposed by Yu et al. in [25] combines the

background subtraction with the region-growing technique as a kind of cluster-

ing-based methods. It makes a motion mask, by clustering the given motion vec-

tor field based on the region growing algorithm, as well as a difference mask by

applying the background subtraction to the DC image extracted from DCT coef-

ficients. Two masks are combined to obtain the final mask as moving object re-

gions.

Another technique for moving object segmentation is based on the Mar-

kov random field (MRF) theory. The MRF-based algorithms provide more relia-

ble performance; however, it does not show significant reduction in computa-

tional complexity. Benzougar et al. first have proposed the algorithm based on

Markovian macroblock labeling framework [6]. First of all, a 2D affine motion

model of the dominant image motion in each P-frame is computed from motion

vectors. Then, a frame is divided into two groups: the regions corresponding to



25

the estimated dominant image motion and those not corresponding to it. Since

the dominant image motion represents the global motion, the regions not con-

forming to the dominant image motion can be thought as the moving object re-

gions. However, it is not still reliable due to the drawbacks of motion vectors. To

make segmentation more reliable, the algorithm employs the displaced frame

difference (DFD) between successive DC images constructed from DC coeffi-

cients of P-frames. That is to say, if any block has low DFD value, the above

labeling decision for this block is clarified to be more reliable. In this reason, this

algorithm considers two factors such as the DFD and the difference between mo-

tion vectors of each block and the dominant image motion. Two factors are com-

bined in the form of energy, and then this energy is minimized by Markovian

labeling framework to find out the optimal configuration of moving object re-

gions.

As an another MRF-based algorithm, Treetasanatavorn et al. have pro-

posed the algorithm that applies the Gibbs-Markov random field theory and the

Bayesian estimation framework to separate the significant foreground object

from compressed motion vector field [15]. This algorithm performs object detec-

tion and tracking by maximizing the following probability density based on the

maximum a posteriori probability (MAP) estimation and the Gibbs-Markov ran-

dom field theory:

SSV SV SV SS PrPr,Pr,,Pr (2)



26

which S denotes an initial partition, S is its predicted partition, and V is

the compressed motion vector field. At first, this method use only reliable mo-

tion vectors after evaluating the reliability of motion vectors. The object segmen-

tation in the first frame is performed by the stochastic motion coherency analysis

introduced in [23]. In the subsequent frames, it additionally applies the partition

projection and relaxation scheme introduced by [24] in order to predict S . The

conflict between two predictions, like that from the stochastic motion coherency

analysis and that from the partition projection and relaxation scheme, is resolved

by checking the incongruity between two predicted partitions. In the next step,

the method attempts to get the most optimal partition from the predicted partition

obtained in the first step by using the Bayesian estimation framework. For this

work, the algorithm relax region boundaries and then search the optimal configu-

ration of partition which maximizes V SS ,,Pr . Finally, it classifies partitions

into background and foregrounds based on the reliability of motion vectors.

Recently, a MAF-based algorithm for H.264|AVC compressed video has

been proposed by Zeng et al. in [12]. The basic structure of this algorithm is sim-

ilar with other MRF-based algorithms; that is, similar motion vectors are merged

into multiple moving objects by minimizing the MRF energy. Its unique trait is

that it considers variable block sizes (such as 4x4, 4x8, 16x16, and so forth)

which are supported not by MPEG-1 or MPEG-2 but by H.264|AVC. The algo-

rithm has two steps: (1) motion vector classification and (2) moving block ex-

traction. In the first step, there are four types of motion vectors such as back-



27

ground MVs, edge MVs, foreground MVs, and noise MVs. In stationary scenes,

motion vectors with small magnitude are considered as background MVs while

motion vectors with big magnitude are classified as foreground MVs. Motion

vectors with intermediate magnitude are thought as noise MVs. On the other

hand, a motion vector which is similar with the average motion vector of its

neighborhood blocks, it can be considered as an edge MV. The number of neigh-

borhood blocks is decided by the macroblock partition type. In the second step,

the MRF classification is applied to find the optimal configuration of block labe-

ling (foreground or background) and extract the blocks which represent moving

objects, on the consideration of two clues such as (1) the MV spatial similarity

within a moving region and (2) the temporal consistency of moving objects. The

object segmentation in I-frames is achieved through projecting the object regions

in the previous P-frame to the current I-frame according to the inverted motion

vectors.

Thilak and Creusere also have proposed the algorithm for H.264|AVC

compressed videos which use the probabilistic data association filter (PDAF)

that is a kind of Bayesian algorithm [25]. The algorithm has two separated steps

like the detection step and the tracking step. In the detection step, it constructs

the binary image; the pixels whose motion vectors have the neither small nor big

magnitude are considered to belong to moving objects. Mathematically, such

blocks should satisfy the following condition:



28

otherwise

M M M H i L

i0

1 (3)

where i is the pixel in location i , i M is the magnitude of motion vector,

and L M and H M are the lower and upper threshold for the motion vector

magnitude. To improve the classification performance, the optimal threshold

values of L M and H M can be obtained by minimizing Bayesian risk which

are constructed from the probability densities and prior probabilities of two

classes (target and foreground) Then, all pixels which are interconnected each

other in the binary image are merged; and several fragments are formed. In the

tracking step, the motion of the target object is modeled as follow:

k k k

k k k

w Hx z

vFx x

1

(4)

where k x is the state vector of the target at time k , k z is the observation vec-

tor of the target, k v and k w are a zero mean, white, and Gaussian noise se-

quences with covariance matrix k Q and k R respectively, F and H are

matrices that are independent from time. Classically, this kind of objects can be

tracked by the Kalman filter; however, since the Kalman filter can track only one

fragment, it can have serious error in the case that one object is split into several

fragments. Therefore, the PDAF can be applied to handle such cases. It seems to

show reliable performance, but it does not assure the reduction of computational

complexity.



29

2.3.4 Issues in Compressed Domain Approach

The essential goal of compressed domain approach is to significantly re-

duce the computation complexity although it slightly deteriorates the perfor-

mance of object detection and tracking. The processing time for major algo-

rithms is shown in Table 1; other algorithms have not verified how fast they are.

Table 1. The processing time of compressed domain algorithms.

Authors Frames/sec PC Note

Zen et al. [11] 5~10 unknown

Wang et al. [7] 2 450 MHz

Chen et al. [26] 43 unknown

Benzougar et al. [6] 40 400 MHz Excluding video decoding

Mezaris et al. [22] 200 800 MHz Excluding video decoding

Zeng et al. [12] 2~16 700 MHz Available for H.264|AVC

Treetasanatavorn et al. [15] 0.1 500 MHz

Porikli and Sun [17] 111~500 4.3 GHz

Aggarwal et al. [2] 100 1.8 GHz

The algorithms which show significantly fast processing time are Chen et al.‟s,

Mezaris et al.‟s, Porikli and Sun‟s, and Aggarwal et al.‟s algorithms. Especially,

although Mezaris et al.‟s algorithm and Porikli and Sun‟s algorithm simulta-

neously performs object segmentation as well as object detection and tracking,

their processing time is remarkably fast.

Nevertheless, these algorithms have some lethal shortcomings which

cause poor performance of object detection and tracking. First of all, they are

available only in extremely restricted environments; that is, they can have se-

rious error in special scene situations. For instance, Chen et al.‟s algorithm first



30

extracts the foreground region from the difference image of temporally neigh-

boring DC images [26]. It is not reasonable because most internal parts of object

region can be excluded from the extracted foreground region when the interior of

objects is low-textured. In other words, it can achieve successful results only in

the cases that the texture of most object regions is obviously altered. In the case

of Mezaris et al.‟s algorithm, the foreground is obtained through global-motion

compensation based on iterative macroblock rejection scheme [22]. That is, a

motion vector which is greatly different from the global motion is considered as

background. However, it can fail to extract the whole foreground region in the

case that the motion of moving objects is not exactly distinguishable from the

global motion. Porikli and Sun‟s algorithm can also make an error due to the li-

mitation of region merging technique [17]. For spatiotemporal segmentation, it

merges blocks that have similar motion vectors and DCT coefficients. However,

an object region can contain chaotic motion vectors; for example, when a de-

formable object moves in the same direction as that of a camera looking toward

or it consists of homogeneous texture in large portion, a chaotic set of motion

vectors is produced with various amplitudes or directions in unpredictable pat-

terns. Likewise, the limitation of Aggarwal et al.‟s algorithm is that it does not

consider the change in the size of the target object which is manually selected as

a rectangle box [2]. The algorithm is limitedly applicable only when the object

size is constant over frames.

Another problem in these algorithms is that they are not compatible with



31

H.264|AVC. These algorithms commonly exploit the DC images which are

formed from DCT coefficients in an I-frame. In MPEG-1 or MPEG-2 bitstreams,

the DC image formation is possible because raw pixel data in I-frames is directly

converted by the discrete cosine transform (DCT) without intra prediction. On

the other hand, since in H.264|AVC the difference between original pixel data

and intra-predicted pixel value is converted by the integer transform (IT), the DC

image cannot be built in I-frames and P-frames.

Additionally, these algorithms do not support consistent object recognition

based on color information. For example, when we track multiple persons, a per-

son can repeatedly come in and go out from the camera screen. Then, in

H.264|AVC videos it is difficult to recognize the person‟s identity based on mo-

tion vectors and IT coefficients.

In the proposed methods, the above problems are coped with by three

ways: (1) reinforcing the adaptability about various scenes, (2) reflecting the ex-

traordinary features of H.264|AVC bitstreams, and (3) decoding the ROI‟s par-

tially. As a result, the proposed methods cannot only maintain fast computation

time, but also have more reliable performance than that of the traditional algo-

rithms.



32

III Proposed Schemes for Moving Object De-

tection and Tracking with Partial Decod-

ing in H.264|AVC Bitstream Domain

In this chapter, two algorithms for object detection and tracking in

H.264|AVC bitstream domain are introduced. One approach is the semi-

automatic method for interactive broadcasting services, and the other approach is

the automatic method especially for real-time surveillance applications. The

semi-automatic method adopts the dissimilarity minimization algorithm, whereas

the automatic method is based on the spatial and temporal macroblock filter

(STMF). Two techniques commonly concentrate on improving the performance

in various scenes where the traditional compressed domain algorithms are not

available.

It should be noticed that unlike traditional compressed domain algorithms,

the proposed algorithms exploit partially decoded pixel data as well as encoded

information like motion vectors or IT coefficients in order to detect and track

moving objects. Even though some compressed domain algorithms contain par-

tial decoding process, it does not positively contribute to object detection and

tracking procedure; it is just for boundary refinement [8,13]. The partial decod-

ing in the proposed algorithms can increase the processing time; however, it

makes a great contribution to finding more accurate locations and sizes of mov-



33

ing objects. Not only that, but it also gives the color information of multiple ob-

jects which can be used for object recognition or metadata formation.

3.1 Semi-automatic Approach

In order to extract location information of a predefined target object from

stationary or non-stationary scenes encoded by H.264|AVC, the dissimilarity

energy minimization algorithm can be exploited. It makes use of motion vectors

and partially decoded luminance signals to perform tracking adaptively accord-

ing to properties of the target object in H.264/AVC videos. It is one of the semi-

automatic feature-based approaches that tracks some feature points selected by a

user. First, it roughly predicts the position of each feature point using motion

vectors extracted from H.264/AVC bitstream. Then, it finds out the best position

inside the given search region by considering three clues such as texture, form,

and motion dissimilarity energies. Since just neighborhood regions of feature

points are partially decoded to compute this energy, the computational complexi-

ty is greatly saved. The set of the best positions of feature points in each frame is

selected to minimize the total dissimilarity energy by dynamic programming.

Also, weight factors for dissimilarity energies are adaptively updated by the

neural network. Compared with the traditional compressed domain algorithms,

the algorithm can successfully track the target object even when its shape is de-

formable over frames or its motion vectors are not homogeneous due to high-

textured background.



35

where Sk-1(i, j) stands for the overlapping area between bk,j and bk-1,i, and

mvk(bk,j

) denotes the backward motion vector of bk,j

with i, j=1,2,…, N . We as-

sume that H.264/AVC videos are encoded in the baseline profile which each

GOP contains just one I-frame and several P-frames. It should be noticed that the

above region-matching method cannot be applied in the last P-frame in one GOP

since the next I-frame does not have backward motion vectors. Assuming that

the motion of each block is approximately constant within a small time interval,

the forward motion vector of any block in the last P-frame can be assigned as a

vector with the reverse direction of the backward motion vector as expressed by

11, 1,1 fmv b mv bk k i k ik

. (6)

Thereafter, positions of feature points in the next frame are predicted using

forward motion vectors. If the nth feature point in the k-1th frame has the dis-

placement vector f k-1,n=( fxk-1,n, fyk-1,n) and is included in the ith block bk-1,i, the

predicted displacement vector pk,n=( pxk,n, pyk,n) in the k th frame is defined as

1,, 1, 1 p f fmv bk ik n k n k

. (7)

Since the predicted position of any feature point is not precise, we need

the process of searching the best position of any feature point inside the search

region centered at the predicted position pk,n= ( pxk,n, pyk,n). It is checked whether



37

Figure 2. The search region is centered at the predicted point located by a forward

motion vector. A candidate point inside the search region has its neighborhood of square

form to compute E C .

Only necessary blocks can be partially decoded in P-frames to reduce the

computational complexity. On the other hand, intra-coded blocks are impossible

to be partially decoded since these are spatially intra-coded from these neighbor

blocks.

General partial decoding takes long time since decoding particular blocks

in P-frames requires many reference blocks to be decoded in the previous frames.

We can predict decoded blocks to reduce the computation time. To predict de-

coded blocks in the k th P-frame, we assume that the velocity inside one GOP is

as uniform as the forward motion vector of the k-2th frame. For the ith frame

with i=k ,k+1,…,K , the predicted search region Pk,n(i) is defined as the set of pix-

els which are necessary to calculate the texture dissimilarity energies of all poss-

ible candidate points for the nth feature point. Then, the half maximum interval

T k,i of Pk,n(i) is T k,i=(i-k+1)× M +W +γ where γ denotes the prediction error. Then,



38

Pk,n(i) is given as follows:

ik ik mmmm

nk nk k nk

T T y x y xm

f m f b fmvk i p piP

,,

,1,22,

,...,,;,,

1

(9)

where b( f k-2,n) stands for the block which includes the nth feature point f k-2,n. The

decoded block set Dk,n(i) is defined as the set of blocks which should be decoded

to reconstruct Pk,n(i). Using the motion vector of the k-1th frame, Dk,n(i) is given

by

,1 ,1,, D i b d d i k mv b f p p P ik k nk nk n

(10)

Assuming that there exist F feature points, the total decoded block set Dk

in the k th frame can be finally computed as

,1

F K D D ik nk

n i k

(11)

Figure 3 shows how partial decoding is performed in the first P-frame of

one GOP which contains one I-frame and three P-frames. It should be noticed

that the time for calculating the total decoded block set is proportional to the

GOP size.



39

Figure 3. The structure of partial decoding in the first P-frame of a GOP which contains

one I-frame and three P-frames. Two decoded block sets Dk,n(k+1) and Dk,n(k+2) in the

first P-frame are projected from two predicted search regions Pk,n(k+1) and Pk,n(k+2).

3.1.3 Form Dissimilarity Energy

The similarity of form means how the network of candidate points is simi-

lar with the network of feature points in the previous frame. Each feature point is

jointly linked by a straight line like Figure 4. After a feature point is initially se-

lected, it is connected to the closest one among non-linked feature points. In this

way, the feature network in the first frame is built by connecting all feature

points successively.

To calculate the form dissimilarity energy of each candidate point, we as-

sume that each feature point is arranged in the order named at the first frame.

The feature point f k-1,n in the k-1th frame has its difference vector fd k-1,n(i)= f k-

1,n(i)- f k-1,n-1(i) as shown at Figure 4. Likewise, the ith candidate point of the nth

feature point in the k th frame has its difference vector cd k,n(i)=ck,n(i)-ck,n-1( j).

Then, the form dissimilarity energy E F for the ith candidate point of the nth fea-



40

ture point (n>0) is defined as follows:

1/ 2

; , , 1, E k n i cd i fd k nF k n

(12)

All candidate points of the first feature point (n=0) have zero form dissi-

milarity energy E F (k ;0,i)=0. The smaller E F is, the less the form of the feature

network will be transformed. The form dissimilarity energy forces the best posi-

tion of a candidate point to be decided as the position where the form of the fea-

ture network is less changed as far as possible.

Figure 4. The network of feature points in the previous frame and the network of

candidate points in the current frame.

3.1.4 Motion Dissimilarity Energy

The reliability of a forward motion vector means how it is similar with

true motion enough to get a predicted point as exactly as possible. Following Fu

et al. [6], if the predicted point pk,n which has located by the forward motion vec-



41

tor fmvk-1 returns to its original location in the previous frame by the backward

motion vector mvk , fmv

k-1 is highly reliable. Assuming that p

k,n is included to the

jth block bk,j, the reliability R can be given as follows:

21, ,1

exp, 22

fmv b mv bk k i k jk R pk n

(13)

where σ is the variance of reliability. Figure 5 shows forward motion vectors

with high and low reliability. In a similar way of Fu‟s definition [18], the motion

dissimilarity energy E M for the ith candidate point is defined as follows:

; , ,, , E k n i R p c i pk n M k n k n

(14)

With high reliability R, E M has greater effect on finding the best point than

E C or E F since it is sharply varying according to the distance between a predicted

point and a candidate point.

Figure 5. The reliability of forward motion vectors. The great gap between a forward

motion vector and a backward motion vector results in low reliability.



42

3.1.5 Energy Minimization

The dissimilarity energy E k,n(i) for the ith candidate point of the nth fea-

ture point is defined as follows:

; , ; , ; ,, E i k E k n i k E k n i k E k n ik n C C F F M M (15)

where wC (k ), wF (k ), and w M (k) are weight factors for texture, form, and

motion dissimilarity energy. If the configuration of candidate points is denoted

as I ={ck,1(i1), ck,2(i2),…,ck,F (iF )}, the optimal configuration I opt (k ) in the k th frame

is selected as what minimizes the total dissimilarity energy E k ( I ) expressed by

,1

F E I E ik k n n

n

(16)

When all possible configurations of candidate points are considered, it

takes so much time Θ((2 M +1)2F

) that causes high computation complexity espe-

cially in cases of large search region or many feature points. We can reduce the

amount of computations by Θ(F ) using the discrete multistage decision process

called the dynamic programming which corresponds to two steps [19]:

A. The accumulated dissimilarity energy (ADE) E local(n,i) for the ith can-

didate point of the nth feature point (n>0) is calculated as follows:

, min , 1,, E n i E i j E n jlocal k n local j

(17)



44

to its output value E ˙ k by the nonlinear activation function ξ. The update of

weight factors is per-formed by the backpropagation algorithm which minimizes

the square output error εk defined as follows:

21

2E E d k k

(20)

where E d denotes the ideal output value. If the activation function ξ is the unipo-

lar sigmoidal function (ξ( x)=1/(1+e- x

)), the gradient of a weight factor is calcu-

lated as

1k E E E E E k d k k x k x

(21)

where x can be T (texture), F (form), or M (motion), and η is the learning con-

stant [20].

Figure 6. The neural network for updating weight factors.



45

3.2 Automatic Approach

For the automatic detection and tracking of moving objects in H.264|AVC

bitstream domain, a novel method based on the spatial and temporal macroblock

filter (STMF) is introduced. The STMF exploits macroblock types and IT coeffi-

cients which represent the existence of motion and the temporal texture change

in a macroblock; the encoded information is exploited to extract foreground re-

gions.

As depicted in Figure 7, the method is composed of two stages: the object

extraction and the object refinement. In the object extraction stage, all object re-

gions are roughly extracted by the STMF based on the occurrence probability of

the objects. The STMF first removes blocks which are judged to be background

based on macroblock types and IT coefficients, and then clusters them into sev-

eral fragments called block groups. Since some block groups can also belong to

background, it calculates the occurrence probability of each block group based

on its temporal consistency. Only block groups with high probability are consi-

dered as real objects. In the object refinement stage, the location and size of ob-

ject regions are then precisely refined by background subtraction with partial de-

coding in I-frames and motion interpolation in P-frames.



46

Block Group Extraction

Spatial Filtering

Temporal Filtering

Partial Decoding

Background Subtraction

Motion Interpolation

O b j e c t E x t r a c t i o n

O b j e c t R e f i n e m e n t

Region Prediction

P - f r a m e s

I - f r a m e

P - f r a m e s

Figure 7. A procedure of object region extraction and refinement

3.2.1 Block Group Extraction

To detect and track moving objects in surveillance videos encoded by an

H.264|AVC baseline profile encoder, we assume that the surveillance camera is

fixed so that there is no camera motion and I frames were periodically inserted

less than every 10 frames in surveillance video. It is observed that in a fixed

camera, most macroblocks of the background tend to be encoded in the skip

mode in P-frames while most parts of the foreground tend to be encoded in non-

skip modes. From these observations, we may consider sets of non-skip blocks

as the foreground candidates for moving object detection and tracking.



47

B4

F1

B1

B2

B3 B5

B6

B7

B8

Figure 8. Block groups before and after spatial filtering

Figure 8 shows that the approximate foreground in a P-frame consists of a

set of „block groups‟ which consists of the blocks with non-skip modes which

are connected in the horizontal, vertical, or diagonal directions. However, such

simple segmentation as block groups is not enough to define moving objects

since there are also the blocks of non-skip modes that may occur in the back-

ground or the blocks of skip mode in the foreground region. For example, some

macroblocks in a homogeneous region of the background are encoded as inter-

coded blocks with motion vectors instead of skip mode blocks. Likewise, in the

case that the visual change of object motion is negligible, the whole or some

parts of the object can be encoded as skip mode blocks. Moreover, one object

region can be separated into one or more block groups which are disconnected



48

one another. Therefore, the block grouping based on the simple classification of

skip mode blocks and non-skip mode blocks is not sufficient to define moving

objects as ROI‟s. To decide whether each block group represents a real object or

a part of background, we use the spatial and temporal macroblock filter (STMF)

which are performed only in P-frames. The filter consists of two modules: spa-

tial filtering and temporal filtering.

3.2.2 Spatial Filtering

The spatial filtering removes most of block groups in the background by

using IT coefficients. That is, the block groups which contain just one non-skip

macroblock or do not contain non-zero IT coefficients are considered belonging

to the background since these groups tend to occur in the background rather than

the foreground. It means that we regard as a candidate of a real object the block

groups which contain more than one non-skip macroblocks and include non-zero

IT coefficients. Although some block groups of a real foreground object can be

considered the background, this rarely happens in the foreground instead many

more block groups are removed in background. So, we have better chance of

removing a number of such false block groups of background by the spatial fil-

tering.

As shown in Figure 8, nine block groups (indicated as F1, and B1~B8) in

a frame can be detected first. After spatial filtering, two active block groups (F1,

B4) are left while the other block groups are removed. It can be seen that most of



49

the block groups belonging to background consist of only one single macroblock

except B3 and B4. After spatial filtering, B3 is removed due to its all zero IT

coefficient values but B4 remains survived due to its non-zero IT coefficient val-

ues. Each frame after spatial filtering can contain several active block groups. So,

our proposed method can support for multiple object detection and tracking

problems.

3.2.3 Temporal Filtering

The temporal filtering process further removes the block groups in the

background which survive after spatial filtering. The survived block groups after

spatial filtering are called the active block groups. Then, the active block groups

are labeled with their object ID‟s by object detection and tracking through tem-

poral evolution. Each block group can be classified as a real object or the back-

ground . Especially, the active block groups which are not determined yet wheth-

er they are the real objects or the background are called the candidate objects.

Hence, an active block group can be labeled as a candidate object C , a real ob-

ject R , or background B .

For the classification of active block groups, a newly appeared (or de-

tected) active block group is regarded initially as a candidate object. The candi-

date object is regarded as a real object when it exhibits its temporal coherence

for which high occurrence probability is obtained, during an observation period.

On the other hand, the active block groups in the background tend to randomly



50

appear and disappear in time while those in the foreground tend to move

smoothly and appear during a relative long period of time in subsequent frames. .

If a candidate object occurs more frequently during a given observation

period, its occurrence probability would be more increased. The longer the ob-

servation period is, the more precise the classification is taken. The structure of

temporal filtering is illustrated in Figure 9.

frame

1T 2T

3T 4T

5T 6

T

AG 1

6

2

6G iG6

6G

Real object

Real object

Real object

Observationperiod

3

6G

Figure 9. Temporal filtering based on the occurrence probability of active group trains

Before applying temporal filtering for an initial active block group A , it

is assigned the active group train lT which is labeled by l and is defined as



51

follows:

,,1,1 i AGGT lill (22)

where indicates the length of an observation period, andi

lG , called the

succeeding active block groups, denotes the set of active groups corresponding

to A , in the ith frame during the observation period as follows:

ii

l

i

l C X G X X G ,1 (23)

whereiC denotes the set of all active block groups in the ith frame during the

observation period and X is an active block group. In other words,i

lG con-

sists of all overlapped active block groups in the ith frame with1i

lG . If

ilG , we let

1 il

il GG assuming that the corresponding object does not

move or there is no or little change in the intensity of the active block group for

which3

6G in Figure 9 corresponds to such a case.

In this way, we computeilG recursively for i1 , and then obtain

lT (a sequence of i

lG ) by accumulating the initial active block group and its

succeeding active block groups through all frames in the observation period.

Thereafter, in the last frame of the observation period, we calculate the oc-

currence probability lP for the active block group train lT which is defined

as follows:



52

lllll GGG LPP ,...,,R 21(24)

where l L indicates a type of an active group for lT after the observation pe-

riod. That is, lP describes the probability that all candidate objects which cor-

respond to an active group train lT would be real objects. According to the

Bayes rule, we have:

lll

l

i

ll

i

l

i

l

lll

llll

llll

GGGP

LP LGGGP

GGGP

GGG LP

GGG LP

,...,,

RR,,...,

,...,,

,...,,,R

,...,,R

211

11

21

21

21

(25)

Suppose that the succeeding candidate objecti

lG in the current frame depends

on only1i

lG in the previous frame. Then, we have

R,R,,..., 111 l

i

l

i

lll

i

l

i

l LGGP LGGGP (26)

From (25) and (26), we have

1

1

211

1

21

R,

,...,,

RR,

,...,,R

i

l

i

l

i

l

lll

l

i

l

i

l

i

l

llll

LGGP

GGGP

LP LGGP

GGG LP

(27)



53

Since Rl LP and lll GGGP ,...,, 21

are the nature of scenes, that is, a

priori probabilities, we only consider the conditional probability in (27). Accor-

dingly, we judge that the active group train lT is a real object if the following

condition is satisfied:

1

1 R,lni

l

i

l

i

l LGGP (28)

where is the threshold of occurrence with 0 . If Equation (28) does

not hold true, then the active group train lT is removed because it is regarded

as a part of the background. If ilG , R,1

l

i

l

i

l LGGP can be calculated

as follows:

1

11

R,

il

i

l

i

l

l

i

l

i

l Gn

GGn

LGGP

(29)

where 1ilGn denotes the number of macroblocks in the region of

1ilG . If

i

lG , we have

lc LGGP l

il

il R,

1(30)

which lc is the number of frames where the succeeding candidate objects for

the active group train lT are found during the observation period.

Once an active block group train is regarded as motion trajectory of a real

object, the object tracking is performed by searching the candidate objects that



54

are overlapped with the corresponding real object group in the previous frame

throughout the subsequent frames after the observation period. In this case, the

train becomes the real object‟s one and is extended towards the subsequent

frames. The real object tracking is performed in the same way as done for the

candidate object tracking in Equation (23). If a real object does not have its suc-

ceeding candidate objects in any subsequent frame, it is assumed that the real

object does not move by staying at a location.

When we detect and track multiple objects with active block groups, we

may have train tangling problem in which at least two trains are merged together,

called the train merging, or one train gets separated into more than two individu-

al trains, called the train separation. Train merging occurs under the situation

that one active block group is overlapped with several candidate or real objects

in the previous frame as shown in Figure 10(a). For simplicity in this paper, we

only consider the case of train merging by two active group trains. Figure 10(b)

shows the train separation where an active group train is divided into two active

groups.

(a) (b)

1l

T

2lT lT A 1 A

2 A

Figure 10. Train tangling. (a) Train merging. (b) Train separation.



55

When two active group trains,1l

T and2l

T are overlapped with a single

active block group and their corresponding objects are labeled with candidate

objects C1

l L and C2

l L , one of two trains is removed. In the case of

having one real object and one candidate object, the candidate object train is re-

moved. When both trains are real objects, then the overlapped active block group

is split into two active block groups, both (1l

T and2l

T ) of which correspond to

the real object. That is, if both active group trains are for real objects, two ob-

jects are not merged which means that two overlapped real objects are consi-

dered to move independently.

On the other hand, train separation occurs under the situation that several

active block groups are overlapped with one candidate or one real object in the

previous frame as shown in Figure 10(b). If the active group train ( lT ) in Figure

10(b) in the previous frame were a candidate object, the two active block groups

( 1 A and 2 A ) overlapped with lT are merged into one candidate object. In

case of the active group train ( lT ) being a real object, the two active block

groups are considered independent objects for which one is regarded as the real

object corresponding to lT and the other as a new candidate object.

3.2.4 Region Prediction of Moving Objects in I-frames

Finally, the location and size of the real object is determined by a rectangle

that encompasses the exterior of the active block group. We define the feature

vector li f ,

of a real object that corresponds to the train lT in the ith the frame



57

(a) (b)

(c) (d)

l D lS

Figure 11. Optimizing the feature vector of an object through background subtraction in

an I frame. (a) The background Image. (b) The I frame in the original sequence. (c) A

partially decoded image from H.264|AVC bitstream. (d) A background-subtracted

image.

where li f ,

denotes the predicted object feature vector of li f ,

, and N de-

notes the length of one GOP. The predicted location li p ,

is the same as the lo-

cation li p ,1

in the previous P frame. Likewise, the predicted height and width

are determined by the respective maximums of heights and widths in P-frames

between two consecutive I-frames (GOP), which may increase the possibility of

encompassing the entire region of the real object. Then, the estimated region by



58

the maximum height and width is partially decoded. The partial decoded regions

in Figure 11(c) are subtracted from the initial background image in Figure 11(a).

After subtraction, the final real object region is determined by the rectangle that

most tightly encompasses the real object as shown in Figure 11(d).

3.2.5 Partial Decoding and Background Subtraction in I-frames

In the H.264|AVC baseline profile, I-frame decoding can be performed in

either 16x16 macroblock or 4x4 sub-macroblock unit. To be more specific, each

unit block refers its neighbor block pixels for spatial prediction. In order to par-

tially decode a certain block in an I-frame, its neighbor blocks need to be de-

coded a priori for spatial prediction. In the worst case with the most bottom-

right block for partial decoding, a lot of blocks leftward and upward must be de-

coded a priori, which increases computational complexity in partial decoding. In

order to avoid this problem, we substitute the reference pixels in the neighbor

blocks with the pixels obtained by the initial background without actual decod-

ing. In this case, perfect reconstruction is not possible, which then causes imper-

fect reconstructions of the blocks in each MB. However, we observe that this

approach is reasonable for the surveillance environment with fixed camera con-

dition and not significant illumination change. The imperfect reconstruction

problem can further be alleviated by comparing with a preset threshold the dif-

ference between partial decoding and the initial background as indicated in Equ-

ation (33).



59

l Bll D x x p x p xS

, (33)

where lS is the region of the real object found by comparing with a predefined

threshold the difference between the pixels of the partially decoded region

and the initial background pixels. x pl

and x p B

are the pixels belonging

to the partially decoded region and the initial background, respectively. is

used to judge the foreground and the background, and l D denotes partially

decoded area . Then, the size lili wh ,, , is refined as the width and height of the

rectangle box which most tightly encloses lS , and then the location

lilili y x p ,,, ,

is set to its center point of the rectangle box.

3.2.6 Motion Interpolation in P-frames

Object tracking is performed by projecting the active block groups in the

current P-frame onto the previous P-frame. In this case, it is observed that the

sizes and locations of the real objects significantly vary over P-frames. In Figure

12, the sizes and locations of the regions (t t t R R R 432 ,, ) for a real object being

tracked are changing over three P-frames. If we assume that an object moves

slow enough with uniform motion between two successive I-frames, then linear

interpolation can be made for the object feature vector (sizes and locations) so

that the interpolated regions (shaded rectangles in P-frames in Figure 12) be-

come the final object regions in P-frames.



60

I P P P I

t

2R

t

3R

t

4R

Figure 12. Motion interpolation. The dotted rectangle boxes are estimated simply by

enclosing active groups corresponding to the real object. These boxes are replaced by

the rectangle boxes through motion interpolation.

Therefore, the interpolation for the object feature vector in a P-frame can be

computed as follows:

lil N ililk i f f N

k f f ,,,,

. (34)

where N is the length of a GOP and N k k 0 is the index for P-frames.

It is noticed that as the length of one GOP gets longer, the updated feature vec-

tors are less reliable because the linearity assumption for uniform motion no

longer holds true.



61

IV Experiments

4.1 Semi-automatic Approach

To demonstrate the performance of the proposed semi-automatic method,

the tracking results of various objects were extracted from videos such as “Ste-

fan”, “Coastguard” and “Lovers” with CIF size. Each video was encoded as the

GOP structure of „IPPP‟ in the baseline profile, and included P-frames whose

previous frame only can be a reference frame. Figure 13 shows the tracking re-

sults of a rigid object with slow motion in “Coastguard”. Four feature points

were well tracked in the uniform form of feature network. Figure 14 also shows

the tracking result of a deformable object with fast motion in “Stefan”. We can

observe that tracking is successful even though the form of feature network is

greatly changing due to fast three-dimensional motion.

Figure 15 represents the visual results of partial decoding in P-frames of

“Lovers” when the search half interval M and the neighborhood half interval W

are assigned as 5 and 3. Only the neighborhood region of three feature points

was partially decoded. Even in a sequence “Lovers” with 300 frames, no track-

ing errors were found.



62

Figure 13. The object tracking in “Coastguard” with 100 frames.



63

Figure 14. The object tracking in “Stefan” with 100 frames.



64

Figure 15. The object tracking in “Lovers” with 300 frames. Partially decoded regions

are shown in “Lovers”.



65

Numerical data of tracking from two video samples is shown at Figure 16-

19. In Figure 17(a) and (b), dissimilarity energies in “Coastguard” are lower than

those in “Stefan”. We can see from this result that the variation of texture, form,

and motion in “Coastguard” is smaller than “Stefan”. Figure 19(a) and (b) shows

that forward motion vectors in “Stefan” is less reliable than those in “Coast-

guard” due to Stefan‟s complex motion. The average percentage of reliabilities

in “Coastguard” is 93.9% higher than 81.7% in “Stefan”; it indicates that the

forward motion vector field in “Stefan” is less reliable than that in “Coastguard”.

As a matter of fact, the “Stefan” sequence contains high-textured background

(e.g. many spectators) as well as a fast moving deformable object (e.g. the tennis

player), which causes false and chaotic motion vectors during motion estimation

process of the encoder. Even in such an intricate sequence as “Stefan”, the track-

ing performance is satisfactory as shown in Figure 14 since three traits like tex-

ture, form, and motion of the target object are jointly considered.

Through the neural network, the square error of dissimilarity energy is mi-

nimized over a few frames as shown in Figure 18(b). When the learning constant

was equal to 5, this error had approximately zero value after the 15th

frame.

Moreover, weight factors converge on optimal values as shown at Figure 18(a).

We can observe that weight factor variations and dissimilarity energies increase

greatly from the 61th

frame to the 66th

frame in “Coastguard”; it illustrates that

weight factors are adaptively controlled when another ship is approaching.

When the JM10.2 reference software was used to read H.264/AVC bit-



66

stream, the processing time which includes partial decoding process in “Coast-

guard” is shown at Figure 16(a); especially, it is observed that the processing

time abruptly increases every I-frame due to full decoding in I-frames. The aver-

age processing time was about 58.9ms per frame (17frames per second) on the

hardware environment of Intel Pentium 4 CPU 3.2GHz and 1GB RAM. Howev-

er, since most of the time (about 45ms per frame) originates in the partial decod-

ing process in JM10.2, the processing time can be effectively reduced by using a

decoder with faster decoding process than JM10.2. As illustrated in Figure 16(b),

in the case that the partial decoding time which is dependent on the capability of

decoder is not considered, the processing time is 14.2ms per frame (70.3 frames

per second). Thus, the proposed algorithm can be well performed in real-time

applications.

It should be noticed that the computation time is abruptly variant accord-

ing to the size of search range. For example, when the search half interval M is

doubled ( M =10) with margin 10, the processing time increases nearly seven

times (430ms per frame) as shown in [16].

Conclusively, the experimental results demonstrate that the proposed

semi-automatic method guarantees reliable performance even in such scenes that

include complicated background or deformable objects. Moreover, its processing

time is kept to be remarkably fast so that the algorithm can be built into the ap-

plications, which are required to work fast, such as metadata authoring tools.



67

(a)

(b)

Figure 16. (a) The processing time which includes partial decoding in “Coastgurad”,

and (b) the processing time which does not include partial decoding.



68

(a)

(b)

Figure 17. (a) Dissimilarity energies in “Stefan” and (b) “Coastguard”



69

(a)

(b)

Frame

Frame

Figure 18. (a) The variation of weight factors in “Coastguard”, and (b) the squared error

of dissimilarity energy in “Stefan”.



70

(a)

(b)

Figure 19. (a) The average reliabilities of forward motion vectors in “Coastguard” and

(b) in “Stefan”.



71

4.2 Automatic Approach

To test the proposed automatic method for object detection and tracking in

H.264|AVC bitstream domain, we used two sequences which were taken by one

fixed camera in indoor and outdoor environments. While only one person walk-

ing in a corridor of a university building appears in the indoor sequence, three

persons entering individually into the screenshot appear in the outdoor sequence

without visual occlusion between persons. In each sequence, there is no signifi-

cant illumination change of the background. Each sequence with 320x240 sizes

was encoded at 30 frames per second by the JM 12.4 reference software with the

GOP structure of „IPPIP‟ based on the H.264|AVC baseline profile. Espe-

cially, P-frames were set to have no intra-coded macroblocks. Also, the length of

observation period for temporal filtering was set to 8 frames.

To evaluate the performance of spatial filtering, we consider the spatial

filtering rate as the ratio of the number of filtered block groups to the total num-

ber of block groups. It represents how many block groups are filtered in each

frame. Figure 20(a) and 21(a) shows the spatial filtering rates in each frame by

calculating the averages of these rates in 60 frames. It is observed that the aver-

age of spatial filtering rates in the indoor sequence is 70.8% and that is 64.2% in

the outdoor sequence, which means that most of block groups are removed by

the spatial filtering process.



72

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700

0

0.5

1

1.5

2

2.5

3

0 100 200 300 400 500 600 700

(a)

(b)

merged active trains

frames

frames

Spatial filtering rates

A c t i v e

g r o u p t r a i n s

real object

Figure 20. The performance measurement of spatial filtering and temporal filtering in

the indoor sequence. (a) The plot of spatial filtering rates, and (b) The temporal filtering

results in which one active group train become the real object.



73

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 100 200 300 400 500 600 700 800 900

0

.5

1

.5

2

.5

3

.5

0 100 200 300 400 500 600 700 800 900

(a)

(b

merged active trains

frames

frames

Spatial filtering rates

A c t i v e

g r o u p t r a i n s

r e a l o b j e c t s

Figure 21. The performance measurement of spatial filtering and temporal filtering in

the outdoor sequence. (a) The plot of spatial filtering rates, and (b) The temporal

filtering results in which three active group trains become the real objects.



74

Figure 20(b) and 21(b) illustrates the result of temporal filtering in the in-

door and outdoor sequences. Among several active group trains which survive

after spatial filtering, one active group trains are decided as real objects in the

indoor sequence whereas three active group trains are decided as real objects in

the outdoor sequence. These results are exactly coincident with real situations in

two sequences. It should be noticed that all active group trains which are not real

objects are not always removed by temporal filtering but sometimes they get

merged into their neighboring real object. As a result, 96% of all active group

trains are removed both in two sequences.

Then, to obtain more precise object regions, we have used background

subtraction and motion interpolation as explained before. Figure 11 shows three

steps of background subtraction in I frames: partial decoding, foreground extrac-

tion, and optimization of object location and size. Especially as shown in Figure

25, partial decoding in I-frames is significantly faster than full decoding. When

we partially decoded only the object regions in the indoor sequence, the frame

rate was 49.5 frames per second while the frame rate was 20.46 frames per

second in full decoding mode. That is, it is sure that the computational complexi-

ty is greatly enhanced in the partial decoding mode rather than in the full decod-

ing mode.

It can be observed in Figure 22(a) and (b) that before the ROI refinement,

some parts of the real object are not included in the rectangle boxes. After back-

ground subtraction in I frames and motion interpolation in P-frames, the refined



75

ROI‟s in the rectangle boxes enclose the whole part of the real object as shown

in Figure 22(c) and (d).

(a) (b)

(c) (d)

Figure 22. The effect of motion interpolation on correction of object trajectory. (a)-(b)

are object locations and sizes in one GOP before motion interpolation, and (c)-(d) after

motion interpolation.

Figure 23 and 24 show the object detection and tracking results for the in-

door sequence with one single moving object and the outdoor sequence with

three moving objects, respectively. The proposed method of tracking moving



76

objects in H.264|AVC bitstream domains exhibits a satisfactory performance

over 720 and 990 frames of the indoor and outdoor sequences, respectively.

It can be noticed in Figure 23 that when the object moving toward the

camera-looking direction, the detection and tracking performance is kept good

even though the object gets scaled on the move. Moreover, although the parts

(head, arms, and legs) of the object have different motion, the rectangle box al-

ways encloses the whole body precisely. Likewise, even in the outdoor sequence

which contains multiple objects (persons), the proposed method of object detec-

tion and tracking works very well as shown in Figure 24. Although they move in

different directions, the proposed algorithm does not fail to detect and track the

three persons separately.

Computation of object detection and tracking as well as the refinement of

the resulting ROI‟s involves three processes: (1) partial decoding in I frames; (2)

the extraction of MB types and IT coefficients in P frames, and (3) object detec-

tion and tracking. It does not include the loading time of H.264|AVC bitstream

since it mostly depends on the performance of the used decoder. The processing

times taken for two sequences are shown in Figure 25. The processing times

were taken 2.02 milliseconds per frame (49.5 frames per second) in the indoor

sequence and 2.69 milliseconds per frame (37.12 frames per second) in the out-

door sequence on a PC with Pentium 4 CPU of 3.2 GHz and RAM of 1G Bytes.

The proposed algorithm is remarkably fast enough to be applied for real-time

surveillance systems.



78

(a) (d)

(b) (e)

(c) (f)



79

(g) (j)

(h) (k)

(i) (l)

Figure 23. The performance measurement of spatial filtering and temporal filtering. (a)

The plot of spatial filtering rates in the indoor sequence, (b) The temporal filtering

results in the indoor sequence one active group train become the real object.



81

(g) (j)

(h) (k)

(i) (l)

Figure 24. The performance measurement of spatial filtering and temporal filtering. (a)

The plot of spatial filtering rates in the outdoor sequence, (b) The temporal filtering

results in the outdoor sequence three active group trains become the real objects.



83

V Conclusions and Future Works

Recently, moving object detection and tracking techniques have become a

necessary component for intelligent visual systems like surveillance or interac-

tive broadcasting. In this thesis, two methods for moving object detection and

tracking with partial decoding in H.264|AVC bitstream domain are proposed; the

semi-automatic method and the automatic method. The semi-automatic method

exploits the dissimilarity minimization algorithm which tracks feature points

adaptively according to their properties like texture, form, and motion. The au-

tomatic method, which is available for surveillance with fixed cameras, exploits

several techniques such as spatial and temporal macroblock filtering, back-

ground subtraction, and motion interpolation. Especially, it can detect and track

multiple objects at the same time. While the former method utilizes only motion

vectors, the latter makes use of integer transform (IT) coefficients and macrob-

lock types to detect and track multiple moving objects. Unlike the traditional

compressed domain algorithms, the proposed methods reflect the extraordinary

features of the encoded information in H.264|AVC bitstreams.

The main contribution of the proposed methods, above all, is that they

have not only low computational complexity enough to be performed in real-

time, but also have excellent performance even in more manifold situations than

those which have been considered at the traditional algorithms. To be specific,

the proposed semi-automatic method successfully tracks a predefined target ob-



84

ject which is deformable or moves fast in complicatedly textured background.

On the other hand, the proposed automatic method is able to detect and track

moving objects which move in the same or opposite direction as that of camera

looking toward.

It should be noticed that the proposed methods include the partial decod-

ing process to get detailed texture information from H.264|AVC bitstreams. Un-

like the traditional compressed domain algorithms, the partial decoding process

makes color extraction or object recognition possible. In the case of MPEG-7

metadata authoring tools, the partially decoded color information of each object

can be converted as a form of MPEG-7 metadata for interactive broadcasting

services. Likewise, the color information can be utilized to distinguish one ob-

ject from other objects by using object recognition techniques in pixel domain.

Consequently, the proposed methods make it possible to combine the com-

pressed domain techniques with the pixel domain techniques for powerful per-

formance and extended functions.

In the future works, the proposed methods need to be extended to other

profiles like Main profile as well as Baseline profile of H.264|AVC standard; in

other words, B-frames also can be handled for object detection and tracking with

I-frames and P-frames. Also, the novel compressed domain techniques which

deal with specific situations like illumination change, occlusion, and silhouette

need to be developed for better performance. In addition, the proposed methods

can evolve into an object segmentation technique based on partial decoding in



86

국문요약

석사학위논문

H.264|AVC 비트스트림 영역에서의 부분복원을 이용한 동적

객체 검출 및 추적에 관한 연구

공학부 유원상

동적 객체 인식 및 추적 기술은 지능적인 영상 시스템에 포함되는 중요

한 기능으로서 주로 픽셀영역에서 컴퓨터 비전 기술을 이용하여 연구되어

왔다. 대부분의 영상 시스템에서는 객체 인식 및 추적이 실시간으로 수행

되어야 하지만, 픽셀영역 접근방법은 상당한 양의 계산량을 요구하기 때문

에 직접적으로 적용되기 어렵다. 최근 하드웨어 및 소프트웨어 기술의 발

달로 인하여

이러한

어려움은

상당히

감소되었지만

,다중

분산

감시

시스

템과 같은 대규모 영상 시스템에서는 제한된 자원으로 많은 양의 동영상

데이터를 처리하기에는 여전히 어려움이 존재한다.

한편, 일반적으로 사용되는 대부분의 동영상 데이터는 MPEG 과 같이 압

축된 형태로 전송되는데, 이러한 압축 동영상은 움직임 벡터나 잔차신호와

같이 객체 인식 및 추적에 이용 가능한 정보를 담고 있다. 따라서, 이러한

정보를 이용하여 동영상 데이터를 완전히 복원하지 않고 압축된 상태에서

직접 객체 인식 및 추적을 수행함으로써 처리속도를 빠르게 하는 기술, 즉

압축영역 접근방법이 최근 연구되고 있다.

본 학위 논문에서는, H.264|AVC로 압축된 동영상 데이터 및 부분 복원된

픽셀정보를 이용하여 빠르면서도 정확한 객체 검출 및 추적을 수행하기

위한 두 종류의 새로운 알고리즘들을 제안한다. 첫 번째 알고리즘은 반자



87

동 알고리즘으로서 사용자가 선택한 객체의 특성점을 텍스쳐, 형태, 및 움

직임 특성에 따라 적응적으로 추적하는 비유사성 최소화 알고리즘이다. 두

번째 알고리즘은 시공간적 매크로블록 필터, 배경 제거, 및 움직임 보간

등의 방법이 사용되는 자동 알고리즘으로서, 카메라가 정지되어 있는 환경

에서 이용 가능하며 여러 객체를 동시에 추적할 수 있다.

제안된 알고리즘은 실시간 처리가 가능할 정도로 빠른 처리속도를 유지

하면서도 기존의 압축 영역 알고리즘에서 다룬 영상보다 더욱 복잡한 영

상에서도 객체 인식 및 추적이 가능하다는 점에 있다. 예를 들어, 기존의

압축 영역 알고리즘은 카메라가 보는 방향으로 움직이는 객체의 추적은

다루지 않는 반면, 제안된 자동 알고리즘에서는 그러한 객체의 움직임을

비교적 정확하게 추적할 수 있다. 반면, 제안된 반자동 알고리즘은 변형되

거나 복잡한 배경에서 빠르게 움직이는 객체에 대한 추적이 가능하다.

특히 제안된 알고리즘에서는 부분 복원 기술을 이용하여 객체의 컬러 정

보 추출이 가능하다는 특징이 있다. 이러한 객체의 컬러 정보는 대화형 방

송을 위한 메타데이터 또는 여러 객체를 구분하기 위한 단서로 활용될 수

도 있다. 따라서 제안된 알고리즘은 이러한 부분 복원 기술을 이용하여 압

축 영역 기술과 픽셀 영역 기술의 장점을 결합한 새로운 형태의 알고리즘

으로서, 향후 기존의 압축 영역 알고리즘이 지닐 수 없는, 가령 객체 인식

같은, 더욱 향상된 기능으로 발전될 수 있는 가능성을 지니고 있다.



88

References

[1] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview of the

H.264|AVC Video Coding Standard,” IEEE Trans. Circuits Syst. Video Technol.,

vol. 13, No. 7, pp. 560 – 576, July 2003.

[2] A. Aggarwal, S. Biswas, S. Singh, S. Sural, and A.K. Majumdar, “Object Tracking

Using Background Subtraction and Motion Estimation in MPEG Videos,” ACCV

2006, LNCS, vol. 3852, pp. 121-130, Springer, Heidelberg (2006).

[3] S. Ji and H. W. Park, “Moving object segmentation in DCT-based compressed vid-

eo,” Electronic Letters, Vol. 36, No. 21, October 2000.

[4] X. -D. Yu, L.-Y. Duan, and Q. Tian, “Robust moving video object segmentation in

the mpeg compressed domain,” in Proc. IEEE Int. Conf. Image Processing, 2003,

vol. 3, pp.933-936.

[5] W. Zeng, W. Gao, and D. Zhao, “Automatic moving object extraction in MPEG

video,” in Proc. IEEE Int. Symp. Circuits Syst., 2003, vol. 2, pp.524-527.

[6] A. Benzougar, P. Bouthemy, and R. Fablet, “MRF-based moving object detection

from MPEG coded video,” in Proc. IEEE Int. Conf. Image Processing, 2001, vol.

3, pp.402-405.

[7] R. Wang, H.-J. Zhung, Y.-Q. Zhang, “A confidence measure based moving object

extraction system built for compressed domain,” in Proc. IEEE Int. Symp. Circuits

Syst., 2000, vol. 5, pp.21-24.

[8] O. Sukmarg and K. R. Rao, “Fast object detection and segmentation in MPEG

compressed domain,” in Proc. TENCON 2000, vol. 3, pp.364-368.

[9] H.-L. Eng and K.-K. Ma, “Spatiotemporal segmentation of moving video objects

over MPEG compressed domain,” in Proc. IEEE Int. Conf. Multimedia and Expo,

2000, vol. 3, pp.1531-1534.



89

[10] M. L. Jamrozik and M. H. Hayes, “A compressed domain video object segmenta-

tion system,” in Proc. IEEE Int. Conf. Image Processing, 2002, vol. 1, pp.113-116.

[11] H. Zen, T. Hasegawa, and S. Ozawa, “Moving object detection from MPEG coded

picture,” in Proc. IEEE Int. Conf. Image Processing, 1999, vol. 4, pp.25-29

[12] W. Zeng, J. Du, W. Gao, and Q. Huang, “Robust moving object segmentation on

H.264|AVC compressed video using the block- based MRF model,” Real-Time Im-

aging, vol. 11(4), 2005, pp.290-299.

[13] R. V. Babu, K. R. Ramakrishnan, and S. H. Srinivasan, “Video object s egmenta-

tion: A compressed domain approach,” IEEE Trans. Circuits Syst. Video Technol.,

vol. 14, No. 4, pp. 462 – 474, April 2004.

[14] V. Thilak and C. D. Creusere, “Tracking of extended size targets in H.264 co m-

pressed video using the probabilistic data association filter,” EUSIPCO 2004,

pp.281-284.

[15] S. Treetasanatavorn, U. Rauschenbach, J. Heuer, and A. Kaup, “Bayesian method

for motion segmentation and tracking in compressed videos,” DAGM 2005,

LNCS, vol. 3663, pp.277-284, Springer, Heldelberg (2005).

[16] W. You, M.S. H. Sabirin, and M. Kim, "Moving Object Tracking in H.264/AVC

bitstream," MCAM 2007, LNCS, vol. 4577, pp.483-492.

[17] Fatih Porikli and Huifang Sun, “Compressed domain video object segmentation,”

Technical Report TR2005-040 of Mitsubishi Electric Research Lab, 2005.

[18] Y. Fu, T. Erdem, and A. M. Tekalp, "Tracking visible boundary of objects using

occlusion adaptive motion snake," IEEE Trans. Image Processing, vol. 9, pp.

2051-2060, Dec. 2000.

[19] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algo-

rithms. Cambridge, MA: MIT Press, 2001.

[20] R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification. New York: John Wiley

& Sons, 2001.



90

[21] B. Yeo, B. Liu, "Rapid scene analysis on compressed video," IEEE Trans. Circuits

Syst. Video Technol., vol. 5, No. 6, pp. 533-544, December 1995.

[22] V. Mezaris, I. Kompatsiaris, E. Kokkinou, and M.G. Strintzis, "Real-time

compressed-domain spatiotemporal video segmentation," in Proc. CBMI03, Sep-

tember 2003, pp.373-380.

[23] S. Treetasanatavorn, U. Rauschenbach, J. Heuer, and A. Kaup, “Stochastic motion

coherency analysis for motion vector field segmentation on compressed video se-

quences,” in Proc. WIAMIS, April 2005.

[24] S. Treetasanatavorn, U. Rauschenbach, J. Heuer, and A. Kaup, “Model based seg-

mentation of motion fields in compressed video sequences using partition projec-

tion and relaxation,” in Proc. VCIP, July 2005, pp.111-120.

[25] V. Thilak and C.D. Creusere, “Tracking of extended size targets in H.264 com-

pressed video using the probabilistic data association filter,” in Proc. EUSIPCO-

2004, September 2004.

[26] H. Chen, Y. Zhan and F. Qi, “Rapid object tracking on compressed video,” in Proc.

2nd IEEE Pacific Rim Conference on Multimedia, pp.1066-1071, October 2001.

[27] M.J. Swain and D.H. Ballard, “Color indexing,” International Journal of Comput-

er Vision 7, pp.11-32, 1991.

[28] M. Vezhnevets, “Face and facial feature tracking for natural Human-Computer In-

terface,” in International Conference on Computer Graphics between Europe and

Asia (GraphiCon-2002), pp.86-90, September 2002.

[29] K. Schwerdt and J.L. Crowley, “Robust face tracking using color,” in Internation-

al Conference on Automatic Face and Gesture Recognition ( AFGR2000), pp.90-

95, March 2000.

[30] G. Finlayson, S. Hordley, and P. Hubel, “Color by correction: A simple, unifying

framework for colour constancy,” IEEE Trans. Pattern Anal. Mach. Intell. 23,

pp.1209-1221, 2001.



91

[31] B.D. Zarit, B.J. Super, and F.K.H. Quek, “Comparison of five color models in skin

pixel classification,” in ICCV99 International Workshop on Recognition, Analysis,

and Tracking of Faces and Gestures in Real-Time Systems ( RATFG-RTS99),

pp.58-63, September 1999.

[32] C. Terrillon, M. David, and S. Akamatsu, “Automatic detection of human faces in

natural scene images by use of a skin color model and invariant moments,” in

Third IEEE International Conference on Automatic Face and Gesture Recognition

( AFGR98), pp.112-117, April 1998.

[33] W. Lu, J. Yang, and A. Waibel, “Skin-color modeling and adaptation,” in Third

Asian Conference on Computer Vision ( ACCV98), vol. 2, pp.687-694, January

1998.

[34] P. Perez, C. Hue, J. Vermaak, and M. Gangnet, “Color-based probabilistic track-

ing,” in European Conference on Computer Vision ( ECCV2002), vol. 1, pp.661-

675, May-June 2002.

[35] G.R. Bradski, “Real time face and object tracking as a component of a perceptual

user interface,” in Workshop on Applications of Computer Vision (WACV98),

pp.214-219, October 1998.

[36] I. Haritaoglu, D. Harwood, and L.S. Davis, “W4: Real-time surveillance of people

and their activities,” IEEE Trans. Pattern Anal. Mach. Intell. 22, pp. 809-830,

2000.

[37] M. Kass, M. Witkin, and A. Terzopoulos, “Snakes: Active contour models,” Inter-

national Journal of Computer Vision 1, pp.321-331, 1988.

[38] H. Wang, J. Leng, and Z.M. Guo, “Adaptive dynamic contour for real-time object

tracking,” in Image and Vision Computing New Zealand ( IVCNZ2002), December

2002.

[39] N. Xu and N. Ahuja, “Object contour tracking using graph cuts based active con-

tours,” in Proc. ICIP2002, vol. 3, pp.277-280, September 2002.



92

[40] N. Xu, R. Bansal, and N. Ahuja, “Object segmentation using graph cuts based ac-

tive contours,” in Proc. CVPR2003, vol. 2, pp.46-53, June 2003.

[41] A. Nikolaidis and I. Pitas, “Probabilistic multiple face detection and tracking using

entropy measures,” Pattern Recognition 33, pp.1783-1791, 2000.

[42] H. Chao, Y.F. Zheng, and S.C. Ahalt, “Object tracking using the Gabor wavelet

transform and the golden section algorithm,” IEEE Transactions on Multimedia 4,

pp.528-538, 2002.

[43] C. Tomasi and T. Kanade, “Detection and tracking of point features,” Technical

Report CMU-CS-91-132, School of Computer Science, Carnegie Mellon Univer-

sity, Pittsburgh, 1991.

[44] A. Shokurov, A. Khropov, and D. Ivanov, “Feature tracking in images and video,”

in International Conference on Computer Graphics between Europe and Asia

(GraphiCon-2003), pp.177-179, September 2003.

[45] P. Beardsley, P.H.S. Torr, and A. Zisserman, “3D model acquisition from extended

image sequences,” in European Conference on Computer Vision ( ECCV96 ), vol. 2,

pp.683-695, April 1996.

[46] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neu-

roscience 3, pp.71-96, 1991.

[47] H.T. Nguyen and A.W.M. Smeulders, “Template tracking using color invariant

pixel features,” in Proc. ICIP2000, vol. 1, pp.569-572, September 2000.

[48] H.T. Nguyen and A.W.M. Smeulders, “Fast occluded object tracking by a robust

appearance filter,” IEEE Trans. Pattern Anal. Mach. Intell. 26, pp.1099-1104,

2004.

[49] L.V. Tsap, D.B. Goldgof, and S. Sarkar, “Fusion of physically-based registration

and deformation modeling for nonrigid motion analysis,” IEEE Trans. Image

Process. 10, pp.1659-1669, 2001.

[50] Y. Wang and S. Zhu, “Analysis and synthesis of textured motion, particles and



waves,” IEEE Trans. Pat. Anal. Mach. Intell. 26, pp.1348-1363, 2004.

[51] T. Schoepflin, V. Chalana, D.R. Haynor, and Y. Kim, “Video object tracking with a

sequential hierarchy of template deformations,” IEEE Trans. Circuits Syst. Video

Technol. 11, pp.1171-1182, 2001.

a study on moving object detection and tracking with partial decoding in h.264|avc bitstream domain

Documents