2008 base paper

8/7/2019 2008 Base Paper

1/13

372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

Video-Based Human Movement Analysis and ItsApplication to Surveillance Systems

Jun-Wei Hsieh, Member, IEEE, Yung-Tai Hsu, Hong-Yuan Mark Liao , Senior Member, IEEE, andChih-Chiang Chen

AbstractThis paper presents a novel posture classificationsystem that analyzes human movements directly from video se-quences. In the system, each sequence of movements is convertedinto a posture sequence. To better characterize a posture in asequence, we triangulate it into triangular meshes, from which weextract two features: the skeleton feature and the centroid contextfeature. The first feature is used as a coarse representation ofthe subject, while the second is used to derive a finer description.We adopt a depth-first search (dfs) scheme to extract the skeletalfeatures of a posture from the triangulation result. The proposedskeleton feature extraction scheme is more robust and efficient

than conventional silhouette-based approaches. The skeletal fea-tures extracted in the first stage are used to extract the centroidcontext feature, which is a finer representation that can charac-terize the shape of a whole body or body parts. Thetwo descriptorsworking together make human movement analysis a very efficientand accurate process because they generate a set of key posturesfrom a movement sequence. The ordered key posture sequence isrepresented by a symbol string. Matching two arbitrary actionsequences then becomes a symbol string matching problem. Ourexperiment results demonstrate that the proposed method is arobust, accurate, and powerful tool for human movement analysis.

Index TermsBehavior analysis, event detection, stringmatching, triangulation.

I. INTRODUCTION

THE analysis of human body movements can be applied in a

variety of application domains, such as video surveillance,

video retrieval, human-computer interaction systems, and med-

ical diagnoses. In some cases, the results of such analysis can be

used to identify people acting suspiciously and other unusual

events directly from videos. Many approaches have been pro-

posed for video-based human movement analysis [1][6], [9],

[10]. For example, Oliver et al.. [9] developed a visual surveil-

lance system that models and recognizes human behavior using

hidden Markov models (HMMs) and a trajectory feature. Ros-

ales and Sclaroff [10] proposed a trajectory-based recognition

Manuscript received July 12, 2006; revised August 4, 2007. This work wassupported in part by the National Science Council of Taiwan, R.O.C., underGrant NSC952221-E-155071 and by the Ministry of Economic Affairs underContract 94-EC-17-A-02-S1032. The associate editor coordinating the reviewof this manuscript and approving it for publication was Prof. Kiyoharu Aizawa.

J.-W. Hsieh, Y.-T. Hsu, and C.-C. Chen are with the Department of Elec-trical Engineering, Yuan Ze University, Chung-Li 320, Taiwan, R.O.C. (e-mail:[email protected]; [email protected]; [email protected]).

H.-Y. M. Liao is with the Institute of Information Science, Academia Sinica,Taipei, Taiwan, R.O.C. (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TMM.2008.917403

system to detect pedestrians in outdoor environments and rec-

ognize their activities from multiple views based on a mixture of

Gaussian classifiers. Other approaches extract human postures

or body parts (such as the head, hands, torso, or feet) to ana-

lyze human behavior [5], [6]. Mikic et al.. [5] used complex 3-D

models and multiple video cameras to extract 3-D information

for 3-D posture analysis, while Werghi [6] extracted 3-D infor-

mation and applied wavelet transforms to analyze 3-D human

postures. Although 3-D features are more informative than 2-D

features (like contours or silhouettes) for classifying human pos-tures, the inherent correspondence problem and high computa-

tional cost make this approach inappropriate for real-time ap-

plications. As a result, a number of researchers have based their

analyses of human behavior on 2-D postures. In [2], Cucchiara

et al.. proposed a probabilistic posture classification scheme

to identify several types of movement, such as walking, run-

ning, squatting, or sitting. Ivanov and Bobick [11] presented a

2-D posture classification system that uses HMMs to recognize

human gestures and behaviors. In [13], Wren et al. proposed

a Pfinder system for tracking and recognizing human behavior

based on a 2-D blob model. The drawback of this approach

is that incorporating 2-D posture models into the analysis ofhuman behavior raises the possibility of ambiguity between the

adopted models and real human behavior due to the mutual oc-

clusions between body parts and loose clothes. In [17] and [18],

Guo et al.. used a stick model to represent the parts of a human

body as sticks and link the sticks with joints for human motion

tracking. Then, in [19], Ben-Arie et al. used a voting technique

to recover the parameters of stick model from video sequences

for action analysis. In addition to the stick model, the cardboard

model [20] is widely recognized to be good for modeling ar-

ticulated human motions. However, the prerequisite that body

parts must be well segmented makes this model inappropriate

for real-time analysis of human behavior.

To solve the problem with the cardboard model, Park andAggarwal [15] used a dynamic Bayesian network to segment

a body into different parts [16]. This blob-based approach is

very promising for analyzing human behavior at the semantic

level, but it is very sensitive to changes in illumination. In addi-

tion to blobs, several researchers have proposed for classifying

human postures based on silhouettes. Weik and Liedtke [12]

traced the negative minimum curvatures along body contours

to segment body parts and then identified body postures using

a modified Iterative Closest Point (ICP) algorithm. Fujiyoshi et

al. [8] presented a skeleton-based method to recognize postures

by extracting skeletal features based on curvature changes along

the human silhouette. In addition, Chang and Huang [7] used

1520-9210/$25.00 2008 IEEE

8/7/2019 2008 Base Paper

2/13

HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 373

Fig. 1. Flowchart of the proposed system.

different morphological operations to extract skeletal features

from postures and then identified movements using a HMM

framework. Although the contour-based approach is simple and

efficient for coarse classification of human postures, it is vul-

nerable to the effects of noise, imperfect contours, or occlu-

sions. Another approach used to analyze human behavior is the

Gaussian probabilistic model. In [2], Cucchiara et al. used a

probabilistic projection map to model postures and performed

frame-by-frame posture classification to recognize human be-

havior. The method uses the concept of a state-transition graph

to integrate temporal information about postures in order to deal

with the occlusion problem. However, the projection histogram

is not robust because it changes dramatically under differentlighting conditions.

In this paper, we propose a novel triangulation-based system

that analyzes human behavior directly from videos. The detailed

components of the system are shown in Fig. 1. First, we apply

background subtraction to extract body postures from video se-

quences and then derive their boundaries by contour tracing.

Next, a Delaunay triangulation technique [14] is used to divide

a posture into different triangular meshes. From the triangula-

tion result, two important features, the skeleton and the cen-

troid context (CC), are extracted for posture recognition. The

skeleton feature is used for a coarse search, while the centroid

context, being a finer feature, is used to classify postures moreaccurately. To extract the skeleton feature, we propose a graph

search method that builds a spanning tree from the triangulation

result. The spanning tree corresponds to the skeletal structure

of the analyzed body posture. The proposed skeleton extrac-

tion method is much simpler than silhouette-based techniques

[8][13]. In addition to the skeletal structure, the tree providesimportant information for segmenting a posture into different

body parts. Based on the result of body part segmentation, we

construct a new posture descriptor, namely, the centroid con-

text descriptor, for recognizing postures. This descriptor uti-

lizes a polar labeling scheme to label every triangular mesh with

a unique number. Then, for each body part, a feature vector,

i.e., the centroid context, is constructed by recording all relatedfeatures of the centroid of each triangular mesh according to

this unique number. We can then compare different postures

more accurately by measuring the distance between their cen-

troid contexts. Each posture is assigned a semantic symbol so

that each human movement can be converted and represented

by a set of symbols. Based on this representation, we use a new

string matching scheme to recognize different human move-

ments directly from videos. The string-based method modifiesthe calculation of the edit distance by using different weights to

measure the operations of insertion, deletion, and replacement.

Because of this modification, even if two movements involvelarge-scale changes, their edit distance is still small. The slow

increase in the edit distance substantially reduces the effect ofthe time warping problem. The experiment results demonstrate

Fig. 2. Sampling of control points. (a) A point with a high curvature. (b) Twopoints with high curvatures that are too close to each other.

the feasibility and superiority of the proposed approach for an-

alyzing human behavior.

The remainder of the paper is organized as follows. The De-

launay triangulation technique is introduced in Section II. In

Section III, we describe the posture classification scheme, which

is based on the skeletal feature. In Section IV, the centroid con-

text for posture classification is discussed. The key posture se-lection and string matching techniques are detailed in Section V.

The experiment results are given in Section VI. We then present

our conclusions in Section VII.

II. DEFORMABLE TRIANGULATIONS

In this paper, it is assumed that all video sequences are cap-

tured by a stationary camera. When the camera is static, the

background of the captured video sequence can be reconstructed

using a mixture of Gaussian models [24]. This allows us to de-

tect some samples and extract foreground objects by subtracting

the background. We then apply some simple morphological op-

erations to remove noise. After these pre-processing steps, we

partition a foreground human posture into triangular meshes

using the constrained Delaunay triangulation technique, and ex-

tract both the skeletal features and the centroid context directly

from the triangulation result [25].

Assume that is a posture extracted in binary form by

image subtraction. To triangulate , a set of control points

must be extracted along its contour in advance. Let be the

set of boundary points along the contour of . We extract some

high curvature points from as the set of control points. Let

be the angle of a point in . As shown in Fig. 2(a), the

angle can be determined by two specified points, and ,

selected from both sides of along to satisfy

(1)

where and are two thresholds set to and

, respectively. is the total length of contour of .

With and , the angle can be determined by

(2)

If is less than a threshold (here we set it at 150 ), is

selected as a control point. In addition to (2), it is required that

two control points must be far apart. This enforces the constraint

that the distance between any two control points must be largerthan the threshold (defined in (1)). As shown in Fig. 2(b),

8/7/2019 2008 Base Paper

3/13


Fig. 3. All the vertices are labeled anticlockwise such that the interior ofV

is

located on their left.

if two candidates, and , are close to each other, i.e.,

, the candidate with the smaller angle is chosen as

a control point.

Assume is the set of control points extracted along the

boundary of . Each point on is labeled anticlockwise and

modulated by the size of . If any two adjacent points on

are connected by an edge, is deemed to be a planar graph,

i.e., a polygon. We adopt the constrained Delaunay triangula-

tion technique proposed by Chew [26] to divide into trian-

gular meshes.

As shown in Fig. 3, is the set of interior points of in. For a triangulation , is said to be a constrained

Delaunay triangulation of if a) each edge of is an edge of

and b) for each remaining edge of , there exists a circle

such that (1) the endpoints of are on the boundary of ,

and (2) if a vertex on is inside , then it cannot be seen from

at least one of the endpoints of . More precisely, given three

vertices , , and on , the triangle belongs

to the constrained Delaunay triangulation if and only if

(i) , where

;

(ii) , where is a circum-circle of

, , and . That is, the interior of doesnot have a vertex .

Based on the above definition, Chew [26] proposed a divide-

and-conquer algorithm to execute a constrained Delaunay tri-

angulation of in time. The algorithm works re-

cursively. When contains only three vertices, itself is the

triangulation result. However, when contains more than three

vertices, we choose an edge from and search for the corre-

sponding third vertex that satisfies properties i) and ii). Then, we

subdivide into two sub-polygons and . The same pro-

cedure is applied recursively to and until the processed

polygon contains only one triangle. In summary, the four steps

of the above algorithm are as follows:

1. Choose a starting edge .

2. Find the third vertex of that satisfies properties (i) and

(ii).

3. Subdivide into two sub-polygons:

and .

4. Repeat Steps 13 on and until the processed polygon

consists of only one triangle.

Details of this algorithm can be found in Bern and Eppstein

[23]. Fig. 4 shows an example of the triangulation result whenthe algorithm is applied to a human posture.

Fig. 4. Triangulation result of a body posture. (a) Input posture. (b) Triangula-

tion result of (a).

Fig. 5. Flowchart of posture classification using skeletons.

III. SKELETON-BASED POSTURE RECOGNITION

In this work, we extract the skeleton and the centroid context

as features from the triangulation result. Conventionally, the ex-

tracted skeleton is based on the contour of an object. For ex-ample, in [11], Fujiyoshi et al. extract feature points with min-

imum curvatures from the silhouette of a human body to define

the skeleton. However, as the method is vulnerable to noise, we

use a graph search scheme to solve the problem.

A. Triangulation-Based Skeleton Extraction

In Section II, we presented a technique for triangulating a

human body image into triangular meshes. By connecting all the

centroids of any two connected meshes, a graph can be formed.

In this section, we use a depth-first search technique to find the

skeleton that will be used for posture recognition. Fig. 5 shows

the block diagram of posture classification using skeletons. Inwhat follows, details of each block will be described.

Assume that is a binary posture. Using the above tech-

nique, is decomposed into a set of triangular meshes , i.e.,

. Each triangular mesh in has a

centroid , and two meshes, and , are connected if they

share one common edge. Then, based on this connectivity,

can be converted into an undirected graph , where all cen-

troids in are nodes on ; and an edge exists between

and if and are connected. The degree of a node

on the graph is defined by the number of edges connected to it.

Then, we perform a graph search on to extract the skeleton

of .To derive a skeleton based on a graph search, we seek a node

whose degree is one and whose position is the highest among

all the nodes on . is defined as the head of . Then, we

perform a depth-first search [21] from to construct a spanning

tree in which all leaf nodes correspond to different limbs of

. The branch nodes (whose degree is three on ) are the

key points used to decompose into different body parts, such

as the hands, feet, or torso. Let be the centroid of , and

let be the union of , , , and . The skeleton, ,

of can be extracted by directly linking any two nodes in

if they are connected through other nodes (or a path existing

between the two nodes [21]). Next, we summarize the algorithm

for skeleton extraction. The time complexity of this algorithmis , where is the number of triangle meshes.

8/7/2019 2008 Base Paper

4/13

8/7/2019 2008 Base Paper

5/13


Fig. 8. Polar transform of a human posture.

Fig. 9. Bodycomponentextraction. (a)Triangulation resultof a posture; (b)thespanning tree of (a); and (c) centroids derived from different parts (determinedby removing all branch nodes).

Then, similar to the technique used in shape context [27], we

project a sample onto a polar coordinate and label each mesh.

Fig. 8 shows a polar transform of a human posture. We use

to represent the number of shells used to quantize the radialaxis and to represent the number of sectors that we want to

quantize in each shell. Therefore, the total number of bins used

to construct the centroid context is . For the centroid

of the triangular mesh of a posture, we construct a vector

histogram , in which

is the number of triangular mesh centroids in the th bin

when is considered as the origin, i.e.,

(6)

where is the th bin of the log-polar coordinate. Then,

given two histograms, and , the distance between

them can be measured by a normalized intersection [31]

(7)

where is the n umber of bins a nd denotes the n umber

of meshes calculated from a posture. Using (6) and (7), we can

define a centroid context to describe the characteristics of an

arbitrary posture .

In the previous section, we presented a tree search algorithm

that finds a spanning tree froma posture based onthetri-

angulation result. As shown in Fig. 9, (b) is the spanning tree de-

rived from (a). The tree captures the skeleton feature of .

Here, we call a node a branch node if it has more than one child.By this definition, there are three branch nodes in Fig. 9(b), i.e.,

Fig. 10. Multiple centroid contexts using different numbers of sectors andshells: (a) 4 shells and 15 sectors, (b) 8 shells and 30 sectors, and (c) centroidhistogram of the gravity center of the posture.

, , and . If we remove all the branch nodes from ,

it will be decomposed into different branch paths . Then,

by carefully collecting the set of triangular meshes along each

path, we findthat eachpath correspondstoonebody com-

ponent in . For example, in Fig. 9(b), if we remove from

, two branch paths will be formed, i.e., one from node

to and one from to node . The first path corresponds

to the head and neck of , and the second corresponds to theleft hand of . In some cases, the path may not correspond to

a high-level semantic body component exactly, as shown by the

path from to . However, if the length of a path is further

constrained, over-segmentation can be easily avoided. Since the

segmentation of body parts is an ill-posed problem, we do not

propose a complete system for segmenting a human posture into

high-level semantic body parts. Instead, we use the current de-

composition result to recognize postures up to the middle level

(using components similar to the body parts, but not real ones).

Given a path , we collect a set of triangular meshes

along it. Let be the centroid of the triangular mesh closest to

the center of the set of meshes. As shown in Fig. 9(c), is the

centroid extracted from the path that begins at and ends at .

Given a centroid , we can obtain its corresponding histogram

using (6). Assume that the set of these path centroids is

. Then, based on , the centroid context of is defined as

follows:

(8)

where is the number of elements in . Fig. 10 shows two

cases of multiple centroid contexts when the number of shells

and sectors is set to (4, 15) and (8, 30), respectively. The cen-

troid contexts are extracted from the head and the center of the

posture, respectively. (c) is the centroid histogram of the gravitycenter of the posture. Given two postures, and , the distance

between their centroid contexts is measured by

(9)

where and are the area ratios of the th and th body

parts residing in and , respectively. Based on (9), an arbi-

trary pair of postures can be compared. We now summarize, thealgorithm for finding the centroid context of a posture .

8/7/2019 2008 Base Paper

6/13


Algorithm for Centroid Context Extraction:

Input: the spanning tree of a posture .

Output: the centroid context of .

Step 1: Recursively trace using a depth first search until

the tree is empty. If a branch node (a node with two children)

is found, collect all the visited nodes to a new path and

remove these nodes from .

Step 2: If a new path only contains two nodes, eliminate

it; otherwise, find its path centroid .

Step 3: Using (6) to find the centroid histogram of each

path centroid .

Step 4: Collect all the histograms as the centroid

context of .

The time complexity of centroid context extraction is

, where is the number of triangle meshes in

and .

B. Posture Recognition Using the Skeleton and the Centroid

Context

Given a posture, we can extract its skeleton feature and cen-

troid context using the techniques described in Sections III-A

and IV-A, respectively. Then, the distance between any two pos-

tures can be measured using (5) (for the skeleton) or (9) (for the

centroid context). As noted previously, the skeleton feature is

used for a coarse search and the centroid context feature is used

for a fine search. To obtain better recognition results, the two

distance measures should be integrated. We use the followingweighted sum to represent the total distance:

(10)

where is the total distance between two postures

and and is a weight used to balance and

.

V. BEHAVIOR ANALYSIS USING STRING MATCHING

We represent each movement sequence of a human subject

by a continuous sequence of postures that changes over time. To

simplify the analysis, we convert a movement sequence into aset of posture symbols. Then, different movement sequences of

a human subject can be recognized and analyzed using a string

matching process that selects key postures as its basis. In the

following, we describe a method that automatically selects a

set of key postures from video sequences, and then present our

string matching scheme for behavior recognition.

A. Key Posture Selection

The movement sequences of a human subject are analyzed

directly from videos. Some postures in a sequence may be re-

dundant, so they should be removed from the modeling process.

To do this, we use a clustering technique to select a set of keypostures from a collection of surveillance video clips.

Assume that all the postures have been extracted from a video

clip and each frame has only one posture. In addition, assume

is the posture extracted from the th frame. Given two adjacent

postures, and , the distance between them, , is calcu-

lated using (10), where is set at 0.5. Assume is the average

value of for all pairs of adjacent postures. For a posture , if

is greater than , we define it as posture-change instance.By collecting all postures that correspond to posture-change in-

stances, we derive a set of key posture candidates . How-

ever, may still contain many redundant postures, which

would degrade the accuracy of the sequence modeling. To ad-

dress this problem, we use a clustering technique to find a better

set of key postures.

Initially, we assume each element in individually

forms a cluster . Then, given two cluster elements, and ,

in , the distance between them is defined by

(11)

where is defined in(10) and representsthenumber

of elements in . According to (11), we can execute an itera-

tive merging scheme to find a compact set of key postures from

. Let and be the th cluster and the collection of

all clusters , respectively, at the th iteration. At each itera-

tion, we choose a pair of clusters, and , whose distance

is the minimum among all pairs in , i.e.,

If is less than , then and are merged

to form a new cluster and construct a new collection of clusters

. The merging process is executed iteratively until no fur-

ther merging is possible. Assume is the final set of clusters.

Then, from the th cluster in , we can extract a key posture

that satisfies the following equation:

(12)

Based on (12) and checking all clusters in , the set of key

postsures , i.e., , can be constructed for

analyzing human movement sequences.

B. Human Movement Sequence Recognition Using String

Matching

Based on the results of key posture selection and posture clas-

sification, we can model different human movement sequences

with strings. For example, in Fig. 11, there are three kinds of

movement sequence: walking, picking up, and falling. Let the

symbols and denote the start and end points of a se-

quence. Then, the sequence shown in Fig. 11 can represented

by: (a) swwwe, (b) swwppwwe, and (c) swwwwffe, where

denotes walking, denotes picking-up, and denotes

falling. Based on this behavior parsing, different movement se-

quences can be compared using a string matching scheme.

Assume and are two movement sequences whose stringrepresentations are and , respectively. We use the edit

8/7/2019 2008 Base Paper

7/13


Fig. 11. Three kinds of movement sequence taken from: (a) Walking. (b)Picking up. (c) Falling.

distance [30] between and to measure the dissimilarity

between and . The edit distance between and , is

defined as the minimum number of edit operations required to

change into . The operations include replacements, in-

sertions, and deletions. For any two strings, and , we de-

fine as the edit distance between and

. That is, is the minimum number of

edit operations needed to transform the first charactersof into the first characters of . In addition, let

be a function whose value is 0 if and

1 if . Then, we get ,

, and .

is calculated by the following recursive equation:

(13)

where the insertion, deletion, and replacement operations

correspond, respectively, to the transitions from cellto cell , from cell to cell , and from cell

to cell , respectively.

Assume that is a walking video clip whose string repre-

sentation is swwwwwwe. is different from that in Fig. 11(a)

due to the scaling problem along the time axis. According to

(13), the edit distance between and the sequence in Fig. 11(a)

is 3, while the edit distance between and the sequences

shown in Fig. 11(b) and (c) is 2. However, is more similar

to Fig. 11(a) than to Fig. 11(b) and (c). Clearly, (13) cannot

be directly applied to human movement sequence analysis

and must be modified. Thus, we define a new edit distance to

address this problem.

Let , , and be the respective costs of the inser-tion, replacement, and deletion operations performed at the

th and th characters of and , respectively. Then, (13)

can be rewritten as

(14)

Here, the replacement operation is considered more important

than insertion and deletion, since it means a change of pos-

ture type. Thus, the weight ofinsertion and deletion should

be scaled down to , where . Following this rule, if aninsertion is found when calculating , its weight

will be if ; otherwise, its weight will be 1.

This implies that . Similarly, for the

deletion operation, we obtain . If

, the weights of the insertion and deletion

operations would be smaller than the weight of the replace-

ment operations; therefore, it would be impossible to choose

replacement as the next operation. This problem can be easilysolved by setting the weight to so that

and will n ot be compared when calculating .

However, they will be compared when calculating

. Since the same end symbol e appears in and ,

the final distance will be equal to

. Thus, the delay in comparison will

not cause an error; however, the cost of insertion and dele-

tion will increase if any incorrect editing operations are chosen.

Therefore, (14) should be modified as follows:

(15)

where and

. In this work, is consistently set to 0.1 because we

consider the influence of replacement to be more significant than

that of insertion and deletion. In the experiments, we obtained

satisfactory results by using the above setting. The algorithm of

string matching to calculate the edit distance between and

is summarized as follows.

Algorithm for String Matching:

Input: two strings and .

Output: the edit distance between and .

Step 1: and .

Step 2: and for all

and

Step 3: For all and do

Step 4: Return .

VI. EXPERIMENTAL RESULTS

To test the efficiency and effectiveness of the proposed ap-

proach, we constructed a large database of more than thirty

thousand postures, from which we obtained over three hun-

dred human movement sequences. The ratio of female type in

this database is 20%. In each sequence, we recorded a spe-

cific type of human behavior: walking, sitting, lying, squatting,

falling, picking up, jumping, climbing, stooping, or gymnas-

tics. Fig. 12 illustrates the skeleton-based posture recognition

system. The left-hand side of the figure shows a person walking,

while the right-hand side shows twelve pre-determined key pos-

tures for matching. Among the twelve key postures, which were

determined by a clustering algorithm, the one surrounded bya bold frame matches the current posture of the query action

8/7/2019 2008 Base Paper

8/13


Fig. 12. Results of walking behavior analysis using the skeleton feature. Therecognized posture is highlighted with a bold frame and shown in (b).

Fig. 13. Result of walking behavior analysis using the centroid context feature.The recognized type is highlighted with a bold frame and shown in (b).

sequence [see (b)]. Fig. 13 shows the centroid context-based

posture recognition system. The current posture shown on the

left-hand side matches the third key posture (surrounded by a

bold frame) located on the right-hand side of the figure [see (b)].

In the first set of experiments, the performance of key posture

selection was examined. Since there is no benchmark behavior

database in the literature, twenty movement sequences (two se-

quences per type) were chosen to evaluate the accuracy of our

proposed key posture selection algorithm. For each sequence, a

set of key postures was manually selected to serve as the groundtruth. Two postures were consideredcorrectly matched if their

distance was smaller than 0.1 [see (10)]. Let

and be the numbers of key frames selected by a manual

method and our approach, respectively. We can then define the

accuracy of key posture selection as follows:

where is the number of key postures selected by our

method, each of which matches one of the manually selected

key postures. is the number of manually selected key

postures, each of which correctly matches one of the automat-ically selected key postures. Fig. 14 shows the results of key

Fig. 14. Result of key posture selection from three action sequences: walking,squatting, and gymnastics.

Fig. 15. Results of skeleton extraction using different sampling rates. The firstrow shows the skeletons extracted by connecting all triangle centroids, whilethe second row shows the skeletons extracted by connecting all branch nodes.

posture selection from sequences of walking, squatting, and

gymnastics. Due to space limitations, only 36 key postures

are shown in Fig. 14. With this set of key postures, different

movement sequences can be converted into a set of symbols

and then recognized through a string matching method. In the

experiments, the average accuracy of key posture selection was

91.9%.

In the second set of experiments, we evaluated the perfor-

mance of skeleton extraction using triangulation. Fig. 15 shows

the results of skeleton extraction using different sampling rates.

The top row illustrates the skeletons extracted by connecting all

the triangle centroids, while the bottom row shows the skeletonsextracted by connecting all the branch nodes. It is obvious that

no matter which sampling rate was used, the skeletons were

always extracted; however, a higher sampling rate achieved a

better approximation result. Fig. 16 shows another example of

extracting a simple skeleton and a complex skeleton directly

from a profile posture. In this paper, a simple skeleton means

a skeleton derived by connecting all the branch nodes, and a

complex skeleton is a skeleton derived by connecting all the

centroids of all the triangles. To classify a posture, we have

to convert a skeleton into its corresponding distance map.

Fig. 17(a) and (b) show the results of applying the distance

transform to the skeletons shown in Fig. 16(a) and (b), respec-

tively. Table I details the success rates of posture recognitionwhen different numbers of meshes (resolutions) were used to

8/7/2019 2008 Base Paper

9/13


Fig. 16. Skeleton extraction directly from a posture profile.(a) Simpleskeletonderived by connecting branch nodes and (b) complex skeleton derived by con-necting all triangle centroids.

Fig. 17. Result of distance transform. (a) Distance map extracted fromFig. 16(a) and (b) distance map extracted from Fig. 16(b).

TABLE ISUCCESS RATES OF POSTURE RECOGNITION WHEN DIFFERENT NUMBERS OF

MESHES WERE USED FOR SKELETON EXTRACTION

extract both simple and complex skeletons. From the table,

we observe that 1) the performance obtained using complex

skeleton representation was better than that derived using the

simple skeleton and 2) the more meshes used for representation,

the better the performance of posture recognition.

We explain how the posture recognition process works by

the following example. The top row of Fig. 18 shows a number

of input postures represented by simple skeletons. Their corre-

sponding binary maps are shown in the bottom row of the figureusing dark shading. The associated silhouettes, shown in the

second row, represent the corresponding posture types retrieved

from the posture database. Fig. 19 compares the performance

of the simple and complex skeletons in recognizing postures

from different types of action (movement) types. From Fig. 19,

it is obvious that the complex skeleton consistently outperforms

the simple skeleton. The reason why the above results were ob-

tained was due to the fact that the simple skeleton loses more

information than the complex skeleton. Therefore, the complex

one performs better than the simple one in terms of accuracy.

The next set of experiments was conducted to evaluate the

performance of the centroid context feature. Since the results

of this experiment were influenced by the number of predefinedsectors and shells, we conducted six sets of experiments. They

Fig. 18. Posture recognition results when the simple skeleton is used. The firstrow shows the skeletons of all query postures. The silhouettes in the second rowshow the corresponding posture type retrieved from the database.

Fig. 19. Success rates of posture recognition using the simple skeleton andcomplex skeleton on different movement types.

Fig. 20. Different numbers of sectors and shells used to define the centroidcontext: (a) 4 shells and 15 sectors and (b) 8 shells and 30 sectors.

were (4, 15), (4, 20), (4, 30), (8, 15), (8, 20), and (8, 30), in which

the first argument represents the number of shells and the second

represents the number of sectors. Fig. 20 shows two types of

polar partition. Table II lists the recognition rates when different

(shell, sector) parameter sets were predefined and applied to the

single centroid context and the multiple centroid context cases.In the former case, the center was set at the gravity center of a

8/7/2019 2008 Base Paper

10/13


TABLE II

RECOGNITION RATES OF POSTURES WHEN DIFFERENT NUMBERSOF SECTORS AND SHELLS WERE USED

Fig. 21. Posture recognition results using multiple centroid contexts.

Fig. 22. Comparison of posture recognition accuracy for different behaviortypes.

posture, but in the latter case, the center of each component was

taken as the gravity center of every partitioned component. From

the recognition rates reported in Table II, it is obvious that using

a multiple centroid context is definitely better than using a single

centroid context. Thereby, in our behavior analysis system, themultiple centroid context is implemented and adopted for pos-

ture classification. In different parameter sets, we found that the

best setting for the (shell, sector) pair was (4, 30). Therefore, we

used this setting in all the experiments.

Fig. 21 shows the results when multiple centroid contexts

were adopted. The images in the top row show how multiple

centroid contexts cover four different postures. In these experi-

ments, we only used two centroid contexts to represent each pos-

ture. In the bottom row, we retrieved database postures (in sil-

houette form) and then superimposed their corresponding query

postures (in shaded form). All the retrieved postures were actu-

ally very close to the query postures. Fig. 22 lists the recogni-

tion rates for different types of behavior when the multiple cen-troid context approach was adopted. For all movement types,

Fig. 23. Posture recognition results using (from left to right) the CC, ShapeContext, PPM, and Zernike moment methods, respectively. The shaded regionshows the input query, and the silhouettes are their recognized posture types.

TABLE IIIFUNCTIONAL COMPARISONS AMONG CENTROID CONTEXT, S HAPE CONTEXT,

PPM, AND MOMENTS.3 K

: NUMBER OF MESHES;v

: NUMBER OF BODY

PARTS;n

: NUMBER OF BOUNDARY POINTS;( w ; h )

: DIMENSION OF THE INPUTPOSTURE; L: NUMBER OF MOMENT ORDERS. T: TRANSLATION; S: SCALING;

R: R OTATION. IT IS NOTICED THATn K v

the average retrieval accuracy rate was 88.9%. In another set

of experiments, we compared our multiple centroid context ap-

proach with three other approaches namely: the shape context-

based approach [27], the Zernike moment-based approach [29],

and the probabilistic projection map-based (PPM) approach [2].

Fig. 23 illustrates the recognition results for a walking pos-

ture (the shaded region). The silhouettes in the figure, from left

to right, represent the database postures retrieved by using themultiple centroid context approach [see (10)], the shape con-

text approach, the PPM approach, and the Zernike approach, re-

spectively. The shape context approach and the PPM approach

misclassified the posture type as running. The Zernike ap-

proach correctly recognized the posture type as the walking

mode. However, the recognized posture belonged to the next

key posture of the same movement sequence. Table III lists

the functional comparisons among these methods. The Zernike

approach has several invariance properties including rotation,

translation, and scaling. However, the calculation of moments

at different orders is very time consuming. The PPM approach

is simple and efficient but not robust to noise. The shape contextis robust to noise but with a higher time complexity. Our cen-

troid context is robust to noise and can represent a posture up

to a syntactic level. Due to its region-based nature, it can work

very efficiently to classify postures. Fig. 24 compares the per-

formance of the four approaches for ten behavior types. In terms

of accuracy, our approachs performance was similar to that of

the shape context approach. However, our approach is better at

dealing with sequences that involve significant movements, such

as climbing and walking. Most importantly, our method has a

lower time complexity than the shape context method because

it is a region-based descriptor, whereas the shape-context ap-

proach is pixel-based.

Next, we conducted a series of experiments to evaluatemovement sequence recognition using the proposed string

8/7/2019 2008 Base Paper

11/13


Fig. 24. Comparison of the performance of different approaches.

TABLE IVANALYSIS OF TEN BEHAVIOR TYPES USING OUR PROPOSED

STRING MATCHING SCHEME

matching scheme. We collected 450 movement sequences,

divided equally among the ten behavior types (i.e., each be-

havior type had 45 movement sequences). The experiment

results, listed in Table IV, show that most of the test sequences

were correctly identified and classified into the appropriate

categories. However, some of the squat sequences tended

to be misclassified as picking up behavior due to their high

degree of similarity to squatting. In this set of experiments, the

average accuracy of behavior analysis was 94.5%.

To test the proposed approach on real-world applications, we

analyzed irregular (or suspicious) human movements captured

by a surveillance camera. First, we extracted a set of normalkey postures from video sequences to learn regular human

movements like walking or running. After a training process, we

took unknown surveillance sequences and determined whether

the postures in them were regular or not. If some irregular pos-

tures appear continuously, an alarm message triggers a security

warning. For example, in Fig. 25(a), a set of normal key pos-

tures was extracted from a walking sequence. From these pos-

tures, we can tell that the posture in (b) is a normal posture.

However, the posture in (c) could be classified as abnormal,

since the person adopted a bending-down posture. We used a

marker to label these abnormal posture sequences until they re-

turned to normal. Fig. 26 shows another two cases of abnormal

posture detection. The postures in Fig. 26(a) and (c) were rec-ognized as normal, since they were similar to the postures in

Fig. 25. Abnormal activity detection. (a) Five key postures defining severalnormal human movements. (b) Normal condition is detected. (c) Warning mes-sage is triggered when an abnormal posture is detected.

Fig. 26. Abnormal posture detection. (a) and (c): Normal postures were de-tected. (b) and (d): Abnormal postures were detected due to the unexpected

shooting and climbing wall postures, respectively.

Fig. 25(a). However, the postures in Fig. 26(b) and (d) were clas-

sified as abnormal, since the subjects adopted shooting and

climbing wall postures. The abnormal posture detection func-

tion improves a video surveillance system in two ways. First, it

saves a great deal of storage space because it is only necessary to

store abnormal postures. Second, it responds immediately when

an abnormal posture is identified.

VII. CONCLUSION

We have proposed a method for analyzing human movements

directly from videos. Specifically, the method comprises the

following.

1) A triangulation-based technique that extracts two im-

portant features, the skeleton feature and the centroid

context feature, from a posture to derive more semantic

meaning. The features form a finer descriptor that can

describe a posture from the shape of the whole body or

from body parts. Since the features complement each

other, all human postures can be compared and classi-

fied very accurately.

2) A clustering scheme for key posture selection, which

recognizes and codes human movements using a set ofsymbols.

8/7/2019 2008 Base Paper

12/13


3) A novel string-based technique for recognizing human

movements. Even though events may have different

scaling changes, they can still be recognized.

The average accuracies of posture classification and behavior

analysis were 95.16% and 94.5%, respectively. The experiment

results show that the proposed method is superior to other ap-

proaches for analyzing human movements in terms of accuracy,robustness, and stability.

REFERENCES

[1] T. B. Moeslund and E. Granum, A survey of computer vision-basedhuman motion capture, Comput. Vis. Image Understand., vol. 81, no.3, pp. 231268, Mar. 2001.

[2] R.Cucchiara, C. Grana, A. Prati,and R. Vezzani, Probabilities postureclassification for human-behavior analysis, IEEE Trans. Syst., Man,Cybern. A, vol. 35, no. 1, pp. 42 54, Jan. 2005.

[3] S. Haritaoglu, D. Harwood, and L. S. Davis, W4: Real-time surveil-lance of peopleand their activities,IEEE Trans. Pattern Anal. Machine

Intell., vol. 22, no. 8, pp. 809830, 2000.[4] S. Maybank and T. Tan, Introduction to special section on visual

surveillance, Int. J. Comput. Vis., vol. 37, no. 2, pp. 173173, 2000.

[5] I. Mikic et al., Human body model acquisition and tracking usingvoxel data, Int. J. Comput. Vis., vol. 53, no. 3, pp. 199223, 2003.

[6] N. Werghi, A discriminative 3D wavelet-based descriptors: Applica-tion to the recognition of human body postures, Pattern Recognit.

Lett., vol. 26, no. 5, pp. 663677, 2005.[7] I.-C. Chang and C.-L. Huang, The model-based human body motion

analysis, Image Vis. Comput., vol. 18, no. 14, pp. 10671083, 2000.[8] H. Fujiyoshi, A. J. Lipton, and T. Kanade, Real-time human motion

analysis by image skeletonization, IEICE Trans. Inform. Syst., vol.E87-D, no. 1, pp. 113120, Jan. 2004.

[9] N. Oliver, B. Rosario, and A. Pentland, A Bayesian computer visionsystem for modeling human interactions, IEEE Trans. Pattern Anal.

Machine Intell., vol. 22, no. 8, pp. 831843, Aug. 2000.[10] R. Rosales and S. Sclaroff, 3D trajectory recovery for tracking mul-

tiple objects and trajectory guided recognition of actions, in Proc.IEEE Conf. On Computer Vision and Pattern Recognition, Jun. 1999,

vol. 2, pp. 637663.[11] Y. Ivanov and A. Bobick, Recognition of visual activities and inter-

actions by stochastic parsing, IEEE Trans. Pattern Recognit. MachineIntell., vol. 2, no. 8, pp. 852872, Aug. 2000.

[12] S. Weik and C. E. Liedtke, Hierarchical 3D pose estimation for artic-ulated human body models from a sequence of volume data, in Proc.

Int. Workshop on Robot Vision 2001, Auckland, New Zealand, Feb.

2001, pp. 2734.[13] C. R. Wren et al., Pfinder: Real-time tracking of the human body,

IEEE Trans. Pattern Anal. Machine Intell., vol. 19, no. 7, pp. 780785,Jul. 1997.

[14] J. Shewchuk, Delaunay refinement algorithms for triangular meshgeneration, computational geometry, Theory Applicat., vol. 23, pp.2174, May 2002.

[15] S. Park and J. K. Aggarwal, Segmentation and tracking of interactinghuman body parts under occlusion and shadowing, in Proc. IEEEWorkshop on Motion and Video Computing, Orlando, FL, 2002, pp.

105111.[16] S. Park and J. K. Aggarwal, Semantic-level understanding of human

actions and interactions using event hierarchy, in Proc. 2004 Conf.Computer Vision and Pattern Recognition Workshop, Washington, DC,

Jun. 2, 2004.

[17] Y. Guo, G. Xu, and S. Tsuji, Understanding human motion pat-terns, in Proc. IEEE Int. Conf. Pattern Recognition, 1994, vol. 2,pp. 325329.

[18] Y. Guo, G. Xu, and S. Tsuji, Tracking human body motion based ona stick figure model, J. Vis. Commun. Image Representat., vol. 5, pp.1p, 1994.

[19] Ben-Arie, Z. Wang, P. Pandit, and S. Rajaram, Human activity recog-nition using multidimensional indexing, IEEE Trans. Pattern Anal.

Machine Intell., vol. 24, no. 8, pp. 10911104, Aug. 2002.[20] S. X. Ju, M. J. Black, and Y. Yaccob,

Cardboard people: A parameter-

ized model of articulated image motion, in Proc. Int. Conf. AutomaticFace and Gesture Recognition, Killington, VT, 1996, pp. 3844.

[21] E.Horowitz, S. Sahni,and S. A. Freed, Fundamentals of Data Structure

in C. New York: W. H. Freeman.

[22] G. Borgefors, Distance transformations in digital images, Comput.Vis., Graph., Image Process., vol. 34, no. 3, pp. 344371, Feb. 1986.

[23] M. Bern andD. Eppstein, Mesh Generation and Optimal Triangulation,

Computing in Euclidean Geometry, 2nd ed. : World Scientific, 1995,pp. 47123.

[24] C. Stauffer and E. Grimson, Learning patterns of activity using real-

time tracking, IEEE Trans. Pattern Recognit. Machine Intell., vol. 22,no. 8, pp. 747757, 2000.

[25] D. Chetverikov and Z. Szabo, A simple and efficient algorithm for de-tection of high curvature points in planar curves, in Proc. 23rd Work-shop Austrian Pattern Recognition Group, 1999, pp. 175184.

[26] L. P. Chew, Constrained Delaunay triangulations, Algorithmica, vol.4, no. 1, pp. 97108, 1989.

[27] S. Belongie, J. Malik, and J. Puzicha, Shape matching and objectrecognition using shape contexts, IEEE Trans. Pattern Recognit. Ma-chine Intell., vol. 24, no. 4, pp. 509522, Apr. 2002.

[28] Y. Rui, A. C. She, and T. S. Huang, Modified Fourier descriptorsfor shape representation -a practical approach, in Proc. 1st Int. Work-shop on Image Databases and Multi-Media Search , Amsterdam, The

Netherlands, Aug. 1996.

[29] A. Kontanzad andY. H. Hong, Invariantimage recognition by Zernikemoments, IEEE Trans. Pattern Recognit. Machine Intell., vol. 12, no.

5, pp. 489497, May 1990.[30] E. Ukkonen, On approximate string matching, in Proc. Int. Conf. on

Foundations of Comp. Theory, 1983, pp. 487495.[31] M. J. Swain and D. H. Ballard, Color indexing, Int. J. Comput. Vis.,

vol. 7, no. 1, pp. 1132, 1991.

Jun-Wei Hsieh (M06) received the B.S. degree incomputer science from TongHai University, Taiwan,

R.O.C., in 1990 and the Ph.D. degree in computerengineering from the National Central University,Taiwan, in 1995.

From 1996 to 2000, he was a Researcher Fellowat the Industrial Technology Researcher Institute,Hsinchu, Taiwan, and managed a team to developvideo-related technologies. He is presently an As-sociate Professor with the Department of ElectricalEngineering, Yuan Ze University, Chung -Li, Taiwan.

His research interests include content-based multimedia databases, videoindexing and retrieval, computer vision, and pattern recognition.

Dr. Hsieh received the Best Paper Awards from the Image Processing andPattern Recognition Society of Taiwan in 2005, 2006, and 2007, and the Distin-guished Research Award from Yuan Ze University in 2006. He is a recipient of

the Phai-Tao-Phai Award.

Yung-Tai Hsu was born in Kaohsiung, Taiwan,R.O.C., in 1978. He received the B.S and M.Sdegrees in mechanical engineering from Yuan ZeUniversity, Taiwan, in 2000 and 2002, respectively.He is currently pursuing the Ph.D degree in elec-trical engineering at Yuan Ze University, Chung-Li,Taiwan.

His research interests include pattern recognition,computer vision, and machine learning.

Mr. Tsai received the Best Paper Award forthe 2006 IPPR Conference on Computer Vision,Graphics, and Image Processing.

8/7/2019 2008 Base Paper

13/13


Hong-Yuan Mark Liao (SM01) received theB.S. degree in physics from National Tsing-HuaUniversity, Hsinchu, Taiwan, R.O.C., in 1981, andthe M.S. and Ph.D. degrees in electrical engineeringfrom Northwestern University, Evanston, IL, in 1985and 1990, respectively.

He was a Research Associate with the ComputerVision and Image Processing Laboratory, North-

western University, in 19901991. In July 1991,he joined the Institute of Information Science,Academia Sinica, Taipei, Taiwan, as an Assistant

Research Fellow. He was promoted to Associate Research Fellow and thenResearch Fellow in 1995 and 1998, respectively. From August 1997 to July

2000, he served as the Deputy Director of the institute. From February 2001to January 2004, he was the Acting Director of the Institute of AppliedScience and Engineering Research. He is jointly appointed as a Professor ofthe Computer Science and Information Engineering Department of NationalChiao-Tung University. His current research interests include multimediasignal processing, video surveillance, and content-based multimedia retrieval.

Dr. Liao is the Editor-in-Chief of the Journal of Information Science and En-

gineering. He is on the Editorial Boards of the International Journal of VisualCommunication and Image Representation, the EURASIP Journal on Advancesin Signal Processing, and Research Letters in Signal Processing. He was an As-sociate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA from 1998 to 2001.He received the Young Investigators Award from Academia Sinica in 1998, the

Excellent Paper Award from the Image Processing and Pattern Recognition So-ciety of Taiwan in 1998 and 2000, the Distinguished Research Award from theNational Science Council of Taiwan in 2003, and the National Invention Awardin 2004. He served as the Program Chair of the International Symposium onMultimedia Information Processing (ISMIP97), the Program Co-Chair of theSecond IEEE Pacific-Rim conference on Multimedia (2001), the ConferenceCo-Chair of the Fifth IEEE International Conference on Multimedia and Ex-position 2004 (ICME), and as Technical Co-Chair of the Eighth IEEE ICME(2007).

Chih-Chiang Chen was born in Taipei, Taiwan,R.O.C., in 1982. He received the B.S degree inelectrical engineering from Yuan Ze University,Taiwan, in 2004, where he is currently pursuing thePh.D degree in electrical engineering.

His research interests include image processing,behavior analysis, and video surveillance.

2008 base paper

Documents