2008 base paper

Upload: 786sheik786

Post on 08-Apr-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/7/2019 2008 Base Paper

    1/13

    372 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

    Video-Based Human Movement Analysis and ItsApplication to Surveillance Systems

    Jun-Wei Hsieh, Member, IEEE, Yung-Tai Hsu, Hong-Yuan Mark Liao , Senior Member, IEEE, andChih-Chiang Chen

    AbstractThis paper presents a novel posture classificationsystem that analyzes human movements directly from video se-quences. In the system, each sequence of movements is convertedinto a posture sequence. To better characterize a posture in asequence, we triangulate it into triangular meshes, from which weextract two features: the skeleton feature and the centroid contextfeature. The first feature is used as a coarse representation ofthe subject, while the second is used to derive a finer description.We adopt a depth-first search (dfs) scheme to extract the skeletalfeatures of a posture from the triangulation result. The proposedskeleton feature extraction scheme is more robust and efficient

    than conventional silhouette-based approaches. The skeletal fea-tures extracted in the first stage are used to extract the centroidcontext feature, which is a finer representation that can charac-terize the shape of a whole body or body parts. Thetwo descriptorsworking together make human movement analysis a very efficientand accurate process because they generate a set of key posturesfrom a movement sequence. The ordered key posture sequence isrepresented by a symbol string. Matching two arbitrary actionsequences then becomes a symbol string matching problem. Ourexperiment results demonstrate that the proposed method is arobust, accurate, and powerful tool for human movement analysis.

    Index TermsBehavior analysis, event detection, stringmatching, triangulation.

    I. INTRODUCTION

    THE analysis of human body movements can be applied in a

    variety of application domains, such as video surveillance,

    video retrieval, human-computer interaction systems, and med-

    ical diagnoses. In some cases, the results of such analysis can be

    used to identify people acting suspiciously and other unusual

    events directly from videos. Many approaches have been pro-

    posed for video-based human movement analysis [1][6], [9],

    [10]. For example, Oliver et al.. [9] developed a visual surveil-

    lance system that models and recognizes human behavior using

    hidden Markov models (HMMs) and a trajectory feature. Ros-

    ales and Sclaroff [10] proposed a trajectory-based recognition

    Manuscript received July 12, 2006; revised August 4, 2007. This work wassupported in part by the National Science Council of Taiwan, R.O.C., underGrant NSC952221-E-155071 and by the Ministry of Economic Affairs underContract 94-EC-17-A-02-S1032. The associate editor coordinating the reviewof this manuscript and approving it for publication was Prof. Kiyoharu Aizawa.

    J.-W. Hsieh, Y.-T. Hsu, and C.-C. Chen are with the Department of Elec-trical Engineering, Yuan Ze University, Chung-Li 320, Taiwan, R.O.C. (e-mail:[email protected]; [email protected]; [email protected]).

    H.-Y. M. Liao is with the Institute of Information Science, Academia Sinica,Taipei, Taiwan, R.O.C. (e-mail: [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TMM.2008.917403

    system to detect pedestrians in outdoor environments and rec-

    ognize their activities from multiple views based on a mixture of

    Gaussian classifiers. Other approaches extract human postures

    or body parts (such as the head, hands, torso, or feet) to ana-

    lyze human behavior [5], [6]. Mikic et al.. [5] used complex 3-D

    models and multiple video cameras to extract 3-D information

    for 3-D posture analysis, while Werghi [6] extracted 3-D infor-

    mation and applied wavelet transforms to analyze 3-D human

    postures. Although 3-D features are more informative than 2-D

    features (like contours or silhouettes) for classifying human pos-tures, the inherent correspondence problem and high computa-

    tional cost make this approach inappropriate for real-time ap-

    plications. As a result, a number of researchers have based their

    analyses of human behavior on 2-D postures. In [2], Cucchiara

    et al.. proposed a probabilistic posture classification scheme

    to identify several types of movement, such as walking, run-

    ning, squatting, or sitting. Ivanov and Bobick [11] presented a

    2-D posture classification system that uses HMMs to recognize

    human gestures and behaviors. In [13], Wren et al. proposed

    a Pfinder system for tracking and recognizing human behavior

    based on a 2-D blob model. The drawback of this approach

    is that incorporating 2-D posture models into the analysis ofhuman behavior raises the possibility of ambiguity between the

    adopted models and real human behavior due to the mutual oc-

    clusions between body parts and loose clothes. In [17] and [18],

    Guo et al.. used a stick model to represent the parts of a human

    body as sticks and link the sticks with joints for human motion

    tracking. Then, in [19], Ben-Arie et al. used a voting technique

    to recover the parameters of stick model from video sequences

    for action analysis. In addition to the stick model, the cardboard

    model [20] is widely recognized to be good for modeling ar-

    ticulated human motions. However, the prerequisite that body

    parts must be well segmented makes this model inappropriate

    for real-time analysis of human behavior.

    To solve the problem with the cardboard model, Park andAggarwal [15] used a dynamic Bayesian network to segment

    a body into different parts [16]. This blob-based approach is

    very promising for analyzing human behavior at the semantic

    level, but it is very sensitive to changes in illumination. In addi-

    tion to blobs, several researchers have proposed for classifying

    human postures based on silhouettes. Weik and Liedtke [12]

    traced the negative minimum curvatures along body contours

    to segment body parts and then identified body postures using

    a modified Iterative Closest Point (ICP) algorithm. Fujiyoshi et

    al. [8] presented a skeleton-based method to recognize postures

    by extracting skeletal features based on curvature changes along

    the human silhouette. In addition, Chang and Huang [7] used

    1520-9210/$25.00 2008 IEEE

  • 8/7/2019 2008 Base Paper

    2/13

    HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 373

    Fig. 1. Flowchart of the proposed system.

    different morphological operations to extract skeletal features

    from postures and then identified movements using a HMM

    framework. Although the contour-based approach is simple and

    efficient for coarse classification of human postures, it is vul-

    nerable to the effects of noise, imperfect contours, or occlu-

    sions. Another approach used to analyze human behavior is the

    Gaussian probabilistic model. In [2], Cucchiara et al. used a

    probabilistic projection map to model postures and performed

    frame-by-frame posture classification to recognize human be-

    havior. The method uses the concept of a state-transition graph

    to integrate temporal information about postures in order to deal

    with the occlusion problem. However, the projection histogram

    is not robust because it changes dramatically under differentlighting conditions.

    In this paper, we propose a novel triangulation-based system

    that analyzes human behavior directly from videos. The detailed

    components of the system are shown in Fig. 1. First, we apply

    background subtraction to extract body postures from video se-

    quences and then derive their boundaries by contour tracing.

    Next, a Delaunay triangulation technique [14] is used to divide

    a posture into different triangular meshes. From the triangula-

    tion result, two important features, the skeleton and the cen-

    troid context (CC), are extracted for posture recognition. The

    skeleton feature is used for a coarse search, while the centroid

    context, being a finer feature, is used to classify postures moreaccurately. To extract the skeleton feature, we propose a graph

    search method that builds a spanning tree from the triangulation

    result. The spanning tree corresponds to the skeletal structure

    of the analyzed body posture. The proposed skeleton extrac-

    tion method is much simpler than silhouette-based techniques

    [8][13]. In addition to the skeletal structure, the tree providesimportant information for segmenting a posture into different

    body parts. Based on the result of body part segmentation, we

    construct a new posture descriptor, namely, the centroid con-

    text descriptor, for recognizing postures. This descriptor uti-

    lizes a polar labeling scheme to label every triangular mesh with

    a unique number. Then, for each body part, a feature vector,

    i.e., the centroid context, is constructed by recording all relatedfeatures of the centroid of each triangular mesh according to

    this unique number. We can then compare different postures

    more accurately by measuring the distance between their cen-

    troid contexts. Each posture is assigned a semantic symbol so

    that each human movement can be converted and represented

    by a set of symbols. Based on this representation, we use a new

    string matching scheme to recognize different human move-

    ments directly from videos. The string-based method modifiesthe calculation of the edit distance by using different weights to

    measure the operations of insertion, deletion, and replacement.

    Because of this modification, even if two movements involvelarge-scale changes, their edit distance is still small. The slow

    increase in the edit distance substantially reduces the effect ofthe time warping problem. The experiment results demonstrate

    Fig. 2. Sampling of control points. (a) A point with a high curvature. (b) Twopoints with high curvatures that are too close to each other.

    the feasibility and superiority of the proposed approach for an-

    alyzing human behavior.

    The remainder of the paper is organized as follows. The De-

    launay triangulation technique is introduced in Section II. In

    Section III, we describe the posture classification scheme, which

    is based on the skeletal feature. In Section IV, the centroid con-

    text for posture classification is discussed. The key posture se-lection and string matching techniques are detailed in Section V.

    The experiment results are given in Section VI. We then present

    our conclusions in Section VII.

    II. DEFORMABLE TRIANGULATIONS

    In this paper, it is assumed that all video sequences are cap-

    tured by a stationary camera. When the camera is static, the

    background of the captured video sequence can be reconstructed

    using a mixture of Gaussian models [24]. This allows us to de-

    tect some samples and extract foreground objects by subtracting

    the background. We then apply some simple morphological op-

    erations to remove noise. After these pre-processing steps, we

    partition a foreground human posture into triangular meshes

    using the constrained Delaunay triangulation technique, and ex-

    tract both the skeletal features and the centroid context directly

    from the triangulation result [25].

    Assume that is a posture extracted in binary form by

    image subtraction. To triangulate , a set of control points

    must be extracted along its contour in advance. Let be the

    set of boundary points along the contour of . We extract some

    high curvature points from as the set of control points. Let

    be the angle of a point in . As shown in Fig. 2(a), the

    angle can be determined by two specified points, and ,

    selected from both sides of along to satisfy

    (1)

    where and are two thresholds set to and

    , respectively. is the total length of contour of .

    With and , the angle can be determined by

    (2)

    If is less than a threshold (here we set it at 150 ), is

    selected as a control point. In addition to (2), it is required that

    two control points must be far apart. This enforces the constraint

    that the distance between any two control points must be largerthan the threshold (defined in (1)). As shown in Fig. 2(b),

  • 8/7/2019 2008 Base Paper

    3/13

    374 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

    Fig. 3. All the vertices are labeled anticlockwise such that the interior ofV

    is

    located on their left.

    if two candidates, and , are close to each other, i.e.,

    , the candidate with the smaller angle is chosen as

    a control point.

    Assume is the set of control points extracted along the

    boundary of . Each point on is labeled anticlockwise and

    modulated by the size of . If any two adjacent points on

    are connected by an edge, is deemed to be a planar graph,

    i.e., a polygon. We adopt the constrained Delaunay triangula-

    tion technique proposed by Chew [26] to divide into trian-

    gular meshes.

    As shown in Fig. 3, is the set of interior points of in. For a triangulation , is said to be a constrained

    Delaunay triangulation of if a) each edge of is an edge of

    and b) for each remaining edge of , there exists a circle

    such that (1) the endpoints of are on the boundary of ,

    and (2) if a vertex on is inside , then it cannot be seen from

    at least one of the endpoints of . More precisely, given three

    vertices , , and on , the triangle belongs

    to the constrained Delaunay triangulation if and only if

    (i) , where

    ;

    (ii) , where is a circum-circle of

    , , and . That is, the interior of doesnot have a vertex .

    Based on the above definition, Chew [26] proposed a divide-

    and-conquer algorithm to execute a constrained Delaunay tri-

    angulation of in time. The algorithm works re-

    cursively. When contains only three vertices, itself is the

    triangulation result. However, when contains more than three

    vertices, we choose an edge from and search for the corre-

    sponding third vertex that satisfies properties i) and ii). Then, we

    subdivide into two sub-polygons and . The same pro-

    cedure is applied recursively to and until the processed

    polygon contains only one triangle. In summary, the four steps

    of the above algorithm are as follows:

    1. Choose a starting edge .

    2. Find the third vertex of that satisfies properties (i) and

    (ii).

    3. Subdivide into two sub-polygons:

    and .

    4. Repeat Steps 13 on and until the processed polygon

    consists of only one triangle.

    Details of this algorithm can be found in Bern and Eppstein

    [23]. Fig. 4 shows an example of the triangulation result whenthe algorithm is applied to a human posture.

    Fig. 4. Triangulation result of a body posture. (a) Input posture. (b) Triangula-

    tion result of (a).

    Fig. 5. Flowchart of posture classification using skeletons.

    III. SKELETON-BASED POSTURE RECOGNITION

    In this work, we extract the skeleton and the centroid context

    as features from the triangulation result. Conventionally, the ex-

    tracted skeleton is based on the contour of an object. For ex-ample, in [11], Fujiyoshi et al. extract feature points with min-

    imum curvatures from the silhouette of a human body to define

    the skeleton. However, as the method is vulnerable to noise, we

    use a graph search scheme to solve the problem.

    A. Triangulation-Based Skeleton Extraction

    In Section II, we presented a technique for triangulating a

    human body image into triangular meshes. By connecting all the

    centroids of any two connected meshes, a graph can be formed.

    In this section, we use a depth-first search technique to find the

    skeleton that will be used for posture recognition. Fig. 5 shows

    the block diagram of posture classification using skeletons. Inwhat follows, details of each block will be described.

    Assume that is a binary posture. Using the above tech-

    nique, is decomposed into a set of triangular meshes , i.e.,

    . Each triangular mesh in has a

    centroid , and two meshes, and , are connected if they

    share one common edge. Then, based on this connectivity,

    can be converted into an undirected graph , where all cen-

    troids in are nodes on ; and an edge exists between

    and if and are connected. The degree of a node

    on the graph is defined by the number of edges connected to it.

    Then, we perform a graph search on to extract the skeleton

    of .To derive a skeleton based on a graph search, we seek a node

    whose degree is one and whose position is the highest among

    all the nodes on . is defined as the head of . Then, we

    perform a depth-first search [21] from to construct a spanning

    tree in which all leaf nodes correspond to different limbs of

    . The branch nodes (whose degree is three on ) are the

    key points used to decompose into different body parts, such

    as the hands, feet, or torso. Let be the centroid of , and

    let be the union of , , , and . The skeleton, ,

    of can be extracted by directly linking any two nodes in

    if they are connected through other nodes (or a path existing

    between the two nodes [21]). Next, we summarize the algorithm

    for skeleton extraction. The time complexity of this algorithmis , where is the number of triangle meshes.

  • 8/7/2019 2008 Base Paper

    4/13

  • 8/7/2019 2008 Base Paper

    5/13

    376 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

    Fig. 8. Polar transform of a human posture.

    Fig. 9. Bodycomponentextraction. (a)Triangulation resultof a posture; (b)thespanning tree of (a); and (c) centroids derived from different parts (determinedby removing all branch nodes).

    Then, similar to the technique used in shape context [27], we

    project a sample onto a polar coordinate and label each mesh.

    Fig. 8 shows a polar transform of a human posture. We use

    to represent the number of shells used to quantize the radialaxis and to represent the number of sectors that we want to

    quantize in each shell. Therefore, the total number of bins used

    to construct the centroid context is . For the centroid

    of the triangular mesh of a posture, we construct a vector

    histogram , in which

    is the number of triangular mesh centroids in the th bin

    when is considered as the origin, i.e.,

    (6)

    where is the th bin of the log-polar coordinate. Then,

    given two histograms, and , the distance between

    them can be measured by a normalized intersection [31]

    (7)

    where is the n umber of bins a nd denotes the n umber

    of meshes calculated from a posture. Using (6) and (7), we can

    define a centroid context to describe the characteristics of an

    arbitrary posture .

    In the previous section, we presented a tree search algorithm

    that finds a spanning tree froma posture based onthetri-

    angulation result. As shown in Fig. 9, (b) is the spanning tree de-

    rived from (a). The tree captures the skeleton feature of .

    Here, we call a node a branch node if it has more than one child.By this definition, there are three branch nodes in Fig. 9(b), i.e.,

    Fig. 10. Multiple centroid contexts using different numbers of sectors andshells: (a) 4 shells and 15 sectors, (b) 8 shells and 30 sectors, and (c) centroidhistogram of the gravity center of the posture.

    , , and . If we remove all the branch nodes from ,

    it will be decomposed into different branch paths . Then,

    by carefully collecting the set of triangular meshes along each

    path, we findthat eachpath correspondstoonebody com-

    ponent in . For example, in Fig. 9(b), if we remove from

    , two branch paths will be formed, i.e., one from node

    to and one from to node . The first path corresponds

    to the head and neck of , and the second corresponds to theleft hand of . In some cases, the path may not correspond to

    a high-level semantic body component exactly, as shown by the

    path from to . However, if the length of a path is further

    constrained, over-segmentation can be easily avoided. Since the

    segmentation of body parts is an ill-posed problem, we do not

    propose a complete system for segmenting a human posture into

    high-level semantic body parts. Instead, we use the current de-

    composition result to recognize postures up to the middle level

    (using components similar to the body parts, but not real ones).

    Given a path , we collect a set of triangular meshes

    along it. Let be the centroid of the triangular mesh closest to

    the center of the set of meshes. As shown in Fig. 9(c), is the

    centroid extracted from the path that begins at and ends at .

    Given a centroid , we can obtain its corresponding histogram

    using (6). Assume that the set of these path centroids is

    . Then, based on , the centroid context of is defined as

    follows:

    (8)

    where is the number of elements in . Fig. 10 shows two

    cases of multiple centroid contexts when the number of shells

    and sectors is set to (4, 15) and (8, 30), respectively. The cen-

    troid contexts are extracted from the head and the center of the

    posture, respectively. (c) is the centroid histogram of the gravitycenter of the posture. Given two postures, and , the distance

    between their centroid contexts is measured by

    (9)

    where and are the area ratios of the th and th body

    parts residing in and , respectively. Based on (9), an arbi-

    trary pair of postures can be compared. We now summarize, thealgorithm for finding the centroid context of a posture .

  • 8/7/2019 2008 Base Paper

    6/13

    HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 377

    Algorithm for Centroid Context Extraction:

    Input: the spanning tree of a posture .

    Output: the centroid context of .

    Step 1: Recursively trace using a depth first search until

    the tree is empty. If a branch node (a node with two children)

    is found, collect all the visited nodes to a new path and

    remove these nodes from .

    Step 2: If a new path only contains two nodes, eliminate

    it; otherwise, find its path centroid .

    Step 3: Using (6) to find the centroid histogram of each

    path centroid .

    Step 4: Collect all the histograms as the centroid

    context of .

    The time complexity of centroid context extraction is

    , where is the number of triangle meshes in

    and .

    B. Posture Recognition Using the Skeleton and the Centroid

    Context

    Given a posture, we can extract its skeleton feature and cen-

    troid context using the techniques described in Sections III-A

    and IV-A, respectively. Then, the distance between any two pos-

    tures can be measured using (5) (for the skeleton) or (9) (for the

    centroid context). As noted previously, the skeleton feature is

    used for a coarse search and the centroid context feature is used

    for a fine search. To obtain better recognition results, the two

    distance measures should be integrated. We use the followingweighted sum to represent the total distance:

    (10)

    where is the total distance between two postures

    and and is a weight used to balance and

    .

    V. BEHAVIOR ANALYSIS USING STRING MATCHING

    We represent each movement sequence of a human subject

    by a continuous sequence of postures that changes over time. To

    simplify the analysis, we convert a movement sequence into aset of posture symbols. Then, different movement sequences of

    a human subject can be recognized and analyzed using a string

    matching process that selects key postures as its basis. In the

    following, we describe a method that automatically selects a

    set of key postures from video sequences, and then present our

    string matching scheme for behavior recognition.

    A. Key Posture Selection

    The movement sequences of a human subject are analyzed

    directly from videos. Some postures in a sequence may be re-

    dundant, so they should be removed from the modeling process.

    To do this, we use a clustering technique to select a set of keypostures from a collection of surveillance video clips.

    Assume that all the postures have been extracted from a video

    clip and each frame has only one posture. In addition, assume

    is the posture extracted from the th frame. Given two adjacent

    postures, and , the distance between them, , is calcu-

    lated using (10), where is set at 0.5. Assume is the average

    value of for all pairs of adjacent postures. For a posture , if

    is greater than , we define it as posture-change instance.By collecting all postures that correspond to posture-change in-

    stances, we derive a set of key posture candidates . How-

    ever, may still contain many redundant postures, which

    would degrade the accuracy of the sequence modeling. To ad-

    dress this problem, we use a clustering technique to find a better

    set of key postures.

    Initially, we assume each element in individually

    forms a cluster . Then, given two cluster elements, and ,

    in , the distance between them is defined by

    (11)

    where is defined in(10) and representsthenumber

    of elements in . According to (11), we can execute an itera-

    tive merging scheme to find a compact set of key postures from

    . Let and be the th cluster and the collection of

    all clusters , respectively, at the th iteration. At each itera-

    tion, we choose a pair of clusters, and , whose distance

    is the minimum among all pairs in , i.e.,

    If is less than , then and are merged

    to form a new cluster and construct a new collection of clusters

    . The merging process is executed iteratively until no fur-

    ther merging is possible. Assume is the final set of clusters.

    Then, from the th cluster in , we can extract a key posture

    that satisfies the following equation:

    (12)

    Based on (12) and checking all clusters in , the set of key

    postsures , i.e., , can be constructed for

    analyzing human movement sequences.

    B. Human Movement Sequence Recognition Using String

    Matching

    Based on the results of key posture selection and posture clas-

    sification, we can model different human movement sequences

    with strings. For example, in Fig. 11, there are three kinds of

    movement sequence: walking, picking up, and falling. Let the

    symbols and denote the start and end points of a se-

    quence. Then, the sequence shown in Fig. 11 can represented

    by: (a) swwwe, (b) swwppwwe, and (c) swwwwffe, where

    denotes walking, denotes picking-up, and denotes

    falling. Based on this behavior parsing, different movement se-

    quences can be compared using a string matching scheme.

    Assume and are two movement sequences whose stringrepresentations are and , respectively. We use the edit

  • 8/7/2019 2008 Base Paper

    7/13

    378 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

    Fig. 11. Three kinds of movement sequence taken from: (a) Walking. (b)Picking up. (c) Falling.

    distance [30] between and to measure the dissimilarity

    between and . The edit distance between and , is

    defined as the minimum number of edit operations required to

    change into . The operations include replacements, in-

    sertions, and deletions. For any two strings, and , we de-

    fine as the edit distance between and

    . That is, is the minimum number of

    edit operations needed to transform the first charactersof into the first characters of . In addition, let

    be a function whose value is 0 if and

    1 if . Then, we get ,

    , and .

    is calculated by the following recursive equation:

    (13)

    where the insertion, deletion, and replacement operations

    correspond, respectively, to the transitions from cellto cell , from cell to cell , and from cell

    to cell , respectively.

    Assume that is a walking video clip whose string repre-

    sentation is swwwwwwe. is different from that in Fig. 11(a)

    due to the scaling problem along the time axis. According to

    (13), the edit distance between and the sequence in Fig. 11(a)

    is 3, while the edit distance between and the sequences

    shown in Fig. 11(b) and (c) is 2. However, is more similar

    to Fig. 11(a) than to Fig. 11(b) and (c). Clearly, (13) cannot

    be directly applied to human movement sequence analysis

    and must be modified. Thus, we define a new edit distance to

    address this problem.

    Let , , and be the respective costs of the inser-tion, replacement, and deletion operations performed at the

    th and th characters of and , respectively. Then, (13)

    can be rewritten as

    (14)

    Here, the replacement operation is considered more important

    than insertion and deletion, since it means a change of pos-

    ture type. Thus, the weight ofinsertion and deletion should

    be scaled down to , where . Following this rule, if aninsertion is found when calculating , its weight

    will be if ; otherwise, its weight will be 1.

    This implies that . Similarly, for the

    deletion operation, we obtain . If

    , the weights of the insertion and deletion

    operations would be smaller than the weight of the replace-

    ment operations; therefore, it would be impossible to choose

    replacement as the next operation. This problem can be easilysolved by setting the weight to so that

    and will n ot be compared when calculating .

    However, they will be compared when calculating

    . Since the same end symbol e appears in and ,

    the final distance will be equal to

    . Thus, the delay in comparison will

    not cause an error; however, the cost of insertion and dele-

    tion will increase if any incorrect editing operations are chosen.

    Therefore, (14) should be modified as follows:

    (15)

    where and

    . In this work, is consistently set to 0.1 because we

    consider the influence of replacement to be more significant than

    that of insertion and deletion. In the experiments, we obtained

    satisfactory results by using the above setting. The algorithm of

    string matching to calculate the edit distance between and

    is summarized as follows.

    Algorithm for String Matching:

    Input: two strings and .

    Output: the edit distance between and .

    Step 1: and .

    Step 2: and for all

    and

    Step 3: For all and do

    Step 4: Return .

    VI. EXPERIMENTAL RESULTS

    To test the efficiency and effectiveness of the proposed ap-

    proach, we constructed a large database of more than thirty

    thousand postures, from which we obtained over three hun-

    dred human movement sequences. The ratio of female type in

    this database is 20%. In each sequence, we recorded a spe-

    cific type of human behavior: walking, sitting, lying, squatting,

    falling, picking up, jumping, climbing, stooping, or gymnas-

    tics. Fig. 12 illustrates the skeleton-based posture recognition

    system. The left-hand side of the figure shows a person walking,

    while the right-hand side shows twelve pre-determined key pos-

    tures for matching. Among the twelve key postures, which were

    determined by a clustering algorithm, the one surrounded bya bold frame matches the current posture of the query action

  • 8/7/2019 2008 Base Paper

    8/13

    HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 379

    Fig. 12. Results of walking behavior analysis using the skeleton feature. Therecognized posture is highlighted with a bold frame and shown in (b).

    Fig. 13. Result of walking behavior analysis using the centroid context feature.The recognized type is highlighted with a bold frame and shown in (b).

    sequence [see (b)]. Fig. 13 shows the centroid context-based

    posture recognition system. The current posture shown on the

    left-hand side matches the third key posture (surrounded by a

    bold frame) located on the right-hand side of the figure [see (b)].

    In the first set of experiments, the performance of key posture

    selection was examined. Since there is no benchmark behavior

    database in the literature, twenty movement sequences (two se-

    quences per type) were chosen to evaluate the accuracy of our

    proposed key posture selection algorithm. For each sequence, a

    set of key postures was manually selected to serve as the groundtruth. Two postures were consideredcorrectly matched if their

    distance was smaller than 0.1 [see (10)]. Let

    and be the numbers of key frames selected by a manual

    method and our approach, respectively. We can then define the

    accuracy of key posture selection as follows:

    where is the number of key postures selected by our

    method, each of which matches one of the manually selected

    key postures. is the number of manually selected key

    postures, each of which correctly matches one of the automat-ically selected key postures. Fig. 14 shows the results of key

    Fig. 14. Result of key posture selection from three action sequences: walking,squatting, and gymnastics.

    Fig. 15. Results of skeleton extraction using different sampling rates. The firstrow shows the skeletons extracted by connecting all triangle centroids, whilethe second row shows the skeletons extracted by connecting all branch nodes.

    posture selection from sequences of walking, squatting, and

    gymnastics. Due to space limitations, only 36 key postures

    are shown in Fig. 14. With this set of key postures, different

    movement sequences can be converted into a set of symbols

    and then recognized through a string matching method. In the

    experiments, the average accuracy of key posture selection was

    91.9%.

    In the second set of experiments, we evaluated the perfor-

    mance of skeleton extraction using triangulation. Fig. 15 shows

    the results of skeleton extraction using different sampling rates.

    The top row illustrates the skeletons extracted by connecting all

    the triangle centroids, while the bottom row shows the skeletonsextracted by connecting all the branch nodes. It is obvious that

    no matter which sampling rate was used, the skeletons were

    always extracted; however, a higher sampling rate achieved a

    better approximation result. Fig. 16 shows another example of

    extracting a simple skeleton and a complex skeleton directly

    from a profile posture. In this paper, a simple skeleton means

    a skeleton derived by connecting all the branch nodes, and a

    complex skeleton is a skeleton derived by connecting all the

    centroids of all the triangles. To classify a posture, we have

    to convert a skeleton into its corresponding distance map.

    Fig. 17(a) and (b) show the results of applying the distance

    transform to the skeletons shown in Fig. 16(a) and (b), respec-

    tively. Table I details the success rates of posture recognitionwhen different numbers of meshes (resolutions) were used to

  • 8/7/2019 2008 Base Paper

    9/13

    380 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

    Fig. 16. Skeleton extraction directly from a posture profile.(a) Simpleskeletonderived by connecting branch nodes and (b) complex skeleton derived by con-necting all triangle centroids.

    Fig. 17. Result of distance transform. (a) Distance map extracted fromFig. 16(a) and (b) distance map extracted from Fig. 16(b).

    TABLE ISUCCESS RATES OF POSTURE RECOGNITION WHEN DIFFERENT NUMBERS OF

    MESHES WERE USED FOR SKELETON EXTRACTION

    extract both simple and complex skeletons. From the table,

    we observe that 1) the performance obtained using complex

    skeleton representation was better than that derived using the

    simple skeleton and 2) the more meshes used for representation,

    the better the performance of posture recognition.

    We explain how the posture recognition process works by

    the following example. The top row of Fig. 18 shows a number

    of input postures represented by simple skeletons. Their corre-

    sponding binary maps are shown in the bottom row of the figureusing dark shading. The associated silhouettes, shown in the

    second row, represent the corresponding posture types retrieved

    from the posture database. Fig. 19 compares the performance

    of the simple and complex skeletons in recognizing postures

    from different types of action (movement) types. From Fig. 19,

    it is obvious that the complex skeleton consistently outperforms

    the simple skeleton. The reason why the above results were ob-

    tained was due to the fact that the simple skeleton loses more

    information than the complex skeleton. Therefore, the complex

    one performs better than the simple one in terms of accuracy.

    The next set of experiments was conducted to evaluate the

    performance of the centroid context feature. Since the results

    of this experiment were influenced by the number of predefinedsectors and shells, we conducted six sets of experiments. They

    Fig. 18. Posture recognition results when the simple skeleton is used. The firstrow shows the skeletons of all query postures. The silhouettes in the second rowshow the corresponding posture type retrieved from the database.

    Fig. 19. Success rates of posture recognition using the simple skeleton andcomplex skeleton on different movement types.

    Fig. 20. Different numbers of sectors and shells used to define the centroidcontext: (a) 4 shells and 15 sectors and (b) 8 shells and 30 sectors.

    were (4, 15), (4, 20), (4, 30), (8, 15), (8, 20), and (8, 30), in which

    the first argument represents the number of shells and the second

    represents the number of sectors. Fig. 20 shows two types of

    polar partition. Table II lists the recognition rates when different

    (shell, sector) parameter sets were predefined and applied to the

    single centroid context and the multiple centroid context cases.In the former case, the center was set at the gravity center of a

  • 8/7/2019 2008 Base Paper

    10/13

    HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 381

    TABLE II

    RECOGNITION RATES OF POSTURES WHEN DIFFERENT NUMBERSOF SECTORS AND SHELLS WERE USED

    Fig. 21. Posture recognition results using multiple centroid contexts.

    Fig. 22. Comparison of posture recognition accuracy for different behaviortypes.

    posture, but in the latter case, the center of each component was

    taken as the gravity center of every partitioned component. From

    the recognition rates reported in Table II, it is obvious that using

    a multiple centroid context is definitely better than using a single

    centroid context. Thereby, in our behavior analysis system, themultiple centroid context is implemented and adopted for pos-

    ture classification. In different parameter sets, we found that the

    best setting for the (shell, sector) pair was (4, 30). Therefore, we

    used this setting in all the experiments.

    Fig. 21 shows the results when multiple centroid contexts

    were adopted. The images in the top row show how multiple

    centroid contexts cover four different postures. In these experi-

    ments, we only used two centroid contexts to represent each pos-

    ture. In the bottom row, we retrieved database postures (in sil-

    houette form) and then superimposed their corresponding query

    postures (in shaded form). All the retrieved postures were actu-

    ally very close to the query postures. Fig. 22 lists the recogni-

    tion rates for different types of behavior when the multiple cen-troid context approach was adopted. For all movement types,

    Fig. 23. Posture recognition results using (from left to right) the CC, ShapeContext, PPM, and Zernike moment methods, respectively. The shaded regionshows the input query, and the silhouettes are their recognized posture types.

    TABLE IIIFUNCTIONAL COMPARISONS AMONG CENTROID CONTEXT, S HAPE CONTEXT,

    PPM, AND MOMENTS.3 K

    : NUMBER OF MESHES;v

    : NUMBER OF BODY

    PARTS;n

    : NUMBER OF BOUNDARY POINTS;( w ; h )

    : DIMENSION OF THE INPUTPOSTURE; L: NUMBER OF MOMENT ORDERS. T: TRANSLATION; S: SCALING;

    R: R OTATION. IT IS NOTICED THATn K v

    the average retrieval accuracy rate was 88.9%. In another set

    of experiments, we compared our multiple centroid context ap-

    proach with three other approaches namely: the shape context-

    based approach [27], the Zernike moment-based approach [29],

    and the probabilistic projection map-based (PPM) approach [2].

    Fig. 23 illustrates the recognition results for a walking pos-

    ture (the shaded region). The silhouettes in the figure, from left

    to right, represent the database postures retrieved by using themultiple centroid context approach [see (10)], the shape con-

    text approach, the PPM approach, and the Zernike approach, re-

    spectively. The shape context approach and the PPM approach

    misclassified the posture type as running. The Zernike ap-

    proach correctly recognized the posture type as the walking

    mode. However, the recognized posture belonged to the next

    key posture of the same movement sequence. Table III lists

    the functional comparisons among these methods. The Zernike

    approach has several invariance properties including rotation,

    translation, and scaling. However, the calculation of moments

    at different orders is very time consuming. The PPM approach

    is simple and efficient but not robust to noise. The shape contextis robust to noise but with a higher time complexity. Our cen-

    troid context is robust to noise and can represent a posture up

    to a syntactic level. Due to its region-based nature, it can work

    very efficiently to classify postures. Fig. 24 compares the per-

    formance of the four approaches for ten behavior types. In terms

    of accuracy, our approachs performance was similar to that of

    the shape context approach. However, our approach is better at

    dealing with sequences that involve significant movements, such

    as climbing and walking. Most importantly, our method has a

    lower time complexity than the shape context method because

    it is a region-based descriptor, whereas the shape-context ap-

    proach is pixel-based.

    Next, we conducted a series of experiments to evaluatemovement sequence recognition using the proposed string

  • 8/7/2019 2008 Base Paper

    11/13

    382 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

    Fig. 24. Comparison of the performance of different approaches.

    TABLE IVANALYSIS OF TEN BEHAVIOR TYPES USING OUR PROPOSED

    STRING MATCHING SCHEME

    matching scheme. We collected 450 movement sequences,

    divided equally among the ten behavior types (i.e., each be-

    havior type had 45 movement sequences). The experiment

    results, listed in Table IV, show that most of the test sequences

    were correctly identified and classified into the appropriate

    categories. However, some of the squat sequences tended

    to be misclassified as picking up behavior due to their high

    degree of similarity to squatting. In this set of experiments, the

    average accuracy of behavior analysis was 94.5%.

    To test the proposed approach on real-world applications, we

    analyzed irregular (or suspicious) human movements captured

    by a surveillance camera. First, we extracted a set of normalkey postures from video sequences to learn regular human

    movements like walking or running. After a training process, we

    took unknown surveillance sequences and determined whether

    the postures in them were regular or not. If some irregular pos-

    tures appear continuously, an alarm message triggers a security

    warning. For example, in Fig. 25(a), a set of normal key pos-

    tures was extracted from a walking sequence. From these pos-

    tures, we can tell that the posture in (b) is a normal posture.

    However, the posture in (c) could be classified as abnormal,

    since the person adopted a bending-down posture. We used a

    marker to label these abnormal posture sequences until they re-

    turned to normal. Fig. 26 shows another two cases of abnormal

    posture detection. The postures in Fig. 26(a) and (c) were rec-ognized as normal, since they were similar to the postures in

    Fig. 25. Abnormal activity detection. (a) Five key postures defining severalnormal human movements. (b) Normal condition is detected. (c) Warning mes-sage is triggered when an abnormal posture is detected.

    Fig. 26. Abnormal posture detection. (a) and (c): Normal postures were de-tected. (b) and (d): Abnormal postures were detected due to the unexpected

    shooting and climbing wall postures, respectively.

    Fig. 25(a). However, the postures in Fig. 26(b) and (d) were clas-

    sified as abnormal, since the subjects adopted shooting and

    climbing wall postures. The abnormal posture detection func-

    tion improves a video surveillance system in two ways. First, it

    saves a great deal of storage space because it is only necessary to

    store abnormal postures. Second, it responds immediately when

    an abnormal posture is identified.

    VII. CONCLUSION

    We have proposed a method for analyzing human movements

    directly from videos. Specifically, the method comprises the

    following.

    1) A triangulation-based technique that extracts two im-

    portant features, the skeleton feature and the centroid

    context feature, from a posture to derive more semantic

    meaning. The features form a finer descriptor that can

    describe a posture from the shape of the whole body or

    from body parts. Since the features complement each

    other, all human postures can be compared and classi-

    fied very accurately.

    2) A clustering scheme for key posture selection, which

    recognizes and codes human movements using a set ofsymbols.

  • 8/7/2019 2008 Base Paper

    12/13

    HSIEH et al.: VIDEO-BASED HUMAN MOVEMENT ANALYSIS 383

    3) A novel string-based technique for recognizing human

    movements. Even though events may have different

    scaling changes, they can still be recognized.

    The average accuracies of posture classification and behavior

    analysis were 95.16% and 94.5%, respectively. The experiment

    results show that the proposed method is superior to other ap-

    proaches for analyzing human movements in terms of accuracy,robustness, and stability.

    REFERENCES

    [1] T. B. Moeslund and E. Granum, A survey of computer vision-basedhuman motion capture, Comput. Vis. Image Understand., vol. 81, no.3, pp. 231268, Mar. 2001.

    [2] R.Cucchiara, C. Grana, A. Prati,and R. Vezzani, Probabilities postureclassification for human-behavior analysis, IEEE Trans. Syst., Man,Cybern. A, vol. 35, no. 1, pp. 42 54, Jan. 2005.

    [3] S. Haritaoglu, D. Harwood, and L. S. Davis, W4: Real-time surveil-lance of peopleand their activities,IEEE Trans. Pattern Anal. Machine

    Intell., vol. 22, no. 8, pp. 809830, 2000.[4] S. Maybank and T. Tan, Introduction to special section on visual

    surveillance, Int. J. Comput. Vis., vol. 37, no. 2, pp. 173173, 2000.

    [5] I. Mikic et al., Human body model acquisition and tracking usingvoxel data, Int. J. Comput. Vis., vol. 53, no. 3, pp. 199223, 2003.

    [6] N. Werghi, A discriminative 3D wavelet-based descriptors: Applica-tion to the recognition of human body postures, Pattern Recognit.

    Lett., vol. 26, no. 5, pp. 663677, 2005.[7] I.-C. Chang and C.-L. Huang, The model-based human body motion

    analysis, Image Vis. Comput., vol. 18, no. 14, pp. 10671083, 2000.[8] H. Fujiyoshi, A. J. Lipton, and T. Kanade, Real-time human motion

    analysis by image skeletonization, IEICE Trans. Inform. Syst., vol.E87-D, no. 1, pp. 113120, Jan. 2004.

    [9] N. Oliver, B. Rosario, and A. Pentland, A Bayesian computer visionsystem for modeling human interactions, IEEE Trans. Pattern Anal.

    Machine Intell., vol. 22, no. 8, pp. 831843, Aug. 2000.[10] R. Rosales and S. Sclaroff, 3D trajectory recovery for tracking mul-

    tiple objects and trajectory guided recognition of actions, in Proc.IEEE Conf. On Computer Vision and Pattern Recognition, Jun. 1999,

    vol. 2, pp. 637663.[11] Y. Ivanov and A. Bobick, Recognition of visual activities and inter-

    actions by stochastic parsing, IEEE Trans. Pattern Recognit. MachineIntell., vol. 2, no. 8, pp. 852872, Aug. 2000.

    [12] S. Weik and C. E. Liedtke, Hierarchical 3D pose estimation for artic-ulated human body models from a sequence of volume data, in Proc.

    Int. Workshop on Robot Vision 2001, Auckland, New Zealand, Feb.

    2001, pp. 2734.[13] C. R. Wren et al., Pfinder: Real-time tracking of the human body,

    IEEE Trans. Pattern Anal. Machine Intell., vol. 19, no. 7, pp. 780785,Jul. 1997.

    [14] J. Shewchuk, Delaunay refinement algorithms for triangular meshgeneration, computational geometry, Theory Applicat., vol. 23, pp.2174, May 2002.

    [15] S. Park and J. K. Aggarwal, Segmentation and tracking of interactinghuman body parts under occlusion and shadowing, in Proc. IEEEWorkshop on Motion and Video Computing, Orlando, FL, 2002, pp.

    105111.[16] S. Park and J. K. Aggarwal, Semantic-level understanding of human

    actions and interactions using event hierarchy, in Proc. 2004 Conf.Computer Vision and Pattern Recognition Workshop, Washington, DC,

    Jun. 2, 2004.

    [17] Y. Guo, G. Xu, and S. Tsuji, Understanding human motion pat-terns, in Proc. IEEE Int. Conf. Pattern Recognition, 1994, vol. 2,pp. 325329.

    [18] Y. Guo, G. Xu, and S. Tsuji, Tracking human body motion based ona stick figure model, J. Vis. Commun. Image Representat., vol. 5, pp.1p, 1994.

    [19] Ben-Arie, Z. Wang, P. Pandit, and S. Rajaram, Human activity recog-nition using multidimensional indexing, IEEE Trans. Pattern Anal.

    Machine Intell., vol. 24, no. 8, pp. 10911104, Aug. 2002.[20] S. X. Ju, M. J. Black, and Y. Yaccob,

    Cardboard people: A parameter-

    ized model of articulated image motion, in Proc. Int. Conf. AutomaticFace and Gesture Recognition, Killington, VT, 1996, pp. 3844.

    [21] E.Horowitz, S. Sahni,and S. A. Freed, Fundamentals of Data Structure

    in C. New York: W. H. Freeman.

    [22] G. Borgefors, Distance transformations in digital images, Comput.Vis., Graph., Image Process., vol. 34, no. 3, pp. 344371, Feb. 1986.

    [23] M. Bern andD. Eppstein, Mesh Generation and Optimal Triangulation,

    Computing in Euclidean Geometry, 2nd ed. : World Scientific, 1995,pp. 47123.

    [24] C. Stauffer and E. Grimson, Learning patterns of activity using real-

    time tracking, IEEE Trans. Pattern Recognit. Machine Intell., vol. 22,no. 8, pp. 747757, 2000.

    [25] D. Chetverikov and Z. Szabo, A simple and efficient algorithm for de-tection of high curvature points in planar curves, in Proc. 23rd Work-shop Austrian Pattern Recognition Group, 1999, pp. 175184.

    [26] L. P. Chew, Constrained Delaunay triangulations, Algorithmica, vol.4, no. 1, pp. 97108, 1989.

    [27] S. Belongie, J. Malik, and J. Puzicha, Shape matching and objectrecognition using shape contexts, IEEE Trans. Pattern Recognit. Ma-chine Intell., vol. 24, no. 4, pp. 509522, Apr. 2002.

    [28] Y. Rui, A. C. She, and T. S. Huang, Modified Fourier descriptorsfor shape representation -a practical approach, in Proc. 1st Int. Work-shop on Image Databases and Multi-Media Search , Amsterdam, The

    Netherlands, Aug. 1996.

    [29] A. Kontanzad andY. H. Hong, Invariantimage recognition by Zernikemoments, IEEE Trans. Pattern Recognit. Machine Intell., vol. 12, no.

    5, pp. 489497, May 1990.[30] E. Ukkonen, On approximate string matching, in Proc. Int. Conf. on

    Foundations of Comp. Theory, 1983, pp. 487495.[31] M. J. Swain and D. H. Ballard, Color indexing, Int. J. Comput. Vis.,

    vol. 7, no. 1, pp. 1132, 1991.

    Jun-Wei Hsieh (M06) received the B.S. degree incomputer science from TongHai University, Taiwan,

    R.O.C., in 1990 and the Ph.D. degree in computerengineering from the National Central University,Taiwan, in 1995.

    From 1996 to 2000, he was a Researcher Fellowat the Industrial Technology Researcher Institute,Hsinchu, Taiwan, and managed a team to developvideo-related technologies. He is presently an As-sociate Professor with the Department of ElectricalEngineering, Yuan Ze University, Chung -Li, Taiwan.

    His research interests include content-based multimedia databases, videoindexing and retrieval, computer vision, and pattern recognition.

    Dr. Hsieh received the Best Paper Awards from the Image Processing andPattern Recognition Society of Taiwan in 2005, 2006, and 2007, and the Distin-guished Research Award from Yuan Ze University in 2006. He is a recipient of

    the Phai-Tao-Phai Award.

    Yung-Tai Hsu was born in Kaohsiung, Taiwan,R.O.C., in 1978. He received the B.S and M.Sdegrees in mechanical engineering from Yuan ZeUniversity, Taiwan, in 2000 and 2002, respectively.He is currently pursuing the Ph.D degree in elec-trical engineering at Yuan Ze University, Chung-Li,Taiwan.

    His research interests include pattern recognition,computer vision, and machine learning.

    Mr. Tsai received the Best Paper Award forthe 2006 IPPR Conference on Computer Vision,Graphics, and Image Processing.

  • 8/7/2019 2008 Base Paper

    13/13

    384 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 10, NO. 3, APRIL 2008

    Hong-Yuan Mark Liao (SM01) received theB.S. degree in physics from National Tsing-HuaUniversity, Hsinchu, Taiwan, R.O.C., in 1981, andthe M.S. and Ph.D. degrees in electrical engineeringfrom Northwestern University, Evanston, IL, in 1985and 1990, respectively.

    He was a Research Associate with the ComputerVision and Image Processing Laboratory, North-

    western University, in 19901991. In July 1991,he joined the Institute of Information Science,Academia Sinica, Taipei, Taiwan, as an Assistant

    Research Fellow. He was promoted to Associate Research Fellow and thenResearch Fellow in 1995 and 1998, respectively. From August 1997 to July

    2000, he served as the Deputy Director of the institute. From February 2001to January 2004, he was the Acting Director of the Institute of AppliedScience and Engineering Research. He is jointly appointed as a Professor ofthe Computer Science and Information Engineering Department of NationalChiao-Tung University. His current research interests include multimediasignal processing, video surveillance, and content-based multimedia retrieval.

    Dr. Liao is the Editor-in-Chief of the Journal of Information Science and En-

    gineering. He is on the Editorial Boards of the International Journal of VisualCommunication and Image Representation, the EURASIP Journal on Advancesin Signal Processing, and Research Letters in Signal Processing. He was an As-sociate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA from 1998 to 2001.He received the Young Investigators Award from Academia Sinica in 1998, the

    Excellent Paper Award from the Image Processing and Pattern Recognition So-ciety of Taiwan in 1998 and 2000, the Distinguished Research Award from theNational Science Council of Taiwan in 2003, and the National Invention Awardin 2004. He served as the Program Chair of the International Symposium onMultimedia Information Processing (ISMIP97), the Program Co-Chair of theSecond IEEE Pacific-Rim conference on Multimedia (2001), the ConferenceCo-Chair of the Fifth IEEE International Conference on Multimedia and Ex-position 2004 (ICME), and as Technical Co-Chair of the Eighth IEEE ICME(2007).

    Chih-Chiang Chen was born in Taipei, Taiwan,R.O.C., in 1982. He received the B.S degree inelectrical engineering from Yuan Ze University,Taiwan, in 2004, where he is currently pursuing thePh.D degree in electrical engineering.

    His research interests include image processing,behavior analysis, and video surveillance.