action recognition based on local space-time … recognition based on local space-time ... this is...

Action Recognition based onLocal Space-Time Features

CHRISTIAN SCHÜLDT

Master’s Degree ProjectStockholm, Sweden 2004

TRITA-NA-E04054

Numerisk analys och datalogi Department of Numerical AnalysisKTH and Computer Science100 44 Stockholm Royal Institute of Technology

SE-100 44 Stockholm, Sweden

CHRISTIAN SCHÜLDT

TRITA-NA-E04054

Master’s Thesis in Computer Science (20 credits)at the School of Electrical Engineering,

Royal Institute of Technology year 2004Supervisors at Nada were Barbara Caputo and Ivan Laptev

Examiner was Jan-Olof Eklundh

Action Recognition based onLocal Space-Time Features

AbstractThis thesis presents a novel action recognition approach using video representationin the form of spatio-temporal interest points in combination with Support VectorMachine (SVM) classification. By adapting the local neighborhood of the interestpoints, we construct a representation that is invariant to the spatial scale, temporalfrequency and relative motion of the camera. Moreover, the interest point repres-entation eliminates the need of segmentation and tracking required by many otherrelated approaches and is still robust with respect to different subjects, lightingconditions and scale variations.

In the area of classification, SVM has recently, due to its good generalization cap-abilities, proven to outperform simpler classifiers as Nearest Neighbor Classification(NNC). However, one problem arises when using the interest point representation forclassification, as the interest point representation consists of sets of vectors, whereasclassifiers generally operate on single vectors. A solution to this problem has beenproposed and can easily be implemented together with SVM, because of its kernelbased nature.

To evaluate the performance of this method we develop an extensive video data-base, containing 2391 sequences of 25 subjects performing six actions in four differ-ent conditions in homogeneous background. In our experiments, using this database,our method shows promising result, almost consistently outperforming the relatedapproach we use for benchmarking. We expect the method to be robust when clas-sifying sequences with moving camera and heterogeneous background.

Rörelseigenkänning med lokala kännetecken

SammanfattningDetta examensarbete undersöker en ny rörelseigenkänningsmetod baserad på spa-tiotemporala intressepunkter samt Support Vector Machine (SVM)-klassificering.Anpassning av området runt varje intressepunkt möjliggör en representation somär invariant med avseende på spatial skala, temporal frekvens samt kamerarörelser.Dessutom kräver intressepunktsrepresentationen, i motsats till andra rörelseigenkän-ningsmetoder, varken segmentering eller s.k. ”tracking”. Trots detta är denna repre-sentation ändå robust med avseende på variation av försöksperson, ljusförhållandenoch skalvariationer.

Inom området för klassificering har SVM visat sig prestera bättre, på grundav goda generaliseringsegenskaper, än enklare algoritmer som t.ex. Nearest Neigh-bour Classification (NNC). Ett problem som uppstår vid klassificering av intres-sepunktsrepresentation är att denna vanligvis består av en uppsättning vektorermedan klassificeringsmetoder i allmänhet opererar på enskilda vektorer. En lösningpå detta problem har tidigare föreslagits och kan enkelt implementeras tillsammansmed SVM.

För att utvärdera prestandan av rörelseigenkänningsmetoden har en videoda-tabas sammanställts. Databasen innehåller 2391 sekvenser med 25 försökspersonerutförandes sex olika rörelser i fyra olika scenarier med homogen bakgrund. I de ex-periment som genomförts på denna databas, visar metoden lovande resultat genomatt nästan genomgående prestera bättre än den andra metod som används för jäm-förelse. Vidare väntas metoden även vara robust vid klassificering av sekvenser medrörlig kamera och heterogen bakgrund.

Preface

This thesis is the result of a Master’s project in Computer Science, carried out at theComputational Vision and Active Perception Laboratory (CVAP) research group atthe Department of Numerical Analysis and Computer Science (NADA) at KTH inStockholm, Sweden.

I would specifically like to thank my supervisors Barbara Caputo and Ivan Laptevfor their support throughout the project. I would also like to thank everyone whovolunteered to help extending the video database by allowing me to film them.

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contribution of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 21.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.2 Dense Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.3 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 52.1.4 Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Our Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Representation of Image- and Video-data 73.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Scale-space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Scale Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Spatial Interest Points . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 Spatio-temporal Interest Points . . . . . . . . . . . . . . . . . . . . . 11

3.4.1 Scale and Velocity Adaptation . . . . . . . . . . . . . . . . . 133.5 Video Representation and Matching . . . . . . . . . . . . . . . . . . 15

3.5.1 Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5.2 Histograms of Local Features . . . . . . . . . . . . . . . . . . 173.5.3 Histograms of Spatio-temporal Gradients . . . . . . . . . . . 173.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Classification of Image- and Video-data 194.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2 Nearest Neighbor Classification . . . . . . . . . . . . . . . . . . . . . 194.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3.1 Hyper-plane Classifier with Two Linearly Separable Classes . 214.3.2 Non-separable Classes . . . . . . . . . . . . . . . . . . . . . . 24

4.3.3 Non-linear Classifiers . . . . . . . . . . . . . . . . . . . . . . . 254.4 Multi-class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 One-class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.6 Using SVM with Local Features . . . . . . . . . . . . . . . . . . . . . 27

4.6.1 Local Features Kernel . . . . . . . . . . . . . . . . . . . . . . 284.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Experiments 305.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2 Video Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3.1 The SVM Kernel . . . . . . . . . . . . . . . . . . . . . . . . . 335.3.2 Multi-class . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3.3 Two-class SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.4 Overall Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 Conclusions 386.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

References 40

A Additional Results for Multi-class SVM & NNC 45

B Additional Results for Two-class SVM 57

Chapter 1

Introduction

1.1 Motivation

Recognition of different actions is a key component in many computer vision sys-tems. An example could be a human-computer interface where the user controls thecomputer (fully or partly) through different visual commands such as hand and armmovements (for example Akyol and Canzler [AC02] show an info terminal which iscontrolled by sign language). Another example is surveillance, where it might becrucial to detect certain actions and activities that can be interpreted as unwantedand/or illegal. An example of this is [CLK+00], who present an autonomous systemused for monitoring activity on a parking lot and airborne surveillance of battlefields.Other examples are traffic related applications, such as “smart cars”, detecting trafficsigns, pedestrians [XF02, VJS03] and other vehicles.

Perhaps one of the most challenging aspects of this type of recognition is toconstruct a representation that is both stable under various transformations anddiscriminative with respect to different classes of objects and motions. Hence, inorder to get robust recognition we want similar measurements from the data inde-pendently of the viewing angle, lighting conditions, if it is snowing etc.

A basic problem in computer vision originates from the fact that natural 3-Dobjects are projected to 2-D images. This transform causes loss of information.What information is lost depends on the relative position and angle between theviewer (camera) and the object/action we are considering. One solution is to usemultiple images with slight translation of the view.

Given several images of the same rigid scene it is often possible to automaticallyfind corresponding points in these images to perform a 3-D reconstruction of thescene. Tremendous amount of research has been done in this field during the recentdecades ([VVF+03, DTM96], just to mention a few).

Other problems that occur are uncontrollable factors such as shadows, differentlighting, background clutter and occlusions. Shadows and occlusions might concealvital information in the scene and the lighting can cause phenomena such as speculareffects, affecting the overall appearance of the scene. Background clutter can also

1

affect appearance, by dissolving distinct contours. Unfortunately, these problemsare very difficult and there exist no straight forward solutions.

More specific to motion-based recognition are the problems of variation in tem-poral frequency (consider fast/slow walking, for example) and changes in relativecamera velocity, such as moving background or locomotive actions. Attempts tocompensate for these difficulties have included camera stabilization techniques andsegmentation, where for example the scene is separated into foreground and back-ground. This segmentation tends to work well for simple scenes, but runs intoproblems when the background is complex. Another suggested method is to uselocal feature based descriptors, such as interest points, which capture complex non-constant motion changes. By scale and velocity adaptation, this representation canbe made invariant to scale changes and camera movements.

Local feature representations typically consist of a set of vectors, where eachvector represents local neighborhood properties. This becomes a problem whenattempting classification, as classification methods assume single vectors as input.Recently, different types of Support Vector Machines (SVM) solving this problemhave been presented [WCG03, WS03], and show interesting results for spatial clas-sification. SVM is a classification method which has been gaining popularity dueto its ability to correctly classify unseen data, as opposed to methods as NearestNeighbor Classification (NNC) (See Chapter 4).

1.2 Contribution of this Thesis

In this thesis we will consider some of the previously described problems, where we,in order to avoid the use of tracking and segmentation, consider a spatio-temporalinterest point representation. Adaptation is then used to obtain invariance to scaleand first order camera motion. The final step is a combination of the spatio-temporalinterest point representation and an SVM classifier, using a local features kernel. Itis this combination that defines the novelty of our method.

Unfortunately, when it comes to performance comparison of different methods,there are significantly less databases for action recognition than there are for objectrecognition, for example (see Section 5.2).

This leads us to the two main contributions:

• A novel approach to action recognition, using local feature representation andSVM classification.

• A database of human actions, containing about 2400 video sequences of sixdifferent human actions performed by 25 subjects in four different scenarios.

We evaluate the performance of the local features method by comparing it to a previ-ously used global method using spatio-temporal gradient histograms. The evaluationis performed on the developed database.

2

1.3 Outline

The rest of the thesis is organized as follows: Chapter 2 will begin with a shortoverview of the related work in the field of action recognition and then continue witha section about the strategy we use. This is followed by a theory chapter (Chapter 3)briefly describing the concept of scales, spatial and spatio-temporal interest points,adaptation with respect to scale and velocity. The chapter ends with a description ofthe three representation methods used in the experiments. The next chapter is alsoa theory chapter, explaining the classification techniques used in the experiments,starting with a section about NNC followed by SVM and the extensions to non-separable classes, non-linear classification, multi-class and so-called one class SVM,ending with a section on the use of SVM with local features. Chapter 5 explains theexperimental setup, our video database, results and discussion. Finally, we presentconclusions and future work in Chapter 6 and in appendix A and B we present theresults that, for the sake of readability, were left out from Chapter 5.

3

Chapter 2

Background

2.1 Related Work

There exist numerous approaches to action recognition. The purpose of this chapteris to give a short overview of these methods, to present their main ideas as well asto describe their advantages and disadvantages.

2.1.1 Tracking

Tracking methods follow the location of image features over time. It can either be awhole person or an object (such as vehicles [Avi01]) or different parts of an articulatedbody. Yacoob and Black [YB98] track the motion of body parts over time and usePrincipal Component Analysis (PCA) of their translation for representation. Bregler[Bre97] also tracks selected body parts, but uses different levels of abstraction whereHidden Markov Models (HMMs) is used to model motion at a high level. HMMs haveshown promising results in the field of acoustic speech recognition and interpretationand have an advantage of being computationally efficient, with possibilities of realtime implementation.

Tracking approaches often suffer from the problem of re-initializing, i.e. if thetracking gets lost it needs automatic or manual re-initialization. Sullivan and Carls-son [SC02] use pose estimation of key-frames to solve the automatic re-initializationproblem. Moreover, tracking assumes that the appearance does not change overtime.

2.1.2 Dense Methods

Another popular class of methods are dense methods, or template matching. Thetemplates can be constructed in many ways, where one popular method is us-ing optical flow. Efros et al. [EBM+03] use optical flow to form spatio-temporalmotion descriptors which are used for nearest neighbor classification. Davis andBobick [DB97] use motion templates constructed by binary cumulative motion im-ages, which they test on different aerobics exercises. They argue that since humans

4

are very good at distinguishing different motions from very blurred data, extractingmotion information must be more relevant than appearance for action recognition.A similar approach is taken by Masoud and Papanikolopoulos [MP03], who use arecursive-filtering technique to create what they call feature images. As this tech-nique involves image differencing, it removes static background fairly well. However,for complex scenes with moving background, such as traffic scenes with vehicles mov-ing in different directions, the performance is likely to be worse.

Optical flow methods often use PCA to decompose the flow field into a set ofbasis flow fields [YB98, BYJ+97]. The idea is that all flow fields corresponding tothe actions of interest could be approximated by linear combinations of the basisflow field. The actions are generally represented by a number of basis flow fieldcoefficients. Related work has been done by Hoey and Little [HL00], decomposingflow fields in terms of Zernike polynomial coefficients instead of using PCA.

As optical flow methods rely on segmentation or the use of some window, itis possible to get erroneous estimates if the segmentation is inaccurate. Moreover,the optic flow does not always correspond to the motion field (consider a rotatingso-called “barber pole”, where the stripes appear to be moving along the axis ofrotation). There is also the aperture problem, constraining the optical flow to bedetermined only along the image gradient.

2.1.3 Statistical Methods

This class of methods relies on statistical properties, where histograms are perhapsmost common. Histogram based methods are widely used for spatial recognition. Forexample, Chang and Krumm [CK99] use color co-occurrence histograms for findingobjects in images and Schiele and Crowley [SC00] use multi-dimensional histogramsof Gaussian filter responses.

For spatio-temporal recognition, there are methods using histograms of opticalflow [FB00] and local intensity gradient [ZI01], as well as multi-dimensional histo-grams of spatio-temporal filter responses [CC99]. The advantage of histogram basedrepresentation methods are that they provide much information using very compactrepresentation, if the dimensionality of the histogram is low. However, the size ofthe representation data grow exponentially with respect to the dimension. Addi-tionally, the histograms only capture statistics, as opposed to for example templatematching, where positional information (hence, where in the sequence the movementis) is used.

2.1.4 Local Features

For interpretation of static images, many recent methods focus on local features.Such features have stable positions and descriptive neighborhoods in images [MS01,MS02] and have proven to be successful for object recognition [FPZ03], wide baselinematching [VVF+03] and image retrieval [SM97]. One of the most popular methodsfor detecting local features is the Harris detector [HS88]. It detects points in the

5

image where the variation of the direction of the gradient is high, which is typicallycorners.

The adaption of interest points with respect to scale and affine transformationsis relevant due to possible viewpoint variations, to which invariance is desired. Inthe light of this, Laptev and Lindeberg [LL03] extend the idea of interest points intothe spatio-temporal domain and use it for matching walking patterns in scenes withcomplex background.

One disadvantages of local features is the sparse representation, hence not allavailable information is used. Since local methods focus only on certain parts of thedata, there is always a risk of missing vital information.

2.1.5 Classification

Support Vector Machine (SVM) classification, based on statistical learning theoryand introduced by Boser et al. [BGV92], has recently gained popularity due to itsgood generalization capabilities, shown both by theory and experimental results.For spatial recognition there exist numerous SVM approaches, for example usingvarious histograms [CHV99] and local features [WCG03]. SVM based techniqueshave also been used for tracking, by for example Zhu et al. [ZWR+01]. Their basicidea consists of matching interest points in two consecutive images. In an iterativemanner, using SVM regression, interest points that are considered outliers are thenremoved to make the matching more robust. A different tracking approach usingSVM is taken by Avidan [Avi01], using SVM in combination with optical flow totrack vehicles in traffic situations.

While SVM and other classifiers traditionally assume vectors as input, Wallravenet al. [WCG03] and Wolf and Shashua [WS03] present different approaches to classifysets of vectors which is common for local features. They both evaluate their kernelsfor object and face recognition purposes and show promising results. Although Wolfand Shashua [WS03] do not use local features in their experiments, their kernel couldstill be used for this purpose.

Viola et al. [VJS03] claim to be first to use both motion based detectors andappearance based detectors. They also detect pedestrians, but do not use tracking.Instead they use rectangle filters on both images and difference images, thus utilizingboth motion and appearance.

2.2 Our Strategy

The goal of this thesis was to develop a method that combines local space-timefeatures [LL03] with a novel SVM classification approach [WCG03] and apply it tothe problem of recognizing human actions in video sequences. To the best of ourknowledge, this has not been done before. The theory behind the interest pointrepresentation together with scale and velocity adaptation is explained in Sections3.4 and 3.4.1. Use of the representation together with an SVM classifier is describedin Section 4.6.

6

Chapter 3

Representation of Image- andVideo-data

3.1 Introduction

Computer vision applications commonly have to handle real world images and/orvideo sequences, influenced by variations in lighting, occlusions, different cameraangle, etc. This emphasizes the importance of constructing stable representationsthat reduce the influence of external conditions. What is desired is to suppress theirrelevant information and emphasize the discriminative properties. For exampleconsider recognition of apples from images. We want the apple representation tofocus on the properties that are discriminative for apples (such as color and shape),and not on irrelevant things such as background texture, shadows and lighting.

Representation methods can roughly be divided into two categories: global andlocal methods. Global methods take the whole image or video sequence into account.Examples of global methods are color or grey-level histograms computed over anentire image or histograms of the space-time gradients computed for video sequences(see Section 3.5.3).

Local methods, on the other hand, focus on certain parts of the image or videodata. When comparing an image or a sequence to another, these different parts arecommonly matched to the parts of the other image or sequence. This gives rise toan issue referred to as the correspondence problem, which is addressed in Section3.5.1.

Local techniques show several advantages, such as robustness with respect topartial occlusions and possibility of local adaptation to scale, velocity and lightvariations. The ability to handle different parts of a scene individually is a significantadvantage, as scale, velocity, lighting and occlusions tend to vary strongly in thescene, when considering natural images and video sequences.

In this chapter we describe the theory behind construction of different image andvideo representations. In Section 3.2 we briefly explain the concept of scale-spaces.This is followed by Section 3.2.1, which addresses scale selection. The detection of

7

spatial interest points using the Harris operator is presented in Section 3.3. Section3.4 then continues with spatio-temporal interest points. Adaptation of interest point,with respect to scale and velocity, is explained in Section 3.4.1. The chapter endswith Section 3.5, which describes three different types of representation that we usefor video sequences.

3.2 Scale-space

The concept of scales is very important to the field of computer vision. For examplean image of a leaf or a Mandelbrot set may look very different depending on atwhich scale we observe it. At a finer scale, the information is very detailed, whereasat a coarser scale fine lines and details are not visible. What is preferred is entirelydependent on the application as different scales contain different information (com-pare left and right images of Figure 3.1, where the right image is calculated at a finerscale and the left at a coarser). Automatic scale selection makes it possible to auto-matically select which scale we should work at, and thus it enables the constructionof scale-invariant data representations.

Figure 3.1. Images from the Mandelbrot set at different scales.

Scale space is a large subject and there is extensive research and theory behind it(for a review, we refer the reader to [Lin94a] and references therein). Scale spacescan generally be divided in two categories: linear and non-linear. A non-linear scalespace can be constructed by, for example, applying morphological transformations(erosion, dilation, opening, closing etc.).

A linear scale-space L of a multidimensional signal f(x) : Rn → R can be

constructed by the convolution

L(x, t) =∫

Rn

f(x)g(ξ − x, t)dξ, (3.1)

8

where g(x, t) is an n-dimensional Gaussian kernel

g(x, t) =1

(2πt)n2

· e− (x·x)2t (3.2)

and t is a continuous scale parameter. Large t implies coarser scale and small timplies finer scale.

In fact, it can be shown [Lin94b] that the Gaussian is the only kernel that can beused to create a linearly separable scale space that satisfies natural axioms such asshift invariance (independence of position in the image), isotropy (independence ofdirection), causality (details at coarser scales must have “a cause” at finer scales) andnon-enhancement of local extrema. It can also be shown [Lin94b] that convolutionwith a Gaussian is equivalent to the solution of the diffusion equation

∂tL = 12∇2L

L(x, 0) = f(x).

There is one more important issue to address: until now we have only been dealingwith continuous signals, whereas computers handle discrete data. The problemis that, when dealing with discrete data, differentiation is an ill-posed problem.However, using the fact that convolution and the derivative operator commute, wecan avoid the problem by taking the derivative of the Gaussian before discretizationand convolving

g(x, t) ∗ f ′(x) F−→ G(ω)iωF (ω)G(ω)iωF (ω) = iωG(ω)F (ω)

iωG(ω)F (ω) F−1−→ g ′(x, t) ∗ f(x).

3.2.1 Scale Selection

As real world images tend to contain objects of different sizes (due to perspectivetransformations), it is often desirable to be able to analyze and compare imagesindependently of the size of the objects in the images. A scale space representationof the images can be used in order to achieve this scale invariance. However, weneed a way to automatically decide which scale we should work on for each im-age. Lindeberg [Lin98] introduces a method for automatic scale selection by findingmaxima of differential expressions formulated in terms of γ-normalized derivatives

∂x,γ−norm = tγ/2∂x.

When using “regular” derivatives and scaling, the absolute values of the minimasand maximas in the image decrease unconditionally with scale (this complies to thenon-enhancement of local extrema, mentioned in the previous section). However,when using the γ-normalized derivatives, the absolute values of the extrema willpeak for a certain scale and then decrease. It is shown in the work of Lindeberg

9

[Lin98] that for a sinusoidal, the scale for which we get a peak is proportional tothe wavelength of the signal. Furthermore, it is shown that for natural images, thescale at which we get a peak, roughly corresponds to the size of the correspondingobject in the image.

The idea of Gaussian scale-space can be extended to the spatio-temporal signalsf(x) : R

2 × R+ → R [LL03]. For this purpose we can use the Gaussian kernel

g(x, y, t, σ2, τ2) =1√

(2π3)σ4τ2· e−(

(x2+y2)

2σ2 + t2

2τ2 ), (3.3)

with a spatial variance (σ) and a temporal variance (τ), as the spatial and temporalextent of different events generally are independent.Laptev and Lindeberg [LL03] demonstrate a scale selection technique in space-timeusing the Laplacian, by studying the scale space representation L for a prototypeevent represented by a spatio-temporal Gaussian blob. The second order derivatives(Lxx, Lyy and Ltt), assume local extrema over space and time at the center ofthe blob. Appropriate selection of the γ-normalization parameters also gives localextrema at scales corresponding to the spatial and temporal scale of the blob. Fromthe sum of the second order derivatives, they define a normalized spatio temporalLaplace operator

∇2normL = Lxx,norm + Lyy,norm + Ltt,norm

= σ2τ1/2(Lxx + Lyy) + στ3/2Ltt.(3.4)

3.3 Spatial Interest Points

Local methods, as briefly discussed in the beginning of this chapter, rely on certainparts of the image. To find these interesting parts, we can calculate something whichis known as interest points.

An interest point can be described as a point in the image where there aresignificant local variations of image values (grey-scale or any type of color represent-ation) in all directions (x, y), hence high information content. These interest pointscan then be used as a compact description of the image, by using a descriptor foreach interest point which can contain various information about the point and itsneighborhood (such as position, scale, derivatives etc.).

One method to detect interest points is the Harris detector

H = det(µ) − k trace2(µ) (3.5)

where k is a constant and µ is the second order matrix

µ =(

µ11 µ12

µ12 µ22

)= g(x, y, σ2

i ) ∗(

L2x LxLy

LxLy L2y

). (3.6)

10

Here σi corresponds to the integration scale and is defined as σi = sσl, where s isyet another constant. Furthermore, the derivatives Lx and Ly are defined as

Lx = ∂x(g(x, y, σ2l ) ∗ f)

Ly = ∂y(g(x, y, σ2l ) ∗ f)

(3.7)

and g is a 2-dimensional Gaussian kernel (Equation 3.2).To detect interest points, positive maxima of H over (x, y) are searched for. Thesepositive maxima correspond to image points with high eigenvalues (λmax, λmin) ofµ, indicating significant variations of f in both image directions [HS88].

Consider the eigenvalue ratio α = λmax/λmin. α = 1 indicates a perfect isometriccorner. Larger ratios correspond to “stretched” or “skewed” corners. If we rewriteEquation 3.5 in terms of µ eigenvalues we get

H = λmaxλmin − k(λmax + λmin)2.

Using the H > 0 restriction and inserting the eigenvalue ratio α, we end up with

k ≤ α

(α + 1)2.

It follows that α ≥ 1 leads to k ≤ 1/4, where k = 1/4 defines a detector that isnon-sensitive, detecting only corners with λmax = λmin. As k decreases, the detectorbecomes more sensitive.

3.4 Spatio-temporal Interest Points

The idea of interest points has been extended from the spatial domain to the spatio-temporal domain by Laptev and Lindeberg [LL03]. Hence, the spatio-temporalinterest points represent significant local variation in both spatial directions and inthe temporal direction. It can be seen as we are searching for “3-D corners” in the(x, y, t) space of the sequence. We get these corners where there are sudden changesof motion; for example, if we have a sequence of an arm waving up and down, wetypically detect interest points where the arm motion changes direction. Figure 3.2illustrates detected interest points in two different video sequences and Figure 3.3shows interest points detected in a walking sequence together with a 3-D plot of theleg motion.

In order to detect spatio-temporal interest points, we modify the approach ofdetecting spatial interest points described in the previous section. A separable scale-space, constructed by convolution with the Gaussian in Equation 3.3 (where σ2

l is

11

(b) Walking

(a) Hand waving

Figure 3.2. Selected frames from two different sequences with overlaid circles,marking detected spatio-temporal interest points. The radii of the circles illus-trate the selected spatial scales of detected features.

the spatial variance and τ2l is the temporal variance) is used. Equations 3.5, 3.6 and

3.7 are then rewritten as

H = det(µ) − k trace3(µ), (3.8)

µ =

µ11 µ12 µ13

µ12 µ22 µ23

µ13 µ23 µ33

= g(x, y, σ2

i , τ2i ) ∗

L2

x LxLy LxLt

LxLy L2y LyLt

LxLt LyLt L2t

, (3.9)

and

Lx = ∂x(g(x, y, t, σ2l , τ2

l ) ∗ f)Ly = ∂y(g(x, y, t, σ2

l , τ2l ) ∗ f)

Lt = ∂t(g(x, y, t, σ2l , τ2

l ) ∗ f).(3.10)

Positive maxima of H (large eigenvalues of µ) are searched for, as for the spatialinterest points. Selection of the constant k is also very similar. Setting k = 1/27

12

makes the detector non-sensitive (detecting only isometric corners) and smaller kmakes the detector more sensitive.

(a)

(b)

Figure 3.3. Spatio-temporal interest points detected for a walking person: (a)3-D plot of a spatio-temporal leg motion (up side down) and correspondingfeatures (in black); (b) Features overlaid on selected frames of a sequence.

3.4.1 Scale and Velocity Adaptation

It is often desirable to have a representation that is invariant to size in the imageand camera movement, as described previously in this chapter. In other words, arepresentation that is scale and velocity invariant. Let us first consider adaptationof scale.

The idea of detecting spatial interest points which are both maxima of the Harris-function in the spatial domain as well as extrema of the Laplacian over scale, as pro-posed by Mikolajczyk and Schmid [MS01], is extended to detecting spatio-temporalinterest points which are maxima of the Harris-function (Equation 3.8) over spaceand time and extrema of the Laplacian over both spatial and temporal scales (Equa-tion 3.4).

As this means finding extrema over five dimensions (x, y, t, σ2, τ2) (which canbe cumbersome if done in a straight-forward manner), the process is split into atwo-step iterative process. We first detect interest points which are maxima of theHarris-function over a few sparsely selected scales. Then we track each interestpoint by iteratively calculating the value of the Laplacian for the point and itsneighborhood and then re-detecting the interest point (using the Harris-function)until convergence has been reached.

13

Regarding the velocity adaptation, we assume that the camera movement canbe approximated as Galilean motion

G =

1 0 vx

0 1 vy

0 0 1

(3.11)

where vx and vy are the velocities in the directions of the x- and y-axis correspond-ingly (it can be seen as an affine transformation over time). We can estimate thevelocities and then cancel their effects. For this purpose, we extend the previouslydescribed scale space to a Galilean scale space, defined by convolution with thenon-uniform Gaussian kernel

g(x, Σ) =1

(2π)n2

√det(Σ)

· e− (xT Σ−1x)2 . (3.12)

The difference between the kernels (Equation 3.2) and (Equation 3.12) is the covari-ance matrix Σ, which in this case is a 3 × 3 matrix that describes the “skew” of theGaussian. If Σ is a diagonal matrix, (Equation 3.2) correspond to (Equation 3.12),hence there is no skew.

When undergoing a linear transform (the Galilean motion in this particular case),Σ transforms as Σ′ = GΣGT . To cancel this effect we can use the covariance matrixΣ′′ = G−1Σ′G−T . For this, we need an estimate of G and hence the estimates of vx

and vy.To estimate the velocities we use the motion constraint equation [HS99]

(∇L)T v + Lt = 0 (3.13)

and assume that the velocities are constant over a small region. This can be writtenas a least square formulation where we want to minimize the squared error

e2 =∫ ∞

−∞w(x − x′)((∇L)T v + Lt)2dx′ (3.14)

in which w(x) is a weighting function (the Gaussian in this case) and v = (vx, vy)T .We want to tune our velocity parameters vx and vy so the squared error e is min-imized. This corresponds to the solution of the following equation system [HS99]

(µ11 µ12

µ12 µ22

)(vx

vy

)=

(µ13

µ23

). (3.15)

Estimates of the velocities vx and vy, however, depend on the covariance matrix Σ′′.To solve this properly we use an iterative approach. First we compute Σ′′ with

some initial G. Then we compute the µ-matrix and obtain velocity estimates vx andvy. We use those estimates to update the G matrix. G is in turn used to calculatea new Σ′′, and so on until we either reach convergence or assume divergence.

14

We can view the whole process in the following way: At first we start with auniform Gaussian since we do not know anything about the velocities. Then, foreach iteration we estimate the velocities and compensate for them using Σ′′. Thisprocess converges to a Σ′′ that cancels the Galilean motion.

By combining scale selection with velocity adaptation, the iterative approach fordetecting adapted space-time interest points can be summarized as follows:

1. Detect interest points (using uniform Gaussian kernel)over a few sparsely distributed scales.

2. For each interest point:3. Compute the Laplacian at the current and neighboring scales.4. Estimate velocities.5. If Laplacian for current scale < Laplacian for neighboring scale

AND !scale convergenceSet current scale = Neighboring scale.

Elsescale convergence = true.

6. If velocities > ε AND !velocity convergenceRecompute Σ′′ using the estimated velocities.

Elsevelocity convergence = true.

7. If !velocity convergence OR !scale convergenceRe-detect interest points and select the interest point which isclosest to the previous.Goto 3.

8. End

3.5 Video Representation and Matching

Using the theory presented above, we construct three different types of video rep-resentations named: Local features (LF), Histogram of local features (HistLF) andhistograms of spatio-temporal gradients (HistSTG). Each method is described be-low.

3.5.1 Local Features

The representation by local features is constructed by first detecting spatio-temporalinterest points (as described in previous sections) at a few initial scales. The iterativevelocity and adaptation scheme described in Section 3.4.1 is then used to estimate ascale and velocity for each interest point. A Gaussian filter with covariance matrixcorresponding to the estimated scale and velocity is then applied locally at each

15

interest point. A descriptor for each interest point is defined as a vector of normalizedspatio-temporal derivatives (a local jet)

l = (Lx, Ly, Lt, Lxx, ..., Lttt) (3.16)

with

Lxmyntk = σm+nτk(∂xmyntkg) ∗ f. (3.17)

We use this representation as the derivatives well describe the intensity functionlocally, which is a desired property.

When comparing different video sequences, we compare the descriptors of the in-dividual interest points detected in the sequences. For comparison, different distancemeasures can be applied (see Chapter 4 and particularly Section 4.6). One problemwith the interest point comparison is to determine which interest points that shouldbe matched. This introduces an issue known as the correspondence problem.

Correspondence Problem

The correspondence problem is the problem of matching an interest point descriptorin sequence h to a corresponding interest point descriptor in sequence k. For ex-ample, if we have an interest point detected around the foot-region of a walkingperson in sequence h, we want this to be matched to another foot-region interestpoint in sequence k (see Figure 3.4), if there is any.

?

Figure 3.4. Correspondence problem. Which features should be matched?

The matching can be tackled in two ways; matching one-to-many, where an interestpoint descriptor in h can be matched to several interest point descriptors in k, or one-to-one matching, where only pairs of descriptors are matched. One-to-many, which isused by Wallraven et al. [WCG03] and described in Section 4.6.1, is implementation-wise more simple than one-to-one, although the one-to-one approach is probablymore intuitive.

16

There are several algorithms for one-to-one matching, generally divided into thecategories greedy and non-greedy. A greedy method finds the pair of interest pointdescriptors that correspond most, denotes their correspondence, removes them andfinds the next pair of interest point descriptors that correspond most and so on.While being being an easy approach, it does not always provide optimal matching[KSG01]. Non-greedy methods generally perform better, but are more complex.For examples of non-greedy algorithms, such as the Hungarian method or differenthybrid methods, the interested reader is referred to [KSG01].

3.5.2 Histograms of Local Features

K-means clustering is used to divide the descriptors l (Equation 3.16) into K differentclusters. Repeating events give similar interest points which in turn yields populatedclusters [LL03]. On the other hand, sporadic events give rise to less populatedclusters. The frequencies of features in each cluster for a particular sequence is thenused for defining a feature histogram H = (h1, ..., hK), that is used for representingany given sequence. As histograms have been proven to be robust and are wellused in many different forms [CHV99, SZ03, BMP00], we have chosen to use thisrepresentation.

Because local features correspond to distinctive parts of video sequence, thek-means clustering and histogram is an intuitive representation as it groups thedistinctive parts and uses the frequency of occurrence of the different events. Forexample, if a sequence contains many detected interest points around the leg-area,it is probable that it contains some sort of walking, jogging or running.

3.5.3 Histograms of Spatio-temporal Gradients

For the HistSTG representation, the method presented by Zelnik-Mandor and Irani[ZI01] is used as benchmark method. A temporal pyramid, containing sub-sampled(in the temporal domain) video sequences at a few different scales is constructed. Foreach point (x, y, t) a local intensity gradient (Lx, Ly, Lt) is calculated. Furthermore,the absolute value of the gradient is used and the magnitude is normalized to 1.

(Nx, Ny, Nt) =|Lx|, |Ly|, |Lt|√L2

x + L2y + L2

t

A histogram for each scale and gradient component (x, y, t) is finally calculated. Therelevance of this representation is argued by Zelnik-Mandor and Irani [ZI01], whostate that the gradient direction mostly depend on properties related to movement,whereas the gradient magnitude tend to describe spatial appearance.

3.5.4 Discussion

Of the three described methods, two (LF and HistLF) are local methods and oneis global (HistSTG) and based on dense measurements. The local methods are

17

expected to perform better for scenes with scale variations and relative cameramovement, due to the adaptation (described in Section 3.4.1). Moreover, they areexpected to be more robust with respect to background clutter, occlusions andlighting variations. Disadvantages are that their representation is sparse, hencethey capture only certain parts of the video sequence (as opposed to global methodslike HistSTG). The local methods enables the detection of multiple actions in onescene, which is not possible for dense methods such as HistSTG. However, the LFrepresentation introduce the correspondence problem, which is not an issue for thehistogram based representations HistLF and HistSTG.

3.6 Summary

In this chapter we have addressed different representation methods for visual data.We have argued that local methods have a theoretical advantage over global methodsand selected the widely used Harris detector to detect interest points, due to itssimplicity and yet high repeatability [MS03, LL04]. Adaptation with respect to scaleand velocity has been discussed in order to obtain invariance to first-order cameramotions and scale variations. A rough outline for an algorithm for such adaptationhas been presented. We have also described the three representation methods, wheretwo use local features and the other uses global features, used for the experiments.Finally we have addressed the correspondence problem (the problem of matchingfeatures in one sequence to features in another sequence).

18

Chapter 4

Classification of Image- and Video-data

4.1 Introduction

In the previous chapter we discussed representation. We will now continue withthe use this representation for classification with a Support Vector Machine (SVM)classifier [CS00] and a Nearest Neighbor Classifier [LI93]. Both classifiers fall intothe category of supervised learning, which means that both data and desired out-put is provided during training. Other categories for classification algorithms areunsupervised learning, where the desired output is not provided during training,and reinforcement learning, where desired output also is not provided during train-ing, but instead some sort of reward/punishment system is used ([DHS01] and thereferences therein).

The outline of this chapter is as follows. We start with a brief description of NNC(Section 4.2) and continue with SVM in Section 4.3. The issue of non-separableclasses (Section 4.3.2) and non-linear classification (Section 4.3.3) is then addressed.As the SVM method originally is a binary classifier, we have a short discussion ofextension to multi-class problems and one-class SVM in Section 4.4 and 4.5 respect-ively. We will then end with with the kernel for local features used with SVM inSection 4.6.

4.2 Nearest Neighbor Classification

Assume to have N input vectors xi ∈ Rm, each vector belonging to class yi wherei = 1, . . . , N . Figure 4.1 shows the setup, where we illustrate a two-dimensionalexample (m = 2) and circles and crosses illustrate the class of the input vector. Theproblem is now to determine yN+1 of a new input vector xN+1, given the class yi

where i = 1, . . . , N of the previous input vectors. The Nearest Neighbor Classifierassigns the class of the new vector to be the same as its nearest neighbor. Thus,yN+1 = yj where j = arg mini { d(xi, xN+1) }, i = 1, . . . , N and d(x, y) is thedistance between the vectors x and y.

19

d1

d2

d3

Figure 4.1. 3-NN Classification. The square shows the vector we wish toclassify and d1, d2 and d3 are the distances to the nearest neighbors.

This can be extended to k-NNC, where the k nearest neighbors are used for clas-sification. We then assign the new vector to the class which the majority of the knearest neighbors belong to. Hence, k should be an odd number.The advantage of NNC is that it is easy to implement and can be computationallyfast, if the dimensionality of the input vectors is low, the number of input vectorsis small and k is small. It is probably due to these factors that it is still widelyused in visual pattern recognition [EBM+03, BMP00, ZI01, MP03]. However, asNNC minimizes the training error (empirical risk), it generally performs worse thanstructural risk minimizers (such as SVM) [Sch00]. Also, as the dimensionality ofthe input space grows, the implicit mapping to a feature space by a kernel1 candrastically reduce the computation time for SVM, whereas the computation timefor NNC rapidly increases.

4.3 Support Vector Machines

SVM’s, originating from statistical learning theory, have been known for their goodgeneralization capabilities. This means minimization of the structural risk. In otherwords, minimizing the risk of misclassifying new, unlabeled data (given the currenttraining set). Systems that minimize the training error, such as feed-forward neuronnetworks, can suffer from over fitting. This means that the network learns thetraining examples too well, and loses its ability to generalize.

1The topic of kernels and feature space mapping is discussed in Section 4.3.3.

20

4.3.1 Hyper-plane Classifier with Two Linearly Separable Classes

Consider the same setup as for NNC in Section 4.2, a situation with two linearlyseparable classes (see Figure 4.2). The number of input vectors is N . We let theclass of each input vector xi be yi ∈ {−1, 1} where i = 1, . . . , N .

w

γ

Figure 4.2. The classes (circles and crosses) are separated by the hyper-planewith the normal w. The two dotted lines show the planes (w · x) + d = 1 and(w ·x) + d = −1. The geometrical margin γ is the distance between these twoplanes. We can also see that there are two vectors from the cross class and onevector from the circle class which lie on the margin. These vectors are calledsupport vectors. Note that it is only the support vectors that affect parametersof the hyper-plane. All other vectors are unimportant to the SVM!

A hyper-plane (w · x) + d = 0, where w is the normal and d is the distance fromthe origin, is used for the separation, which gives the classifier

yi = sgn((w · xi) + d). (4.1)

We now take a look at the geometric margin and the so called functional margin.The geometric margin is defined as the geometric distance from the closest inputvector(s) to the hyper-plane. The geometric distance from a point (xi) to a hyper-plane can be written as

d(w, d, xi) =|(w · xi) + d|

||w|| .

The geometric margin then becomes γ = d(w, d, xi) + d(w, d, xj), where xi is theinput vector (or one of the vectors) belonging to class 1 (yi = 1) that is closest to

21

the hyper-plane, and xj is the input vector (or one of the vectors) belonging to class2 (yj = −1) that is closest to the hyper-plane.

Furthermore, we introduce the functional margin. The functional distance from apoint (xi) to a hyper-plane is defined as |(w·xi)+d|. Since the plane (the parametersw and d) can be rescaled without changing the geometric properties (hence, we havea degree of freedom), we can use this to choose an appropriate functional distance.A canonical hyper-plane is a hyper-plane that has the functional distance to theclosest point(s) equal to 1 (|(w · xi) + d| = 1 for all i representing points closest tothe hyper-plane).

The defined functional distance to the closest point(s) is inserted in the equationfor the geometric margin. We get the resulting geometric margin

γ =|(w · xi) + d|

||w|| +|(w · xi) + d|

||w|| =2

||w|| ,

We want to maximize this function, which means to maximize the distance to theclosest points, which in turn is equivalent to minimizing

12||w||2. (4.2)

However, we also have some restrictions. The restrictions are that all the points ofone class should lie on one side of the hyper-plane and all points belonging to theother class should lie on the other side, hence

yi((w · xi) + d) ≥ 1 ∀i. (4.3)

The problem can now be formulated as an optimization problem with the goalfunction (Equation 4.2) and restrictions (Equation 4.3):

minimizew,d12 ||w||2

subject to yi((w · xi) + d) − 1 ≥ 0 ∀i.(4.4)

We note that the goal function is convex. Since the constraints are linear, they forma convex region. Hence, the problem is convex. This means that there is only onelocal optimum, which can be solved with Quadratic Programming (QP). This is asignificant advantage over the feed-forward neuron networks training with steepestdescent, where the problem may have several local optima which we may end up in,depending on the starting position.

For reasons that will become apparent later (when introducing the non-linearSVM in Section 4.3.3), the problem will be further modified. From optimizationtheory, it is given that we can, given this primal problem, formulate the dual problemwhich have the same optimum as the primal one (under certain conditions) [HL01].The dual problem is formulated as: maximize the Lagrangian

L(w, d, α) =12||w||2 +

N∑i=1

αi(yi(w · xi + d) − 1) (4.5)

with respect to α under the conditions ∂wL(w, d, α) = 0, ∂dL(w, d, α) = 0 andαi ≥ 0 ∀i. Here α = (α1, . . . , αN ) are Lagrange multipliers. We can rewrite

22

∂wL(w, d, a) = 0 => w =N∑

i=0

yiαixi (4.6)

∂dL(w, d, a) = 0 =>N∑

i=0

yiαi = 0 (4.7)

and insert (Equation 4.6) and (Equation 4.7) into the expression for the Lagrangian(Equation 4.5), which (after some simplifications) becomes

L(w, d, a) =N∑

i=1

αi − 12

N∑i,j=1

αiαjyiyj(xi · xj). (4.8)

Now we can write the simplified goal function and restrictions for the dual problem:

maximizeα∑N

i=1 αi − 12

∑Ni,j=1 αiαjyiyj(xi · xj)

subject to∑N

i=1 αiyi = 0αi ≥ 0 ∀i.

(4.9)

To get the resulting classifier, we insert the expression for w (Equation 4.6) into ouroriginal classifier (Equation 4.1) which gives

yj = sgn( N∑

i=0

yiαi(xj · xi) + d)

.

d can be obtained through (Equation 4.3), using an i that corresponds to a supportvector. We know that xi is a support vector if the corresponding αi > 0. Then theconstraint becomes

yi((w · xi) + d) = 1

because it is right on the margin. By observing the optimization problem from amore “physical” point of view we can get an intuitive insight. Assume that the goalfunction forms some kind of “landscape” and the constraints are walls, forming aregion within this landscape. If we drop a ball in this region, it will roll down the“hills” of this landscape until it reaches a wall (or several walls forming a corner)or a local minimum (a hole in the landscape). This can be seen as a minimizationof the goal function. Since the goal function only have one local minimum (asstated earlier) which obviously is w = 0 but is never a valid solution because of therestrictions, there must be at least one wall constraining the ball from rolling further

23

down. These walls (constrains) are the support vectors. Solving for d and insertingthe expression for w (Equation 4.6) gives

d = yi −N∑

k=0

ykαk(xk · xi).

When going from this linear separation to non-linear in Section 4.3.3, the existenceof the input vectors x only in the dot-product with other input vectors (xi · xj) isvery important.

4.3.2 Non-separable Classes

In the previous section we assumed linear separable classes (hard margin SVM).However, in “real life”-applications we probably will not have linearly separable data.Also, there is no quick way of telling if we have linearly separable data or not, so wehave to modify the solution into what is called soft margin SVM. To do this, someslack variables ξ = (ξ1, . . . , ξN ) are introduced in the primal optimization problem(Equation 4.4), still using a linear separation (a hyper-plane), resulting in

minimizew,d12 ||w||2 + C

∑Ni=1 ξi

subject to yi((w · xi) + d) − 1 + ξi ≥ 0 ∀iξi ≥ 0 ∀i.

(4.10)

The slack variables relax the restrictions, making it possible for some points to beon the wrong side of the plane. However, we naturally want to minimize these slackvariables, therefore an expression for this is added in the goal function. C is apositive constant that defines the trade-off between minimizing the slack variablesagainst minimizing the margin. In real life applications, this constant is often setby using trial-and-error. Hence, a number of different values of C are tested and theone which gives best result is used.

Now, what we would like to do is to formulate the dual problem, as in theprevious section. This can be done in a very similar way, using the Lagrangian andderivatives etc. (the interested reader is referred to [Sch00]). The result is verysimilar to Equation (4.9), with one difference: the restrictions (αi ≥ 0 ∀i) changeto (0 ≤ αi ≤ C ∀i):

maximizeα∑N

i=1 αi − 12

∑Ni,j=1 αiαjyiyj(xi · xj)

subject to∑N

i=1 αiyi = 00 ≤ αi ≤ C ∀i.

(4.11)

The expression for the classifier is almost the same as the hard margin classifier.There is a slight difference in the way d is computed, due to the slack variables inthe constraints. Derivation of this d is straightforward.

24

4.3.3 Non-linear Classifiers

Until now we have been using a hyper-plane as a separator for the two classes.Why not use some other surface? Since a more complex surface means introducingmore variables to optimize (more “knobs to turn”) and thus making the problemmuch harder to solve, we can instead manipulate the input data. What we do isa transformation from the input space Rm to the so called feature space F wherethe classes are linearly separable (ideally). We can then write this transformationas Φ : Rm → F (see Figure 4.3 for an example). F can be of either higher or lowerdimension than the input space, but normally it is of a higher dimension. Now comesthe beauty of having the input vectors x only in the dot-product with other inputvectors (xi · xj). Instead of doing the transformation into feature space explicitlyand then calculating the dot product (Φ(xi) · Φ(xj)) we can do the transformationinto feature space implicitly using a so called kernel K(xi, xj) which takes care ofboth the transformation and the dot product. This kernel trick (instead of firstmapping to feature space and then calculate the dot product) can save tremendousamount of calculation time [BGV92].

−15 −10 −5 0 5 10 15−5

0

5

10

15

20

−15 −10 −5 0 5 10 15

0

50

100

150

200

250

Figure 4.3. (left) One dimensional example with two classes. Linear separa-tion not possible; (right) One-dimensional data mapped to a two-dimensionalfeature space with Φ(xi) = (xi1 , x2

i1), enabling linear separation.

There are two ways of deriving the kernel. Either directly (not bothering about the Φmapping) or from a specified Φ. Consider the second way, where we have a Φ(xi) =(1,

√2xi1 ,

√2xi2 ,

√2xi1xi2 , x

2i1

, x2i2

) where xi = (xi1 , xi2). Hence, a mapping from 2to 6 dimensions (the first component is constant so we are just “using” 5 dimensions).We write the definition of the kernel

K(xi, xj) = Φ(xi) · Φ(xj) =1 + 2xi1xj1 + 2xi2xj2 + 2xi1xi2xj1xj2 + (xi1xj1)

2 + (xi2xj2)2 =

((xi · xj) + 1)2.

25

Obviously, calculating K(xi, xj) = ((xi·xj)+1)2, instead of explicitly calculating thetransformation Φ for both points and then the dot product, is much more efficient.This can of course be extended to an arbitrary polynomial kernel

k(xi, xj) = ((xi · xj) + θ)d,

where θ and d are design parameters to choose. Example of other interesting, fre-quently used kernels are [ACS01]

- Gaussian RBF: K(xi, xj) = e−||xi−xj ||2

2σ2

- Sigmoid: K(xi, xj) = tanh(κ(xi · xj) + θ)

- Inverse Multi-quadratic: K(xi, xj) = 1√||xi−xj ||2+c2

When deriving a kernel directly (not by starting with a desired Φ as we did above)one have to make sure that it fulfills certain criteria. A kernel which fulfills thesecriteria is called a Mercer kernel [Sch00]. We shall introduce yet another definition:the kernel matrix

K = (K(xi, xj))i,j ∀i, j.

A Mercer kernel must satisfy two constraints [Sch00]

- Its kernel matrix must be positive definite, which implies positivity on thediagonal

K(xi, xi) ≥ 0 ∀i.

- The kernel matrix must also be symmetric

K(xi, xj) = K(xj, xi)

So what kernel should we use and why? This depends on the input data distributionwhich probably is not known. A frequent solution is to try a few different kernelsand use the one that gives the best result. RBF type kernels often give good resultsfor image classification [CHV99], for example.

4.4 Multi-class SVM

So far we have only considered a two class problem. When dealing with more thantwo classes, the two perhaps most common approaches are

26

One against all Pick a class and perform two-class SVM on this class vs. allothers. Repeat for all classes. (n-class turned into n two-class problems) If weget positive result for more that one class, use the one with largest margin.

Pairwise Combine pairs of the classes and perform two-class SVM on each pairs.(n-class turned into n(n − 1)/2 two-class problems).

In terms of performance, the methods have proven to give very similar result [WW98].Thus, the one-against-all approach is often preferred over pairwise due to lower com-plexity. There also exist numerous extensions of these two basic ideas, but alsocompletely different solutions.

4.5 One-class SVM

In the case of two-class SVM, we are dealing with two classes which we want toseparate. In one-class SVM we only have one class, hence only examples that belongto the class and no examples that do not belong. Still, we can use the informationwe have got to create a bound for the class. What we are trying to do is basicallyto estimate its probability distribution. Outliers are treated as not belonging to theclass. In geometrical terms, it can be seen as we are trying to fit the class withina hyper-sphere. We want this hyper-sphere to be big enough to fit all the trainingdata, but not too big as we do not want it to contain samples that do not belong tothe class.

In practical computer vision applications, one-class SVM can be used for recog-nition, i.e. if we want to detect an object in a scene (output true (the object ispresent in the scene) or false (no such object is present)).

4.6 Using SVM with Local Features

Assume to be given an image or video sequence i. Now we define lji as a jet descriptorof interest point j in i. Let ni be the number of interest points in i. A simple way toconstruct one vector out of this is to concatenate the jet descriptors for all interestpoints

F i = (l1i , l2i , . . . , lnii)

Now assume an SVM comparison between two such feature vectors, F h and F k.Since it is likely that we detect different number of interest points for differentimages or sequences, we get vectors of different lengths. This is a problem becauseSVM uses the dot product for comparison, and the dot-product between vectors ofdifferent lengths is not defined. A simple solution to this is to zero-pad the shorterof the vectors. However, we will show that this is not a good solution [WCG03].

27

If we assume that nh < nk, F h has to be zero-padded. The dot-product betweenthe vectors becomes

F h · F k = l1h· l1k

+ l2h· l2k

+ . . . + 0 · l(nk−1)k+ 0 · lnkk

It is obvious that this is equivalent to shorten the length of F k to match the unpad-ded F n, hence information that could be vital is discarded. Also, how the jets areordered in the feature vector affects the result.

To avoid these problems, we can match the individual interest points (their jets)in corresponding image or video sequence. This introduces yet another issue, thecorrespondence problem, previously described in Section 3.5.1.

4.6.1 Local Features Kernel

Instead of just putting all jet descriptors in one large vector, we now define a setof jet descriptors Li = {lji}ni

j=1 where lji is a jet descriptor of interest point j in iand ni is the number of interest points in i, like in Section 4.6. From this setup,Wallraven et al. [WCG03] presents a new set of kernels

KL(Lh, Lk) =12[K(Lh, Lk) + K(Lk, Lh)]

with

K(Lh, Lk) =1nh

nh∑jh=1

maxjk=1,...nk

{Kl(ljh, ljk

)}.

Kl can be any Mercer kernel and in this particular case we use an exponentialfunction based on the normalized cross-correlation

Kl(x, y) = exp{− ρ

(1 − 〈x − µx|y − µy〉

||x − µx|| · ||y − µy||)}

(4.12)

where µx is the mean of x and µy is the mean of y. Other possible Kl could be[WCG03]

Kl(x, y) = exp{−ρχ2(x, y)} (4.13)

Kl(x, y) = exp{−ρ||xa − yb||} (4.14)

It is shown in the work of Wallraven et al. [WCG03] that if Kl satisfy Mercer’stheorem, so does K and in turn KL.

Unfortunately, it turns out that the issue of correspondence still persists, sincethe local neighborhood around many local features can look very similar. In orderto improve the matching it is possible to use, for example, positional informationand/or distance histograms [BMP00] in addition to jet comparison.

28

4.7 Summary

This chapter has described the theory behind the NNC and SVM classifiers, whichwill be used in the experiments in the next chapter. We began with NNC, explainingwhy it is outperformed by, for example, SVM. We then continued by describing SVMand its extensions to handle non-separable data, multi-class problems and also one-class problems. This was followed by a section about the concept of kernels andnon-linear classification. Finally, we ended with the problem of using local featureswith SVM and described a class of kernels which solve this problem.

29

Chapter 5

Experiments

5.1 Introduction

With the theory from the two previous chapters in mind, we now continue with theexperiments. We start this chapter with a presentation of our video database inSection 5.2. This is followed by an overview of the selected representation methods,based on the theory from Section 3.5. We then continue with the multi-class ex-periments (Section 5.3.2), followed by two-class SVM experiments (Section 5.3.3).Finally, we discuss the results of the experiments (Section 5.4) and end with a sum-mary.

5.2 Video Database

While, for object recognition, there exist several publicly available databases [FPZ03,MN95], the situation is different for action recognition. The existing databases, tendto be more or less specialized, focusing on specialized tasks such as gait recognitionand analysis (the CMU Mobo Database [GS01], containing 25 subjects performingfour walking-type actions recorded from different views. Another example is thethe Georgia-Tech gait database [GTDB] which consists of 20 subjects in differentenvironments) or visual speech recognition (Tulips1 database [Mov01]).

In order to test our approach on the problem of recognizing human actions, wedecided to make a new database for the experiments. For this video database, thenumber of actions was chosen to be six; three “arm actions” (hand wave, hand clapand box) and three “leg actions” (walk, jog and run). Each action was filmed in fourdifferent scenarios:

s1 Outdoors.

s2 Outdoors with scale variation.For locomotive actions (walk, run and jog) this means that the subject isperforming the action diagonally, starting at a distance and proceeding closer

30

to the camera. For non-locomotive actions (box, hand wave and hand clap)the same effect is achieved by zooming.

s3 Outdoors with different clothes.The subjects where asked to wear something that might make the recognitionof the action somewhat more difficult, such as a backpack or a long coat and/ora scarf fluttering behind the running subject.

s4 Indoors.

All scenarios where filmed with a static camera and homogeneous background (asclose to homogeneous as we could achieve). The outdoor filming took place at theÖstermalms IP sports-field. The gravel field was used as “homogeneous background”.For the indoor filming, a bright unicolored wall was used as background.

Filming was done with a 3CCD DV-camera with the sampling rate 25 frames persecond, while all sequences where subsampled to the spatial resolution of 160 × 120pixels in 256 levels of grey-scale. Saving of the clips was done in uncompressedavi format to avoid possible artifacts (blockiness) from lossy compression such asMPEG. However, the DV-camera internally saves the data in lossy MPEG-format,so there are artifacts anyway. On the other hand, the resolution that the camerauses is much higher than ours so it is assumed that these artifacts do not have anysignificant impact on the result of the action recognition.

25 persons were filmed performing the six actions under the four conditions. Foursequences of each person-action-condition combination was filmed, making a totalnumber of 2400 video sequences. Although, due to various reasons, a few sequencesare missing, causing the true number of sequences to be 2391. To the best of ourknowledge, this is the largest video database with sequences of human actions takenover different scenarios and it is our plan to make it available online. The sequencesare all short, ranging from little over a second to several seconds.

A few samples from the database, showing the different actions and scenariosare displayed in Figure 5.1. As it can be seen, there are some variations within eachcondition. For example when it is sunny, there are shadows on the ground, andwhen the ground is wet from rain the subject leaves visible footprints.

5.3 Experiments

All sequences were divided with respect to the subjects into a training set, a valid-ation set and a test set. The sequences in the training set where used to train theclassifiers while the validation set was then used to optimize the classifier paramet-ers (validation phase). The recognition results were obtained on the test set (testphase). To analyze the influence of different scenario, we performed separate trainingon four different subsets of scenario: s1, {s1, s4}, {s1, s3, s4} and {s1, s2, s3, s4}.

The experiments where conducted using three different types of representationmethods, described earlier in Chapter 3.

31

S1

Walking Jogging Running Boxing Hand waving Hand clapping

S2

S3

S4

Figure 5.1. Samples from the database

Local features (LF) For this representation we detect spatio-temporal Harrispoints [LL03] and for each interest point we compose a jet with normalized spatio-temporal derivatives l = (Lx, Ly, Lt, Lxx, ..., Ltttt) with Lxmyntk = σm+nτk(∂xmyntkg)∗f . We use derivatives of order up to 4 (resulting in 34 dimensional jets). The videosequence is then represented by this set of jets. We use the local features kerneldescribed in Section 4.6 together with SVM for classification.

Histograms of local features (HistLF) Here we used histograms of the localfeature jets described above. The histograms where quantized into 128 clustersaccording to Section 3.5.2. Thus, we ended up with a 128 bin histogram for eachvideo sequence. We also used a “stop list” to remove 5% of the most populatedclusters, as this method had shown to increase performance for similar applications[SZ03].

Histograms of space-time gradients (HistSTG) This representation used his-togram of the spatio-temporal gradient as in the work of Zelnik-Mandor and Irani[ZI01]. For each point in the video sequence we calculate the absolute value of theintensity gradient (Lx, Ly, Lt). For each gradient component we then calculate a 128bin histogram. All this is done over 4 temporal scales, resulting in 12 histogramsper video sequence (3 dimensions (x, y, t) times 4 scales).

32

s1 s1+s4 s1+s3+s4 s1+s2+s3+s40

20

40

60

80

100

Training scenario

Cor

rect

ly c

lass

ified

(%

)

Test on all scenario

Local features, SVMHistogram LF, SVMHistogram STG, SVMHistogram LF, NNCHistogram STG, NNC

s1 s1+s4 s1+s3+s4 s1+s2+s3+s40

20

40

60

80

100

Training scenario

Cor

rect

ly c

lass

ified

(%

)

Test on scenario s2


Figure 5.2. Recognition results for multi-class experiments. Results are shownfor different methods and training scenario: (left) testing over all scenario;(right) testing over s2 scenario with scale variations.

5.3.1 The SVM Kernel

The kernel used with the LF representation was the local features kernel describedin Section 4.6.1. As for correspondence, the kernel uses one-to-many matching. Thiswas changed to one-to-one. As the number of detected interest points vary between50 and 270 (approximately), it was decided (through testing) to only use 50 of thestrongest matches. This means that we change the sum of (Equation 4.6.1) to gofrom 1 to 50.

For the HistLF and HistSTG representations we used the kernel K(x, y) =exp{−γχ2(x, y)} according to results reported in the literature ([CHV99] for ex-ample).

5.3.2 Multi-class

We start with a multi-class experiment since it is perhaps most simple and intuitive,and yet provides much information about the performance of each representation(which actions/conditions it fails to classify correctly etc.). All six actions are usedand divided into six classes. We evaluate the performance for training on differentconditions, as described in Section 5.3. This is done in order to see if it is fruitfulto train on various conditions or if the methods can show good results regardless ofthe training condition(s).

We use the three representation methods LF, HistLF and HistSTG, as describedpreviously. The LF representation was used together with an SVM classifier, whilethe histogram based methods where used with both SVM and NNC. The reason forusing NNC is to have some kind of reference results, and as shown in the literature[CHV99] we expect SVM to perform better than NNC.

Sequences of 8 subjects for training, 8 subjects for validation and 9 subjects fortesting were used.

33

LF+SVM, test on all scenariosWalk Run Jog Box Hclap Hwav

Walk 83.78 0 16.22 0 0 0Run 6.25 54.86 38.89 0 0 0Jog 22.92 16.67 60.42 0 0 0Box 0.70 0 0 97.90 0.70 0.70

Hclap 1.39 0 0 35.42 59.72 3.47Hwav 0.69 0 0 20.83 4.86 73.61

HistSTG+SVM, test on all scenariosWalk Run Jog Box Hclap Hwav

Walk 64.86 0 27.03 0 0 8.11Run 4.86 50.69 44.44 0 0 0Jog 19.44 15.28 65.28 0 0 0Box 0 0 0 67.83 17.48 14.69

Hclap 0 0 0 40.28 54.86 4.86Hwav 0 0 1.39 9.72 10.42 78.47

LF+SVM, test on s2Walk Run Jog Box Hclap Hwav

Walk 100 0 0 0 0 0Run 13.89 16.67 69.44 0 0 0Jog 66.67 0 33.33 0 0 0Box 0 0 0 97.22 2.78 0

Hclap 0 0 0 36.11 58.33 5.56Hwav 0 0 0 25.00 5.56 69.44

HistSTG+SVM, test on s2Walk Run Jog Box Hclap Hwav

Walk 66.67 0 0 0 0 33.33Run 19.44 11.11 69.44 0 0 0Jog 77.78 0 22.22 0 0 0Box 0 0 0 38.89 36.11 25.00

Hclap 0 0 0 33.33 47.22 19.44Hwav 0 0 5.56 2.78 30.56 61.11

Table 5.1. Confusion matrices (results in %) for multi-class experiments withtraining on s1 for different test-scenarios.

Figure 5.2 shows the amount of actions correctly classified for the whole test set(all scenarios) and for scenario s2 with four different training sets. Four confusionmatrices are shown in Table 5.1. The rest of the figures and confusion matrices canbe found in the appendix (Tables A.1 - A.11 and Figure A.1).

Discussion

From Figure 5.2 it can be noted that the local features method outperforms the otherused methods for the four constellations of training sets and that both histogram-based methods on average performs better with SVM than with NNC, as expected.However, for test on scenario s2, NNC tends to outperform SVM. Unfortunately,we do not have any explanation for this. We can also see that all methods seem tobenefit more or less from the addition of scenario s2 to the training set. However, LFand HistLF seem to be more stable with respect to scale variation than HistSTG.Hence, it seems that scale adaptation of local features gives increase in performance.

34

LFWalk Run Jog Box Hclap

Run 88.01Jog 80.82 70.14Box 97.59 96.17 99.65

Hclap 95.55 95.14 99.65 82.23Hwav 97.28 97.92 98.26 87.80 87.85

HistLFWalk Run Jog Box Hclap

Run 87.67Jog 80.82 71.18Box 96.56 100 98.95

Hclap 92.81 100 100 79.79Hwav 95.55 100 99.65 82.92 82.64

HistSTGWalk Run Jog Box Hclap

Run 91.10Jog 74.66 70.83Box 98.97 98.95 98.95

Hclap 99.66 98.96 98.96 65.85Hwav 98.29 100 100 86.76 87.50

Table 5.2. Recognition rates (%) for action pairs (walk vs. jog, walk vs. runetc.) using two-class SVM with training on scenario s1 (test on all conditions).

As can be seen from the confusion matrices in Table 5.1, there is a clear sep-aration between arm actions and leg actions. The most confusion occurs betweenjogging and running sequences as well as the hand clapping and boxing sequences.Similar structure can be observed in the confusion matrices for other training andtest sets as well as for the HistLF method (see Tables A.1 - A.11 in the appendix).

5.3.3 Two-class SVM

We continue with two-class SVM experiments where we take out pairs of actionswhich we try to classify (i.e. walk vs. jog, run vs. box etc.). This is done in orderto further evaluate the performance of the methods. NNC was not used in the two-class experiments, since we have confirmed in the multi-class experiments that SVMalmost consistently outperforms NNC.

Sequences of 8 subjects where used in the training set, 8 subjects in the validationset and 9 subjects in the test set, as for the multi-class experiments.

The recognition rates for different methods and training on scenario s1 is shownin Table 5.2. All different action combinations and the corresponding results are

35

shown (i.e. for the LF representation: run vs. walk gives recognition rate 88.01%and jog vs. run gives recognition rate 70.14%). Results for the other training subsetscan be found in the appendix (see Tables B.1, B.2 and B.3).

Discussion

An observation from Table 5.2 is the poor results for HistSTG and the actions boxvs. hand clap. We think this is caused by the direction-of-action invariance of therepresentation, since the two actions mainly contain horizontal arm movements. It isalso worth noting that the HistSTG method seems to perform better (than LF andHistLF) on action pairs where the velocity of the subject is a good cue for separation(such as walk vs. run), whereas the situation is the opposite for more similar pairs(for example jog vs. run). The velocity invariance of the local methods is believedto cause this behavior.

5.4 Overall Discussion

First of all, the confusion between jogging and running can partly be explained byhigh similarity of the classes (the running of some people may appear very similarto the jogging of others).

Evaluation of the result for the three different methods over all experiments,shows that the LF method performs better on average than the other two. However,the performance of the HistSTG representation varies much and performs betterin some cases. This is especially apparent when observing the results of the two-class experiments, where the recognition rate for the action-pair Hand clapping vs.Boxing is only 65.85% (hence, much lower than the other methods), whereas formost action-pairs containing walking, it performs better than both LF and HistLF.

The LF and HistLF methods are designed to be scale invariant and, indeed,classify the scale varying s2 scenario better. However, we expected the performanceto be better. One possible explanation is that s2 scenario contains other variationsthan scaling, such as expansion due to camera zoom for arm actions and variationsin the view (45 ◦) for leg actions.

Furthermore, as our database only contains sequences with static camera andhomogeneous background, we believe that we do not use the full capacity of the LFrepresentation and the velocity invariance.

5.5 Summary

In this chapter we have presented a novel video database, containing video sequencesof 25 subjects performing six different actions under four different scenarios. Wehave used this database for extensive evaluation of three different representations,where two are local and one is global. We conclude that the two local methodsoutperform the global method in our experiments, even though the video databaselacks scenes where local representation and velocity adaptation would probably aid

36

the recognition even more in comparison to global methods (such as scenes withmoving camera and textured background).

37

Chapter 6

Conclusions

6.1 Conclusions

The object of this project was to investigate the possibilities of using local featuresin the form of spatio-temporal interest points together with an SVM-classifier torecognize human actions. It has been proven through experiments that this is pos-sible and that such a method even performs better on the created database than thenormalized space time gradient method.

There are several motivations for the use of local representation, such as possib-ilities of: bottom up segmentation, handling of multiple simultaneous actions, useof positional information, independence of static and constantly moving backgroundand achieving invariance to size and camera velocity. In this thesis we have usedadaptation for construction of a scale- and velocity invariant representation. Al-though, for this particular database, velocity, and also spatial scale to some extent,is a helpful cue for recognizing the action. Thus, by using adaptation, this cue is re-moved. This is probably a reason to why the normalized space time gradient methodsometimes give better recognition results. We believe that extension of the databaseto sequences with static heterogeneous background, sequences with moving cameraand sequences with moving background will give more hints of the potential of thelocal feature representation.

6.2 Future Work

We plan to extend this database by adding sequences with static textured back-ground, moving background and moving camera, which we believe more clearly willshow the benefits of the local features representation and velocity adaptation.

Positional information could be used in many ways to improve the recognitionresults for the local features representation. Possible approaches are to search forinterest point matches in a local neighborhood, to use distance- or angle histograms(similar to the shape descriptors presented in [BMP00]) as a part of the matchingand/or matching of interest point clusters to other clusters instead of single point

38

matching. Although when using positional information, possible introduction ofunwanted dependencies, such as translation, rotation, scale etc. has to be considered.

39

References

[AC02] S. Akyol and U. Canzler, An information terminal using vision based signlanguage recognition, ITEA Workshop on Virtual Home Environments, VHEMiddleware Consortium, 2002.

[Avi01] S. Avidan, Support vector tracking, IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR), pp. 184-191, 2001.

[ACS01] N.E. Ayat, M. Cheriet and C.Y. Suen, Empirical error based optimizationof SVM kernels: Application to digit image recognition, International Workshopon Frontiers of Handwriting Recognition (IWFHR), pp. 292-297, 2002.

[BMP00] S. Belongie, J. Malik and Jan Puzicha, Shape context: A new descriptorfor shape matching and object recognition, Neural Information Processing Sys-tems (NIPS), pp. 831-837, 2000.

[BYJ+97] M. Black, Y. Yacoob, A. Jepson, D. Fleet, Learning parameterized modelsof image motion, In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pp. 561-567, 1997.

[BGV92] B. Boser, I. Guyon, and V. Vapnik, A training algorithm for optimalmargin classifiers, Fifth Annual Workshop on Computational Learning Theory,Pittsburgh, ACM, 1992.

[Bre97] C. Bregler, Learning and recognizing human dynamics in video sequences,In Proceedings of the IEEE Conference on Computer Vision and Pattern Re-cognition (CVPR), pp. 568-574, 1997.

[CK99] P. Chang and J. Krumm, Object recognition with color cooccurrence histo-grams, In Proceedings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR), vol. 2, pp. 498-504, 1999.

[CHV99] O. Chapelle, P Haffner and V. Vapnik, SVMs for histogram-based imageclassification, IEEE Trans. on Neural Networks, 10(5), 1999.

[CC99] O. Chomat and J. Crowley, Probabilistic recognition of activity using localappearance In Proceedings of the IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition (CVPR), vol. 1, pp. 104-109, 1999.

40

[CS00] N. Christianini and J. Shawe-Taylor, Introduction to support vector ma-chines, Cambridge University Press, ISBN 0-521-78019-5, 2000.

[CLK+00] R. Collins, A. Lipton, T. Kanade, H. Fujiyoshi, D. Duggins, Y. Tsin, D.Tolliver, N. Enomoto, and O. Hasegawa, A system for video surveillance andmonitoring, Technical report CMU-RI-TR-00-12, Robotics Institute, CarnegieMellon University, 2000.

[DB97] J. Davis and A. Bobick, The representation and recognition of action usingtemporal templates, In Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR), pp. 928-934 ,1997.

[DTM96] P. Debevec, C. Taylor and J. Malik, Modeling and rendering architecturefrom photographs: A hybrid geometry- and image-based approach, In Pro-ceedings of SIGGRAPH, International Conference on Computer Graphics andInteractive Techniques, pp. 11-20, 1996.

[DHS01] R. Duda, P. Hart and D. Stork, Pattern classification, Wiley-Interscience,ISBN 0-471-05669-3, 2001.

[EBM+03] A.A. Efros, A.C. Berg, G. Mori and J. Malik, Recognizing action at adistance, In Proceedings of the International Conference on Computer Vision(ICCV), vol. 2, pp. 726-733, 2003.

[FB00] R. Fablet and P. Bouthemy, Statistical motion-based object indexing us-ing optic flow field, In Proceeding of the International Conference on PatternRecognition (ICPR), vol. 4, pp. 287-290, 2000.

[FPZ03] R. Fergus, P. Perona and A. Zisserman, Object class recognition by unsu-pervised scale-invariant learning In Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp.264-271, 2003.

[GS01] R. Gross and J. Shi, The cmu motion of body (mobo) database, Tech-nical Report CMU-RI-TR-01-18, Robotics Institute, Carnegie Mellon Univer-sity, Pittsburgh, PA, 2001.

[HS88] C. Harris and M. Stephens, A combined edge and corner detector, In Pro-ceedings of The Fourth Alvey Vision Conference, pp. 147-151, 1988.

[HS99] H. Haussecker and H. Spies, Handbook of computer vision and applications,volume 2, signal processing and pattern recognition, Academic Press, ISBN0-12-379772-1, pp. 312-324, 1999.

[HL01] F. Hillier and G. Lieberman, Introduction to operations research, McGrawe-Hill Companies, Inc. ISBN 0-07-118163-6, 2001.

41

[HL00] J. Hoey and J. Little, Representation and recognition of complex humanmotion, In Proceedings of the IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR), vol. 1, pp. 1752-1759, 2000.

[KSG01] S. Kumar, M. Sallam and D. Goldgof, Matching point features under smallnonrigid motion, Pattern Recognition, vol. 34, pp. 2353-2365, 2001.

[LI93] P. Langley and W. Iba, Average-case analysis of a nearest neighbor algorithm,In Proceedings of the Thirteenth International Joint Conference on ArtificialIntelligence (IJCAI), pp. 889-894, 1993.

[LL03] I. Laptev and T. Lindeberg, On space-time interest points, Tech. Rep. ISRNKTH/NA/P–03/12–SE, Oct. 2003.

[LL04] I. Laptev and T. Lindeberg, Velocity adaptation of space-time interestpoints, To appear in the Proceedings of the International Conference on PatternRecognition (ICPR), 2004.

[Lin94a] T. Lindeberg, Scale-space theory in computer vision, ISBN 0-7923-9418-6,Kluwer Academic Publishers, 1994.

[Lin94b] T. Lindeberg, Scale-space theory: A basic tool for analysing structures atdifferent scales, Journal of Applied Statistics, pp. 224-270, 1994.

[Lin98] T. Lindeberg, Feature detection with automatic scale selection, Interna-tional Journal of Computer Vision, 1998.

[MN95] H. Maurase and S. Nayar, Visual learning and recognition of 3D objectsfrom appearance, International Journal of Computer Vision (IJCV), vol. 14,pp. 5-24, 1995.

[MP03] O. Masoud and N. Papanikolopoulos, Recognizing human activities, In Pro-ceedings of the IEEE Conference on Advanced Video and Signal Based Surveil-lance (AVSS), pp. 157-162, 2003.

[MS01] K. Mikolajczyk and C. Schmid, Indexing based on scale invariant interestpoints, In Proceedings of the International Conference on Computer Vision(ICCV), vol. 1, pp. 525-531, 2001.

[MS02] K. Mikolajczyk and C. Schmid, An affine invariant interest point detector,In Proceedings of the European Conference on Computer Vision (ECCV), vol.1, pp. 128-142, 2002.

[MS03] K. Mikolajczyk and C. Schmid, A performance evaluation of localdescriptors, In Proceedings of the IEEE Computer Society Conference on Com-puter Vision and Pattern Recognition (CVPR), vol. 2, pp. 257-263, 2003.

42

[Mov01] J.R. Movellan, Visual speech recognition with stochastic networks, In G.Tesauro, D. Touretzky, and T. Leen, editors, Advances in neural informationprocessing systems. MIT Press, Cambridge, Massacusetts, 1995.

[SC00] B. Schiele and J. L. Crowley, Recognition without correspondence usingmultidimensional receptive field histograms, International Journal of ComputerVision (IJCV), vol. 36, pp. 31-52, 2000.

[SM97] C. Schmid and R. Mohr, Local grayvalue invariants for image retrieval,IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, pp.530-535, 1997.

[Sch00] B. Schölkopf, Statistical learning and kernel methods, MSR-TR2000-23, Mi-crosoft Research, 2000.

[SZ03] J. Sivic and A. Zisserman, Video google: A text retrieval approach to objectmatching in videos, In Proceedings of the International Conference on ComputerVision (ICCV), pp. 1470-1477, 2003.

[SC02] J. Sullivan and S. Carlsson, Recognizing and tracking human action, InProceedings of the European Conference on Computer Vision (ECCV), vol. 1,pp. 629-644, 2002.

[VVF+03] M. Vergauwen, F. Verbiest, V. Ferrari, C. Strecha and L. van Gool, Wide-baseline 3D reconstruction from digital stills, ISPRS Workshop on Visualizationand Animation of Reality-based 3D Models, 2003.

[VJS03] P. Viola, M.J. Jones and D. Snow, Detecting pedestrians using patternsof motion and appearance, In Proceedings of the International Conference onComputer Vision (ICCV), vol. 2, pp. 734-741, 2003.

[WB01] C. Wallraven and H.H. Bülthoff, View-based recognition under illumina-tion changes using local features, In Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR), Workshopon Identifying Objects Across Variations in Lighting: Psychophysics and Com-putation, 2001.

[WCG03] C. Wallraven, B. Caputo and A. Graf, Recognition with local features:the kernel recipe, In Proceedings of the International Conference on ComputerVision (ICCV), vol. 1, pp. 257-264, 2003.

[WW98] J. Weston and C. Watkins, Multi-class support vector machines, TechnicalReport CSD-TR-98-04, Department of Computer Science, Royal Holloway, Uni-versity of London, 1998.

[WS03] L. Wolf and A. Shashua, Kernel principal angles for classification machineswith applications to image sequence interpretation, In Proceedings of the IEEEComputer Society Conference on Computer Vision and Pattern Recognition(CVPR), vol. 1, 2003.

43

[XF02] F.Xu and K. Fujimura, Pedestrian detection and tracking with night vision,IEEE Intelligent Vehicle Symposium, 2002.

[YB98] Y. Yacoob and M.J. Black, Parameterized modeling and recognition of activ-ities, International Conference on Computer Vision (ICCV), 1998.

[ZI01] L. Zelnik-Mandor and M. Irani, Event-based analysis of video, In Proceedingsof the IEEE Computer Society Conference on Computer Vision and PatternRecognition (CVPR), vol. 2, pp. 123-130, 2001.

[ZWR+01] W. Zhu, S. Wang, R. Lin and S. Levinson, Tracking of object withSVM regression, In Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR), vol. 2, pp. 240-245, 2001.

[GTDB] The Georgia-Tech gait recognition database. http://www.cc.gatech.edu/cpl/projects/hid/images.html [April 2004].

44

Appendix A

Additional Results for Multi-class SVM& NNC

s1 s1+s4 s1+s3+s4 s1+s2+s3+s40

20

40

60

80

100

Training scenario

Cor

rect

ly c

lass

ified

(%

)

Test on scenario s3


s1 s1+s4 s1+s3+s4 s1+s2+s3+s40

20

40

60

80

100

Training scenario

Cor

rect

ly c

lass

ified

(%

)

Test on scenario s4


s1 s1+s4 s1+s3+s4 s1+s2+s3+s40

20

40

60

80

100

Training scenario

Cor

rect

ly c

lass

ified

(%

)

Test on scenario s1


Figure A.1. Recognition results for multi-class experiments. Results are shownfor different methods and training scenario: (top) testing over s1 scenario;(bottom left) testing over s3 scenario (different clothes); (bottom right) testingover s4 scenario (indoors).

45


Walk 63.89 0 36.11 0 0 0Run 0 88.89 11.11 0 0 0Jog 0 19.44 80.56 0 0 0Box 0 0 0 97.14 0 2.86

Hclap 0 0 0 36.11 61.11 2.78Hwav 2.78 0 0 30.56 0 66.67

HistLF+SVM, test on s1Walk Run Jog Box Hclap Hwav

Walk 75.00 0 25.00 0 0 0Run 0 83.33 16.67 0 0 0Jog 2.78 13.89 83.33 0 0 0Box 2.86 0 0 97.14 0 0

Hclap 0 0 0 44.44 52.78 2.78Hwav 5.56 0 0 30.56 5.56 58.33


Walk 55.56 0 44.44 0 0 0Run 0 83.33 16.67 0 0 0Jog 0 19.44 80.56 0 0 0Box 0 0 0 82.86 14.29 2.86

Hclap 0 0 0 30.56 69.44 0Hwav 0 0 0 16.67 11.11 72.22

Table A.1. Confusion matrices (results in %) for LF+SVM, HistLF+SVM andHistSTG+SVM with training on s1 for different test-scenario s1.

46


Walk 72.50 0 27.50 0 0 0Run 5.56 77.78 16.67 0 0 0Jog 2.78 47.22 50 0 0 0Box 0 0 0 100 0 0

Hclap 5.56 0 0 36.11 52.78 5.56Hwav 0 0 0 5.56 13.89 80.56


Walk 75.00 0 25.00 0 0 0Run 5.56 72.22 22.22 0 0 0Jog 2.78 19.44 77.78 0 0 0Box 0 0 0 100 0 0

Hclap 2.78 0 0 36.11 58.33 2.78Hwav 0 0 0 13.89 8.33 77.78


Walk 60 0 40 0 0 0Run 0 72.22 27.78 0 0 0Jog 0 41.67 58.33 0 0 0Box 0 0 0 86.11 2.78 11.11

Hclap 0 0 0 69.44 30.56 0Hwav 0 0 0 19.44 0 80.56


Walk 100 0 0 0 0 0Run 5.56 36.11 58.33 0 0 0Jog 22.22 0 77.78 0 0 0Box 2.78 0 0 97.22 0 0

Hclap 0 0 0 33.33 66.67 0Hwav 0 0 0 22.22 0 77.78


Walk 97.22 0 2.78 0 0 0Run 8.33 16.67 75.00 0 0 0Jog 25.00 0 75.00 0 0 0Box 0 0 0 97.22 0 2.78

Hclap 0 0 0 36.11 63.89 0Hwav 8.33 0 0 27.78 5.56 58.33


Walk 77.78 0 22.22 0 0 0Run 0 36.11 63.89 0 0 0Jog 0 0 100 0 0 0Box 0 0 0 63.89 16.67 19.44

Hclap 0 0 0 27.78 72.22 0Hwav 0 0 0 0 0 100

Table A.2. Confusion matrices (results in %) for LF+SVM, HistLF+SVM andHistSTG+SVM with training on s1 for different test-scenarios.

47


Walk 87.84 0 11.49 0.68 0 0Run 3.47 66.67 29.86 0 0 0Jog 15.97 21.53 62.50 0 0 0Box 2.10 0 0 93.71 4.20 0

Hclap 0.69 0 0 25.00 74.31 0Hwav 0.69 0 0 20.14 13.89 65.28

HistLF+SVM, test on all scenariosWalk Run Jog Box Hclap Hwav

Walk 82.43 0 16.22 1.35 0 0Run 5.56 54.17 40.28 0 0 0Jog 21.53 11.11 67.36 0 0 0Box 5.59 0 0 91.61 2.10 0.70

Hclap 1.39 0 0 35.42 63.19 0Hwav 2.78 0 0 26.39 9.03 61.81


Walk 70.95 0 24.32 4.73 0 0Run 4.86 63.19 31.94 0 0 0Jog 22.92 12.50 64.58 0 0 0Box 2.10 0 0 70.63 25.87 1.40

Hclap 0 0 0 38.89 61.11 0Hwav 0 0 0 19.44 14.58 65.97

Table A.3. Confusion matrices (results in %) for LF+SVM, HistLF+SVM andHistSTG+SVM with training on s1 & s4 for all test scenarios.

48


Walk 72.22 0 27.78 0 0 0Run 0 91.67 8.33 0 0 0Jog 0 22.22 77.78 0 0 0Box 0 0 0 97.14 2.86 0

Hclap 2.78 0 0 25.00 72.22 0Hwav 2.78 0 0 30.56 13.89 52.78


Walk 66.67 0 33.33 0 0 0Run 0 88.89 11.11 0 0 0Jog 0 16.67 83.33 0 0 0Box 0 0 0 97.14 2.86 0

Hclap 0 0 0 41.67 58.33 0Hwav 5.56 0 0 30.56 5.56 58.33


Walk 47.22 0 52.78 0 0 0Run 0 88.89 11.11 0 0 0Jog 0 13.89 86.11 0 0 0Box 0 0 0 88.57 11.43 0

Hclap 0 0 0 33.33 66.67 0Hwav 0 0 0 25.00 16.67 58.33


Walk 97.22 0 0 2.78 0 0Run 11.11 16.67 72.22 0 0 0Jog 52.78 0 47.22 0 0 0Box 2.78 0 0 91.67 5.56 0

Hclap 0 0 0 13.89 86.11 0Hwav 0 0 0 22.22 16.67 61.11


Walk 94.44 0 0 5.56 0 0Run 19.44 13.89 66.67 0 0 0Jog 69.44 0 30.56 0 0 0Box 11.11 0 0 86.11 2.78 0

Hclap 0 0 0 44.44 55.56 0Hwav 5.56 0 0 36.11 8.33 50


Walk 80.56 0 0 19.44 0 0Run 19.44 11.11 69.44 0 0 0Jog 77.78 0 22.22 0 0 0Box 0 0 0 38.89 58.33 2.78

Hclap 0 0 0 27.78 72.22 0Hwav 0 0 0 19.44 41.67 38.89

Table A.4. Confusion matrices (results in %) for LF+SVM, HistLF+SVM andHistSTG+SVM with training on s1 & s4 for different test-scenarios.

49


Walk 82.50 0 17.50 0 0 0Run 0 80.56 19.44 0 0 0Jog 0 52.78 47.22 0 0 0Box 5.56 0 0 94.44 0 0

Hclap 0 0 0 38.89 61.11 0Hwav 0 0 0 8.33 19.44 72.22


Walk 70 0 30 0 0 0Run 0 75.00 25.00 0 0 0Jog 0 27.78 72.22 0 0 0Box 2.78 0 0 97.22 0 0

Hclap 5.56 0 0 33.33 61.11 0Hwav 0 0 0 13.89 13.89 72.22


Walk 62.50 0 37.50 0 0 0Run 0 75.00 25.00 0 0 0Jog 0 36.11 63.89 0 0 0Box 0 0 0 91.67 5.56 2.78

Hclap 0 0 0 63.89 36.11 0Hwav 0 0 0 33.33 0 66.67


Walk 100 0 0 0 0 0Run 2.78 77.78 19.44 0 0 0Jog 11.11 11.11 77.78 0 0 0Box 0 0 0 91.67 8.33 0

Hclap 0 0 0 22.22 77.78 0Hwav 0 0 0 19.44 5.56 75.00


Walk 100 0 0 0 0 0Run 2.78 38.89 58.33 0 0 0Jog 16.67 0 83.33 0 0 0Box 8.33 0 0 86.11 2.78 2.78

Hclap 0 0 0 22.22 77.78 0Hwav 0 0 0 25.00 8.33 66.67


Walk 94.44 0 5.56 0 0 0Run 0 77.78 22.22 0 0 0Jog 13.89 0 86.11 0 0 0Box 8.33 0 0 63.89 27.78 0

Hclap 0 0 0 30.56 69.44 0Hwav 0 0 0 0 0 100

Table A.5. Confusion matrices (results in %) for LF+SVM, HistLF+SVM andHistSTG+SVM with training on s1 & s4 for different test-scenarios.

50


Walk 86.49 0 10.81 2.03 0.68 0Run 4.17 60.42 35.42 0 0 0Jog 19.44 11.11 69.44 0 0 0Box 1.40 0 0 92.31 6.29 0

Hclap 0 0 0 15.28 81.94 2.78Hwav 0.69 0 0.69 13.19 16.67 68.75


Walk 77.03 0 18.24 4.05 0.68 0Run 6.25 55.56 38.19 0 0 0Jog 20.14 7.64 72.22 0 0 0Box 0 0 0 90.21 6.99 2.80

Hclap 0 0 0 26.39 68.75 4.86Hwav 0.69 0 0 20.83 16.67 61.81


Walk 80.41 0 14.86 4.73 0 0Run 6.94 63.89 29.17 0 0 0Jog 24.31 5.56 70.14 0 0 0Box 0 0 0 78.32 20.98 0.70

Hclap 0 0.69 0.69 25.00 72.22 1.39Hwav 1.39 1.39 0.69 17.36 10.42 68.75

Table A.6. Confusion matrices (results in %) for LF+SVM, HistLF+SVM andHistSTG+SVM with training on s1, s3 & s4 for all test scenarios.

51


Walk 83.33 0 16.67 0 0 0Run 0 88.89 11.11 0 0 0Jog 0 8.33 91.67 0 0 0Box 0 0 0 97.14 2.86 0

Hclap 0 0 0 16.67 83.33 0Hwav 2.78 0 0 25.00 13.89 58.33


Walk 66.67 0 33.33 0 0 0Run 0 86.11 13.89 0 0 0Jog 0 11.11 88.89 0 0 0Box 0 0 0 94.29 5.71 0

Hclap 0 0 0 27.78 72.22 0Hwav 0 0 0 33.33 13.89 52.78


Walk 66.67 0 33.33 0 0 0Run 0 88.89 11.11 0 0 0Jog 0 8.33 91.67 0 0 0Box 0 0 0 85.71 14.29 0

Hclap 0 0 0 22.22 77.78 0Hwav 2.78 0 0 22.22 11.11 63.89


Walk 88.89 0 0 8.33 2.78 0Run 13.89 19.44 66.67 0 0 0Jog 69.44 0 30.56 0 0 0Box 0 0 0 86.11 13.89 0

Hclap 0 0 0 11.11 86.11 2.78Hwav 0 0 2.78 16.67 16.67 63.89


Walk 80.56 0 0 16.67 2.78 0Run 22.22 11.11 66.67 0 0 0Jog 66.67 0 33.33 0 0 0Box 0 0 0 83.33 11.11 5.56

Hclap 0 0 0 38.89 50 11.11Hwav 0 0 0 22.22 13.89 63.89


Walk 80.56 0 0 19.44 0 0Run 27.78 11.11 61.11 0 0 0Jog 91.67 0 8.33 0 0 0Box 0 0 0 66.67 30.56 2.78

Hclap 0 2.78 2.78 52.78 36.11 5.56Hwav 2.78 5.56 2.78 16.67 27.78 44.44

Table A.7. Confusion matrices (results in %) for LF+SVM, HistLF+SVM andHistSTG+SVM with training on s1, s3 & s4 for different test-scenarios.

52


Walk 75.00 0 25.00 0 0 0Run 0 80.56 19.44 0 0 0Jog 0 36.11 63.89 0 0 0Box 0 0 0 100 0 0

Hclap 0 0 0 16.67 75.00 8.33Hwav 0 0 0 0 25.00 75.00


Walk 62.50 0 37.50 0 0 0Run 0 75.00 25.00 0 0 0Jog 0 19.44 80.56 0 0 0Box 0 0 0 97.22 0 2.78

Hclap 0 0 0 22.22 69.44 8.33Hwav 0 0 0 0 22.22 77.78


Walk 80 0 20 0 0 0Run 0 77.78 22.22 0 0 0Jog 0 13.89 86.11 0 0 0Box 0 0 0 97.22 2.78 0

Hclap 0 0 0 11.11 88.89 0Hwav 0 0 0 30.56 0 69.44


Walk 100 0 0 0 0 0Run 2.78 52.78 44.44 0 0 0Jog 8.33 0 91.67 0 0 0Box 5.56 0 0 86.11 8.33 0

Hclap 0 0 0 16.67 83.33 0Hwav 0 0 0 11.11 11.11 77.78


Walk 100 0 0 0 0 0Run 2.78 50 47.22 0 0 0Jog 13.89 0 86.11 0 0 0Box 0 0 0 86.11 11.11 2.78

Hclap 0 0 0 16.67 83.33 0Hwav 2.78 0 0 27.78 16.67 52.78


Walk 94.44 0 5.56 0 0 0Run 0 77.78 22.22 0 0 0Jog 5.56 0 94.44 0 0 0Box 0 0 0 63.89 36.11 0

Hclap 0 0 0 13.89 86.11 0Hwav 0 0 0 0 2.78 97.22

Table A.8. Confusion matrices (results in %) for LF+SVM, HistLF+SVM andHistSTG+SVM with training on s1, s3 & s4 for different test-scenarios.

53


Walk 85.14 0 14.86 0 0 0Run 2.08 64.58 33.33 0 0 0Jog 9.03 13.89 77.08 0 0 0Box 0 0 0 92.31 6.99 0.70

Hclap 0.69 0 0 27.08 67.36 4.86Hwav 0 0 0.69 18.75 10.42 70.14


Walk 80.41 0 18.24 1.35 0 0Run 4.86 55.56 39.58 0 0 0Jog 19.44 10.42 70.14 0 0 0Box 1.40 0 0 95.10 1.40 2.10

Hclap 0 0 0 32.64 63.89 3.47Hwav 1.39 0 0 28.47 8.33 61.81


Walk 75.00 0 25.00 0 0 0Run 0.69 65.97 33.33 0 0 0Jog 13.89 14.58 71.53 0 0 0Box 2.10 0 0 74.83 20.28 2.80

Hclap 0 0 0 25.69 70.83 3.47Hwav 0 0 0 15.97 13.19 70.83

Table A.9. Confusion matrices (results in %) for LF+SVM, HistLF+SVM andHistSTG+SVM with training on s1, s2, s3 & s4 for all test scenarios.

54


Walk 75.00 0 25.00 0 0 0Run 0 91.67 8.33 0 0 0Jog 0 16.67 83.33 0 0 0Box 0 0 0 97.14 2.86 0

Hclap 0 0 0 41.67 58.33 0Hwav 0 0 0 33.33 8.33 58.33


Walk 63.89 0 36.11 0 0 0Run 0 91.67 8.33 0 0 0Jog 0 16.67 83.33 0 0 0Box 0 0 0 97.14 2.86 0

Hclap 0 0 0 41.67 58.33 0Hwav 2.78 0 0 36.11 5.56 55.56


Walk 50 0 50 0 0 0Run 0 88.89 11.11 0 0 0Jog 0 16.67 83.33 0 0 0Box 0 0 0 85.71 11.43 2.86

Hclap 0 0 0 22.22 77.78 0Hwav 0 0 0 27.78 13.89 58.33


Walk 100 0 0 0 0 0Run 5.56 27.78 66.67 0 0 0Jog 30.56 0 69.44 0 0 0Box 0 0 0 94.44 5.56 0

Hclap 0 0 0 30.56 63.89 5.56Hwav 0 0 2.78 27.78 2.78 66.67


Walk 94.44 0 0 5.56 0 0Run 16.67 13.89 69.44 0 0 0Jog 63.89 0 36.11 0 0 0Box 2.78 0 0 97.22 0 0

Hclap 0 0 0 47.22 44.44 8.33Hwav 2.78 0 0 33.33 11.11 52.78


Walk 100 0 0 0 0 0Run 2.78 16.67 80.56 0 0 0Jog 55.56 0 44.44 0 0 0Box 0 0 0 55.56 36.11 8.33

Hclap 0 0 0 30.56 58.33 11.11Hwav 0 0 0 11.11 36.11 52.78

Table A.10. Confusion matrices (results in %) for LF+SVM, HistLF+SVMand HistSTG+SVM with training on s1, s2, s3 & s4 for different test-scenarios.

55


Walk 67.50 0 32.50 0 0 0Run 0 80.56 19.44 0 0 0Jog 0 38.89 61.11 0 0 0Box 0 0 0 97.22 0 2.78

Hclap 0 0 0 19.44 66.67 13.89Hwav 0 0 0 0 19.44 80.56


Walk 65.00 0 35.00 0 0 0Run 0 77.78 22.22 0 0 0Jog 0 25.00 75.00 0 0 0Box 0 0 0 94.44 0 5.56

Hclap 0 0 0 22.22 72.22 5.56Hwav 0 0 0 13.89 11.11 75.00


Walk 52.50 0 47.50 0 0 0Run 0 80.56 19.44 0 0 0Jog 0 41.67 58.33 0 0 0Box 0 0 0 91.67 8.33 0

Hclap 0 0 0 36.11 61.11 2.78Hwav 0 0 0 25.00 0 75.00


Walk 100 0 0 0 0 0Run 2.78 58.33 38.89 0 0 0Jog 5.56 0 94.44 0 0 0Box 0 0 0 80.56 19.44 0

Hclap 2.78 0 0 16.67 80.56 0Hwav 0 0 0 13.89 11.11 75.00


Walk 100 0 0 0 0 0Run 2.78 38.89 58.33 0 0 0Jog 13.89 0 86.11 0 0 0Box 2.78 0 0 91.67 2.78 2.78

Hclap 0 0 0 19.44 80.56 0Hwav 0 0 0 30.56 5.56 63.89


Walk 100 0 0 0 0 0Run 0 77.78 22.22 0 0 0Jog 0 0 100 0 0 0Box 8.33 0 0 66.67 25.00 0

Hclap 0 0 0 13.89 86.11 0Hwav 0 0 0 0 2.78 97.22

Table A.11. Confusion matrices (results in %) for LF+SVM, HistLF+SVMand HistSTG+SVM with training on s1, s2, s3 & s4 for different test-scenarios.

56

Appendix B

Additional Results for Two-class SVM


Run 92.80Jog 86.64 69.79Box 98.63 99.65 99.30

Hclap 99.66 97.92 100 82.93Hwav 92.47 100 97.92 88.50 86.46


Run 91.78Jog 79.79 72.57Box 95.19 100 98.26

Hclap 97.60 100 98.95 80.49Hwav 95.89 100 99.65 84.32 85.76


Run 92.80Jog 76.71 72.57Box 98.97 99.30 98.95

Hclap 99.32 100 100 64.46Hwav 98.29 94.79 97.57 81.88 85.76

Table B.1. Recognition rates (%) for action pairs (walk vs. jog, walk vs.run etc.) using two-class SVM with training on scenario s1 & s4 (test on allconditions)

57


Run 91.44Jog 84.93 76.74Box 97.94 99.65 98.95

Hclap 95.21 96.18 100 88.15Hwav 98.63 98.96 97.92 90.94 84.38


Run 92.12Jog 81.85 73.61Box 94.50 95.12 98.95

Hclap 98.63 100 100 78.75Hwav 97.26 94.79 99.65 83.97 82.99


Run 90.07Jog 80.14 74.31Box 97.25 100 100

Hclap 99.66 99.65 99.31 76.66Hwav 97.60 100 100 89.55 85.76

Table B.2. Recognition rates (%) for action pairs (walk vs. jog, walk vs. runetc.) using two-class SVM with training on scenario s1, s3 & s4 (test on allconditions)

58


Run 93.15Jog 89.04 77.43Box 99.31 98.95 98.95

Hclap 99.66 96.88 98.26 82.58Hwav 98.97 99.31 98.96 90.59 85.07


Run 93.15Jog 83.90 70.83Box 97.25 100 98.26

Hclap 99.32 100 100 82.23Hwav 97.60 100 98.61 82.23 84.72


Run 89.73Jog 81.85 75.00Box 98.97 100 100

Hclap 100 100 99.65 75.96Hwav 100 100 100 85.02 88.54

Table B.3. Recognition rates (%) for action pairs (walk vs. jog, walk vs. runetc.) using two-class SVM with training on scenario s1, s2, s3 & s4 (test onall conditions)

59

action recognition based on local space-time … recognition based on local space-time ... this is...

Documents