research article affine-invariant feature extraction for...

Hindawi Publishing CorporationISRNMachine VisionVolume 2013 Article ID 215195 7 pageshttpdxdoiorg1011552013215195

Research ArticleAffine-Invariant Feature Extraction for Activity Recognition

Samy Sadek1 Ayoub Al-Hamadi2 Gerald Krell2 and Bernd Michaelis2

1 Department of Mathematics and Computer Science Faculty of Science Sohag University 82524 Sohag Egypt2 Institute for Information Technology and Communications (IIKT) Otto von Guericke University Magdeburg39106 Magdeburg Germany

Correspondence should be addressed to Samy Sadek samytechnikgmailcom

Received 28 April 2013 Accepted 4 June 2013

Academic Editors A Gasteratos D P Mukherjee and A Torsello

Copyright copy 2013 Samy Sadek et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

We propose an innovative approach for human activity recognition based on affine-invariant shape representation and SVM-based feature classification In this approach a compact computationally efficient affine-invariant representation of action shapesis developed by using affine moment invariants Dynamic affine invariants are derived from the 3D spatiotemporal action volumeand the average image created from the 3D volume and classified by an SVM classifier On two standard benchmark action datasets(KTH andWeizmann datasets) the approach yields promising results that compare favorably with those previously reported in theliterature while maintaining real-time performance

1 IntroductionVisual recognition and interpretation of human-inducedactions and events are among the most active research areasin computer vision pattern recognition and image under-standing communities [1] Although a great deal of progresshas been made in automatic recognition of human actionsduring the last two decades the approaches proposed inthe literature remain limited in their ability This leadsto a need for much research work to be conducted toaddress the ongoing challenges and develop more efficientapproaches It is clear that developing good algorithms forsolving the problem of human action recognitionwould yieldhuge potential for a large number of potential applicationsfor example the search and the structuring of large videoarchives human-computer interaction video surveillancegesture recognition and robot learning and control Infact the nonrigid nature of human body and clothes invideo sequences resulting from drastic illumination changeschanging in pose and erratic motion patterns presents thegrand challenge to human detection and action recognitionIn addition while the real-time performance is a majorconcern in computer vision especially for embedded com-puter vision systems the majority of state-of-the-art humanaction recognition systems often employ sophisticated featureextraction and learning techniques creating a barrier to

the real-time performance of these systems This suggests atrade-off between accuracy and real-time performance Theremainder of this paper commences by briefly reviewingthe most relevant literature in this area of human actionrecognition in Section 2 Then in Section 3 we describethe details of the proposed method for action recognitionThe experimental results corroborating the proposedmethodeffectiveness are presented and analyzed in Section 4 Finallyin Section 5 we conclude and mention possible future work

2 The Literature Overview

Recent few years have witnessed a resurgence of interestin more research on the analysis and interpretation ofhuman motion motivated by the rise of security concernsand increased ubiquity and affordability of digital mediaproduction equipment Human action can generally be rec-ognized using various visual cues such as motion [2 3]and shape [4 5] Scanning the literature one notices thata significant body of work in human action recognitionfocuses on using spatial-temporal key points and local featuredescriptors [6] The local features are extracted from theregion around each key point detected by the key pointdetection process These features are then quantized toprovide a discrete set of visual words before they are fed

2 ISRNMachine Vision

Figure 1 GMM background subtraction the first and third rows display two sequences of walking and running actions from KTH andWeizmann action datasets respectively while the second and fourth rows show the results of background subtraction where foregroundobjects are shown in cyan color

into the classification module Another thread of research isconcerned with analyzing patterns of motion to recognizehuman actions For instance in [7] periodic motions aredetected and classified to recognize actions Alternativelysome researchers have opted to use both motion and shapecues In [8] the authors detect the similarity between videosegments using a space-time correlation model In [9]Rodriguez et al present a template-based approach usinga Maximum Average Correlation Height (MACH) filter tocapture intraclass variabilities Likewise a significant amountof work is targeted at modelling and understanding humanmotions by constructing elaborated temporal dynamic mod-els [10] There is also an attractive area of research whichfocuses on using generative topic models for visual recogni-tion based on the so-called Bag-of-Words (BoW) model [11]The underlying concept of a BoW is that each video sequenceis represented by counting the number of occurrences ofdescriptor prototypes so-called visual words Topic modelsare built and then applied to the BoW representation Threeexamples of commonly used topic models include Corre-lated Topic Models (CTMs) [11] Latent Dirichlet Alloca-tion (LDA) [12] and probabilistic Latent Semantic Analysis(pLSA) [13]

3 Proposed Methodology

In this section the proposed method for action recognitionis described The main steps of the framework are explainedin detail along the following subsections

31 Background Subtraction In this paper we use GaussianMixture Model (GMM) as a basis to model backgrounddistribution Formally speaking let 119883

119905be a pixel in the

current frame 119868119905 where 119905 is the frame index Then each pixel

can be modeled separately by a mixture of 119870 Gaussians

119875 (119883119905) =

119870

sum

119894=1

120596119894119905120578 (119883119905 120583119894119905 Σ119894119905) (1)

Where 120578 is a Gaussian probability density function 120583119894119905

Σ119894119905 and 120596

119894119905are the mean covariance and an estimate of

the weight of the 119894th Gaussian in the mixture at time 119905respectively 119870 is the number of distributions which is setto 5 in experiments Before the foreground is detected thebackground is updated (see [14] for details about the updatingprocedure) After the updates are done the weights 120596

119894119905are

normalized By applying a threshold 119879 (set to 06 in ourexperiments) the background distribution remains on topwith the lowest variance where

119861 = argmin119887

(sum119887

119894=1120596119894119905

sum119870

119894=1120596119894119905

gt 119879) (2)

Finally all pixels 119883119905that match none of the components

are good candidates to be marked as foreground An exampleof GMM background subtraction can be seen in Figure 1

32 Average Images from 3D Action Volumes The 3D volumein the spatio-temporal (119883119884119879) domain is formed by piling

ISRNMachine Vision 3

Figure 2 2D average image created from the 3D spatio-temporalvolume of a walking sequence

up the target region in the image sequences of one actioncycle which is used to partition the sequences for thespatiotemporal volume An action cycle is a fundamental unitto describe the action In this work we assume that the spatio-temporal volume consists of a number of small voxels Theaverage image 119868

119886V(119909 119910) is defined as

119868119886V (119909 119910) =

1

120591

120591minus1

sum

119905=0

119868 (119909 119910 119905) (3)

where 120591 is the number of frames in action cycle (we use120591 = 25 in our experiments) 119868(119909 119910 119905) represents the densityof the voxels at time 119905 An example of average image createdfrom the 3D spatio-temporal volume of the running sequenceis shown in Figure 2 For characterizing these 2D averageimages the 2D affine moment invariants are considered asfeatures [26]

33 Feature Extraction As is well known the momentsdescribe shape properties of an object as it appears Affinemoment invariants are moment-based descriptors which areinvariant under a general affine transform Six affinemomentinvariants can be conventionally derived from the centralmoments [27] as follows

1198681=

1

1205784

00

[1205782012057802

minus 1205782

11]

1198682=

1

12057810

00

[1205782

031205782

30minus 612057830120578211205781212057803

+ 4120578301205783

12

+4120578031205783

21minus 31205782

211205782

12]

1198683=

1

1205787

00

[12057820

(1205782112057803

minus 1205782

12) minus 12057811

(1205783012057803

minus 1205782112057812)

+12057802

(1205780312057812

minus 1205782

21)]

1198684=

1

12057811

00

[1205783

201205782

03minus 61205782

20120578111205781212057803

minus 61205782

20120578021205782112057803

+ 91205782

20120578021205782

12+ 12120578201205782

111205782112057803

+ 61205782012057811120578021205783012057803

+ 181205782012057811120578021205783012057812

minus 81205783

111205783012057803

minus 6120578201205782

021205783012057812

+ 9120578201205782

021205782

21

+121205782

11120578021205783012057812

minus 6120578111205782

021205783012057812

+ 1205783

021205783

30]

1198685=

1

1205786

00

[1205784012057804

minus 41205783112057813

+ 31205782

22]

1198686=

1

1205789

00

[120578401205780412057822

+ 2120578311205781312057822

minus 120578401205782

13

minus120578041205782

13minus 1205783

22]

(4)

where 120578119901119902

is the central moment of order 119901 + 119902For a spatio-temporal (119883119884119879) space the 3D moment of

order (119901 + 119902 + 119903) of 3D object O is derived using the sameprocedure of the 2D centralized moment

120578119901119902119903

= sumsumsum

(119909119910119905)isinO

(119909 minus 119909119892)119901

(119910 minus 119910119892)119902

(119905 minus 119905119892)119903

119868 (119909 119910 119905)

(5)

Where (119909119892 119910119892 119905119892) is the centroid of object in the spatio-

temporal space Based on the definition of the 3D momentin (5) six 3D affine moment invariants can be defined Thefirst two of these moment invariants are given by

1198691=

1

1205785

000

[120578200

120578020

120578002

+ 2120578110

120578101

120578011

minus 120578200

1205782

011

minus120578020

1205782

101minus 120578002

1205782

110]

1198692=

1

1205787

000

[120578400

(120578040

120578004

+ 31205782

022minus 4120578013

120578031

)

+ 3120578202

(120578040

120578202

minus 4120578112

120578130

+ 41205782

121)

+ 12120578211

(120578022

120578211

+ 120578103

120578130

minus 120578031

120578202

minus120578121

120578112

)

+ 4120578310

(120578031

120578103

minus 120578004

120578220

+3120578013

120578121

minus 3120578022

120578112

)

+ 3120578220

(120578004

120578220

+ 2120578022

120578202

+4120578112

minus 4120578013

120578311

minus 4120578121

120578103

)

+ 4120578301

(120578013

120578130

minus 120578040

120578103

+ 3120578031

120578112

minus3120578022

120578121

) ]

(6)

Due to their long formulae the remaining four momentinvariants are not displayed here (refer to [28]) Figure 3


Walk Jog Run Box Wave Clap

Walk Jog Run Box Wave ClapWalk Jog Run Box Wave Clap




07

08

09

1

0

02

04

06

08

0

0005

001

0015

002

I 1 I 2

I 3 I 4

I 5 I 6

times10minus4

5

0

minus5

minus10

minus003

minus002

minus001

0

001

minus03

minus02

minus01

0

01

Figure 3 Plots of 2D affine moment invariants (119868119894 119894 = 1 6) computed on the average images of walking jogging running boxing

waving and clapping sequences

shows a series of plots of 2D dynamic affine invariants withdifferent action classes computed on the average images ofaction sequences

34 Action Classification Using SVM In this section we for-mulate the action recognition task as a multiclass learningproblem where there is one class for each action and thegoal is to assign an action to an individual in each videosequence [1 29] There are various supervised learning algo-rithms by which action recognizer can be trained SupportVector Machines (SVMs) are used in this work due to theiroutstanding generalization capability and reputation of ahighly accurate paradigm [30] SVMs that provide a bestsolution to data overfitting in neural networks are basedon the structural risk minimization principle from compu-tational theory Originally SVMs were designed to handledichotomic classes in a higher dimensional space where amaximal separating hyperplane is created On each side ofthis hyperplane two parallel hyperplanes are conductedThen SVM attempts to find the separating hyperplane thatmaximizes the distance between the two parallel hyperplanes(see Figure 4) Intuitively a good separation is achieved bythe hyperplane having the largest distance Hence the largerthemargin the lower the generalization error of the classifierFormally let D = (x

119894 119910119894) | x119894

isin R119889 119910119894

isin minus1 +1 be atraining dataset Vapnik [30] shows that the problem is best

120585i

xi

120585j

xj

120573x+ 1205730

= +1

120573x+ 1205730

= 0

120573x+ 1205730

= minus1

Figure 4 Generalized optimal separating hyperplane

addressed by allowing some examples to violate the marginconstraints These potential violations are formulated withsome positive slack variables 120585

119894and a penalty parameter 119862 ge

0 that penalize the margin violations Thus the generalizedoptimal separating hyperplane is determined by solving thefollowing quadratic programming problem

min1205731205730

1

2

10038171003817100381710038171205731003817100381710038171003817

2

+ 119862sum

119894

120585119894 (7)

subject to (119910119894(⟨x119894120573⟩ + 120573

0) ge 1 minus 120585

119894forall119894) and (120585

119894ge 0 forall119894)


Geometrically 120573 isin R119889 is a vector going through thecenter and perpendicular to the separating hyperplane Theoffset parameter 120573

0is added to allow the margin to increase

and not to force the hyperplane to pass through the originthat restricts the solution For computational purposes itis more convenient to solve SVM in its dual formulationThis can be accomplished by forming the Lagrangian andthen optimizing over the Lagrangemultiplier 120572The resultingdecision function hasweight vector120573 = sum

119894120572119894x119894119910119894 0 le 120572

119894le 119862

The instances x119894with 120572

119894gt 0 are called support vectors as they

uniquely define the maximummargin hyperplaneIn the current approach several classes of actions are cre-

ated Several one-versus-all SVM classifiers are trained usingaffine moment features extracted from action sequences inthe training dataset For each action sequence a set of six2D affine moment invariants is extracted from the averageimage Also another set of six 3D affine moment invariantsis extracted from the spatio-temporal silhouette sequenceThen SVM classifiers are trained on these features to learnvarious categories of actions

4 Experiments and Results

To evaluate the proposed approach two main experimentswere carried out and the results we achieved were comparedwith those reported by other state-of-the-art methods

41 Experiment 1 We conducted this experiment using KTHaction dataset [31] To illustrate the effectiveness of themethod the obtained results are compared with those ofother similar state-of-the-art methods The KTH datasetcontains action sequences comprised of six types of humanactions (ie walking jogging running boxing handwavingand hand clapping) These actions are performed by a totalof 25 individuals in four different settings (ie outdoorsoutdoors with scale variation outdoors with different clothesand indoors) All sequences were acquired by a static cameraat 25 fps and a spatial resolution of 160 times 120 pixels overhomogeneous backgrounds To the best of our knowledgethere is no other similar dataset already available in theliterature of sequences acquired on different environments Inorder to prepare the experiments and to provide an unbiasedestimation of the generalization abilities of the classificationprocess a set of sequences (75 of all sequences) performedby 18 subjects was used for training and other sequences(the remaining 25) performed by the other 7 subjects wereset aside as a test set SVMs with Gaussian radial basisfunction (RBF) kernel are trained on the training set whilethe evaluation of the recognition performance is performedon the test set

The confusion matrix that shows the recognition resultsachieved on the KTH action dataset is given in Table 1 whilethe comparison of the obtained results with those obtainedby other methods available in the literature is shown inTable 3 As follows from the figures tabulated in Table 1most actions are correctly classified Furthermore there isa high distinction between arm actions and leg actionsMost of the mistakes where confusions occur are betweenldquojoggingrdquo and ldquorunningrdquo actions and between ldquoboxingrdquo and

Table 1 Confusion matrix for the KTH dataset

WalkingRunningJoggingBoxingWavingClapping

Walking

094000004000000000

Running

001096008000000000

Jogging

005004088000000000

Boxing

000000000094002001

Waving

000000000002093003

Clapping

000000000004005096

Action

ldquoclappingrdquo actions This is intuitively plausible due to thefact of high similarity between each pair of these actionsFrom the comparison given by Table 3 it turns out that ourmethod performs competitively with other state-of-the-artmethods It is pertinent to mention here that the state-of-the-art methods with which we compare our method haveused the same dataset and the same experimental conditionstherefore the comparison seems to be quite fair

42 Experiment 2 This second experiment was conductedusing the Weizmann action dataset provided by Blank etal [32] in 2005 which contains a total of 90 video clips(ie 5098 frames) performed by 9 individuals Each videoclip contains one person performing an action There are 10categories of action involved in the dataset namely walkingrunning jumping jumping in place bending jacking skippinggalloping sideways one-hand waving and two-hand wavingTypically all the clips in the dataset are sampled at 25Hz andlast about 2 seconds with image frame size of 180 times 144 Inorder to provide an unbiased estimate of the generalizationabilities of the proposedmethod we have used the leave-one-out cross-validation (LOOCV) technique in the validationprocess As the name suggests this involves using a groupof sequences from a single subject in the original dataset asthe testing data and the remaining sequences as the trainingdata This is repeated such that each group of sequences inthe dataset is used once as the validation Again as with thefirst experiment SVMs with Gaussian RBF kernel are trainedon the training set while the evaluation of the recognitionperformance is performed on the test set

The confusion matrix in Table 2 provides the recognitionresults obtained by the proposed method where correctresponses define the main diagonal From the figures in thematrix a number of points can be drawn The majority ofactions are correctly classified An average recognition rateof 978 is achieved with our proposed method What ismore there is a clear distinction between arm actions andleg actions The mistakes where confusions occur are onlybetween skip and jump actions and between jump and runactions This intuitively seems to be reasonable due to thefact of high closeness or similarity among the actions in eachpair of these actions In order to quantify the effectiveness ofthe method the obtained results are compared qualitativelywith those obtained previously by other investigators Theoutcome of this comparison is presented in Table 3 In thelight of this comparison one can see that the proposedmethod is competitive with the state-of-the-art methods


Table 2 Confusion matrix for the Weizmann dataset

Action

Bend

Bend

Jump

Jump

Pjump

Pjump

Walk

Walk

Run

Run

Side

Side

Jack

Jack

Skip

Skip

Wave 1

Wave 1

Wave 2

Wave 2

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

090

000

000

010

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

010

000

000

090

000

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

100

000

Table 3 Comparison with the state of the art on the KTH andWeizmann datasets

Method KTH WeizmannOur method 935 980Liu and Shah [15] 928 mdashWang and Mori [16] 925 mdashJhuang et al [17] 917 mdashRodriguez et al [9] 886 mdashRapantzikos et al [18] 883 mdashDollar et al [19] 812 mdashKe et al [20] 630 mdashFathi and Mori [21] mdash 100Bregonzio et al [22] mdash 966Zhang et al [23] mdash 928Niebles et al [24] mdash 900Dollar et al [19] mdash 852Klaser et al [25] mdash 843

It is worthwhile to mention that all the methods that wecompared our method with except the method proposedin [21] have used similar experimental setups thus thecomparison seems to be meaningful and fair A final remarkconcerns the real-time performance of our approach Theproposed action recognizer runs at 18fps on average (using a28GHz Intel dual core machine with 4GB of RAM running32-bit Windows 7 Professional)

5 Conclusion and Future Work

In this paper we have introduced an approach for activityrecognition based on affine moment invariants for activityrepresentation and SVMs for feature classification On two

benchmark action datasets the results obtained by theproposed approach were compared favorably with thosepublished in the literature The primary focus of our futurework will be to investigate the empirical validation of theapproach on more realistic datasets presenting many techni-cal challenges in data handling such as object articulationocclusion and significant background clutter

References

[1] S Sadek A Al-Hamadi B Michaelis and U Sayed ldquoRecog-nizing human actions a fuzzy approach via chord-length shapefeaturesrdquo ISRN Machine Vision vol 1 pp 1ndash9 2012

[2] A A Efros A C Berg G Mori and J Malik ldquoRecognizingaction at a distancerdquo in Proceedings of the 9th IEEE InternationalConference on Computer Vision (ICCV rsquo03) vol 2 pp 726ndash733October 2003

[3] S Sadek A Al-Hamadi B Michaelis and U Sayed ldquoTowardsrobust human action retrieval in videordquo in Proceedings of theBritish Machine Vision Conference (BMVC rsquo10) AberystwythUK September 2010

[4] S Sadek A Al-Hamadi B Michaelis and U Sayed ldquoHumanactivity recognition a scheme using multiple cuesrdquo in Proceed-ings of the International Symposium on Visual Computing (ISVCrsquo10) vol 1 pp 574ndash583 Las Vegas Nev USA November 2010

[5] S Sadek A AI-Hamadi M Elmezain B Michaelis and USayed ldquoHuman activity recognition via temporal momentinvariantsrdquo in Proceedings of the 10th IEEE International Sym-posiumon Signal Processing and Information Technology (ISSPITrsquo10) pp 79ndash84 Luxor Egypt December 2010

[6] S Sadek A Al-Hamadi B Michaelis and U Sayed ldquoAn actionrecognition scheme using fuzzy log-polar histogram and tem-poral self-similarityrdquo EURASIP Journal on Advances in SignalProcessing vol 2011 Article ID 540375 2011

[7] R Cutler and L S Davis ldquoRobust real-time periodic motiondetection analysis and applicationsrdquo IEEE Transactions on


Pattern Analysis andMachine Intelligence vol 22 no 8 pp 781ndash796 2000

[8] E Shechtman and M Irani ldquoSpace-time behavior based corre-lationrdquo in Proceedings of the IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo05) vol 1pp 405ndash412 June 2005

[9] M D Rodriguez J Ahmed and M Shah ldquoAction MACH aspatio-temporal maximum average correlation height filter foraction recognitionrdquo in Proceedings of the 26th IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo08) June2008

[10] N Ikizler and D Forsyth ldquoSearching video for complex activ-ities with finite state modelsrdquo in Proceedings of the IEEE Com-puter Society Conference on Computer Vision and Pattern Recog-nition (CVPR rsquo07) June 2007

[11] D M Blei and J D Lafferty ldquoCorrelated topic modelsrdquo inAdvances in Neural Information Processing Systems (NIPS) vol18 pp 147ndash154 2006

[12] D M Blei A Y Ng and M I Jordan ldquoLatent Dirichlet alloca-tionrdquo Journal of Machine Learning Research vol 3 no 4-5 pp993ndash1022 2003

[13] T Hofmann ldquoProbabilistic latent semantic indexingrdquo in Pro-ceedings of the 22nd Annual International ACM SIGIR Con-ference on Research and Development in Information Retrieval(SIGIR rsquo99) pp 50ndash57 1999

[14] S J McKenna Y Raja and S Gong ldquoTracking colour objectsusing adaptive mixture modelsrdquo Image and Vision Computingvol 17 no 3-4 pp 225ndash231 1999

[15] J Liu and M Shah ldquoLearning human actions via informationmaximizationrdquo in Proceedings of the 26th IEEE Conference onComputer Vision and Pattern Recognition (CVPR rsquo08) June2008

[16] YWang andGMori ldquoMax-Margin hidden conditional randomfields for human action recognitionrdquo in Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition Workshops (CVPR rsquo09) pp 872ndash879 June 2009

[17] H Jhuang T Serre L Wolf and T Poggio ldquoA biologicallyinspired system for action recognitionrdquo in Proceedings of the 11thIEEE International Conference on Computer Vision (ICCV rsquo07)pp 257ndash267 October 2007

[18] K Rapantzikos Y Avrithis and S Kollias ldquoDense saliency-based spatiotemporal feature points for action recognitionrdquoin Proceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshops (CVPRrsquo09) pp 1454ndash1461 June 2009

[19] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehaviorrecognition via sparse spatio-temporal featuresrdquo in Proceedingsof the 2nd Joint IEEE International Workshop on Visual Surveil-lance and Performance Evaluation of Tracking and Surveillance(VS-PETS rsquo05) pp 65ndash72 October 2005

[20] Y Ke R Sukthankar and M Hebert ldquoEfficient visual eventdetection using volumetric featuresrdquo in Proceedings of the 10thIEEE International Conference on Computer Vision (ICCV rsquo05)pp 166ndash173 October 2005

[21] A Fathi and GMori ldquoAction recognition by learning mid-levelmotion featuresrdquo in Proceedings of the 26th IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR rsquo08) June2008

[22] M Bregonzio S Gong and T Xiang ldquoRecognising action asclouds of space-time interest pointsrdquo in Proceedings of the IEEEComputer Society Conference on Computer Vision and PatternRecognition Workshops (CVPR rsquo09) pp 1948ndash1955 June 2009

[23] Z Zhang YHu S Chan and L-T Chia ldquoMotion context a newrepresentation for human action recognitionrdquo in Proceeding ofthe European Conference on Computer Vision (ECCV rsquo08) vol4 pp 817ndash829 2008

[24] J C Niebles H Wang and L Fei-Fei ldquoUnsupervised learningof human action categories using spatial-temporalwordsrdquo Inter-national Journal of Computer Vision vol 79 no 3 pp 299ndash3182008

[25] A Klaser M Marszaek and C Schmid ldquoA spatiotemporaldescriptor based on 3D-gradientsrdquo in Proceedings of the BritishMachine Vision Conference (BMVC rsquo08) 2008

[26] S Sadek A Al-Hamadi B Michaelis and U Sayed ldquoHumanaction recognition via affinemoment invariantsrdquo in Proceedingsof the 21st International Conference on Pattern Recognition(ICPR rsquo12) pp 218ndash221 Tsukuba Science City Japan November2012

[27] J Flusser and T Suk ldquoPattern recognition by affine momentinvariantsrdquo Pattern Recognition vol 26 no 1 pp 167ndash174 1993

[28] D Xu and H Li ldquo3-D affine moment invariants generated bygeometric primitivesrdquo in Proceedings of the 18th InternationalConference on Pattern Recognition (ICPR rsquo06) pp 544ndash547August 2006

[29] S Sadek A Al-Hamadi B Michaelis and U Sayed ldquoAn SVMapproach for activity recognition based on chord-length-function shape featuresrdquo inProceedings of the IEEE InternationalConference on Image Processing (ICIP rsquo12) pp 767ndash770OrlandoFla USA October 2012

[30] VN VapnikTheNature of Statistical LearningTheory SpringerNew York NY USA 1995

[31] C Schuldt I Laptev and B Caputo ldquoRecognizing humanactions a local SVM approachrdquo in Proceedings of the 17thInternational Conference on Pattern Recognition (ICPR rsquo04) pp32ndash36 2004

[32] M Blank L Gorelick E Shechtman M Irani and R BasrildquoActions as space-time shapesrdquo in Proceedings of the 10th IEEEInternational Conference on Computer Vision (ICCV rsquo05) vol 2pp 1395ndash1402 October 2005

International Journal of

AerospaceEngineeringHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of



RotatingMachinery


Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics


Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks



Figure 1 GMM background subtraction the first and third rows display two sequences of walking and running actions from KTH andWeizmann action datasets respectively while the second and fourth rows show the results of background subtraction where foregroundobjects are shown in cyan color

into the classification module Another thread of research isconcerned with analyzing patterns of motion to recognizehuman actions For instance in [7] periodic motions aredetected and classified to recognize actions Alternativelysome researchers have opted to use both motion and shapecues In [8] the authors detect the similarity between videosegments using a space-time correlation model In [9]Rodriguez et al present a template-based approach usinga Maximum Average Correlation Height (MACH) filter tocapture intraclass variabilities Likewise a significant amountof work is targeted at modelling and understanding humanmotions by constructing elaborated temporal dynamic mod-els [10] There is also an attractive area of research whichfocuses on using generative topic models for visual recogni-tion based on the so-called Bag-of-Words (BoW) model [11]The underlying concept of a BoW is that each video sequenceis represented by counting the number of occurrences ofdescriptor prototypes so-called visual words Topic modelsare built and then applied to the BoW representation Threeexamples of commonly used topic models include Corre-lated Topic Models (CTMs) [11] Latent Dirichlet Alloca-tion (LDA) [12] and probabilistic Latent Semantic Analysis(pLSA) [13]

3 Proposed Methodology

In this section the proposed method for action recognitionis described The main steps of the framework are explainedin detail along the following subsections

31 Background Subtraction In this paper we use GaussianMixture Model (GMM) as a basis to model backgrounddistribution Formally speaking let 119883

119905be a pixel in the

current frame 119868119905 where 119905 is the frame index Then each pixel

can be modeled separately by a mixture of 119870 Gaussians

119875 (119883119905) =

119870

sum

119894=1

120596119894119905120578 (119883119905 120583119894119905 Σ119894119905) (1)

Where 120578 is a Gaussian probability density function 120583119894119905

Σ119894119905 and 120596

119894119905are the mean covariance and an estimate of

the weight of the 119894th Gaussian in the mixture at time 119905respectively 119870 is the number of distributions which is setto 5 in experiments Before the foreground is detected thebackground is updated (see [14] for details about the updatingprocedure) After the updates are done the weights 120596

119894119905are

normalized By applying a threshold 119879 (set to 06 in ourexperiments) the background distribution remains on topwith the lowest variance where

119861 = argmin119887

(sum119887

119894=1120596119894119905

sum119870

119894=1120596119894119905

gt 119879) (2)

Finally all pixels 119883119905that match none of the components

are good candidates to be marked as foreground An exampleof GMM background subtraction can be seen in Figure 1

32 Average Images from 3D Action Volumes The 3D volumein the spatio-temporal (119883119884119879) domain is formed by piling




119886V(119909 119910) is defined as

119868119886V (119909 119910) =

1

120591

120591minus1

sum

119905=0

119868 (119909 119910 119905) (3)



1198681=

1

1205784

00

[1205782012057802

minus 1205782

11]

1198682=

1

12057810

00

[1205782

031205782

30minus 612057830120578211205781212057803

+ 4120578301205783

12

+4120578031205783

21minus 31205782

211205782

12]

1198683=

1

1205787

00

[12057820

(1205782112057803

minus 1205782

12) minus 12057811

(1205783012057803

minus 1205782112057812)

+12057802

(1205780312057812

minus 1205782

21)]

1198684=

1

12057811

00

[1205783

201205782

03minus 61205782

20120578111205781212057803

minus 61205782

20120578021205782112057803

+ 91205782

20120578021205782

12+ 12120578201205782

111205782112057803

+ 61205782012057811120578021205783012057803

+ 181205782012057811120578021205783012057812

minus 81205783

111205783012057803

minus 6120578201205782

021205783012057812

+ 9120578201205782

021205782

21

+121205782

11120578021205783012057812

minus 6120578111205782

021205783012057812

+ 1205783

021205783

30]

1198685=

1

1205786

00

[1205784012057804

minus 41205783112057813

+ 31205782

22]

1198686=

1

1205789

00

[120578401205780412057822

+ 2120578311205781312057822

minus 120578401205782

13

minus120578041205782

13minus 1205783

22]

(4)

where 120578119901119902



120578119901119902119903

= sumsumsum

(119909119910119905)isinO

(119909 minus 119909119892)119901

(119910 minus 119910119892)119902

(119905 minus 119905119892)119903

119868 (119909 119910 119905)

(5)



1198691=

1

1205785

000

[120578200

120578020

120578002

+ 2120578110

120578101

120578011

minus 120578200

1205782

011

minus120578020

1205782

101minus 120578002

1205782

110]

1198692=

1

1205787

000

[120578400

(120578040

120578004

+ 31205782

022minus 4120578013

120578031

)

+ 3120578202

(120578040

120578202

minus 4120578112

120578130

+ 41205782

121)

+ 12120578211

(120578022

120578211

+ 120578103

120578130

minus 120578031

120578202

minus120578121

120578112

)

+ 4120578310

(120578031

120578103

minus 120578004

120578220

+3120578013

120578121

minus 3120578022

120578112

)

+ 3120578220

(120578004

120578220

+ 2120578022

120578202

+4120578112

minus 4120578013

120578311

minus 4120578121

120578103

)

+ 4120578301

(120578013

120578130

minus 120578040

120578103

+ 3120578031

120578112

minus3120578022

120578121

) ]

(6)








07

08

09

1

0

02

04

06

08

0

0005

001

0015

002

I 1 I 2

I 3 I 4

I 5 I 6

times10minus4

5

0

minus5

minus10

minus003

minus002

minus001

0

001

minus03

minus02

minus01

0

01





119894 119910119894) | x119894

isin R119889 119910119894


120585i

xi

120585j

xj

120573x+ 1205730

= +1

120573x+ 1205730

= 0

120573x+ 1205730

= minus1





min1205731205730

1

2

10038171003817100381710038171205731003817100381710038171003817

2

+ 119862sum

119894

120585119894 (7)

subject to (119910119894(⟨x119894120573⟩ + 120573

0) ge 1 minus 120585

119894forall119894) and (120585

119894ge 0 forall119894)





119894120572119894x119894119910119894 0 le 120572

119894le 119862











Walking

094000004000000000

Running

001096008000000000

Jogging

005004088000000000

Boxing

000000000094002001

Waving

000000000002093003

Clapping

000000000004005096

Action






Action

Bend

Bend

Jump

Jump

Pjump

Pjump

Walk

Walk

Run

Run

Side

Side

Jack

Jack

Skip

Skip

Wave 1

Wave 1

Wave 2

Wave 2

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

090

000

000

010

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

010

000

000

090

000

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

100

000







References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation












119886V(119909 119910) is defined as

119868119886V (119909 119910) =

1

120591

120591minus1

sum

119905=0

119868 (119909 119910 119905) (3)



1198681=

1

1205784

00

[1205782012057802

minus 1205782

11]

1198682=

1

12057810

00

[1205782

031205782

30minus 612057830120578211205781212057803

+ 4120578301205783

12

+4120578031205783

21minus 31205782

211205782

12]

1198683=

1

1205787

00

[12057820

(1205782112057803

minus 1205782

12) minus 12057811

(1205783012057803

minus 1205782112057812)

+12057802

(1205780312057812

minus 1205782

21)]

1198684=

1

12057811

00

[1205783

201205782

03minus 61205782

20120578111205781212057803

minus 61205782

20120578021205782112057803

+ 91205782

20120578021205782

12+ 12120578201205782

111205782112057803

+ 61205782012057811120578021205783012057803

+ 181205782012057811120578021205783012057812

minus 81205783

111205783012057803

minus 6120578201205782

021205783012057812

+ 9120578201205782

021205782

21

+121205782

11120578021205783012057812

minus 6120578111205782

021205783012057812

+ 1205783

021205783

30]

1198685=

1

1205786

00

[1205784012057804

minus 41205783112057813

+ 31205782

22]

1198686=

1

1205789

00

[120578401205780412057822

+ 2120578311205781312057822

minus 120578401205782

13

minus120578041205782

13minus 1205783

22]

(4)

where 120578119901119902



120578119901119902119903

= sumsumsum

(119909119910119905)isinO

(119909 minus 119909119892)119901

(119910 minus 119910119892)119902

(119905 minus 119905119892)119903

119868 (119909 119910 119905)

(5)



1198691=

1

1205785

000

[120578200

120578020

120578002

+ 2120578110

120578101

120578011

minus 120578200

1205782

011

minus120578020

1205782

101minus 120578002

1205782

110]

1198692=

1

1205787

000

[120578400

(120578040

120578004

+ 31205782

022minus 4120578013

120578031

)

+ 3120578202

(120578040

120578202

minus 4120578112

120578130

+ 41205782

121)

+ 12120578211

(120578022

120578211

+ 120578103

120578130

minus 120578031

120578202

minus120578121

120578112

)

+ 4120578310

(120578031

120578103

minus 120578004

120578220

+3120578013

120578121

minus 3120578022

120578112

)

+ 3120578220

(120578004

120578220

+ 2120578022

120578202

+4120578112

minus 4120578013

120578311

minus 4120578121

120578103

)

+ 4120578301

(120578013

120578130

minus 120578040

120578103

+ 3120578031

120578112

minus3120578022

120578121

) ]

(6)








07

08

09

1

0

02

04

06

08

0

0005

001

0015

002

I 1 I 2

I 3 I 4

I 5 I 6

times10minus4

5

0

minus5

minus10

minus003

minus002

minus001

0

001

minus03

minus02

minus01

0

01





119894 119910119894) | x119894

isin R119889 119910119894


120585i

xi

120585j

xj

120573x+ 1205730

= +1

120573x+ 1205730

= 0

120573x+ 1205730

= minus1





min1205731205730

1

2

10038171003817100381710038171205731003817100381710038171003817

2

+ 119862sum

119894

120585119894 (7)

subject to (119910119894(⟨x119894120573⟩ + 120573

0) ge 1 minus 120585

119894forall119894) and (120585

119894ge 0 forall119894)





119894120572119894x119894119910119894 0 le 120572

119894le 119862











Walking

094000004000000000

Running

001096008000000000

Jogging

005004088000000000

Boxing

000000000094002001

Waving

000000000002093003

Clapping

000000000004005096

Action






Action

Bend

Bend

Jump

Jump

Pjump

Pjump

Walk

Walk

Run

Run

Side

Side

Jack

Jack

Skip

Skip

Wave 1

Wave 1

Wave 2

Wave 2

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

090

000

000

010

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

010

000

000

090

000

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

100

000







References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation















07

08

09

1

0

02

04

06

08

0

0005

001

0015

002

I 1 I 2

I 3 I 4

I 5 I 6

times10minus4

5

0

minus5

minus10

minus003

minus002

minus001

0

001

minus03

minus02

minus01

0

01





119894 119910119894) | x119894

isin R119889 119910119894


120585i

xi

120585j

xj

120573x+ 1205730

= +1

120573x+ 1205730

= 0

120573x+ 1205730

= minus1





min1205731205730

1

2

10038171003817100381710038171205731003817100381710038171003817

2

+ 119862sum

119894

120585119894 (7)

subject to (119910119894(⟨x119894120573⟩ + 120573

0) ge 1 minus 120585

119894forall119894) and (120585

119894ge 0 forall119894)





119894120572119894x119894119910119894 0 le 120572

119894le 119862











Walking

094000004000000000

Running

001096008000000000

Jogging

005004088000000000

Boxing

000000000094002001

Waving

000000000002093003

Clapping

000000000004005096

Action






Action

Bend

Bend

Jump

Jump

Pjump

Pjump

Walk

Walk

Run

Run

Side

Side

Jack

Jack

Skip

Skip

Wave 1

Wave 1

Wave 2

Wave 2

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

090

000

000

010

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

010

000

000

090

000

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

100

000







References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation













119894120572119894x119894119910119894 0 le 120572

119894le 119862











Walking

094000004000000000

Running

001096008000000000

Jogging

005004088000000000

Boxing

000000000094002001

Waving

000000000002093003

Clapping

000000000004005096

Action






Action

Bend

Bend

Jump

Jump

Pjump

Pjump

Walk

Walk

Run

Run

Side

Side

Jack

Jack

Skip

Skip

Wave 1

Wave 1

Wave 2

Wave 2

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

090

000

000

010

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

010

000

000

090

000

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

100

000







References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation











Action

Bend

Bend

Jump

Jump

Pjump

Pjump

Walk

Walk

Run

Run

Side

Side

Jack

Jack

Skip

Skip

Wave 1

Wave 1

Wave 2

Wave 2

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

090

000

000

010

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

010

000

000

090

000

000

000

000

000

000

000

000

000

000

000

100

000

000

000

000

000

000

000

000

100

000







References





































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation






































RoboticsJournal of





Journal of



RotatingMachinery





VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation









research article affine-invariant feature extraction for...

Documents