an introduction to action recognition/detection sami benzaid november 17, 2009
Post on 20-Jan-2016
225 views
TRANSCRIPT
![Page 1: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/1.jpg)
An Introduction to Action Recognition/Detection
Sami Benzaid
November 17, 2009
![Page 2: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/2.jpg)
What is action recognition?
The process of identifying actions that occur in video sequences (in this case, by humans).
Videos from http://www.nada.kth.se/cvap/actions/
Example Videos
![Page 3: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/3.jpg)
Why perform action recognition?
Surveillance footage
User-interfaces
Automatic video organization / tagging
Search-by-video?
![Page 4: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/4.jpg)
Complications
Different scalesPeople may appear at different scales in
different videos, yet perform the same action.
Movement of the cameraThe camera may be a handheld camera,
and the person holding it can cause it to shake.
Camera may be mounted on something that moves.
![Page 5: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/5.jpg)
Complications, continued
Movement with the cameraThe subject performing an action (i.e.,
skating) may be moving with the camera at a similar speed.
Figure from Niebles et al.
![Page 6: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/6.jpg)
Complications, continued
OcclusionsAction may not be fully visible
Figure from Ke et al.
![Page 7: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/7.jpg)
Complications, continued
Background “clutter”Other objects/humans present in the video
frame.Human variation
Humans are of different sizes/shapesAction variation
Different people perform different actions in different ways.
Etc…
![Page 8: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/8.jpg)
Why have I chosen this topic?
I wanted an opportunity to learn about it. (I knew nothing of it beforehand).
I will likely be incorporating it into my research in the future, but I'm not there yet.
![Page 9: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/9.jpg)
Why have I chosen this topic?
Specifically: Want to know if someone is on the phone
hence not interruptible; can set status to “busy”
Want to know if someone is present (look for characteristic human actions)
Things other than humans can cause motion, this will make for a more reliable presence detector.
Online status can change accordingly.
Want to know immediately when someone leaves Can accurately set status to unavailable
![Page 10: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/10.jpg)
1st Paper Overview
Recognizing Human Actions: A Local SVM Approach (2004)Use local space-time features to represent
video sequences that contain actions.Classification is done via an SVM. Results
are also computed for KNN for comparison.Christian Schuldt, Ivan Laptev and
Barbara Caputo (2004)
![Page 11: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/11.jpg)
The Dataset
Video dataset with a few thousand instances. 25 people each:
perform 6 different actions Walking, jogging, running, boxing, hand waving, clapping
in 4 different scenarios Outdoors, outdoors w/scale variation, outdoors w/different
clothes, indoors (several times)
Backgrounds are mostly free of clutter. Only one person performing a single action
per video.
![Page 12: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/12.jpg)
The Dataset
Figure from Schuldt et al.
![Page 13: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/13.jpg)
Local Space-time features
Figure from Schuldt et al.
![Page 14: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/14.jpg)
Representation of Features
Spatial-temporal “jets” (4th order) are computed at each feature center:
Using k-means clustering, a vocabulary consisting of words hi is created from the jet descriptors.
Finally, a given video is represented by a histogram of counts of occurrences of features corresponding to hi in that video:
( , , , ,..., )x y t xx ttttl L L L L L
1( ,..., )nH h h
![Page 15: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/15.jpg)
Recognition Methods
2 representations of data: [1] Raw jet descriptors (LF) (“local feature”
kernel)Wallraven, Caputo, and Graf (2003)
[2] Histograms (HistLF) (X2 kernel)2 classification methods:
SVMK Nearest Neighbor
![Page 16: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/16.jpg)
Results
Figure from Schuldt et al.
![Page 17: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/17.jpg)
Results
Some categories can be confused with others (running vs. jogging vs. walking / waving vs. boxing) due to different ways that people perform these tasks.
Local Features (raw jet descriptors without histograms) combined with SVMs was the best-performing technique for all tested scenarios.
![Page 18: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/18.jpg)
A Potentially Prohibitive Cost
The method we just saw was entirely supervised. All videos used in the training set had labels attached to them.
Sometimes, labeling videos can be a very expensive operation.
What if we want to be able to train on a training set in which the videos are not necessarily labeled, and recognize the occurrences of their actions in a test set?
![Page 19: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/19.jpg)
2nd Paper Overview
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words (2008)Unsupervised approach to classifying
actions that occur in video.Uses pLSA (Probabilistic Latent Semantic
Analysis) to learn a model.Juan Carlos Niebles, Hongcheng Wang
and Li Fei-Fei (2008)
![Page 20: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/20.jpg)
Data
Training set:Set of videos in which a single person is
performing a single action.Videos are unlabeled.
Test set (relaxed requirement):Set of videos which can contain multiple
people performing multiple actions simultaneously.
![Page 21: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/21.jpg)
Space-time Interest Points
Figure from Niebles et al.
![Page 22: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/22.jpg)
Space-time Interest Points
Figure from Niebles et al.
![Page 23: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/23.jpg)
Feature Descriptors
Brightness gradients calculated for each interest point ‘cube’.
Image gradients computed for each cube (at different scales of smoothing).
Gradients concatenated to form feature vector. Length of vector: [# of pixels in interest “point”] * [#
of smoothing scales] * [# of gradient directions]
Vector is projected to lower dimensions using PCA.
![Page 24: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/24.jpg)
Codebook Formation
Codebook of spatial-temporal words.K-means clustering of all space-time
interest points in training set.Clustering metric: Euclidean distance
Videos represented as collections of spatial-temporal words.
![Page 25: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/25.jpg)
Representation
Space-time interest points:
Figure from Niebles et al.
![Page 26: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/26.jpg)
pLSA: Learning a Model
Probabilistic Latent Semantic AnalysisGenerative ModelVariables:
wi = spatial-temporal word
dj = video
n(wi, dj) = co-occurrence table (# of occurrences of word wi in video dj)
z = topic, corresponding to an action
![Page 27: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/27.jpg)
Probabilistic Latent Semantic AnalysisProbabilistic Latent Semantic Analysis
• Unsupervised technique• Two-level generative model: a video is a
mixture of topics, and each topic has its own characteristic “word” distribution
wd z
T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999
video topic wordP(z|d) P(w|z)
Slide: Lana Lazebnik
![Page 28: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/28.jpg)
Probabilistic Latent Semantic AnalysisProbabilistic Latent Semantic Analysis
• Unsupervised technique• Two-level generative model: a video is a
mixture of topics, and each topic has its own characteristic “word” distribution
wd z
T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999
K
kjkkiji dzpzwpdwp
1
)|()|()|(Slide: Lana Lazebnik
![Page 29: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/29.jpg)
The pLSA modelThe pLSA model
K
kjkkiji dzpzwpdwp
1
)|()|()|(
Probability of word i in video j(known)
Probability of word i given
topic k (unknown)
Probability oftopic k given
video j(unknown)
Slide: Lana Lazebnik
![Page 30: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/30.jpg)
The pLSA modelThe pLSA model
Observed codeword distributions
(M×N)
Codeword distributionsper topic (class)
(M×K)
Class distributionsper video
(K×N)
K
kjkkiji dzpzwpdwp
1
)|()|()|(
p(wi|dj) p(wi|zk)
p(zk|dj)
videos
wor
ds
wor
ds
topics
topi
cs
videos
=
Slide: Lana Lazebnik
![Page 31: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/31.jpg)
Maximize likelihood of data: Observed counts of word i in video j
M … number of codewords
N … number of videos
Slide credit: Josef Sivic
Learning pLSA parametersLearning pLSA parameters
![Page 32: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/32.jpg)
InferenceInference
)|(maxarg dzpzz
• Finding the most likely topic (class) for a video:
Slide: Lana Lazebnik
![Page 33: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/33.jpg)
InferenceInference
)|(maxarg dzpzz
• Finding the most likely topic (class) for a video:
zzz dzpzwp
dzpzwpdwzpz
)|()|(
)|()|(maxarg),|(maxarg
• Finding the most likely topic (class) for a visual word in a given video:
Slide: Lana Lazebnik
![Page 34: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/34.jpg)
Datasets
KTH human motion dataset (Schuldt et al. 2004) (Dataset introduced in previous paper.)
Weizmann human action dataset (Blank et al. 2005)
Figure skating dataset (Wang et al. 2006)
![Page 35: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/35.jpg)
Example of Testing (KTH)
Figure from Niebles et al.
![Page 36: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/36.jpg)
KTH Dataset Results
Figure from Niebles et al.
![Page 37: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/37.jpg)
Results Compared to Prev. Paper
Figures from Niebles et al., Schuldt et al.
![Page 38: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/38.jpg)
Results Compared to Prev. Paper
Walking 83.8 00 16.2 00 00 00
Running 6.3 54.9 38.9 00 00 00
Jogging 22.9 16.7 60.4 00 00 00
Handwaving 0.7 00 00 73.6 4.9 20.8
Handclapping 1.4 00 00 3.5 59.7 35.4
Boxing 0.7 00 00 0.7 0.7 97.9
Walking 82 00 17 00 00 01
Running 01 88 11 00 00 00
Jogging 09 37 53 00 01 00
Handwaving 00 00 00 93 00 07
Handclapping 00 00 00 00 86 14
Boxing 00 00 00 00 02 98
First Paper (Supervised Technique) This Paper (Unsupervised Technique)
![Page 39: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/39.jpg)
Weizmann Dataset
10 action categories, 9 different people performing each category.
90 videos.Static camera, simple background.
![Page 40: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/40.jpg)
Weizmann Examples
Figure from Niebles et al.
![Page 41: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/41.jpg)
Example of Testing (Weizmann)
Figure from Niebles et al.
![Page 42: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/42.jpg)
Weizmann Dataset Results
Figure from Niebles et al.
![Page 43: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/43.jpg)
Figure Skating Dataset
32 video sequences7 people3 actions:
Stand-spinCamel-spinSit-spin
Camera motion, background clutter, viewpoint changes.
![Page 44: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/44.jpg)
Example Frames
Figure from Niebles et al.
![Page 45: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/45.jpg)
Example of Testing
Figure from Niebles et al.
![Page 46: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/46.jpg)
Figure Skating Results
Figure from Niebles et al.
![Page 47: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/47.jpg)
A Distinction
Action Recognition vs. Event DetectionAction Recognition
= Classify a video of an actor performing a certain class of action.
Event Detection = Detect instance(s) of event(s) from predefined
classes, occurring in video that more closely resembles real-life events (i.e., cluttered background, multiple humans, occlusions).
![Page 48: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/48.jpg)
3rd Paper Overview
Event Detection in Cluttered Videos (2007)Represent actions as spatial-temporal
volumes.Detection is done via a distance threshold
between a template action volume and a video sequence.
Y. Ke, R. Sukthankar, and M. Hebert (2007)
![Page 49: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/49.jpg)
Example images
Cluttered background + occlusion: hand-waving
Cluttered background: picking something up
Images from Ke et al.
![Page 50: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/50.jpg)
Representation of an event
Spatio-temporal volume:
Example: hand-waving
Image from Ke et al.
![Page 51: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/51.jpg)
Preprocessing
A video in which event detection is to be performed is first segmented into space-time regions using mean-shift.
The video is treated as a volume, and not individually by frames.
Objects in the video are over-segmented in space and time.
Authors state this is equivalent to “superpixels”, in the space-time context.
![Page 52: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/52.jpg)
Preprocessing
Figure from Ke et al.
![Page 53: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/53.jpg)
Detecting an Event in Video
Want to detect event corresponding to template T (i.e., hand-waving volume), in video volume V (which has been oversegmented).
Slide the template along all locations in the video volume V. Measure shape-matching distance between T and relevant subset of V.
![Page 54: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/54.jpg)
Shape-matching distance
Appropriate region intersection distance:
The authors point out that enumerating through all subsets of segmented objects in V is very inefficient.
They detail an optimization to reduce run-time computation to table lookups, as well as a different distance metric.
( ) ( )T V T V
![Page 55: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/55.jpg)
Shape-matching distance
Basic Idea:
Figure from Ke et al.
![Page 56: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/56.jpg)
Flow Distance
Additionally, a flow distance metric is used.
For a spatial-temporal patch P1 in T and P2 in V, calculate flow correlation distance (whether or not the same flow could have generated both patches).
Flow correlation distance algorithm by Shechtman and Irani (2005)
![Page 57: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/57.jpg)
Breaking up the template
Why is this useful?
Figure from Ke et al.
![Page 58: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/58.jpg)
Matching template parts to regions
Template parts may not match oversegmented regions as well:
Figure from Ke et al.
![Page 59: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/59.jpg)
Detection with parts
Use a cutting plane (where the template was split) when calculating distance.
Encode relationship between multiple parts of a template as a graph, with information about distances.
Energy function to minimize:
Appearance distance: Detection occurs when distance is below a
chosen threshold.
1 ( , )
* argmin ( ) ( , )i j
n
i i ij i jL i v v E
L a l d l l
( ) ( , ; ) ( , ; )i i N i i F i ia l d T V l d T V l
![Page 60: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/60.jpg)
The Data
20ish minutes of video, containing approximately 110 events for which templates exist.
Templates/”training set” are created from a single instance of an action being performed; they are manually segmented and split.
![Page 61: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/61.jpg)
Example Detections
Figure from Ke et al.
![Page 62: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/62.jpg)
Results
Figure from Ke et al.
![Page 63: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/63.jpg)
Reference Links
Recognizing Human Actions: A Local SVM Approach Christian Schuldt, Ivan Laptev and Barbara Caputo http://www.nada.kth.se/cvap/actions/
Event Detection in Cluttered Videos Y. Ke, R. Sukthankar, and M. Hebert http://www.cs.cmu.edu/~yke/video/
Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words Juan Carlos Niebles, Hongcheng Wang and Li Fei-
Fei http://vision.stanford.edu/niebles/humanactions.htm
![Page 64: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/64.jpg)
Extra Slides
Extra Slides after this point.
![Page 65: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/65.jpg)
Detection of features (Schuldt et al.)
Video is an image sequenceScale space representation of is
constructed by convolution with a Gaussian kernel:
Second-moment matrix is computed using spatio-temporal image gradients:
( and are spatial / temporal scale parameters).
( , , )f x y t
2 2 2 2( , , ) * ( , , )L f g
f
2 2 2 2( ; , ) ( ; , )*( ( ) )Tg s s L L
![Page 66: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/66.jpg)
Detection of features (Schuldt et al.)
The positions of the features are the maxima of over all .
Size/neighborhood of features determined by scale parameters.
More detailed info about space-time interest points: Ivan Laptev and Tony Lindeberg. Space-
Time Interest Points. In Proc. ICCV 2003, Nice, France, pp.I:432-439.
3det( ) ( )k trace ( , , )x y t
![Page 67: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/67.jpg)
Space-time Interest Points (Niebles et al.)
Detection algorithm based on: Dollár, Rabaud, Cottrell, & Belongie (2005)
Response function: 2-d Gaussian kernel applied only in
spatial dimensions. 1-D “Gabor filters” applied over time. Point locations are local maxima of response
function. Size is determined by spatial/temporal scale
factors.
2 2( * * ) ( * * )ev odR I g h I g h
( , ; )g x y
,ev odh h
![Page 68: An Introduction to Action Recognition/Detection Sami Benzaid November 17, 2009](https://reader035.vdocuments.site/reader035/viewer/2022062423/56649d785503460f94a5b27b/html5/thumbnails/68.jpg)
Recognition of Multiple Actions (Niebles et al.)
Figure from Niebles et al.