9.913 pattern recognition for vision - mit opencourseware...example: utterance classification...
TRANSCRIPT
![Page 1: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/1.jpg)
9.913 Pattern Recognition for Vision
Class XIII, Motion and Gesture Yuri Ivanov
Fall 2004
![Page 2: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/2.jpg)
TOC
• Movement – Activity – Action • View-based representation • Sequence comparison • Hidden Markov Models • Hierarchical representations
Fall 2004 Pattern Recognition for Vision
![Page 3: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/3.jpg)
![Page 4: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/4.jpg)
From Tracking to Classification
How do we describe that? How do we classify that?
Figure by MIT OCW.
Fall 2004 Pattern Recognition for Vision
![Page 5: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/5.jpg)
Sequence Analysis
• We might want to ask: – Is <it> doing something meaningful? – What exactly? – How does it do it?
• How fast – e.g. conducting • How accurately – e.g. dance instruction • What style?
• That leads us to a sequence analysis
Fall 2004 Pattern Recognition for Vision
![Page 6: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/6.jpg)
Motion Taxonomy
• Movement – Primitive motion – is “what it looks like”
• Activity – Requires explicit sequence model
• Action – Requires contextual information – Requires relational information – And many other things…
Images removed due to copyright considerations. Please see: Bobick, A. "Movement, Activity, and Action: The Role of Knowledge in Perception
of Motion." Phyl Trans of Royal Statistical Society 352 (1997).
Self-evidential,
Fall 2004 Pattern Recognition for Vision
![Page 7: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/7.jpg)
Basic Problem
1 1
1 1
1 1
x
y
q
Ø ø Œ œ Œ œ Œ œ Œ œ Œ œº ß�
1
1
1
M
M
M
x
y
q
�
2 1
2 1
2 1
x
y
q
Ø ø Œ œ Œ œ Œ œ Œ œ Œ œº ß�
2
2
2
N
N
N
x
y
q
Ø ø Œ œ Œ œ Œ œ Œ œ Œ œº ß�
1 1 1 1... Ms = x x 2 2
2 1 ... Ns = x x
1 2
? s s<>
Object trajectory in the feature space
Ø ø Œ œ Œ œ Œ œ Œ œ Œ œ º ß
Fall 2004 Pattern Recognition for Vision
![Page 8: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/8.jpg)
Motion Energy Image
1
0
( , ( , , ) i
E t
t
-
=
= -∪
( , , )D x y t
Sum the differences over the last τ frames:
Photographs and figures from: Bobick, A., and J. Davis. "The Representation and Recognition of Action Using Temporal Templates."
IEEE Transactions on Pattern Analysis and Machine Intelligence 23 3 (2002). Courtesy of IEEE, A. Bobick, and J. Davis. Copyright 2002
IEEE. Used with Permission.
First idea –implicit representation of time
, ) x y t D x y t i
- frame differencing
- WHERE motion happened
, no.
Fall 2004 Pattern Recognition for Vision
![Page 9: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/9.jpg)
- -
Motion History Image
Step two: include temporal information
( ,( , , )
( , , 1) H x y t
H otherwiset t
t =� = �
� t
( , ( , , ( ( , ( , , 1))H H H x y tt t ta= - + - -t
Photographs and figures from: Bobick, A., and J. Davis. "The Representation and Recognition of Action Using Temporal Templates." IEEE Transactions on Pattern Analysis and Machine Intelligence 23, no. 3 (2002). Courtesy of IEEE, A. Bobick, and J. Davis. Copyright 2002 IEEE. Used with Permission.
, ) 1 max(0, 1)
if D x y t x y t
- HOW motion happened Aside - you can compute a similar measure recursively:
, ) 1) , ) x y t x y t D x y t
Fall 2004 Pattern Recognition for Vision
![Page 10: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/10.jpg)
Illustration
OpenCV – Intel Open source Computer Vision Library
Fall 2004 Pattern Recognition for Vision
![Page 11: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/11.jpg)
Classification
Feature vector: x = [7 for MEI + 7 for MHI]
RTS invariant shape descriptors (see the end of notes)
2[ ) ]E xw w w w wm m= S = -
With the usual Gaussian assumption on distribution of x:
( ) ( )1T x xw w ww m m -Ø ø= - S -º ß
Then the class, ω :
Hu moments Hu moments
]; [( E x
argmin
Fall 2004 Pattern Recognition for Vision
![Page 12: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/12.jpg)
The model is replicated for a discrete number of views:
( ) ( )1,( ) min i i i
T
iV x xq q q w w w q
w m m -Ø ø= - S -Œ œº ß
[ ]( )iVw w=
For q = � �
Closest member of each class
Nearest class
Photographs and figures from: Bobick, A., and J. Davis. "The Representation and Recognition of Action Using Temporal Templates." IEEE Transactions on Pattern Analysis and Machine Intelligence 23, no. 3 (2002). Courtesy of IEEE, A. Bobick, and J. Davis. Copyright 2002 IEEE. Used with Permission.
Multi-View Recognition
argmin
{0 ...90 }
Fall 2004 Pattern Recognition for Vision
![Page 13: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/13.jpg)
Example Application
KidsRoom
http://whitechapel.media.mit.edu/vismod/demos/kidsroom/kidsroom.html
• Interactive story • Autonomous system • Narration is controlled • Input from cameras and mike
• Visual events: • position • motion energy • motion direction • gross body motion
Image removed due to copyright considerations. See:
Fall 2004 Pattern Recognition for Vision
![Page 14: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/14.jpg)
Motion Energy
Reference object
Move right Move left Move straight
MEI
Vismod Tech Report # 398
Fall 2004 Pattern Recognition for Vision
![Page 15: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/15.jpg)
Movement Classification
MEI/MHI
Last game sequence
“Flap”
“Spin”
Fall 2004 Pattern Recognition for Vision
![Page 16: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/16.jpg)
Temporal Alignment
< > ?
them as vectors
YX
Another idea – temporal alignment
If sequences are aligned to a common time axis, then we can treat
Fall 2004 Pattern Recognition for Vision
![Page 17: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/17.jpg)
Temporal Alignment
1
1
yT
xT
1 T
1 T
1
yT
1
xT
( )y yi kf=
( )x xi kf=
k
k
yi
xi
ix and iy that align X and Y to a common time axis k while minimizing dissimilarity.
Global normalization
[ ]( ){ }2
1
1 ( ) T
x x y y n
E mMf =
Ø ø= - º ß�
Local weighting
Find re-indexing sequences
One solution – Dynamic Time Warp algorithm
( ) ( ) n s i n s i n
Fall 2004 Pattern Recognition for Vision
![Page 18: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/18.jpg)
Example: Utterance Classification
Master Prototype
Class Prototypes
Original
samples
Normalized 59
samples
Time normalization: 1. Find the least distortion
prototype in each class 2. Pick the longest one 3. Warp all data to it 4. Train classifier
30-70
“sit”
“roll over”
Fall 2004 Pattern Recognition for Vision
![Page 19: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/19.jpg)
Example: Utterance Classification
1. Compute the symmetric DTW between all pairs
2. Compute an RBF Kernel
( , ) ( , )
2 i j j i
ijd +
=
)i j ijdg= -
1
SVM: N
i i
ba =
= +� x x
Danger: K
Alternative - pair-wise alignment:
D s s D s s
( , ) exp( K s s
( ) ( , ) i i f x y K
might not be a proper kernel matrix – need to regularize
Fall 2004 Pattern Recognition for Vision
![Page 20: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/20.jpg)
Example: Utterance Classification
Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20%
Japanese Vowel Set (UCI Machine Learning Repository): • Speaker identification task • 9 speakers • saying the same Japanese vowel • features 12 cepstral coefficients • each utterance – 7-30 samples • 340 training examples • 240 testing examples
Fall 2004 Pattern Recognition for Vision
![Page 21: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/21.jpg)
Hidden Markov Model (HM&M)Yet another idea:
Fall 2004 Figure by MIT OCW. Pattern Recognition for Vision
![Page 22: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/22.jpg)
Hidden Markov Model (HM&M)Yet another idea:
Fall 2004 Figure by MIT OCW Pattern Recognition for Vision
![Page 23: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/23.jpg)
HMMs
Figure by MIT OCW.
Another view – “Graphical Model”:
Fall 2004 Pattern Recognition for Vision
![Page 24: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/24.jpg)
Components of an HMM
A Bl p=
1) π i = p(q1 =i)
2) A aij=p(qt =i| q =j
3) B given state: bi =p(ot|qt =i)
1
1 N
ij j
a =
=�
1
1 N
i i
p =
=�
1ib =�
{ , , }
- probability of starting from a particular state ⎪ π
- probability of moving to a state, given the history t-1 ) – Markov assumption
- probability of outputing a particular observation from a (o)
( ) x dx
Fall 2004 Pattern Recognition for Vision
![Page 25: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/25.jpg)
HMM in Pictures
N states
T transitions
…
• A = p(qt| q ,…, q1)= p(qt| q• p(qt| q ) is independent of t
π
qt
q qTq1
o1
t-1 t-1) – Markov assumption t-1 – stationary => a matrix
t-1
Fall 2004 Pattern Recognition for Vision
![Page 26: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/26.jpg)
HMM Example
Input Alphabet
Output distributions Transition
matrix Initial state
How HMM sees it
Fall 2004 Pattern Recognition for Vision
![Page 27: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/27.jpg)
Three Tasks of HMM
1. Given a sequence of observations find a probability of it given the model, p(O|λ)
2. Given a sequence of observations recover a sequence of states, P(q|O,λ)
3. Given a sequence, estimate parameters of the model
Fall 2004 Pattern Recognition for Vision
![Page 28: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/28.jpg)
Problem I – Probability Calculation
( | ) ( , | ) ( | , ) ( | )P P P Pl l l l " "
= =� � q q
O q
Marginalize:
1 21 2 3( ( ) ( ( )Tq q qP o b ol =O q
1 1 2 2 3 1( | ) ...
T Tq q qP a a al p -
=q
Given: 1( To o=O
Calculate: ( | )P lO
N states, T transitions => |q|=NT !!!!
N=5, T=100 => 2TNT= 2*100*5100 ~ 1072 computations
1 1 1 2 2 2 3 3 11 2 3( ( ) ( ) ( ( )T T Tq q q q q q q TP P o a b ol l p
-=q
65536*1072 particles in the universe
Take I – brute force:
O q O q
| , ) )... b o b q q q q
,..., )
| , ) ( | ) )... q q q q b o a b o a b O q
Fall 2004 Pattern Recognition for Vision
![Page 29: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/29.jpg)
Try Again
1
1 (1) 2 ) 3 ( ( )( ) ( ) ( ( )q q q q q q q q Tb b b o a b op -=
= � 7210
q
q q
2N T»
2 TTN» S
S
( | ) ( , | )P Pl l "
= � q
O O q
1 2 3( ) ( ) ( ( )ij j j l T m l j i
p= �
(1) (1) (2) (2) (2) (3 (3) 1) ( ) )... q T q T q T o a o a
)... i i k k m m b o a b o a b o a b o �� ��
Fall 2004 Pattern Recognition for Vision
![Page 30: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/30.jpg)
Problem I – Probability Calculation
Define a “forward variable”, α 1 2( , | )t t ti P oa l= = t
and ending up in state i
1. Initialize
1 1( )i ii b oa p= …
2. Induce
1 1 1
( ) N
t t ij j t i
j ia a+ + =
Ø ø = Œ œº ß
� …
3. Terminate
1
( | ) ( ) N
T i
P il a =
= �O …
Take II – forward procedure:
( ) ... o o q i - probability of seeing the string up to
( )
( ) ( ) a b o
Fall 2004 Pattern Recognition for Vision
![Page 31: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/31.jpg)
While We Are At It…
Define a “backward variable”, β
1 2( | , )t t t T tb l+ + = = string after t and after visiting state i at t
1. Initialize
T ib = …
2. Induce
1 1 1
( ) ( ) N
t ij j t t j
i a jb b+ + =
= � …
3. Terminate …
( ) ... i P o o o q i - probability of seeing the rest of the
( ) 1
( ) b o
Fall 2004 Pattern Recognition for Vision
![Page 32: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/32.jpg)
Task II – Optimal State Sequence
i at time t.
Given: 1( To o=O
Find: ( | , )t t q
q P q l= O
1
( , )( | )
( , )
t t N
t i
P
P =
= = =
=� O
O O
rule
1 1 1 1 1( , ) ( , ) ( , | , )t t t T t t t t T t tP q + + = =O
…
1 1( ,t t t T t t tP o q a b+ = =
1o 2o 3o 1To - To
“Optimality” - maximum probability of being in a state
,..., )
argmax
q i P q i
q i
by Bayes
... ... , ... ) ( ... ... P o o o o q P o o q P o o o o q
... ) ( ... | ) o o q P o
Fall 2004 Pattern Recognition for Vision
![Page 33: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/33.jpg)
State Posterior
1 1
( , ) ( )( | ) ( )
( , ) ( )
t t t t tN N
t t t j j
P q i i ii i
P q j j j
a b g
a b = =
= = = = =
=� � O
O O
So,
1. α matrix 2. β matrix 3. 4. Normalize columns
2N T» 2N T»
NT 2N T»
What’s the problem?
But not entirely useless! We will need it later.
( )
( ) P q
Forward pass – compute Backward pass – compute Multiply element-by element
Inconsistent paths – some might not even be allowed
Fall 2004 Pattern Recognition for Vision
![Page 34: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/34.jpg)
Task II – Viterbi Algorithm
Given: 1( To o=O
Find: | , )P l q
q O
single maximum probability path.
Define: 1
1 1 1 ... ( , ,
t t t t t q
i q qd -
-= =
By the optimality principle (Bellman, ’57):
1 1( )t t ij j tij b od d+ +
Ø ø= º ß
Max prob. path so far
Just need to keep track of max probability states along the way
,..., )
argmax (
“Optimality” –
1 2
( ) max ... ... ) q q
P q i o o
( ) max ( ) i a
Fall 2004 Pattern Recognition for Vision
![Page 35: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/35.jpg)
Task II – Viterbi Algorithm (cont.)
1 1( )i ii b od p=
* * 1 1( )t t tq qy + + =
1. Initialize
2. Recurse
3. Terminate
4. Backtrack
11 ( )t t ij j ti N
j b od d -£ £ Ø ø= º ß
1 i N£ £
1 iy =
2 t T£ £ 1 j N£ £
1 1
( )t t iji N
j i ay d -£ £
Ø ø= º ß 2 t T£ £ 1 j N£ £
Housekeeping variable
*
1 ( )Ti N
P id £ £
= *
1 ( )T T
i N q id
£ £ =
(t T= -
( )
( ) max ( ) i a
( ) 0
( ) argmax
max
argmax
1), ..., 1
Fall 2004 Pattern Recognition for Vision
![Page 36: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/36.jpg)
Viterbi Illustration
• Similar to the forward procedure • Typically, you’ll do it in log space for speed and underflows:
- replace all parameters with their logarithms - replace all multiplications with additions
Fall 2004 Pattern Recognition for Vision
![Page 37: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/37.jpg)
Task III – Parameter Estimation
Given: 1( To o=O
Find: , ,A Bp
1 1
( , , )) ( , | )
( ) t t
t t t
ji j j
P x +
+
= = = = = =
OO
O
1( ) ( )
( ) t ij j t t j
P
a b+= O
1( )ij j t+
( )i ta ( 1)j tb +
,..., )
Baum-Welch algorithm (EM for HMMs)
First, introduce another greek letter:
( , P q i q
P q i q
( ) i a b o
a b o
Fall 2004 Pattern Recognition for Vision
![Page 38: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/38.jpg)
Transition Probability
[ ] [ ]
1
1 1
1
) .) ( )
T
t t
ij T
t t
i jE i j a
E i i
x
g
-
= -
=
fi = =
fi
�
�
…
1o 2o 3o 1To - To
expected # of transitions from i to j
This leads to:
The rest is easy
S
( , ) #( #(
Fall 2004 Pattern Recognition for Vision
![Page 39: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/39.jpg)
Priors and Outputs
[ ] [ ]
1
1
( )
( ) ( )
t k
T
t t o vk
i T
t t
i E i v
b k E i i
g
g
= =
=
= =
�
�
Sum probabilities of being in state i while
k
[ ] 1 ( )i E ip g= = =
Output distribution (discrete):
Prior distribution:
Normalize
Figure by MIT OCW.
#( , ) #( )
seeing symbol v
#( , 1) i t
Fall 2004 Pattern Recognition for Vision
![Page 40: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/40.jpg)
Continuous Output Case
( , )i i ib o N m= S
Output distribution (continuous, Gaussian):
1
1
( )
( )
T
t t t
i T
t t
i o
i
g m
g
=
=
� =
�
�
1
1
)
( )
T T
t t i t i t
i T
t t
i o o
i
g m m
g
=
=
� - -S =
�
�
These should look VERY familiar
Observation at time t weighted by the probability of being in the state at that time
( )
( ) ( )(
Fall 2004 Pattern Recognition for Vision
![Page 41: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/41.jpg)
Semi-Continuous HMM Example
Input
Transition matrix
Initial state
Output distributions
How HMM sees it
Fall 2004 Pattern Recognition for Vision
![Page 42: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/42.jpg)
Gesture Recognition –Trajectory Model
Modeling a tracked hand trajectory.
Fall 2004 Pattern Recognition for Vision
![Page 43: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/43.jpg)
HMM Classifier
Input Sequence
ArgMax
l1
l2
l3
l4
l
Nothing unusual:
Bank of HMMs
Fall 2004 Pattern Recognition for Vision
![Page 44: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/44.jpg)
Applications – American Sign Language
Task: Recognition of sentences of American Sign Language
40 word lexicon:
• Single camera • No special markings on hands • Real-time
part of speech vocabulary pronoun I, you, he, we,
you(pl), they verb want, like, lose,
dontwant, dontlike, love, pack, hit, loan
noun box, car, book, table, paper, pants, bicycle, bottle, can, wristwatch, umbrella, coat, pencil, shoes, food, magazine, fish, mouse, pill, bowl
adjective red, brown, black, gray, yellow
Table from: Starner, T., and et. al. "Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video." IEEE Transactions on Pattern Analysis and Machine Intelligence (1998). Courtesy of IEEE. Copyright 1998 IEEE. Used with Permission.
Fall 2004 Pattern Recognition for Vision
![Page 45: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/45.jpg)
ASL – Features and Model
with a single skip transition:
( )min, , , , , , , / ,(...) T
leftrighto l lØ ø= º ß
Features (from skin model):
System 1: Second person System 2: First person
Nose could be used for initializing the skin model Courtesy of Thad Starner. Used with permission.
Photo marked with black box due to copyright consideration.
“Word” model – a 4-state L-R HMM
max max x y dx dy area q l
Fall 2004 Pattern Recognition for Vision
![Page 46: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/46.jpg)
ASL – In Action
Courtesy of Thad Starner. Used with permission.
Fall 2004 Pattern Recognition for Vision
![Page 47: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/47.jpg)
Applications – American Sign Language
experiment training set test set all features 94.10% 91.90%
relative features 89.60% 87.20% all features & unrestricted
grammar
81.0% (87%) (D=31, S=287, I=137, N=2390)
74.5% (83%) (D=3, S=76, I=41, N=470)
grammar training set test set part-of-speech 99.30% 97.80%
5-word sentence 98.2% (98.4%) (D = 5, S=36, I=5 N =2500) 97.80%
unrestricted 96.4% (97.8%) (D=24, S=32, I=35, N=2500)
96.8% (98.0%) (D=4, S=6, I=6,
N=500)
500 sentences (400 training, 100 testing)
System 1:
System 2:
fq`ll`q
% words recognized correctly
Word accuracy,
1 D S I
N + +
-
o`qs,ne,roddbg
Fall 2004 Pattern Recognition for Vision
![Page 48: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/48.jpg)
Beyond HMM
Where can we go if HMM is not sufficient?
Ideas:
Capable of generating only a regular language
More expressive, may include memory, but harder to deal with
Explicit representation of structure Compact representation
Figures from: Ivanov, Y., and A. Bobick. "Recognition of Visual Activities and Interactions." IEEE Transactions of Pattern Analysis and Machine Intelligence (2000). Courtesy of IEEE. Copyright 2000 IEEE. Used with Permission.
• Hierarchical HMM • More complex models - SCFG
Fall 2004 Pattern Recognition for Vision
![Page 49: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/49.jpg)
Structured Gesture
Problem: 2 directions = 2 models WHY???
Solution – split the model in two: • Components (trajectories) • Structure (events)
Fall 2004 Pattern Recognition for Vision
![Page 50: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/50.jpg)
Heterogeneous Representation
• – Pitching, cooking, dancing, stealing a car from a parking lot
• Components – Signal level model – Variability in performance – Hidden state representation (HMM, etc.)
• Structure – – Uncertainty in component detections – State is NOT hidden (SRG, SCFG, etc)
• Right tool for the right task!
Many high-level activities are sequences of primitives
Event-level model
Fall 2004 Pattern Recognition for Vision
![Page 51: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/51.jpg)
Two-tier Recognition Architecture
Fall 2004 Pattern Recognition for Vision
![Page 52: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/52.jpg)
Application: Conducting Music
V V V V 4 4
“Dictionary” gestures
Fall 2004 Pattern Recognition for Vision
![Page 53: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/53.jpg)
Application: Conducting Music
123 4 56 1234
56
= = 123
456
Grammar:
~70%Individual
~95%
~85%
Correct
Component
Bar
Jean Sibelius, Second Symphony, Opus 43, D Major
Fall 2004 Pattern Recognition for Vision
![Page 54: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/54.jpg)
Component Detection
Input sequence
Peaks of likelihoodHMM Likelihoods
Single event
Fall 2004 Pattern Recognition for Vision
![Page 55: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/55.jpg)
Temporal Consistency
a b a ab b b
|A fi Grammar:
Input
ab abA
Fall 2004 Pattern Recognition for Vision
![Page 56: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/56.jpg)
a b a ab b b
Temporal Consistency
|A fi Grammar:
Input
Terminals have temporal extent!
Figures from: Ivanov, Y., and A. Bobick. "Recognition of Visual Activities and Interactions." IEEE Transactions of Pattern Analysis and Machine Intelligence (2000). Courtesy of IEEE. Copyright 2000 IEEE. Used with Permission.
ab abA
Fall 2004 Pattern Recognition for Vision
![Page 57: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/57.jpg)
a b a ab b b
Consistent parse
Temporal Consistency
|A fi Grammar:
Input
Terminals have temporal extent!
Inconsistent parse
Figures from: Ivanov, Y., and A. Bobick. "Recognition of Visual Activities and Interactions." IEEE Transactions of Pattern Analysis and Machine Intelligence (2000). Courtesy of IEEE. Copyright 2000 IEEE. Used with Permission.
ab abA
Fall 2004 Pattern Recognition for Vision
![Page 58: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/58.jpg)
a b a ab b b
Temporal Consistency
|A fi Grammar:
Input
Terminals have temporal extent!
Inconsistent parse
Consistent parse
Figures from: Ivanov, Y., and A. Bobick. "Recognition of Visual Activities and Interactions." IEEE Transactions of Pattern Analysis and Machine Intelligence (2000). Courtesy of IEEE. Copyright 2000 IEEE. Used with Permission.
ab abA
Fall 2004 Pattern Recognition for Vision
![Page 59: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/59.jpg)
Parsing
Parser
a b c d e
adecbDetection
The idea is that the top level parse will filter out mistakes in low level detections
Fall 2004 Pattern Recognition for Vision
![Page 60: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/60.jpg)
Stochastic Context-Free Grammar
Example Grammar:
TRACK [0.50] [0.50]
[0.25] [0.25] [0.25] [0.25]
. . .
[0.70] [0.30]
semantically significant groups of events
Terminals events
Rule probabilities
-> CAR-TRACK | PERSON-TRACK
CAR-TRACK -> CAR-THROUGH | CAR-PICKUP | CAR-OUT | CAR-DROP
CAR-THROUGH . . .
CAR-EXIT -> car-exit | SKIP car-exit
Non-terminals
- individual
SKIP-rule - noise symbol
Fall 2004 Pattern Recognition for Vision
![Page 61: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/61.jpg)
Event Parsing
X
Z
a b c
SKIP SKIP
a x y b p q r c
X Zc Z ab
fi fi
For the production X, events a, b and c should be consistent
- production rules (states)
X - target non-terminal (label)
Z - intermediate non-terminal
- input stream (tracking events)
- noise rules
Fall 2004 Pattern Recognition for Vision
![Page 62: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/62.jpg)
Application: Musical Conducting
~70%Individual
~95% ~85%
Correct
Component Bar
Courtesy of Teresa Marrin-Nakra. Used with permission.
Fall 2004 Pattern Recognition for Vision
![Page 63: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/63.jpg)
![Page 64: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/64.jpg)
Application: Surveillance System
• changes
• Static cameras • •
parking lot • Handling simultaneous events
Outdoor environment - occlusions and lighting
Real-time performance Labeling activities and person-vehicle interactions in a
Fall 2004 Pattern Recognition for Vision
![Page 65: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/65.jpg)
Monitoring System
Photos and figures from: Stauffer, Chris, and Eric Grimson, "Learning Patterns of Activity Using Real-Time Tracking." IEEE Transactions on PatternRecognition and Machine Intelligence (TPAMI) 22, no. 8 (2000): 747-757. Courtesy of IEEE, Chris Stauffer, and Eric Grimson. Copyright 2000 IEEE. Used with Permission.
• Tracker (Stauffer, Grimson) – assigns identity to the moving objects – collects the trajectory data into partial tracks
• Event Generator – maps partial tracks onto a set of events
• Parser – labels sequences of events according to a grammar – enforces spatial and temporal constraints
Fall 2004 Pattern Recognition for Vision
![Page 66: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/66.jpg)
Tracker
• Adaptive to slow lighting changes: – Each pixel is modeled by a mixture
• Foreground regions are found by connected components algorithm
• Object dynamics is modeled in 2D by a set of Kalman filters
•
w Xt t i
K
( ) ( , , ), , , = * = � h m S
1
Details - (Stauffer, Grimson CVPR 99)
P X i t i t i t
Fall 2004 Pattern Recognition for Vision
![Page 67: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/67.jpg)
Tracker
Camera view Connected components
Trajectories over time An object Photos and figures from: Stauffer, Chris, and Eric Grimson, "Learning Patterns of Activity Using Real-Time Tracking." IEEE Transactions on PatternRecognition and Machine Intelligence (TPAMI) 22, no. 8 (2000): 747-757. Courtesy of IEEE, Chris Stauffer, and Eric Grimson. Copyright 2000 IEEE. Used with Permission.
Fall 2004 Pattern Recognition for Vision
![Page 68: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/68.jpg)
Event Generator
( , )
• ( , )
• Each event is complemented if the label probability is < 1 ( , , )
Map tracks onto events: car-enter, person-enter, car-found, person-found, car-lost, person-lost, stopped
Photos and figures from: Stauffer, Chris, and Eric Grimson, "Learning Patterns of Activity Using Real-Time Tracking." IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI) 22, no. 8 (2000): 747-757. Courtesy of IEEE, Chris Stauffer, and Eric Grimson. Copyright 2000 IEEE.
Used with permission.
• Events along with class likelihoods are posted at the endpoints of each track car-appear [0.5] car-disappear [1.0]
Action label is assigned to each event in accordance with the environment map car-enter [0.5] car-exit [1.0]
car-enter [0.5] person-enter [0.5] car-exit [1.0]
Fall 2004 Pattern Recognition for Vision
![Page 69: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/69.jpg)
Parking Lot Grammar (Partial)
Photos and figures from: Stauffer, Chris, and Eric Grimson, "Learning Patterns of Activity Using Real-Time Tracking." IEEE Transactions on Pattern Recognition and Machine Intelligence (TPAMI) 22, no. 8 (2000): 747-757. Courtesy of IEEE, Chris Stauffer, and Eric Grimson. Copyright 2000 IEEE. Used with Permission.
Fall 2004 Pattern Recognition for Vision
![Page 70: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/70.jpg)
Consistency
• Temporal – Events should happen in particular order – Temporally close events are more likely to be related – Tracks overlapping in time are definitely not related to the
same object
• Spatial – Spatially close events are more likely to be related
• Other – Objects don’t change identity within a track
Fall 2004 Pattern Recognition for Vision
![Page 71: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/71.jpg)
Spatio-Temporal Consistency
r r rp d t t= + -1 1 2 1( ) Predict new position:
f
t t
p p T
p( , )
, ( )
exp ( )r r r r r r2
2 1
2 2
0 0
=
- <
- -FHG
IKJ
R S| T|
if
q
Penalize:
t Z
X
t
Z
t Z
X
X
reject
accept
penalize
( , ( , )d= =r r
) (
), x y dx dy
Fall 2004 Pattern Recognition for Vision
![Page 72: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/72.jpg)
Input Data
Fall 2004 Pattern Recognition for Vision
![Page 73: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/73.jpg)
Event Generator
Interleaved events in the input stream
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions." IEEE Workshop on Visual Surveillance (ICCV 2001)W. E. L. Grimson. Copyright 1999 IEEE. Used with Permission.
(1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and
Fall 2004 Pattern Recognition for Vision
![Page 74: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/74.jpg)
Parse 1: Person-Pass-Through
Action label Object track
Temporal extent
Component labels
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions." IEEE Workshop on Visual Surveillance (ICCV 2001) (1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and W. E. L. Grimson. Copyright 1999 IEEE. Used with Permission. Fall 2004 Pattern Recognition for Vision
![Page 75: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/75.jpg)
Parse 2: Drive-In
Action label Object tracks
Temporal extent
Component labels
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions." IEEE Workshop on Visual Surveillance (ICCV 2001) (1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and W. E. L. Grimson. Copyright 1999 IEEE. Used with Permission. Fall 2004 Pattern Recognition for Vision
![Page 76: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/76.jpg)
Parse 3: Drop-off
Action labelAction label Object track
Temporal extent
Component labels
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions." IEEE Workshop on Visual Surveillance (ICCV 2001) (1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and W. E. L. Grimson. Copyright 1999 IEEE. Used with Permission. Fall 2004 Pattern Recognition for Vision
![Page 77: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/77.jpg)
Parse 4: Car-Pass-Through
Action label Object track
Temporal extent
Component labels
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions." IEEE Workshop on Visual Surveillance (ICCV 2001) (1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and W. E. L. Grimson. Copyright 1999 IEEE. Used with Permission. Fall 2004 Pattern Recognition for Vision
![Page 78: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/78.jpg)
Summary
• • • Extended robust parsing algorithm • Events are staged in real environment with other cars
and people • • •
Real-time system First of a kind end-to end system
~10-15 events per minute Staged events - 100% detected Accidental events - ~80% detected
Fall 2004 Pattern Recognition for Vision
![Page 79: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/79.jpg)
Automatic Surveillance System
• • Static cameras • •
parking lot • Handling simultaneous events
Outdoor environment - occlusions and lighting changes
Real-time performance Labeling activities and person-vehicle interactions in a
Fall 2004 Pattern Recognition for Vision
![Page 80: 9.913 Pattern Recognition for Vision - MIT OpenCourseWare...Example: Utterance Classification Accuracy KNN 94.60% MCC 94.10% HMM 96.20% SVM 98.20% DynSVM 98.20% Japanese Vowel Set](https://reader035.vdocuments.site/reader035/viewer/2022071114/5fead942154b6566040a46b6/html5/thumbnails/80.jpg)
Appendix: Hu Moments
Full appendix from: Bobick, A., and J. Davis. "The Representation and Recognition of Action Using Temporal Templates." IEEE Transactions on Pattern Analysis and Machine Intelligence 23, no. 3 (2002). Courtesy of IEEE. Copyright 2002 IEEE. Used with Permission.
Fall 2004 Pattern Recognition for Vision