using grammars for action recognition · aniket bera. video analysis with cfgs the “inverse...

Using Grammars for Action Recognition

Aniket Bera

Video analysis with CFGs

The “Inverse Hollywood problem”:

From video to scripts and storyboards via causal analysis.

Brand 1997

Action Recognition using Probabilistic Parsing.Bobick and Ivanov 1998

Recognizing Multitasked Activities from Video using

Stochastic Context-Free Grammar.

Moore and Essa 2001

13

CFG for human activities

enter detach leave enter detach attach touch touch detach attach leave

M. Brand. The "Inverse Hollywood Problem":From video to scripts and storyboards

via causal analysis. AAAI 1997.

14

Parse treeSCENE (Open up a PC)

IN ACTION (Open PC)

OUT IN

ADD ADD

enter detach leave enter

ACTION (unscrew) OUT

MOVE REMOVE

MOTION MOTION

detach attach touch touch detach attach leave

• Deterministic low-level primitive detection• Deterministic parsing

M. Brand. The "Inverse Hollywood Problem": From video to scripts and storyboards via causal analysis. AAAI 1997.

15

Stochastic CFGs

Action Recognition using Probabilistic Parsing.Bobick and Ivanov 1998

16

Gesture analysis with CFGs

Primitive recognition with HMMs

Action Recognition using Probabilistic Parsing. Bobick and Ivanov 1998 17

left-right


up-down


right-left


down-up


Parse Tree

S

RH

TOP UD BOT DU

LR RL

left-right up-down right-left down-up

22

Errors

Likelihood value over time (not discrete symbols)

HMM a

HMM b

Errors are inevitable…

but the grammar acts as a top-down constraint


Dealing with uncertainty & errors

Stolcke-Early (probabilistic) parser

SKIP rules to deal with insertion errors

HMM a

HMM b

HMM c


SCFG for Blackjack

Recognizing Multitasked Activities from Video usingStochastic Context-Free Grammar.

Moore and Essa 2001

• Deals with more complex activities• Deals with more error types

25

Stochastic Grammars: Overview

• Representation: Stochastic grammar• Terminals: object interactions• Context-sensitive due to internal scene models

• Domain: Towers of Hanoi• Requires activities with

strong temporal constraints

• Contributions• Showed recognition &

decomposition with veryweak appearance models

• Demonstrated usefulnessof feedback from high tolow-level reasoning components

Expectation Grammars(CVPR 2003)

• Analyze video of a person physically solving the Towers of Hanoi task

• Recognize valid activity

• Identify each move

• Segment objects

• Detect distracters / noise

System Overview

ToH: Low-Level Vision

Raw VideoBackground

Model

ForegroundComponents

Foreground andshadow detection

Low-Level Features• Explanation-based symbols

• Blob interaction events

• merge, split, enter, exit, tracked, noise

• Future Work: hidden, revealed, blob-part, coalesce

• All possible explanations generated• Inconsistent explanations heuristically pruned

Enter

Merge

Contributions

• Showed activity recognition and decomposition without appearance models

• Demonstrated usefulness of feedback from high-level, long-term interpretations to low-level, short-term decisions

using grammars for action recognition · aniket bera. video analysis with cfgs the “inverse...

Documents