seeing action: part 2 statistics and/or structure

Seeing Action: Part 2

Statistics and/or Structure

Aaron [email protected]

School of Interactive Computing College of Computing

Georgia Tech

Continuing from the lower middle???

Three levels of understanding motion or behavior:Movement - atomic behaviors defined by motion

"Bending down", "(door) rising up", Swinging a hammer

Action – a single, semantically meaningful "event""Opening a door", "Lifting a package"Typically short in timeMight be definable in terms of motion; especially so in a

particular context.

Activity – a behavior or collection of actions with a purpose/intention.

"Delivering packages"Typically has causal underpinningsCan be thought of as statistically structured events

Maybe Actions are movements in context??

Context

Structure and Statistics (Old and new)

Grammar-based representation and parsing– Highly expressive for activity description – Easy to build higher level activity from reused low level

vocabulary.

P-Net (Propagation nets) – really stochastic Petri nets– Specify the structure – with some annotation can learn

detectors and triggering probabilities

Statistics of events – Low level events are statistically sequenced – too hard to

learn full model.– N-grams or suffix trees

"Higher-level" Activities: Known structure, uncertain elements

Many activities are comprised of a priori defined sequences of primitive elements.– Dancing, conducting, pitching, stealing a car from a

parking lot.– The states are not hidden.

The activities can be described by a set of grammar-like rules; often ad hoc approaches taken.

But, the sequences are uncertain:– Uncertain performance of elements– Uncertain observation of elements

The basic idea and approach

Low-level primitives with uncertain feature detection (individual elements might be HMMs)

High-level description found by parsing input stream of uncertain primitives.

Extend Stochastic Context Free Grammars to handle perceptually relevant uncertainty.

Idea: split the problem into:

Approach:

Stochastic CFGs

Traditional SCFGs have probabilities associated with the production rules. Traditional parsing yields most likely parse given a known set of input symbols.PIECE -> BAR PIECE | [0.5]

BAR [0.5]

BAR -> TWO | [0.5]

THREE [0.5]

THREE -> down3 right3 up3 [1.0]

TWO -> down2 up2 [1.0]

Thanks to Andreas Stolcke’spriori work on parsing SCFGsusing efficient Earley parser.

Extending SCFGs (Ivanov and Bobick, PAMI)

Within the parser we handle:– Uncertainty about input symbols

Input is multi-valued string (vector of likelihoods)– Deletion, substitution, and insertion errors

Introduce error rules – Individually recognized primitives typically temporally

inconsistentIntroduce penalty for overlap.Spatial and temporal consistency enforced.

Need to define when a symbol has been generated. How do we learn production probabilities? (Not many

examples.) Make sure not too sensitive to them.

Enforcing temporal consistency

Output of one HMM parsing backwards

Time

P(p

rimiti

ve) - Output event

Video Sample

Event Grammar and Parsing

Tracker generates events: ENTER, LOST, FOUND, EXIT, STOP. Tracks have properties (e.g. size) and trajectories.

Tracker assigns class to each event, though only probabilistically.

Parser parses single stream that contains interleaved events: (CAR-ENTER, CAR-STOP, PERSON-FOUND, CAR-EXIT, PERSON-EXIT)

Parser enforces spatial and temporal consistency for each object class and interactions (e.g. to be a PICK-UP, the PERSON-FOUND event must be close to CAR-STOP)

Spatial and temporal consistency eliminates symbolic ambiguity.

Advantages of SCFGs

What grammar can do (simplified):CAR_PASS -> CAR_ENTER CAR_EXIT |

CAR_ENTER CAR_HIDDEN CAR_EXIT

CAR_HIDDEN -> CAR_LOST CAR_FOUND | CAR_LOST CAR_FOUND CAR_HIDDEN

Skip allows concurrency (and junk):PERSON_LOST -> person_lost | SKIP person_lost

Concurrent parse:Events: ce pe cl cf cs px pl cxPICKUP -> ce pe cl cf cs px pl cxP_PASS -> ce pe cl cf cs px pl cx

Parsing System

Parse 1: Person-pass- through

Parse 2: Drive-in

Parse 3: Car-pass-through

Parse 4: Drop-off

Advantages of STCFG approach

Structure and components of activities defined a priori and are the right levels of annotation to recover (compare to HMMs).

FSM vs CFG is not the point. Rather explicit representation of structural elements and uncertainties.

Often many (enough) examples of each primitive to support training, but not of higher level activity.

Allows for integration of heterogeneous primitive detectors; only assumes likelihood generation.

More robust than ad-hoc rule based techniques: handles errors through probability.

No notion of causality, or anything other than (multi-stream) sequencing.

Some Q's about Representations…

Scope and Range:– thoughts???

"Language" of the representation– Grammar of explicit symbols

Computability of an instance:– Quite easy. Given the input string the parsing is both the

computation and the matching Learnability of the "class":

– Inside-outside algorithm for learning CFGs but lets be serious…

Stability in face of perceptual uncertainty– Explicitly designed to handle this uncertainty

Inference-support– Depends on what you mean by inference. No notion of

real semantics or explicit time.

P-Nets (Propagation Networks) (Shi and Bobick, ’04 and ’06)

Nodes represent activation intervals

– Active vs. inactive: Token propagation

More than one node can be active at a time!Links represent partial order as well logical constraintDuration model on each link and node:

–Explicit model on length of activation –Explicit model on length between successive intervals

Observation model on each node

Conceptual Schema

Logical relation– Autonomous assumption: logic constraint only exists at

start/end points of any intervals– Condition probability function can represent any logical function

Examples of logic constraint

Propagation Net – Computing

Computational SchemaA DBN style rollout to compute corresponding

conceptual schema

Experiment: Glucose Project

Task: monitor an user to calibrate a glucose meter and point out operating error as feedback.

Constructed 16 node P-Net as representation 3 subjects with total of 21 perfect sequences,

10 missing_1_step sequences and 10 missing_6_steps sequences

D-Condensation

Initiate 1 particle at dummy starting nodeRepeat

For each particlegenerate all possible consequent statescalculate the probability for each states

EndSelect n particles to survive

Until the final time steps is reachedOutput the path represented by the particle with

highest probability

Experiment: Glucose Meter Calibration

Experiment: Classification Performance

Experiment: Label individual frames

Labeling individual nodes Labels on Node J: Insert

And now some statistics…

Problem: the higher level world of activity is not usually a P-Net or an FSM or an HMM or …

Two possible solutions:1. Understand what's really going on…

…another time.

2. Lose the structure

Stochastic sequencing (Hamid and Bobick)

A priori define some low-level "actions"/events that can be stochastically detected in context – e.g. Door opening

Collect training data (streams of events) of activities – making a delivery, UPS pick-up, trash collection

Collect histograms of N-tuples and do both activity discovery and recognition

–Later can focus on anomalies

Advantages: cheat where easy, learn the hard stuff, exploit the context

Barnes & Nobles Loading Dock

Two levels in the representation

Low Level: Events (computer vision problem)– Background subtraction and Foreground

extraction (better “modeling”)– Classifying (per frame) each foreground object

as either• Person• Vehicle (what type if possible)• Package• Tool used to move packages• Miscellaneous object

– Tracking people, vehicles, packages, tools, and miscellaneous objects over multiple frames

Two levels in the representation

Higher Level: Statistical characterization of subsequences– Instances of same activity class have certain common

subsequences.– But, partially ordered will typically rearrange subsequences

within the sequence. – Find a “soft” characterization of the statistics of the

subsequences – Deifne similarity measure for such characterization.

• Allows discovery of activity classes • Allows for detection of anomalous examples

Caveats: – We provide the events – whether it’s manually or specifying

the detector doesn’t really matter (except for publication)– Training needs pre-segmentation

Stochastic sequencing: n-grams

Experimental Setup – Loading Dock

Barnes & Noble Loading Dock Area

One month worth of data:–5 days a week–9 a.m. till 5 p.m.

Event Vocabulary – 61 events

–Hand-labeled for testing activity labeling, noise sensitivity.–Training detectors for these events

Bird’s Eye View of Experimental Setup

B&N Processing Video

Activity-Class Discovery

Treating activities as individual instances

Activity-class discovery – finding maximal cliques in edge weighted graphs

Need to come up with:– Activity Similarity Metric– Procedure to group similar activities

Activity Similarity

Two types of differences–structural differences–frequency differences

Sim(A,B) = 1 – normalized difference between the counts of non-zeros event n-grams

Properties–additive identity–is commutative–does not follow triangular in-equality

Activity-Class Discovery

A graphic theoretic problem of finding maximal cliques in edge-weighted graphs [Pavan, Pelillo ‘03]

Sequentially find maximal cliques in edge weighted graph of activities

Activities different enough from all the regular activities are anomalies

Activity-Class Discovery – Dominant Sets

Anomaly Detection

Compute the within-Class similarity of the test activity w.r.t. previous class members

Learn the detection threshold from training data – can be done using an R.O.C.

Anomaly "Explanation"

Explanatory features – their frequency has high mean and low variance

Explanation based on features that were:

– Missing from an anomaly but were frequently and consistently present in regular members

– Extraneous in an anomaly but consistently absent from the regular members

Results

UPS Delivery VehiclesFed Ex Delivery VehiclesDelivery Trucks – multiple packages deliveredCars and vans, only 1 or 2 packages deliveredMotorized cart used to pick and drop packagesVan deliveries – no use of motorized cartDelivery trucks – multiple people

General Characteristics ofDiscovered Activity Classes

Few of the detected Anomalies

(a) Back door of delivery not closed(b) More than usual number of people involved in unloading(c) Very few vocabulary events performed

Results

Are the detected anomalous activities ‘interesting’ from human view-point?

Anecdotal Validation:– Studied 7 users– Showed each user 8 regular activities selected

randomly– Showed each user 10 test activities, 5 regular and 5

detected anomalous activities– 8 out of 10 activity-labels of the users matched the

labels of our system– Probability of this match happening by chance is 4.4%

Some Q's about Representations… (more discussion)

Scope and Range: – A monitored scene with pre-designed detectors

"Language" of the representation– Histograms and other statistics of feature n-gram occurrences

Computability of an instance:– Given detectors, easy to compute

Learnability of the "class":– Full power of statistical learning. Even allowing notion

of outlier detector. Stability in face of perceptual uncertainty

– Fair. Needs to be better. Inference-support

– Distance-in-feature space reasoning only.

seeing action: part 2 statistics and/or structure

Documents

parsing scfgs

higher level activity

approachlowlevel primitives

scfgs ivanov

known structure

production probabilities

production rules

activity description