linking video analysis to annotation technologies

65
Linking Video Analysis to Annotation Technologies Presentation for the BMVA’s Computer Vision Methods for Ambient Intelligence 31st May 2006 Dimitrios Makris (Kingston University) & Bogdan Vrusias (University of Surrey)

Upload: tybalt

Post on 12-Jan-2016

44 views

Category:

Documents


0 download

DESCRIPTION

Linking Video Analysis to Annotation Technologies. Presentation for the BMVA’s Computer Vision Methods for Ambient Intelligence 31st May 2006 Dimitrios Makris (Kingston University) & Bogdan Vrusias (University of Surrey). REVEAL project. EPSRC-funded project, initiated in 2004 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Linking Video Analysis to Annotation Technologies

Linking Video Analysis to Annotation Technologies

Presentation for the BMVA’s

Computer Vision Methods for Ambient Intelligence

31st May 2006

Dimitrios Makris (Kingston University)

&

Bogdan Vrusias (University of Surrey)

Page 2: Linking Video Analysis to Annotation Technologies

REVEAL project

• EPSRC-funded project, initiated in 2004– Academic partners

• Kingston University, University of Surrey

– Industrial partners/observers• SIRA Ltd, Ipsotek Ltd, CrowdDynamics Ltd, Overview Ltd

– End-Users• PITO, PSDB (Home Office), Surrey Police

• Aim: “to promote those key technologies which will enable automated extraction of evidence from CCTV archives”

Page 3: Linking Video Analysis to Annotation Technologies

See No EvilHear No Evil

Speak No Evil

Page 4: Linking Video Analysis to Annotation Technologies

• See Evil– Computer Vision– Input: Video Streams

• Hear Evil– Natural Language Processing– Input: Annotations of Video Streams

• Speak Evil– Link together– Output: Automatic Video Annotations

Scope of REVEAL

Page 5: Linking Video Analysis to Annotation Technologies

Challenges

• Development of Visual Evidence Thesaurus– Automatic Extraction of Surveillance Ontology

• Extracting Visual Semantics– Motion Detection & Tracking– Geometric and Colour Constancy– Object Classification, Behaviour Analysis, Semantic Landscape

• Analysing Crowds• Development of Surveillance Meta-Data Mode• Multimodal Data Fusion

– Fusion of Visual Semantics and Annotations

• Video Summarisation

Page 7: Linking Video Analysis to Annotation Technologies

Video Analysis Overview

• Motion Analysis– Motion Detection– Motion Tracking– Crowd Analysis

• Automatic Camera Calibration– Colour– Geometric

• Visual Semantics extraction– Object Classification– Behaviour Analysis– Semantic landscape

Page 8: Linking Video Analysis to Annotation Technologies

Motion Analysis (1/2)

• Motion Detection– Novel Technique for handling rapid light

variations, based on correlating changes in YUV (Renno et al, VS2006)

• Motion Tracking– Blob-based Kalman filter for tackling partial

occlusion (Xu&Ellis, BMVC2002)

Page 9: Linking Video Analysis to Annotation Technologies

Motion Analysis (2/2)

Example

Page 10: Linking Video Analysis to Annotation Technologies

Crowd Analysis (1/3)

• Problem: Detect and Track Individuals in Crowded situations

Original Frame Foreground Mask

Page 11: Linking Video Analysis to Annotation Technologies

Crowd Analysis (2/3)

• Combine edges of original image with edges of foreground mask

Original Frame Edges Foreground Mask Edges

Page 12: Linking Video Analysis to Annotation Technologies

Crowd Analysis (3/3)• Fit a head-shoulder (Omega) model

Head Candidates in the scene

Head Candidates on the boundaries of the foreground

Head Candidates within the foreground

Page 13: Linking Video Analysis to Annotation Technologies

Automatic Geometric Calibration (1/3)

Heights

0

20

40

60

80

100

120

140

160

180

200

0 50 100 150 200 250 300 350

Row

Hei

gh

t

Large Vehicle

Vehicle

Person

hi- horizon

Height (pixels)

Imag

e Po

sitio

n (p

ixel

s)

Horizon

• Pedestrian height model– Estimate linear pedestrian height model from

observations (Renno et al, ICIP2002)

Page 14: Linking Video Analysis to Annotation Technologies

Automatic Geometric Calibration (2/3)

• Ground Plane Estimation– Use the pedestrian linear model to estimate

ground plane. (Renno et al, BMVC2002)

Page 15: Linking Video Analysis to Annotation Technologies

Automatic Geometric Calibration (3/3)

Occlusion Edges

Depth Map

• Scene Depth Map– Use moving objects estimated depths to determine the

scene depth map (Renno et al, BMVC2004)

Page 16: Linking Video Analysis to Annotation Technologies

Automatic Colour Calibration (1/3)• Variation of

colour responses is significant!

• A real-time colour constancy algorithm is required

Page 17: Linking Video Analysis to Annotation Technologies

Automatic Colour Calibration (2/3)

• Grey world and Gamut Mapping algorithms were tested.

• Automatic method for reference frame.

• Gamut Mapping performs better, but Grey World can operate real-time.

(Renno et al, VS-PETS2005)

Page 18: Linking Video Analysis to Annotation Technologies

Automatic Colour Calibration (3/3)

• Real Time Colour Constancy

Page 19: Linking Video Analysis to Annotation Technologies

Visual Semantics(Makris et al, ECOVISION 2004)

Targets– Pedestrians– Cars– large

vehicles

Actions– move– stop– enter/exit– accelerate– turn left/right

Static features– road/corridor– door/gate– ATM– desk– bus stop

Page 20: Linking Video Analysis to Annotation Technologies

Visual Semantics

• Object Classification(ongoing work)

• Behaviour Analysis(ongoing work)

• Semantic landscape– Label static scene by observing activity

(Makris&Ellis, AVSS 2003)

Page 21: Linking Video Analysis to Annotation Technologies

Reverse Engineering

Page 22: Linking Video Analysis to Annotation Technologies

Entry/Exit Zones

Detected by an EM-based algorithm

Page 23: Linking Video Analysis to Annotation Technologies

Detected Routes

Page 24: Linking Video Analysis to Annotation Technologies

Segmentation of Routesto Paths & Junction

Page 25: Linking Video Analysis to Annotation Technologies

Possible extensions

• Use target labels– paths: traffic road or pavements– pedestrian crossing: junction of

• pedestrian route• vehicle route

• More complicated rules– bus stop

• pedestrians stop• vehicle stop• pedestrians merge with vehicle

Page 26: Linking Video Analysis to Annotation Technologies

Data Hierarchy in Video Analysis

Pixels

Blobs Trajectories

Actor labels Scene labels Action labels

Textual Summary

Page 27: Linking Video Analysis to Annotation Technologies

Natural Language Processing (Surrey)

Hypothesis: Experts are using a common language/keywords to describe crime scene/video evidence.

•Visual Evidence Thesaurus– Data acquisition (workshops)– Data analysis– Automatic ontology extraction

Page 28: Linking Video Analysis to Annotation Technologies

Video Annotation Workshops

• 2 different workshops were organised and ran to prove the hypothesis and construct a domain thesaurus.– Different experts

• Police Forces (Surrey-West Yorkshire).• Forensic Services (London-Birmingham).• Private Video Evidence expert (Luton).

– Several data collection tasks

• Purpose: Gather knowledge and feedback from experts in order to understand the way videos are observed and perceived.

• Task: Validate the hypothesis and extract common keywords used and the description pattern.

Page 29: Linking Video Analysis to Annotation Technologies

Sel

ecte

d sa

mpl

e fr

om th

e W

orks

hop

Sel

ecte

d sa

mpl

e fr

om th

e W

orks

hop

Video Annotation Workshops

Page 30: Linking Video Analysis to Annotation Technologies

• Workshop Feedback:

– Strong interest in the project.– Willing to help (within the legal limits).

• Workshop Outputs:

– Initial descriptions from experts, for analysis.– Useful feedback and comments.

Video Evidence Thesaurus

Page 31: Linking Video Analysis to Annotation Technologies

• 2 or 3 people walking around what looks like between 2 buildings.

• 2 people fighting,………

• 5 people, 1 person walks away to the bottom of screen.

• 2 people walk towards each other,……….

• Person walking across holding piece of paper on heart. Person walking away looking over his left shoulder.

• 2 people passing out each other in corridor, appear to wave hands, having some kind of interaction.

• Same video clip.

• Description from 3 different people.

• Pattern (Identify, Elaborate, Location)

• 2 or 3 people walking around what looks like between 2 buildings.

• 2 people fighting,………

• 5 people, 1 person walks away to the bottom of screen.

• 2 people walk towards each other,……….

• Person walking across holding piece of paper on heart. Person walking away looking over his left shoulder.

• 2 people passing out each other in corridor, appear to wave hands, having some kind of interaction.

Analysis

Page 32: Linking Video Analysis to Annotation Technologies

Analysis:

• Description from different people, for same video clips.

• Pattern (Identify <I>, Elaborate <E>, Location <L>) .• Grammar:

• <Description>: <I><E|L><L|E|{Φ}><Description|{Φ}>.

• <I>: <Single|Group>

• <Single> : <{Person}|{Male}|{Female}|….>

• <Group> : <2|3|…..|n><{People}|Single>

Video Evidence Thesaurus

Page 33: Linking Video Analysis to Annotation Technologies

Thesaurus Construction Methodology

• We adopt a text-driven and bottom-up method: starting from a collection of texts in a specialist domain, together with a representative general language corpus

• Use a five-step algorithm for identifying discourse patterns with more or less unique meanings, without any overt access to an external knowledge base

Page 34: Linking Video Analysis to Annotation Technologies

I. Select training corpora: CCTV-Related Corpus and a general language corpus.

II. Extract key words;

III. Extract key collocates;

IV. Extract local grammar using collocation and relevance feedback;

V. Assert the grammar as a finite state automaton.

Thesaurus Construction Methodology

Page 35: Linking Video Analysis to Annotation Technologies

• Once the single terms, especially weird terms, are identified we find candidate compound terms by computing collocation statistics between the single terms and other open class words in the entire CORPUS.

Development of Visual Evidence Thesaurus

Page 36: Linking Video Analysis to Annotation Technologies

• Collocates of the the weird term EARPRINT + collocation statistics

Development of Visual Evidence Thesaurus

Page 37: Linking Video Analysis to Annotation Technologies

• Collocates of the the weird term EARPRINT IDENTIFICATION

Development of Visual Evidence Thesaurus

Page 38: Linking Video Analysis to Annotation Technologies

A inheritance hierarchy of EARPRINT collocates exported to a knowledge representation system PROTEGE:

Development of Visual Evidence Thesaurus

Page 39: Linking Video Analysis to Annotation Technologies

• A multiple inheritance hierarchy of EARPRINT collocates now exported to a knowledge representation workbench PROTEGE:

Rubbish!

Development of Visual Evidence Thesaurus

Page 40: Linking Video Analysis to Annotation Technologies

Experiments and Evaluation

• I. Select training corpora

• Training-Corpus– The British National Corpus, comprising 100-million

tokens distributed over 4124 texts (Aston and Burnard 1998);

– Crime Alerts Corpus (FBI Crime Alerts, Wanted by the Royal Canadian Mounted Police (RCMP) & Polizei Bayern, Journal/Conference papers) comprising 109 articles and contains 214,437 words

Page 41: Linking Video Analysis to Annotation Technologies

Experiments and Evaluation

• II. Extract key words– The frequencies of individual words in the Crime

Alerts Corpus were computed using System Quirk;

Page 42: Linking Video Analysis to Annotation Technologies

Experiments and EvaluationRanks Crime Alerts Corpus

(NCAC=214437)

Cumulative

Number of

Tokens (%)

British National

Corpus

(NBNC=100 Million)

Cumulative

Number of

Tokens (%)

1-10 the, of, and, in, to, a, by, for, county, total

39724

(18.88%)

the, of, and, a, in, to, for, is, as, that

22.3 M

(22.3%)

11-20 percent, is, state, city, with, law, that, s, population, enforcement

11910

(5.66%)

was, I, on, with, as, be, he, you, at, by

6.51 M

(6.5 %)

21-30 offenses, crime, or, are, rate, on, township, area, as, agencies

8923

(4.24%)

are, this, have, but, not, from, had, his, they, or

4.23 M

(4.2%)

31-40 be, was, counties, this, per, from, were, an, number, continued

7126

(3.39%)

which, an, she, where, here, we, one, there, all, been

3.05 M

(3.1%)

41-50 data, inhabitants, reporting, university, estimated, theft, cities, metropolitan, at, other

6043

(2.87%)

their, if, has, will, so, would, no, what, can, when

2.35 M

(2.4%)

Page 43: Linking Video Analysis to Annotation Technologies

Experiments and Evaluation

Token Crime Alerts Corpus

(NCAC=214437)BNC

(NBNC=100,000,000) Weirdness

(a/b) Rank fCAC fCAC / NCAC

(a)

Rank fBNC fBNC / NBNC

(b)

crime 22 960 0.448% 1512 7155 0.007% 62.63

theft 46 618 0.288% 5031 1727 0.002% 167.05

murder 57 492 0.229% 1811 5935 0.006% 38.70

crimes 91 301 0.140% 4901 1789 0.002% 78.54

scars 181 145 0.068% 14217 379 0.0004% 178.60

Page 44: Linking Video Analysis to Annotation Technologies

Experiments and Evaluation

• III. Extract key collocates

f Left Right Total z-score

scars 65763

acne 70 42 0 42 8.16

boxcar 43 32 5 37 7.13

rolling 26 16 11 27 5.06

deep 40 20 3 23 4.24

facial 34 22 0 22 4.03

Page 45: Linking Video Analysis to Annotation Technologies

Experiments and Evaluation

• IV. Extract local grammar using collocation and relevance feedback

Pattern f Collocate Left Right z-score

facial acne scars 21 pitted 9 1 2.37

deep boxcar scars 16 had 3 3 2.12

has scars on 8 his 1 5 3.95

has scars on 8 left 1 4 3.05

Page 46: Linking Video Analysis to Annotation Technologies

Experiments and Evaluation

• V. Assert the grammar as a finite state automaton– The (re-) collocation patterns can then be asserted as a finite state automata for each of the

movement verbs and spatial preposition metaphors

Page 47: Linking Video Analysis to Annotation Technologies

An experiment:

• 16 videos in the CAVIAR data set were shown to 4 different surveillance experts;

• The experts were asked to describe the videos in their own words – in English surveillance speak

• Experts will describe videos in a succinct manner using terminology of their domain and framing the description in a ‘local grammar’

• The interviews were transcribed and sentences and phrases were marked up using a basic ontology: Action, Location, Result, Miscellaneous.

Describing Videos

Page 48: Linking Video Analysis to Annotation Technologies

One of our experts described the frame on the left as

Man in blue t-shirt, centre of scene, facing camera, raises white card high above his head. Second man wearing a dark top with white stripes down the sleeve enters scene from above. Meets a third individual with a dark top and pale trousers and an altercation occurs in the centre of the open space. Individuals meet briefly and leave scene in opposite directions. The original person with the white sleeves -- white stripes on the sleeves leaves scene below camera. Second person in the altercation leaves scene by the red chairs. That was an assault, I’d say.

Describing Videos

Page 49: Linking Video Analysis to Annotation Technologies

One of our experts described the frame on the left as

Event 1:Man in blue t-shirt, centre of scene, facing camera, raises white card high above his head.

Event 1:Miscellaneous: Man in blue t-shirt, Location: centre of scene, Action: facing camera, Result: raises white card high above

his head.

Describing Videos

Page 50: Linking Video Analysis to Annotation Technologies

One of our experts described the frame on the left as

Event 1:

M Man in blue t-shirt,

L centre of scene,

A facing camera,

R raises white card high above his head.

Event 2:

A Second man wearing a dark top with white stripes down the

sleeve enters scene

L from above.

A Meets a third individual with a dark top and pale trousers and

an altercation occurs

L in the centre of the open space.

A Individuals meet briefly

R and leave scene in opposite directions.

Describing Videos

Page 51: Linking Video Analysis to Annotation Technologies

Inter-indexer variability: Triplets at the start of event descriptions

TripletExpert 1

%Expert

2 %Expert 3

%Expert

4 % Avg Std Dev

ALA 36 31 33 23 30.75 5.6

LAL 23 21 15 17 19 3.6

ALM 22 21 11 3 14.25 8.9

ALR 11 0 11 17 9.75 7.1

MAL 5 3 7 17 8 6.2

Describing Videos

Page 52: Linking Video Analysis to Annotation Technologies

Inter-indexer variability:

The most frequent triplet that ends with

an R is ALRALR, in turn, was found in frequently occurring patterns like

ALALRALALALR

ALALALALRAALALR

Describing Videos

Page 53: Linking Video Analysis to Annotation Technologies

Inter-indexer variability:The following local grammar was ‘discovered’ from the corpus of marked transcripts for all of the four descriptions:

L?M?((A+L)M?)+L?A*M?R

Describing Videos

Page 54: Linking Video Analysis to Annotation Technologies

Actions Expert 1 %

Expert2 %

Expert 3 %

Expert 4 %

Mean

Verb of motion

71 70 51 50 60.5

Verb of action

20 17 37 41 28.7

Verb of stasis 7 7 7 5 6.5

Verb of action + prep.

2 1 0 0 0.7

Most frequently used verbs

Describing Videos

Page 55: Linking Video Analysis to Annotation Technologies

Location Type Exemplars

Real world location

Seating area; Stairs; Walkway;Shop entrance

Relative spatial location

To the left; Left to right; From the right; Area from left

Location relative to portrayal

Right hand side of screenTop left field of view of cameraAway from usTowards the camera

Describing Videos

Page 56: Linking Video Analysis to Annotation Technologies

What about the labelling in CAVIAR? Actions are described through verbal nouns – three videos (Fight_Chase, Browse, LeftBagBehindChair)

Situation Type FC BR3 LBBC

Moving Verbal noun of motion

923 799 1295

Inactive Adjective of stasis 618 70 228

Fighting Verbal noun of action 280 0 0

Joining Verbal noun of action 52 0 0

Split up Verbal noun of action 200 0 0

Browsing Verbal noun of motion

0 135 0

Describing Videos

Page 57: Linking Video Analysis to Annotation Technologies

What about the labelling in CAVIAR? Actions are described through verbal nouns – three videos (Fight_Chase, Browse, LeftBagBehindChair)

Context Type FC BR3 LBBC

Walking Verbal noun of motion

522 156 897

Fighting Verbal noun of action

532 0 0

Immobile Adjective of stasis 1019 200 626

Browsing Verbal noun of motion

0 648 0

Describing Videos

Page 58: Linking Video Analysis to Annotation Technologies

What about the labelling in CAVIAR? Actions are described through verbal nouns – three videos (Fight_Chase, Browse, LeftBagBehindChair). How do our experts compare with CAVIAR labelling

Acts & Results

Expert Mean

CAVIAR Context

CAVIAR Situation

Motion 59.25 48 69

Action 29.75 12 12

Stasis 6.25 40 20

Describing Videos

Page 59: Linking Video Analysis to Annotation Technologies

Surveillance Meta-Data Model (both)

• How can we create a model to describe CCTV video?

• What is an activity?• Activity: Interaction between

actors and scene objects

Page 60: Linking Video Analysis to Annotation Technologies

Multimodal Data Fusion

• Can we use machine learning methods to link vision and text?

• Can that link be created in an unsupervised fashion?

Video Sequence

Text Description Fr

ame

Imag

e

Page 61: Linking Video Analysis to Annotation Technologies

Multimodal Data Fusion

Page 62: Linking Video Analysis to Annotation Technologies

Video Summarisation

• Automatic annotation / labelling

• Video content categorisation

• Retrieval

Page 64: Linking Video Analysis to Annotation Technologies

Summary/Conclusions

• Development of Visual Evidence Thesaurus– Automatic Extraction of Surveillance Ontology

• Extracting Visual Semantics– Motion Detection & Tracking– Geometric and Colour Constancy– Object Classification, Behaviour Analysis, Semantic Landscape

• Analysing Crowds• Development of Surveillance Meta-Data Mode• Multimodal Data Fusion

– Fusion of Visual Semantics and Annotations

• Video Summarisation

Page 65: Linking Video Analysis to Annotation Technologies

Summary/Conclusions

• Computer Vision to extract the Visual Semantics (See Evil)

• Natural Language Processing to identify the Surveillance Ontology (Hear Evil)

• Linking of the two technologies in order to construct a Video Summarisation System (Speak Evil)