wei-ta chu , che -cheng lin ,jen-yu yu

20
Using Cross-Media Correlation for Scene Detection in Travel Videos

Upload: elise

Post on 03-Feb-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Using Cross-Media Correlation for Scene Detection in Travel Videos. Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu. Outline. Introduction Approach Experiments Conclusion. Introduction. Why Use Cross Media Correlation for Scene Detection in Travel Video?? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Using Cross-Media Correlation for Scene Detection in Travel Videos

Page 2: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Outline

IntroductionApproachExperimentsConclusion

Page 3: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Introduction Why Use Cross Media Correlation

for Scene Detection in Travel Video??

What Correlation between photos and video?

More and more people get used to record daily life and travel experience both by Digital Cameras and Camcorders.(much lower cost in Camera and Camcorders)

Page 4: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Why Use Cross Media Correlation for Scene Detection in Travel Video?? What Correlation between photos and video?

People often capture travel experience by still Camera and Camcorders.

The content stored in photos and video contain similar information. Such as Landmark , Human’s Face.

Massive home videos captured in uncontrolled environments, such as overexposure/underexposure and hand shaking.

Page 5: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Why Use Cross Media Correlation for Scene Detection in Travel Video??

It’s Hard for direct scene detection in video.

High correlation between photo and video.

Photo obtain high quality data (scene detection is more easier).

Page 6: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Approach What’s different purpose that people

use photo and video even capture same things?

PhotoTo obtain high quality data , capture famous landmark or human’s face

VideoTo Capture evolution of an eventUtilize the correlation so that we can

succeed the works that are harder to be conducted in videos, but easier to be done in photos

Page 7: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

FrameWork

To perform scene detection in photos:First we cluster photo by checking time information.

To perform scene detection in videos:First we extract several keyframe for each video shot, and find the optimal matching between photo and keyframe sequences

Page 8: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

The idea of scene detection based on cross media alignment

Page 9: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

The proposed cross-media scene detection framework

PhotosTime-based

clustering

Visual word representati

on

DP-based Matching

Videos

Shot change

detection

Keyframe extraction

Filtering(motion

blur cease )

Visual word representati

on

Sceneboundaries

This process not only reduces the time of cross-media matching, but also eliminates the influence of bad-quality image

Page 10: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Preprocessing Scene Detection for Photos

utilize different shooting time to cluster photo

denote the time difference between the ith photo and the (i+1)-th photo as gi

gi = ti+1- ti

A scene change is claimed to occur between the nth and (n+1)-th photos. We set K as 17 and set d as 10 in this work.

K is an empirical threshholdD is the size of sliding window

Page 11: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Preprocessing

Use Global k-means algorithm to extract Keyframe

Detect and Filtering blur Keyframe . It’s no only reduces the time of cross-media matching, but also eliminates the influence of bad-quality images.

Page 12: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Visual Word Representation

Apply the difference-of-Gaussian(DoG) detector to detect feature points in keyframes and photos

Use SIFT(Scale-Invariant Feature Transform) to describe each point as a 128-dimensional feature vector.

SIFT-based feature vectors are clustered by a k-means algorithm , and feature points in the same cluster are claimed to belong to the same visual word

Page 13: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Visual Word Representation

KeyFrames ,

PhotosSIFT

Feature point

(Feature vector)

K-means Visual Word

Page 14: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Visual Word Histogram Matching

Xi denote the i th prefix of X, i.e., Xi = <X1 ,X2,…, Xi>

LCS(Xi,Yj) denotes the length of the longest common subsequence between Xi and Yj

Page 15: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Evaluation Data

Page 16: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Evaluation Metric

The first term indicates the fraction of the current evaluated scene, and the second term indicates how much a given scene is split into smaller scenes.

The purity value ranges from 0 to 1. Larger purity value means that the result is closer to the ground truth

τ(si ,sj*) is the length of overlap between the

scene si and sj*

τ(si) is the length of the scene si

T is the total length of all scenes

Page 17: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Performance in terms of purity based on different numbers of visual words, with different similarity thresholds

Page 18: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Performance based on four different scene detection approaches

Hue Saturation Value

Page 19: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Conclusion

For video, extract keyframe by global k-means algo. (Scen spot can be easily determined by time information of photos)

Representing keyframes and photo set by a sequence of visual word.Transform scene detection into a sequence matching algo.

Page 20: Wei-Ta Chu , Che -Cheng Lin ,Jen-Yu Yu

Conclusion

By using a dynamic programming approach , find optimal matching between two sequence, determine video scene boundaries with the help of photo scene boundaries.

By experiment on different travel video, different parameter settings, result shows that using correlation between different modalities is effective