shin’ichi satoh national institute of informatics

26
Introduction to Content-based Media Analysis and Search Technology Technology Overview and Historical Trends from an Academic Point of View Shin’ichi Satoh National Institute of Informatics

Upload: alaina-gregory

Post on 26-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Shin’ichi Satoh National Institute of Informatics

Introduction to Content-based Media Analysis and Search Technology

Technology Overview and Historical Trends from an

Academic Point of ViewShin’ichi Satoh

National Institute of Informatics

Page 2: Shin’ichi Satoh National Institute of Informatics

Nowadays abundant multimedia information available

Web, broadband network, CATV, satellite... digital camera, mobile phone,

Abundant Multimedia

Page 3: Shin’ichi Satoh National Institute of Informatics

YouTube: 35 hours of video uploaded every minute

Abundant Multimedia

Page 4: Shin’ichi Satoh National Institute of Informatics

Flickr: 5 billion photos Facebook: 3 billion photos per month

Abundant Multimedia

Page 5: Shin’ichi Satoh National Institute of Informatics

How can we utilize such huge amounts of multimedia?

Search could be one promising option Any technical problems? It seems like multimedia search is already

available Google, Yahoo!, Bing image search, Flickr,

YouTube, etc...

Abundant Multimedia

Page 6: Shin’ichi Satoh National Institute of Informatics

Multimedia search is possible only via text search technology

This problem is prominent especially for visual media (audio can be converted into text via ASR)

Major Part of Multimedia is Inaccessible

Page 7: Shin’ichi Satoh National Institute of Informatics

But major part of multimedia data has no text data

We checked a number of photos in Flickr and found around 85% of photos have no tags or description

as far as we use the text search-based technologies, such large amounts of multimedia are inaccessible at all!

Major Part of Multimedia is Inaccessible

Page 8: Shin’ichi Satoh National Institute of Informatics

Moreover, text-based multimedia search is NOT perfect

searching images of "people playing drums" some results are good

but some results are very strange

Major Part of Multimedia is Inaccessible

Johndog

Page 9: Shin’ichi Satoh National Institute of Informatics

Multimedia semantic content analysis is required

However it’s difficult◦ Multimedia is difficult to handle by computers◦ Inherently difficult due to “Semantic Gap”

Multimedia Content Analysis and Search

Query: Lion

Lion

Page 10: Shin’ichi Satoh National Institute of Informatics

Multimedia data is huge◦ text: 1kb/s (10 words), audio: 100kb/s (MP3), video

10Mb/s (MPEG2) computers since 1940s (ENIAC 1946) text processing by computer since 1950s!

(Turing test 1950, ELIZA and SHRDLU 1960s) project Gutenberg since 1971 CD-ROM (1985), DVD (1993), larger memory,

external storage (hard disk drives) multimedia data (audio/image/video) are

getting manageable only after 1990s

Handling Multimedia

Page 11: Shin’ichi Satoh National Institute of Informatics

Please guess what this is.

Semantic Gap

Water Lilies, Monet

Page 12: Shin’ichi Satoh National Institute of Informatics

Please guess what this is.

Semantic Gap

Page 13: Shin’ichi Satoh National Institute of Informatics

Semantic Gap

Page 14: Shin’ichi Satoh National Institute of Informatics

Computers are so good at handling text, but not so at handling multimedia

text: artificial media, symbolized by nature multimedia: ambiguous, depend on cognition, natural

media, not symbolized, etc... human can easily “see” or perceive but we cannot explain how we “see”

Semantic Gap

The quick brown fox jumps over the lazy dog

Page 15: Shin’ichi Satoh National Institute of Informatics

1980s Landsat images, medial images, stock

photos Search using relational DB only via statistics and text issue was how to handle “huge” data of

images less attention was paid

to content analysis

Early Media Search System

Page 16: Shin’ichi Satoh National Institute of Informatics

CBIR: Image retrieval based on “content” T. Kato, TRADEMARK & ART MUSEUM (1989) IBM QBIC (1990s) Take an image as a query, and return “similar”

images Use “features,” e.g., color histogram, edge,

shape, etc. It worked for images without metadata Assume that similar images in the feature

space are semantically similar as well But this is not always true

Content-Based Image Retrieval (CBIR)

Page 17: Shin’ichi Satoh National Institute of Informatics

Content-Based Image Retrieval (CBIR)

Feature space

Semantic Gap

Page 18: Shin’ichi Satoh National Institute of Informatics

Let’s take a look at face detection as an example...

Face detection is very stable technology

Before 1990 face detection was very unstable◦ Shape of facial features and their geometric

relations were hard coded

After late 1990s face detector using machine learning succeeded in very stable performance◦ Simply provide a lot of face image

examples (a few thousands) to the system and let it learn

Multimedia Semantic Content Analysis via Machine Learning

Early face detection method

Machine learning

Page 19: Shin’ichi Satoh National Institute of Informatics

• Following the success of machine-learning-based approaches in face detection, OCR, ASR, etc., researchers decided to “train” computers for media semantic content analysis

• build corpus: tens, hundreds, or thousands images/video shots per concept with manual annotation

• extract features (low-level, but recently “local” features are known to be more effective)

• train computers to automatically map between low-level features and semantic categories using machine learning

• Several corpora available

Media Semantic Content Analysis

Page 20: Shin’ichi Satoh National Institute of Informatics

Caltech 101 (2003), Caltech 256 (2007) 101/256 concepts define the set of concepts first, then collect

images (via image search engine) manual selection, so clean annotation up to a few hundreds images per concept standard benchmark datasets “small world effect” anticipated questionable selection of concepts

Caltech 101/256

Page 21: Shin’ichi Satoh National Institute of Informatics

airplane, chair, elephant, faces, leopards, rhino

bonsai, brain, scorpion, trilobite, yin_yang...

Page 22: Shin’ichi Satoh National Institute of Informatics

Large number of concepts, large number of images

#concepts: 10,000+ #images: 10,000,000+ concepts are systematically selected from

WordNet (computer-readable thesaurus)

Page 23: Shin’ichi Satoh National Institute of Informatics

Manual annotation by Amazon Mechanical Turk Hard to control quality Scalability issue

Page 24: Shin’ichi Satoh National Institute of Informatics

• Currently researchers are focusing on the issue: how to effectively learn semantic concepts from GIVEN training media corpus

• Corpus: the larger, the better• But how to obtain large corpus?• CGM (Flickr, Web): noisy• Manual annotation (AMT):

costly, less scalable• Other approaches such as

ESP game could be interesting

Issues

Page 25: Shin’ichi Satoh National Institute of Informatics

1970 1980 1990 2000 2010

Text

Audio/Speech

Image

Video

Project Gutenberg

bag-of-words

TF/IDF

WSJ

TREC PageRank

MFCC

Viterbi, HMM

CMU-MIT

Face DBPascal VOC

ImageNet

TRECVID

Caltech101

V-J Face

Det.

USPSOCR

single digit

1000 wordsLVCSR

IBM ViaVoice

Page 26: Shin’ichi Satoh National Institute of Informatics

Multimedia content analysis research: “just started”

More advanced results to come Business value? Killer applications?

Conclusion