multimedia image and video processing- second edition (image processing series)

Multimedia Imageand Video Processing

Second Edition

2012 by Taylor & Francis Group, LLC www.ebook3000.com

Multimedia Imageand Video ProcessingEdited by

Ling GuanYifeng HeSun-Yuan Kung

Second Edition

CRC Press is an imprint of theTaylor & Francis Group, an informa business

Boca Raton London New York


CRC PressTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742

2012 by Taylor & Francis Group, LLCCRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government worksVersion Date: 20120215

International Standard Book Number-13: 978-1-4398-3087-1 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.comand the CRC Press Web site athttp://www.crcpress.com


Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxviiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxixIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxiEditors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . liContributors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . liii

Part I Fundamentals of Multimedia

1. Emerging Multimedia Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Huifang Sun

2. Fundamental Methods in Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 29April Khademi, Anastasios N. Venetsanopoulos, Alan R. Moody,and Sridhar Krishnan

3. Application-Specific Multimedia Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 77Tung-Chien Chen, Tzu-Der Chuang, and Liang-Gee Chen

4. Multimedia Information Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Zhongfei (Mark) Zhang and Ruofei Zhang

5. Information Fusion for Multimodal Analysis and Recognition . . . . . . . . . . . . 153Yongjin Wang, Ling Guan, and Anastasios N. Venetsanopoulos

6. Multimedia-Based Affective HumanComputer Interaction . . . . . . . . . . . . . . 173Yisu Zhao, Marius D. Cordea, Emil M. Petriu, and Thomas E. Whalen

Part II Methodology, Techniques, and Applications: Coding ofVideo and Multimedia Content

7. Part Overview: Coding of Video and Multimedia Content . . . . . . . . . . . . . . . 197Oscar Au and Bing Zeng

8. Distributed Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215Zixiang Xiong

9. Three-Dimensional Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233Anthony Vetro

10. AVS: An Application-Oriented Video Coding Standard . . . . . . . . . . . . . . . . . 255Siwei Ma, Li Zhang, Debin Zhao, and Wen Gao

Part III Methodology, Techniques, and Applications: MultimediaSearch, Retrieval, and Management

11. Multimedia Search and Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291Linjun Yang, Xian-Sheng Hua, and Hong-Jiang Zhang

v


vi Contents

12. Video Modeling and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301Zheng-Jun Zha, Jin Yuan, Yan-Tao Zheng, and Tat-Seng Chua

13. Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319Lei Zhang and Wei-Ying Ma

14. Digital Media Archival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345Chong-Wah Ngo and Song Tan

Part IV Methodology, Techniques, and Applications: MultimediaSecurity

15. Part Review on Multimedia Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367Alex C. Kot, Huijuan Yang, and Hong Cao

16. Introduction to Biometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397Carmelo Velardo, Jean-Luc Dugelay, Lionel Daniel, Antitza Dantcheva,Nesli Erdogmus, Neslihan Kose, Rui Min, and Xuran Zhao

17. Watermarking and Fingerprinting Techniques for Multimedia Protection . . . . 419Sridhar Krishnan, Xiaoli Li, Yaqing Niu, Ngok-Wah Ma, and Qin Zhang

18. Image and Video Copy Detection Using Content-Based Fingerprinting . . . . . 459Mehrdad Fatourechi, Xudong Lv, Mani Malek Esmaeili, Z. Jane Wang, andRabab K. Ward

Part V Methodology, Techniques, and Applications: MultimediaCommunications and Networking

19. Emerging Technologies in Multimedia Communications and Networking:Challenges and Research Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489Chang Wen Chen

20. A Proxy-Based P2P Live Streaming Network: Design, Implementation, andExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519Dongni Ren, S.-H. Gary Chan, and Bin Wei

21. Scalable Video Streaming over the IEEE 802.11e WLANs . . . . . . . . . . . . . . . . . 531Chuan Heng Foh, Jianfei Cai, Yu Zhang, and Zefeng Ni

22. Resource Optimization for Distributed Video Communications . . . . . . . . . . . 549Yifeng He and Ling Guan

Part VI Methodology, Techniques, and Applications: ArchitectureDesign and Implementation for Multimedia Image andVideo Processing

23. Algorithm/Architecture Coexploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573Gwo Giun (Chris) Lee, He Yuan Lin, and Sun Yuan Kung


Contents vii

24. Dataflow-Based Design and Implementation of Image ProcessingApplications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609Chung-Ching Shen, William Plishker, and Shuvra S. Bhattacharyya

25. Application-Specific Instruction Set Processors for Video Processing . . . . . . . 631Sung Dae Kim and Myung Hoon Sunwoo

Part VII Methodology, Techniques, and Applications: MultimediaSystems and Applications

26. Interactive Multimedia Technology in Learning: Integrating Multimodality,Embodiment, and Composition for Mixed-Reality Learning Environments . . 659David Birchfield, Harvey Thornburg, M. Colleen Megowan-Romanowicz,Sarah Hatton, Brandon Mechtley, Igor Dolgov, Winslow Burleson, andGang Qian

27. Literature Survey on Recent Methods for 2D to 3D Video Conversion . . . . . . . 691Raymond Phan, Richard Rzeszutek, and Dimitrios Androutsos

28. Haptic Interaction and Avatar Animation Rendering Centric Telepresencein Second Life . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 717A. S. M. Mahfujur Rahman, S. K. Alamgir Hossain, and A. El Saddik

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 741


List of Figures

1.1 Typical MPEG.1 encoder structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 (a) An example of an MPEG GOP of 9, N = 9, M = 3. (b) Transmissionorder of an MPEG GOP of 9 and (c) Display order of an MPEG GOP of 9. . . . . . . 7

1.3 Two zigzag scan methods for MPEG-2 video coding. . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Block diagram of an H.264 encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.5 Encoding processing of JPEG-2000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.6 (a) MPEG-1 audio encoder. (b) MPEG-1 audio decoder. . . . . . . . . . . . . . . . . . . . 20

1.7 Relations between tools of MPEG.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.8 Illustration of MPEG.21 DIA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1 Histogram example with L number of bins. (a) FLAIR MRI (brain).(b) PDF pG(g) of (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.2 Example histograms with varying number of bins (bin widths).(a) 100 bins, (b) 30 bins, (c) 10 bins, (d) 5 bins. . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.3 Empirical histogram and KDA estimate of two random variables,N(0, 1) and N(5, 1). (a) Histogram. (b) KDA. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4 Types of kernels for KDA. (a) Box, (b) triangle, (c) Gaussian, and(d) Epanechnikov. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5 KDA of random sample (N(0, 1) + N(5, 1)) for box, triangle, andEpanechnikov kernels. (a) Box, (b) triangle, and (c) Epanechnikov. . . . . . . . . . . . 34

2.6 Example image and its corresponding histogram with mean and varianceindicated. (a) g(x, y). (b) PDF pG(g) of (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.7 HE techniques applied to mammogram lesions. (a) Original.(b) Histogram equalized. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.8 The KDA of lesion (e) in Figure 2.7, before and after enhancement.Note that after equalization, the histogram resembles a uniform PDF.(a) Before equalization. (b) After equalization. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.9 Image segmentation based on global histogram thresholding.(a) Original. (b) B(x, y) g(x, y). (c) (1 B(x, y)) g(x, y). . . . . . . . . . . . . . . . . . . 39

2.10 The result of a three-class Otsu segmentation on the image of Figure 2.6a.The left image is the segmentation result of all three classes (each class isassigned a unique intensity value). The images on the left are binarysegmentations for each tissue class B(x, y). (a) Otsu segmentation.(b) Background class. (c) Brain class. (d) Lesion class. . . . . . . . . . . . . . . . . . . . . . 40

2.11 Otsus segmentation on retinal image showing several misclassified pixels.(a) Original. (b) PDF pG(g) of (a). (c) Otsu segmentation. . . . . . . . . . . . . . . . . . . 41

ix


x List of Figures

2.12 Example FLAIR with WML, gradient image, and fuzzy edge mappingfunctions. (a) y(x1, x2). (b) g(x1, x2) = y. (c) k and pG(g). (d) k(x1, x2). . . . . . 42

2.13 T1- and T2-weighted MR images (1 mm slice thickness) of the brain andcorresponding histograms. Images are from BrainWeb database; seehttp://www.bic.mni.mcgill.ca/brainweb/. (a) T1-weighted MRI.(b) T2-weighted MRI. (c) Histogram of Figure 2.13a. (d) Histogram ofFigure 2.13b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.14 T1- and T2-weighted MR images (1 mm slice thickness) with 9% noise andcorresponding histograms. Images are from BrainWeb database; seehttp://www.bic.mni.mcgill.ca/brainweb/. (a) T1-weighted MRI with 9%noise. (b) T2-weighted MRI with 9% noise. (c) Histogram of Figure 2.14a.(d) Histogram of Figure 2.14b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.15 (Un)correlated noise sources and their 3D surface representation.(a) 2D Guassian IID noise. (b) Surface representation of Figure 2.15a.(c) 2D Colored noise. (d) Surface representation of Figure 2.15c. . . . . . . . . . . . . . 47

2.16 Empirically found M2 distribution and the observed Mobs2 for uncorrelatedand correlated 2D data of Figure 2.15. (a) p(M2) and Mobs2 for Figure 2.15a.(b) p(M2) and Mobs2 for Figure 2.15c. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.17 Correlated 2D variables generated from normally (N) anduniformly (U) distributed random variables. Parameters used to simulate therandom distributions are shown in Table 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.18 1D nonstationary data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.19 Grid for 2D-extension of RA test. (a), (b), and (c) show several examples ofdifferent spatial locations where the number of RAs are computed. . . . . . . . . . . 51

2.20 Empirically found distribution of R and the observed R for 2Dstationary and nonstationary data. (a) IID stationary noise.(b) p(R) and R of (a). (c) Nonstationary noise. (d) p(R) and R of (c). . . . . . . . . . 52

2.21 Nonstationary 2D variables generated from normally (N) anduniformly (U) distributed. Parameters (, ) and (a, b) used to simulatethe underlying distributions are shown in Table 2.1. . . . . . . . . . . . . . . . . . . . . . 53

2.22 Scatterplot of gradient magnitude images of original image (x-axis)and reconstructed version (y-axis). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.23 Bilaterally filtered examples. (a) Original. (b) Bilaterally filtered. (c) Original.(d) Bilaterally filtered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.24 Image reconstruction of example shown in Figure 2.23a. (a) Y0.35rec , (b) Y0.50rec ,

(c) Y0.58est , and (d) Y0.70rec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

2.25 Reconstruction example ( = 0.51 and = 0.53, respectively).(a) S(Y

rec(x1,x2))

C(Yrec(x1,x2)). (b) Y0.51rec . (c) Hist(Y). (d) Hist

(Y0.51rec

). (e) S(Y

rec(x1,x2))

C(Yrec(x1,x2)). (f) Y0.53rec .

(g) Hist(Y). (h) Hist(Y0.53rec

). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

2.26 Normalized differences in smoothness and sharpness, between the proposedmethod and the bilateral filter. (a) Smoothness. (b) Sharpness. . . . . . . . . . . . . . . 61


List of Figures xi

2.27 Fuzzy edge strength k versus intensity y for the image in Figure 2.23a.(a) k vs. y, (b) (y), and (c) (x1, x2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

2.28 Original image y(x1, x2), global edge profile (y) and global edge valuesmapped back to spatial domain (x1, x2). (a) y(x1, x2), (b) (y), and(c) (x1, x2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.29 Modified transfer function c(y) with original graylevel PDF pY(y), and theresultant image, c(x1, x2). (a) c(y) and pY(y) and (b) c(x1, x2) of (b). . . . . . . . . . . . 64

2.30 CE transfer function and contrast-enhanced image. (a) yCE(y) and pY(y).(b) yCE(x1, x2) of (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

2.31 Original, contrast-enhanced images and WML segmentation. (ac) Original.(df) Enhanced. (gi) Segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

2.32 One level of DWT decomposition of retinal images. (a) Normal imagedecomposition; (b) decomposition of the retinal images with diabeticretinopathy. CE was performed in the higher frequency bands (HH, LH, HL)for visualization purposes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.33 Medical images exhibiting texture. (a) Normal small bowel, (b) small bowellymphoma, (c) normal retinal image, (d) central retinal vein occlusion,(e) benign lesion, and (f) malignant lesion. CE was performed on (e) and(f) for visualization purposes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.1 A general architecture of multimedia applications system. . . . . . . . . . . . . . . . . . 79

3.2 (a) The general architecture and (b) hardware design issues of thevideo/image processing engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.3 Memory hierarchy: trade-offs and characteristics. . . . . . . . . . . . . . . . . . . . . . . . 82

3.4 Conventional two-stage macroblock pipelining architecture. . . . . . . . . . . . . . . . 84

3.5 Block diagram of the four-stage MB pipelining H.264/AVC encoding system. . . 85

3.6 The spatial relationship between the current macroblock and the searchingrange. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.7 The procedure of ME in a video coding system for a sequence. . . . . . . . . . . . . . 87

3.8 Block partition of H.264/AVC variable block size. . . . . . . . . . . . . . . . . . . . . . . . 88

3.9 The hardware architecture of 1DInterYSW, where N = 4, Ph = 2, and Pv = 2. . . . 893.10 The hardware architecture of 2DInterYH, where N = 4, Ph = 2, and Pv = 2. . . . . 903.11 The hardware architecture of 2DInterLC, where N = 4, Ph = 2, and Pv = 2. . . . . 903.12 The hardware architecture of 2DIntraVS, where N = 4, Ph = 2, and Pv = 2. . . . . 913.13 The hardware architecture of 2DIntraKP, where N = 4, Ph = 2, and Pv = 2. . . . . 923.14 The hardware architecture of 2DIntraHL, where N = 4, Ph = 2, and Pv = 2. . . . . 923.15 (a) The concept, (b) the hardware architecture, and (c) the detailed

architecture of PE array with 1-D adder tree, of Propagate Partial SAD,where N = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93


xii List of Figures

3.16 (a) The concept, (b) the hardware architecture, and (c) the scan orderand memory access, of SAD Tree, where N = 4. . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.17 The hardware architecture of inter-level PE with data flow I for (a) FBSME,where N = 16; (b) VBSME, where N = 16 and n = 4. . . . . . . . . . . . . . . . . . . . . . 95

3.18 The hardware architecture of Propagate Partial SAD with Data Flow II forVBSME, where N = 16 and n = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.19 The hardware architecture of SAD Tree with Data Flow III for VBSME,where N = 16 and n = 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3.20 Block diagram of the IME engine. It mainly consists of eight PE-ArraySAD Trees. Eight horizontally adjacent candidates are processed in parallel. . . 101

3.21 M-parallel PE-array SAD Tree architecture. The inter-candidate data reusecan be achieved in both horizontal and vertical directions with Ref. Pels Reg.Array, and the on-chip SRAM bandwidth is reduced. . . . . . . . . . . . . . . . . . . . . 101

3.22 PE-array SAD Tree architecture. The cost of 16 4 4 blocks are separatelysummed up by 16 2-D Adder sub-trees and then reduced by one VBSTree for larger blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.23 The operation loops of MRF-ME for H.264/AVC. . . . . . . . . . . . . . . . . . . . . . . . 103

3.24 The level-C data reuse scheme. (a) There are overlapped region of SWs forhorizontally adjacent MBs; (b) the physical location to store SW data in localmemory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3.25 The MRSC scheme for MRF-ME requires multiple SWs memories. Thereference pixels of multiple reference frames are loaded independentlyaccording to the level-C data reuse scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

3.26 The SRMC scheme can exploit the frame-level DR for MRF-ME. Only singleSW memory is required. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3.27 Schedule of MB tasks for MRF-ME; (a) the original (MRSC) version; (b) theproposed (SRMC) version. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

3.28 Estimated MVPs in PMD for Lagrangian mode decision. . . . . . . . . . . . . . . . . . 107

3.29 Proposed architecture with SRMC scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

3.30 The schedule of SRMC scheme in the proposed framework. . . . . . . . . . . . . . . . 109

3.31 The rate-distortion efficiency of the reference software and the proposedframework. Four sequences with different characteristics are used for theexperiment. Foreman has lots of deformation with media motions. Mobilehas complex textures and regular motion. Akiyo has the still scene, whileStefan has large motions. The encoding parameters are baseline profile,IPPP. . . structure, CIF, 30 frames/s, 4 reference frames, 16-pel searchrange, and low complexity mode decision. (a) Akiyo (CIF, 30 fps);(b) Mobile (CIF, 30 fps); (c) Stefan (CIF, 30 fps); (d) Foreman (CIF, 30 fps). . . . . . 110

3.32 Multiple reference frame motion estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . 112

3.33 Variable block size motion estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

2012 by Taylor & Francis Group, LLC

List of Figures xiii

3.34 Interpolation scheme for luminance component : (a) 6-tap FIR filter for halfpixel interpolation. (b) Bilinear filter for quarter pixel interpolation. . . . . . . . . . 112

3.35 Best partition for a picture with different quantization parameters(black block: inter block, gray block: intra block). . . . . . . . . . . . . . . . . . . . . . . . 113

3.36 FME refinement flow for each block and sub-block. . . . . . . . . . . . . . . . . . . . . . 113

3.37 FME procedure of Lagrangian inter mode decision in H.264/AVC referencesoftware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

3.38 The matching cost flowchart of each candidate. . . . . . . . . . . . . . . . . . . . . . . . . 115

3.39 Nested loops of fractional motion estimation. . . . . . . . . . . . . . . . . . . . . . . . . . 115

3.40 Data reuse exploration with loop analysis. (a) Original nested loops;(b) Loop i and Loop j are interchanged. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3.41 Intra-candidate data reuse for fractional motion estimation. (a) Referencepixels in the overlapped (gray) interpolation windows for two horizontallyadjacent interpolated pixels P0 and P1 can be reused; (b) Overlapped (gray)interpolation windows data reuse for a 4 4 interpolated block. Totally, 9 9reference pixels are enough with the technique of intra-candidate data reuse. . . 118

3.42 Inter-candidate data reuse for half-pel refinement of fractional motionestimation. The overlapped (gray) region of interpolation windows can bereused to reduce memory access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

3.43 Hardware architecture for fractional motion estimation engine. . . . . . . . . . . . . 119

3.44 Block diagram of 4 4-block PU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1203.45 Block diagram of interpolation engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

3.46 Hardware processing flow of variable-block size fractional motion estimation.(a) Basic flow; (b) advanced flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

3.47 Inter-4 4-block interpolation window data reuse. (a) Vertical data reuse,(b) horizontal data reuse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

3.48 Search Window SRAMs data arrange. (a) Physical location of reference pixelsin the search window; (b) traditional data arrangement with 1-D randomaccess; (c) proposed ladder-shaped data arrangement with 2-Drandom access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

3.49 Illustration of fractional motion estimation algorithm. The white circles arethe best integer-pixel candidates. The light-gray circles are the half-pixelcandidates. The dark-gray circles are the quarter-pixel candidates. The circleslabeled 1 and 2 are the candidates refined in the first and second passes,respectively. (a) Conventional two-step algorithm; (b) Proposed one-passalgorithm. The 25 candidates inside the dark square are processed in parallel. . 124

3.50 Rate-distortion performance of the proposed one-pass FME algorithm.The solid, dashed, and dotted lines show the performance of the two-stepalgorithm in the reference software, the proposed one-pass algorithm,and the algorithm with only half-pixel refinement. . . . . . . . . . . . . . . . . . . . . . 125


xiv List of Figures

3.51 Architecture of fractional motion estimation. The processing engines on theleft side are used to generate the matching costs of integer-pixel andhalf-pixel candidates. The transformed residues are reused to generate thematching costs of quarter-pixel candidates with the processing engines insidethe light-gray box on the right side. Then, the 25 matching costs are comparedto find the best MV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.1 Relationships among the interconnected fields to multimedia informationmining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

4.2 The typical architecture of a multimedia information mining system. . . . . . . . . 134

4.3 Graphic representation of the model developed for the randomized datageneration for exploiting the synergy between imagery and text. . . . . . . . . . . . 137

4.4 The architecture of the prototype system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

4.5 An example of image and annotation word pairs in the generated database.The number following each word is the corresponding weight of the word. . . . 143

4.6 The interface of the automatic image annotation prototype. . . . . . . . . . . . . . . . 144

4.7 Average SWQP(n) comparisons between MBRM and the developedapproach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

4.8 Precision comparison between UPMIR and UFM. . . . . . . . . . . . . . . . . . . . . . . 147

4.9 Recall comparison between UPMIR and UFM. . . . . . . . . . . . . . . . . . . . . . . . . . 148

4.10 Average precision comparison among UPMIR, Google Image Search, andYahoo! Image Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

5.1 Multimodal information fusion levels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

5.2 Block diagram of kernel matrix fusion-based system. . . . . . . . . . . . . . . . . . . . . 164

5.3 Block diagram of KCCA-based fusion at the feature level. . . . . . . . . . . . . . . . . 165

5.4 Block diagram of KCCA-based fusion at the score level. . . . . . . . . . . . . . . . . . . 165

5.5 Experimental results of kernel matrix fusion (KMF)-based method (weightedsum (WS), multiplication (M)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

5.6 Experimental results of KCCA-based fusion at the feature level. . . . . . . . . . . . 167

5.7 Experimental results of KCCA-based fusion at the score level. . . . . . . . . . . . . . 168

6.1 HCI devices for three main human sensing modalities: audio, video,and haptic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

6.2 Examples of emotional facial expressions from JAFFE (first three rows), MMI(fourth row), and FG-NET (last row) databases. . . . . . . . . . . . . . . . . . . . . . . . . 177

6.3 Muscle-controlled 3D wireframe head model. . . . . . . . . . . . . . . . . . . . . . . . . . 179

6.4 Person-dependent recognition of facial expressions for faces from the MMIdatabase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

6.5 Person-independent recognition of facial expressions for faces from the MMIdatabase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180


List of Figures xv

6.6 Visual tracking and recognition of facial expression. . . . . . . . . . . . . . . . . . . . . 181

6.7 General steps of proposed head movement detection. . . . . . . . . . . . . . . . . . . . 182

6.8 General steps of proposed eye gaze detection. . . . . . . . . . . . . . . . . . . . . . . . . . 183

6.9 Geometrical eye and nostril model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

6.10 Example of gaze detection based on the |D D0| global parameter difference. . 1846.11 Taxonomy of the human-head language attributes. . . . . . . . . . . . . . . . . . . . . . . . 185

6.12 Fuzzy inferences system for multimodal emotion evaluation. . . . . . . . . . . . . . 186

6.13 Fuzzy membership functions for the five input variables. (a) Happiness,(b) anger, (c) sadness, (d) head-movement, and (e) eye-gaze. . . . . . . . . . . . . . . 187

6.14 Fuzzy membership functions for the three output variables. (a) Emotionset-A, (b) emotion set-B, and (c) emotion set-C. . . . . . . . . . . . . . . . . . . . . . . . . 188

6.15 Image sequence of female subject showing the admire emotion state. . . . . . . . . 188

6.16 Facial muscles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

6.17 The architecture of the 3D head and facial animation system. . . . . . . . . . . . . . . 190

6.18 The muscle control of the wireframe model of the face. . . . . . . . . . . . . . . . . . . 191

6.19 Fundamental facial expressions generated by the 3D muscle-controlled facialanimation system: surprise, disgust, fear, sadness, anger, happiness, andneutral position. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

7.1 9-Mode intraprediction for 4 4 blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2037.2 4 4 ICT and inverse ICT matrices in H.264. . . . . . . . . . . . . . . . . . . . . . . . . . . 2047.3 Multiple reference frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

8.1 (a) Direct MT source coding. (b) Indirect MT source coding (the chief executiveofficer (CEO) problem). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217

8.2 Block diagram of the interframe video coder proposed by Witsenhausenand Wyner in their 1980 patent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

8.3 WitsenhausenWyner video coding. (a) Encoding, (b) decoding. . . . . . . . . . . . 221

8.4 WitsenhausenWyner video coding versus H.264/AVC and H.264/AVCIntraSkip coding when the bitstreams are protected with ReedSolomon codesand transmitted over a simulated CDMA2000 1X channel. (a) Football with acompression/transmission rate of 3.78/4.725 Mb/s. (b) Mobile with acompression/transmission rate of 4.28/5.163 Mb/s. . . . . . . . . . . . . . . . . . . . . . 221

8.5 Block diagram of layered WZ video coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 222

8.6 Error robustness performance of WZ video coding compared with H.26L FGSfor Football. The 10th decoded frame by H.26L FGS (a) and WZ video coding(b) in the 7th simulated transmission (out of a total of 200 runs). . . . . . . . . . . . 222

8.7 (a) 3D camera settings and (b) first pair of frames from the 720 288 stereosequence tunnel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223


xvi List of Figures

8.8 PSNR versus frame number comparison among separate H.264/AVC coding,two-terminal video coding, and joint encoding at the same sum rate of6.581 Mbps for the (a) left and the (b) right sequences of the tunnel. . . . . . . . 224

8.9 The general framework proposed in [46] for three-terminal video coding. . . . . 224

8.10 An example of left-and-right-to-center frame warping (based on the firstframes of the Ballet sequence). (a) The decoded left frame. (b) The originalcenter frame. (c) The decoded right frame. (d) The left frame warped to thecenter. (e) The warped center frame, and (f) The right frame warped tothe center. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

8.11 Depth camera-assisted MT video coding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

8.12 An MT video capturing system with four HD texture cameras and onelow-resolution (QCIF) depth camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227

8.13 An example of depth map refinement and side information comparisons.(a) The original HD frame. (b) The preprocessed (warped) depth frame.(c) The refined depth frame. (d) The depth frame generated withoutthe depth camera. (e) Side information with depth camera help, and(f) Side information without depth camera help. . . . . . . . . . . . . . . . . . . . . . . . 228

9.1 Applications of 3D and multiview video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

9.2 Illustration of inter-view prediction in MVC. . . . . . . . . . . . . . . . . . . . . . . . . . . 237

9.3 Sample coding results for Ballroom and Race1 sequences; each sequenceincludes eight views at video graphics array (VGA) resolution. . . . . . . . . . . . . 239

9.4 Subjective picture quality evaluation results given as averageMOS with 95% confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241

9.5 Comparison of full-resolution and frame-compatible formats:(a) full-resolution stereo pair; (b) side-by-side format;(c) top-and-bottom format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

9.6 Illustration of video codec for scalable resolution enhancementof frame-compatible video. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244

9.7 Example of 2D-plus-depth representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

9.8 Effect of down/up sampling filters on depth maps and correspondingsynthesis result (a, b) using conventional linear filters; (c, d) usingnonlinear filtering as proposed in [58]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

9.9 Sample plot of quality for a synthesized view versus bit ratewhere optimal combinations of QP for texture and depth aredetermined for a target set of bit rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

10.1 The block diagram of AVS video encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

10.2 Neighboring samples used for intraluma prediction.(a): 8 8 based. (b): 4 4 based. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259

10.3 Five intraluma prediction modes in all profiles in AVS1-P2. . . . . . . . . . . . . . . . 260

10.4 Macroblock partitions in AVS1-P2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261


List of Figures xvii

10.5 VBMC performance testing on QCIF and 720p test sequences.(a) QCIF and (b) 1280 720 Progressive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

10.6 Multiple reference picture performance testing. . . . . . . . . . . . . . . . . . . . . . . . . 262

10.7 Video codec architecture for video sequence with static background(AVS1-P2 Shenzhan Profile). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262

10.8 Interpolation filter performance comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . 264

10.9 Filtering for fractional sample accuracy MC. Uppercase letters indicatesamples on the full-sample grid, lowercase letters represent samples at half-and quarter-sample positions, and all the rest samples with s integer numbersubscript are eighth-pixel locations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

10.10 Temporal direct mode in AVS1-P2. (a) Motion vector derivation for directmode in frame coding. Colocated blocks reference index is 0 (solid line),or 1 (dashed line). (b) Motion vector derivation for direct mode in top fieldcoding. Colocated blocks reference index is 0. (c) Motion vector derivation fordirect mode in top field coding. Colocated blocks reference index is 1 (solidline), 2 (dashed line pointing to bottom field), or 3 (dashed line pointing to topfield). (d) Motion vector derivation for direct mode in top field coding.Colocated blocks reference index is 1. (e) Motion vector derivation for directmode in top field coding. Colocated blocks reference index is 0 (solid line),2 (dashed line pointing to bottom field), or 3 (dashed line pointingto top field). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

10.11 Motion vector derivation for symmetric mode in AVS1-P2. (a) Frame coding.(b) Field coding, forward reference index is 1, backward reference index is 0.(c) Field coding, forward reference index is 0, backward reference index is 1. . . 270

10.12 Quantization matrix patterns in AVS1-P2 Jiaqiang Profile. . . . . . . . . . . . . . . . . 272

10.13 Predefined quantization weighting parameters in AVS1-P2 Jiaqiang Profile:(a) default parameters, (b) parameters for keeping detail informationof texture, and (c) parameters for removing detail information of texture. . . . . . 272

10.14 Coefficient scan in AVS1-P2. (a) zigzag scan. (b) alternate scan. . . . . . . . . . . . . 273

10.15 Coefficient coding process in AVS1-P2 2D VLC entropy coding scheme.(a) Flowchart of coding one intraluma block. (b) Flowchart of coding oneinterluma block. (c) Flowchart of coding one interchroma block. . . . . . . . . . . . 275

10.16 An example table in AVS1-P2VLC1_Intra: from (Run, Level) to CodeNum. . . . 276

10.17 Coefficient coding process in AVS1-P2 context-adaptive arithmetic coding. . . . 277

10.18 Deblocking filter process in AVS1-P2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278

10.19 Slice-type conversion process. E: entropy coding, E1: entropy decoding,Q: quantization, Q1: Inverse quantization, T: transform, T1: inversetransform, MC: motion compensation. (a) Convert P-slice to L-slice.(b) Convert L-slice to P-slice. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280


xviii List of Figures

10.20 Slice structure in AVS1-P2. (a) Normal slice structure where the slice can onlycontain continual lines of macroblocks. (b) Flexible slice set allowing moreflexible grouping of macroblocks in slice and slice set. . . . . . . . . . . . . . . . . . . . 280

10.21 Test sequences: (a) Vidyo 1 (1280 720@60 Hz); (b) Kimono 1(1920 1080@24 Hz); (c) Crossroad (352 288@30 Hz); (d) Snowroad(352 288@30 Hz); (e) News and (f) Paris. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

10.22 Ratedistortion curves of different profiles. (a) Performance of Jiaqiang Profile,(b) performance of Shenzhan Profile, and (c) performance of Yidong Profile. . . 285

11.1 Overview of the offline processing and indexing process for a typicalmultimedia search system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292

11.2 Overview of the query process for a typical multimedia search system. . . . . . . 292

12.1 An illustration of SVM. The support vectors are circled. . . . . . . . . . . . . . . . . . . 303

12.2 The framework of automatic semantic video search. . . . . . . . . . . . . . . . . . . . . 307

12.3 The query representation as structured concept threads. . . . . . . . . . . . . . . . . . 309

12.4 UI and framework of VisionGo system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

13.1 A general CBIR framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320

13.2 A typical flowchart of relevance feedback. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328

13.3 Three different two-dimensional (2D) distance metrics. The red dot q denotesthe initial query point, and the green dot q denotes the learned optimal querypoint, which is estimated to be the center of all the positive examples. Circlesand crosses are positive and negative examples. (a) Euclidean distance;(b) normalized Euclidean distance; and (c) Mahalanobis distance. . . . . . . . . . . 329

13.4 The framework of search-based annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

14.1 Large digital video archival management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347

14.2 Near-duplicates detection framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348

14.3 Partial near-duplicate videos. Given a video corpus, near-duplicate segmentscreate hyperlinks to interrelate different portions of the videos. . . . . . . . . . . . . 353

14.4 A temporal network. The columns of the lattice are frames from the referencevideos, ordered according to the k-NN of the query frame sequence. The labelon each frame shows its time stamp in the video. The optimal path ishighlighted. For ease of illustration, not all paths and keyframes are shown. . . 354

14.5 Automatically tagging the movie 310 to Yuma using YouTube clips. . . . . . . . . . 356

14.6 Topic structure generation and video documentation framework. . . . . . . . . . . 358

14.7 A graphical view of the topic structure of the news videos about ArkansasSchool Shooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359

14.8 Google-context video summarization system. . . . . . . . . . . . . . . . . . . . . . . . . . 361


List of Figures xix

14.9 Timeline-based visualization of videos about the topic US PresidentialElection 2008. Important videos are mined and aligned with news articles, andthen attached to a milestone timeline of the topic. When an event is selected,the corresponding scene, tags, and news snippet are presented to users. . . . . . . 361

15.1 Forgery image examples in comparison with their authentic versions. . . . . . . . 375

15.2 Categorization of image forgery detection techniques. . . . . . . . . . . . . . . . . . . . 378

15.3 Image acquisition model and common forensic regularities. . . . . . . . . . . . . . . . 379

16.1 Scheme of a general biometric system and its modules: enrollment, recognition,and update. Typical interactions among the components are shown. . . . . . . . . 399

16.2 The lines represent two examples of cumulative matching characteristic curveplots for two different systems. The solid line represents the system thatperforms better. N is the number of subjects in the database. . . . . . . . . . . . . . . 404

16.3 Typical examples of biometric system graphs. The two distributions(a) represent the client/impostor scores; by varying the threshold, differentvalues of FAR and FRR can be computed. An ROC curve (b) is used tosummarize the operating points of a biometric system; for each differentapplication, different performances are required to the system. . . . . . . . . . . . . 405

16.4 (a) Average face and (b),(c) eigenfaces 1 to 2, (d),(e) eigenfaces 998-999 asestimated on a subset of 1000 images of the FERET face database. . . . . . . . . . . 407

16.5 A colored (a) and a near-infrared (b) version of the same iris. . . . . . . . . . . . . . . 410

16.6 A scheme that summarizes the steps performed during Daugman approach. . . 410

16.7 Example of a fingerprint (a), and of the minutiae: (b) termination,(c) bifurcation, (d) crossover, (e) lake, and (f) point or island. . . . . . . . . . . . . . . 411

16.8 The two interfaces of Google Picasa (a) and Apple iPhoto (b). Both the systemssummarize all the persons present in the photo collection. The two programsgive the opportunity to look for a particular face among all the others. . . . . . . . 415

17.1 Generic watermarking process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421

17.2 Fingerprint extraction/registration and identification procedure for legacycontent protection. (a) Populating the database and (b) Identifying thenew file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

17.3 Structure of the proposed P2P fingerprinting method. . . . . . . . . . . . . . . . . . . . 423

17.4 Overall spatio-temporal JND model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

17.5 The process of eye track analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426

17.6 Watermark bit corresponding to approximate energy subregions. . . . . . . . . . . 429

17.7 Diagram of combined spatio-temporal JND model-guided watermarkembedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

17.8 Diagram of combined spatio-temporal JND model-guided watermarkextraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430


xx List of Figures

17.9 (a) Original walk pal video. (b) Watermarked pal video by Model 1.(c) Watermarked pal video by Model 2. (d) Watermarked pal video by Model 3.(e) Watermarked pal video by the combined spatio temporal JND model. . . . . 431

17.10 (a) Robustness versus MPEG2 compression by four models. (b) Robustnessversus MPEG4 compression by four models. . . . . . . . . . . . . . . . . . . . . . . . . . . 432

17.11 Robustness versus Gaussian noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

17.12 Robustness versus valumetric scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433

17.13 BER results of each frame versus MPEG2 compression. . . . . . . . . . . . . . . . . . . 434

17.14 BER results of each frame versus Gaussian noise. . . . . . . . . . . . . . . . . . . . . . . . 435

17.15 BER results of each frame versus valumetric scaling. . . . . . . . . . . . . . . . . . . . . 435

17.16 Example of decomposition with MMP algorithm. (a) The original music signal.(b) The MDCT coefficients of the signal. (c) The molecule atoms after 10iteration. (d) The reconstructed signal based on the molecule atoms in (c). . . . . 439

17.17 Example of decomposition with MMP algorithm. . . . . . . . . . . . . . . . . . . . . . . 440

17.18 Fingerprint matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442

17.19 MDCT coefficients after low-pass filter. (a) MDCT coefficients of thelow-pass-filtered signal. (b) MDCT coefficient differences between the originalsignal and the low-pass-filtered signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

17.20 MDCT coefficients after random noise. (a) MDCT coefficients of the noisedsignal. (b) MDCT coefficient differences between the original signal and thenoised signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444

17.21 MDCT coefficients after MP3 compression. (a) MDCT coefficients of MP3signal with bit rate 16 kbps. (b) MDCT coefficient differences between theoriginal signal and the MP3 signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444

17.22 Fingerprint embedding flowchart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448

17.23 Two kinds of fingerprints in a video. UF denotes that a unique fingerprint isembedded and SF denotes that a sharable fingerprint is embedded. . . . . . . . . . 452

17.24 The topology of base file and supplementary file distribution. . . . . . . . . . . . . . 452

17.25 Comparison of images before and after fingerprinting. (a) Original Lena.(b) Original Baboon. (c) Original Peppers. (d) Fingerprinted Lena.(e) Fingerprinted Baboon. (f) Fingerprinted Peppers. . . . . . . . . . . . . . . . . . . . . 453

17.26 Images after Gaussian white noise, compression, and median filter. (a) Lenawith noise power at 7000. (b) Baboon with noise power at 7000. (c) Pepperswith noise power at 7000. (d) Lena at quality 5 of JPEG compression.(e) Baboon at quality 5 of JPEG compression. (f) Peppers at quality 5 of JPEGcompression. (g) Lena with median filter [9 9]. (h) Baboon with median filter[9 9]. (i) Peppers with median filter [9 9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454

18.1 The building blocks of a CF algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461

18.2 Overall scheme for finding copies of an original digital media using CF. . . . . . 461


List of Figures xxi

18.3 An example of partitioning an image into overlapping blocksof size m m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464

18.4 Some of the common preprocessing algorithms for content-based videofingerprinting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465

18.5 (ac) Frames 61, 75, and 90 from a video. (d) A representative framegenerated as a result of linearly combining these frames. . . . . . . . . . . . . . . . . . 466

18.6 Example of how SIFT can be used for feature extraction from an image.(a) Original image, (b) SIFT features (original image), and (c) SIFT features(rotated image). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468

18.7 Normalized Hamming distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472

18.8 (a) An original image and (bf) sample content-preserving attacks. . . . . . . . . . 473

18.9 The overall structure of FJLT, FMT-FJLT, and HCF algorithms. . . . . . . . . . . . . . 477

18.10 The ROC curves for NMF, FJLT, and HCF fingerprinting algorithms whentested on a wide range of attacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478

18.11 A nonsecure version of the proposed content-based video fingerprintingalgorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479

18.12 Comparison of the secure and nonsecure version in presence of (a) time shiftfrom 0.5 s to +0.5 s and (b) noise with variance 2. . . . . . . . . . . . . . . . . . . . . 479

19.1 Illustration of the wired-cum-wireless networking scenario. . . . . . . . . . . . . . . 499

19.2 Illustration of the proposed HTTP streaming proxy. . . . . . . . . . . . . . . . . . . . . . 500

19.3 Example of B frame hierarchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501

19.4 User feedback-based video adaptation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504

19.5 User attention-based video adaptation scheme. . . . . . . . . . . . . . . . . . . . . . . . . 505

19.6 Integration of UEP and authentication. (a) Joint ECC-based scheme.(b) Joint media error and authentication protection. . . . . . . . . . . . . . . . . . . . . . 512

19.7 Block diagram of the JMEAP system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513

19.8 Structure of transmission packets. The dashed arrows represent hashappending. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513

20.1 A proxy-based P2P streaming network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520

20.2 Overview of FastMeshSIM architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524

20.3 Software design. (a) FastMesh architecture; (b) SIM architecture; (c) RParchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525

20.4 HKUST-Princeton trials. (a) A lab snapshot; (b) a topology snapshot; (c) screencapture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527

20.5 Peer delay distribution. (a) Asian Peers; (b) US peers. . . . . . . . . . . . . . . . . . . . 528

20.6 Delay reduction by IP multicast. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529

21.1 The four ACs in an EDCA node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534


xxii List of Figures

21.2 The encoding structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

21.3 An example of the loss impact results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536

21.4 An example of the RPI value for each packet. . . . . . . . . . . . . . . . . . . . . . . . . . . 537

21.5 Relationship between packet loss probability, retry limit, and transmissioncollision probability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538

21.6 PSNR performance of scalable video traffic delivery over EDCA and EDCAwith various ULP schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 540

21.7 Packet loss rate of scalable video traffic delivery over EDCA. . . . . . . . . . . . . . . 540

21.8 Packet loss rate of scalable video traffic delivery over EDCA with fixed retrylimit-based ULP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541

21.9 Packet loss rate of scalable video traffic delivery over EDCA with adaptiveretry limit-based ULP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541

21.10 Block diagram of the proposed cross-layer QoS design. . . . . . . . . . . . . . . . . . . 543

21.11 PSNR of received video for DCF. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545

21.12 PSNR of received video for EDCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545

21.13 PSNR for our cross-layer design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546

22.1 Illustration of a WVSN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557

22.2 Comparison of power consumption at each sensor node. . . . . . . . . . . . . . . . . . 564

22.3 Trade-off between the PSNR requirement and the achievable maximumnetwork lifetime in lossless transmission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565

22.4 Comparison of the visual quality at frame 1 in Foreman CIF sequence withdifferent distortion requirement Dh, h V: (a) Dh = 300.0, (b) Dh = 100.0,and (c) Dh = 10.0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565

23.1 Complexity spectrum for advanced visual computing algorithms. . . . . . . . . . . 574

23.2 Spectrum of platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576

23.3 Levels of abstraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577

23.4 Features in various levels of abstraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578

23.5 Concept of AAC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578

23.6 Advanced visual system design methodology. . . . . . . . . . . . . . . . . . . . . . . . . . 579

23.7 Dataflow model of a 4-tap FIR filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580

23.8 Pipeline view of dataflow in a 4-tap FIR filter . . . . . . . . . . . . . . . . . . . . . . . . . 581

23.9 An example for an illustration of quantifying the algorithmic degree ofparallelism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587

23.10 Lifetime analysis of input data for typical visual computing systems. . . . . . . . . 589

23.11 Filter support of a 3-tap horizontal filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590


List of Figures xxiii

23.12 Filter support of a 3-tap vertical filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590

23.13 Filter support of a 3-tap temporal filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591

23.14 Filter support of a 3 3 3 spatialtemporal filter. . . . . . . . . . . . . . . . . . . . . . 59123.15 Search windows for motion estimation: (a) Search window of a single block.

(b) Search window reuse of two consecutive blocks, where the gray region isthe overlapped region. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592

23.16 Search windows for motion estimation at coarser data granularity. (a) Searchwindow of a single big block. (b) Search window reuse of two consecutive bigblocks, where the gray region is the overlapped region. . . . . . . . . . . . . . . . . . . 593

23.17 Average external data transfer rates versus local storage at various datagranularities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594

23.18 Dataflow graph of Loeffler DCT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596

23.19 Dataflow graphs of various DCT: (a) 8-point CORDIC-based Loeffler DCT,(b) 8-point integer DCT, and (c) 4-point integer DCT. . . . . . . . . . . . . . . . . . . . . 597

23.20 Reconfigurable dataflow of the 8-point type-II DCT, 8-point integer DCT, and4-point DCT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598

23.21 Dataflow graph of H.264/AVC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 599

23.22 Dataflow graph schedule of H.264/AVC at a fine granularity. . . . . . . . . . . . . . 599

23.23 Dataflow graph schedule of H.264/AVC at a coarse granularity. . . . . . . . . . . . 600

23.24 Data granularities possessing various shapes and sizes. . . . . . . . . . . . . . . . . . . 601

23.25 Linear motion trajectory in spatio-temporal domain. . . . . . . . . . . . . . . . . . . . . 602

23.26 Spatio-temporal motion search strategy for backward motion estimation. . . . . 603

23.27 Data rate comparison of the STME for various number of search locations. . . . 604

23.28 PSNR comparison of the STME for various number of search locations. . . . . . . 604

23.29 PSNR comparison of ME algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605

23.30 Block diagram of the STME architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605

24.1 Dataflow graph of an image processing application for Gaussian filtering. . . . . 616

24.2 A typical FPGA architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621

24.3 Simplified Xilinx Virtex-6 FPGA CLB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621

24.4 Parallel processing for tile pixels geared toward FPGA implementation. . . . . . . 622

24.5 A typical GPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625

25.1 SoC components used in recent electronic device. . . . . . . . . . . . . . . . . . . . . . . 633

25.2 Typical structure of ASIC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634

25.3 Typical structure of a processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634

25.4 Progress of DSPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635


xxiv List of Figures

25.5 Typical structure of ASIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636

25.6 Xtensa LX3 DPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638

25.7 Design flow using LISA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639

25.8 Example of DFG representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640

25.9 ADL-based ASIP design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641

25.10 Overall VSIP architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 642

25.11 Packed pixel data located in block boundary. . . . . . . . . . . . . . . . . . . . . . . . . . . 643

25.12 Horizontal packed addition instructions in VSIP. (a) dst = HADD(src).(b) dst = HADD(src:mask). (c) dst = HADD(src:mask1.mask2). . . . . . . . . . . . . 643

25.13 Assembly program of core block for in-loop deblocking filter. . . . . . . . . . . . . . 644

25.14 Assembly program of intraprediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 644

25.15 Operation flow of (a) fTRAN and (b) TRAN instruction in VSIP. . . . . . . . . . . . . 645

25.16 Operation flow of ME hardware accelerator in VSIP. (a) ME operation in thefirst cycle. (b) ME operation in the second cycle. . . . . . . . . . . . . . . . . . . . . . . . 646

25.17 Architecture of the ASIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647

25.18 Architecture of the ASIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 649

25.19 Architecture example of the ASIP [36] with 4 IPEU, 1 FPEU, and 1 IEU. . . . . . . 651

25.20 Top-level system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652

25.21 SIMD unit of the proposed ASIP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653

26.1 SMALLab mixed-reality learning environment. . . . . . . . . . . . . . . . . . . . . . . . . 661

26.2 (a) The SMALLab system with cameras, speakers, and project, and(b) SMALLab software architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 669

26.3 The block diagram of the object tracking system used in the multimodalsensing module of SMALLab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 670

26.4 Screen capture of projected Layer Cake Builder scene. . . . . . . . . . . . . . . . . . . . . 675

26.5 Layer Cake Builder interaction architecture schematic. . . . . . . . . . . . . . . . . . . . . 676

26.6 Students collaborating to compose a layer cake structure in SMALLab. . . . . . . . 678

26.7 Layer cake structure created in SMALLab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678

27.1 Tsukuba image pair: left view (a) and right view (b). . . . . . . . . . . . . . . . . . . . . 692

27.2 Disparity map example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693

27.3 3DTV System by MERL. (a) Array of 16 cameras, (b) array of 16 projectors,(c) rear-projection 3D display with double-lenticular screen, and(d) front-projection 3D display with single-lenticular screen. . . . . . . . . . . . . . . 693

27.4 The ATTEST 3D-video processing chain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695

27.5 Flow diagram of the algorithm by Ideses et al. . . . . . . . . . . . . . . . . . . . . . . . . . 697


List of Figures xxv

27.6 Block diagram of the algorithm by Huang et al. . . . . . . . . . . . . . . . . . . . . . . . . 698

27.7 Block diagram of the algorithm by Chang et al. . . . . . . . . . . . . . . . . . . . . . . . . 699

27.8 Block diagram of the algorithm by Kim et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 700

27.9 Multiview synthesis using SfM and DIBR by Knorr et al. Gray: original camerapath, red: virtual stereo cameras, blue: original camera of a multiview setup. . . 701

27.10 Block diagram of the algorithm by Li et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702

27.11 Block diagram of the algorithm by Wu et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 703

27.12 Block diagram of the algorithm by Xu et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 704

27.13 Block diagram of the algorithm by Yan et al. . . . . . . . . . . . . . . . . . . . . . . . . . . 704

27.14 Block diagram of the algorithm by Cheng et al. . . . . . . . . . . . . . . . . . . . . . . . . 706

27.15 Block diagram of the algorithm by Li et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707

27.16 Block diagram of the algorithm by Ng et al. . . . . . . . . . . . . . . . . . . . . . . . . . . . 708

27.17 Flow chart of the algorithm by Cheng and Liang. . . . . . . . . . . . . . . . . . . . . . . 710

27.18 Flow chart of the algorithm by Yamada and Suzuki. . . . . . . . . . . . . . . . . . . . . 711

28.1 A basic communication block diagram depicting various components of the SLinterpersonal haptic communication system. . . . . . . . . . . . . . . . . . . . . . . . . . . 719

28.2 The Haptic jacket controller and its hardware components. Array ofvibro-tactile motors are placed in the gaiter-like wearable cloth in order towirelessly stimulate haptic interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722

28.3 The flexible avatar annotation scheme allows the user to annotate any part ofthe virtual avatar body with haptic and animation properties. When interactedby the other party, the user receives those haptic rendering on his/her hapticjacket and views the animation rendering on the screen. . . . . . . . . . . . . . . . . . 722

28.4 User-dependent haptic interaction access design. The haptic and animationdata are annotated based on the target user groups such as family, friend,lovers, and formal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724

28.5 SL and haptic communication system block diagram. . . . . . . . . . . . . . . . . . . . 726

28.6 A code snippet depicting portion of the Linden Script that allows customizedcontrol of the user interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727

28.7 An overview of the target user group specific interaction rules stored (andcould be shared) in an XML file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 728

28.8 Processing time of different interfacing modules of the SL Controller. Thefigure depicts the modules that interface with our system. . . . . . . . . . . . . . . . . 729

28.9 Processing time of the components of the implemented interaction controllerwith respect to different haptic and animation interactions. . . . . . . . . . . . . . . . 729

28.10 Haptic and animation rendering time over 18 samples. The interactionresponse time changes due to the network parameters ofSL controller system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731


xxvi List of Figures

28.11 Average of the interaction response times that were sampled on particular timeintervals. The data were gathered during three weeks experiment sessions andaveraged. From our analysis, we observed that based on the server load theuser might experience delay in their interactions. . . . . . . . . . . . . . . . . . . . . . . 731

28.12 Interaction response time in varying density of traffic in the SL map location forthe Nearby Interaction Handler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 733

28.13 Usability study of the SL haptic interaction system. . . . . . . . . . . . . . . . . . . . . . 735

28.14 Comparison between the responses of users from different (a) gender, (b) agegroups, and (c) technical background. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736


Preface

We have witnessed significant advances in multimedia research and applications due tothe rapid increase in digital media, computing power, communication speed, and stor-age capacity. Multimedia has become an indispensable aspect in contemporary daily life,and we can feel its presence in many applications ranging from online multimedia search,Internet Protocol Television (IPTV), and mobile multimedia, to social media. The prolifera-tion of diverse multimedia applications has been the motivating force for the research anddevelopment of numerous paradigm-shifting technologies in multimedia processing.

This book documents the most recent advances in multimedia research and applications.It is a comprehensive book, which covers a wide range of topics including multimediainformation mining, multimodal information fusion and interaction, multimedia security,multimedia systems, hardware for multimedia, multimedia coding, multimedia search,and multimedia communications. Each chapter of the book is contributed by prominentexperts in the field. Therefore, it offers a very insightful treatment on the topic.

This book includes an Introduction and 28 chapters. The Introduction provides a com-prehensive overview on recent advances in multimedia research and applications. The 28chapters are classified into 7 parts. Part I focuses on Fundamentals of Multimedia, and PartsII through VII focus on Methodology, Techniques, and Applications.

Part I includes Chapters 1 through 6. Chapter 1 provides an overview of multimediastandards including video coding, still image coding, audio coding, multimedia interface,and multimedia framework. Chapter 2 provides the fundamental methods for histogramprocessing, image enhancement, and feature extraction and classification. Chapter 3 givesan overview on the design of an efficient application-specific multimedia architecture.Chapter 4 presents the architecture for a typical multimedia information mining system.Chapter 5 reviews the recent methods in multimodal information fusion and outlinesthe strength and weakness of different fusion levels. Chapter 6 presents bidirectional,human-to-computer and computer-to-human, affective interaction techniques.

Part II focuses on coding of video and multimedia content. It includes Chapters 7 through10. Chapter 7 is a part overview, which provides a review on various multimedia codingstandards including JPEG, MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and H.264. Chapter 8surveys the recent work on applying distributed source coding principles to video com-pression. Chapter 9 reviews a number of important 3D representation formats and theassociated compression techniques. Chapter 10 gives a detailed description to Audio VideoCoding Standard (AVS) developed by the China Audio Video Coding Standard WorkingGroup.

Part III focuses on multimedia search, retrieval, and management. It includes Chapters11 through 14. Chapter 11 is a part overview which provides the research trends in thearea of multimedia search and management. Chapter 12 reviews the recent work on videomodeling and retrieval including semantic concept detection, semantic video retrieval, andinteractive video retrieval. Chapter 13 presents a variety of existing techniques for imageretrieval, including visual feature extraction, relevance feedback, automatic image annota-tion, and large-scale visual indexing. Chapter 14 describes three basic components: contentstructuring and organization, data cleaning, and summarization, to enable management oflarge digital media archival.

Part IV focuses on multimedia security. It includes Chapters 15 through 18. Chapter 15is a part overview which reviews the techniques for information hiding for digital media,

xxvii


xxviii Preface

multimedia forensics, and multimedia biometrics. Chapter 16 provides a broad view ofbiometric systems and the techniques for measuring the system performance. Chapter 17presents the techniques in watermarking and fingerprinting for multimedia protection.Chapter 18 reviews content-based fingerprinting approaches that are applied to imagesand videos.

Part V focuses on multimedia communications and networking. It includesChapters 19 through 22. Chapter 19 is a part overview, which discusses several emerg-ing technical challenges as well as research opportunities in next-generation networkedmobile video communication systems. Chapter 20 presents a two-tier proxy-based peer-to-peer (P2P) live streaming network, which consists of a low-delay high-bandwidth proxybackbone and a peer-level network. Chapter 21 presents the recent studies on exploring thescalability of scalable video coding (SVC) and the quality of service (QoS) provided by theIEEE 802.11e to improve performance for video streaming over wireless local area networks(WLANs). Chapter 22 provides a review of recent advances on optimal resource allocationfor video communications over P2P streaming systems, wireless ad hoc networks, andwireless visual sensor networks.

Part VI focuses on architecture design and implementation for multimedia image andvideo processing. It includes Chapters 23 through 25. Chapter 23 presents the methodol-ogy for concurrent optimization of both algorithms and architectures. Chapter 24 introducesdataflow-based methods for efficient parallel implementations of image processing appli-cations. Chapter 25 presents the design issues and methodologies of application-specificinstruction set processor (ASIP) for video processing.

Part VII focuses on multimedia systems and applications. It includes Chapters 26 through28. Chapter 26 presents the design and implementation of a mixed-reality environment forlearning. Chapter 27 reviews the recent methods for converting conventional monocularvideo sequences to stereoscopic or multiview counterparts for display using 3D visual-ization technology. Chapter 28 presents a Second Life (SL) HugMe prototype system thatbridges the gap between virtual and real-world events by incorporating interpersonal hapticcommunication system.

The target audience of the book includes researchers, educators, students, and engineers.The book can be served as a reference book in the undergraduate or graduate courses onmultimedia processing or multimedia systems. It can also be used as references in researchof multimedia processing and design of multimedia systems.

Ling GuanYifeng He

Sun-Yuan Kung


Acknowledgments

First, we would like to thank all the chapter contributors, without whom this book wouldnot exist. We would also like to thank the chapter reviewers for their constructive comments.We are grateful to Nora Konopka, Jessica Vakili, and Jennifer Stair of Taylor & Francis, LLC,and S.M. Syed of Techset Composition, for their assistance in the publication of the book.Finally, we would like to give special thanks to our families for their patience and supportwhile we worked on the book.

xxix


Introduction: Recent Advances in MultimediaResearch and Applications

Guo-Jun Qi, Liangliang Cao, Shen-Fu Tsai, Min-Hsuan Tsai, and Thomas S. Huang

CONTENTS

0.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxi0.2 Advances in Content-Based Multimedia Annotation . . . . . . . . . . . . . . . . . . . . xxxii

0.2.1 Typical Multimedia Annotation Algorithms . . . . . . . . . . . . . . . . . . . . . . xxxii0.2.2 Multimodality Annotation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . xxxii0.2.3 Concept-Correlative Annotation Algorithms . . . . . . . . . . . . . . . . . . . . xxxiii

0.3 Advances in Constructing Multimedia Ontology . . . . . . . . . . . . . . . . . . . . . . xxxiv0.3.1 Construction of Multimedia Ontologies . . . . . . . . . . . . . . . . . . . . . . . . xxxiv0.3.2 Ontological Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxv

0.4 Advances in Sparse Representation and Modeling for Multimedia . . . . . . . . . . xxxv0.4.1 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxvi0.4.2 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxvi

0.4.2.1 Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxvi0.4.2.2 Video Foreground Detection . . . . . . . . . . . . . . . . . . . . . . . . . . xxxvi

0.4.3 Robust Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . xxxvii0.5 Advances in Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxvii

0.5.1 Retrieval and Search for Social Media . . . . . . . . . . . . . . . . . . . . . . . . . xxxvii0.5.2 Multimedia Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxviii

0.6 Advances in Distributed Multimedia Mining . . . . . . . . . . . . . . . . . . . . . . . . . .xxxix0.7 Advances in Large-Scale Multimedia Annotation and Retrieval . . . . . . . . . . . . . . xli0.8 Advances in Geo-Tagged Social Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xlii0.9 Advances in Multimedia Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xliiiReferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xliv

0.1 Overview

In the past 10 years, we have witnessed the significant advances in multimedia researchand applications. Amount of new technologies have been invented for various fundamentalmultimedia research problems. They are helping the computing machines better perceive,organize, and retrieve the multimedia content. With the rapid development of multimediahardware and software, nowadays we can easily make, access and share considerablemultimedia contents, which could not be imagined only 10 years before. All of theseresult in many urgent technical problems for effectively utilizing the exploding multimediainformation, especially for efficient multimedia organization and retrieval in different levelsfrom personal photo albums to web-scale search and retrieval systems. We look into some

xxxi


xxxii Introduction

of these edge cutting techniques arising in the past few years, and in brevity summarizehow they are applied to the emerging multimedia research and application problems.

0.2 Advances in Content-Based Multimedia Annotation

Content-based multimedia annotation has been attracting great effort, and significantprogress has been made to achieve high effectiveness and efficiency in the past decade.A large number of machine learning and pattern recognition algorithms have been intro-duced and adopted for improving annotation accuracy, among which are support vectormachines (SVMs), ensemble methods (e.g., AdaBoost), and semisupervised classifiers. Toapply these classic classification algorithms for annotation task, multimodality algorithmshave been invented to fuse different kinds of feature cues, ranging from color, texture, andshape features to the popular scale-invariant feature transform (SIFT) descriptors. Theymake use of the complementary information across different feature descriptors to enhanceannotation accuracy. On the other hand, recent results show that the annotation tasks donot exist independently across different multimedia concepts, but the annotation tasks ofthese concepts are strongly correlated with each other. This idea yields many new annota-tion algorithms which explore intrinsic concept correlations. In this section, we first reviewsome basic annotation algorithms which have been successfully applied for annotationtasks, followed by some classic algorithms that fuse the different feature cues and explorethe concept correlations.

0.2.1 Typical Multimedia Annotation Algorithms

Kernel methods and ensemble classifiers have gained great success since they are pro-posed in the late 1990s. As the typical discriminative models, they become prevailing inreal annotation systems [62]. Generally speaking, when there are enough training samples,discriminative models result in more accurate classification results than generative mod-els [66]. On the contrary, generative models, including naive Bayes, Bayesian Network,Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), and many graphicalmodels, are also widely applied to multimedia annotation and the results showed they cancomplement with the discriminative models to improve the annotation accuracy. For exam-ple, Refs. [71] and [105] use two-dimensional dependency-tree model hidden HMMs andGMMs respectively to represent each image adapting from universal background modelsin the first step. Then discriminative SVMs are built upon the kernel machines comparingthe similarity of these image representations.

In case of small number of training samples, semisupervised algorithms are more effec-tive for annotation. They explore the distribution of testing samples so that more robustannotation results can be achieved, avoiding from overfitting into training set of small size.Zhou et al. [104] proposes to combine partial label to estimate the score of each sample. Thesimilar idea is also developed by Zhu et al. [107], where the scores on labeled samples arefixed as that in training sert and the resulted method corresponds to the harmonic solutionof un-normalized graph Laplacian.

0.2.2 Multimodality Annotation Algorithms

Efficiently fusing a set of multimodal features is one of the key problems in multimediaannotation. The weights of different features often vary for each annotation task, or even


Introduction xxxiii

change from one multimedia object to another. This propels us to develop sophisticatedmodality fusion algorithms. A common approach to fusing multiple features is to use dif-ferent kernels for different features and then combine them by a weighted summation,which is so called multiple kernel learning (MKL) [5,46,75]. In MKL, the weight for thedifferent feature does not depend on the multiple objects and remains the same across allthe samples. Consequently, such a linear weighting approach does not describe possiblenonlinear relationships among different types of features.

Recently, Gnen and Alpaydin [29] proposed a localized weighting approach to MKLby introducing a weighting function for the samples that is assumed to either a linear orquadratic function of the input sample. Cao et al. [11] proposed an alternative Heteroge-neous Feature Machine (HFM) that builds a kernel logistic regression (LR) model based onsimilarities that combine different features and distance metrics.

0.2.3 Concept-Correlative Annotation Algorithms

The goal of multimedia annotation is to assign a set of labels to multimedia documentsbased on their semantic content. In many cases, multiple concepts can be assigned to onemultimedia document simultaneously. For example, in many online video/image sharingweb sites (e.g., Flickr, Picasa, and YouTube), most of the multimedia documents have morethan one tags manually labeled by users. It results in a multilabel multimedia annotationproblem that is more complex and challenging compared to multiclass annotation problem.This is because the annotations of multiple concepts are not independent but strongly corre-lated with each other. Evidences have shown exploring the label correlations plays the keyrole to improve the annotation results. Naphade et al. [63] proposes a probabilistic BayesianMultinet approach that explicitly models the relationship between the multiple conceptsthrough a factor graph upon the underlying multimedia ontology semantics. Wu et al. [97]uses an ontology-based multilabel learning algorithm for multimedia concept detection.Each concept is first independently modeled by a classifier, and then a predefined ontol-ogy hierarchy is leveraged to improve the detection accuracy of each individual classifier.Smith and Naphade [85] presents a two-step Discriminative Model Fusion approach tomine the unknown or indirect relationship to specific concepts by constructing model vec-tors based on detection scores of individual classifiers. SVM is then trained to refine thedetection results of the individual classifiers. Alternative fusion strategy can also be used,for example, Hauptmann et al. [34] proposed to use LR to fuse the individual detections.Users were involved in their approach to annotate a few concepts for extra video clips, andthese manual annotations are then utilized to help infer and improve detections of otherconcepts.

Although it is intuitively correct that contextual relationship can help improve detectionaccuracy of individual detectors, experimental results have shown that such improvementis not always stable, and the overall performance can even be worse than individual detec-tors alone. It is due to the fact that these algorithms are built on top of the independentbinary detectors with a second step to fuse them. However, the output of the individualindependent detectors can be unreliable and therefore their detection errors can propagateto the fusion step. To

multimedia image and video processing- second edition (image processing series)

Documents