read.pudn.comread.pudn.com/downloads335/ebook/1470705/imageandvideocompression.pdfimage processing...

� 2007 by Taylor & Francis Group, LLC.

IMAGE PROCESSING SERIESSeries Editor: Phillip A. Laplante, Pennsylvania State University

Published Titles

Adaptive Image Processing: A Computational Intelligence PerspectiveStuart William Perry, Hau-San Wong, and Ling Guan

Color Image Processing: Methods and ApplicationsRastislav Lukac and Konstantinos N. Plataniotis

Image Acquisition and Processing with LabVIEW™

Christopher G. Relf

Image and Video Compression for Multimedia EngineeringSecond EditionYun Q. Shi and Huiyang Sun

Multimedia Image and Video ProcessingLing Guan, S.Y. Kung, and Jan Larsen

Shape Analysis and Classification: Theory and PracticeLuciano da Fontoura Costa and Roberto Marcondes Cesar Jr.

Software Engineering for Image Processing SystemsPhillip A. Laplante


Yun Q. ShiNew Jersey Institute of Technolog y

Newark, New Jersey, USA

Huifang SunMitsubishi Electric Research Laboratories

Cambridge, Massachusetts, USA

CRC Press is an imprint of theTaylor & Francis Group, an informa business

Boca Raton London New York


CRC PressTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742

© 2008 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government worksPrinted in the United States of America on acid-free paper10 9 8 7 6 5 4 3 2 1

International Standard Book Number-13: 978-0-8493-7364-0 (Hardcover)

This book contains information obtained from authentic and highly regarded sources Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-ity of all materials or the consequences of their use. The Authors and Publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC) 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For orga-nizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Shi, Yun Q.Image and video compression for multimedia engineering : fundamentals, algorithms, and

standards / Yun Q. Shi and Huifang Sun. -- 2nd ed.p. cm. -- (Image processing series)

Includes bibliographical references and index.ISBN 978-0-8493-7364-0 (alk. paper)1. Multimedia systems. 2. Image compression. 3. Video compression. I. Sun, Huifang. II. Title.

QA76.575.S555 2008006.7--dc22 2007048389

Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.com

and the CRC Press Web site athttp://www.crcpress.com


To beloved Kong Wai Shih, Wen Su,

Yi Xi Li, Shu Jun Zheng, and

Xian Hong Li

and

To beloved Xuedong,

Min, Yin, Andrew, Rich, Haixin, and

Allison


Contents

Preface to the Second EditionPreface to the First EditionContent and Organization of the BookAuthors

Part I Fundamentals

Chapter 1 Introduction1.1 Practical Needs for Image and Video Compression1.2 Feasibility of Image and Video Compression

� 2007 b

1.2.1 Statistical Redundancy

y Taylor &

1.2.1.1 Spatial Redundancy1.2.1.2 Temporal Redundancy1.2.1.3 Coding Redundancy

1.2.2 Psychovisual Redundancy
1.2.2.1 Luminance Masking1.2.2.2 Texture Masking1.2.2.3 Frequency Masking1.2.2.4 Temporal Masking1.2.2.5 Color Masking1.2.2.6 Color Masking and Its Application in Video Compression1.2.2.7 Summary: Differential Sensitivity
1.3 Visual Quality Measurement
1.3.1 Subjective Quality Measurement1.3.2 Objective Quality Measurement
1.3.2.1 Signal to Noise Ratio1.3.2.2 An Objective Quality Measure Based on Human

Visual Perception
1.4 Information Theory Results
1.4.1 Entropy
1.4.1.1 Information Measure1.4.1.2 Average Information per Symbol
1.4.2 Shannon’s Noiseless Source Coding Theorem1.4.3 Shannon’s Noisy Channel Coding Theorem1.4.4 Shannon’s Source Coding Theorem1.4.5 Information Transmission Theorem

1.5 SummaryExercisesReferences

Chapter 2 Quantization2.1 Quantization and the Source Encoder2.2 Uniform Quantization

2.2.1 Basics

Francis Group, LLC.

� 2007 by Taylor &

2.2.1.1 Definitions2.2.1.2 Quantization Distortion2.2.1.3 Quantizer Design

2.2.2 Optimum Uniform Quantizer
2.2.2.1 Uniform Quantizer with Uniformly Distributed Input2.2.2.2 Conditions of Optimum Quantization2.2.2.3 Optimum Uniform Quantizer with Different
Input Distributions
2.3 Nonuniform Quantization
2.3.1 Optimum (Nonuniform) Quantization2.3.2 Companding Quantization

2.4 Adaptive Quantization
2.4.1 Forward Adaptive Quantization2.4.2 Backward Adaptive Quantization2.4.3 Adaptive Quantization with a One-Word Memory2.4.4 Switched Quantization
2.5 Pulse Code Modulation2.6 SummaryExercisesReferences

Chapter 3 Differential Coding3.1 Introduction to DPCM

3.1.1 Simple Pixel-to-Pixel DPCM3.1.2 General DPCM Systems

3.2 Optimum Linear Prediction
3.2.1 Formulation3.2.2 Orthogonality Condition and Minimum Mean Square Error3.2.3 Solution to Yule–Walker Equations
3.3 Some Issues in the Implementation of DPCM
3.3.1 Optimum DPCM System3.3.2 1-D, 2-D, and 3-D DPCM3.3.3 Order of Predictor3.3.4 Adaptive Prediction3.3.5 Effect of Transmission Errors
3.4 Delta Modulation3.5 Interframe Differential Coding

3.5.1 Conditional Replenishment3.5.2 3-D DPCM3.5.3 Motion Compensated Predictive Coding

3.6 Information-Preserving Differential Coding3.7 SummaryExercisesReferences

Chapter 4 Transform Coding4.1 Introduction

4.1.1 Hotelling Transform4.1.2 Statistical Interpretation4.1.3 Geometrical Interpretation

Francis Group, LLC.

� 2007 b

4.1.4 Basis Vector Interpretation4.1.5 Procedures of Transform Coding

4.2 Linear Transforms
4.2.1 2-D Image Transformation Kernel
y Taylor &

4.2.1.1 Separability4.2.1.2 Symmetry4.2.1.3 Matrix Form4.2.1.4 Orthogonality

4.2.2 Basis Image Interpretation4.2.3 Subimage Size Selection

4.3 Transforms of Particular Interest
4.3.1 Discrete Fourier Transform4.3.2 Discrete Walsh Transform4.3.3 Discrete Hadamard Transform4.3.4 Discrete Cosine Transform
4.3.4.1 Background4.3.4.2 Transformation Kernel4.3.4.3 Relationship with DFT

4.3.5 Performance Comparison
4.3.5.1 Energy Compaction4.3.5.2 Mean Square Reconstruction Error4.3.5.3 Computational Complexity4.3.5.4 Summary
4.4 Bit Allocation
4.4.1 Zonal Coding4.4.2 Threshold Coding
4.4.2.1 Thresholding and Shifting4.4.2.2 Normalization and Roundoff4.4.2.3 Zigzag Scan4.4.2.4 Huffman Coding4.4.2.5 Special Code Words4.4.2.6 Rate Buffer Feedback and Equalization

4.5 Some Issues
4.5.1 Effect of Transmission Error4.5.2 Reconstruction Error Sources4.5.3 Comparison between DPCM and TC4.5.4 Hybrid Coding

Chapter 5 Variable-Length Coding: Information Theory Results (II)5.1 Some Fundamental Results

5.1.1 Coding an Information Source5.1.2 Some Desired Characteristics

5.1.2.1 Block Code5.1.2.2 Uniquely Decodable Code5.1.2.3 Instantaneous Codes5.1.2.4 Compact Code

5.1.3 Discrete Memoryless Sources

Francis Group, LLC.

� 2007 b

5.1.4 Extensions of a Discrete Memoryless Source

y Taylor &

5.1.4.1 Definition5.1.4.2 Entropy5.1.4.3 Noiseless Source Coding Theorem

5.2 Huffman Codes
5.2.1 Required Rules for Optimum Instantaneous Codes5.2.2 Huffman Coding Algorithm
5.2.2.1 Procedures5.2.2.2 Comments5.2.2.3 Applications

5.3 Modified Huffman Codes
5.3.1 Motivation5.3.2 Algorithm5.3.3 Codebook Memory Requirement5.3.4 Bounds on Average Code Word Length
5.4 Arithmetic Codes
5.4.1 Limitations of Huffman Coding5.4.2 The Principle of Arithmetic Coding
5.4.2.1 Dividing Interval [0, 1) into Subintervals5.4.2.2 Encoding5.4.2.3 Decoding5.4.2.4 Observations

5.4.3 Implementation Issues
5.4.3.1 Incremental Implementation5.4.3.2 Finite Precision5.4.3.3 Other Issues
5.4.4 History5.4.5 Applications


Chapter 6 Run-Length and Dictionary Coding: InformationTheory Results (III)

6.1 Markov Source Model
6.1.1 Discrete Markov Source6.1.2 Extensions of a Discrete Markov Source
6.1.2.1 Definition6.1.2.2 Entropy

6.1.3 Autoregressive Model
6.2 Run-Length Coding
6.2.1 1-D Run-Length Coding6.2.2 2-D Run-Length Coding

6.2.2.1 Five Changing Pixels6.2.2.2 Three Coding Modes

6.2.3 Effect of Transmission Error and Uncompressed Mode
6.2.3.1 Error Effect in the 1-D RLC Case6.2.3.2 Error Effect in the 2-D RLC Case6.2.3.3 Uncompressed Mode
Francis Group, LLC.

6.3 Digital Facsimile Coding Standards6.4 Dictionary Coding

� 2007 b

6.4.1 Formulation of Dictionary Coding6.4.2 Categorization of Dictionary-Based Coding Techniques

y Taylor &

6.4.2.1 Static Dictionary Coding6.4.2.2 Adaptive Dictionary Coding

6.4.3 Parsing Strategy6.4.4 Sliding Window (LZ77) Algorithms

6.4.4.1 Introduction6.4.4.2 Encoding and Decoding6.4.4.3 Summary of the LZ77 Approach

6.4.5 LZ78 Algorithms
6.4.5.1 Introduction6.4.5.2 Encoding and Decoding6.4.5.3 LZW Algorithm6.4.5.4 Summary6.4.5.5 Applications
6.5 International Standards for Lossless Still Image Compression
6.5.1 Lossless Bilevel Still Image Compression
6.5.1.1 Algorithms6.5.1.2 Performance Comparison

6.5.2 Lossless Multilevel Still Image Compression
6.5.2.1 Algorithms6.5.2.2 Performance Comparison

Part II Still Image Compression

Chapter 7 Still Image Coding: Standard JPEG7.1 Introduction7.2 Sequential DCT-Based Encoding Algorithm7.3 Progressive DCT-Based Encoding Algorithm7.4 Lossless Coding Mode7.5 Hierarchical Coding Mode7.6 SummaryExercisesReferences

Chapter 8 Wavelet Transform for Image Coding: JPEG20008.1 A Review of Wavelet Transform

8.1.1 Definition and Comparison with Short-TimeFourier Transform

8.1.2 Discrete Wavelet Transform8.1.3 Lifting Scheme

8.1.3.1 Three Steps in Forward Wavelet Transform8.1.3.2 Inverse Transform

Francis Group, LLC.

� 2007 by Taylor &

8.1.3.3 Lifting Version of CDF (2,2)8.1.3.4 A Demonstration Example8.1.3.5 (5,3) Integer Wavelet Transform8.1.3.6 A Demonstration Example of (5,3) IWT8.1.3.7 Summary

8.2 Digital Wavelet Transform for Image Compression
8.2.1 Basic Concept of Image Wavelet Transform Coding8.2.2 Embedded Image Wavelet Transform Coding Algorithms
8.2.2.1 Early Wavelet Image Coding Algorithms andTheir Drawbacks

8.2.2.2 Modern Wavelet Image Coding8.2.2.3 Embedded Zerotree Wavelet Coding8.2.2.4 Set Partitioning in Hierarchical Trees Coding

8.3 Wavelet Transform for JPEG2000
8.3.1 Introduction of JPEG2000
8.3.1.1 Requirements of JPEG20008.3.1.2 Parts of JPEG2000

8.3.2 Verification Model of JPEG20008.3.3 An Example of Performance Comparison between JPEG

and JPEG2000
Chapter 9 Nonstandard Still Image Coding9.1 Introduction9.2 Vector Quantization

9.2.1 Basic Principle of Vector Quantization
9.2.1.1 Vector Formation9.2.1.2 Training Set Generation9.2.1.3 Codebook Generation9.2.1.4 Quantization
9.2.2 Several Image Coding Schemes with Vector Quantization
9.2.2.1 Residual VQ9.2.2.2 Classified VQ9.2.2.3 Transform Domain VQ9.2.2.4 Predictive VQ9.2.2.5 Block Truncation Coding
9.2.3 Lattice VQ for Image Coding
9.3 Fractal Image Coding
9.3.1 Mathematical Foundation9.3.2 IFS-Based Fractal Image Coding9.3.3 Other Fractal Image Coding Methods

9.4 Model-Based Coding
9.4.1 Basic Concept9.4.2 Image Modeling

Francis Group, LLC.

Part III Motion Estimation and Compensation

Chapter 10 Motion Analysis and Motion Compensation10.1 Image Sequences10.2 Interframe Correlation10.3 Frame Replenishment10.4 Motion Compensated Coding10.5 Motion Analysis

� 2007 b

10.5.1 Biological Vision Perspective10.5.2 Computer Vision Perspective10.5.3 Signal Processing Perspective

10.6 Motion Compensation for Image Sequence Processing
10.6.1 Motion Compensated Interpolation10.6.2 Motion Compensated Enhancement10.6.3 Motion Compensated Restoration10.6.4 Motion Compensated Down-Conversion

Chapter 11 Block Matching11.1 Nonoverlapped, Equally Spaced, Fixed Size, Small Rectangular

Block Matching11.2 Matching Criteria11.3 Searching Procedures

11.3.1 Full Search11.3.2 2-D Logarithm Search11.3.3 Coarse–Fine Three-Step Search11.3.4 Conjugate Direction Search11.3.5 Subsampling in the Correlation Window11.3.6 Multiresolution Block Matching11.3.7 Thresholding Multiresolution Block Matching

y Taylor &

11.3.7.1 Algorithm11.3.7.2 Threshold Determination11.3.7.3 Thresholding11.3.7.4 Experiments

11.4 Matching Accuracy11.5 Limitations with Block Matching Techniques11.6 New Improvements

11.6.1 Hierarchical Block Matching11.6.2 Multigrid Block Matching

11.6.2.1 Thresholding Multigrid Block Matching11.6.2.2 Optimal Multigrid Block Matching

11.6.3 Predictive Motion Field Segmentation11.6.4 Overlapped Block Matching


Francis Group, LLC.

Chapter 12 Pel Recursive Technique12.1 Problem Formulation12.2 Descent Methods

� 2007 b

12.2.1 First-Order Necessary Conditions12.2.2 Second-Order Sufficient Conditions12.2.3 Underlying Strategy12.2.4 Convergence Speed

y Taylor &

12.2.4.1 Order of Convergence12.2.4.2 Linear Convergence

12.2.5 Steepest Descent Method
12.2.5.1 Formulae12.2.5.2 Convergence Speed12.2.5.3 Selection of Step Size
12.2.6 Newton–Raphson’s Method
12.2.6.1 Formulae12.2.6.2 Convergence Speed12.2.6.3 Generalization and Improvements
12.2.7 Other Methods
12.3 Netravali–Robbins’ Pel Recursive Algorithm
12.3.1 Inclusion of a Neighborhood Area12.3.2 Interpolation12.3.3 Simplification12.3.4 Performance

12.4 Other Pel Recursive Algorithms
12.4.1 Bergmann’s Algorithm (1982)12.4.2 Bergmann’s Algorithm (1984)12.4.3 Cafforio and Rocca’s Algorithm12.4.4 Walker and Rao’s Algorithm
12.5 Performance Comparison12.6 SummaryExercisesReferences

Chapter 13 Optical Flow13.1 Fundamentals

13.1.1 2-D Motion and Optical Flow13.1.2 Aperture Problem13.1.3 III-Posed Problem13.1.4 Classification of Optical Flow Techniques

13.2 Gradient-Based Approach
13.2.1 Horn and Schunck’s Method
13.2.1.1 Brightness Invariance Equation13.2.1.2 Smoothness Constraint13.2.1.3 Minimization13.2.1.4 Iterative Algorithm

13.2.2 Modified Horn and Schunck Method13.2.3 Lucas and Kanade’s Method13.2.4 Nagel’s Method13.2.5 Uras, Girosi, Verri, and Torre’s Method

Francis Group, LLC.

13.3 Correlation-Based Approach

� 2007 b

13.3.1 Anandan’s Method13.3.2 Singh’s Method

y Taylor &

13.3.2.1 Conservation Information13.3.2.2 Neighborhood Information13.3.2.3 Minimization and Iterative Algorithm

13.3.3 Pan, Shi, and Shu’s Method
13.3.3.1 Proposed Framework13.3.3.2 Implementation and Experiments13.3.3.3 Discussion and Conclusion
13.4 Multiple Attributes for Conservation Information
13.4.1 Weng, Ahuja, and Huang’s Method13.4.2 Xia and Shi’s Method
13.4.2.1 Multiple Image Attributes13.4.2.2 Conservation Stage13.4.2.3 Propagation Stage13.4.2.4 Outline of Algorithm13.4.2.5 Experimental Results13.4.2.6 Discussion and Conclusion


Chapter 14 Further Discussion and Summary on 2-DMotion Estimation

14.1 General Characterization
14.1.1 Aperture Problem14.1.2 Ill-Posed Inverse Problem14.1.3 Conservation Information and Neighborhood Information14.1.4 Occlusion and Disocclusion14.1.5 Rigid and Nonrigid Motion
14.2 Different Classifications
14.2.1 Deterministic Methods versus Stochastic Methods14.2.2 Spatial Domain Methods versus Frequency Domain Methods
14.2.2.1 Optical Flow Determination Using Gabor Energy Filters
14.2.3 Region-Based Approaches versus Gradient-Based Approaches14.2.4 Forward versus Backward Motion Estimation
14.3 Performance Comparison between Three Major Approaches
14.3.1 Three Representatives14.3.2 Algorithm Parameters14.3.3 Experimental Results and Observations
14.4 New Trends
14.4.1 DCT-Based Motion Estimation
14.4.1.1 DCT and DST Pseudophases14.4.1.2 Sinusoidal Orthogonal Principle14.4.1.3 Performance Comparison


Francis Group, LLC.

Part IV Video Compression

Chapter 15 Fundamentals of Digital Video Coding15.1 Digital Video Representation15.2 Information Theory Results: Rate Distortion Function

of Video Signal15.3 Digital Video Formats

� 2007 b

15.3.1 Digital Video Color Systems15.3.2 Progressive and Interlaced Video Signals15.3.3 Video Formats Used by Video Industry

y Taylor &

15.3.3.1 ITU-R15.3.3.2 Source Input Format15.3.3.3 Common Intermediate Format15.3.3.4 ATSC Digital Television Format

15.4 Current Status of Digital Video=Image Coding Standards
15.4.1 JPEG Standard15.4.2 JPEG200015.4.3 MPEG-115.4.4 MPEG-215.4.5 MPEG-415.4.6 H.26115.4.7 H.263, H.263 Version 2 (H.263þ), H.263þþ, and H.26L15.4.8 MPEG-4 Part 10 Advanced Video Coding or H.264=AVC15.4.9 VC-115.4.10 RealVideo

Chapter 16 Digital Video Coding Standards: MPEG-1=2 Video16.1 Introduction16.2 Features of MPEG-1=2 Video Coding

16.2.1 MPEG-1 Features
16.2.1.1 Introduction16.2.1.2 Layered Structure Based on Group of Pictures16.2.1.3 Encoder Structure16.2.1.4 Structure of the Compressed Bitstream16.2.1.5 Decoding Process
16.2.2 MPEG-2 Enhancements
16.2.2.1 Field=Frame Prediction Mode16.2.2.2 Field=Frame DCT Coding Syntax16.2.2.3 Downloadable Quantization Matrix and Alternative
Scan Order16.2.2.4 Pan and Scan16.2.2.5 Concealment Motion Vector16.2.2.6 Scalability

16.3 MPEG-2 Video Encoding
16.3.1 Introduction16.3.2 Preprocessing
Francis Group, LLC.

� 2007 b

16.3.3 Motion Estimation and Motion Compensation

y Taylor &

16.3.3.1 Matching Criterion16.3.3.2 Searching Algorithm16.3.3.3 Advanced Motion Estimation

16.4 Rate Control
16.4.1 Introduction of Rate Control16.4.2 Rate Control of Test Model 5 for MPEG-2
16.5 Optimum Mode Decision
16.5.1 Problem Formation16.5.2 Procedure for Obtaining the Optimal Mode
16.5.2.1 Optimal Solution16.5.2.2 Near-Optimal Greedy Solution

16.5.3 Practical Solution with New Criteria for the Selectionof Coding Mode

16.6 Statistical Multiplexing Operations on Multiple Program Encoding
16.6.1 Background of Statistical Multiplexing Operation16.6.2 VBR Encoders in StatMux16.6.3 Research Topics of StatMux

Chapter 17 Application Issues of MPEG-1=2 Video Coding17.1 Introduction17.2 ATSC DTV Standards

17.2.1 A Brief History17.2.2 Technical Overview of ATSC Systems

17.2.2.1 Picture Layer17.2.2.2 Compression Layer17.2.2.3 Transport Layer17.2.2.4 Transmission Layer

17.3 Transcoding with Bitstream Scaling
17.3.1 Background17.3.2 Basic Principles of Bitstream Scaling17.3.3 Architectures of Bitstream Scaling
17.3.3.1 Architecture 1: Cutting AC Coefficients17.3.3.2 Architecture 2: Increasing Quantization Step17.3.3.3 Architecture 3: Re-Encoding with Old Motion Vectors

and Old Decisions17.3.3.4 Architecture 4: Re-Encoding with Old Motion Vectors

and New Decisions17.3.3.5 Comparison of Bistream Scaling Methods

17.3.4 MPEG-2 to MPEG-4 Transcoding
17.4 Down-Conversion Decoder
17.4.1 Background17.4.2 Frequency Synthesis Down-Conversion17.4.3 Low-Resolution Motion Compensation17.4.4 Three-Layer Scalable Decoder17.4.5 Summary of Down-Conversion Decoder

Francis Group, LLC.

17.5 Error Concealment

� 2007 b

17.5.1 Background17.5.2 Error Concealment Algorithms

y Taylor &

17.5.2.1 Code Word Domain Error Concealment17.5.2.2 Spatio-Temporal Error Concealment

17.5.3 Algorithm Enhancements
17.5.3.1 Directional Interpolation17.5.3.2 I-Picture Motion Vectors17.5.3.3 Spatial Scalable Error Concealment
17.5.4 Summary of Error Concealment
Chapter 18 MPEG-4 Video Standard: Content-BasedVideo Coding

18.1 Introduction18.2 MPEG-4 Requirements and Functionalities

18.2.1 Content-Based Interactivity
18.2.1.1 Content-Based Manipulation and Bitstream Editing18.2.1.2 Synthetic and Natural Hybrid Coding18.2.1.3 Improved Temporal Random Access
18.2.2 Content-Based Efficient Compression
18.2.2.1 Improved Coding Efficiency18.2.2.2 Coding of Multiple Concurrent Data Streams
18.2.3 Universal Access
18.2.3.1 Robustness in Error-Prone Environments18.2.3.2 Content-Based Scalability
18.2.4 Summary of MPEG-4 Features
18.3 Technical Description of MPEG-4 Video
18.3.1 Overview of MPEG-4 Video18.3.2 Motion Estimation and Compensation

18.3.2.1 Adaptive Selection of 163 16 Block or Four 83 8 Blocks18.3.2.2 Overlapped Motion Compensation

18.3.3 Texture Coding
18.3.3.1 INTRA DC and AC Prediction18.3.3.2 Motion Estimation=Compensation of Arbitrary
Shaped VOP18.3.3.3 Texture Coding of Arbitrary Shaped VOP

18.3.4 Shape Coding
18.3.4.1 Binary Shape Coding with CAE Algorithm18.3.4.2 Gray-Scale Shape Coding
18.3.5 Sprite Coding18.3.6 Interlaced Video Coding18.3.7 Wavelet-Based Texture Coding

18.3.7.1 Decomposition of the Texture Information18.3.7.2 Quantization of Wavelet Coefficients18.3.7.3 Coding of Wavelet Coefficients of Low–Low Band

and Other Bands18.3.7.4 Adaptive Arithmetic Coder

Francis Group, LLC.

� 2007 b

18.3.8 Generalized Spatial and Temporal Scalability18.3.9 Error Resilience

18.4 MPEG-4 Visual Bitstream Syntax and Semantics18.5 MPEG-4 Visual Profiles and Levels18.6 MPEG-4 Video Verification Model

18.6.1 VOP-Based Encoding and Decoding Process18.6.2 Video Encoder

y Taylor &

18.6.2.1 Video Segmentation18.6.2.2 Intra=Inter Mode Decision18.6.2.3 Off-Line Sprite Generation18.6.2.4 Multiple VO Rate Control

18.6.3 Video Decoder
Chapter 19 ITU-T Video Coding Standards H.261 and H.26319.1 Introduction19.2 H.261 Video Coding Standard

19.2.1 Overview of H.261 Video Coding Standard19.2.2 Technical Detail of H.26119.2.3 Syntax Description

19.2.3.1 Picture Layer19.2.3.2 Group of Blocks Layer19.2.3.3 Macroblock Layer19.2.3.4 Block Layer

19.3 H.263 Video Coding Standard
19.3.1 Overview of H.263 Video Coding19.3.2 Technical Features of H.263
19.3.2.1 Half-Pixel Accuracy19.3.2.2 Unrestricted Motion Vector Mode19.3.2.3 Advanced Prediction Mode19.3.2.4 Syntax-Based Arithmetic Coding19.3.2.5 PB-Frames

19.4 H.263 Video Coding Standard Version 2
19.4.1 Overview of H.263 Version 219.4.2 New Features of H.263 Version 2
19.4.2.1 Scalability19.4.2.2 Improved PB-Frames19.4.2.3 Advanced Intracoding19.4.2.4 Deblocking Filter19.4.2.5 Slice-Structured Mode19.4.2.6 Reference Picture Selection19.4.2.7 Independent Segmentation Decoding19.4.2.8 Reference Picture Resampling19.4.2.9 Reduced-Resolution Update19.4.2.10 Alternative Inter VLC and Modified Quantization19.4.2.11 Supplemental Enhancement Information

19.5 H.263þþ Video Coding and H.26L

Francis Group, LLC.


Chapter 20 A New Video Coding Standard: H.264=AVC20.1 Introduction20.2 Overview of H.264=AVC Codec Structure20.3 Technical Description of H.264=AVC Coding Tools

� 2007 b

20.3.1 Instantaneous Decoding Refresh Picture20.3.2 Switching I-Slices and Switching P-Slices20.3.3 Transform and Quantization20.3.4 Intraframe Coding with Directional Spatial Prediction20.3.5 Adaptive Block Size Motion Compensation20.3.6 Motion Compensation with Multiple References20.3.7 Entropy Coding20.3.8 Loop Filter20.3.9 Error-Resilience Tools

20.4 Profiles and Levels of H.264=AVC
20.4.1 Profiles of H.264=AVC20.4.2 Levels of H.264=AVC

Chapter 21 MPEG System: Video, Audio, and Data Multiplexing21.1 Introduction21.2 MPEG-2 System

21.2.1 Major Technical Definitions in MPEG-2 System Document21.2.2 Transport Streams

y Taylor &

21.2.2.1 Structure of Transport Streams21.2.2.2 Transport Stream Syntax

21.2.3 Transport Streams Splicing21.2.4 Program Streams21.2.5 Timing Model and Synchronization

21.3 MPEG-4 System
21.3.1 Overview and Architecture21.3.2 Systems Decoder Model21.3.3 Scene Description21.3.4 Object Description Framework

Francis Group, LLC.

Preface to the Second Edition

When looking at the preface of the first edition of this book published in 1999, it is observedthat most of the presentation, analyses, and discussion made there are still valid. The trend ofswitching from analog to digital communications continues. Digital image and video, digitalmultimedia, Internet, andWorldWideWeb have been continuously and vigorously growingduring the past eight years. Therefore, in this second edition of this book, we have retainedmost of the material of the first edition, but with some necessary updates and new additions.Two major and some minor changes made in this second edition are as follows.

First, the major parts of JPEG2000 have become standards after 1999. Hence, we haveupdated Chapter 8, which presents fundamental concepts and algorithms on JPEG2000.Second, a new chapter describing the recently developed video coding standard, MPEG-4Part 10 Advanced Video Coding or H.264, has been added in the second edition asChapter 20. For this purpose, Chapter 20 in the first edition covering the system part ofMPEG, multiplexing=demultiplexing and synchronizing the coded audio, video, and otherdata has been changed to Chapter 21 in this new edition. Other minor changes have beenmade wherever necessary, including the addition of new color systems of digital video,profiles, and levels of video coding standards.

Acknowledgments

Both authors acknowledge the great efforts of Dr. Zhicheng Ni in preparing the solutionmanual of this book. Yun Qing Shi would like to thank Professor Guorong Xuan, TongjiUniversity, Shanghai, China for constructive discussions on wavelet transform. HuifangSun expresses his appreciation to his colleague, Dr. Anthony Vetro, for fruitful technicaldiscussions and proofreading related to the new chapter (Chapter 20) in this edition. Hewould like to thank Drs. Ajay Divakaran and Fatih Porikli for their help in many aspects onthis edition. He also extends his appreciation to many friends and colleagues of theMPEGers who provided MPEG documents and tutorial materials cited in some revisedchapters of this edition. He would like to thank Drs. Richard Waters, Kent Wittenburg,Masatoshi Kameyama, and Joseph Katz for their continuing support and encouragement.He also would like thank Tokumichi Murakami and Kohtaro Asai for their friendlysupport and encouragement.

Yun Qing ShiNew Jersey Institute of Technology

Newark, New Jersey


Cambridge, Massachusetts


Preface to the First Edition

It is well known that in the 1960s the advent of the semiconductor computer and the spaceprogram swiftly brought the field of digital image processing into public focus. Since thenthe field has experienced rapid growth and has entered every aspect of modern techno-logy. Since the early 1980s, digital image sequence processing has been an attractiveresearch area because an image sequence, as a collection of images, may provide moreinformation than a single image frame. The increased computational complexity andmemory space required for image sequence processing are becoming more attainable.This is due to more advanced, achievable computational capability, resulting from thecontinuing progress made in technologies, especially those associated with the VLSIindustry and information processing.

In addition to image and image sequence processing in the digitized domain, facsimiletransmissionhas switched fromanalog todigital since the 1970s.However, the concept of highdefinition television (HDTV)when proposed in the late 1970s and early 1980s continued to beanalog. This has since changed. In the United States, the first digital system proposal forHDTV appeared in 1990. The Advanced Television Standards Committee (ATSC), formed bythe television industry, recommended the digital HDTV system developed jointly by theseven Grand Alliance members as the standard, which was approved by the Federal Com-munication Commission (FCC) in 1997. Today’s worldwide prevailing concept of HDTV isdigital. The digital television (DTV) provides the signal that can be used in computers.Consequently, the marriage of TV and computers has begun. Direct broadcasting bysatellite (DBS), digital video disks (DVD), video-on-demanding (VOD), video game andother digital video related media and services are now, or soon to be, available.

As in the case of image and video transmission and storage, audio transmission andstorage through some media have changed from analog to digital. Examples includeentertainment audio on compact disks (CD) and telephone transmission over long andmedium distance. Digital TV signals discussed above provide another example since theyinclude audio signals. Transmission and storage of audio signals through some othermedia are about to change to digital. Examples of this include telephone transmissionthrough local area and cable TV.

Although most signals generated from various sensors are analog in nature, the switch-ing from analog to digital is motivated by the superiority of digital signal processing andtransmission over their analog counterparts. The principal advantage of being digital is therobustness against various noises. Clearly, this results from the fact that only binary digitsexist in digital format and it is much easier to distinguish one state from the other than tohandle analog signals.

Another advantage of being digital is ease of signal manipulation. In addition to thedevelopment of a variety of digital signal (including image, video and audio) processingtechniques and specially designed software and hardware, that may be well-known, thefollowing development is an example of this advantage. The digitized information format,i.e., the bitstream, often in the compressed version, is a revolutionary change in the videoindustry that enablesmanymanipulations,which are either impossible or very complicated toexecute in analog format. For instance, video, audio and other data can be first compressed


to separate bitstreams and then combined to a signal bitstream, thus providing amultimediasolution for many practical applications. Information from different sources and to differentdevices can be multiplexed and demultiplexed in terms of the bitstream. Bitstream conver-sion in terms of bit rate conversion, resolution conversion and syntax conversion becomesfeasible. In digital video, content-based coding, retrieval and manipulation; and editingvideo in the compressed domain become feasible. All system-timing signals in the digitalsystems can be included in the bitstream instead of being transmitted separately as intraditional analog systems.

Being digital is well-suited to the recent development of modern telecommunicationstructures as exemplified by the Internet and World Wide Web (WWW). Therefore, we cansee that digital computers, consumer electronics (including television and video games),and telecommunications networks are combined to produce an information revolution.Combining audio, video and other data, multimedia becomes an indispensable element ofmodern life. While the pace and the future of this revolution cannot be predicted, one thingis certain: this process is going to drastically change many aspects of our world in the nextseveral decades.

One of the enabling technologies in the information revolution is digital data compres-sion, because the digitization of analog signals causes data expansion. In other words, storeand=or transmit digitized signals need more bandwidth and=or storage space than theoriginal analog signals do.

The focus of this book is on image and video compression encountered in multimediaengineering. Fundamentals, algorithms and standards are three emphases of the book. It isintended to serve as a graduate level text. Its material is sufficient for a one-semester orone-quarter graduate course on digital image and video coding. For this purpose, at theend of each chapter, there is a section of exercises, containing problems and projects, forpractice, and a section of references for further reading.

Based on this book, a short course, entitled ‘‘Image and Video Compression for Multi-media,’’ was conducted at Nanyang Technological University, Singapore in March andApril, 1999. The response to the short course was overwhelmingly positive.

Acknowledgments

We are pleased to express our gratitude here for the support and help we received in thecourse of writing this book.

The first author thanks his friend and former colleague Dr. C.Q. Shu for fruitful technicaldiscussion related to some contents of the book. Sincere thanks also are directed to severalof his friends and former students, Drs. J.N. Pan, X. Xia, S. Lin and Y. Shi, for their technicalcontributions and computer simulations related to some subjects of the book. He is gratefulto Ms. L. Fitton for her English editing of 11 chapters, and to Dr. Z.F. Chen for her help inpreparing many graphics.

The second author expresses his appreciation to his colleagues Anthony Vetro and AjayDivakaran for fruitful technical discussion related to some contents of the book and fortheir proofreading of nine chapters. He also extends his appreciation to Dr. Weiping Li andXiaobing Lee for their help for providing some useful references, and to many friends andcolleagues of the MPEGers who provided wonderful MPEG documents and tutorialmaterials that are cited in some chapters of this book.

Both authors would like to express their deep appreciation to Dr. Z.F. Chen for her greathelp in formatting all the chapters of the book. They both thank Dr. F. Chichester for hishelp in preparing the book.

Special thanks go to the editor-in-chief of the Digital Image Processing book series of theCRC Press, Dr. P. Laplante, for his constant encouragement and guidance. The help from


the acquisition editor of Electrical Engineering of the CRC Press, Nora Konopka, isappreciated.

The first author acknowledges the support he received associated with the book writingfrom the Electrical and Computer Engineering Department at New Jersey Institute ofTechnology, New Jersey, U.S.A. In particular, thanks are directed to the departmentchairman, Professor R. Haddad, and the associate chairman, Professor K. Sohn. He isalso grateful to the Division of Information Engineering and the Electrical and ElectronicEngineering School at Nanyang Technological University (NTU), Singapore for the sup-port he received during his sabbatical leave. It was in Singapore that he finished writingthe manuscript. In particular, thanks go to the Dean of the School Professor Er Meng Hwaand the Division head Professor A.C. Kot. With pleasure, he expresses his appreciation tomany of his colleagues at NTU for their encouragement and help. In particular, his thanksgo to Drs. G. Li and J.S. Li, and Dr. G.A. Bi. Thanks are also directed to many colleagues,graduate students and some technical staff from industrial companies in Singapore, whoattended the short course, which was based on this book, in March=April 1999 andcontributed their enthusiastic support and some fruitful discussion.

The last, but not the least, both authors thank their families for their patient supportduring the course of the writing. Without their understanding and support, we would nothave been able to complete this book.

Yun Qing Shi and Huifang SunJune 23, 1999


Content and Organization of the Book

This book consists of 21 chapters, grouped into four parts: (1) fundamentals, (2) still imagecompression, (3) motion estimation and compensation, and (4) video compression. Thefollowing paragraphs summarize the aim and content of each chapter, each part, and therelationship between some chapters and the four parts.

Part I includes Chapters 1–6, which provide readers with the fundamentals for under-standing the remaining three parts of the book. In Chapter 1, the practical needs for imageand video compression are demonstrated. The feasibility of image and video compressionis analyzed. Specifically, both statistical and psychovisual redundancies are analyzed andthe removal of these redundancies leads to image and video compression. In the course ofthe analysis, some fundamental characteristics of the human visual system are discussed.Visual quality measurement, another important concept in the compression, is addressedin both subjective and objective quality measures, and the new trend in combining themerits of the two measures is also discussed. Finally, some information theory results arepresented as the concluding subject of the chapter.

Chapter 2 discusses quantization, a crucial step in lossy compression. It is known thatquantization has a direct impact on both coding bit rate and quality of reconstructedframes. Both uniform and nonuniform quantizations are covered in this chapter. The issuesof quantization distortion, optimum quantization, and adaptive quantization are alsoaddressed. The last subject discussed in the chapter is pulse code modulation (PCM),which, as the earliest, best-established, and most frequently applied coding system,normally serves as a standard against which other coding techniques are compared.

Two efficient coding schemes, differential coding and transform coding (TC), are dis-cussed in Chapters 3 and 4, respectively. Both techniques utilize the redundanciesdiscussed in Chapter 1, thus achieving data compression. In Chapter 3, the formulationof general differential pulse code modulation (DPCM) systems is described first, followedby the discussion of optimum linear prediction and several implementation issues. Then,delta modulation (DM), as an important, simple, special case of DPCM is presented.Finally, application of differential coding technique to interframe coding and informationpreserving differential coding are covered.

Chapter 4 begins with the introduction of the Hotelling transform, the discrete version ofthe optimum Karhunen and Loeve transform. Through the statistical, geometrical, andbasis vector (image) interpretations, this introduction provides a solid understanding of thetransform coding technique. Several linear unitary transforms are then presented, followedby performance comparisons between these transforms in terms of energy compactness,mean square reconstruction error, and computational complexity. It is demonstrated thatthe discrete cosine transform (DCT) performs better than others in general. In the discus-sion of bit allocation, an efficient adaptive scheme using thresholding coding devised byChen and Pratt in 1984 is featured, which established a basis for the internationalstill image coding standard, JPEG. The comparison between DPCM and TC is also given.The combination of these two techniques (hybrid transform=waveform coding), and itsapplication in image and video coding are also described.


The last two chapters in Part I cover several coding (e.g., code word assignment)techniques. In Chapter 5, two types of variable-length coding techniques, Huffman codingand arithmetic coding, are discussed. First, an introduction to basic coding theory ispresented, which can be viewed as a continuation of the information theory results pre-sented in Chapter 1. Then, the Huffman code, as an optimum and instantaneous code, anda modified version are covered. Huffman coding is a systematic procedure for encoding asource alphabet with each source symbol having an occurrence probability. As a blockcode (a fixed code word having an integer number of bits is assigned to a source symbol), itis optimum in the sense that it produces minimum coding redundancy. Some limitations ofHuffman coding are analyzed. As a stream-based coding technique, arithmetic coding isdistinct from and is gaining more popularity than Huffman coding. It maps a string ofsource symbols into a string of code symbols. Free of the integer-bits-per-source-symbolrestriction, arithmetic coding is more efficient. The principle of arithmetic coding and someof its implementation issues are addressed.

While two types of variable-length coding techniques, introduced in Chapter 5, can beclassified as fixed-length-to-variable-length coding techniques, run-length coding (RLC)and dictionary coding (as discussed in Chapter 6) can be classified as variable-length-to-fixed-length coding techniques. The discrete Markov source model (another portion ofthe information theory results), which can be used to characterize 1-D RLC, is introducedat the beginning of Chapter 6. Both 1-D RLC and 2-D RLC are then introduced.The comparison between 1-D and 2-D RLC is made in terms of coding efficiency andtransmission error effect. The digital facsimile coding standards based on 1-D and 2-D RLCare introduced. Another focus of Chapter 6 is on dictionary coding. Two groups ofadaptive dictionary coding techniques, the LZ77 and LZ78 algorithms, are presented.Their applications are discussed. At the end of the chapter, a discussion of internationalstandards for lossless still image compression is given. For both lossless bilevel and multi-level still image compression, the respective standard algorithms and their performancecomparisons are provided.

Part II of the book includes Chapters 7 through 9, which are devoted to still imagecompression. In Chapter 7, the international still image coding standard JPEG is intro-duced. Two classes of encoding, i.e., lossy and lossless, and four modes of operation, i.e.,sequential DCT-based mode, progressive DCT-based mode, lossless mode, and hierarch-ical mode, are covered. The discussion in Part I is very useful in understanding what isintroduced here for JPEG.

Because of the higher coding efficiency and superior spatial and quality scalabilityfeatures over the DCT coding technique, discrete wavelet transform (DWT) coding hasbeen adopted by JPEG2000 still image coding standards as the core technology. Chapter 8begins with an introduction to wavelet transform (WT), which includes a comparisonbetween WT and the short-time Fourier transform (STFT), and presents WT as a unificationof several existing techniques, known as filter bank analysis, pyramid coding, and sub-bandcoding. Then the DWT for still image coding is discussed. In particular, the embeddedzerotree wavelet (EZW) technique and set partitioning in hierarchical trees (SPIHT) arediscussed. The updated JPEG2000 standard activity is also presented here.

Chapter 9 presents three nonstandard still image coding techniques: vector quantization(VQ), fractal coding, and model-based image coding. All three techniques have severalimportant features such as very high compression ratio for certain kinds of images, andvery simple decoding procedures. Owing to some limitations, however, they have not beenadopted by the still image coding standards. On the other hand, the facial model and faceanimation techniques have been adopted by the MPEG-4 video standard.

Part III of this book, consisting of Chapters 10 through 14, addresses motion estimationand motion compensation, which are key issues in modern video compression. Part III is a


prerequisite to Part IV, which discusses various video coding standards. The first chapterin Part III, Chapter 10, introduces motion analysis and compensation in general. Thechapter begins with the concept of imaging space, which characterizes all images and allimage sequences in temporal and spatial domains. Both temporal and spatial imagesequences are special proper subsets of the imaging space. A single image becomes merelya specific cross section of the imaging space. Two techniques in video compression utilizinginterframe correlation, both developed in the late 1960s and early 1970s, are presentedhere. Frame replenishment is relatively simpler in modeling and implementation. How-ever, motion compensated coding achieves higher coding efficiency and better quality inreconstructed frames with a 2-D displacement model. Motion analysis is then viewed froma signal processing perspective. Three techniques in motion analysis are briefly discussed.They are block matching, pel recursion, and optical flow, which are presented in detail inChapters 11 through 13, respectively. Finally, other applications of motion compensationto image sequence processing are discussed.

Chapter 11 addresses the block matching technique, which is presently the most fre-quently used motion estimation technique. The chapter first presents the original blockmatching technique proposed by Jain and Jain. Several different matching criteria andsearch strategies are then discussed. A thresholding multiresolution block matching algo-rithm is described in some detail so as to provide an insight into the technique. Then, thelimitations of block matching techniques are analyzed, from which several new improve-ments are presented. They include hierarchical block matching, multigrid block matching,predictive motion field segmentation, and overlapped block matching. All of these tech-niques modify the nonoverlapped, equally spaced, fix-sized, small rectangular block modelproposed by Jain and Jain in some way so that the motion estimation is more accurate andhas fewer block artifacts and overhead side information.

The pel recursive technique is discussed in Chapter 12. First, determination of 2-Ddisplacement vectors is converted via the use of the displaced frame difference (DFD)concept to a minimization problem. Second, descent methods in optimization theory arediscussed. In particular, the steepest descent method and Newton–Raphson method areaddressed in terms of algorithm, convergence, and implementation issues such as selectionof step-size and initial value. Third, the first pel recursive techniques proposed by Netravaliand Robbins are presented. Finally, several improvement algorithms are described.

Optical flow, the third technique in motion estimation for video coding, is covered inChapter 13. First, some fundamental issues in motion estimation are addressed. Theyinclude the difference and relationships between 2-D motion and optical flow, the apertureproblem, and the ill-posed nature of motion estimation. The gradient-based and correl-ation-based approaches to optical flow determination are then discussed in detail. For theformer, the Horn and Schunck algorithm is illustrated as a representative technique andsome other algorithms are briefly introduced. For the latter, the Singh method is intro-duced as a representative technique. In particular, the concepts of conservation informa-tion and neighborhood information are emphasized. A correlation-feedback algorithm ispresented in detail to provide an insight into the correlation technique. Finally, multipleattributes for conservation information are discussed.

Chapter 14, the last chapter in Part III, provides a further discussion and summary of 2-Dmotion estimation. First, a few features common to all three major techniques discussed inChapters 11 through 13 are addressed. They are the aperture and ill-posed inverse prob-lems, conservation and neighborhood information, occlusion and disocclusion, and rigidand nonrigid motion. Second, a variety of different classifications of motion estimationtechniques are presented. Frequency domain methods are discussed as well. Third, per-formance comparison between three major techniques in motion estimation is made.Finally, the new trends in motion estimation are presented.


Part IV, containing Chapters 15 through 21, covers various video coding standards.Chapter 15 presents fundamentals of video coding. First, digital video representation isdiscussed. Second, the rate distortion function of the video signal is covered, the fourthportion of the information theory results presented in this book. Third, various digital videoformats are discussed. Finally, the current digital image=video coding standards are sum-marized. The full names and abbreviations of some organizations, the completion time, andthe major features of various image=video coding standards are listed in two tables.

Chapter 16 is devoted to video coding standards MPEG-1=2, which are the most widelyused video coding standards at present. The basic technique of MPEG-1=2 is a full-motioncompensated DCT and DPCM hybrid coding algorithm. The features of MPEG-1 (includ-ing layered data structure) and the MPEG-2 enhancements (including field=frame modesfor supporting the interlaced video input and scalability extension) are described. Issues ofrate control, optimum mode decision, and multiplexing are discussed.

Chapter 17 presents several application examples of MPEG-1=2 video standards. Theyare the ATSC DTV standard, approved by the Federal Communications Commission(FCC) in the United States, transcoding, down-conversion decoder, and error concealment.Discussion of these applications can enhance understanding and mastering of MPEG-1=2standards. Some research work reported may be found helpful for graduate students tobroaden their knowledge of digital video processing, an active research field.

Chapter 18 presents the MPEG-4 video standard. The predominant feature of MPEG-4,content-based manipulation, is emphasized. The underlying concept of audio=visualobjects (AVOs) is introduced. The important functionalities of MPEG-4, i.e., content-basedinteractivity (including bitstream editing, synthetic and natural hybrid coding (SNHC)),content-based coding efficiency, and universal access (including content-based scalability),are discussed. Since neither MPEG-1 nor MPEG-2 includes synthetic video and content-based coding, the most important application of MPEG-4 is in a multimedia environment.

Chapter 19 introduces ITU-T video coding standards H.261 and H.263, which areutilized mainly for videophony and videoconferencing. The basic technical detail ofH.261, the earliest video coding standard, is presented. The technical improvements withwhich H.263 achieves high coding efficiency are discussed. Features of H.263þ, H.263þþ,and H.26L are also presented.

Chapter 20 introduces the recently developed video coding standard, MPEG-4 Part 10Advanced Video Coding [H.264], which is jointly developed by joint video team (JVC) ofMPEG and ITU-T VCEG, and it is simply called as H.264=AVC. The H.264=AVC is anefficient and state-of-the-art video compression standard, whose coding efficiency is abouttwo times better than that of MPEG-2 at the expense of increased complexity. TheH.264=AVC has been planed for many applications including HD-DVD, DTV for satelliteand wireless networks, IPTV, and many others.

Chapter 21 covers the systems part of MPEG, multiplexing=demultiplexing and syn-chronizing the coded audio, video and other data. Specifically, MPEG-2 systems andMPEG-4 systems are introduced. In MPEG-2 systems, two forms i.e., program streamand transport stream, are described. In MPEG-4 systems, some multimedia applicationrelated issues are discussed.

Yun Qing ShiNew Jersey Institute of Technology

Newark, New Jersey


Cambridge, Massachusetts


Authors


Yun Qing Shi joined the New Jersey Institute of Technol-ogy (NJIT), Newark, New Jersey in 1987, and is currently aprofessor of Electrical and Computer Engineering. Heobtained his BS and MS from Shanghai Jiao Tong Univer-sity, Shanghai, China, and his MS and PhD from theUniversity of Pittsburgh, Pennsylvania. His research inter-ests include visual signal processing and communications,multimedia data hiding and security, theory of multidi-mensional systems, and signal processing. Before enteringgraduate school, he had industrial experience in numer-ical control manufacturing and electronic broadcasting.Some of his research projects have been funded by severalfederal and New Jersey state agencies.

Dr. Shi is an author and coauthor of 200 papers, one book, and four book chapters. Heholds two U.S. patents, and has 20 U.S. patents pending (all of these pending patents havebeen licensed to third parties by NJIT). He is the chairman of the Signal Processing Chapterof IEEE North Jersey Section, the founding editor-in-chief of LNCS Transactions on DataHiding and Multimedia Security (Springer), an editorial board member of MultidimensionalSystems and Signal Processing (Springer), a member of IEEE Circuits and Systems Society’s(CASS) three technical committees, the technical chair of IEEE International Conference onMultimedia and Expo 2007 (ICME07), a co-technical chair of International Workshop onDigital Watermarking 2007 (IWDW07), and a fellow of IEEE. He was an associate editor ofIEEE Transactions on Signal Processing, IEEE Transactions on Circuits and Systems Part II, aguest editor of special issues for several journals, a formal reviewer of the MathematicalReviews, a contributing author for Comprehensive Dictionary of Electrical Engineering (CRC),an IEEE CASS Distinguished Lecturer, a member of IEEE Signal Processing Society’sTechnical Committee of Multimedia Signal Processing, a co-general chair of IEEE 2002International Workshop on Multimedia Signal Processing (MMSP02), a co-technical chairof MMSP05, and a co-technical chair of IWDW06.

Huifang Sun graduated from Harbin Engineering Institute,China, and received his PhD from the University of Ottawa,Canada. He joined the Electrical Engineering Department ofFairleigh Dickinson University in 1986 and was promoted toan associate professor before moving to Sarnoff Corporation in1990. He joined the Sarnoff laboratory as a member of thetechnical staff and was later promoted to technology leader ofdigital video communication. In 1995, he joined MitsubishiElectric Research Laboratories (MERL) as a senior principaltechnical staff member, and was promoted as vice presidentand fellow of MERL and deputy director in 2003. Dr. Sun’sresearch interests include digital video=image compression

and digital communication. He has coauthored two books and has published more than150 journal and conference papers. He holds 48 U.S. patents. He received the technicalachievement award in 1994 at the Sarnoff laboratory. He received the 1992 best paperaward of IEEE Transactions on Consumer Electronics, the 1996 best paper award of ICCE, andthe 2003 best paper award of IEEE Transactions on Circuits and Systems for Video Technology.Dr. Sun is now an associate editor for IEEE Transactions on Circuits and Systems for VideoTechnology and was the chair of the Visual Processing Technical Committee of IEEECircuits and System Society. He is an IEEE Fellow.


Part I

Fundamentals


1Introduction

Image and vide o data compr ession ref ers to a proces s in which the amoun t of data use d torepres ent image and vide o is reduce d to mee t a bit rate requi remen t (belo w or at mo stequal to the ma ximum availabl e bit rate), whil e the quality of the recon structed image orvideo sat isfi es a require ment for a certain appl ication and the compl exity of computat ioninvolve d is affordabl e for the app lication. In thi s book, the ter ms image and vide o datacompr ession, image and vide o compr ession, and image and video coding are synonymous.Figure 1.1 shows the functi onality of image and vide o data com pression in visu al trans-mission and storage. Image and video data com press ion has been foun d to be necess ary inthese impo rtant appl ications, becaus e the huge amount of data involved in these and ot herappl ications usually well-exce eds the capa bility of today ’ s hardw are des pite rap idadvance ments in semi conductor, compute r, and ot her industri es.

It is noted that both informa tion and data are closely relate d yet different concepts . Datarepres ents inf ormation and the quantity of data can be measu red. In the conte xt of digitalimage and video, data is usually measure d in the num ber of bina ry units (bits) . Infor ma-tion is defi ned as kno wledge, facts, and new s accordin g to the Cambr idge Intern ationalDicti onary of English. That is, whil e dat a is the repre sentation of knowledge , facts, andnews, infor mation is the kno wledge, facts, and news. Inform ation, howeve r, may also bequan titatively measu red.

Bit rate (also known as coding rate ), as an impo rtant par ameter in image and videocompr ession, is often expre ssed in a unit of bits per seco nd (bits =s, or bps), which issuitable in visual commu nication . In fact, an exa mple in Secti on 1.1 concer ning vide ophony(a cas e of visual transmi ssion) use s bit rate in terms of bits per seco nd. In the appl ication ofimage storage, bit rate is usually express ed in a unit of bits pe r pixel (bpp). The term pixel isan ab breviation for pictu re elemen t and is someti mes refer red to as pel. In informati onsource coding, bit rate is someti mes expres sed in a unit of bits per symbol . In Se ction 1.4.2,when discussing noiseless source coding theorem, we consider bit rate as the averagelength of code words in the unit of bits per symbol.

The required quality of the reconstructed image and video is application dependent.In medical diagnosis and some scientific measurements, we may need the reconstructedimage and video to mirror the original image and video. In other words, only reversible,information-preserving schemes are allowed. This type of compression is referred to aslossless compression. In applications, such as motion picture and television (TV), a certainamount of information loss is allowed. This type of compression is called lossy compression.

From its definition, one can see that image and video data compression involves severalfundamental concepts including information, data, visual quality of image and video, andcomputational complexity. This chapter is concerned with several fundamental concepts inimage and video compression. First, the necessity as well as the feasibility of image andvideo data compression are discussed. The discussion includes the utilization of severaltypes of redundancy inherent in image and video data, and the visual perception of the


OutputInput Image andvideo

compression

Transmissionor

storage

Datareconstruction

ordata retrieval

FIGURE 1.1Image and video compression for visual transmission and storage.

human visual system (HVS). As the quality of the reconstructed image and video is one ofour main concerns, the subjective and objective measures of visual quality are addressed.Then we present some fundamental information theory results, considering they play a keyrole in image and video compression.

1.1 Practical Needs for Image and Video Compression

Needless to say, visual information is of vital importance for human beings to perceive,recognize, and understand the surrounding world. With the tremendous progress that hasbeen made in advanced technologies, particularly in very large-scale integrated (VLSI)circuits, increasingly powerful computers and computations, it is becoming more possiblethan ever for video to be widely utilized in our daily life. Examples include videophony,videoconferencing, high definition TV (HDTV), and digital video disk (also known asdigital versatile disk [DVD]), to name a few.

Video as a sequence of video frames, however, involves a huge amount of data. Let ustake a look at an illustrative example. Assume the present switch telephone network(PSTN) modem can operate at a maximum bit rate of 56,600 bits=s. Assume each videoframe has a resolution of 2883 352 (288 lines and 352 pixels=line), which is comparablewith that of a normal TV picture and is referred to as common intermediate format (CIF).Each of the three primary colors RGB (red, green, blue) is represented for one pixel with8 bits, as usual, and the frame rate in transmission is 30 frames=s to provide a continuousmotion video. The required bit rate, then, is 2883 3523 83 33 30¼ 72,990,720 bits=s.Therefore, the ratio between the required bit rate and the largest possible bit rate isabout 1289. This implies that we have to compress the video data by at least 1289 timesin order to accomplish the transmission described in this example. Note that an audiosignal has not been accounted for yet in this illustration.

With increasingly demanding video services, such as three-dimensional (3-D) moviesand games, and high video quality, such as HDTV, advanced image, and video datacompression is necessary. It becomes an enabling technology to bridge the gap betweenthe required huge amount of video data and the limited hardware capability.

1.2 Feasibility of Image and Video Compression

In this section, we shall see that image and video compression is not only a necessity forrapid growth of digital visual communications, but is also feasible. Its feasibility rests withtwo types of redundancies, i.e., statistical redundancy and psychovisual redundancy. Byeliminating these redundancies, we can achieve image and video compression.


1.2.1 Statisti cal Redunda nc y

Statisti cal redund ancy can be class ifi ed int o two types : interpixe l redundanc y andcoding redu ndancy. By int erpixe l redund ancy we mean that pixels of an image frame ,and pixels of a gro up of successive image or video frames, are not statist ically indep endent.On the con trary, they are correlat ed to various degrees. (Difference and relatio nshipbetwe en image and vide o sequ ences are dis cussed in Chapter 10, when we begi n to dis cussvideo compression). This type of interpixel correlation is referred to as interpixel redundancy.Interpixel redundancy can further be divided into two categories: spatial redundancy andtemporal redundancy. By coding redundancy, we mean that the statistical redundancyis associated with coding techniques.

1.2.1. 1 Spatial Redunda ncy

Spati al redu ndancy represen ts the statist ical corre lation betw een pixels within an imageframe. Hence it is also calle d intrafr ame redu ndancy.

It is well known that for mo st prope rly sample d TV signals the normaliz ed autocor rela-tion coef ficie nts along a row (o r a colum n) with a on e-pixe l shift is very close to themaximu m value 1. That is, the intensity value s of pixe ls along a row (or a colu mn) havea ver y high autocor relation (close to the maximu m aut ocorrelati on) with those of pi xelsalong the same row (or the same column) but shifted by a pixe l. This doe s not come as asurpri se because mo st of the inten sity value s change conti nuous ly from pixel to pixelwithin an image frame excep t fo r the edge regi ons (Figu re 1.2). Figure 1.2a is a prett ynormal picture : a boy and a girl in a park , and is of a resolution of 883 3 710. Th e int ensitypro files along the 318th ro w and the 262th c olumn are depict ed in Figu re 1.2b and c,respec tively. For easy referenc e, the posit ions of the 318th row and 26 2th colu mn in thepicture are shown in Figure 1.2d. That is, the ver tical axi s repre sents inten sity value s, whilethe ho rizontal axis indicate s the pixel posit ion within the row or the colu mn. These twocurves indicate that often inten sity value s change gradua lly fro m one pixe l to the ot heralong a ro w and along a column.

The st udy of the st atistical prope rties of video signal s can be traced back to the 1950s .Knowin g that we must study and understan d redundanc y to rem ove it, Kretzm erdesign ed some expe rimen tal devices such as a picture autocor relator and a probab iloscopeto measure several statist ical quantit ies of TV signal s and publis hed his outstan dingwork in [kretzme r 1952]. He fo und that the autocor relatio n in bot h horizontal and ver ticaldirecti ons exhibits similar beha viors as shown in Figu re 1.3. Autoco rrelation functionsof several pictures with differe nt com plexity were measu red. It was fo und that frompicture to picture , the shape of aut ocorrelati on curves range s from remarkab ly line ar tosomew hat lik e exponenti al. The cen tral symm etry with respect to the v ertical axi s and thebell- shaped distribut ion, howe ver, rem ain generall y the sam e. When the pixe l shiftin gbecome s small, it was foun d that the autocor relatio n is high. This local aut ocorrelati oncan be as high as 0.97 to 0.99 for one- or two-pi xel sh ifting. For ver y detai led picture s, it canrange from 0.43 to 0.75. It was also fo und that aut ocorrelati on generall y has no preferreddirecti on.

The Fouri er transf orm of aut ocorrelati on, pow er spectru m, is known as ano therimportan t functi on in st udying st atistical beha vior. Figure 1.4 shows a typical powerspectrum of a TV signal [fink 1957; connor 1972]. It is reported that the spectrum is quiteflat until 30 kHz for a broadcast TV signal. Beyond this line frequency the spectrum starts todrop at a rate of around 6 dB per octave. This reveals the heavy concentration of videosignals in low frequencies, considering a nominal bandwidth of 5 MHz.


Spa tial redu ndancy implies that the inten sity va lue of a pixel can be guessed fro m thatof its neighb oring pi xels. In other word s, it is no t necessary to repres ent each pixel inan image frame indep endentl y. Inste ad, on e can predic t a pixel from its neighbors .Predic tive coding, also kno wn as different ial codi ng, is based on this observat ion andis discusse d in Chapter 3. Th e direct consequen ce of recognition of spatial redundancy isthat by removing a large amount of the redundancy (or utilizing the high correlation)within an image frame, we may save a lot of data in representing the frame, thus achievingdata compression.

(a)Row profile

(b)

0

50

100

150

200

250

1 54 107 160 213 266 319 372 425 478 531 584 637 690 743 796 849

Row pixel

Gra

y le

vel

FIGURE 1.2 (See color insert following page 288.)(a) A picture of boy and girl. (b) Intensity profile along 318th row.


(c)

Column profile

0

50

100

150

200

250

300

1 44 87 130 173 216 259 302 345 388 431 474 517 560 603 646 689

Column pixel

Gra

y le

vel

(d)

FIGURE 1.2 (continued)(c) Intensity profile along 262th column. (d) Positions of 318th row and 262th column.

1.2.1.2 Temporal Redundancy

Temporal redundancy is concerned with the statistical correlation between pixels fromsuccessive frames in a temporal image or video sequence. Therefore, it is also calledinterframe redundancy.

Consider a temporal image sequence. That is, a camera isfixed in the 3-Dworld and it takespictures of the scene one by one as time goes by. As long as the time interval between twoconsecutive pictures is short enough, i.e., the pictures are taken densely enough, we can


Autocorrelation

0 10 20 30 40 50Pixel shift

1.0

–10–20–30–40–50

FIGURE 1.3Autocorrelation in horizontal directions for some testing pictures. (From Kretzmer, E.R., Bell Syst. Tech. J., 31, 751,1952. With permission.)

imagine that the similarity between two neighboring frames is strong. Figure 1.5a and b showsthe 21st and 22nd frames of the ‘‘ Miss America’’ sequence, respectively. The frames have aresolution of 176 3 144. Among the total 25,344 pixels, only 3.4% change their gray value morethan 1% of the maximum gray value (255 in this case) from the 21st frame to the 22nd frame.This confirms an observation made in [mounts 1969]: For a videophone-like signal withmoderate motion in the scene, on average, less than 10% of pixels change their gray valuesbetween two consecutive frames by an amount of 1% of the peak signal. The high interframecorrelation was reported in [kretzmer 1952]. There, the autocorrelation between two adjacentframes was measured for two typical motion picture films. The measured autocorrelationswere 0.80 and 0.86. The concept of frame difference coding of television signals was alsoreported in [seyler 1962], and the probability density functions of television frame differenceswas analyzed in [seyler 1965]. In summary, pixels within successive frames usually bear astrong similarity or correlation. As a result, we may predict a frame from its neighboringframes along the temporal dimension. This is referred to as interframe predictive coding and isd i sc us se d i n C ha pt er 3 . A m or e p re ci se , h en ce , m or e e fficient interframe predictive coding

Relative power (dB)

f (Hz)

0

–60

–50

–40

–30

–20

–10

10 k 100 k 1000 k 10,000 k

FIGURE 1.4A typical power spectrum of a TV broadcast signal. (From Fink, D.G., Television Engineering Handbook, New York,1957, Sect. 10.7. With permission.)


(a) (b)

FIGURE 1.5(a) 21st frame and (b) 22nd frame of the Miss America sequence.

scheme, which has been in development since the 1980s, uses motion analysis. That is, itconsiders that the changes from one frame to the next are mainly due to the motion of someobjects in the frame. Taking this motion information into consideration, we refer to the methodas motion compensated (MC) predictive coding. Both interframe correlation and MC predict-ive coding are discussed in detail in Chapter 10.

Removi ng a large amo unt of tem poral redundanc y lead s to a grea t deal of dat a com pres-sion. At pres ent, all the inter natio nal vide o c oding stand ards have adop ted MC predic tivecoding, which has bee n a v ital facto r to the increa sed use of digit al vide o in digit al medi a.

1.2.1. 3 Coding Redunda ncy

As we discussed, interpixel redundancy is concerned with the correlation between pixels. Thatis, some information associated with pixels is redundant. The psychovisual redundancy(Section 1.2.2) is related to the information that is psychovisually redundant, i.e., to whichthe HVS is not sensitive. It is hence clear that both interpixel and psychovisual redundanciesare somehow associated with some information contained in image and video. Eliminatingthese redundancies, or utilizing these correlations, by using fewer bits to represent theinformation results in image and video data compression. In this sense, the coding redun-dancy is different. It has nothing to do with information redundancy but with the represen-tation of information, i.e., coding itself. To see this, let us take a look at the following example.

One illustrat ive exam ple is provid ed in Table 1.1 . Th e fi rst column lists five dis tinctsymbols that need to be encoded. The second column contains occurrence probabilities of

TABLE 1.1

An Illustrative Example

Symbol Occurrence Probability Code 1 Code 2

a1 0.1 000 0000a2 0.2 001 01a3 0.5 010 1a4 0.05 011 0001a5 0.15 100 001


these fi ve symbol s. The thi rd column lists c ode 1, a set of code words obtain ed by usinguniform -lengt h code word assig nment. (This code is kn own as the natural bina ry code.)The fourt h column lists code 2, in which each code word has a vari able length . Th erefore,code 2 is called the vari able-len gth code . It is note d that the symbol with a high erocc urrence probab ility is enc oded with a shorte r length . Let us exa mine the ef ficiency ofthe two different codes. That is, we wi ll exa mine which one provi des a sh orter averagelength of code word s. It is obvio us that the averag e length of code words in code 1, Lavg,1 , is3 bits. The aver age length of code words in code 2, Lavg,2 , can be calcul ated as follo ws:

Lavg,2 ¼ 4 � 0: 1 þ 2 � 0: 2 þ 1 � 0:5 þ 4 � 0: 05 þ 3 � 0:15 ¼ 1:95 bits =symbol : (1: 1)

Ther efore, it is con cluded that code 2 wi th vari able-len gth codi ng is mo re ef fi cient thancode 1 with natural bina ry c oding.

Fro m thi s example , we can see that for the same set of symbols , different code s mayperform different ly. Some may be more ef fi cient than others . For the sam e amoun t ofinformation, code 1 contains some redundancy. That is, some data in code 1 is notnecessary and can be removed without any effect. Huffman coding and arithmetic coding,two variab le-length coding techni ques, are dis cussed in Chapter 5.

From the study of coding redundancy, it is clear that we should search for more efficientcoding techniques to compress image and video data.

1.2.2 Psychovisual Redundancy

While interpixel redundancy inherently rests in image and video data, psychovisualredundancy originates from the characteristics of the HVS.

It is known that the HVS perceives the outside world in a rather complicated way. Itsresponse to visual stimuli is a nonlinear function of the strength of some physical attributesof the stimuli such as intensity and color. HVS perception is different from camera sensing.In the VHS, visual information is not perceived equally; some information may be moreimportant than other information. This implies that if we apply less data to representless important visual information, perception will not be affected. In this sense, we see thatsome visual information is psychovisually redundant. Eliminating this type of psychovi-sual redundancy leads to data compression.

To understand this type of redundancy, let us study some properties of the HVS. Wemay model the human vision system as a cascade of two units [lim 1990], as depicted inFigure 1.6. The first one is a low-level processing unit that converts incident light intoa neural signal. The second one is a high-level processing unit that extracts informationfrom the neural signal. Although much research was carried out to investigate low-levelprocessing, high-level processing remains wide open. The low-level processing unit is

Incidentlight

Perceivedvisual

information

Human visual system

Low-levelprocessing

unit

High-levelprocessing

unit

FIGURE 1.6A two-unit cascade model of the human visual system (HVS).


known as a nonlinear system (approximately logarithmic, as shown below). As a greatbody of literature exists, we limit our discussion only to video compression-related results.That is, several aspects of the HVS, which are closely related to image and video compres-sion, are discussed in this section. They are luminance masking, texture masking, fre-quency masking, temporal masking, and color maskings. Their relevance in image andvideo compression is addressed. Finally, a summary is provided, in which it is pointed outthat all of these features can be unified as one: differential sensitivity. This seems to be themost important feature of the human visual perception.

1.2.2.1 Luminance Masking

Luminance masking concerns the brightness perception of the HVS, which is the mostfundamental aspect among the five to be discussed here. Luminance masking is alsoreferred to as luminance dependence [connor 1972] and contrast masking [legge 1980;watson 1987]. As pointed in [legge 1980], the term masking usually refers to a destructiveinteraction or interference among stimuli that are closely coupled in time or space. Thismay result in a failure in detection, or errors in recognition. Here, we are mainly concernedwith the detectability of one stimulus when another stimulus is present simultaneously.The effect of one stimulus on the detectability of another, however, does not have todecrease detectability. Indeed, there are some cases in which a low-contrast maskerincreases the detectability of a signal. This is sometimes referred to as facilitation, but inthis discussion we only use the term masking.

Consider the monochrome image shown in Figure 1.7. There, a uniform disk-shapedobject with a gray level (intensity value) I1 is imposed on a uniform background with agray level I2. Now the question is: Under what circumstances can the disk-shaped objectbe discriminated from the background by the HVS? That is, we want to study the effect ofone stimulus (the background in this example, the masker) on the detectability of anotherstimulus (in this example, the disk). Two extreme cases are obvious. That is, if thedifference between the two gray levels is quite large, the HVS has no problem withdiscrimination, or in other words the HVS notices the object from the background. If, onthe other hand, the two gray levels are the same, the HVS cannot identify the existence ofthe object. What we are concerned with here is the critical threshold in the gray leveldifference for discrimination to take place.

If we define the threshold DI as such a gray level difference DI¼ I1� I2 that the object canbe noticed by the HVS with a 50% chance, then we have the following relation, known ascontrast sensitivity function, according to Weber’s law.

�II� constant, (1:2)

where the constant is approximately 0.02. Weber’s law states that for a relatively very widerange of I, the threshold for discrimination, DI, is directly proportional to the intensity I.The implication of this result is that when the background is bright, a larger difference ingray levels is needed for the HVS to discriminate the object from the background.

I1

I2

FIGURE 1.7A uniform object with gray level I1 imposed on a uniform background with graylevel I2.


On the othe r hand , the inten sity difference require d coul d be smaller if the ba ckgroun d isrelativ ely dark. It is noted that Equatio n 1.1 impl ies a logarithm ic respons e of the HV S, andWebe r’ s law holds for all other human sense s as well.

Furt her research has indicate d that the luminance thres hold D I inc reases more slo wlythan predic ted by Weber ’ s law. Some mo re accu rate contras t sens itivity functi ons havebeen presente d in the literature . In [legge 1980], it was reported that an expone ntialfuncti on repl aces the line ar relatio n in W eber ’s law. Th e fo llowing exponentia l expre ssionis reported in [watson 1987].

DI ¼ I0 � max 1,II0

� �a� �, (1: 3)

where I0 is the lumi nance dete ction thre shold when the gray level of the ba ckgroun d isequal to zero, i.e., I ¼ 0, and a is a cons tant, appro ximately equal to 0.7.

Figu re 1.8 shows a picture uniform ly corru pted by addi tive white Gaussi an no ise(AW GN). It can be observe d that the noise is mo re visible in the dark area than in the brightare a if com paring, for insta nce, the dar k portion and the brigh t portion of the cloud above thebridge . Th is indicate s that noise filteri ng is more necess ary in the dark area than in the brightare a. The lighter area can accommod ate mo re addi tive noise before the noise become svisi ble. Th is prope rty has fo und applicat ion in embe dding dig ital waterm arks [huang 1998].

The direc t impact that luminance maskin g has on image and video compr ession isrelated to quan tization , which is c overed in detail in Chap ter 2. Rou ghly spe akin g,quantization is a process that converts a continuously distributed quantity into a set of

(a)

FIGURE 1.8 (See color insert following page 288.)The bridge in Vancouver: (a) Original [Courtesy of Minhuai Shi].


(b)

FIGURE 1.8 (continued)(b) Uniformly corrupted by additive white Gaussian noise (AWGN).

finit ely many distinct quan tities. The number of these distinct quan tities (known as quan ti-zation level s) is one of the keys in quan tizer design . It signi ficantly in fluences the resultin gbit rate a nd the quality of the recon structed image and video. An effect ive quan tizer shouldbe able to min imize the visibilit y of quan tization error. The contrast sensi tivity functi onprovid es a guidelin e in analy sis of the visibilit y of quan tization error . Therefor e, it can beappl ied to quan tizer design . Lumi nance ma sking sugge sts a no nunifor m quan tizationschem e that takes the con trast sensitiv ity fun ction int o cons ideratio n. One such examplewas presente d in [wat son 1987].

1.2.2. 2 Texture Maski ng

Texture masking is sometimes also called detail dependence [connor 1972], spatial masking[netravali 1977; lim 1990], or activity masking [mitchell 1996]. It states that the discriminationthreshold increases with increasing picture detail. That is, the stronger the texture, the largerthe discrimination threshold. In Figure 1.8, it can be observed that the additive random noise isless pronounced in the strong texture area than in the smooth area if comparing, for instance,the dark portion of the cloud (the up-right corner of the picture) with the water area (thebottom-right corner of the picture). This is in confirmation of texture masking.

In Figu re 1.9b, the number of quan tization levels decrease s from 256 (as in Figure 1.9a) to16. Th at is, we use only 4 bits, instead of 8 bits, to represen t the intensity value for eachpixel. The unnatural contours, caused by coarse quantization, can be noticed in the relativeuniform regions, compared with Figure 1.9a. This phenomenon was first noted in [goodall1951] and is called false contouring [gonzalez 1992]. Now we see that the false contouring


(a)

(b)

FIGURE 1.9 (See color insert following page 288.)Christmas in Winorlia: (a) Original, (b) Four-bit quantized.


(c)

FIGURE 1.9 (continued)(c) Improved IGS quantized with 4 bits.

can be explained by using texture masking because texture masking indicates that thehuman eye is more sensitive to the smooth region than to the textured region, whereintensity exhibits a high variation. A direct impact on image and video compression is thatthe number of quantization levels, which affects bit rate significantly, should be adaptedaccording to the intensity variation of image regions.

1.2.2.3 Frequency Masking

While the above two characteristics are picture dependent in nature, frequency masking ispicture independent. It states that the discrimination threshold increases with frequencyincrease. It is also referred to as frequency dependence.

Frequency masking can be well illustrated by using Figure 1.9. In Figure 1.9c, high-frequency random noise has been added to the original image before quantization. Thismethod is referred to as the improved gray-scale (IGS) quantization [gonzalez 1992, p. 318].With the same number of quantization levels (16) as in Figure 1.9b, the picture qualityof Figure 1.9c is improved dramatically compared with that of Figure 1.9b: the annoyingfalse contours have disappeared despite the increase of the root mean square value ofthe total noise in Figure 1.9c. This is due to the fact that the low-frequency quantizationerror is converted to the high-frequency noise, and that the HVS is less sensitive to thehigh-frequency content. We thus see, as pointed out in [connor 1972], that our human eyesfunction like a low-pass filter.


Ow ing to frequency mask ing, in the transf orm doma in, say, the discre te cosine trans-form (DCT ) doma in, we can drop some high-frequ ency coef ficients with small magnitu desto achieve data com pression wi thout no ticeably affecting the perce ption of the HV S. Thislead s to a tech nique calle d transf orm coding, discusse d in Chapte r 4.

1.2.2.4 Temporal Masking

Temporal masking is another picture-independent feature of the HVS. It states that it takes awhile for the HVS to adapt itself to the scene when the scene changes abruptly. During thistransition the HVS is not sensitive to details. The masking takes place both before and afterthe abrupt change. It is called forward temporalmasking if it happens after the scene change;otherwise, it is referred to backward temporal masking [mitchell 1996].

This implies that one should take temporal masking into consideration when allocatingdata in image and video coding.

1.2.2.5 Color Masking

Digital color image processing is gaining increasing popularity due to the wide applicationof color images in modern life. As mentioned earlier, we are not going to cover all aspectsof the perception of the HVS. Instead, we cover only those aspects related to psychovisualredundancy, thus to image and video compression. Therefore, our discussion here on colorperception is by no means exhaustive.

In physics, it is known that any visible light corresponds to an electromagneticspectral distribution. Therefore, a color, as a sensation of visible light, is an energy withan intensity as well as a set of wavelengths associated with the electromagnetic spectrum.Obviously, intensity is an attribute of visible light. The composition of wavelengths isanother attribute: chrominance. There are two elements in the chrominance attribute: hueand saturation. The hue of a color is characterized by the dominant wavelength in thecomposition. Saturation is a measure of the purity of a color. A pure color has a saturationof 100%, whereas a white light has a saturation of 0.

1.2.2.5.1 RGB Model

The RGB primary color system is the most well known among several color systems. Thisis due to the following feature of the human perception of color. The color sensitive area inthe HVS consists of three different sets of cones and each set is sensitive to the light ofone of the three primary colors: red, green, and blue. Consequently, any color sensed by theHVS can be considered as a particular linear combination of the three primary colors.Many research results are available, the C.I.E. (Commission Internationale de l’Eclairage)chromaticity diagram being a well-known example. These results can be easily found inmany classic optics and digital image processing texts.

The RGB model is used mainly in color image acquisition and display. In color signalprocessing including image and video compression, however, the luminance–chrominancecolor system is more efficient and, hence, widely used. This has something to do withthe color perception of the HVS. It is known that the HVS is more sensitive to green than tored, and is least sensitive to blue. An equal representation of red, green, and blue leadsto inefficient data representation when the HVS is the ultimate viewer. Allocating dataonly to the information that the HVS can perceive, on the other hand, can make videocoding more efficient.

Luminance is concerned with the perceived brightness, while chrominance is related tothe perception of hue and saturation of color. That is, roughly speaking, the luminance–chrominance representation agrees more with the color perception of the HVS. This feature


makes the luminance–chrominance color models more suitable for color image processing.A good example is presented in [gonzalez 1992], about histogram equalization. It is wellknown that applying histogram equalization can bring out some details originally indark regions. Applying histogram equalization to the RGB components separatelycan certainly achieve the goal. In doing so, however, the chrominance elements hueand saturation have been changed, thus leading to color distortion. With a luminance–chrominance model, histogram equalization can be applied to the luminance componentonly. Hence, the details in the dark regions are brought out, whereas the chrominanceelements remain unchanged, hence no color distortion. With the luminance componentY serving as a black–white signal, a luminance–chrominance color model offers compati-bility with black and white TV systems. This is another merit of luminance–chrominancecolor models.

To be discussed next are several different luminance–chrominance color models: HSI,YUV, YCbCr, and YIQ.

1.2.2.5.2 Gamma-Correction

It is known that a nonlinear relationship (basically a power function) exists betweenelectrical signal magnitude and light intensity for both cameras and CRT-based displaymonitors [haskell 1996]. That is, the light intensity is a linear function of the signal voltageraised to the power of g. It is a common practice to correct this nonlinearity beforetransmission. This is referred to as gamma-correction. The gamma-corrected RGB com-ponents are denoted by R0, G0, and B0, respectively. They are used in the discussion onvarious color models. For the sake of notational brevity, we simply use R, G, and Binstead of R0, G0, and B0 in the following discussion, while keeping the gamma-correctionin mind.

1.2.2.5.3 HSI Model

In this model, I stands for the intensity component, H for the hue component, and S forsaturation component. One merit of this color system is that the intensity component isdecoupled from the chromatic components. As analyzed above, this decoupling usuallyfacilitates color image processing tasks. Another merit is that this model is closely relatedto the way the HVS perceives color pictures. Its main drawback is the complicatedconversion between RGB and HSI models. A detailed derivation of the conversion maybe found in [gonzalez 1992]. Because of this complexity, the HSI model is not used in anyTV systems.

1.2.2.5.4 YUV Model

In this model, Y denotes the luminance component, and U and V are the two chrominancecomponents. The luminance Y can be determined from the RGB model via the followingrelation:

Y ¼ 0:299Rþ 0:587Gþ 0:114B: (1:4)

It is noted that the three weights associated with the three primary colors, R, G, and B, arenot the same. Their different magnitudes reflect different responses of the HVS to differentprimary colors.

Instead of being directly related to hue and saturation, the other two chrominancecomponents, U and V, are defined as color differences as follows:

U ¼ 0:492(B�Y), (1:5)


and

V ¼ 0: 877( R � Y) : (1: 6)

In this way, the YUV mo del lowers com putational comple xity. It has been used in PA L(phas e alternati ng line) TV systems. Note that PA L is an analog compos ite color TVstand ard and is used in most of the Europ ean count ries, some Asia n count ries, andAustralia. By composite systems, we mean both the luminance and chrominance componentsof the TV signals are multiplexed within the same channel. For completeness, an expression ofYUV in terms of RGB is given below.

YUV

0@

1A ¼

0: 299 0:587 0: 114�0: 147 � 0:289 0: 4360: 615 � 0:515 � 0: 100

0@

1A R

GB

0@

1A: (1: 7)

1.2.2.5.5 YIQ Model

This color space has bee n util ized in NT SC (National Television Syste ms Commi ttee) TVsystem s fo r year s. Note that NTSC is an analog composit e color TV standard and is use d inthe North America and Jap an. Th e Y still acts as the luminance com ponent. The twochromi nance com ponents are the linear transf ormatio n of the U and V com ponents de finedin the YUV model. Specifically,

I ¼ �0:545U þ 0:839V, (1:8)

and

Q ¼ 0:839U þ 0:545V: (1:9)

Subst ituting the U and V com ponents expres sed in Equati ons 1.4 and 1.5 int o Equatio ns1.8 and 1.9, we can express YIQ directly in terms of RGB. That is,

YIQ

0@

1A ¼

0:299 0:587 0:1140:596 �0:275 �0:3210:212 �0:523 0:311

0@

1A R

GB

0@

1A (1:10)

1.2.2.5.6 YDbDr Model

The YDbDr model is used in SECAM (Sequential Couleur a Memoire) TV system. Notethat SECAM is used in France, Russia, and some eastern European countries. The relation-ship between YDbDr and RGB appears below.

YDbDr

0@

1A ¼

0:299 0:587 0:114�0:450 �0:883 1:333�1:333 1:116 �0:217

0@

1A R

GB

0@

1A (1:11)

where

Db ¼ 3:059U, (1:12)

and

Dr ¼ �2:169V: (1:13)


1.2.2.5.7 YCb Cr Model

From the abov e, we can see that the U and V chrominan ce compone nts are diffe rencesbetwe en the gamma-c or rected color B and the luminance Y, and the gamma-c orrec ted Rand the lumi nance Y , respectiv ely. The chrominan ce compone nt pairs I and Q , and Db andDr are both linear transf orms of U and V. Hence they are ver y clos ely relate d to each ot her.It is noted that U and V may be negativ e as well. To make chromi nance compone ntsnonneg ative, the Y, U , and V com ponents are sca led and shifted to produ ce the YCbCrmodel, which is used in the international coding standards JPEG and MPEG. (These twostandard s are covered in Chapters 7 and 16, respectiv ely).

YCbCr

0@

1A ¼ 0:257 0:504 0:098

�0:148 �0:291 0:4390:439 �0:368 �0:071

0@

1A R

GB

0@

1Aþ 16

128128

0@

1A (1:14)

1.2.2.6 Color Masking and Its Application in Video Compression

It is well known that the HVS is much more sensitive to the luminance component Y thanto the chrominance components U and V. Following Van Ness and Mullen [van ness 1967;mullen 1985], Mitchell, Pennebaker, Fogg, and LeGall included a figure in [mitchell 1996]to quantitatively illustrate the above statement. A modified version is shown in Figure 1.10.There, the abscissa represents spatial frequency in the unit of cycles per degree (cpd), while

Con

tras

t sen

sitiv

ity

0.1 0.3 1 3 10 30 100

1000

500

200

100

50

20

10

5

2

1

Spatial frequency (cycles per degree [cpd])

Luminance

Red-green

Blue-yellow

FIGURE 1.10Contrast sensitivity versus spatial frequency. (From Van Ness, F.I. and Bouman, M.A., J. Opt. Soc. Am. 57, 401,1967; Mullen, K.T., J. Physiol., 359, 381, 1985.)


the ordinat e is the con trast sensi tivity de fi ned for the sinu soidal testing signal . Twoobse rvations are in order. First, for each of the thre e curves, i.e., curves for the lumi nancecomponent Y and the chromi nance component s U and V , the c ontrast sensitiv ity increa seswhen spati al frequency inc reases , in gene ral. This agrees wi th frequency ma sking dis-cussed abov e. Seco nd, for the same contrast sensitiv ity, we see that the lumi nance com-ponent corre sponds to a muc h higher spatial frequency . Th is is an indicatio n that the HVSis high ly sensitive to luminance than to chrominan ce. This state ment can also be con firme d,perha ps more easily, by examin ing those spatial freque ncies a t which all three curves havedata availab le. Then we can see that the contrast sensitiv ity of lumi nance is much lowerthan that of the chrominan ce component s.

The direct imp act of color mas king on image and video coding is that by utilizing thi spsycho visual feature, we can allocate more bits to the luminance com ponent than to thechromi nance com ponents. This leads to a c ommon pra ctice in color image and videocoding : usi ng full resol ution for the inten sity com ponent, whi le using a 2 3 1 subsamp lingbot h horiz ontally and vertically for the two chromi nance compone nts. Th is has beenadopte d in related int ernation al coding stand ards, discusse d in Chapte r 16.

1.2.2. 7 Summ ary: Different ial Sensitivity

In this section, we have discussed luminance, texture, frequency, temporal, and colormaskings. Before we enter Section 1.3, let us summarize what we have discussed so far.

We see that luminance masking, also known as contrast masking, is of fundamentalimportance among several types of masking. It states that the sensitivity of the eyes to astimulus depends on the intensity of another stimulus. Thus it is a differential sensitivity.Both texture (detail or activity) and frequency of another stimulus significantly influencethis differential sensitivity. The same mechanism exists in color perception, where the HVSis highly sensitive to luminance than to chrominance. Therefore, we conclude that differ-ential sensitivity is the key in studying human visual perception.

These features can be utilized to eliminate psychovisual redundancy, and thus compressimage and video data.

It is also noted that variable quantization, which depends on activity and luminance indifferent regions, seems to be reasonable from a data compression point of view. Itspractical applicability, however, is somehow questionable. That is, some experimentalwork does not support this expectation [mitchell 1996].

It is noted that this differential sensitivity feature of the HVS is common to human per-ception. For instance, there is also forward and backward temporal masking in humanaudio perception.

1.3 Visual Quality Measurement

As the definition of image and video compression indicates, image and video quality is animportant factor in dealing with image and video compression. For instance, in evaluatingtwo different compression methods, we have to base the evaluation on some definite imageand video quality. When both methods achieve the same quality of reconstructedimage and video, the one that requires less data is considered to be superior to the other.Alternatively, with the same amount of data, the method providing a higher qualityreconstructed image or video is considered the better method. Note that here we havenot considered other performance criteria, such as computational complexity.


Surp risingl y, howev er, it turns out that the measure ment of image and video qualit y isnot strai ghtforward . There are two type s of visu al qualit y asses sment. On e is obje ctiveasses sment (usi ng electri cal measu remen ts), and the other is subject ive asses sment (usi nghuman observe rs). Eac h has its own merits and deme rits. A com bination of these twometho ds is now wi dely utilized in practice. In this secti on, we wi ll fi rst discuss subject ivevisual quality measu remen t, foll owed by obje ctive quality measu remen t.

1.3.1 Subject ive Quali ty Measur ement

It is nat ural t h a t th e visu al qu al ity o f rec onstru ct ed video f rames s hould be judgedby human v ie wers if they are to be t he ul ti mate re ce ivers o f the data (s ee Figure 1.1).Therefore, the subjective visual quality measure plays an important role in visualcommunications.

In subjective visual quality measurement, a set of video frames are generated withvarying coding parameters. Observers are invited to subjectively evaluate the visualquality of these frames. Specifically, observers are asked to rate the pictures by givingsome measure of picture quality. Alternatively, observers are requested to provide somemeasure of impairment to the pictures. A five-scale rating system of the degree of impair-ment, used by Bell Laboratories, is listed below [sakrison 1979]. It has been adopted as oneof the standard scales in CCIR Recommendation 500-3 [CCIR 1986]. Note that CCIR is nowITU-R (International Telecommunications Union-Recommendations).

1. Impairment is not noticeable.

2. Impairment is just noticeable.

3. Impairment is definitely noticeable, but not objectionable.

4. Impairment is objectionable.

5. Impairment is extremely objectionable.

In the subjective evaluation, there are a few things worth mentioning. In most applications,there is a whole array of pictures simultaneously available for evaluation. These picturesare generated with different encoding parameters. By keeping some parameters fixedwhile making one parameter (or a subset of parameters) free to change, the resultingquality rating can be used to study the effect of the one parameter (or the subset ofparameters) on encoding. An example using this method to study the effect of varyingnumbers of quantization levels on image quality can be found in [gonzalez 1992].

Another possible way to study subjective evaluation is to identify pictures with the samesubjective quality measure from the whole array of pictures. From this subset of testpictures, we can produce, in the encoding parameter space, isopreference curves thatcan be used to study the effect of the parameter(s) under investigation. An exampleusing this method to study the effect of varying both image resolution and numbers ofquantization levels on image quality can be found in [huang 1965].

In the rating, a whole array of pictures is usually divided into columns with each columnsharing some common conditions. The evaluation starts within each column with apairwise comparison. This is because pairwise comparison is relatively easy for the eyes.As a result, pictures in one column are arranged in an order according to visual quality,and quality or impairment measures are then assigned to the pictures in the one column.After each column has been rated, unification between columns is necessary. That is,different columns need to have a unified quality measurement. As pointed out in [sakrison1979], this task is not easy because it means we may need to equate impairment that resultsfrom different types of errors.


One thing is understood from the above discussion: Subjective evaluation of visualquality is costly, and it needs large number of pictures and observers. The evaluationtakes a long time because human eyes are easily fatigued and bored. Some special measureshave to be taken to arrive at an accurate subjective quality measure. Examples in thisregard include averaging subjective ratings and taking their deviation into consideration.For further details on subjective visual quality measurement, readers may refer to [sakrison1979; hidaka 1990; webster 1993].

1.3.2 Objective Quality Measurement

In this section, we first introduce the concept of signal to noise ratio (SNR), which is apopularly utilized objective quality assessment. Then we present a promising new objectivevisual quality assessment technique based on human visual perception.

1.3.2.1 Signal to Noise Ratio

Consider Figure 1.11, where f(x, y) is the input image to a processing system. The systemcan be a low-pass filter, a subsampling system, or a compression system.

It can even represent a process in which AWGN corrupts the input image. The g(x, y)is the output of the system. In evaluating the quality of g(x, y), we define an error functione(x, y) as the difference between the input and the output. That is,

e(x, y) ¼ f (x, y)� g(x, y): (1:15)

The mean square error, Ems, is defined as

Ems ¼ 1MN

XM�1x¼0

XN�1y¼0

e(x, y)2, (1:16)

where M and N are the dimensions of the image in the horizontal and vertical directions.Note that it is sometimes denoted by MSE. The root mean square error, Erms, is defined as

Erms ¼ffiffiffiffiffiffiffiffiEms

p: (1:17)

It is sometimes denoted by RMSE.As noted earlier, SNR is widely used in objective quality measurement. Depending

whether mean square error or root mean square error is used, the SNR may be called themean square signal to noise ratio, SNRms, or the root mean square signal to noise ratio,SNRrms. We have

SNRms ¼ 10 log10

PM�1x¼0

PN�1y¼0

g(x, y)2

MN � Ems

0BBB@

1CCCA, (1:18)

FIGURE 1.11An image processing system.

Inputf (x, y)

Outputg (x, y)

Processingsystem


and

SNRrms ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiSNRms

p: (1 : 19)

In image and video data compr ession, ano ther clos ely relate d term, PSNR (peak signal tonoise ratio), whi ch is essentia lly a mo di fied versio n of SNRms , is widel y use d. It is de finedas follo ws:

PSN R ¼ 10 log10255 2

Ems

� �(1 : 20)

The interpretat ion of the SNR is that the larger the SNR ( SNRms , SNR rms , or PSN R)the better the quality of the proce ssed image, g(x , y). Th at is, the closer the proces sedimage g( x, y) is to the or iginal image f (x , y). This seems corre ct; howe ver, from our earlierdiscuss ion about the fea tures of the HVS, we know that the HVS does not resp ond to visualstimul i in a st raightforw ard way. Its low- level proces sing unit is known to be no nlinear.Sever al mas king phenomena exis t. Each con fi rms that the visu al perce ption of the HVS isnot simp le. It is wo rth noting that our understan ding of the high-leve l proce ssing unit ofthe HVS is far from complete. Therefore, we may understand that the SNR does not alwaysprovide us with reliable assessments of image quality. One good example is presented inSecti on 1.2.2.3, which uses the IGS quantiz ation techniq ue to achi eve high compr ession(using only four bits for quantization instead of the usual eight bits) without introducingnoticeable false contouring. In this case, the subjective quality is high, and the SNRdecreases due to low-frequency quantization noise and additive high-frequency randomnoise. Another example drawn from our discussion about the masking phenomena is thatsome additive noise in bright areas or in highly textured regions may be masked, whereassome minor artifacts in dark and uniform regions may turn out to be quite annoying. Inthis case, the SNR cannot truthfully reflect visual quality.

On the one hand, we see that the objective quality measure does not always providereliable picture quality assessment. On the other hand, however, its implementation ismuch faster and easier than that of the subjective quality measure. Furthermore, objectiveassessment is repeatable. Owing to these merits, objective quality assessment is still widelyused despite this drawback.

It is noted that combining subjective and objective assessment has been a commonpractice in international coding-standard activity.

1.3.2.2 An Objective Quality Measure Based on Human Visual Perception

Introduced here is a new development in visual quality assessment, which is an objectivequality measurement based on human visual perception [webster 1993]. Since it belongs tothe category of objective assessment, it possesses merits, such as repeatability, and fast andeasy implementation. On the other hand, based on human visual perception, its assessmentof visual quality agrees closely to that of subjective assessment. In this sense, the newmethod attempts to combine the merits of the two different types of assessment.

1.3.2.2.1 Motivation

Visual quality assessment is best conducted via the subjective approach because in thiscase, the HVS is the ultimate viewer. The implementation of subjective assessment is,however, time-consuming, costly, and it lacks repeatability. On the other hand, althoughnot always accurate, objective assessment is fast, easy, and repeatable. The motivation here


is to develop an objective quality measurement system such that its quality assessment isvery close to that obtained by using subjective assessment. To achieve this goal, thisobjective system is based on subjective assessment. That is, it uses the rating achievedvia subjective assessment as a criterion to search for new objective measurements so as tohave the objective rating as close to the subjective one as possible.

1.3.2.2.2 Methodology

The derivation of the objective quality assessment system is shown in Figure 1.12. The inputtesting video goes through a degradation block, resulting in degraded input video. Thedegradation block, or impairment generator, includes various video compression codecs(coder–decoder pairs) with bit rates ranging from 56 kbits=s to 45 Mbits=s, and other videooperations. The input video and degraded input video form a pair of testing video, which issent to a subjective assessment block as well as a statistical feature selection block.

A normal subjective visual quality assessment, as introduced in Section 1.3.2.2.2, isperformed in the subjective assessment block, which involves a large panel of observers(e.g., 48 observers in [webster 1993]). In the statistical feature selection block, a variety ofstatistical operations are conducted and various statistical features are selected. Examplescover Sobel filtering, Laplacian operator, first-order differencing, moment calculation, fastFourier transform, etc. Statistical measurements are then selected based on these statisticaloperations and features. An objective assessment is formed as follows:

s ¼ a0 þXl

i¼1aini, (1:21)

where s denotes the output rating of the object assessment, or simply the objectivemeasure, which is supposed to be a good estimate of the corresponding subjective score.The ni, i¼ 1, . . . , l are selected objective measurements. The a0, ai, i¼ 1, . . . , l are coefficientsin the linear model of the objective assessment.

The results of the objective and subjective assessments are applied to a statistical analysisblock. In the statistical analysis block, the objective assessment rating is compared with thesubjective assessment rating. The result of the comparison is fed back to the statistical

Objectiveassessment

based onsubjective

assessment

Statisticalanalysis

Statisticalfeature

selection

Objectiveassessment

Subjectiveassessment

DegradationInput testing

videoOutput:

estimate ofsubjective

assessment

Measurementsand coefficients

Degradedinput video

FIGURE 1.12Block diagram of objective assessment based on subjective assessment.


featur e selectio n block. The st atistical measu rements obt ained in the statist ical fea tureselectio n block are exa mined acco rding to the ir perform ance in the asse ssment.A statist ical measu rement is rega rded to be good if it can redu ce by a signi ficant amo untthe diffe rence between the objectiv e and the subj ective a ssessment s. Th e bes t measure mentis dete rmine d via an exh austive search am ong the va rious measure ments. Note that thecoef ficients in Equatio n 1.21 are also examined in the statist ical anal ysis block in a similarmann er to that use d for the m easuremen ts.

The measu remen ts and coef ficients determi ned after iterat ions result in an optimalobjecti ve asse ssment via Equ ation 1.21, whi ch is fi nally passed to the last block as theoutpu t of the system. The whol e process wi ll become muc h c learer wi th the explanat ionprovid ed bel ow.

1.3.2.2.3 Re sults

The resu lts repo rted in [webs ter 1993] are int roduced here.

1.3.2.2.4 Infor mation Feature s

As men tioned in Section 1.2.2, diffe rential sensi tivity is a key in huma n visual perce ption.Two selected features, perce ived spatial informati on [ SI ] (the amo unt of spatial detail) andperceiv ed temporal informatio n [ TI] (the amount of tem poral lumi nance variation), involvepixel differenci ng. SI is de fined as shown:

SI ( fn ) ¼ STD s � So bel ( fn )

� , (1: 22)

whereSTDs denotes stand ard deviati on ope rator in the spatial doma inSobel denotes the So bel operatio nfn denotes the nth vide o frame

Tempo ral informa tion is de fi ned simi larly:

TI ( fn ) ¼ STD s � Df n

� , (1:23)

where Dfn ¼ fn � fn�1, i.e., the successive frame difference.

1.3.2.2.5 Determined Measurements

The parameter l in Equation 1.21 is chosen as three. That is,

s ¼ a0 þ a1n1 þ a2n2 þ a3n3: (1:24)

The measurements n1, n2, and n3 are formulated based on the above-defined informationfeatures: SI and TI.

Measurement n1

n1 ¼ RMSt 5:81SI (of n ) � SI (dfn )

SI(ofn)

� �, (1: 25)

where RM St represen ts the root mean square value taken over the tim e dime nsion; of n anddf n denote the or iginal nth frame and the degraded nth frame, respectively. It is observedthat n1 is a measure of the relative change in the SI between the original frame and thedegraded frame.


Mea sureme nt n2

n2 ¼ =t�0:10 8 � MAX{[ TI (ofn ) � TI (df n )],0}

�, (1: 26)

where

=t { yt } ¼ STD t�CO NV( yt , [�1, 2, � 1])

�, (1: 27)

where STDt deno tes the st andard deviati on ope rator with resp ect to tim e, and CO NVindica tes the conv olution ope ratio n betw een its two argumen ts. It is unders tood that TImeasu res tem poral lumi nance variati on (temp oral mo tion) and the con volution kern el,[ � 1, 2, �1], enhance s the vari ation due to its high -pass filter natu re. Therefor e, n2 measure sthe differe nce of TI betwe en the original and degrad ed frame s.

Mea sureme nt n3:

n3 ¼ MAX t 4: 23 � log10TI (dfn )TI (ofn )

�� , (1: 28)

where MAXt indicates taking of the maxim um value over time. Therefor e, measure ment n3respo nds to the rati o between the TI of the degrad ed vide o and that of the original vide o.Dist ortion, such as block artifacts (discussed in Chapte r 11) and motion jerki ness(discu ssed in Chapte r 10 ), which occurs in video codi ng, will cause n3 to be large.

1.3.2.2.6 Objective Estimator

The least square error procedure is applied to testing video sequenceswithmeasurements ni,i¼ 1, 2, 3, determined above, to minimize the difference between the rating scores obtainedfrom the subjective and the objective assessments, resulting in the estimated coefficients a0and ai, i¼ 1, 2, 3. Consequently, the objective assessment of visual quality s becomes

s ¼ 4:77� 0:992n1 � 0:272n2 � 0:356n3: (1:29)

1.3.2.2.7 Reported Experimental Results

It was reported that the correlation coefficient between the subjective and the objectiveassessment scores (an estimate of the subjective score) is in the range of 0.92 to 0.94. It isnoted that a set of 36 testing scenes containing various amounts of SI and TI were used inthe experiment. Hence, it is apparent that quite good performance was achieved.

Although there is surely room for further improvement, this work does open a new andpromising way to assess visual quality by combining subjective and objective approaches.It is objective and thus fast and easy; and because it is based on the subjective measure, it ismore accurate in terms of the high correlation to human perception. Theoretically, the SIand TI measures defined on differencing are very important. They reflect the most import-ant aspect of human visual perception.

1.4 Information Theory Results

In the beginning of this chapter, it was noted that the term information is considered one ofthe fundamental concepts in image and video compression. We will now address some


information theory results. In this section, measure of information and the entropy of aninformation source are covered first. We then introduce some coding theorems, which playa fundamental role in studying image and video compression.

1.4.1 Entropy

Entropy is a very important concept in information theory and communications. So is itin image and video compression. We first define the information content of a sourcesymbol. Then we define entropy as average information content per symbol for a discretememoryless source.

1.4.1.1 Information Measure

As mentioned at the beginning of this chapter, information is defined as knowledge, fact,and news. It can be measured quantitatively. The carriers of information are symbols.Consider a symbol with an occurrence probability p. Its information content (i.e., theamount of information contained in the symbol), I, is defined as follows:

I ¼ log21p

bits or I ¼ � log2 p bits (1:30)

where the bit is a contraction of binary unit. In Equation 1.30, we set the base of thelogarithmic function to equal 2. It is noted that these results can be easily converted asfollows for the case where the r-ary digits are used for encoding. Hence, from now on, werestrict our discussion to binary encoding.

I ¼ � logr 2 � log2 p bits: (1:31)

According to Equation 1.30, the information contained within a symbol is a logarithmicfunction of its occurrence probability. The smaller the probability, the more information thesymbol contains. This agrees with common sense. The occurrence probability is somewhatrelated to the uncertainty of the symbol. A small occurrence probability means largeuncertainty. In this way, we see that the information content of a symbol is about theuncertainty of the symbol. It is noted that the information measure defined here is valid forboth equally probable symbols and nonequally probable symbols [lathi 1998].

1.4.1.2 Average Information per Symbol

Now consider a discrete memoryless information source. By discreteness, we mean thesource is a countable set of symbols. By memoryless, we mean the occurrence of a symbolin the set is independent of that of its preceding symbol. Take a look at a source of thistype that contains m possible symbols {si, i¼ 1, 2, . . . ,m}. The corresponding occur-rence probabilities are denoted by {pi, i¼ 1, 2, . . . ,m}. According to the discussion inSection 1.4.1.1, the information content of a symbol si, Ii, is equal to Ii¼�log2 pi bits.Entropy is defined as the average information content per symbol of the source. Obviously,the entropy, H, can be expressed as follows:

H ¼ �Xmi¼1

pi log2 pi bits (1:32)


From this definition, we see that the entropy of an information source is a function ofoccurrence probabilities. It is straightforward to show that the entropy reaches the max-imum when all symbols in the set are equally probable.

1.4.2 Shannon’s Noiseless Source Coding Theorem

Consider a discrete, memoryless, stationary information source. In what is called sourceencoding, a code word is assigned to each symbol in the source. The number of bits in thecode word is referred to as the length of the code word. The average length of code wordsis referred to as bit rate, expressed in the unit of bits per symbol.

Shannon’s noiseless source coding theorem states that for a discrete, memoryless,stationary information source, the minimum bit rate required to encode a symbol onaverage is equal to the entropy of the source. This theorem provides us with a lowerbound in source coding. Shannon showed that the lower bound can be achieved when theencoding delay extends to infinity. By encoding delay, we mean the encoder waits andthen encodes a certain number of symbols at once. Fortunately, with finite encoding delay,we can already achieve an average code word length fairly close to the entropy. That is, wedo not have to actually sacrifice bit rate much to avoid long encoding delay, whichinvolves high computational complexity and a large amount of memory space.

Note that the discreteness assumption is not necessary. We assume a discrete sourcesimply because digital image and video are the focus here. Stationarity assumption isnecessary in deriving the noiseless source coding theorem. This assumption may not besatisfied in practice. Hence, Shannon’s theorem is a theoretical guideline only. There is nodoubt, however, that it is a fundamental theoretical result in information theory.

In summary, the noiseless source coding theorem, Shannon’s first theorem published inhis celebrated paper [shannon 1948], is concerned with the case where both the channeland the coding system are noise free. The aim under these circumstances is codingcompactness. The more compact, the better the coding. This theorem specifies the lowerbound, which is the source entropy, and how to reach this lower bound.

One way to evaluate the efficiency of a coding scheme is to determine its efficiency withrespect to the lower bound, i.e., entropy. The efficiency h is defined as follows:

h ¼ HLavg

, (1:33)

where H is entropy, and Lavg denotes the average length of the code words in the code. Asthe entropy is the lower bound, the efficiency never exceeds the unity, i.e., h � 1. The samedefinition can be generalized to calculate the relative efficiency between two codes. That is,

h ¼ Lavg,1Lavg,2

, (1:34)

where Lavg,1 and Lavg,2 represent the average code word length for code 1 and code 2,respectively. We usually put the larger of the two in the denominator, and h is called theefficiency of code 2 with respect to code 1. A complementary parameter of codingefficiency is coding redundancy, z, which is defined as

z ¼ 1� h: (1:35)


1.4.3 Shanno n’s Noisy Channe l Coding The orem

If a code has an effi ciency of h ¼ 1 (i.e., it reaches the lower bound of sou rce encoding) thencoding redund ancy is z ¼ 0. Now conside r a noisy transmi ssion chann el. In transmi ttingthe coded symbol through the no isy channel, the received symbol s may be erron eous dueto the lack of redundanc y. On the other hand, it is well known that by addi ng redu ndancy(e.g., parity check bits) some errors occurrin g during the transmis sion over the noisychann el may be ide nti fied. The code d symbols are then resent. In this way, we see thatadding redund ancy m ay c ombat noise.

Shan non ’s noisy chann el codi ng theorem state s that it is poss ible to transmi t symbolsover a noisy chann el withou t error if the bit rate is bel ow a chann el capacity, C. That is,

R < C (1 : 36)

where R denotes the bit rate. Th e channel capacity is determine d by the noise andsignal pow er.

In conclus ion, the noisy channel coding theorem , Shan non ’s second theo rem [shann on1948], is concerne d with a noisy , memory less chann el. By memor yless, we mean thechann el outpu t corre spondin g to the cur rent inpu t is indep endent of the outp ut cor re-spondin g to previo us inpu t symbo ls. Unde r the se circ umsta nces, the aim is relia blecommu nication . To be error-fr ee, the bit rate cannot exceed chann el capa city. That is,chann el capa city sets an upper bound on the bit rate.

1.4.4 Shanno n’s Source Codi ng Theorem

As see n in Section s 1.4.2 and 1.4 .3, the noiseles s sou rce codi ng theo rem de fi nes the lowe stposs ible bit rate for noise less source coding and noise less channel transmis sion; whereasthe noisy chann el coding the orem de fi nes the high est possible coding bit rate for error-fr eetransmi ssion. Th erefore, bot h theorem s work for reli able (no error) transmi ssion. In thissectio n, we continu e to deal with discrete mem oryles s inf ormation source s, but we dis cussthe situati on in which lossy codi ng is enc ountered . As a resu lt, distortio n of the informati onsource s takes pl ace. For insta nce, quan tization , dis cussed in Chapter 2, causes informati onloss. Th erefore, it is conclude d that if an enc oding procedur e inv olves quan tization , then itis lossy coding . Th at is, errors occur during the codi ng proce ss, even thou gh the c hannel iserror-fr ee. We wan t to fi nd the lower bound of the bit rate for thi s case.

The source coding theorem [shann on 1948] st ates that for a given distorti on D, thereexists a rate distortion function R(D) [berger 1971], which is the minimum bit rate requiredto transmit the source with distortion less than or equal to D. That is, to have distortion notlarger than D, the bit rate, R, must satisfy the following condition:

R � R(D): (1:37)

A more detailed discussion about this theorem and the rate distortion function is givenin Chapte r 15, which deals wi th video coding.

1.4.5 Information Transmission Theorem

It is clear that by combining the noisy channel coding theorem and the source codingtheorem, we can derive the following relationship:

C � R(D) (1:38)


This is called the information transmission theorem [slepian 1973]. It states that if thechannel capacity of a noisy channel, C, is larger than the rate distortion function R(D) thenit is possible to transmit an information source with distortion D over a noisy channel.

1.5 Summary

In this chapter, we first discussed the necessity for image and video compression. It is shownthat image andvideo compression becomes an enabling technique in today’s exploding numberof digital multimedia applications. Then, we show that the feasibility of image and videocompression rests in redundancy removal. Three types of redundancy are studied: statisticalredundancy, coding redundancy, and psychovisual redundancy. Statistical redundancy comesfrom interpixel correlation. By interpixel correlation, wemean correlation between pixels eitherlocated in one frame (spatial or intraframe redundancy) or pixels located in successive frames(temporal or interframe redundancy). Psychovisual redundancy is based on the features(several types of masking phenomena) of human visual perception. That is, visual informationis not perceived equally from human visual point of view. In this sense, some information ispsychovisually redundant. Coding redundancy is related to coding technique.

The visual quality of reconstructed image and video is a crucial criterion in the evalu-ation of the performance of visual transmission or storage systems. Both subjective andobjective assessments are discussed. A new and promising objective technique based onsubjective assessment is introduced. Because it combines the merits of both types of visualquality assessment, it achieves a quite satisfactory performance. The selected statisticalfeatures reveal some possible mechanism of the human visual perception. Further study inthis regard would be fruitful.

In the last section, we introduced some fundamental information theory results, relevantto image and video compression. The results introduced include information measurement,entropy, and several theorems. All the theorems assume discrete, memoryless, and station-ary information sources. The noiseless source coding theorem points out that the entropy ofan information source is the lower bound of coding bit rate that a source encoder canachieve. The source coding theorem deals with lossy coding applied in a noise-free channel.It states that for a given distortion,D, there is a rate distortion function, R(D). Whenever thebit rate in the source coding is greater than R(D), the reconstructed source at the receivingend satisfies the fidelity requirement defined by the D. The noisy channel coding theoremstates that, to achieve error-free performance, the source coding bit ratemust be smaller thanthe channel capacity. Channel capacity is a function of noise and signal power. The infor-mation transmission theorem combines the noisy channel coding theorem and the sourcecoding theorem. It states that it is possible to have a reconstructedwaveform at the receivingend, satisfying the fidelity requirement corresponding to distortion, D, if the channelcapacity, C, is larger than the rate distortion function, R(D). Although some of the assump-tions on which these theorems were developed, may not be valid in complicated practicalsituations, these theorems provide important theoretical limits for image and video coding.They can also be used for evaluation of the performance of different coding techniques.

Exercises

1. Using your own words, define spatial and temporal redundancy, and psychovisualredundancy, and state the impact they have on image and video compression.


2. Why is different ial sensitiv ity cons idered the most importan t feature in human visualperce ption?

3. From the des cription of the newl y deve loped obje ctive asses sment techniq ue based onsubject ive assessm ent (Sectio n 1.3), wha t poi nts do you think are relate d to and supportthe statement made in problem 2?

4. Using your own words, interpret Weber’s law.5. What is the advantage possessed by color models that decouple the luminance com-

ponent from chrominance components.6. Why has the HIS model not been adopted by any TV systems?7. What is the problem with the objective visual quality measure of PSNR?

References

[berger 1971] T. Berger, Rate Distortion Theory, Prentice-Hall, Englewood Cliffs, NJ, 1971.[CCIR 1986] CCIR Recommendation 500–3, Method for the subjective assessment of the quality of

television pictures, Recommendations and Reports of the CCIR, 1986, XVIth Plenary Assembly,Volume XI, Part 1.

[connor 1972] D.J. Connor, R.C. Brainard, and J.O. Limb, Interframe coding for picture transmission,Proceedings of The IEEE, 60, 7, 779–790, July 1972.

[fink 1957] D.G. Fink, Television Engineering Handbook, McGraw-Hill, New York, 1957, Sect. 10.7.[goodall 1951] W.M. Goodall, Television by pulse code modulation, Bell System Technical Journal, 30,

33–49, January 1951.[gonzalez 1992] R.C. Gonzalez and R.E. Woods, Digital Image Processing, Addison Wesley, Reading,

MA, 1992.[haskell 1996] B.G. Haskell, A. Puri and A.N. Netravali, Digital Video: An Introduction to MPEG-2,

Chapman and Hall, ITP, New York, 1996.[hidaka 1990] T. Hidaka and K. Ozawa, Subjective assessment of redundancy-reduced moving

images for interactive application: Test methodology and report, Signal Processing: Image Com-munication, 2, 201–219, 1990.

[huang 1965] T.S. Huang, PCM picture transmission, IEEE Spectrum, 2, 12, 57–63, 1965.[huang 1998] J. Huang and Y.Q. Shi, Adaptive image watermarking scheme based on visual masking,

IEE Electronics Letters, 34, 8, 748–750, April 1998.[kretzmer 1952] E.R. Kretzmer, Statistics of television signal, Bell System Technical Journal, 31, 4,

751–763, July 1952.[lathi 1998] B.P. Lathi,Modern Digital and Analog Communication Systems, 3rd edn., Oxford University

Press, New York, 1998.[legge 1980] G.E. Legge and J.M. Foley, Contrast masking in human vision, Journal of Optical Society of

America, 70, 12, 1458–1471, December 1980.[lim 1990] J.S. Lim, Two-Dimensional Signal and Image Processing, Prentice-Hall, Englewood Cliffs, NJ,

1990.[mitchell 1996] J.L. Mitchell, MPEG Video: Compression Standard, J.L. Mitchell, W.B. Pennebaker,

C.E. Fogg and D.J. LeGall, (Eds.), Chapman and Hall, ITP, New York, 1996.[mounts 1969] F.W. Mounts, A video encoding system with conditional picture-element replenish-

ment, Bell System Technical Journal, 48, 7, 2545–2554, September 1969.[mullen 1985] K.T. Mullen, The contrast sensitivity of human color vision to red-green and blue-

yellow chromatic gratings, Journal of Physiology, 359, 381–400, 1985.[netravali 1977]A.N. Netravali and B. Prasada, Adaptive quantization of picture signals using spatial

masking, Proceedings of the IEEE, 65, 536–548, April 1977.[sakrison 1979] D.J. Sakrison, Image coding applications of vision model, in Image Transmission

Techniques, W.K. Pratt (Ed.), Academic Press, New York, 1979, pp. 21–71.[seyler 1962] A.J. Seyler, The coding of visual signals to reduce channel-capacity requirements,

Proceedings of the I.E.E., 109C, 676–684, 1962.


[seyler 1965] A.J. Seyler, Probability distributions of television frame difference, Proceedings of IREE(Australia), 26, 355–366, November 1965.

[shannon 1948] C.E. Shannon, A mathematical theory of communication, Bell System TechnicalJournal, 27, 379–423 (Part I), July 1948, 623–656 (Part II), October 1948.

[slepian 1973] D. Slepian (Ed.), Key Papers in the Development of Information Theory, IEEE Press,New York, 1973.

[van ness 1967] F.I. Van Ness and M.A. Bouman, Spatial modulation transfer in the human eye,Journal of Optical Society of America, 57, 3, 401–406, March 1967.

[watson 1987] A.B. Watson, Efficiency of a model human image code, Journal of Optical Society ofAmerica A, 4, 12, 2401–2417, December 1987.

[webster 1993] A.A. Webster, C.T. Jones, and M.H. Pinson, An objective video quality assessmentsystem based on human perception, Proceedings of Human Vision, Visual Processing and DigitalDisplay IV, J.P. Allebach and B.E. Rogowitz (Eds.), SPIE Proceedings, 1913, 15–26, September 1993.


2Quantization

After an intro duction to image and vide o compr ession was presen ted in Chapter 1, weaddr ess several fund amental a spects of image and video compress ion in the rem ainingchapte rs of Par t I. Chapte r 2, as the firs t chapte r in the series, deal s with quan tization .Quant ization is a necessary compone nt in lossy coding and has direct impa ct on the bit rateand the distortio n of reconst ructed image or v ideo. We will dis cuss concepts, principles ,and various quantiz ation techniq ues, which include unifo rm and no nuniform quan tiza-tion, opti mum quan tization , and ada ptive quan tization .

2.1 Q uantization a nd the S ource E ncoder

The function ality of image and video compress ion in the appl ications of visual commu ni-cation s and storage is depict ed in Figu re 1.1. In the conte xt of visual com munica tions, thewhole syste m may be illus trated as show n in Figu re 2.1. In the transmi tter, the inputanalog infor mation sou rce is con verted to a digital format in the A =D c onverte r block. Thedigital fo rmat is com press ed throu gh the image and video source enc oder. In the chann elencode r, some redu ndancy is adde d to help com bat no ise and, hence , transm ission err or.Modul ation makes digital data sui table for transmi ssion throu gh the anal og chann el, suchas air space in the applicat ion of a TV broad cast. At the rec eiver, the count erpart blocksrecon struct the inpu t visu al informati on. As far as storag e of visu al informati on is con-cerned, the blocks of channel, chann el encode r, chann el decod er, mo dulati on, anddemod ulation may be omitt ed (Figure 2.2). If input and outp ut are requi red to be in thedigital format in some applicat ions then the A=D and D=A conve rters are om itted from thesystem. If they are require d, howeve r, othe r blocks , such a s enc ryption and decryp tion canbe adde d to the syste m [sk lar 1988]. He nce, what is dep icted in Figure 2.1 is a conceptu allyfund amental block diagram of a visual c ommunica tion syste m.

This book is main ly conc erned wi th source enc oding and decod ing. To thi s end, we takea step further. Th at is, we show block diagram s of a sou rce enc oder and decoder (Figure2.3). As shown in Figu re 2.3a, the re are thre e component s in the source encoding: trans-formatio n, quan tization , and code word assig nment. After the transf ormati on, some formof an inpu t informa tion sou rce is pres ented to a quan tizer. In othe r words, the transf orm-ation blo ck decides which types of quan tities from the input image and vide o are to beencode d. It is not nec essary that the origi nal image and video wave form be quan tized andcoded : we will sh ow that some formats obt ained from the input image and video are moresuitable for enc oding. An exampl e is the difference signal . From the disc ussion of int erpixelcorre lation in Chapter 1, it is known that a pixel is normal ly highly c orrelated with itsimmediately horizontal or vertical neighboring pixel. Therefore, a better strategy is toencode the difference of gray level values between a pixel and its neighbor. Since these


Inputvisual

information

Receiver

Transmitter

Receivedvisual

information

A/D Sourceencoder

Channelencoder

Modulation

Channel

DemodulationChanneldecoder

SourcedecoderD/A

FIGURE 2.1Block diagram of a visual communication system.

data are high ly corre lated, the diffe rence usua lly has a smaller dyn amic range . Cons e-quentl y, the encoding is more ef ficient. This ide a is discuss ed in detai l in Chapte r 3.

Anoth er example is wha t is calle d transf or m coding (Chapter 4). Th ere, instead ofenc oding the or iginal input image and vide o, we encode a transfor m of the input imageand vide o. Because the redundanc y in the transf orm domain is reduce d grea tly, the codingef ficiency is muc h higher com pared with dire ctly enc oding the original image and vide o.

Note that the term transf orma tion in Figu re 2.3a is someti mes refer red to as map per andsignal proces sing in the literature [gonzal ez 19 92; li 1995]. Quant ization ref ers to a proces sthat con verts input data into a set of fi nitely many different va lues. Often , the inpu t data toa quan tizer is continu ous in magni tude.

Inputvisual

information

Retrieval

Storage

Retrievedvisual

information

A /D Sourceencoder

SourcedecoderD/A

FIGURE 2.2Block diagram of a visual storage system.


Inputinformation

Transformation Quantization

Code wordstring

Code worddecoder

Inversetransformation

Code wordstring

Reconstructedinformation

Code wordassignment

(a) Source encoder

(b) Source decoder

FIGURE 2.3Block diagram of (a) a source encoder and (b) a source decoder.

Hence , quan tization is essen tially discreti zation in mag nitude, whi ch is an importan tstep in the lossy compression of digital image and video. (The reason that the term lossycompression is used here is shown shortly.) The input and output of quantization can be eitherscalars or vectors. The quantization with scalar input and output is called scalar quantization,wherea s that wi th vector input and outp ut is referred to as vector quan tization . In thischapte r, we dis cuss scalar quan tization . Vector quantiz ation is address ed in Chapte r 9.

After quantization, code words are assigned to the finitely many different values, theoutput of the quantizer. Natural binary code (NBC) and variable-length code (VLC), intro-duced in Chapter 1, are two examples of this. Other examples are the widely utilized entropycode (including Huffman code and arithmetic code), dictionary code, and run-length code(RLC) (frequently used in facsimile transmission), which are covered in Chapters 5 and 6.

The source decod er, as shown in Figure 2.3b, con sists of two blocks : code word decoderand inv erse transf ormati on. They are count erpart s of the code word assignment andtransf ormatio n in the source enc oder. Note that there is no block that correspo nds toquan tization in the source deco der. Th e impl ication of this obse rvation is the followi ng.First, quantiz ation is an irreversi ble process . That is, in general, there is no way to fi nd theoriginal value from the quan tized value . Second , quantiz ation is, there fore, a source ofinforma tion loss. In fact, quantiz ation is a critica l stag e in image and video compress ion. Ithas signi fi cant impact on the distorti on of rec onstru cted image and v ideo as well as the bitrate of the enc oder. Obvi ously, coarse quantiz ation result s in mo re distortio n and lower bitrate than fi ne quantiz ation .

In thi s chapte r, uniform quan tization , whi ch is the simples t yet the most impo rtant case,is discusse d first. Nonu niform quan tization is covered a fter that, followe d by optimu mquan tization for bot h uniform and nonuni form cases. Then a discussi on of adapt ivequan tization is provi ded. Fina lly, pulse code mo dulati on (PCM) is described as the bestestablished and most frequently implemented digital coding method involving quantiza-tion is described.

2.2 Uniform Quantization

Uniform quantization is the simplest, yet very popular quantization technique. Conceptu-ally, it is of great importance. Hence, we start our discussion on quantization with uniformquantization. Several fundamental concepts of quantization are introduced in this section.


2.2.1 Basics

This section concerns several basic aspects of uniform quantization. They are some funda-mental terms, quantization distortion, and quantizer design.

2.2.1.1 Definitions

In Figure 2.4, the horizontal axis denotes the input to a quantizer, while the vertical axisrepresents the output of the quantizer. The relationship between the input and the outputbest characterizes this quantizer; this type of curve is referred to as the input–outputcharacteristic of the quantizer. From the curve, it can be seen that there are nine intervalsalong the x-axis. Whenever the input falls in one of the intervals, the output assumes acorresponding value. The input–output characteristic of the quantizer is staircase-like and,hence, clearly nonlinear.

The end points of the intervals are called decision levels, denoted by di with i being theindex of intervals. The output of the quantization is referred to as the reconstruction level(also known as quantizing level [musmann 1979]), denoted by yi with i being its index. Thelength of the interval is called the step size of the quantizer, denoted by D. With the aboveterms defined, we can nowmathematically define the function of the quantizer in Figure 2.4as follows:

yi ¼ Q(x), if x 2 (di, diþ1), (2:1)

where i¼ 1, 2, . . . , 9 and Q(x) is the output of the quantizer with respect to the input x.

y5

x

1

–1

–3.5 –1.5–2.5 –0.5 0.5 1.5 2.5 3.50

y

2

3

4

–2

– 3

–4

y9

y8

y7

y6

y1

y4

y3

y2

d1 = − ∞ d10 = ∞d2 d3 d4 d6 d7 d8 d9d5

FIGURE 2.4Input–output characteristic of a uniform midtread quantizer.


It is note d that in Figure 2.4, D ¼ 1. The deci sion levels and rec onstru ction levels areevenly spaced. It is a uniform quantizer because it possesses the following two features:

1. Except possibly the right-most and left-most intervals, all intervals (hence, decisionlevels) along the x-axis are uniformly spaced. That is, each inner interval has thesame length.

2. Except possibly the outer intervals, the reconstruction levels of the quantizer arealso uniformly spaced. Furthermore, each inner reconstruction level is the arith-metic average of the two decision levels of the corresponding interval along thex-axis.

The uniform quantizer depicted in Figure 2.4 is called midtread quantizer. Its counterpartis called a midrise quantizer, in which the reconstructed levels do not include the value ofzero. A midrise quantizer having step size D¼ 1 is shown in Figure 2.5. While midtreadquantizers are usually utilized for an odd number of reconstruction levels, midrise quanti-zers are used for an even number of reconstruction levels.

Note that the input–output characteristic of both the midtread and midrise uniformquantizers as depicted in Figures 2.4 and 2.5, respectively, is odd symmetric with respect tothe vertical axis x¼ 0. In the rest of this chapter, our discussion develops under thissymmetry assumption. The results thus derived will not lose generality since we canalways subtract the statistical mean of input x from the input data and thus achieve thissymmetry. After quantization, we can add the mean value back.

The total number of reconstruction levels of a quantizer is denoted by N. Figures 2.4 and2.5 reveal that if N is even, then the decision level d(N=2)þ1 is located in the middle of theinput x-axis, i.e., d(N=2)þ1¼ 0. If N is odd, on the other hand, then the reconstruction levely(Nþ1)=2¼ 0. This convention is important in understanding the design tables of quantizersin the literature.

x0.5

–4.0

d1 = − ∞ d9 = ∞d2 d3 d4 d6 d7 d8d5

–3.0 –2.0 –1.0 0 1.0 2.0 3.0 4.0–0.5

–1.5

–2.5

y

1.5

2.5

3.5

–3.5

y6

y7

y8

y5

y3

y2

y1

y4

FIGURE 2.5Input–output characteristic of a uniform midrise quantizer.


2.2.1. 2 Quanti zation Dis tortion

The source coding theorem presen ted in Chapte r 1 states that for a cer tain disto rtion D ,there exists a rate distorti on function R ( D), such that as long as the bit rate used is largerthan R (D ) it is pos sible to transmit the source with a distorti on sm aller than D. Since wecannot afford an infi nite bit rate to repres ent an origi nal source , some dis tortion inquan tization is inevitabl e. We can also say that since quan tization cause s informati onloss irrevers ibly, we encount er quan tiza tion err or and, consequ ently, an issue: ho w toevalu ate the quality or, equiva lently, the distortio n of quan tization . Accordi ng to ourdiscuss ion on visual qual ity asse ssment in Chapter 1, we kno w that there a re two waysto do so: subject ive evaluati on and objecti ve evalu ation.

In terms of subj ective evalu ation, in Se ction 1.3.1 we int roduced a fi ve-sca le ratingadopte d in CCIR Recomme ndation 500-3. We also descri bed the false contour ing pheno-men on, which is cause d by coars e quan tization . Tha t is, our human eyes are more sensitiv eto the relative ly uniform regions in an image plane . Th erefore, an insu ffi cient num ber ofrec onstructi on levels results in annoy ing false contour s. In other word s, more recon struc-tion levels are requi red in relative ly uniform regi ons than in relative ly nonuni form regio ns.

In terms of obje ctive evalu ation , in Section 1.3.2 we de fi ned mean square error (MSE)and root mean square err or (RMSE), signal to no ise ratio (SNR) and peak signal to no iseratio (PSNR). In deal ing wi th quantiz ation, we de fine quan tization error, eq , as the differ-enc e betwe en the input signal and the quan tized outpu t:

eq ¼ x � Q ( x), (2: 2)

where x and Q( x) are input and quantiz ed outpu t, respec tively. Quantizat ion error is oftenrefer red to as quan tiza tion noise . It is a com mon practice to treat input x as a rand omvari able wi th a probabi lity dens ity function (pdf) fx( x). Mean squ are quan tization error,MSEq , can thus be expres sed as

MSEq ¼XNi¼ 1

ðd i þ1

di

(x � Q ( x)) 2 fx ( x)d x (2: 3)

where N is the total number of reconst ruction levels. Note that the outer decision levelsmay be �1 or 1 (Figu res 2 .4 and 2.5). It is clear that, when the pdf, fx( x), rem ainsunc hanged , few er recon struction levels (sma ller N ) resu lt in mo re disto rtion. That is, coars equan tization leads to large quan tization noise . This con fi rms the st atement that quan tiza-tion is a crit ical compone nt in a sou rce enc oder, which signi ficantly infl uences both bit rateand distorti on of the encode r. As mention ed earlier , the assum ption that the input –outp utcharac teristi c is odd symm etric with respect to the x ¼ 0 axis impl ies that the mean ofthe random varia ble, x, is equ al to zer o, i.e., E ( x) ¼ 0. Th erefore, MSEq is the variance of x,i.e., MSEq ¼ s 2q .

The quan tization noise associ ated with the midtr ead quantiz er dep icted in Figure 2.4 isshown in Figu re 2.6. It is clear that the quan tization noise is signal depen dent. It is observe dthat, associated with the inner intervals, the quantization noise is bounded by �0.5D. Thistype of quantization noise is referred to as granular noise. The noise associated with theright-most and the left-most intervals are unbounded as the input x approaches either �1or 1. This type of quantization noise is called overload noise. Denoting the mean squaregranular noise and overload noise by MSEq,g and MSEq,o, respectively, we then have thefollowing relations:

MSEq ¼ MSEq,g þMSEq,o (2:4)


0.5

x

–0.5

–4.5 –4.0 –3.5 –3.0 –2.5 –2.0 –1.5 –1.0 –0.5 0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

y

Granularquantization

noise

Overloadquantization

noise

Overloadquantization

noise

FIGURE 2.6Quantization noise of the uniform midtread quantizer shown in Figure 2.4.

and

MSEq ,g ¼XN � 1

i ¼ 2

ðdi þ 1

di

( x � Q (x)) 2 fx ( x)d x (2 : 5)

ðd 2
MSEq, o ¼ 2
d1

( x � Q (x)) 2 fx ( x)d x (2 : 6)

2.2.1. 3 Quantizer Design

The design of a quan tizer (either uniform or nonu niform) inv olves choosing the number ofrecon struction levels , N (he nce, the number of deci sion levels, N þ 1), and selecting thevalue s of deci sion levels and recon struction levels (decidin g where to locate the m). In ot herwords , the design of a quan tizer is equiva lent to specifying its input –outpu t charac ter istic.

The optimu m quan tizer design can be stated as follows. For a give n pdf of the inpu trandom vari able, fX (x), dete rmine the number of reconst ruction le vels, N , choose a set ofdecisio n levels { di, i ¼ 1, . . . , N þ 1} and rec onstru ction levels { yi , i ¼ 1, . . . , N } such that theMSEq , defi ned in Equati on 2.3, is min imized.

In the uniform quan tizer design , the total num ber of rec onstru ction levels, N, isusually given. Accord ing to the two features of uniform quan tizers des cribed in Secti on2.2.1.1, we kno w that the rec onstructi on levels of a uniform quan tizer can be derivedfrom the deci sion levels. He nce, on ly on e of the se two sets is inde pende nt. Furt hermore ,both deci sion levels and rec onstru ction levels a re uniform ly spaced exc ept possibly theouter interv als. These constrai nts to gether with the symm etry assump tion lead tothe following observat ion: In fact, there is only one par ameter that needs to decide inuniform quan tizer design , which is the step size D. As to the optimu m uniform quan tizerdesign , a different pdf leads to a different st ep size .


2.2.2 Opti mum Unif orm Quanti zer

In this section, we firs t discuss optimum uniform quan tizer design when the input x obe ysuniform distrib ution. Th en, we cov er optimu m unifo rm quan tizer design when the inpu t xhas othe r types of probabilis tic distribut ions.

2.2.2. 1 Uniform Quanti zer with Unifo rmly Distrib uted Input

Let us ret urn to Figure 2.4, where the input –outpu t c haracter istic of a nin e rec onstructi onlevel midtread quantiz er is shown. No w, conside r that the input x is a uniform ly dis trib-uted random variab le. Its inpu t –output charac teristic is shown in Figu re 2.7. We notic e thatthe new character istic is restrict ed within a fi nite range of x, i.e., � 4.5 � x � 4. 5. This is dueto the de fi nition of uniform distri bution. Cons equentl y, the overlo ad quan tization no isedoes not exis t in thi s case, which is shown in Figure 2.8.

The mean square quantization error, MSEq, is found to be

MSEq ¼ Nðd2d1

(x�Q(x))21ND

dx

¼ D2

12: (2:7)

This result indicates that if the input to a uniform quantizer has a uniform distributionand the number of reconstruction levels is fixed then the MSEq is directly proportional tothe square of the quantization step size. Or, in other words, the root MSEq (the standard

y1

y5 x

1

–1

– 4.5 –3.5 –2.5 –1.5 – 0.5 0 0.5 1.5 2.5 3.5 4.5

y

2

3

4

–2

–3

–4

y9

y8

y7

y6

y4

y3

y2

d1 d10d2 d3 d4 d6 d7 d8 d9d5

FIGURE 2.7Input–output characteristic of a uniform midtread quantizer with input x having uniform distribution in [�4.5, 4.5].


x0.5

–0.5

0–4.5 –4.0 –3.5 –3.0 –2.5 –2.0 –1.5 –1.0 –0.5 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5

y

Granularquantization

noise

Zero overloadquantization

noise

Zerooverload

quantizationnoise

FIGURE 2.8Quantization noise of the quantizer shown in Figure 2.7.

deviati on of the quan tization no ise) is dire ctly prop ortion al to the quan tiza tion st ep. Thelarger the step size , the larger (accordi ng to square law) the MSEq . This agrees with ourearlier observat ion: coa rse quan tization leads to large quan tization err or.

As menti oned above, the MSEq is equal to the vari ance of the quantiz ation no ise, i.e.,MSEq ¼ s 2q . To find the SNR of the unifo rm quantiz ation in this case, we need to determinethe va riance of the input x. No te that we assume the inpu t x to be a zero mean uniformrandom vari able. So, accordin g to probab ility theo ry, we have

s 2x ¼( N D) 2

12: (2 : 8)

Ther efore, the mean square signal to no ise ratio SNRms , defined in Chapter 1, is equal to

SNRms ¼ 10 log10s 2xs 2q¼ 10 log10 N 2 : (2 : 9)

Note that her e we use the sub script ms to indicate the SNR in the mean square sense, a sde fined in the Chapter 1. If we assume N ¼ 2n, we then have

SNRms ¼ 20 log10 2n ¼ 6:02 n dB : (2 : 10)

The int erpretat ion of the above resu lt is as follo ws. If we use the NBC to code therecon struction levels of a uniform quan tizer with a uniform ly distribut ed input source ,then every increased bit in the coding brin gs out a 6.02 dB increa se in the SNRms. Anequiva lent state ment ca n be derive d from Equati on 2. 7. That is, whenev er the step size ofthe uniform quantiz er decreas es by a hal f, the MSEq decre ases four tim es.

2.2.2. 2 Conditions of Op timum Quanti zation

The conditio ns under which the MSEq is minimi zed we re derive d [lloy d 1957, 1982; max1960] for a give n pdf of the quan tizer input, fX ( x).


TABLE 2.1

Optimal Symmetric Uniform Quantizer for Uniform, Gaussian, Laplacian, and Gamma Distributions (Having Zero Mean and Unit Variance)

Uniform Gaussian Laplacian Gamma

N di yi MSE di yi MSE di yi MSE di yi MSE

�1.000 �1.596 �1.414 �1.154�0.500 �0.798 �0.707 �0.577

2 0.000 8.333 10�2 0.000 0.363 0.000 0.500 0.000 0.6680.500 0.798 0.707 0.577

1:000 1:596 1:414 1:154

�1.000 �1.991 �2.174 �2.120�0.750 �1.494 �1.631 �1.590

�0.500 �0.996 �1.087 �1.060�0.250 �0.498 �0.544 �0.530

4 0.000 2.083 10�2 0.000 0.119 0.000 1.9633 10�1 0.000 0.3200.250 0.498 0.544 0.500

0:500 0:996 1:087 1:0600.750 1.494 1.631 1.590

1.000 1.991 2.174 2.120

�1.000 �2.344 �2.924 �3.184�0.875 �2.051 �2.559 �2.786

�0.750 �1.758 �2.193 �2.388�0.625 �1.465 �1.828 �1.990

�0.500 �1.172 �1.462 �1.592�0.375 �0.879 �1.097 �1.194

�0.250 �0.586 �0.731 �0.796�0.125 �0.293 �0.366 �0.398

8 0.000 5.213 10�3 0.000 3.743 10�2 0.000 7.173 10�2 0.000 0.1320.125 0.293 0.366 0.398

0:250 0:586 0:731 0:7960.375 0.879 1.097 1.194

0.500 1.172 1.462 1.5920.625 1.465 1.828 1.990

�2007

byTaylor

&Francis

Group,L

LC.

0.750 1.758 2.193 2.3880.875 2.051 2.559 2.786

1.000 2.344 2.924 3.184

�1.000 �2.680 �3.648 �4.320�0.938 �2.513 �3.420 �4.050

�0.875 �2.345 �3.192 �3.780�0.813 �2.178 �2.964 �3.510

�0.750 �2.010 �2.736 �3.240�0.688 �1.843 �2.508 �2.970

�0.625 �1.675 �2.280 �2.700�0.563 �1.508 �2.052 �2.430

�0.500 �1.340 �1.824 �2.160�0.438 �1.173 �1.596 �1.890

�0.375 �1.005 �1.368 �1.620�0.313 �0.838 �1.140 �1.350

�0.250 �0.670 �0.912 �1.080�0.188 �0.503 �0.684 �0.810

�0.125 �0.335 �0.456 �0.540�0.063 �0.168 �0.228 �0.270

16 0.000 1.303 10�3 0.000 1.153 10�2 0.000 2.543 10�2 0.000 5.013 10�2

0.063 0.168 0.228 0.2700:125 0:335 0:456 0:540

0.188 0.503 0.684 0.8100.250 0.670 0.912 1.080

0.313 0.838 1.140 1.3500.375 1.005 1.368 1.620

0.438 1.173 1.596 1.8900.500 1.340 1.824 2.160

0.563 1.508 2.052 2.4300.625 1.675 2.280 2.700

0.688 1.843 2.508 2.9700.750 2.010 2.736 3.240

0.813 2.178 2.964 3.5100.875 2.345 3.192 3.780

0.938 2.513 3.420 4.0501.000 2.680 3.648 4.320

Sources: From Max, J., IRE Trans. Inf. Theory, IT-6, 7, 1960; Paez, M.D. and Glisson, T.H., IEEE Trans. Commun., COM-20, 225, 1972.Note: The numbers enclosed in rectangles are the step sizes.

�2007

byTaylor

&Francis

Group,L

LC.

The MSEq was give n in Equati on 2.3. The nec essary conditio ns fo r optimu m (minimumMSE) quan tization are as follows . Th at is, the derivati ves of MSEq with respec t to the di andyi have to be zer o.

( di � yi� 1 )2 fx ( di ) � (di � y i ) 2 f x ( di ) ¼ 0, i ¼ 2, . . . , N (2: 11)

ðd i þ 1
�
di

( x � yi ) fx ( x)d x ¼ 0, i ¼ 1, . . . , N (2: 12)

The suf ficient cond itions can be derive d acco rdingly by involvi ng the second order deriva-tives [max 1960; fl eischer 1 964]. The symmet ry assump tion of the input –outpu t charac ter-istic mad e earlier ho lds here as we ll. Th ese suf ficient con ditions are list ed bel ow:

1: x1 ¼ �1 and xN þ 1 ¼ þ1 (2: 13)

ðd iþ 1
2:
di

( x � yi ) f X (x)d x ¼ 0, i ¼ 1, 2, . . . , N (2: 14)

3: di ¼ 1(yi � 1 þ yi ), i ¼ 2, . . . , N (2: 15)

2

Note that the fi rst con dition is for an input x whos e range is �1 < x < 1 . The inter-pretat ion of the ab ove con ditions is that each decision level (except for the outer intervals)is the arithmeti c averag e of the two neighb oring reconstru ction levels, and each recon-struc tion level is the cen troid of the area under the pdf fX ( x) betwe en the two adjacentdeci sion le vels.

Note that the ab ove conditio ns are general in the sense that there is no restr ictionimpo sed on the pdf. In the nex t secti on, we dis cuss the opti mum uniform quan tizationwhen the input of quan tizer assume s differe nt distrib utions.

2.2.2. 3 Optimum Unif orm Quan tizer with Different Inpu t Distri butions

Let us retur n to our disc ussion on the optimu m quan tizer des ign whose inpu t has uniformdistri bution. Becau se the input has unifo rm distrib ution, the outer int ervals are also finit e.For uniform distri bution, Equatio n 2.14 impl ies that each rec onstru ction level is thearithme tic averag e of the two correspo nding deci sion levels . Cons idering the two fea turesof a uniform quan tizer, pres ented in Se ction 2.2.1.1, we see that a unifo rm quan tizer isoptim um (minim izing the MSEq) when the inpu t has unifo rm dis tribution.

Whe n the inpu t x is uniform ly distribut ed in [ � 1, 1], the step size D of the optimu muniform quan tizer is list ed in Table 2.1 for the numb er of rec onstru ction levels, N , equ al to2, 4, 8, 16, and 32. From this table, we note that the M SE q of the uniform quan tization with auniform ly dis tributed input decre ases four times as N doub les. As menti oned in Secti on2.2.2.1, this is equiva lent to an increa se of SNRms by 6.02 dB as N doubles.

The derivation above is a special case, i.e., the uniform quantizer is optimum for auniformly distributed input. Normally, if the pdf is not uniform, the optimum quantizeris not a uniform quantizer. Due to the simplicity of uniform quantization, however, it maysometimes be desirable to design an optimum uniform quantizer for an input with another-than-uniform distribution.


Unde r the se circ umsta nces, howeve r, Equ ations 2.13 through 2.15 are not a set ofsimult aneous equati ons one can hope to sol ve with any ease . Nu merical procedur eswere sugge sted to solve for des ign of optimu m unifo rm quantiz ers. Max deri ved uniformquan tization step size D for an input with a Gau ssian distribut ion [max 1960 ]. Paez andGliss on foun d step size D for Lapl acian and Gamma -distribu ted input signals [paez 1972].These result s are listed in Table 2.1. Note that all three dis tributions have a zero mean andunit standard deviati on. If the mean is no t zero, only a shift in input is neede d whileappl ying these resu lts. If the standard deviati on is not unit, the tab ulated step size needs tobe multi plied by the standard devia tion. The theo retical MSE is also listed in Tabl e 2.1.Note that the subscri pt q associ ated wi th MSE has been dropped for the sake of notationalbrev ity from now on in the c hapter as long as it does no t cause con fusion.

2.3 N onuniform Q uantization

It is no t dif ficult to see that, exc ept fo r the special case of the uniform ly distri buted inputvariab le x, the optim um (minim um MSE, also denote d sometime s by MMSE) quan tizersshoul d be no nunifor m. Consi der a c ase in which the input rand om vari able obe ys theGaussi an distrib ution with a zer o mean and unit variance , and the numb er of reconst ruc-tion levels is fi nite. We natu rally conside r that havi ng deci sion levels mo re dense lylocated aroun d the middl e of the x-axi s, x ¼ 0 (high-p robabilit y dens ity region), andchoo sing deci sion levels mo re coarsely dis tributed in the range far away from the cen terof the x-axis (low-p robabil ity density regio n) will lead to less MSE. The strate gy adoptedhere is analogou s to the superiori ty of VLC over fi xed-length code (FLC) dis cussed inChapter 1.

2.3.1 Optimu m (Nonun iform) Quanti zation

Condi tions of optimu m quan tization were discusse d in Section 2.2.2.2. With some con-strai nts, these conditio ns were sol ved in a closed form [pante r 1951]. Th e equa tionscharac terizing these cond itions, howe ver, cannot be solved in a closed form in general .Lloyd and Max proposed an iterat ive proced ure to num erically solve the equati ons. Theoptimu m quantiz ers thus des igned are calle d Lloyd –Max quan tizers.

When input x obe ys Gau ssian distri bution, the solution to optimu m quan tizer design forfinit ely ma ny rec onstructi on levels N was obtain ed [lloy d 1957, 1982; max 1960]. That is,the decisio n and rec onstru ction levels tog ether with theoret ical min imum MSE and opti-mum SNR have bee n determine d. Follow ing this proce dure, the des ign fo r Laplaci an andGamma dis tribution were tabulated in [paez 1972]. These results are contai ned in Table 2.2.As st ated before, we see in the table onc e ag ain that uniform quantiz ation is optimal if theinput x is a uniform random variable.

Figure 2.9 [max 196 0] gives a perform ance compar ison betw een opti mum uniformquantization and optimum quantization for the case of a Gaussian-distributed inputwith a zero mean and unit variance. The abscissa represents the number of reconstructionlevels, N, and the ordinate the ratio between the error of the optimum quantizer and theerror of the optimum uniform quantizer. It can be seen that when N is small, the ratio isclose to one. That is, the performances are close. When N increases, the ratio decreases.Specifically, when N is large the nonuniform quantizer is about 20% to 30% more efficientthan the uniform optimum quantizer for the Gaussian distribution with a zero mean andunit variance.


TABLE 2.2

Optimal Symmetric Quantizer for Uniform, Gaussian, Laplacian, and Gamma Distributions (The Uniform Distribution Is between [�1, 1];the Other Three Distributions Have Zero Mean and Unit Variance.)

Uniform Gaussian Laplacian Gamma

N di yi MSE di yi MSE di yi MSE di yi MSE

�1.000 �1 �1 �1�0.500 �0.799 �0.707 �0.577

2 0.000 8.333 10�2 0.000 0.363 0.000 0.500 0.000 0.6680.500 0.799 0.707 0.577

1.000 1 1 1�1.000 �1 �1 �1

�0.750 �1.510 �1.834 �2.108�0.500 �0.982 �1.127 �1.205

�0.250 �0.453 �0.420 �0.3024 0.000 2.083 10�2 0.000 0.118 0.000 1.7653 10�1 0.000 0.233

0.250 0.453 0.420 0.3020.500 �0.982 1.127 1.205

0.750 1.510 1.834 2.1081.000 1 1 1�1.000 �1 �1 �1

�0.875 �2.152 �3.087 �3.799�0.750 �1.748 �2.377 �2.872

�0.625 �1.344 �1.673 �1.944�0.500 �1.050 �1.253 �1.401

�0.375 �0.756 �0.833 �0.859�0.250 �0.501 �0.533 �0.504

�0.125 �0.245 �0.233 �0.1498 0.000 5.213 10�3 0.000 3.453 10�2 0.000 5.483 10�2 0.000 7.123 10�2

0.125 0.245 0.233 0.1490.250 0.501 0.533 0.504

0.375 0.756 0.833 0.8590.500 1.050 1.253 1.401

0.625 1.344 1.673 1.944

�2007

byTaylor

&Francis

Group,L

LC.

0.750 1.748 2.377 2.8720.875 2.152 3.087 3.799

1.000 1 1 1�1.000 �1 �1 �1

�0.938 �2.733 �4.316 �6.085�0.875 �2.401 �3.605 �5.050

�0.813 �2.069 �2.895 �4.015�0.750 �1.844 �2.499 �3.407

�0.688 �1.618 �2.103 �2.798�0.625 �1.437 �1.821 �2.372

�0.563 �1.256 �1.540 �1.945�0.500 �1.099 �1.317 �1.623

�0.438 �0.942 �1.095 �1.300�0.375 �0.800 �0.910 �1.045

�0.313 �0.657 �0.726 �0.791�0.250 �0.522 �0.566 �0.588

�0.188 �0.388 �0.407 �0.386�0.125 �0.258 �0.266 �0.229

�0.063 �0.128 �0.126 �0.07216 0.000 1.303 10�3 0.000 9.503 10�3 0.000 1.543 10�2 0.000 1.963 10�2

0.063 0.128 0.126 0.0720.125 0.258 0.266 0.229

0.188 0.388 0.407 0.3860.250 0.522 0.566 0.588

0.313 0.657 0.726 0.7910.375 0.800 0.910 1.045

0.438 0.942 1.095 1.3000.500 1.099 1.317 1.623

0.563 1.256 1.540 1.9450.625 1.437 1.821 2.372

0.688 1.618 2.103 2.7980.750 1.844 2.499 3.407

0.813 2.069 2.895 4.0150.875 2.401 3.605 5.050

0.938 2.733 4.316 6.0851.000 1 1 1

Sources: From Lloyd, S.P., Least squares quantization in PCM, Institute of Mathematical Statistics Meeting, Atlantic City, NJ, September 1957; Lloyd, S.P., Least squaresquantization in PCM, IEEE Trans. Inf. Theory, IT-28, 129, 1982; Paez, M.D. and Glisson, T.H., IEEE Trans. Commun., COM-20, 225, 1972; Max, J., IRE Trans. Inf. Theory,IT-6, 7, 1960.

�2007

byTaylor

&Francis

Group,L

LC.

Err

or r

atio

Number of reconstruction levels N

1.0

2 6 10 14 18 22 26 30 34 38

0.9

0 8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

FIGURE 2.9Ratio of error for optimal quantizer to error for optimum uniform quantizer vs. number of reconstruction levels N.(minimum mean square error [MSE] for Gaussian distributed input with a zero mean and unit variance). (FromMax, J., IRE Trans. Inf. Theory, IT-6, 7, 1960. With permission.)

2.3.2 Companding Quantization

It is known that a speech signal usually has a large dynamic range. Moreover, its statisticaldistribution reveals that very low speech volumes predominate most voice communications.Specifically, by a 50% chance, the voltage characterizing detected speech energy is less than25% of the root mean square (RMS) value of the signal. Large amplitude values are rare:only by a 15% chance does the voltage exceed the RMS value [skalr 1988]. These statisticsnaturally lead to the need for nonuniform quantization with relatively dense decisionlevels in the small magnitude range and relatively coarse decision levels in the largemagnitude range.

When the bit rate is 8 bits=sample, the following companding technique [smith 1957],which realizes nonuniform quantization, is found to be extremely useful. Though speechcoding is not the main focus of this book, we briefly discuss the companding techniquehere as an alternative way to achieve nonuniform quantization.

The companding technique, also known as logarithmic quantization, consists of thefollowing three stages: compressing, uniform quantization, and expanding [gersho1977] as shown in Figure 2.10. First, it compresses the input signal with a logarithmic

Input OutputCompressing

Uniformquantization

Expanding

Nonuniform quantization

FIGURE 2.10Companding technique in achieving quantization.


0

C (x )

x

Q [C (x )]

C (x )

(a) (b)

(c) (d)

0

Q [C (x )]

E {Q [C (x )]} Qn (x )

0

x

0

FIGURE 2.11Characteristics of companding techniques. (a) Compressing characteristic, (b) uniform quantizer characteristic,(c) expanding characteristic, and (d) nonuniform quantizer characteristic.

characteristi c, a nd second , it q uant ize s th e c ompressed input using a uniform quantizer.Fina lly, the uniform ly quantiz ed results are expan ded inverse ly. An illustrat ion of thecharac teristics of these three stages and the resultant nonuni form quan tization are shownin Figu re 2.11.

In practi ce, a pi ecewise linear app roximati on of the logarithm ic compr ession charac ter-istic is used. Ther e are two diffe rent ways. In No rth America, a m-law compr essioncharac teristic is used, which is defi ned as fo llows.

c ( x) ¼ xmaxln[ 1 þ m( jxj =xmax )]

ln(1 þ m) sgn x, (2: 16)

where sgn is a sign fun ction de fined as

sgn x ¼ þ1 if x � 0�1 if x < 0

�(2 : 17)

The m-law compr ession character istic is sh own in Figu re 2.12a. The standard value ofm is 255. Note from the figure that the case of m¼ 0 corresponds to uniform quantization.


0

C(x

)/x m

ax

μ = 255

μ = 0

A = 87.6

A = 1

x/xmax

1.0

C(x

)/x m

ax

1.0

1.00 x/xmax 1.0

(a) m-law (b) A-law

FIGURE 2.12Compression characteristics.

In Europe, the A -law character istic is used. Th e A -law character istic is depicted inFigu re 2.12b, and is de fi ned as fo llows.

c( x) ¼xmax

A( j xj =xmax )1 þ ln A sgn x 0 <

j xjxmax

� 1A

xmax1 þ ln[ A (j xj =xmax )]

1 þ ln A sgn x 1A <

jxjxmax

< 1

8>><>>:

(2: 18)

It is no ted that the standard v alue of A is 87.6. Th e case of A ¼ 1 correspond s to uniformquan tization .

2.4 Adapt ive Q uantization

In the last sectio n, we st udied nonuni form quantiz ation , whos e motiv ation is to minimiz eMSEq . We foun d that no nuniform quan tization is nec essary if the pdf of the input rand omvari able x is not uniform . Co nsider an opti mum quan tizer for a Gaussian- distri buted inputwhen the num ber of rec onstruc tion levels N is eight. Its inpu t –outpu t charac teristi c can bederive d from Table 2.2 and is shown in Figu re 2.13. This curv e reveals that the deci sionlevels are densely locate d in the central region of the x-axi s and coarsely elsew here. In otherwords , the deci sion levels are dense ly dis tributed in the regio n having a high er probabi lityof occurren ce and coars ely distri buted in othe r regions. A logarithm ic compan ding tech-nique a lso allo cates deci sion levels densely in the small magni tude regi on, whi ch corre s-ponds to a high occurren ce probabili ty, but in a different way. We con clude thatnonu niform quantiz ation achieves minimum MSEq by distri buting decisio n levels accord-ing to the st atistics of the input random vari able.

These two types of nonu niform quantiz ers are both time-inva riant. Th at is, the y are notdesign ed for no nstatio nary inpu t signal s. Moreove r, even for a station ary inpu t signal , ifits pdf devi ates from that wi th which the optimu m quan tizer is design ed then a mismat chwill take place and the perform ance of the quan tizer wi ll deteri orate . Ther e are two maintypes of mismatch. One is calle d va riance mismat ch. That is, the pdf of input signal ismatch ed, while the vari ance is mismat ched. Anoth er type is pdf m ismatch. Noted thatthese two kinds of mismat ch also occur in optimu m uniform quantiz ation, because the rethe optimi zation is also achi eved based on the input statist ics assu mption. For a detai led


−∞ −1.7479 −1.0500 −0.5005 0.5005 1.0500 1.7479 ∞

2.1519

1.3439

0.7560

0.2451

Y

X

FIGURE 2.13Input–output characteristic of the optimal quantizer for Gaussian distribution with zero mean, unit variance,and N¼ 8.

analysis of the effects of the two types of mismatch on quantization, readers are referred to[jayant 1984].

Adaptive quantization attempts to make the quantizer design adapt to the varying inputstatistics to achieve better performance. It is a means to combat the mismatch problemdiscussed above. By statistics, we mean the statistic mean, variance (or the dynamic range),and type of input pdf. When the mean of the input changes, differential coding (discussedin the next chapter) is a suitable method to handle the variation. For other types of cases,adaptive quantization is found to be effective. The price paid in adaptive quantization isprocessing delay and an extra storage requirement as seen below.

There are two different types of adaptive quantization: forward and backward adapta-tions. Before we discuss these, however, let us describe an alternative way to define quant-ization [jayant 1984]. Quantization can be viewed as a two-stage process (Figure 2.14). Thefirst stage is the quantization encoder and the second stage is the quantization decoder. Inthe encoder, the input to quantization is converted to the index of an interval into which the

Quantizationencoder

Quantizationdecoder

Interval index

Reconstructionlevel

Output yInput x

FIGURE 2.14A two-stage model of quantization.


ReceiverTransmitter

Statistical parameters

Buffering Statisticalanalysis

Quantizationencoder

Quantizationdecoder

Intervalindex

Reconstructionlevel

Output yInput x

FIGURE 2.15Forward adaptive quantization.

inpu t x falls. Th is index is map ped to (the code word that repres ents) the reconst ruction levelcorre spondin g to the inter val in the deco der. Roughl y spe akin g, thi s de fi nition cons iders aquan tizer as a commu nication system in which the quan tization encoder is in the transmi tterside whil e the quantiz ation decoder is in the receiver side. In thi s sense , this de finition isbro ader than that of quantiz ation de fi ned in Figu re 2.3a.

2.4.1 Forwa rd Adaptive Quanti zation

A block diagram of forwa rd adapt ive quantiz ation is shown in Figure 2.15. Ther e, the inputto the quan tizer, x, is first split into block s, each with a certain length. Blo cks are stored in abuff er on e at a time. A st atistical analysis is then c arried out with resp ect to the block in thebuff er. Base d on the anal ysis, the quan tization encode r is set up, and the input data withinthe block are assigned indexe s of respectiv e int ervals. In additi on to the se indexes , theenc oder setting par ameters are sent to the quan tization deco der as side inf ormatio n. Theterm side comes from the fact that the am ount of bits used for codi ng the setting parame teris usually a small fract ion of the total amount of bits used.

The selectio n of block size is a critica l iss ue. If the size is small, the ada ptation to the localstatist ics will be effective , but the side informa tion needs to be sent freq uently. That is,mo re bits are used for sending the sid e inform ation. If the size is large, the bits used fo r sid einfor mation decrease . On the othe r hand, the ada ptation become s less sensi tive to c hangingstatist ics, and both proce ssing del ay and storage requi red increase. In pra ctice, a prop ercom promise betw een quan tity of sid e informati on a nd eff ectivenes s of ada ptation pro-duce s a good selection of the block size.

Exa mples of using fo rward mann er to ada pt quan tization to a changi ng inpu t vari ance(to comb at vari ance mismat ch) can be fo und in [jay ant 1984; sayood 199 6].

2.4.2 Back ward Adapt ive Quan tization

Figu re 2.16 shows a block diagram of back ward ada ptive quan tization . A clos e look at theblock diagram reveals that in both the quantization encoder and decoder the buffering andthe statistical analysis are carried out with respect to the output of quantization encoder. Inthis way, there is no need to send side information. The sensitivity of adaptation to the


Buffering

Statisticalanalysis

Quantizationencoder

Quantizationdecoder

Output yInput x

Buffering

Statisticalanalysis

Transmitter Receiver

FIGURE 2.16Backward adaptive quantization.

changing statistics wi ll be degraded , howeve r, since, inste ad of the original input, only isthe output of the quan tization enc oder use d in the statist ical anal ysis. Th at is, the quan ti-zation no ise is inv olved in the st atistical anal ysis.

2.4.3 Adaptive Quanti zation with a One-W ord Memory

Intuiti vely, it is expecte d that obse rving a suf ficient large number of input or outpu t(qua ntized) data is nec essary to track the changing statistics a nd then ada pt the quan tizersetting in adapt ive quantiz ation. Through an analysis , Jayant showe d that effect iveadapt ations can be rea lized with an expl icit mem ory of only one word. Th at is, eitherone inpu t sam ple, x, in forwa rd adapt ive quan tization or a quan tized outpu t, y, inbackwar d ada ptive quantiz ation is suf ficient [ja yant 197 3].

In [ja yant 1984], example s on step size adapt ation (with the number of to tal reconst ruc-tion levels larger than four) were give n. The ide a is as fo llows. If at momen t ti the inpu tsample xi falls into the outer inter val, then the st ep size at the next mo ment t i þ1 will beenlarge d by a factor of mi (multiplyi ng the cur rent step size by m i , m i > 1). On the othe r hand,if the input xi falls int o an inn er interva l close to x ¼ 0 the n, the multi plier is less than 1,i.e., mi < 1 . That is, the multi plier mi is sm all in the interval near x ¼ 0 and mo notonic allyincrea ses for an increa sed x. Its range varies from a sm all positive numb er less than 1 to anumb er larger than 1. In this way, the quantiz er ada pts itself to the inpu t to avo id overlo adas we ll as underl oad to achi eve better perform ance.

2.4.4 Switched Quantization

This is ano ther adapt ive quan tization schem e. A block diagram is shown in Figure 2.17. Itconsists of a bank of L quantizers. Each quantizer in the bank is fixed, but collectively they


Buffering Statisticalanalysis

Input x Output y

Q1

Q2

Q3

Q4

QL

FIGURE 2.17Switched quantization.

form a bank of quan tizers with a vari ety of input –outpu t charac teristics . Based on astatist ical anal ysis of rec ent inpu t or outp ut sam ples, a swit ch conne cts the curren t inputto one of the quan tizers in the bank such that the best pos sible perform ance may beachi eved. It is reported that in both video and spe ech appl ications, this sch eme hasshown improve d perform ance even when the num ber of quan tizers in the bank, L, istwo [jayant 1984]. Intere stingly, it is no ted that as L !1, the swit ched quan tizationconve rges to the adaptive quan tizer dis cussed above.

2. 5 Pul se Cod e Modu lati on

Pulse code mo dulation is closely related to quan tization , the fo cus of this chap ter. Furt her-mo re, as poi nted in [jay ant 1984], PCM is the ear liest, best-e stablishe d, and mo st fre-quentl y appl ied coding system despite the fact that it is the most bit-cons umingdigit izing system (s ince it enc odes each pixe l indep endentl y) as we ll as a ver y deman dingsystem in terms of bit error rate on the digital channel. Ther efore, we discuss the PCMtechni que in this secti on.

PCM is now the most importan t fo rm of pulse modul ation. The other forms of pulsemo dulation are pu lse amp litude mo dulati on (PAM), pulse width m odulation (PW M), andpulse position modul ation (PPM) , which are cov ered in mo st commu nication texts. Brie fl yspe aking, pulse modulati on links an anal og signal to a pulse train in the followi ng way.The analog signal is first sam pled (a discreti zation in time doma in). The sam pled values areuse d to modula te a pulse train. If the modul ation is ca rried out throug h the amp litude ofthe pulse train, it is calle d PA M. If the mo di fied param eter of the pulse trai n is the pulsewidt h, we the n have PWM . If the pulse width and magnitu de are constant — only theposit ion of pulses is mo dulate d by the sample valu es— we then encount er PPM. Anillus tration of these pulse modul ations is shown in Figure 2.18.

In PCM, an anal og signal is first sam pled. The sample d value is then quan tized. Fina lly,the quantiz ed value is enc oded , resultin g in a bit steam. Figure 2.19 provid es an example ofPCM. We see that through a sampling and a uniform quantization the PCM systemconverts the input analog signal, which is continuous in both time and magnitude, into adigital signal (discretized in both time and magnitude) in the form of a NBC sequence. Inthis way, an analog signal modulates a pulse train with a NBC.


PPM

0

0

0

PWM

PAM

f (t )

t

t

t

t

t

0

0

The pulse train

FIGURE 2.18Pulse modulation.

By far, PCM is more popular than other types of pulse modulation because the codemodulation is much more robust against various noises than amplitude modulation,width modulation and position modulation. In fact, almost all coding techniques includea PCM component. In digital image processing, given digital images usually appear inPCM format. It is known that an acceptable PCM representation of monochrome picturerequires 6–8 bits=pixel [huang 1965]. It is used so commonly in practice that its per-formance normally serves as a standard against which other coding techniques arecompared.

Let us rec all the false contour ing phenome non, discusse d in textu re mask ing (Chapt er 1).It states that our eyes are more sensitive to relatively uniform regions in an image plane. Ifthe number of reconstruction levels is not large enough (coarse quantization) then someunnatural contours will appear. When frequency masking was discussed, it was noted thatby adding some high frequency signal before quantization, the false contouring can beeliminated to a great extent. This technique is called dithering. The high frequency signalused is referred to as a dither signal. Both false contouring and dithering were firstreported in [goodall 1951].


0101 0101 0101 0101 0101 0100 0011 0011 0010 0001 0001 0001 0010 0010 0011

0100 0101 1000 1010 1100 1101 1110 1110 1111 1111 1111 1111 1110 1110

20

1110

9

8

7

6

13

26

18

17

16

15

14

12

1 2 3 4 5

y

x

29

2725

23 2824

21

22

19

d10000 y1

d20001 y2

d30010 y3

d40011 y4

d50100 y5

d60101 y6

d70110 y7

d80111 y8

d91000 y9

d101001 y10

d111010 y11

d121011 y12

d131100 y13

d141101 y14

d151110 y15

d16

d17

1111 y16

Output code (from left to right, from top to bottom):

FIGURE 2.19Pulse code modulation (PCM).

2.6 Summary

Quantization is a process in which a quantity having possibly infinitely many values isconverted to another quantity having only finitely many values. It is an important elementin source encoding that has significant impact on both bit rate and distortion of recon-structed images and video in visual communication systems. Depending on whether thequantity is a scalar or a vector, quantization is called either scalar or vector quantization. Inthis chapter, we considered only scalar quantization.

Uniform quantization is the simplest and yet the most important case. In uniformquantization, except for outer intervals, both decision levels and reconstruction levels areuniformly spaced. Moreover, a reconstruction level is the arithmetic average of the twocorresponding decision levels. In uniform quantization design, the step size is the onlyparameter that needs to be specified.


Optimum quantization implies minimization of the MSEq. When the input has a uniformdistribution, uniform quantization is optimum. For the sake of simplicity, a uni-form optimum quantizer is sometimes desired even when the input does not obey uniformdistribution. The design under these circumstances involves an iterative procedure.The design problem in cases where the input has Gaussian, Laplacian, or Gamma distri-bution was solved and the parameters are available.

When the constraint of uniform quantization is removed, the conditions for optimumquantization are derived. The resultant optimum quantizer is normally nonuniform.An iterative procedure to solve the design is established and the optimum design para-meters for Gaussian, Laplacian, and Gamma distribution are tabulated.

The companding technique is an alternative way to implement nonuniform quantiza-tion. Both nonuniform quantization and companding are time-invariant and hence notsuitable for nonstationary input. Adaptive quantization deals with nonstationary inputand combats the mismatch that occurs in optimum quantization design.

In adaptive quantization, buffering is necessary to store some recent input or sampledoutput data. A statistical analysis is carried out with respect to the stored recent data.Based on the analysis, the parameters of the quantizer are adapted to changing inputstatistics to achieve better quantization performance. There are two types of adaptivequantization: forward and backward adaptive quantizations. With the forward type, thestatistical analysis is derived from the original input data, whereas with the backwardtype, quantization noise is involved in the analysis. Therefore, the forward type usuallyachieves more effective adaptation than the backward type. The latter, however, doesnot need to send quantizer setting parameters as side information to the receiver side,since the output values of quantization encoder (based on which the statistics are analyzedand parameters of the quantizer are adapted) are available in both the transmitter andreceiver sides.

Switched quantization is another type of adaptive quantization. In this scheme, a bankof fixed quantizers is utilized, each quantizer having different input–output character-istics. A statistical analysis based on recent input decides which quantizer in the bank issuitable for the present input. The system then connects the input to this particularquantizer.

Nowadays, PCM is the most frequently used form of pulse modulation due to itsrobustness against noise. PCM consists of three stages: sampling, quantization, and encod-ing. First, the analog signals are sampled with a proper sampling frequency. Second, thesampled data are quantized using a uniform quantizer. Finally, the quantized values areencoded with NBC. It is the best established and most applied coding system. Despite itsbit-consuming feature, it is utilized in almost all coding systems.

Exercises

1. Using your own words, define quantization and uniform quantization. What are thetwo features of uniform quantization?

2. What is optimum quantization? Why is uniform quantization sometimes desired, evenwhen the input has a pdf different from uniform? How was this problem solved? Drawan input–output characteristic of an optimum uniform quantizer with an input obeyingGaussian pdf having zero mean, unit variance, and the number of reconstruction levels,N, equal to 8.


3. What are the c onditions of optimu m no nunifor m quan tization ? From Table 2.2, whatobservations can you make?

4. Define variance mismatch and pdf mismatch. Discuss how you can resolve the mis-match problem.

5. What is the difference between forward and backward adaptive quantization?Comment on the merits and drawbacks for each.

6. What are PAM, PWM, PPM, and PCM? Why is PCM the most popular type of pulsemodulation?

References

[fleischer 1964] P.E. Fleischer, Sufficient conditions for achieving minimum distortion in quantizer,IEEE International Convention Records, Part I, 12, 104–111, 1964.

[gersho 1977] A. Gersho, Quantization, IEEE Communications Society Magazine, 16–29, September1977.

[gonzalez 1992] R.C. Gonzalez and R.E. Woods, Digital Image Processing, Addison-Wesley, Reading,MA, 1992.

[goodall 1951] W.M. Goodall, Television by pulse code modulation, Bell System Technical Journal, 30,33–49, January 1951.

[huang 1965] T.S. Huang, PCM picture transmission, IEEE Spectrum, 2, 57–63, December 1965.[jayant 1973] N.S. Jayant, Adaptive quantization with one word memory, Bell System Technical

Journal, 52, 1119–1144, September 1973.[jayant 1984] N.S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice-Hall, Englewood Cliffs,

NJ, 1984.[li 1995] W. Li and Y.-Q. Zhang, Vector-based signal processing and quantization for image and

video compression, Proceedings of the IEEE, 83, 2, 317–335, February 1995.[lloyd 1957] S.P. Lloyd, Least Squares Quantization in PCM, Institute of Mathematical Statistics

Meeting, Atlantic City, NJ, September 1957.[lloyd 1982] S.P. Lloyd, Least squares quantization in PCM, IEEE Transactions on Information Theory,

IT-28, 129–137, March 1982.[max 1960] J. Max, Quantizing for minimum distortion, IRE Transactions on Information Theory, IT-6,

7–12, 1960.[musmann 1979] H.G. Musmann, Predictive image coding, in Image Transmission Techniques,

W.K. Pratt (Ed.), Academic Press, NY, 1979.[paez 1972] M.D. Paez and T.H. Glisson, Minimum mean-squared-error quantization in speech PCM

and DPCM Systems, IEEE Transactions on Communications, COM-20, 225–230, April 1972.[panter 1951] P.F. Panter and W. Dite, Quantization distortion in pulse count modulation with

nonuniform spacing of levels, Proceedings of the IRE, 39, 44–48, January 1951.[sayood 1996] K. Sayood, Introduction to Data Compression, Morgan Kaufmann, San Francisco,

CA, 1996.[sklar 1988] B. Sklar, Digital Communications: Fundamentals and Applications, PTR Prentice-Hall,

Englewood Cliffs, NJ, 1988.[smith 1957] B. Smith, Instantaneous companding of quantized signals, Bell System Technical Journal,

36, 653–709, May 1957.


3Differential Coding

Instead of enc oding a signal directl y, the different ial coding techniq ue code s the differencebetwe en the signal itse lf and its pred iction. Ther efore, it is also known as predictiv e c oding.By util izing spati al and temporal int erpixel correlat ion, differe ntial coding is an ef ficient andyet com putatio nally simple coding technique. In this chap ter, we wi ll first descri be thedifferent ial technique in general . And the n its two compone nts: predic tion and quan tization .Ther e is a n emphas is on (optimu m) pred iction, since quantiz ation was already dis cussedin Chapte r 2. Whe n the differe nce signal (also known as predic tion error ) is quantiz ed,the differe ntial codi ng is call ed differe ntial pulse code mo dulation (DPCM ). Some issues inDPCM are discuss ed, after which delta modul ation (DM) a s a special case of DPCMis covered. Th e idea of different ial coding involvi ng image sequ ences is brie fly dis cussedin thi s chap ter. A more detailed covera ge is presente d in Parts III and IV , startingfrom Chap ter 10. If quan tization is not includ ed, the different ial codi ng is refer red to asinforma tion-prese rving differe ntial coding and discuss ed at the end of the chapter.

3.1 I ntro du ction t o D PCM

As dep icted in Figure 2.3, a sou rce enc oder cons ists of the follo wing three compone nts:Transfo rmation, quan tization , and code wo rd a ssignment. The transf orm ation conve rtsinput into a format fo r quan tization follo wed by code word assignme nt. In other words ,the compone nt of transf ormati on decides which format of inpu t to be encode d. As m en-tioned in Chapte r 2, inpu t itse lf is not necess arily the most suitable format for encoding .

Consi der the cas e of mo nochrome image enc oding. The inpu t is usu ally a 2 -D array ofgray level value s of an image obt ained via PCM coding. The concept of spatial redun-dancy, discuss ed in Section 1.2.1.1, tells us that neighbori ng pixe ls of an image are usu allyhighly correlat ed. Th erefore, it is mo re ef ficient to enc ode the gray difference betwe en twoneighbo ring pixels inste ad of enc oding the gray level value s of each pi xel. At the receiver ,the deco ded difference is added back to rec onstru ct the gray level valu e of the pixel. Asneighbo ring pixe ls are highly corre lated, the ir gray level value s bear a great simi larity.Hence , we expect that the vari ance of the differe nce signal wi ll be smaller than that of theoriginal signal . As sume uniform quan tization and natural bina ry coding for the sake ofsimplici ty. Then we see that for the same bit rate (bits per sam ple) the quantiz ation err orwill be smaller , i.e., a higher qual ity of rec onstructe d signal ca n be achieved . Or, for thesame quality of rec onstruc ted signal , we need a lower bit rate.

Assume a bit rate of 8 bits =sample in the quantiz ation. W e can see that altho ugh thedynami c range of the difference signal is theo retically doubled , from 256 to 512, thevariance of the difference signal is actually muc h sm aller. This can be con firmed fromthe histog rams of the boy and girl image (refe r to Figure 1.2) and its differe nce image


obt ained by ho rizontal pixel-t o-pixel differenci ng, shown in Figure 3.1a and 3.1b, respect-ively. Figu re 3.1b and its close- up (Figure 3.1c) indicate that by a rate of 42.44% thediffere nce value s fall into the range of �1, 0 , and þ 1. In othe r words, the histogra m ofthe differe nce signal is much more narrow ly concentrat ed than that of the origi nal signal.

3.1.1 Simpl e Pixel -to-Pi xel DPCM

Deno te the gray level value s of pixe ls along a row of an image as zi , i ¼ 1, . . . , M , where M isthe to tal num ber of pixe ls within the ro w. Usin g the imm ediate ly prece ding pixel ’s graylevel value , zi � 1, as a pred iction of that of the presen t pixe l, z i , i.e.,

zi ¼ z i� 1 , (3: 1)

we then have the differe nce signal

di ¼ z i � z i ¼ z i � z i� 1 : (3: 2)

A block diagram of the schem e des cribed abov e is shown in Figure 3.2. Ther e zi denote sthe sequ ence of pixe ls along a ro w, di is the correspo nding differe nce signal, and di is thequan tized versio n of the differe nce, i.e.,

di ¼ Q ( di ) ¼ di þ e q (3: 3)

where eq repre sents quan tization error. In the decod er, �z i repre sents the rec onstru cted pixe lgray value , and we have

�zi ¼ �z i � 1 þ di : (3: 4)

This simp le sch eme, ho wever, suffers from an accumul ated quan tization error. We cansee this clearl y from the fo llowing deri vation [s ayood 1996], where we assume the initia lvalue z0 is availabl e for both the encode r and the decod er:

as i ¼ 1, d1 ¼ z 1 � z 0d1 ¼ d1 þ e q ,1�z1 ¼ z 0 þ d1 ¼ z 0 þ d1 þ e q,1 ¼ z 1 þ e q,1 :

(3:5)

Similarly, we can have

as i ¼ 2, �z2 ¼ z2 þ eq,1 þ eq,2 (3:6)

and, in general,

�zi ¼ zi þXi

j¼1eq,j: (3:7)

This problem can be rem edied by the follo wing sch eme, shown in Figu re 3.3. Now wesee that in both the encoder and the decoder, the reconstructed signal is generated inthe same way, i.e.,


0

(a)

0.002

0.004

0.006

0.008

0.01

0.012

0.014

1 16 31 46 61 76 91 106 121 136 151 166 181 196 211 226 241 256

Gray level value

Occ

urre

nce

rate

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

–255 0(b)

255

Difference value

Occ

urre

nce

rate

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

(c)–9 –8 –7 –6 –5 –4 –3 –2 –1 0 +1 +2 +3 +4 +5 +6 +7 +8 +9

Difference value

Occ

urre

nce

rate

FIGURE 3.1(a) Histogram of the original boy and girl image. (b) Histogram of the difference image obtained by usinghorizontal pixel-to-pixel differencing. (c) A close-up of the central portion of the histogram of the difference image.


(a) Encoder (b) Decoder

+

+ zi

zi–1

Σ

Delay

di

–

Σ Quantizationdi

zi

zi–1 = z i

+ di

ˆ

ˆ

FIGURE 3.2Block diagram of a pixel-to-pixel differential coding system.

�zi ¼ �zi�1 þ di, (3:8)

and in the encoder the difference signal changes to

di ¼ zi � �zi�1: (3:9)

That is, the previous reconstructed �zi�1 is used as the prediction, zi, i.e.,

zi ¼ �zi�1: (3:10)

In this way, we have

as i ¼ 1, d1 ¼ z1 � z0

d1 ¼ d1 þ eq,1

�z1 ¼ z0 þ d1 ¼ z0 þ d1 þ eq,1 ¼ z1 þ eq,1:

(3:11)

Similarly, we have

as i ¼ 2, d2 ¼ z2 � �z1

d2 ¼ d2 þ eq,2

�z2 ¼ �z1 þ d2 ¼ z2 þ eq,2:

(3:12)

+

+

–

++

Σ Quantization

Delay

dizizi

zi

zi = zi–1

zi–1zi

zi

zi

zi–1

Σ

Σ

Delay

+

di

ˆ

di


FIGURE 3.3Block diagram of a practical pixel-to-pixel differential coding system.


In general ,

�zi ¼ z i þ e q, i : (3 : 13)

Thus, we see that the probl em of quan tiza tion error a ccumulati on has been resol vedby having bot h the enc oder and the deco der work in the same fas hion, as indicate d inFigure 3.3, or in Equ ations 3.3, 3.9, and 3.10.

3.1.2 Gener al DPCM System s

In the abov e discuss ion, we can view the rec onstru cted neighbo ring pixel ’s gray value as apredic tion of that of the pixe l being code d. No w, we general ize this simp le pixel-t o-pixelDPCM. In a general DPCM syste m, a pixel ’s gray level value is first predic ted from thepreced ing reconst ructed pixels ’ gray level value s. The differe nce betw een the pixe l’ s graylevel value and the pred icted value is then quan tized. Fina lly, the quan tized diffe rence isencode d and transmitt ed to the receiver . A block diagram of this general diffe rentialcoding schem e is shown in Figu re 3.4, where the code word assignme nt in the enc oderand its count erpart in decoder are not inc luded.

It is note d that, inste ad of using the previ ous rec onstru cted sample , �zi � 1, as a predic tor,we now have the predic ted ver sion of zi, z i, as a function of the n previo us reconst ructedsample s, �zi � 1, �z i � 2, . . . , �z i� n . That is,

zi ¼ f (�z i� 1 , �z i� 2 , . . . , �z i � n ) : (3 : 14)

Linear pred iction, i.e., the function f in Equati on 3.14 is line ar, is of par ticular inter est andis widely use d in different ial coding. In line ar pred iction, we have

zi ¼Xnj¼ 1

aj �z i � j , (3: 15)

where aj are real parame ters. He nce, we see that the simple pi xel-to-pixe l diffe rentialcoding is a special case of general different ial coding with linear predic tion, i.e., n ¼ 1and a1¼ 1.

In Figure 3.4, di is the difference signal and is equal to the difference between the originalsignal, zi, and the prediction zi. That is,

Σ

Σ

Σ

Σ

Quantization

di +zizi

zi = ziziajzi–j

zi

zi

zi

zi+

+ +

+−

j=1

n

DelayPrediction

DelayPrediction

diˆ di

ˆ

FIGURE 3.4Block diagram of a general differential pulse code modulation (DPCM) system.


di ¼ z i � z i : (3: 16)

The quantiz ed ver sion of di is denoted by di. The reconst ructed v ersion of z i is repre-sented by �zi, and

�zi ¼ z i þ di : (3: 17)

Note that thi s is true for both the enc oder and the deco der. Recall that the acc umulati on ofthe quan tization error can be remedi ed by using this met hod.

The difference betwe en the original inpu t and the predicted input is calle d predictionerror , whic h is denoted by ep. That is,

ep ¼ z i � z i , (3: 18)

where the ep is unders tood as the predic tion err or assoc iated wi th the index i. Quant izationerror , eq , is equal to the reconst ruction err or or coding error , e r , defi ned as the differe ncebetwe en the or iginal signal , zi, and the rec onstru cted signal , �z i, when the transmi ssion iserror fre e:

eq ¼ di � di¼ ( zi � z i ) � (�z i � z i )

¼ zi � �z i ¼ er : (3: 19)

This indicate s that quan tization error is the only sou rce of informati on loss with an errorfree transmis sion chann el.

The DP CM system depict ed Figure 3.4 is also called closed-l oop DP CM with feed backaroun d the quan tizer [jay ant 1984]. This term reflects the featur e in DPCM structur e.

Before we leave thi s section, let us take a loo k at the history of the develop ment ofdiffere ntial image coding. Accordi ng to an excellent early article on diffe rential imagecoding [mu smann 1979], the fi rst theoretical and expe rimen tal approach es to image codinginv olving linear predic tion began in 1952 at the Bell Tel ephone Labor atori es [harrison 1952;kret zmer 1952; oliv er 1952]. Th e concepts of DPCM and DM were also develop ed in 1952[cu tler 1 952; dejager 1952]. Predic tive coding capa ble of preservin g infor mation fo r a PCMsignal was establ ished at the Massa chusett s Instituti on of Techno logy [elias 1955].

Diffe rential coding techni que has played an impo rtant role in image and video coding.In the inter national coding st andard for still images , JPEG , which is covered in Chapte r 7,we ca n see that the different ial codi ng is used in lossle ss mo de, and in DC T-based mode forcoding DC coef ficients. Motion compe nsated (MC ) coding has bee n a maj or develop mentin video coding since 1980s and has been adop ted by all the internati onal video codingstand ards, such as H.261 and H.263 (Ch apter 19), MPEG 1 and MPEG 2 (Chapt er 16). MCcoding is essentially a predictive coding technique applied to video sequences involvingdisplacement motion vectors.

3. 2 Op timum Lin ear Pre dict i on

Figu re 3.4 indicate s that a different ial c oding syste m cons ists of two maj or com ponents:predic tion and quan tization . Quantizat ion was discusse d in the Chapter 2. He nce, in thi s


chapte r, we emp hasize predic tion. Below, we formu late the optimum linear pred ictionproblem and then present a theoret ical sol ution to the probl em.

3.2.1 Formul ation

Optimu m linear predic tion can be formu lated as follo ws. Consider a discrete -time rando mproces s z . At a typical momen t i, it is a random variable zi . We have n previo us observa-tions �zi� 1, �z i� 2, � � �, �z i � n availabl e and wou ld like to form a pred iction of z i , denoted by z i .The outp ut of the predic tor, zi , is a line ar fun ction of the n previo us obse rvations . Th at is,

zi ¼Xnj¼ 1

aj �z i � j , (3: 20)

with aj , j ¼ 1, 2, . . . , n being a set of real coef ficients. An illustrat ion of a linear predic tor isshown in Figure 3.5. As de fined abov e, the predic tion error , ep, is

ep ¼ z i � z i : (3 : 21)

The m ean square predic tion error , MSEp, is

MSEp ¼ E ( e p ) 2� � ¼ E ( zi � z i )2

� �(3 : 22)

The optimum pred iction then refers to the determinat ion of a set of coef ficients aj , j ¼1, 2, . . . , n such that the MSEp is min imize d.

This optimi zation probl em turns out to be com putational ly intracta ble for most pra cticalcases due to the feedback aro und the quantiz er shown in Figu re 3.4, and the no nlinearnature of the quan tizer. Therefor e, the opti mization problem is solved in two sep aratestages. That is, the best linear predic tor is first designed ignoring the quan tizer. Th en, thequan tizer is optimi zed for the distribut ion of the difference signal [habib i 1971]. Althou ghthe predicto r thus design ed is sub optimal, ignoring the quan tizer in the optimum predic tordesign allo ws us to substi tute the reconstru cted �zi �j by z i � j for j ¼ 1, 2, . . . , n, accordin gto Equati on 3.19. Conse quently, we can app ly the the ory of optimu m linear pred iction tohandle the design of the optimum predictor as shown below.

+a1+a2+an

zi−n zi−2 zi−1

Σ

zi

FIGURE 3.5An illustration of a linear predictor.


3.2.2 Orthog ona lity Condi tion and Minimu m Mean Squ are Error

By takin g the different iation of MSEp with resp ect to coef ficient aj , on e can derivethe follo wing nec essary conditio ns, which are usua lly ref erred to as the orthogo nalitycond ition.

E ( ep � z i� j ) ¼ 0 for j ¼ 1, 2, . . . , n: (3: 23)

The int erpretat ion of Equati on 3.2 3 is that the predic tion error, ep, must be orthogo nal to allthe observat ions, which are now the precedin g sam ples: zi �j , j ¼ 1, 2, � � �, n, ac cording to ourdiscuss ion mad e in Secti on 3.2.1. These are equiva lent to

Rz ( m ) ¼Xnj¼ 1

aj R z ( m � j) for m ¼ 1, 2, . . . , n, (3: 24)

where Rz repres ents the aut ocorrelatio n fun ction of z . In a vector –matri x format, the abov eorthog onal cond itions can be writte n as

Rz (1)Rz (2)...

..

.

Rz (n )

2666664

3777775¼

Rz (0) Rz (1) . . . . . . Rz (n � 1)Rz (1) Rz (2) . . . . . . Rz (n � 2)... ..

.. . . . . . ..

.

..

. ...

. . . . . . ...

Rz (n � 1) Rz ( n) . . . . . . Rz (0)

2666664

3777775�

a1a2...

..

.

an

2666664

3777775

(3: 25)

Equations 3.24 and 3.25 are called Yule–Walker equations.The minimum MSEp is then found to be

MSEp ¼ Rz(0)�Xnj¼1

ajRz( j): (3:26)

These results can be found in texts on random processes [leon-garcia 1994].

3.2.3 Solution to Yule–Walker Equations

Once autocorrelation data is available, the Yule–Walker equation can be solved bymatrix inversion. A recursive procedure was developed by Levinson to solve the Yule–Walker equations [leon-garcia 1994]. When the number of previous samples used in thelinear predictor is large, i.e., the dimension of the matrix is high, the Levinson recursivealgorithm becomes more attractive. Note that in the field of image coding the autocorrela-tion function of various types of video frames is derived from measurements [o’neal 1966;habibi 1971].

3.3 Some Issues in the Implementation of DPCM

Several related issues in the implementation of DPCM are discussed in this section.


3.3.1 Optimu m DPCM System

As DPCM consists mainly of two parts, predic tion and quan tization , its optimi zationshoul d not be carried out separate ly. Th e int eraction betwe en the two par ts is quitecomplic ated, howeve r, and thu s comb ined optimi zation of the whol e DP CM syste m isdif ficult. For tunately, with the mean square error criter ion, the relatio n between quan tiza-tion err or a nd pred iction error has been found as

MSEq � 92N 2

MSE p , (3: 27)

where N is the total number of reconst ruction levels in the quantiz er [o ’ neal 1966;musma nn 1979]. Th at is, the mean squ are err or of quantiz ation, MSEq, is approxi matelypropo rtional to the, MSEp, mean square error of predic tion. Wit h this appro ximati on,we c an optimi ze two parts separatel y as men tioned in Secti on 3.2.1. While the optimi za-tion of quan tization was address ed in Chap ter 2, the optimu m predic tor was discusse d inSecti on 3. 2. A large amo unt of work has bee n done in this subj ect. For instance , theoptimu m predic tor fo r color image coding was designed and tested in [pirs ch 1977].

3.3.2 1-D, 2-D, and 3-D DPCM

In Se ction 3.1.2, we expres sed linear predic tion in Equati on 3.15. Howeve r, so far we havenot dis cussed ho w to pred ict a pixel ’s gray level value by usi ng its neighboring pixels ’coded gray level value s.

Practi cal pixe l-to-pixel different ial coding system was dis cussed in Se ction 3.1.1. There,the reconstru cted inten sity of the imm ediate ly preced ing pixel along the sam e sca n line isused as a predic tion of the pixe l inten sity being coded . This type of differe ntial codi ngis referred to as 1-D DPCM. In general , 1-D DPCM may use the reconstru cted gr ay levelvalue s of more than one prece ding pixe ls within the sam e scan line to predic t that of a pixelbeing code d. By far, ho wever, the immedi ately preced ing pixe l in the sam e sca n li ne ismost frequentl y used in 1-D DPCM. Th at is, pixel A in Figu re 3.6 is of ten used as apredic tion of pixe l Z, which is being DP CM coded .

Some times in DPCM image coding, both the decoded int ensity value s of adjace nt pixelswithin the sam e scan line and the decod ed int ensity values of neighboring pixe ls in thedifferent scan lines are involved in the prediction. This is called 2-D DPCM. A typical pixelarrangement in 2-D predictive coding is shown in Figure 3.6. Note that the pixels involvedin the prediction are restricted to be either in the lines above the line where the pixel beingcoded, Z, is located or on the left-hand side of pixel Z if they are in the same line.Traditionally, a TV frame is scanned from top to bottom and from left to right. Hence,

L

ZAB

C D E F G

K J I H

0

x

y

FIGURE 3.6Pixel arrangement in 1-D and 2-D prediction.


the abov e restrict ion indi cates that on ly those pixels, which have bee n code d, availabl e inbot h the transmi tter and the receiver, are use d in the predic tion. In 2-D system theory, thi ssup port is referred to as recursive ly compu table [bose 1982]. An often used 2-D predictioninv olves pixe ls A, D, and E.

Obvi ously, 2-D predictiv e coding util izes not on ly the spatial corre lation existi ngwithi n a scan line but also that existing in neighbo ring sca n lines. In ot her words , thespatia l cor relation is utilized both horizontal ly and v ertically . It was repo rted that 2-Dpredic tive codi ng outpe rforms 1-D pred ictive codi ng by decreasin g the predic tion errorby a factor of 2, or equiva lently 3 dB in SNR. The improve ment in subjective assessm entis even large r [mu smann 1979]. Furt hermore , the transmiss ion err or in 2-D predic tiveimage coding is much less seve re than in 1-D predic tive image coding. Th is is discusse din Se ction 3.6.

In the conte xt of image sequ ences, neighboring pixe ls may be locate d not only in thesam e image frame but also in succe ssive frames. Th at is, neig hboring pi xels along the tim edime nsion are a lso involve d. If the predic tion of a DPCM syste m involve s three types ofneighbo ring pixels: thos e along the same scan line, those in the differe nt sca n line s ofthe sam e image frame, and those in the different frames , the DPCM is then calle d 3-Ddiffere ntial codi ng, dis cussed in Section 3.5.

3.3.3 Order of Predicto r

The num ber of coef ficients in the line ar predic tion, n, is refer red to as the or der ofthe pred ictor. The relati on betwe en the mean square predic tion err or, MSEp, and theorder of the predic tor, n, has been studi ed. As shown in Figure 3.7, the MSE p decre asesas n increa ses quite effective ly, but the perform ance improve ment becomes negl igible asn > 3 [habi bi 1971].

3.3.4 Adapt ive Predict ion

Adap tive DPCM means adaptive pred iction and adapt ive quan tization . As adaptivequan tization was already dis cussed in Chapte r 2, her e we will dis cuss only adapt iveprediction.

Similar to the discussion on adaptive quantization, adaptive prediction can be done intwo different ways: forward adaptive and backward adaptive predictions. In the former,adaptation is based on the input of a DPCM system, while in the latter, adaptation is basedon the output of the DPCM. Therefore, forward adaptive prediction is more sensitive tochanges in local statistics. Prediction parameters (the coefficients of the predictor), how-ever, need to be transmitted as side information to the decoder. On the other hand,quantization error is involved in backward adaptive prediction. Hence, the adaptation isless sensitive to local changing statistics. But, it does not need to transmit side information.

In either case, the data (either input or output) has to be buffered. Autocorrelationcoefficients are analyzed, based on which the prediction parameters are determined.

3.3.5 Effect of Transmission Errors

Transmission error caused by channel noise may reverse the binary bit information from0 to 1 or from 1 to 0 with what is known as bit error probability, or bit error rate. Theeffect of transmission error on reconstructed images varies depending on different codingtechniques.


MSEp

0.075

0.070

0.065

0.060

0.055

0.050

0.045

0.040

0.035

0.030

0.025

0.020

0.015

0.010

0.0050 2 4 6 8 10 12 14 16 18 20 2422

Order of predictor

×

*

Experimental results

Theoretical results

× Experimental

* Theoreticalfirst order predictor (horizontal correlation)

FIGURE 3.7Mean square prediction error (MSEp) versus order of predictor. (From Habibi, A., IEEE Trans. Commun. Technol.,COM-19, 948, 1971. With permission.)

In the case of the PCM-coding technique, each pixel is coded independently. Therefore,bit reversal in the transmission only affects the gray level value of the corresponding pixelin the reconstructed image. It does not affect other pixels in the reconstructed image.

In DPCM, however, the effect caused by transmission errors becomes more severe.Consider a bit reversal occurring in transmission. It causes error in the correspondingpixel. But, this is not the end of the effect. The affected pixel causes errors in reconstructingthose pixels toward which the erroneous gray level value was used in the prediction.In this way, the transmission error propagates.

Interestingly, it is reported that the error propagation is more severe in 1-D differentialimage coding than in 2-D differential coding. This may be explained as follows: in 1-Ddifferential coding, usually the immediate preceding pixel in the same scan line is involvedin prediction. Therefore, an error will be propagated along the scan line until the beginningof the next line, where the pixel gray level value is reinitialized. In 2-D differential coding,the prediction of a pixel gray level value depends not only on the reconstructed gray levelvalues of pixels along the same scan line but also on the reconstructed gray level values ofthe vertical neighbors. Hence, the effect caused by a bit reversal transmission error is lesssevere than in the 1-D differential coding.

For this reason, the bit error rate required by DPCM coding is lower than that requiredby PCM coding. For instance, while a bit error rate less than 53 10�6 is normallyrequired for PCM to provide broadcast TV quality, for the same application a bit error


rate less than 10 �7 and 10 � 9 is require d for DPCM coding with 2-D and 1-D predic tions,respec tively [mu smann 1979].

Chan nel enc oding with an error corre ction capa bility was appl ied to lower the bit errorrate. For insta nce , to lowe r the bit error rate from the or der of 10 � 6 to 10 �9 for DPCMcoding with 1-D predic tion, an error corre ction code by adding 3% redundanc y in chann elcoding has bee n use d [bruders 1978].

3.4 Delta Modulation

Delta modul ation is an importan t, simp le, special case of DP CM, as discuss ed above. It waswidel y applied and is thus a n impo rtant coding techni que in and of itself.

The abov e discuss ion and charac ter ization of DPCM systems are applicabl e to DMsystem s. This is becaus e DM is essen tially a special type of DPCM, wi th the follo wingtwo featur es:

1. Th e linear predic tor is of the fi rst or der, with the coef fi cient a1 equal to 1.

2. Th e quan tizer is a 1-bit quan tizer. That is, dep ending on wheth er the differe ncesignal is positive or negative, the output is either þD=2 or �D=2.

To perceive these two features, let us take a look at the block diagram of a DM systemand the input –outp ut character istic of its 1-bit quantiz er, shown in Figures 3.8 and 3.9,respectively. Due to the first feature listed above, we have

zi ¼ �zi�1: (3:28)

Next, we see that there are only two reconstruction levels in quantization because of thesecond feature. That is,

di ¼ þD=2 if zi > �zi�1�D=2 if zi < �zi�1

�: (3:29)

From the relation between the reconstructed value and the predicted value of DPCMdiscussed above and the fact that DM is a special case of DPCM, we have

+

+

−

++

Σ Two-levelquantization

Delay

dizizi

zi

zi = zi−1

zi

zi

Σ

Σ

Delay

+

di di

zizi

ˆ zi


FIGURE 3.8Block diagram of Delta modulation (DM) systems.


2Δ

2Δ

–

0 di

di

FIGURE 3.9Input–output characteristic of two-level quantiza-tion in Delta modulation (DM).

�zi ¼ z i þ di : (3 : 30)

Combi ning Equati ons 3.28 through 3.30, we have

�zi ¼ �zi� 1 þ D=2 if zi > �z i� 1�zi� 1 � D=2 if zi < �z i� 1

�: (3 : 31)

The above mathemat ical relatio nships are importan t in understan ding DM systems. Forinsta nce, Equatio n 3.31 indica tes that the st ep size D of DM is a crucial parame ter. We notethat a large step size compar ed wi th the magni tude of the differe nce signal cause s granula rerror, as shown in Figure 3.10. Ther efore, to reduce the granula r error, we shoul d choo se a

0

Granular error

Slope overload

Input

Output Granular error

Slope overload

Input z (t ), Output zi

t, i

FIGURE 3.10Delta modulation (DM) with fixed step size.


sm all step size . On the othe r hand, a small step size com pared wi th the magni tude of thediffere nce signal will lead to the overload error dis cussed in Chap ter 2 for quan tization .Since in DM systems it is the differe nce signal that is quan tized, howeve r, the over-load error in DM becomes slo pe overl oad error , as sh own in Figu re 3.10. That is, it takestim e (mu ltiple steps ) fo r the reconst ructed sample s to catch up with the sudd en change ininpu t. Ther efore, the st ep size should be large to avo id the slope overlo ad. Consi deringthese two con flicting facto rs, a prope r com promise in choo sing the step size is com monpra ctice in DM.

To improve the perfo rmance of DM, an oversam pl ing tech nique is often applied . That is,the inpu t is oversam pled before the app lication of DM. By oversam pl ing, we mean that thesam pling frequency is higher than the sampling fre quency used in obt aining the origi nalinpu t signal. The increa sed sam ple densi ty cause d by overs ampling decre ases the magni-tude of the differe nce sig nal. Co nsequentl y, a relative ly small step size can be used so as todecrea se the granu lar noise wi thout increa sing the slope overload error. At the last, theresol ution of the DM- coded image is kept the same as that of the origi nal inpu t [jayant1984; lim 1990].

To achieve better pe rformance for changi ng inputs, a n ada ptive tech nique can be appl iedin DM. That is, eithe r inpu t (forw ard adapt ation) or outpu t (backw ard ada ptation ) data isbuff ered and the dat a vari ation is anal yzed. The step size is then cho sen accord ingly.If it is forwa rd adapta tion, side inf ormatio n is require d for transmi ssion to the decod er.Figu re 3.11 demons trates step size adapt ation. We see the sam e inpu t as that shownin Figure 3.10. But, the step size is no w not fi xed. Instead , the step size is adapted accordin gto the varying input. When the input changes with a large slope, the step size increases toavoid the slope overload error. On the other hand, when the input changes slowly, the stepsize decreases to reduce the granular error.

0

Granular error

InputOutput Granular error

Slope overload

Input z (t ), Output zi

t, i

FIGURE 3.11Adaptive Delta modulation (DM).


3.5 I nter frame Diff erential Coding

As was men tioned in Se ction 3.3.2, 3-D differe ntial codi ng involve s an image sequ ence.Consi der a sens or located in 3-D worl d space. For instance s, in appl ications such asvideopho ny and videoc onferencin g, the senso r is fi xed in pos ition for a whil e and ittakes picture s. As tim e goes by, the image s form a temporal image sequ ence. The codi ngof such an image sequenc e is refer red to as interframe codi ng. The subj ect of imagesequ ence and video coding is addr essed in Par ts III and IV. In this section, we wi ll brie fl ydiscuss how different ial coding is applied to int erframe coding.

3.5.1 Conditional Replenis hment

Recogni zing the grea t similar ity betw een consecu tive TV frames, a conditio nal repl enish-ment codi ng techniq ue was proposed and develop ed [mounts 1969]. It was rega rded oneof the firs t real demonstr ation s of inter frame coding exploiti ng inter frame redund ancy[netrav ali 1979].

In this sch eme, the previ ous frame is use d as a ref erence for the presen t frame . Cons idera pair of pixe ls: one in the previ ous frame and the othe r in the pres ent frame — bot hoccupy ing the same spatia l position in the frame s. If the gray level differe nce betweenthe pai r of pixe ls exceed s a certain criterion, the n the pixe l is conside red a changi ng pixel.The pres ent pixel gray level value and its position informati on are transmitt ed to receivin gside, where the pixel is repl enished . Otherwis e, the pixe l is conside red unc hanged. Atreceiver its previ ous gray lev el is repe ated. A block diag ram of conditio nal replen ishmentis sh own in Figu re 3.12. Th ere a frame memor y unit in the transmitt er is used to storeframes . The differe ncing and thre sholdin g of correspo nding pi xels in two cons ecutiveframes can then be conduct ed there . A buff er in the transm itter is used to sm ooth thetransmission data rate. This is necessary because the data rate varies from region-to-region

Intensity andposition

information ofchanging pixels

Intensity andposition

information ofchanging pixels

Framememory

Transmitterbuffering

FramememoryReceiver

buffering

f i

f i−1

fi

fi −1differencingthresholding

(a) Transmitter

replenishmentrepetition

(b) Receiver

ˆ

ˆ

FIGURE 3.12Block diagram of conditional replenishment.


within an image frame and from frame-to-frame within an image sequence. A buffer in thereceiver is needed for a similar consideration. In the frame memory unit, the replenishmentis carried out for the changing pixels and the gray level values in the receiver are repeatedfor the unchanged pixels.

With conditional replenishment, a considerable savings in bit rate was achieved inapplications such as videophony, videoconferencing, and TV broadcasting. Experimentsin real time, using the head-and-shoulder view of a person in animated conversation as thevideo source, demonstrated an average bit rate of 1 bit=pixel with a quality of recon-structed video comparable with standard 8 bits=pixel PCM transmission [mount 1969].Compared with pixel-to-pixel 1-D DPCM, the most popularly used coding technique atthe time, conditional replenishment technique is more efficient due to the exploitationof high interframe redundancy. As pointed in [mount 1969] there is more correlationbetween television pixels along the frame-to-frame temporal dimension than thereis between adjacent pixels within a signal frame. That is, the temporal redundancy isnormally higher than spatial redundancy for TV signals.

Tremendous efforts have been made to improve the efficiency of this rudimentarytechnique. For an excellent review, readers are referred to [haskell 1972, 1979]. 3-DDPCM coding is among the improvements and is discussed next.

3.5.2 3-D DPCM

It is soon realized that it is more efficient to transmit the gray level difference than totransmit the gray level itself, resulting in interframe differential coding. Furthermore,instead of treating each pixel independently of its neighboring pixels, it is more efficientto utilize spatial redundancy as well as temporal redundancy, resulting in 3-D DPCM.

Consider two consecutive TV frames, each consisting of an odd and an even field.Figure 3.13 demonstrates the small neighborhood of a pixel, Z, in the context. As withthe 1-D and 2-D DPCM discussed before, the prediction can only be based on the previ-ously encoded pixels. If the pixel under consideration, Z, is located in the even field of the

ab

c

2

4P

QR

VW

Y

fi −1,o fi −1,e fi ,o fi ,e

ab

c

2

4D

EF

IJ

K

ab

c

1

3A

BC

G

HZ

ab

c

1

3L

MO

S

UT

Odd field Even field Odd field Even field

(i −1)th frame i th frame

FIGURE 3.13Pixel arrangement in two TV frames. (From Haskell, B.G., in Image Transmission Techniques, Academic Press,New York, 1979. With permission.)


TABLE 3.1

Some Linear Prediction Schemes

OriginalSignal (Z) Prediction Signal (Z) Differential Signal (dz)

Element difference Z G Z�G

Field difference ZEþ J2

Z� Eþ J2

Frame difference Z T Z�TElement difference

of frame differenceZ TþG�S (Z�G)� (T�S)

Line difference of framedifference

Z TþB�M (Z�B)� (T�M)

Element difference of fielddifference

Z T þ Eþ J2

� �� QþW

2

� �Z� Eþ J

2

� �� T �QþW

2

� �

Source: From Haskell, B.G., in Image Transmission Techniques, Academic Press, New York, 1979. With permission.

presen t frame , the n the odd field of the presen t frame and bot h odd and even field s of theprevio us frame are availabl e. As men tioned in Section 3.3.2, it is assum ed that in the evenfield of the pres ent frame, on ly tho se pi xels in the lines above the line where pixe l Z lies andthose pixels left of the Z in the line whe re Z lies a re used for predic tion.

Table 3.1 lists several utilized line ar predic tion schem e. It is recogn ized that the cas e ofelem ent difference is a 1-D predic tor because the imm edi ately prece ding pixel is used asthe predic tor. The field difference is defi ned as the arithme tic aver age of two immedi atelyvertica l neig hboring pixels in the previo us odd field . As the odd field is gene rated fi rst,follo wed by the even field , thi s predic tor cannot be regard ed as a pure 2 -D predic tor.Instead , it sh ould be con sidered a 3-D predic tor. The remain ing cases belong to 3-Dpredic tors. One thing is commo n in all the cases: the gray levels of pixels used in thepredic tion have alread y been code d and thus are availab le in bot h the transmi tter and thereceiver .

The predic tion error of each c hanging pixe l Z ide nti fied in thresho lding proce ss is thenquan tized and coded.

An anal ysis of the relations hip betw een the entro py of movin g areas (bit s per changi ngpixel) and the motion speeds (pixels per frame inter val) in the scene ry con taining a movin gmann equin ’ s head was studied with differe nt linear prediction s, listed in Table 3.1 [haske ll1979]. It was fo und that the elem ent diffe rence of field differe nce generall y correspond s tothe lowest ent ropy, mean ing that this predic tion is the most ef fi cient. The frame differenceand elem ent diffe rence corre spond to higher entro py. It is recogn ized that, in the circum-stance s, transmis sion err or will be propag ated if the pixe ls in the previ ous line are used inpredic tion [conno r 1973]. Hence , the li near predicto r shoul d use only pixels from the sameline or the sam e line in the previo us frame when bit reversal error in transmi ssion needs tobe conside red. Combi ning these two factors , the elem ent differe nce of frame differenceprediction is preferred.

3.5.3 Motion Compensated Predictive Coding

When frames are taken densely enough, changes in successive frames can be attributed tothe motion of objects during the interval between frames. Under this assumption, if we cananalyze object motion from successive frames, then we should be able to predict objects inthe next frame based on their positions in the previous frame and the estimated motion.The difference between the original frame and the predicted frame is thus generated and


the mo tion vect ors are then quantiz ed and coded . If the mo tion esti mation is accu rateenoug h, the M C prediction err or can be smaller than 3-D DPCM. In othe r words, thevari ance of the predic tion err or will be small er, result ing in more ef ficie nt codi ng. Takingmo tion int o cons ideratio n, this differe ntial techniq ue is called MC predictiv e codi ng. Thistechni que was a major develop ment in image sequ ence codi ng since the 1980s andwas a dopted by all internati onal video coding st andards. A more detailed discussi on isprovid ed in Chapter 10.

3.6 Information-Preserving Differential C oding

As emphasized in Chapter 2, quantization is not reversible in the sense that it causesinformation loss permanently. The DPCM technique, discussed above, includes quantizationand hence is lossy coding. In applications such as those involving scientific measurements,information preserving is required. In this section, the following question is addressed:Under these circumstances, how should we apply differential coding to reduce bit ratewhile preserving information?

Figu re 3.14 shows a block diagram of informati on-prese rving different ial codi ng. First,we see that there is no quan tizer. Therefor e, the irrevers ible inf ormatio n loss associate dwith quan tization does not exist in this techniq ue. Seco nd, we obse rve that predic tion anddiffere ncing are still use d. Tha t is, the different ial (predi ctive) tech nique st ill appl ies. He nceit is expe cted that the variance of the differe nce signal is sma ller than that of the origi nalsignal (Secti on 3.1). Conse quently, the higher-p eaked histogra ms make codi ng mo re ef fi-cient. Th ird, an ef ficient los sless code r is util ized. Since quan tizers cannot be used here,PCM with natural binary coding is also not use d here. As the histogra m of the differe ncesignal is narrow ly concen trated about its mean , lossle ss coding tech niques such as anef ficient Hu ffman code r (discu ssed in Chapte r 5) is naturally a suitable choice here.

Binary string

Binary string

+

+Output

∑

Prediction

Losslessdecoding

−

+Input∑

Prediction

Losslesscoding

di

di

zi

zi

z i

z i

(a) Encoder

(b) Decoder

ˆ

ˆ

FIGURE 3.14Block diagram of information-preserving differential coding.


As men tioned ear lier, inpu t image s are normally in a PCM coded format with a bit rateof 8 bits =pixel for mono chrome pictu res. Th e differe nce signal is the refore int eger valued.Having no quan tization and usi ng an ef fi cient los sless coder, the coding system depicted inFigure 3.14 is, there fore, an informati on-prese rving different ial coding techni que.

3.7 S ummary

Rather than coding the signal itself, differential coding, also known as predictive coding,encodes the difference between the signal and its prediction. Utilizing spatial and=or tem-poral correlation between pixels in the prediction, the variance of the difference signal can bemuch smaller than that of the original signal, thus making differential coding quite efficient.

Among different ial coding method s, different ial pu lse code mo dulati on (DPCM ) is use dmost widely. In DPCM coding, the differe nce signal is quan tized and code words areassigne d to the quan tized difference . Predic tion and quan tization are there fore the twomajor compone nts in the DPCM syste ms. Si nce quan tization was already addr essed inChapter 2, this chapter emphas izes predic tion. Th e theo ry of optimu m line ar prediction isintrod uced. He re, opti mum means minimizat ion of mean square predic tion err or (MSEp).The formu lation of optimu m line ar predic tion, the orthogo nality cond ition, and the m in-imum MSEp are pres ented. Th e orthogo nality con dition states that the pred iction err ormust be orthogo nal to each observat ion, i.e ., the rec onstructe d sample intensity values areused in the linear predic tion. By solving the Yul e–W alker equatio n, the opti mum predic -tion coef ficie nts may be determi ned.

In additi on, some fund amental issues in impl ementing the DPCM technique are dis-cussed. One issue is the dimension ality of the pred ictor in DPCM. We discusse d 1-D, 2-D,and 3-D pred ictors. DP CM with a 2-D predic tor demo nstrat es bett er perform ance than thatwith 1-D predic tor, becaus e 2-D DP CM util izes more spatial cor relation, i.e., not onlyhoriz ontally but also vertica lly. As a result , a 3 dB imp rovemen t in SNR was reported .3-D predic tion is encount ered in wha t is known as interframe coding. There, temporalcorre lation exists and 3-D DPCM util izes both spati al a nd tem poral cor relation betweenneighbo ring pixe ls in succe ssive frames . Conse quently , more redund ancy can be rem oved.Motion compens ated (MC ) predic tive coding as a ver y powerf ul technique in video codingbelong s to different ial coding. It use s a more adv anced transl ation al motion model in thepredic tion, howe ver, and it is covered in Par ts III and IV.

Anoth er iss ue is the or der of predic tors a nd its effect on the perform ance of predic tion interms of MSEp. Increasi ng pred iction order ca n lower MSE p effectivel y, but the perform -ance improvement becomes not significant after the third order.

Adaptive prediction is another issue. Similar to adaptive quantization, discussed inChapter 2, we can adapt the prediction coefficients in the linear predictor to varyinglocal statistics.

The last issue is concerned with the effect of transmission error. Bit reversal in transmis-sion causes a different effect on reconstructed images depending on what type of codingtechnique is used. PCM is known to be bit-consuming. (An acceptable PCM representationof monochrome images requires 6–8 bits=pixel.) But 1 bit reversal only affects an individualpixel. For the DPCM coding technique, however, a transmission error may propagate fromone pixel to the other. In particular, DPCM with a 1-D predictor suffers from errorpropagation more severely than DPCM with a 2-D predictor.

Delta modulation is an important special case of DPCM, in which the predictor is of thefirst order. Specifically, the immediate preceding coded sample is used as a prediction ofthe present input sample. Furthermore, the quantizer has only two reconstruction levels.


Fina lly, an informatio n-pre serving differe ntial coding tech nique is discusse d. As men-tioned in Chapte r 2, quan tization is an irreversi ble proces s: it causes informati on loss. To beable to preser ve infor mation, there is no quan tizer in this type of syste m. To be ef ficient,lossle ss codes such as Huffman code or arit hmeti c code are use d fo r diffe rence signalenc oding.

Exerc ises

1. Justify the nec essity of the clos ed-loop DPCM with feed back aroun d quantiz ers. Th at is,give a sui table reason why the quan tization error will be accumul ated if, inste ad ofusi ng the rec onstructe d preced ing sample s, we use the imm ediate ly preced ing sampleas the pred iction of the sam ple being code d in DPCM.

2. Why doe s the overl oad error enc ountered in quan tization appear to be the slopeoverl oad in DM?

3. What advantage does oversam plin g bring up in the DM tech nique?4. What are the two features of DM that make it a subcl ass of DPCM?5. Expl ain why DPCM with a 1-D predicto r suffers from bit rev ersal transmis sion error

mo re severe ly than DP CM with a 2-D predic tor.6. Expl ain why no quantiz er can be used in informati on-prese rving different ial coding,

and why the differential system can work without a quantizer.7. Why do all the pixels involved in prediction of differential coding have to be in a

recursively computable order from the point of view of the pixel being coded?8. Discuss the similarity and dissimilarity between DPCM and MC predictive coding.

References

[bose 1982] N.K. Bose, Applied Multidimensional System Theory, Van Nostrand Reinhold, New York,1982.

[bruders 1978] R. Bruders, T. Kummerow, P. Neuhold, and P. Stamnitz, Ein versuchssystem zurdigitalen ubertragung von fernsehsignalen unter besonderer berucksichtigung von ubertra-gungsfehlern, Festschrift 50 Jahre Heinrich-Hertz-Institut, Berlin, 1978.

[connor 1973] D.J. Connor, IEEE Transactions on Communications, COM-21, 695–706, 1973.[cutler 1952] C.C. Cutler, U.S. Patent 2,605,361, 1952.[dejager 1952] F. DeJager, Philips Research Report, 7, 442–466, 1952.[elias 1955] P. Elias, IRE Transactions on Information Theory, IT-1, 16–32, 1955.[habibi 1971] A. Habibi, Comparison of nth-order DPCM encoder with linear transformations

and block quantization techniques, IEEE Transactions on Communication Technology, COM-19,6, 948–956, December 1971.

[harrison 1952] C.W. Harrison, Bell System Technical Journal, 31, 764–783, 1952.[haskell 1972] B.G. Haskell, F.W. Mounts, and J.C. Candy, Interframe coding of videotelephone

pictures, Proceedings of the IEEE, 60, 7, 792–800, July 1972.[haskell 1979] B.G. Haskell, Frame replenishment coding of television, in Image Transmission

Techniques, W.K. Pratt, (Ed.), Academic Press, New York, 1979.[Jayant 1984] N.S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice-Hall, Englewood Cliffs,

NJ, 1984.[kretzmer 1952] E.R. Kretzmer, Statistics of television signals, Bell System Technical Journal, 31,

751–763, July 1952.


[leon-garcia 1994] A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, 2nd edn.,Addison Wesley, Reading, MA, 1994.

[lim 1990] J.S. Lim, Two-dimensional Signal and Image Processing, Prentice-Hall, Englewood Cliffs, NJ,1990.

[mounts 1969] F.W. Mounts, A video encoding system with conditional picture-element replenish-ment, Bell System Technical Journal, 48, 7, 2545–1554, September 1969.

[musmann 1979] H.G. Musmann, Predictive image coding, in Image Transmission Techniques, W.K.Pratt (Ed.), Academic Press, New York, 1979.

[netravali 1979] A.N. Netravali and J.D. Robbins, Motion compensated television coding: Part I,The Bell System Technical Journal, 58, 3, 631–670, March 1979.

[oliver 1952] B.M. Oliver, Bell System Technical Journal, 31, 724–750, 1952.[o’neal 1966] J.B. O’Neal, Bell System Technical Journal, 45, 689–721, 1966.[pirsch 1977] P. Pirsch and L. Stenger, Acta Electronica, 19, 277–287, 1977.[sayood 1996] K. Sayood, Introduction to Data Compression, Morgan Kaufmann Publishers,

San Francisco, CA, 1996.


4Transform Coding

As introduce d in Chapter 3, differe ntial coding achi eves high coding ef fic iency by util izingthe correlat ion betwe en pixels exis ting in image frame s. Tr ansform codi ng (TC), whi ch isthe focus of this chap ter, is another effi cient coding schem e based on util ization of int er-pixel cor relation. As we will see in Chapte r 7, TC has become a fundame ntal techni querecom mend ed by the internati onal still image coding standard Joint Phot ographic Ex pertsGrou p coding (JPEG). In addi tion, TC was found to be ef ficient in coding predic tion err orin motion compens ated (MC) predictiv e coding. As a result , it was also adopted by theinter national video coding standards such as H.261, H.263, and MPEG 1, 2, and 4. This willbe dis cussed in Part IV.

4.1 I ntro du ction

As sh own in Figure 2.3 there are thre e com ponents in a source encode r: transf ormati on,quan tization , and code word assignme nt. It is the transfor mation compone nt that deci deswhich fo rmat of input source is quan tized and enc oded. In differe ntial pulse code modu-lation (DPCM ), for instance , the difference betwe en an original signal and a pred ictedversio n of the origi nal signal is quantiz ed and enc oded. As long a s the prediction error issmall enoug h, i.e., the predic tion rese mbles the original signal well (by using corre lationbetwe en pixels), different ial coding is ef ficient.

In TC, the ma in ide a is that if the transf orm ed versio n of a signal is less corre latedcompar ed with the origi nal signal , then quan tizing and encodi ng the transf ormed signalmay lead to dat a com pression. At the receiver, the encoded dat a are decod ed and trans-formed back to reconstruct the signal. Therefore, in TC, the transformation componentillustrated in Figure 2.3 is a transform. Quantization and code word assignment are carriedout with respect to the transformed signal or the transformed domain.

We begin with the Hotelling transform, using it as an example of how a transform maydecorrelate a signal in the transform domain.

4.1.1 Hotelling Transform

Consider an N-dimensional (N-D) vector~zs. The ensemble of such vectors, {~zs} s 2 I, whereI represents the set of all vector indexes, can be modeled by a random vector *zwith each ofits component zi, i¼ 1, 2, . . . ,N as a random variable. That is,

*z ¼ (z1, z2, . . . , zN)T, (4:1)


where T stands for the operator of matrix transposition. The mean vector of the population,m~z, is defined as

m~z ¼ E[~z ] ¼ (m1,m2, . . . ,mN)T, (4:2)

where E stands for the expectation operator. Note that m~z is an N-D vector with the ithcomponent, mi, being the expectation value of the ith random variable component in *z.

mi ¼ E[zi], i ¼ 1, 2, . . . ,N: (4:3)

The covariance matrix of the population denoted by, C~z, is equal to

C~z ¼ E[(*z�m~z)(*z�m~z)

T]: (4:4)

Note that the product inside the E operator is referred to as the outer product of the vector(*z�m~z). Denote an entry at the ith row and jth column in the covariance matrix by ci,j.From Equation 4.4, it can be seen that ci,j is the covariance between the ith and jthcomponents of the random vector *z. That is,

ci,j ¼ E[(zi �mi)(zj �mj)] ¼ Cov(zi, zj): (4:5)

On the main diagonal of the covariance matrix C~z, the element ci,j is the variance of the ithcomponent of~z, zi.

Obviously, the covariance matrix C~z is a real and symmetric matrix. It is real because ofthe definition of random variables. It is symmetric because Cov(zi, zj)¼Cov(zj, zi). Accord-ing to the theory of linear algebra, it is always possible to find a set of N orthonormaleigenvectors of the matrix C~z, with which we can convert the real symmetric matrix C~z intoa full-ranked diagonal matrix. This statement can be found in the texts of linear algebra[strang 1998].

Denote the set of N orthonormal eigenvectors and their corresponding eigenvalues ofthe covariance matrix C~z by~ei and li, i¼ 1, 2, . . . ,N, respectively. Note that eigenvectorsare column vectors. Form a matrix F such that its rows comprise the N eigenvectors.That is,

F ¼ (~e1,~e2, . . . ,~eN)T: (4:6)

Now, consider the following transformation.

~y ¼ F(~z�m~z): (4:7)

It is easy to verify that the transformed random vector ~y has the following two charac-teristics:

1. The mean vector, m~y, is a zero vector. That is,

m~y ¼ 0: (4:8)


2. The covariance matrix of the transformed random vector C~y is

C~y ¼ FC~zFT ¼

l1 0l2

. ..

. ..

0 lN

2666664

3777775: (4:9)

This transform is called the Hotelling transform [hotelling 1933], or eigenvector transform[tasto 1971; wintz 1972].

The inverse Hotelling transform is defined as

~z ¼ F�1~yþm~z, (4:10)

where F�1 is the inverse matrix of F. It is easy to see from its formation discussed abovethat the matrix F is orthogonal. Therefore, we have FT¼F�1. Hence the inverse Hotellingtransform can be expressed as

~z ¼ FT~yþm~z: (4:11)

Note that in implementing the Hotelling transform, the mean vector m~z and the covariancematrix C~z can be calculated approximately by using a given set of K sample vectors[gonzalez 2001].

m~z ¼1K

XKs¼1

~zs (4:12)

1 XKT T
C~z ¼ K s¼1
~zs~zs �m~zm~z (4:13)

The analogous transform for continuous data was devised by Karhunen and Loeve[karhunen 1947; loeve 1948]. Alternatively, the Hotelling transform can be viewed asthe discrete version of the Karhunen–Loeve transform (KLT). We observe that the covar-iance matrix C~y is a diagonal matrix. The elements in the diagonal are the eigenvalues ofthe covariance matrix C~z. That is, the two covariance matrices have the same eigenvaluesand eigenvectors because the two matrices are similar. The fact that zero values areeverywhere, except along the main diagonal in C~y, indicates that the components of thetransformed vector~y are uncorrelated. That is, the correlation previously existing betweenthe different components of the random vector ~z has been removed in the transformeddomain. Therefore, if the input is split into blocks and the Hotelling transform is appliedblockwise, the coding may be more efficient because the data in the transformed block areuncorrelated. At the receiver, we may produce a replica of the input with an inversetransform. This basic idea behind TC will be further illustrated. Note that TC is alsoreferred to as block quantization [huang 1963].

4.1.2 Statistical Interpretation

Let us continue our discussion of the 1-D Hotelling transform, recalling that the covariancematrix of the transformed vector ~y, C~y, is a diagonal matrix. The elements in the main


diago nal are ei genvalue s of the covariance matrix C~y . Accordi ng to the de finit ion ofcov ariance matrix, these elem ents are the vari ances of the com ponents of vect or ~y, denote dby s 2y ,1 , s

2y ,2 , . . . , s

2y ,N . Let us arrange the eigen value s (variance s) in a noni ncreasin g order,

i.e., l1 � l2 � , . . . , � l N . Choos e an int eger L , and L < N . Usin g the cor respond ing Leigen vector s, ~e1 , ~e 2 , . . . , ~e L , we form a matrix �F with the se L eigen vectors (tran sposed) asits L rows. Obvi ously, the matrix �F is of L 3 N . He nce, using the matrix �F in Equatio n 4.7will have the transf ormed vect or ~y of L 3 1. That is,

~y ¼ �F (~z � m~z ) : (4: 14)

The inverse transf orm change s accordin gly

~z 0 ¼ �F T~y þ m~z : (4: 15)

Note that the reconstru cted vector ~z , denote d by ~z 0 , is still an N 3 1 column vect or. It can beshown [wi ntz 19 72] that the mean square recon struction error between the origi nal vect or ~zand the rec onstructe d vector ~z 0 is given by

MSEr ¼XNi¼ Lþ 1

s 2y, i : (4: 16)

Equati on 4.16 indicate s that the mean square rec onstru ction error equals the sum ofvari ances of the discarde d component s. Note that a lthough we discuss the rec onstructi onerror here, we have not conside red the quantiz ation error and transmi ssion error involve d.Equati on 4.15 impl ies that if, in the transf ormed vect or ~y, the first L com ponents have the irvari ances occupy a large perce ntage of the total variance s, the mean square rec onstructi onerror will not be large even though only the fi rst L component s are kept, i.e., the ( N � L)rem aining component s in the ~y are discarde d. Quant izing and enc oding only L compon-ents of vector ~y in the transf orm domain lead to higher codi ng effi ciency, whi ch is the basicide a behind TC .

4.1.3 Geom etrical Interpr etation

Trans forming a set of statist ically depen dent data int o a nother set of uncor related data,and then disc arding the insi gnifi cant transfor m coef ficients (having sm all vari ances) illus-trated earlier usi ng the Hotellin g transf orm, can be viewe d as a statist ical inter pretation ofTC. He re, we give a geome trical interpretat ion of TC. For this pu rpose, we use 2-D vect orsinstead of N -D vect ors.

Cons ider a bina ry image of a car in Figure 4.1a. Eac h pixe l in the shad ed object regioncorre sponds to a 2-D vector with its two com ponents being coordinat es z1 and z 2, respect-ively. He nce, the set of all pixels associ ated with the obje ct forms a pop ulatio n of vector s.We can determi ne its mean vector and covari ance matrix using Equatio ns 4.12 and 4.13,respectively. We can then apply the Hotelling transform by using Equation 4.7. Figure 4.1bdepicts the same object after the application of the Hotelling transform in the y1� y2coordinate system. We note that the origin of the new coordinate system is now locatedat the centroid of the binary object. Furthermore, the new coordinate system is aligned withthe two eigenvectors of the covariance matrix C~z.

As mentioned earlier, the elements along the main diagonal C~y (two eigenvalues of theC~y and C~z) are the two variances of the two components of the ~y population. Sincethe covariance matrix C~y is a diagonal matrix, the two components are uncorrelated after


z2

0z1

(a)

y2

y1

(b)

FIGURE 4.1(a) A binary object in the z1 – z2 coordinate system. (b) After the Hotelling transform, the object is aligned with itsprincipal axes.

the transform. As one variance (along the y1 direction) is larger than the other (along the y2direction), it is possible for us to achieve higher coding efficiency by ignoring the compon-ent associated with the smaller variance without too much sacrifice of the reconstructedimage quality.

It is noted that the alignment of the object with the eigenvectors of the covariance matrixis of importance in pattern recognition [gonzalez 2001].

4.1.4 Basis Vector Interpretation

Basis vector expansion is another interpretation of TC. For simplicity, in this section weassume a zero mean vector. Under this assumption, the Hotelling transform and its inversetransform become

~y ¼ F~z (4:17)

~z ¼ FT~y (4:18)

Recall that the row vectors in the matrix F are the transposed eigenvectors of the covar-iance matrix C~z. Equation 4.18 can be written as

~z ¼XNi¼1

yi~ei: (4:19)

In Equation 4.19, we can view vector ~z as a linear combination of basis vectors ~ei,i¼ 1, 2, . . . ,N. The components of the transformed vector ~y, yi, i¼ 1, 2, . . . ,N, serve ascoefficients in the linear combination or as weights in the weighted sum of basis vectors.The coefficient yi, i¼ 1, 2, . . . ,N, can be produced according to Equation 4.17:

yi ¼~eTi~z: (4:20)

That is, yi is the inner product between vectors~ei and ~z. Therefore, the coefficient yi canbe interpreted as the amount of correlation between the basis vector ~ei and the originalsignal~z.


In the Ho telling transf orm, the coef ficients yi , i ¼ 1, 2, . . . , N, are unc orrelated . Th e vari-ance of yi can be arran ged in a noni ncreasin g or der. For i > L , the vari ance of the coef ficientbecome s insigni fi cant. We can the n discard these coef ficients withou t introd ucing signi fi-cant error in the linear comb ination of ba sis vectors and achi eve higher codi ng ef ficiency.

In the above three inter pretat ions of TC, we see that the linear unitar y transf orm canprovid e the follo wing two fun ctions:

1. De correlat e inpu t data; i.e., transf orm coef ficie nts are less correlat ed than theor iginal data.

2. Have some transf orm coef fi cients more signi fic ant than othe rs (with large vari-ance , eigenvalue , or weigh t in basis vect or expan sion) such that transf orm coef fi-cients can be treated differe ntly: some can be discarde d, some can be coa rselyquan tized, and some can be fi nely quan tized.

Note that the de finition of unitary transf orm is given shortl y in Section 4.2.1.3.

4.1.5 Proced ures of Transfo rm Codi ng

In this sectio n, we summa rize the proced ures of TC. There are three steps in TC as shownin Figure 4.2. First, the input dat a (fram e) is divi ded int o block s (subi mages) . Each block isthen line arly transf ormed. The transf ormed ver sion is then truncat ed, quan tized, andenc oded. These last three fun ctions, whi ch are dis cussed in Secti on 4.4, can be groupe dand terme d as bit allocation . Th e outpu t of enc oder is a bit strea m.

In the receiver , the bit st ream is decoded and the n inv ersely transf ormed to formreconstructed blocks. All the reconstructed blocks collectively produce a replica of theinput image.

Blockdivision

Lineartransform

Bit allocationTruncation

quantizationcode wordassignment

Input image Output bit stream

Decoder Inversetransform

Blockmerge

Input bit stream Reconstructedimage

(a) Transmitter

(b) Receiver

FIGURE 4.2Block diagram of transform coding.


4.2 Linear Transforms

Here, we first discuss a general formulation of a linear unitary 2-D image transform. Then,a basis image interpretation of TC is given.

4.2.1 2-D Image Transformation Kernel

There are two different ways to handle image transformation. For example, in the first way,we convert a 2-D array representing a digital image into a 1-D array via row-by-rowstacking. That is, from the second row on, the beginning of each row in the 2-D array iscascaded to the end of its previous row. Then we transform this 1-D array using a 1-Dtransform. After the transformation, we can convert the 1-D array back to a 2-D array. Withthe second way, a 2-D transform is directly applied to the 2-D array corresponding to aninput image, resulting in a transformed 2-D array. These two ways are essentially the same.It can be straightforwardly shown that the difference between the two is simply a matter ofnotation [wintz 1972]. In this section, we use the second way to handle image transform-ation. That is, we work on 2-D image transformation.

Assume a digital image is represented by a 2-D array g(x, y), where (x, y) is thecoordinates of a pixel in the 2-D array, while g is the gray level value (also often calledintensity or brightness) of the pixel. Denote the 2-D transform of g(x, y) by T(u, v), where(u, v) is the coordinates in the transformed domain. Assume that both g(x, y) and T(u, v)are a square 2-D array of N3N, i.e., 0 � x, y,u, v � N� 1.

The 2-D forward and inverse transforms are defined as

T(u, v) ¼XN�1x¼0

XN�1y¼0

g(x, y)f (x, y, u, v) (4:21)

and

g(x, y) ¼XN�1u¼0

XN�1v¼0

T(u, v)i(x, y,u, v), (4:22)

where f(x, y, u, v) and i(x, y, u, v) are referred to as the forward and inverse transformationkernels, respectively.

A few characteristics of transforms are discussed below.

4.2.1.1 Separability

A transformation kernel is called separable (hence, the transform is said to be separable) ifthe following conditions are satisfied:

f (x, y,u, v) ¼ f1(x,u)f2( y, v) (4:23)

and

i(x, y,u, v) ¼ i1(x,u)i2( y, v): (4:24)

Note that a 2-D separable transform can be decomposed into two 1-D transforms. That is, a2-D transform can be implemented by a 1-D transform rowwise followed by another 1-Dtransform columnwise. That is,


T1 ( x, v) ¼XN � 1

y¼ 0g( x, y)f2 ( y, v), (4: 25)

where 0 � x, v � N� 1, and

T (u , v) ¼XN � 1

x ¼ 0 T1 ( x, v) f 1 ( x, u), (4: 26)

where 0 � u, v � N � 1 . Of course , the 2-D transform can also be impl ement ed in a rev erseorder wi th two 1-D transf orms, i.e., colu mnw ise fi rst followe d by roww ise. The count er-par ts of Equ ations 4.25 and 4.26 for the inverse transf orm can be deriv ed simi larly.

4.2.1. 2 Symmetry

The transf ormati on kern el is symmet ric (he nce, the transf orm is symm etric) if the kern el issep arable and the fo llowing cond ition is sat isfi ed:

f1 ( y, v) ¼ f 2 ( y, v) : (4: 27)

That is, f1 is function ally equivalent to f 2.

4.2.1. 3 Matrix F orm

If a transf ormati on kernel is symmet ric (he nce, sep arable) then the 2-D image transf ormdiscuss ed ear lier can be expres sed com pactly in the following matrix form. Denote animage matrix by G and G ¼ { gi, j } ¼ { g( i � 1, j� 1 )}. That is, a typical elem ent (at the i th rowand j th colu mn) in the matrix G is the pixel gray level value in the 2-D array g(x, y) at thesame geometrical position. Note that the subtraction of one in the notation g(i � 1, j � 1)com es from Equati ons 4.21 and 4.22. Th at is, the indexe s of a square 2-D image array areconve ntion ally de fined from 0 to N � 1, while the indexes of a square matrix are from 1 toN. Denote the forward transform matrix by F and F ¼ {fi,j} ¼ {f1(i� 1, j� 1)}. We then havethe following matrix form of a 2-D transform:

T ¼ FTGF, (4:28)

where T at the left-hand side of the equation denotes the matrix corresponding to thetransformed 2-D array in the same fashion as that used in defining the G matrix. Theinverse transform can be expressed as

G ¼ ITTI, (4:29)

where the matrix I is the inverse transform matrix and I ¼ {ij,k} ¼ {i1(j� 1, k � 1)}. Theforward and inverse transform matrices have the following relation:

I ¼ F�1: (4:30)

Note that all of the matrices defined above, G, T, F, and I, are of N3N.It is known that the discrete Fourier transform (DFT) involves complex quantities. In this

cas e, the count erpart s of Equati ons 4.28 throu gh 4.30 become Equatio ns 4.31 throug h 4.33,respectively:

T ¼ F*TGF (4:31)


G ¼ I * T TI (4 : 32)

I ¼ F� 1 ¼ F* T , (4: 33)

where * indic ates comple x conjugat ion. Note that the transf orm matrice s F and I con taincomple x quan tities and satisfy Equati on 4.33. They are called unitar y matrice s and thetransf orm is ref erred to as a unitar y transf orm.

4.2.1. 4 Orthogo nality

A transf orm is said to be ortho gonal if the transf orm matri x is orthogo nal. That is,

FT ¼ F� 1 : (4 : 34)

Note that an ortho gonal matrix (orthogo nal transf orm) is a special case of a unitary matrix(unit ary transf or m), where only rea l quan tities are inv olved. We will see that all the 2-Dimage transf orms , pres ented in Section 4.3, are sep arable, symm etric, and unitary .

4.2.2 Basis Image Interpr etation

Here we study the concept of basis images or basis matrice s. Recall that we dis cussed ba sisvector s when we cons idered the 1-D transf orm. That is, the compone nts of the transf ormedvector (also ref erred to as the transf orm coef ficients ) can be interprete d as the coef ficie nts inthe basis vector expan sion of the inpu t vect or. Each coef fi cient is essentia lly the amo unt ofcorre lation betwe en the inpu t vect or and the correspo nding basis vector. The concept ofbasis vect ors can be exten ded to basis images in the con text of 2-D image transf orms.

The 2-D inverse transf orm intro duced in Se ction 4.2.1 (Equat ion 4.22) is de fi ned as

g( x, y) ¼XN � 1

u¼ 0

XN � 1

v ¼ 0T ( u , v) i( x, y, u, v), (4 : 35)

where 0 � x, y � N � 1. Equation 4.35 can be vie wed a s a component fo rm of the inversetransf orm. As defi ned in Se ction 4.2.1.3, the whole image { g( x, y)} is denoted by the imagematrix G of N 3 N . We no w denote the image forme d by the inverse transf ormatio nkernel { i ( x, y, u, v), 0 � x, y � N � 1} as a 2-D array Iu,v of N 3 N fo r a speci fic pair of( u, v) with 0 � u, v � N � 1. Recall that a digital image c an be repre sented by a 2-D arrayof gray level values. In turn the 2-D array can be arran ged int o a matrix. Name ly, we treatthe follo wing three : a digital image, a 2-D array (with prope r resolutio n), and a matrix(with proper indexing), interchangeably. We then have

Iu,v ¼

i(0, 0, u, v) i(0, 1, u, v) . . . . . . i(0,N � 1,u, v)

i(1, 0, u, v) i(1, 1, u, v) . . . . . . i(1,N � 1,u, v)

..

. ...

. . . . . . ...

..

. ...

. . . . . . ...

i(N � 1, 0, u, v) i(N � 1, 1, u, v) . . . . . . i(N � 1,N � 1, u, v)

266666664

377777775

(4:36)


The 2-D array Iu,v is refer red to as a ba sis image. There are N 2 ba sis images in to tal becaus e0 � u, v � N � 1. The inverse transf orm expres sed in Equatio n 4.35 can then be written in acoll ective form as

G ¼XN � 1

u¼ 0

XN � 1

v¼ 0T ( u, v) Iu, v : (4: 37)

We can inter pret Equatio n 4.37 as a series expan sio n of the original image G int o a set of N 2

basis image s Iu,v. The transfor m coef fi cients T ( u, v), 0 � u , v � N � 1, beco me the coef ficientsof the expan sion. Alter natively, the image G is said to be a weighted sum of basis image s.Note that, similar to the 1-D case, the coef ficient or the we ight T( u, v) is a corre lationmeasu re betwe en the image G and the basis image Iu,v [wintz 1972].

Note that basis images have no thing to do wi th the inpu t image. Inste ad, it is comple telyde fined by the transf or m itself. That is, basis images are the attribu te of 2-D imagetransf orms . Di fferent transf orms have differe nt sets of basis images .

The motiv ation behind TC is that with a prope r transf orm, hen ce, a prope r set of basisimage s, the transf orm coef ficients are more indep endent than the gra y sca les of the origi nalinpu t image . In the ideal case, the transf orm coef ficients are statist ically indep endent. W ecan then opti mally encode the coef fi cients independe ntly, which can make codi ng mo reef ficient and simple. As pointed out in [wintz 1 972], ho wever, thi s is generall y impo ssiblebecau se of the fo llowing two reasons. First, it requi res the joint probabi lity density functi on(pdf) of the N 2 pixels, which have not been deduced from basic physical laws and cannotbe measured . Second, even if the joint pdfs were known, the probl em of devising arev ersible transf orm that can generate indep endent coef ficients is unsol ved. Th e optim umline ar transf orm we can have resu lts in unc orrelated coef ficients. When Gau ssian distri bu-tion is involve d, we can have indep endent transfor m coef ficie nts. In addi tion to theunc orrelated ness of coef ficients, the variance of the coef fi cients varies wi dely. Insig nificantcoef ficients can be ignore d withou t intro ducing signi fi cant dis tortion in the recon structedimage . Si gnific ant coef ficients can be allo cated mo re bits in encodi ng. Th e coding ef ficiencyis thus enhance d.

As show n in Figure 4.3, TC can be viewed as expan ding the inpu t image into a set ofbasis images, then quan tizing and encodi ng the coef ficients associ ated wi th the ba sisimage s separatel y. At the receiver the coef fi cients are recon structed to produ ce a replicaof the input image . This strate gy is simi lar to that of subband codi ng, which is discuss ed inChapte r 8. From this poi nt of view, TC ca n be c onsidered a special case of sub band coding,though TC was devised much earlier than subband coding.

It is worth mentioning an alternative way to define basis images. That is, a basis imagewith indexes (u, v), Iu,v, of a transform can be constructed as the outer product of theuth basis vector,~bu, and the vth basis vector,~bv, of the transform. The basis vector,~bu, isthe uth column vector of the inverse transform matrix I [jayant 1984]. That is,

Iu,v ¼~bu~b Tv : (4:38)

4.2.3 Subimage Size Selection

The selection of subimage (block) size, N, is important. Normally, the larger the size themore decorrelation the TC can achieve. It has been shown, however, that the correlationbetween image pixels becomes insignificant when the distance between pixels becomeslarge, e.g., it exceeds 20 pixels [habibi 1971a]. On the other hand, a large size causes some


Input imageg (x, y)

T (N − 1, N − 1) Q (N – 1, N – 1)

T (1, 0) Q (1, 0) E (1, 0)

T (0, N − 1) Q (0, N − 1) E (0, N − 1)

E (N − 1, N − 1)

T (0,1) Q (0, 1) E (0, 1)

T (0, 0) Q (0, 0) E (0, 0)

2-D forwardtransform

Basis imageexpansion

T(u, v)Output

bit stream

Inputbit image

D (N – 1, N – 1)

D (1, 0)

D (0, N – 1)

D (0, 1)

D (0, 0)

Reconstructedimage g (x, y)

2-D imageinverse

transform

(a) Transmitter

(b) Receiver

FIGURE 4.3Basis image interpretation of TC (Q: quantizer, E: encoder, D: decoder).

problem s. In adapt ive TC, a large blo ck cannot adapt to local statistics we ll. As will bediscuss ed later in this chapte r, a transmis sion error in TC affects the whole associ atedsubim age. Hence a large size impl ies a pos sibly seve re effect of transmis sion err or onrecon structed image s. As will be sh own in video coding (Pa rts III a nd IV), TC is usedtogether with motion compensated (MC) coding. Consider that large block size is not usedin motion estimation; subimage sizes of 4, 8, and 16 are used most often. In particular,N¼ 8 is adopted by the international still image coding standard JPEG as well as videocoding standards H.261, H.263, H. 264, MPEG 1, 2, and 4.


4. 3 Transforms of Par ticular Interest

Sever al com monly used image transf orms are discuss ed in this section. Th ey inc lude theDFT, the discre te W alsh transf orm (DWT), the discre te Hadamar d transf orm (DH T), andthe disc rete cosine transfor m (DCT) and discre te sine transf orm. All of the se transf orms aresymm etric (hence, sep arable as well), unitar y, and reversi ble. For each transf orm , wede fine its transf ormati on kernel and discuss its basis images.

4.3.1 Discret e Fourier Tra nsform

The DFT is of great importan ce in the field of dig ital signal process ing. Owin g to the fastFouri er transf orm (FFT) based on the algori thm deve loped in [cooley 196 5], the DFT iswidel y utilize d for various tasks of digital signal proce ssing. It has been discuss ed in manysignal and image proce ssing texts. He re we only de fine it usi ng the transf ormatio n kerneljust introd uced abov e. The fo rward and inv erse transf ormatio n kern els of the DFT are

f ( x, y, u , v) ¼ 1N

exp {� j2p ( xu þ yv ) =N } (4:39)

and

i(x, y,u, v) ¼ 1N

exp{j2p(xuþ yv)=N}: (4:40)

Clearly, since complex quantities are involved in the DFT transformation kernels, theDFT is generall y comple x. He nce, we use the unitary matrix to handl e the DFT (refer toSecti on 4.2.1.3). The basis vect or of the DFT ~b u is an N 3 1 column vector and is de fined as

~bu ¼ 1ffiffiffiffiNp 1, exp j2p

uN

� �, exp j2p

2uN

� �, . . . , exp j2p

(N � 1)uN

� �� T: (4:41)

As mentioned, the basis image with index (u,v), Iu,v, is equal to~bu~bTv . A few basis images arelisted below for N¼ 4.

I0,0 ¼ 14

1 1 1 11 1 1 11 1 1 11 1 1 1

0BB@

1CCA (4:42)

1 j �1 �j0 1
I0,1 ¼ 1
41 j �1 �j1 j �1 �j1 j �1 �j

BB@ CCA (4:43)

1 1 1 �10 1
I1,2 ¼ 1
4j �j j �j�1 1 �1 1�j �j �j j

BB@ CCA (4:44)

1 �j �1 j0 1

I3,3 ¼ 14�j �1 j 1�1 j 1 �jj 1 �j �1

BB@ CCA (4:45)


4.3.2 Discrete Walsh Transform

The transformation kernels of the DWT [walsh 1923] are defined as

f (x, y,u, v) ¼ 1N

Yn�1i¼0

[(�1)pi(x)pn�1�i(u)(�1)pi( y)pn�1�i(v)] (4:46)

and

i(x, y, u, v) ¼ f (x, y,u, v): (4:47)

where n¼ log2N, pi(arg) represents the ith bit in the natural binary representation of the arg,the oth bit corresponds to the least significant bit and the (n� 1)th bit corresponds to themostsignificant bit. For instance, consider N¼ 16, then n¼ 4. The natural binary code of number8 is 1000. Hence, p0(8)¼ p1(8)¼ p2(8)¼ 0, and p3(8)¼ 1. We see that if the factor 1=N isput aside then the forward transformation kernel is always an integer: either þ1 or �1.In addition, the inverse transformation kernel is the same as the forward transformationkernel. Therefore, we conclude that the implementation of the DWT is simple.

When N¼ 4, the 16 basis images of the DWT are shown in Figure 4.4. Each correspondsto a specific pair of (u, v) and is of resolution 43 4 in x–y coordinate system. They arebinary images, where the bright represents þ1, and the dark �1. The transform matrix ofthe DWT is shown below for N¼ 4.

F ¼ 12

1 1 1 11 1 �1 �11 �1 1 �11 �1 �1 1

0BB@

1CCA (4:48)

4.3.3 Discrete Hadamard Transform

The DHT [hadamard 1893] is closely related to the DWT. This can be seen from thefollowing definition of the transformation kernels.

f (x, y,u, v) ¼ 1N

Yni¼0

[(�1)pi(x)pi(u)(�1)pi( y)pi(v)] (4:49)

and

i(x, y, u, v) ¼ f (x, y,u, v), (4:50)

v

u

0

0 1 2 3

1

2

3

FIGURE 4.4When N¼ 4, a set of 16 basis images of DWT.


where the defi nition of n, i, and pi (arg) are the same as in the DWT. For thi s reason, theterm Walsh –Hadamar d transfor m (DW HT) is freque ntly use d to represen t either of thetwo transf orms.

Whe n N is a pow er of 2, the transf orm matrice s of the DWT and DHT have the same row(or colum n) vector s except that the or der of row (or column) vectors in the matrices arediffere nt. This is the only difference betw een the DWT and DHT under the circ umsta nce:N ¼ 2n . Becaus e of thi s difference , while the DWT can be impl ement ed by using the FFTalgori thm wi th a strai ghtforward mo di fication , the DHT need s more wo rk to use theFFT algori thm. On the ot her hand , the DHT poss esses the followi ng rec ursive feature,whil e the DWT does no t:

F2 ¼ 1 11 � 1

� �(4: 51)

and

F2N ¼ FN FNFN � FN

� �, (4: 52)

where the subs cripts indicate the size of the transf orm matrice s. It is obvi ous that thetransf orm matrix of the DHT can be easily derived by using the recursi on.

Note that the number of sign changes betwe en con secutive ent ries in a row (or a column)of a transf orm m atrix (from positive to negativ e and from nega tive to posit ive) is known assequ ency. It is observed that the seq uency does not mo notonicall y increa se as the ordernum ber of rows (o r columns) increases in the DHT. Since sequ ency bear s some simi larity tofreq uency in the Fouri er transf orm, sequenc y is desir ed as an increa sing fun ction of theorder num ber of rows (or columns ). This is rea lized by the ordered Hadamar d transf orm[go nzalez 2001].

The transf ormatio n kern el of the or dered Hadamar d transf orm is de fined as

f (x, y, u, v) ¼ 1N

YN�1i¼0

[(�1)pi(x)di(u)(�1)pi( y)di(v)], (4:53)

where the definition of i, pi(arg) are the same as defined above for the DWT and DHT. Thedi(arg) is defined as

d0(arg) ¼ bn�1(arg)

d1(arg) ¼ bn�1(arg)þ bn�2(arg)

..

.

dn�1(arg) ¼ b1(arg)þ b0(arg)

(4:54)

The 16 ba sis images of the or dered Had amard transf orm are shown in Figure 4.5 forN¼ 4. It is observed that the variation of the binary basis images becomes more frequentmonotonically when u and v increase. We also see that the basis image expansion is similarto the frequency expansion of the Fourier transform in the sense that an image is decom-posed into components with different variations. In TC, these components with differentcoefficients are treated differently.


v

u

0

0 1 2 3

1

2

3

FIGURE 4.5When N¼ 4, a set of 16 basis images of the ordered DHT.

4.3.4 Discrete Cosine Transform

Discrete cosine transform is the most commonly used transform for image and videocoding.

4.3.4.1 Background

The DCT, which plays an extremely important role in image and video coding, wasestablished by Ahmed et al. [ahmed 1974]. There, it was shown that the basis membercos[(2xþ 1)up=2N] is the uth Chebyshev polynomial TU(j) evaluated at the xth zeroof TN(j). Recall that the Chebyshev polynomials are defined as

T0(j) ¼ 1=ffiffiffi2p

(4:55)

TK(j) ¼ cos [k cos�1 (j)], (4:56)

where TK(j) is the k order Chebyshev polynomial and it has k zeros, starting from the firstzero to the kth zero. Furthermore, it was demonstrated that the basis vectors of 1-D DCTprovide a good approximation to the eigenvectors of the class of Toeplitz matrices defined as

1 r r2 . . . rN�1

r 1 r . . . rN�2

r2 r 1 . . . rN�3

..

. ... ..

.. . . ..

.

rN�1 rN�2 rN�3 . . . 1

2666664

3777775, (4:57)

where 0< r< 1.

4.3.4.2 Transformation Kernel

The transformation kernel of the 2-D DCT can be extended straightforwardly from that of1-D DCT as follows:

f (x, y,u, v) ¼ C(u)C(v) cos(2xþ 1)up

2N

� �cos

(2yþ 1)vp2N

� �, (4:58)


DCT basis images

FIGURE 4.6When N¼ 8, a set of 64 basis images of the DCT.

where

C(u) ¼ffiffiffi1N

qfor u ¼ 0ffiffiffi

2N

qfor u ¼ 1, 2, . . . ,N � 1

8<: (4:59)

i(x, y,u, v) ¼ f (x, y,u, v): (4:60)

Note that C(v) is defined the same way as in Equation 4.59. The 64 basis images of the DCTare shown in Figure 4.6 for N¼ 8.

4.3.4.3 Relationship with DFT

The DCT is closely related to the DFT. This can be examined from an alternative method ofdefining the DCT. It is known that applying the DFT to an N-point sequence gN(n), n¼ 0,1, . . . , N� 1, is equivalent to the following:

1. Repeating gN(n) every N points, form a periodic sequence, ~gN(n), with a funda-mental period N, that is,

~gN(n) ¼X1i¼�1

gN(n� iN): (4:61)


2. Determine the Fourier series expansion of the periodic sequence ~gN(n). That is,determine all the coefficients in the Fourier series, which are known to be periodicwith the same fundamental period N.

3. Truncate the sequence of the Fourier series coefficients so as to have the samesupport as that of the given sequence gN(n). That is, only keep the N coefficientswith indexes 0, 1, . . . ,N � 1, and set all the others to equal zero. These N Fourierseries coefficients form the DFT of the given N-point sequence gN(n).

An N-point sequence gN(n) and the periodic sequence ~gN(n), generated from gN(n), areshown in Figure 4.7a and b, respectively. In summary, the DFT can be viewed as acorrespondence between two periodic sequences. One is the periodic sequence ~gN(n),which is formed by periodically repeating gN(n). The other is the periodic sequence ofFourier series coefficients of ~gN(n).

gN (n)

n

0 1 2 3 4 5 6 7

–8 –7 –6 –5 –4 –3 –2 –1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

)(~ ngN

n

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

)(2 ng N

n

n0

)(~2 ng N

(a) Original 1-D input sequence

(b) Formation of a periodic sequence with a fundamental period of N (DFT)

(c) Formation of a back-to-back 2N sequence

(d) Formation of a periodic sequence with a fundamental period of 2N (DCT)

FIGURE 4.7An example to illustrate the differences and similarities between DFT and DCT.


The DCT of an N -point sequ ence is obt ained through the follo wing three steps :

1. Fli p over the given sequenc e with respec t to the end point of the sequ ence to forma 2N -point sequ ence, g2N (n ), as shown in Figure 4.7c. Then form a peri odicsequ ence ~g2N( n), shown in Figu re 4.7d, accord ing to

~g2N ( n) ¼X1i ¼�1

g2 N ( n � 2iN ) (4: 62)

2. Find the Fouri er series coef fi cients of the periodic sequ ences ~g2N( n).

3. Tr uncate the resu ltant peri odic sequenc e of the Fouri er series coef fi cients to havethe sup port of the give n fi nite sequ ence gN ( n). Th at is, keeping only the N coef fi-cients with indexes 0, 1, . . . , N � 1, set all the othe rs to equal z ero. These N Fouri erseri es coef ficie nts fo rm the DC T of the given N-poi nt sequ ence gN ( n).

A compar ison betwe en Figu re 4.7b and d rev eals that the periodi c sequenc e ~gN ( n) is notsm ooth. Ther e exist disc ontinuities at the beginning and end of each period. These end-hea d discontinuities cause a high-freq uency dis tribution in the correspo nding DFT. On thecontrar y, the peri odic sequ ence ~g2N ( n) does not have this type of disco ntinuity due tofl ipping over the given finit e sequ ence. As a result , there is no high -frequ ency componentcorre spondin g to the end- hea d disco ntinuitie s. He nce, the DC T poss esses better energycom paction in the low fre quencies than the DFT. By energy com paction, we mean mo reenergy is com pacted in a fraction of transfor m coef ficients. For instance , it is kno wn thatthe mo st energy of an image is con tained in a sm all regio n of low fre quency in the DFTdoma in. Vivid exa mples can be foun d in [gonzalez 2001]. In terms of energy com paction,when compar ed with the KLT (the Hotelling transf orm is its dis crete ver sion), whic h iskno wn as the opti mal, the DC T is the best among the DFT , DWT , DHT, and dis crete Harrtransf orm.

Bes ides this advantage , the DC T can be impl ement ed usi ng the FFT. Th is can be seenfrom the above disc ussion. Ther e, It has been show n that the DCT of an N -point seq uence,gN ( n), can be obtain ed fro m the DFT of the 2N -point sequenc e g 2N ( n). Furt hermore , theeven symmetry in ~g2N(n) makes the computation required for the DCT of an N-point equalto that required for the DFT of the N-point sequence. Because of these two merits, the DCTis the most popular image transform used in image and video coding. No other transformhas been proven to be better than the DCT from a practical standpoint [haskell 1996].

4.3.5 Performance Comparison

In this section, we compare the performance of a few commonly used transforms in termsof energy compaction, mean square reconstruction error, and computational complexity.

4.3.5.1 Energy Compaction

Since all the transforms we discussed are symmetric (hence separable) and unitary, thematrix form of the 2-D image transf orm can be expre ssed as T ¼ FTGF as discusse d inSecti on 4.2.1.3. In the 1-D case, the transfor m m atrix F is the count erpart of the m atrix Fdiscussed in the Hotelling transform. Using F, one can transform a 1-D column vector ~zinto another 1-D column vector ~y. The components of the vector ~y are transform coeffi-cients. The variances of these transform coefficients, and therefore the signal energyassociated with the transform coefficients, can be arranged in a nondecreasing order.


It can be shown that the to tal energy before and afte r the transfor m remains the same.Ther efore, the mo re energy compac ted in a fraction of total coef fi cients, the better energycompac tion the transf orm has . On e measu re of energy compac tion is the transfor m codinggain GTC, which is de fined as the ratio betwe en the arithme tic mean and the geome tricmean of the vari ances of all the compone nts in the transf or med vect or [jayant 1984].

GTC ¼1N

PN � 1

i¼ 0s 2i

QN � 1

i¼ 0s 2i

� � 1N: (4 : 63)

A larger GTC indi cates higher energy compac tion. Th e TC ga ins fo r a fi rst-orde r autore-gressi ve source with r ¼ 0.95 achi eved by usi ng the DCT, DFT, and KLT were reporte d in[zelins ki 1975; jayan t 1984]. The TC ga in afford ed by the DCT compar es ver y closely to thatof the optimu m KLT.

4.3.5. 2 Mean Squ are Recon struction Error

The perform ance of the transf or ms can be compar ed in terms of the mean square rec on-struc tion err or as we ll. This was men tioned in Section 4.1.2 when we provid ed a statist icalinter pretation for TC . Th at is, after a rranging all the N transf ormed coef fi cients a ccordin gto their variance s in a noni ncreas ing order, if L < N and we discard the last N � L coef fi-cients to rec onstruct the or iginal input signal ~z (simila r to wha t we did with the Hotellin gtransf orm), then the mean square recon struction error is

MSEr ¼ E k~z � ~z 0 ) k2h i

¼XNi¼ Lþ 1

s 2i , (4: 64)

where ~z 0 denotes the reconst ructed vect or. No te that in the abov e de fined mean squarerecon struction error, the quantiz ation error and transmiss ion error have not been inc luded.Hence , it is some times referred to as the mean square appro ximation err or.

Ther efore it is desired to choo se a transf or m so that the transf ormed coef fi cients are moreindep endent and more energy is con centrated in the first L coef ficie nts. Then it is pos sibleto discard the rem aining coef ficients to sav e coding bits withou t causi ng signi fi cantdistorti on in input signal recon struction .

In ter ms of the mean square recon struction error, the perform ance of the DCT, KLT ,DFT, DWT, and discre te Haar transf orm fo r the 1-D case was repo rted in [ahmed 1974].The vari ances of the 16 transf orm coef ficients are show n in Figu re 4.8 when N ¼ 16,r ¼ 0.95. Note that N stands for the dimension of the 1-D vector, while the par ameter r isshown in the Toe plitz matrix (refer to Equati on 4.57). We can see that the DC T compar esmost clos ely to the KLT , which is known to be optimum.

Note that the unequal variance distribution among transform coefficients also foundapplication in the field of pattern recognition. Similar results to those in [ahmed 1974] forthe DFT, DWT, and Harr transform were reported in [andrews 1971].

A similar analysis can be carried out for the 2-D case [wintz 1972]. Recall that an imageg(x, y) can be expressed as a weighted sum of basis images Iu,v. That is,

G ¼XN�1u¼0

XN�1v¼0

T(u, v)Iu,v, (4:65)


100

10.0

0.010 2 4 6 8 10 12 14 16

Discrete cosine

HaarFourier

Walsh–Hadamard Karhunen–Loeve

Var

ianc

e

Transform component

0.1

1.0

FIGURE 4.8Transform coefficient variances whenN¼ 16, r¼ 0.95. (From Ahmed, N., Nararajan, T., and Rao, K.R., IEEE Trans.Comput., 90–93, 1974. With permission.)

where the weights are transform coefficients. We arrange the coefficients according to theirvariances in a nonincreasing order. For some choices of the transform (hence basis images),the coefficients become insignificant after the first L terms, and the image can be approxi-mated well by truncating the coefficients after L. That is,

G ¼XN�1u¼0

XN�1v¼0

T(u, v)Iu,v �XLu¼0

XLv¼0

T(u, v)Iu,v: (4:66)

The mean square reconstruction error is given by

MSEr ¼XN�1L

XN�1L

s2u,v (4:67)

A comparison among the KLT, DHT, and DFT in terms of the mean square reconstructionerror for 2-D array of 163 16 (i.e., 256 transform coefficients) was reported in [Figure 5,wintz 1972]. Note that the discrete KLT is image dependent. In the comparison, the KLT iscalculated with respect to an image named Cameraman. It shows that while the KLTachieves best performance, the other transforms perform closely.

In essence, the criteria of mean square reconstruction error and energy compaction areclosely related. It has been shown that the discrete KLT, also known as the Hotellingtransform, is the optimum in terms of energy compaction and mean square reconstructionerror. The DWT, DHT, DFT, and DCT are close to the optimum [wintz 1972; ahmed 1974];however, the DCT is the best among these several suboptimum transforms.

Note that the performance comparison among various transforms in terms of bit rateversus distortion in the reconstructed image was reported in [pearl 1972; ahmed 1974]. The


same conclusion was drawn. That is, the KLT is optimu m, while the DFT , DWT, DC T, andHarr transf orms are close in perfo rmance. Among the subopt imum transf orms , the DC T isthe best.

4.3.5. 3 Computati onal Complex ity

Note that while the DWT, DHT, DFT, and DCT are inpu t image inde pendent, the discreteKLT (the Hotellin g transfor m) is input dependent . More spe cifi cally, the ro w vect ors of theHote lling transfor m matrix are transpos ed eigen vector s of the covariance matri x of theinput rando m vect or. So far the re is no fas t transf orm algori thm avail able. This computa-tiona l com plexity proh ibits the Hote lling transf orm from pra ctical usage. It c an be shownthat the DWT, DFT, and DC T can be impleme nted using the FFT algorithm .

4.3.5. 4 Summary

As poi nted earlier , the DCT is the best among the subo ptimum transf orms in terms ofenergy compac tion. Mo reover, the DCT ca n be implement ed using the FFT. Even thou gh a2N -point sequenc e is involved , the eve n symmet ry makes the c omputat ion involve d in theN -point DCT equiva lent to that of the N-poi nt FFT. For these two rea sons, the DCT findsthe widest appl ication in image and video codi ng.

4.4 B it Allocation

As sh own in Figure 4.2, in TC, an input image is fi rst divi ded int o block s (s ubimage s). Th ena 2-D li near transf orm is appl ied to each block . The transf ormed blocks go throug htruncat ion, quan tization , and code word assignme nt. The last three fun ctions: truncat ion,quan tization , and code wo rd a ssignment are comb ined and called bit allocati on.

From the previo us section, it is known that the appl ied transf orm deco rrelates subi-mages. Moreove r, it redistrib utes image energy in the transf orm doma in in such a way thatmost of the energy is com pacted into a small fract ion of coef ficients. Th erefore, it is pos sibleto discard the maj ority of transf orm coef fic ients withou t int roducing signi fi cant distorti on.

As a resu lt, we see that in TC there are main ly thre e types of err or involve d. On e is dueto truncat ion, i.e., the maj ority of coef fi cients are tru ncated to zer o. The othe r com es fromquan tization . Transmi ssion error is the thi rd type of error . (No te that tru ncation can also becons idered a special type of quan tization .) The mean square rec onstruc tion error dis cussedin Secti on 4.3.5.2 is in fact only relate d to truncat ion error . For this reason, it was referred tomore precis ely as m ean square approximat ion error . In general , the rec onstru ction err or,i.e., the error betwe en the original image sig nal and the recon structed image at the receiver ,includ es thre e types of error: truncat ion error, quan tization err or, and transmis sion err or.

Ther e are two differe nt ways to truncat e transf orm coef ficients. On e is calle d zonalcoding, whil e the othe r is threshold coding. And they are discuss ed bel ow.

4.4.1 Zonal Codi ng

In zonal codi ng, also known as zon al samplin g, a zone in the transfor med block isprede fined acco rding to a statist ical av erage obt ained from many bloc ks. All transf ormcoefficients in the zone are retained, while all coefficients outside the zone are set to zero.As menti oned in Section 4.3.5. 1, the to tal energy of the image remain s the same afterapplying the transforms discussed there. Since it is known that the DC and low-frequency


(a)

(0, 7)(0, 0)

1 1 1 1

1 1

0

0

1

1

1

0

1

1

0

0

1

1

0

0

1

0

0

0

1

0 0

0

0

00

0

0

0

0000 0

0

00 0 0

0

0 0

00 0 0 0 00

0

0

0

0 0 0 0 00 0

1

1

1

0

1

1

1

1

0

1 1

1

0

0

1

1 1 1 1

00

0

0

1

0

0

0

0

0

0

0

0

0 0 0

0

0 00

0

0 0 0

0

0 0 0 0 00

0

0

00 0 0 0 00 0

v v

u

(7, 0)

(b)

(0, 7)(0, 0)

1

1

1

0

u

(7, 0)

FIGURE 4.9Two illustrations of zonal coding.

AC coefficients of the DCT occupy most of the energy, the zone is located in the top-leftportion of the transformed block when the transform coordinate system is set convention-ally. (Note that by DC we mean u¼ v¼ 0. By AC we mean u and v do not equal zerosimultaneously.) That is, the origin is at the top-left corner of the transformed block. Twotypical zones are shown in Figure 4.9. The simplest uniform quantization with naturalbinary coding can be used to quantize and encode the retained transform coefficients. Withthis simple technique, there is no overhead side information that needs to be sent to thereceiver, since the structure of the zone, the scheme of the quantization, and encoding areknown at both the transmitter and the receiver.

The coding efficiency, however, may not be very high. This is because the zone ispredefined based on average statistics. Therefore some coefficients outside the zonemight be large in magnitude, while some coefficients inside the zone may be small inquantity. Uniform quantization and natural binary encoding are simple, but they areknown not to be efficient enough.

For further improvement of coding efficiency, an adaptive scheme has to be used. There,a two-pass procedure is applied. In the first pass, the variances of transform coefficients aremeasured or estimated. Based on the statistics, the quantization and encoding schemes aredetermined. In the second pass, quantization and encoding are carried out [habibi 1971a;chen 1977].

4.4.2 Threshold Coding

In threshold coding, also known as threshold sampling, there is not a predefined zone.Instead, each transform coefficient is compared with a threshold. If it is smaller than thethreshold, then it is set to zero. If it is larger than the threshold, it will be retained forquantization and encoding. Compared with zonal coding, this scheme is adaptive intruncation in the sense that the coefficients with more energy are retained no matterwhere they are located. The address of these retained coefficients, however, has to besent to the receiver as side information. Furthermore, the threshold is determined after anevaluation of all coefficients. Hence, it was usually a two-pass adaptive technique.

Chen and Pratt devised an efficient adaptive scheme to handle threshold coding [chen1984]. It is a one-pass adaptive scheme, in contrast to two-pass adaptive schemes. Hence itis fast in implementation. With several effective techniques addressed here, it achievedexcellent results in TC. Specifically, it demonstrated satisfied quality of reconstructed


CTN (u, v)CT (u, v )

Ratebuffer

Zigzagscan

Normalization

Applyingvariable

quantizationstep size

Thresholding and

shifting

Roundoff Huffmancoding

DCT g (x, y )

g ’ (x, y )

C (u, v )

Inputsubimage

Outputbit

stream

(a) Transmitter

Adding shift back

Arranging transform coefficientsin blocks

Decoding

Inverse DCT

Ratebuffer

Inputbit

stream

Reconstructed subimage

Inverse normalization

(b) Receiver

C� (u, v)TN

C� (u, v)TN

C� (u, v )T

C�(u, v )

FIGURE 4.10Block diagram of the algorithm proposed by Chen and Pratt. (From Chen, W.H. and Pratt, W.K., IEEE Trans.Commun., COM-32, 225–232, 1984. With permission.)

frames at a bit rate of 0.4 bits =pixel for coding of color images , which corre sponds to rea l-time color tele vision transmis sion over a 1.5 Mbits =s chann el. This scheme has beenadopte d by the inter national still codi ng stand ard JPEG . A block diag ram of the thres holdcoding propo sed by Chen and Pratt is sh own in Figu re 4.10. Mo re details and m odi ficati onmade by JPEG will be describ ed in Chapte r 7.

4.4.2. 1 Threshol ding an d Shifting

The DCT is use d in the sch eme becaus e of its superio rity, descri bed in Secti on 4.3. Here weuse C( u, v) to denote the DCT coef ficients. Th e DC coef ficient, C(0, 0), is proces sed differ-ently. As men tioned in Chapter 3, the DC coef fi cients are enc oded with different ial codingtechni que. For more detail, refer to Chapte r 7. For a ll the AC coef ficients, the followi ngthresho lding and shiftin g are carried out:

CT ( u, v) ¼ C( u, v) � T if C( u , v) > T0 if C( u , v) � T

(4 : 68)


FIGURE 4.11Input–output characteristic of threshold-ing and shifting.

–3T 2T–4T T–T 3T–2T 4T

T

3T

2T

C (u, v )

CT (u, v )

0

where T on the righ t-hand side is the threshold. Note that Equati on 4.68 also implies ashift ing of transf orm coef ficients by T . The inpu t –outpu t charac teristic of the thresho ldingand shift ing is shown in Figure 4.11.

Figu re 4.12 demonstr ate s that mo re than 60% of the DCT coef fi cients normally fall belowa threshold value as low as 5. This indi cates that with a prope rly selected threshold value itis possible to set most of the DCT coefficients equal to zero. The threshold value is adjustedby the feedback from the rate buffer, or by the desired bit rate.

4.4.2.2 Normalization and Roundoff

The threshold subtracted transform coefficients CT(u, v) are normalized before roundoff.The normalization is implemented as follows:

CTN(u, v) ¼ CT(u, v)Gu, v

, (4:69)

05 10 15 20

Coefficient threshold

Distribution of cosine transform coefficients

Miss America Football

% c

oeffi

cien

t bel

ow th

resh

old

25 30 35 40

10

20

30

40

50

60

70

80

90

100

FIGURE 4.12Amplitude distribution of the DCT coefficients.


where the normalization factor Gu,v is controlled by the rate buffer. The roundoff processconverts floating point to integer as follows:

R[CTN(u, v)] ¼ C*TN(u, v) ¼bCTN(u, v)þ 0:5c if CTN(u, v) � 0dCTN(u, v)� 0:5e if CTN(u, v) < 0

(4:70)

where the operator bxcmeans the largest integer smaller than or equal to x, the operator dxemeans the smallest integer larger than or equal to x. The input–output characteristics of thenormalization and roundoff are shown in Figure 4.13a and b, respectively.

From these input–output characteristics, we can see that the roundoff is a uniformmidtread quantizer with a unit quantization step. The combination of normalization androundoff is equivalent to a uniform midtread quantizer with the quantization step sizeequal to the normalization factor Gu,v. Normalization is a scaling process, which makes theresultant uniform midtread quantizer adapt to the dynamic range of the associated trans-form coefficient. It is therefore possible for one quantizer design to be applied to variouscoefficients with different ranges. Obviously, by adjusting the parameter Gu,v (quantizationstep size) variable bit rate (VBR) and mean square quantization error can be achieved. Theselection of the normalization factors for different transform coefficients can hence take

tg a1

Γu,v=

CT (u, v )

CTN (u, v )

(a)

CTN (u, v )0 0.5 1.0 1.5 2.0 2.5 3.0 3.5

1

2

3

�3

�2

�1

�3.5�3.0�2.5�2.0�1.5�1.0�0.5

(b)

a

C * (u, v )TN

FIGURE 4.13Input–output characteristic of (a) normalization and (b) roundoff.


FIGURE 4.14Quantization tables. (a) Luminancequantization table. (b) Chrominancequantization table.

17 18 24 47 99 99 99 99

18 21 26 66 99 99 99 99

24 26 56 99 99 99 99 99

47 66 99 99 99 99 99 99

99 99 99 99 99 99 99 99

99 99 99 99 99 99 99 99

99 99 99 99 99 99 99 99

99 99 99 99 99 99 99 9972 92 95 98 112100103 99

49 64 78 87 103121120 101

24 35 55 64 81 104113 92

18 22 37 56 68 109103 77

14 17 22 29 51 87 80 62

14 13 16 24 40 57 69 56

12 12 14 19 26 58 60 55

16 11 10 16 24 40 51 61

(a) (b)

the statist ical feature of the images and the charac teristi cs of the HV S into con siderati on. Ingeneral , mo st image energy is contai ned in the DC and low-fre quency AC transfor mcoef ficients . The HV S is mo re sens itive to a relative ly uniform regio n than to a relati velydetaile d regio n, as discuss ed in Chapte r 1. Chapter 1 also men tions that wi th regard to thecolor image , the HVS is more sensitiv e to the lumi nance compone nt than to the chromin-ance com ponents.

These have been taken into conside ration in JPEG . A matrix cons isting of all thenormal ization factors is calle d a quan tization table in JPEG . A luminance quan tizationtable and a chro minance quan tization table use d in JPEG are shown in Figure 4.14. Weobse rve that in general in bot h tables the sm all no rmaliza tion factors are assigne d to theDC and low-fre quency AC coef ficie nts. The large Gs are associ ated with the high -frequenc ytransf orm coef fi cients. Comp ared with the lumi nance quan tization table, the chromi nancequan tization table has larger quan tization step size s for the low a nd midd le fre quencycoef ficients and almost the same step size s for the DC and high-freq uency coef fi cients,indica ting that the chromi nance compone nts are relatively coars ely quantiz ed, com paredwith the luminance component .

4.4.2. 3 Zigzag Sc an

As men tioned at the begi nning of thi s sectio n, while threshold codi ng is adaptive to thelocal statist ics a nd hence is more efficient in truncat ion, threshold coding needs to send theaddr ess of retained coef fi cients to the receiver as overhead sid e inf ormatio n. An ef ficientschem e, called zigzag sca n, was prop osed in [chen 1984] as sh own in Figu re 4.15. Asshown in Figure 4.12, a great majorit y of transf orm coef fi cients has magnitu de sm aller thana thres hold of 3 . Conse quently, m ost quan tized coef fi cients are z ero. He nce, in the 1-Dsequ ence obtained by zigzag sca nning, most of the num bers are zero. A code known asrun-l ength code (RLC), discusse d in Chapte r 6, is very ef fi cient unde r these circumst ance sto enc ode the addr ess infor mation of nonze ro c oeffi cients. Run-length of zero coef ficients isunders tood as the num ber of cons ecutive zeros in the zigzag scan. Zig zag sca nningmin imizes the use of RLC s in the block, hence making code s most ef fi cient.

4.4.2. 4 Huffman Codi ng

Stat istical studi es of the m agnitude of nonze ro DCT coef fi cients and the run-len gth of zer oDC T coef ficients in zigzag sca nning we re cond ucted in [chen 1984]. Th e domi nation of thecoef ficients wi th sm all amplitu de and the short run -lengths was found a nd is sh own inFigu res 4.16 and 4.17. This jus tifies the applicat ion of the Hu ffman coding to the magni tudeof nonze ro transf orm coef fi cients and run-len gths of zeros.


v

u

0

FIGURE 4.15Zigzag scan of DCT coefficients within an 83 8 block.

4.4.2. 5 Special Code Words

Two special code words were used in [che n 1984]. On e is calle d end of block (EOB).Anoth er is calle d run -length prefi x. Once the last no nzero DC T coef fi cients in the zigzagare coded , EOB is appe nded, indicati ng the ter mination of codi ng the blo ck. This fur thersaves bits used in coding. Run- length pre fix is used to discrimin ate the RLC word s fromthe amp litude code words.

4.4.2. 6 Rate Buffer Fe edback and Equ alization

As sh own in Figure 4.10, a rate buffer accepts a vari able-rat e dat a inpu t from the enc odingproces s and provid es a fixed- rate data outpu t to the channel. The status of the rate buff er ismonitored and fed back to control the threshold and the normalization factor. In thisfashion a one-pass adaptation is achieved.

20

500

1000

1500

2000

2500

3000

3500

4000

4500

Miss AmericaFootball

4 6 8 10Amplitude in absolute value

Num

ber

of c

oeffi

cien

ts

12 14 16 18 20FIGURE 4.16Histogram of DCT coefficients in abso-lute amplitude.


20

500

1000

1500

2000

2500

3000

3500

4 6 8Number of consective zeros

Miss AmericaFootball

Num

ber

of c

oeffi

cien

ts

10 12 14

FIGURE 4.17Histogram of zero run-length.

4. 5 Some I ssues

4.5.1 Effect of Transm ission Error

In TC, each pixel in the recon structed image reli es on all transf orm coef fic ients in thesubim age where the pixel is located. Hence , a bit reversal transmis sion err or wi ll be spre ad.That is, an err or in a transf orm coef fi cient wi ll le ad to errors in all the pi xels within thesubim age. As dis cussed in Secti on 4.2.3, this is one of the rea sons the selected subimag esize cannot be ver y large. Dependi ng on which coef ficient is in error, the eff ect cause d by abit reversal err or on the rec onstru cted image varies. For insta nce , a n err or in the DC or alow- frequency AC coef ficient may be obje ctionable, while an error in the high -frequenc ycoef ficient may be less noticeabl e.

4.5.2 Reco nstruct ion Error Sources

As dis cussed, three sources , truncat ion (discarding transfor m coef ficients with sm all vari-ance s), quantiz ation, and transmi ssion, contrib ute to the rec onstru ction error. It is note dthat in TC transf orm is appl ied block by blo ck. Quant izat ion and encoding of transf ormcoef ficients are also con ducted blockw ise. At the rec eiver, reconst ructed blocks are puttog ether to form the whol e rec onstru cted image. In the proces s, block artifacts are pro-duce d. Sometim es, even thoug h it may not severely affect an objecti ve asses sment of therec onstructe d image quality , block artifacts can be annoy ing to the HVS, esp ecially whenthe coding rate is low.

To alleviat e the blocki ng eff ect, several tech niques have been prop osed. On e is to overl apblocks in the source image. Another is to postfilter the reconstructed image alongblock boundaries. The selection of advanced transforms is an additional possible method[lim 1990].


In the block overl apping method , when the block s are fi nally org anized to form therecon structed image, each pixe l in the overl apped regions takes an average value of all itsrecon structed gray level value s from multi ple block s. In thi s metho d, extra bits are use d forthose pixe ls involve d in the overlap ped regions. For this rea son, the overl apped regio n isusually on ly one pixel wi de.

Due to the sharp transiti on along block boundar ies, blo ck artifacts are of high freq uencyin nature . Low-p ass fi ltering is hen ce normal ly use d in the pos tfi ltering metho d. To avoidthe blu rring effect caused by low-pass fi ltering on the nonb oundar y image a rea, low- passpost filteri ng is only appl ied to block boundar ies. Unl ike the block overlap ping method ,the post filtering method does not need ext ra bits. Mo reover, it has bee n shown that thepost filteri ng metho d can achieve better results in com bating block artifacts [reeve 1984;ramam urthi 1986]. For these two reasons, the post filtering met hod has been adopted by theinter national codi ng st andards .

4.5.3 Compa rison betwe en DPCM and TC

As men tioned at the begi nning of the chapte r, both differe ntial codi ng and TC util izeinterpixel correlation and are efficient coding techniques. Comparisons between these twotechniques have been reported in [habibi 1971b]. Take a look at the techniques discussedin the Chapte rs 3 and 4. We can see that differential coding is simpler than TC. This isbecause the linear prediction and differencing involved in differential coding are simplerthan the 2-D transform involved in TC. In terms of the memory requirement and process-ing delay, differential coding such as DPCM is superior to TC. That is, DPCM needs lessmemory and has less processing delay than TC. The design of the DPCM system, however,is sensitive to image-to-image variation, and so is its performance. That is, an optimumDPCM design is matched to the statistics of a certain image. When the statistics change,the performance of the DPCM will be affected. On the contrary, TC is less sensitive to thevariation in the image statistics. In general, the optimum DPCM coding system with a thirdor higher-order predictor performs better than TC when the bit rate is about 2–3 bits=pixelfor single images. When the bit rate is below 2–3 bits=pixel, TC is normally preferred. Asa result, the international still image coding standard JPEG is based on TC, whereas, inJPEG, DPCM is used for coding the DC coefficients of DCT, and information-preservingdifferential coding is used for lossless still image coding.

4.5.4 Hybrid Coding

A method called hybrid transform=waveform coding, or simply hybrid coding, wasdevised to combine the merits of the two methods. By waveform coding, we mean codingtechniques that code the waveform of a signal instead of the transformed signal. DPCM is awaveform coding technique. Hybrid coding combines TC and DPCM coding. That is, TCcan be applied first rowwise followed by DPCM coding columnwise, or vice versa. Inthis way, the two techniques complement each other. That is, the hybrid coding techni-que simultaneously has TC’s small sensitivity to variable image statistics and DPCM’ssimplicity in implementation.

It is worth mentioning a successful hybrid coding scheme in interframe coding: predic-tive coding along the temporal domain. Specifically, it uses MC predictive coding. That is,the motion analyzed from successive frames is used to more accurately predict a frame.The prediction error (in the 2-D spatial domain) is transform coded. This hybrid codingscheme has been very efficient and was adopted by the international video coding standardsH.261, H.263, and MPEG 1, 2 and 4.


4.6 Summary

In TC, instead of the original image or some function of the original image in the spatialand temporal domain, the image in the transform domain is quantized and encoded. Themain idea behind TC is that the transformed version of the image is less correlated.Moreover, the image energy is compacted into a small proper subset of transformcoefficients.

The basis vector (1-D) and the basis image (2-D) provide a meaningful interpretation ofTC. This type of interpretation considers the original image to be a weighted sum of basisvectors or basis images. The weights are the transform coefficients, each of which isessentially a correlation measure between the original image and the corresponding basisimage. These weights are less correlated than the gray level values of pixels in the originalimage. Furthermore they have a great disparity in variance distribution. Some weightshave large variances. They are retained and finely quantized. Some weights have smallenergy. They are retained and coarsely quantized. A vast majority of weights are insig-nificant and discarded. In this way, a high coding efficiency is achieved in TC. Because thequantized nonzero coefficients have a very nonuniform probability distribution, they canbe encoded by using efficient variable-length codes. In summary, three factors, truncation(discarding a great majority of transform coefficients), adaptive quantization, and variable-length coding, contribute mainly to a high coding efficiency of TC.

Several linear, reversible, unitary transforms have been studied and utilized in TC.They include the discrete KLT (the Hotelling transform), the DFT, the Walsh transform,the Hadamard transform, and the discrete cosine transform. It is shown that the KLTis the optimum. The transform coefficients of the KLT are uncorrelated. The KLT cancompact the most energy in the smallest fraction of transform coefficients. However, theKLT is image dependent. There is no fast algorithm to implement it. This prohibits the KLTfrom practical use in TC. While the rest of the transforms perform closely, the DCT appearsto be the best. Its energy compaction is very close to the optimum KLT and it can beimplemented using the FFT. The DCT has been found to be efficient not only for stillimages coding but also for coding residual images (predictive error) in MC interframepredictive coding. These features make the DCT the most widely used transform in theimage and video coding.

There are two ways to truncate transform coefficients: zonal coding and thresholdcoding. In zonal coding, a zone is predefined based on average statistics. The transformcoefficients within the zone are retained, while those outside the zone are discarded. Inthreshold coding, each transform coefficient is compared with a threshold. Those coeffi-cients larger than the threshold are retained, while those smaller are discarded. Thresholdcoding is adaptive to local statistics. A two-pass procedure was usually taken. That is, thelocal statistics are measured or estimated in the first pass. The truncation takes place in thesecond pass. The addresses of the retained coefficients need to be sent to the reviver asoverhead side information.

A one-step adaptive framework of TC has evolved as a result of the tremendous researchefforts in image coding. It became a base of the international still image coding standardJPEG. Its fundamental components include the DCT transform, thresholding and adaptivequantization of transform coefficients, zigzag scan, Huffman coding of magnitude of thenonzero DCT coefficients and run-length of zeros in the zigzag scan, the code word of EOB,and rate buffer feedback control.

Threshold and normalization factor are controlled by rate buffer feedback. Since thethreshold decides how many transform coefficients are retained and the normalizationfactor is actually the quantization step size, the rate buffer has direct impact on the bit rate


of the TC system. Selection of quantization steps takes the energy compaction of the DCTand the characteristics of the HVS into consideration. That is, it uses not only statisticalredundancy but also psychovisual redundancy to enhance coding efficiency.

After thresholding, normalization, and roundoff are applied to the DCT transformcoefficients in a block, a great majority of transform coefficients are set to zero. Zigzagscan can convert the 2-D array of transform coefficients into a 1-D sequence. The number ofconsecutive zero-valued coefficients in the 1-D sequence is referred to as run-length ofzeros and is used to provide address information of nonzero DCT coefficients. Bothmagnitude of nonzero coefficients and run-length information need to be coded. Thestatistical analysis has demonstrated that small magnitude and short run-length are domi-nant. Therefore, efficient lossless entropy coding methods such as Huffman codingand arithmetic coding (the focus of the next chapter) can be applied to magnitude andrun-length.

In a reconstructed subimage, there are three types of error involved: truncation error(some transform coefficients have been set to zero), quantization error, and transmissionerror. In a broad sense, the truncation can be viewed as a part of the quantization. That is,these truncated coefficients are quantized to zero. The transmission error in terms of bitreversal will affect the whole reconstructed subimage. This is because, in the inversetransform (such as the inverse DCT), each transform coefficient makes a contribution.

In reconstructing the original image all the subimages are organized to form the wholeimage. Therefore the independent processing of individual subimages causes block arti-facts. Though they may not severely affect the objective assessment of reconstructed imagequality, block artifacts can be annoying, especially in low bit rate image coding. Blockoverlapping and postfiltering are the two effective ways to alleviate block artifacts. In theformer, neighboring blocks are purposely overlapped by one pixel. In reconstructing theimage, those pixels that have been coded more than once take an average of the multipledecoded values. Extra bits are used. In the latter technique, a low-pass filter is appliedalong boundaries of blocks. No extra bits are required in the process and the effect ofcombating block artifacts is better than with the former technique.

The selection of subimage size is an important issue in the implementation of TC. Ingeneral, the large size will remove more interpixel redundancy. But it has been shown thatthe pixel correlation becomes insignificant when the distance of pixels exceeds 20. On theother hand, large size is not suitable for adaptation to local statistics, while adaptation isrequired in handling nonstationary images. Large size also makes the effect of transmissionerror spread more widely. For these reasons, subimage size should not be large. In MCpredictive interframe coding, motion estimation is normally carried out in sizes of 163 16or 83 8. To be compatible, the subimage size in TC is frequently chosen as 83 8.

Both predictive coding, say, DPCM, and TC utilize interpixel correlation and are efficientcoding schemes. Compared with TC, DPCM is simpler in computation. It needs lessstorage and has less processing delay. But it is more sensitive to image-to-image variation.On the other hand, TC provides higher adaptation to statistical variation. TC is capable ofremoving more interpixel correlation, thus providing higher coding efficiency. Tradition-ally, people consider that predictive coding is preferred if bit rate is in the range of 2–3bites=pixel, while TC is preferred when bit rate is below 2–3 bits=pixel. However, thesituation changes. TC becomes the core technology in image and video coding. Manyspecial VLSI chips are designed and manufactured for reducing computational complexity.The complexity becomes less important. Consequently, predictive coding such as DPCM isonly used in some very simple applications.

In the context of interframe coding, 3-D (two spatial dimensions and one temporaldimension) TC has not found wide application in practice due to the complexity in


com putation and st orage. Hyb rid transf orm =wavefo rm coding has proven to be ver yef ficient in inter frame coding. There, the MC predictiv e coding is use d along tem poraldime nsion, whi le TC is used to code the predic tion err or in two spati al dimens ions.

Exerc ises1. Consi der the follo wing eight poi nts in a 3-D coor dinat e syste m: (0,0,0) T, (1,0,0) T ,

(0,1,0) T , (0,0,1) T , (0,1,1) T , (1,0,1) T, (1,1,0) T , (1,1,1) T. Find the mean vector and covari ancematrix using the Equati ons 4.12 and 4.13.

2. For N ¼ 4, fi nd the basis images of the DFT , Iu,v when (a) u ¼ 0, v ¼ 0, (b) u ¼ 1, v ¼ 0,(c) u ¼ 2, v ¼ 2, (d) u ¼ 3, v ¼ 2. Use bot h met hods dis cussed in the tex t; i.e., the metho dwith basis image and the met hod with basis vectors .

3. For N ¼ 4, find the basis image s of the or dered DHT when (a) u ¼ 0, v ¼ 2, (b) u ¼ 1,v ¼ 3, (c) u ¼ 2, v ¼ 3, (d) u ¼ 3, v ¼ 3 . Verify your resu lts by c omparin g the m withFigu re 4.5.

4. Rep eat the previ ous problem for the DWT, and ver ify your result s by com paring themwith Figure 4.4.

5. Rep eat problem 3 for the DCT and N ¼ 4.6. Whe n N ¼ 8, dr aw the transf or m matrix F fo r the DWT, DHT, the order DHT, DFT, and

DC T.7. The matrix form of forwa rd and inv erse 2-D symm etric image transf orms are

expres sed in texts such a s [Jayant 1984] as T ¼ FGF T and G ¼ IT I T, which are differentfrom Equati ons 4.28 and 4.29. Can you explain thi s discrepan cy?

8. Deri ve Equati on 4.64. [NB: Use the con cept of basis vect ors and the orthogo nality ofbasis vect ors.]

9. Justify that the no rmaliza tion factor is the quan tization step.10. Th e transf orm used in TC has two functi ons: decorrel ation and energy com paction.

Does decorre lation aut omati cally lead to energy compac tion? Comme nt.11. Usin g your own words , expl ain the main ide a behi nd TC.12. Read the tech niques by Chen and Pra tt pres ented in Section 4.4.2. Comp are them with

JPEG discusse d in Chapte r 7. Comme nt on the simi larity and dis similarity betwe enthe m.

13. Ho w is the one- pass adapta tion to local statistics in the algorithm of [che n and pratt]achi eved?

14. Using your own words, explain why the DCT is superior to the DFT in terms of energycompaction.

15. Why is the subimage size of 83 8 widely used?

References

[ahmed 1974] N. Ahmed, T. Nararajan, and K.R. Rao, Discrete cosine transform, IEEE Transactionson Computers, 90–93, January 1974.

[andrews 1971] H.C. Andrews, Multidimensional rotations in feature selection, IEEE Transactions onComputers, c-20, 1045–1051, September 1971.

[chen 1977] W.H. Chen and C.H. Smith, Adaptive coding of monochrome and color images, IEEETransactions on Communications, COM-25, 1285– 1292, November 1977.


[chen 1984] W.H. Chen and W.K. Pratt, Scene adaptive coder, IEEE Transactions on Communications,COM-32, 225–232, March, 1984.

[cooley 1965] J.W. Cooley and J.W. Tukey, An algorithm for the machine calculation of complexFourier series, Mathematics of Computation, 19, 297–301, 1965.

[gonzalez 2001] R.C. Gonzalez and R.E. Woods, Digital Image Processing, 2nd edition, Prentice Hall,Upper Saddle River, NJ, 2001.

[habibi 1971a] A. Habibi and P.A. Wintz, Image coding by linear transformations and blockquantization, IEEE Transactions on Communication Technology, COM-19, 50–60, February 1971.

[habibi 1971b] A. Habibi, Comparison of nth-order DPCM encoder with linear transformationsand block quantization techniques, IEEE Transactions on Communication Technology, COM-19,6, 948–956, December 1971.

[hadamard 1893] J. Hadamard, Resolution d’une question relative aux determinants, Bulletin desSciences Mathematiques Series 2, 17, Part I, 240–246, 1893.

[haskell 1996] B.G. Haskell, A. Puri, and A.N. Netravali, Digital Video: An Introduction to MPEG-2,Chapman & Hall, 1996.

[hotelling 1933] H. Hotelling. Analysis of a complex of statistical variables into principal compon-ents, Journal of Educational Psychology, 24, 417–441, 498–520, 1933.

[huang 1963] J.-Y. Huang and P.M. Schultheiss, Block quantization of correlated Gaussian randomvariables, IEEE Transactions on Communication Systems, cs-11, 289–296, September 1963.

[jayant 1984] N.S. Jayant and P. Noll, Digital Coding of Waveforms, Prentice-Hall, EnglewoodCliffs, NJ, 1984.

[karhunen 1947] H. Karhunen, Über lineare Methoden in der Wahrscheinlich-Keitsrechnung, Ann.Acad. Sci. Fenn., Ser. A. I. 37, Helsinki, 1947. (An English translation is available as ‘‘On linearmethods in Probability theory’’ (I. Selin transl.), The RAND Corp., Dec. T-131, Aug. 11, 1960.)

[lim 1990] J.S. Lim, Two-Dimensional Signal and Image Processing, Prentice-Hall, Englewood Cliffs,NJ, 1990.

[loeve 1948] M. Loéve, Fonctions aleatoires de seconde ordre, in P. Levy, Processus Stochastiques etMouvement Brownien, Hermann, Paris, France, 1948.

[pearl 1972] J. Pearl, H.C. Andrews, and W.K. Pratt, Performance measures for transform datacoding, IEEE Transactions on Communication Technology, col. com-20, 411–415, June 1972.

[ramamurthi 1986] B. Ramamurthi and A. Gersho, Nonlinear space-variant postprocessing of blockcoded images, IEEE Transactions on Acoustics, Speech and Signal Processing, 34, 1258–1267,October 1986.

[reeve 1984] H.C. Reeve III and J.S. Lim, Reduction of blocking effects in image coding, Journal ofOptical Engineering, 23, 34–37, January=February 1984.

[strang 1998] G. Strang, Introduction to Linear Algebra, Wellesley-Cambridge Press, Cambridge, MA,June 1998.

[tasto 1971] M. Tasto and P.A. Wintz, Image coding by adaptive block quantization, IEEE Transac-tions on Communication Technology, COM-19, 6, 957–972, December 1971.

[walsh 1923] J.L. Walsh, A closed set of normal orthogonal functions, American Journal of Mathematics,45, 1, 5–24, 1923.

[wintz 1972] P.A. Wintz, Transform picture coding, Proceedings of The IEEE, 60, 7, 809–820, July 1972.[zelinski 1975] R. Zelinski and P. Noll, Adaptive block quantization of speech signals (in German),

Technical Report no. 181, Heinrich Hertz Institut, Berlin, 1975.


5Variable-Length Coding: InformationTheory Results (II)

Ther e are three stages that take pl ace in an encoder: transf or mation, quan tization , and codeword assignme nt (Figure 2.3). Quant ization was discuss ed in Chapter 2 . Differe ntialcoding and transf or m coding using two differe nt transf ormatio n com ponents we re cov-ered in Chapte rs 3 and 4, respec tively. In different ial coding, it is the differe nce signal thatis quan tized and encoded , whereas in transf or m coding it is the transf ormed signal that isquan tized a nd enc oded. In this chap ter and the nex t chapte r, we discuss seve ral code wordassignme nt (enco ding) tech niques. In thi s chap ter, we cover two ty pes of vari able-len gthcoding (VLC ): Hu ffman coding and arithme tic coding.

First we introd uce some fundame ntal concepts of enc oding. Then , the rules that must beobeye d by all optimu m and insta ntan eous codes are discuss ed. On the basis of the se rules,the Huffman coding algorithm is presente d. A modi fied version of the Huffman codi ngalgorithm is int roduced as an ef ficient way to dramati cally redu ce codeb ook memor ywhil e keeping almost the same optimalit y.

The promisin g arithme tic coding algori thm, which is quite diffe rent from Huffmancoding, is anothe r focus of the chap ter. W hile Hu ffman coding is a block-ori ented codi ngtechni que, arit hmetic coding is a stream-ori ented codi ng tech nique. With imp rovemen ts inimpl ementati on, arithme tic coding has ga ined increa sing pop ularity. Both Huffman andarithme tic codings are included in the int ernational still image codi ng standard JPEG (JointPhot ographic (im age) Ex perts Group coding). Th e ada ptive arit hmeti c coding algori thmsare adopted by the international bi-level image coding standard JBIG (Joint Bi-level ImageExperts Group coding). Note that the material presented in this chapter can be viewed as aconti nuation of the inf ormatio n theo ry results pres ented in Chapter 1.

5.1 Some Fundamental Results

Before presenting Huffman coding and arithmetic coding, we first provide some funda-mental concepts and results as necessary background.

5.1.1 Coding an Information Source

Consider an information source, represented by a source alphabet S.

S ¼ {s1, s2, . . . , sm}, (5:1)


where si , i ¼ 1, 2, . . . , m, are source symbol s. Note that the terms sou rce symbol andinfor mation messag e are use d interchange ably in the literature . In this book, howe ver,we wou ld li ke to disti nguish them. That is, a n infor mation messag e can be a sourcesymbol , or a com bination of source symbol s. We denote code alphabet by A and

A ¼ { a1 , a 2 , . . . , a r }, (5: 2)

where aj, j ¼ 1, 2, . . . , r, are code symbo ls. A mess age code is a sequenc e of code symbol sthat repres ents a give n informati on mess age. In the simp lest case, a mess age consists ofonly a source symbol. Encoding is then a procedur e to assig n a code word to the sou rcesymbol . Namely,

si ! Ai ¼ (a i1 , a i2 , . . . , a ik ), (5: 3)

where the code word Ai is a string of k code symbols assi gned to the source symbol s i .The ter m mess age ensembl e is de fi ned as the entire set of mess ages. A code , also knownas an ensemb le code, is de fined as a map ping of all the possible sequ ences of symbol s of S(me ssage ensembl e) into the sequ ences of symbol s in A.

Note that in binary coding, the num ber of c ode symbols r is equal to 2, since there areonly two code symbol s availabl e: the bina ry digits ‘‘ 0’’ and ‘‘ 1’’ . Two exa mples are givenbelow to illus trate the a bove con cepts.

Exa mple 5 .1Consi der an English article and the ASCII code . Refer to Table 5.1. In this con text, thesource alphabet consists of all the Englis h lette rs in both lower and upper cases and allthe punct uat ion ma rks. The code alphabe t consists of the binary 1 and 0. There are a total of128 7-bit bina ry code words . From Table 5.1, we see that the code word assigne d to thecapital letter A is 1000001. That is, A is a source symbol, while 1000001 is its code word.

Example 5.2Table 5.2 lists wha t is kn own as the (5,2) code. It is a line ar block code. In thi s example , thesource alphabet consists of the four (22) source symbols listed in the left column of the table:00, 01, 10, and 11. The code alphabet consists of the binary 1 and 0. There are four codewords listed in the right column of the table. From this table, we see that the code assigns a5-bit code word to each source symbol. Specifically, the code word of the source symbol 00is 00000. The source symbol 01 is encoded as 10100. The code word assigned to 10 is 01111.The symbol 11 is mapped to 11011.

5.1.2 Some Desired Characteristics

To be practical in use, codes need to have some desired characteristics [abramson 1963].Some of the characteristics are addressed in this subsection.

5.1.2.1 Block Code

A code is said to be a block code if it maps each source symbol in S into a fixed code wordin A. Hence, the codes listed in the above two examples are block codes.

5.1.2.2 Uniquely Decodable Code

A code is uniquely decodable if it can be unambiguously decoded. Obviously, a code has tobe uniquely decodable if it is to be in use.


TABLE 5.1

Seven-bit American Standard Code for Information Interchange

5 0 1 0 1 0 1 0 1Bits 6 0 0 1 1 0 0 1 1

1 2 3 4 7 0 0 0 0 1 1 1 10 0 0 0 NUL DLE SP 0 @ P ‘ p1 0 0 0 SOH DC1 ! 1 A Q a q0 1 0 0 STX DC2 ’’ 2 B R b r1 1 0 0 ETX DC3 # 3 C S c s0 0 1 0 EOT DC4 $ 4 D T d t1 0 1 0 ENQ NAK % 5 E U e u0 1 1 0 ACK SYN & 6 F V f v1 1 1 0 BEL ETB 0 7 G W g w0 0 0 1 BS CAN ( 8 H X h x1 0 0 1 HT EM ) 9 I Y i y0 1 0 1 LF SUB * : J Z j z1 1 0 1 VT ESC þ ; K [ k {0 0 1 1 FF FS , < L \ l j1 0 1 1 CR GS � ¼ M ] m }0 1 1 1 SO RS . > N ^ n �1 1 1 1 SI US = ? O — o DEL

NUL Null, or all zeros DC1 Device control 1SOH Start of heading DC2 Device control 2STX Start of text DC3 Device control 3ETX End of text DC4 Device control 4EOT End of transmission NAK Negative acknowledgmentENQ Enquiry SYN Synchronous idleACK Acknowledge ETB End of transmission blockBEL Bell, or alarm CAN CancelBS Backspace EM End of mediumHT Horizontal tabulation SUB SubstitutionLF Line feed ESC EscapeVT Vertical tabulation FS File separatorFF Form feed GS Group separatorCR Carriage return RS Record separatorSO Shift out US Unit separatorSI Shift in SP SpaceDLE Data link escape DEL Delete

Examp le 5.3Table 5.3 speci fies a code. Obviousl y it is not uniqu ely deco dable since if a bina ry strin g‘‘ 00 ’’ is received we do not kno w which of the following two sou rce symbol s has been sentout: s1 or s 3.

TABLE 5.2

(5,2) Linear Block Code

Source Symbol Code Word

s1 (0 0) 00000

s2 (0 1) 10100

s3 (1 0) 01111

s4 (1 1) 11011


TABLE 5.3

Not Uniquely Decodable Code


s1 00

s2 10

s3 00

s4 11

Nonsingu lar CodeA block code is no nsingular if all the code words are dis tinct.

Exa mple 5 .4Table 5.4 gives a nonsingular code since all four code words are distinct. If a code is not anonsingular code, i.e., at least two code words are identical then the code is not uniquelydecodable. Notice that, however, a nonsingular code does not guarantee unique decodability.The code shown in Table 5.4 is such an example in that it is nonsingular while it is notuniquely decodable. It is not uniquely decodable because once the binary string ‘‘ 11’’ isreceived, we do not know if the source symbols transmitted are s1 followed by s1 or simply s2.

The n th Extension of a Block CodeThe nth extension of a block code, which map s the source symbol si into the code word Ai ,is a block code that maps the sequ ences of source symbol s si1 s i2 � � � s in into the sequ ences ofcode word s Ai 1Ai 2 � � �A in .

A Neces sary and Suf ficient Condi tion of Blo ck Codes ’ Unique Decodab ilityA blo ck code is uniquely decod able if and on ly if the nth exten sion of the code isnons ingular for every finite n .

Exa mple 5 .5The second ext ension of the nonsing ular block code show n in Examp le 5.4 is listed inTable 5 .5. Clear ly, thi s second extensio n of the code is not a nonsin gular code, sincethe entries s1s 2 and s 2s 1 are the sam e. Th is con firms the nonu nique decodab ility of thenonsingular code in Example 5.4.

5.1.2.3 Instantaneous Codes

5.1.2.3.1 Definition of Instantaneous Codes

A uniquely decodable code is said to be instantaneous if it is possible to decode each codeword in a code symbol sequence without knowing the succeeding code words.

TABLE 5.4

Nonsingular Code


s1 1

s2 11

s3 00

s4 01


TABLE 5.5

Second Extension of the Nonsingular Block CodeShown in Example 5.4

SourceSymbol

CodeWord

SourceSymbol

CodeWord

s1 s1 11 s3 s1 001s1 s2 111 s3 s2 0011s1 s3 100 s3 s3 0000s1 s4 101 s3 s4 0001s2 s1 111 s4 s1 011s2 s2 1111 s4 s2 0111s2 s3 1100 s4 s3 0100s2 s4 1101 s4 s4 0101

Examp le 5.6Table 5.6 lists three uniquely decodab le codes. The first one is in fact a 2-bit natu ral bina rycode. In decod ing, we can immedi ately tell which sou rce symbols are transmitt ed sinceeach code word has the sam e length . In the second code, code symbol ‘‘ 1’’ fun ctions lik e acomma. Whene ver we see a ‘‘ 1, ’’ we know it is the end of the code wo rd. The third c odeis differe nt from the ear lier two code s in that if we see a ‘‘ 10 ’’ st ring we are not sure if itcorre sponds to s2 unt il we see a succe eding ‘‘ 1. ’’ Sp eci fically, if the next code symbol is ‘‘ 0, ’’we still cannot tell if it is s3 since the next one may be ‘‘ 0’’ (hence s 4) or ‘‘ 1’’ (hence s 3). In thi sexampl e, the next ‘‘ 1’’ belong s to the succe eding code word . Ther efore, we see that code 3is uniquely deco dable. Howeve r, it is no t insta ntaneou s.

De finition of the jth Pre fi xAssume a code word Ai ¼ a i1a i 2 � � � a ik . Then the sequences of code symbol s a i1a i 2 � � � a ij with1 � j � k is the j th-orde r pre fi x of the code word A i.

Examp le 5.7If a code word is 11001, it has the follo wing five pre fixes: 11001, 1100, 110, 11, 1. Th e first-order pre fix is 1, whil e the fifth- order pre fi x is 11001.

Neces sary and Suf fi cient C ondition of Being Instantan eous CodesA code is insta ntan eous if and on ly if no code word is a prefix of some other code word.This condition is often referred to as the prefix condition. Hence, the instantaneous code isalso called the prefix condition code or sometimes simply the prefix code. In manyapplications, we need a block code that is nonsingular, uniquely decodable, and instant-aneous.

TABLE 5.6

Three Uniquely Decodable Codes

Source Symbol Code 1 Code 2 Code 3

s1 00 1 1

s2 01 01 10

s3 10 001 100

s4 11 0001 1000


5.1.2. 4 Compac t Code

A uniqu ely decodab le code is said to be compac t if its aver age length is the min imumamo ng all othe r uniqu ely decoda ble code s based on the same sou rce alphabet S and codealphabe t A . A compac t code is also referred to as a minimu m redund ancy code, or anoptim um code.

Note that the averag e length of a code was de fined in Chapter 1 and is resta ted below.

5.1.3 Discret e Memor yless Sources

This is the simples t model of an infor mation sou rce. In this model , the symbol s generat edby the sou rce are indep endent of each other. That is, the sou rce is mem oryles s or it has azer o-memor y.

Cons ider the informati on source express ed in Equatio n 5.1 as a discrete mem oryles ssource . Th e occurren ce probabi lities of the source symbo ls can be denoted by p( s1),p( s2), . . . , p( sm ). The length s of the code words can be denoted by l 1, l 2, . . . , lm . The averag elength of the code is the n equal to

Lavg ¼Xmi ¼ 1

li p( s i ) : (5: 4)

Recal l Shan on ’ s fi rst theo rem, i.e., the noiseles s codi ng theo rem, des cribed in Chapte r 1.The average length of the code is bounded below by the entro py of the inf ormatio n sou rce.The entropy of the source S is defined as H(S) and

H(S) ¼ �Xmi¼1

p(si) log2 p(si): (5:5)

Recall that entropy is the average amount of information contained in a source symbol. InChapter 1, the efficiency of a code, h, is defined as the ratio between the entropy and thecode. That is, h¼H(S)=Lavg. The redundancy of the code, z, is defined as z¼ 1 � h.

5.1.4 Extensions of a Discrete Memoryless Source

Instead of coding each source symbol in a discrete source alphabet, it is often useful to codeblocks of symbols. It is, therefore, necessary to define the nth extension of a discretememoryless source.

5.1.4.1 Definition

Consider the zero-memory source alphabet S defined in Equation 5.1. That is,S¼ {s1, s2, . . . , sm}. If n symbols are grouped into a block, then there are a total of mn blocks.Each block is considered as a new source symbol. These mn blocks thus form an informa-tion source alphabet, called the nth extension of the source S, which is denoted by Sn.

5.1.4.2 Entropy

Let each block be denoted by bi and

bi ¼ (si1, si2, . . . , sin): (5:6)


TABLE 5.7

Discrete Memoryless SourceAlphabet

Source Symbol Occurrence Probability

s1 0.6

s2 0.4

Then we have the fo llowing relatio n due to the memory less assump tion.

p( bi ) ¼Ynj¼ 1

p( sij ) : (5 : 7)

Hence the relations hip betwe en the entropy of the sou rce S and the entropy of its nthexten sion is as follo ws:

H ( Sn ) ¼ n � H ( S) : (5 : 8)

Examp le 5.8Table 5.7 lists a source alphabet. Its second exten sion is listed in Table 5.8.

The entropy of the sou rce and its seco nd extension are calcul ated.

H ( S) ¼ �0: 6 � log2 (0 :6) � 0: 4� log 2 (0 : 4) � 0: 97,H (S 2 ) ¼ �0: 36 � log2 (0 :36) � 2(0 : 24) � log 2 (0 : 24) � 0:16 � log 2 (0 :16 ) � 1:94 :

It is seen that H ( S2) ¼ 2H ( S).

5.1.4.3 Noiseless Source Coding Theorem

The noiseless source coding theorem, also known as Shanon’s first theorem, was presentedin Chapte r 1, but without a mathemat ical expres sion. Here, we provi de some m athematicalexpressions to give more insight about the theorem.

For a discrete zero-memory information source S, the noiseless coding theorem can beexpressed as

H(S) � Lavg < H(S)þ 1, (5:9)

TABLE 5.8

Second Extension of the Source Alpha-bet Shown in Table 5.7

Source Symbol Occurrence Probability

s1 s1 0.36

s2 s2 0.24

s2 s1 0.24

s2 s2 0.16


that is, there exists a VLC whose averag e leng th is bounde d below by the entropy of thesource (that is enc oded) and bounde d above by the ent ropy plus 1. Since the nth ext ensionof the source alphabe t, Sn , is itself a discre te memor yless source , we can apply the aboveresult to it. That is,

H ( Sn ) � Lnavg < H ( Sn ) þ 1, (5: 10)

where Lnavg is the averag e code word length of a VLC for the S n. Si nce H ( Sn ) ¼ nH ( S) andLnavg ¼ nL avg , we have

H ( S) � Lavg < H ( S) þ 1n : (5: 11)

Ther efore, when codi ng blocks of n source symbo ls, the noise less source coding the orystate s that fo r an arbitrary positive number «, the re is a VLC, which satis fies the follo wing:

H ( S) � Lavg < H ( S) þ « (5: 12)

as n is large enoug h. That is, the aver age num ber of bits used in coding per source symbo lis bounde d below by the entropy of the source and is bou nded above by the sum ofthe ent ropy and an arbitrary posit ive number . To mak e « arbit rarily sm all, i.e., to make theaver age length of the code arbitraril y close to the entro py, we have to make the block size nlarge enough. This ver sion of the noise less coding theo rem suggest s a way to mak e theaverage length of a VLC approach the source entropy. It is known, however, that the highcoding complexity that occurs when n approaches infinity makes implementation of thecode impractical.

5.2 Huffman Codes

Consi der the sou rce alp habet de fi ned in Equati on 5.1. The method of enc oding sou rcesymbols according to their probabilities suggested in [shannon 1948; fano 1949] is notoptimum. It approaches the optimum, however, when the block size n approaches infinity.This results in a large storage requirement and high computational complexity. In manycases, we need a direct encoding method that is optimum and instantaneous (henceuniquely decodable) for an information source with finite source symbols in sourcealphabet S. Huffman code is the first such optimum code [huffman 1952], and isthe technique most frequently used at present. It can be used for r-ary encoding as r > 2.For the notational brevity, however, we discuss only the Huffman coding used in thebinary case presented here.

5.2.1 Required Rules for Optimum Instantaneous Codes

Let us rewrite Equation 5.1 as follows:

S ¼ (s1, s2, . . . , sm): (5:13)

Without loss of generality, assume the occurrence probabilities of the source symbols are asfollows:

p(s1) � p(s2) � � � � � p(sm�1) � p(sm): (5:14)


As we are see king the optimu m code for S, the length s of code words assigned to thesource symbo ls shoul d be

l1 � l2 � � � � � l m � 1 � l m : (5 : 15)

On the basis of the requi remen ts of the opti mum and insta ntaneou s code, Huffman derive dthe follo wing rules (rest rictions):

1: l1 � l 2 � � � � � lm � 1 ¼ l m : (5 : 16)

Equatio ns 5.14 and 5.16 imply that when the sou rce symbol occ urrence probabi lities arearran ged in a no nincreas ing order, the length of the correspo nding code words shouldbe in a nondec reasing order. In othe r words , the code word length of a mo re probablesource symbol shoul d not be longer than that of a less probabl e source symbol. Furt her-more, the length of the code wo rds assig ned to the two least probabl e source symbolsshould be the same.

2. The code words of the two least probable source symbols should be the sameexcept for their last bits.

3. Each possible sequence of length lm�1 bits must be used either as a code word ormust have one of its prefixes used as a code word.

Rule 1 can be justified as follows. If the first part of the rule, i.e., l1 � l2 � � � � � lm�1 isviolated, say, l1 > l2, then we can exchange the two code words to shorten the averagelength of the code. This means the code is not optimum, which contradicts the assumptionthat the code is optimum. Hence it is impossible. That is, the first part of rule 1 has to bethe case. Now assume that the second part of the rule is violated, i.e., lm�1 < lm. (Note thatlm�1 > lm can be shown to be impossible by using the same reasoning we just used to provethe first part of the rule.) Since the code is instantaneous, code word Am�1 is not a prefix ofcode word Am. This implies that the last bit in the code word Am is redundant. It can beremoved to reduce the average length of the code, implying that the code is not optimum.This contradicts the assumption, thus proving rule 1.

Rule 2 can be justified as follows. As in the above, Am�1 and Am are the code words ofthe two least probable source symbols. Assume that they do not have the identical prefixof the order lm�1. Since the code is optimum and instantaneous, code words Am�1 andAm cannot have prefixes of any order that are identical to other code words. This impliesthat we can drop the last bits of Am�1 and Am to achieve a lower average length. Thiscontradicts the optimum code assumption. It proves that rule 2 has to be the case.

Rule 3 can be justified using a similar strategy to that used above. If a possible sequenceof length lm�1 has not been used as a code word and any of its prefixes have not been usedas code words, then it can be used in place of the code word of the mth source symbol,resulting in a reduction of the average length Lavg. This is a contradiction to the optimumcode assumption and it justifies the rule.

5.2.2 Huffman Coding Algorithm

On the basis of these three rules, we see that the two least probable source symbols haveequal-length code words. These two code words are identical except for the last bits,the binary 0 and 1, respectively. Therefore, these two source symbols can be combinedto form a single new symbol. Its occurrence probability is the sum of two source symbols,


TABLE 5.9

Source Alphabet and Huffman Codes in Example 5.9

SourceSymbol

OccurrenceProbability

Code WordAssigned

Lengthof Code Word

s1 0.3 00 2s2 0.1 101 3s3 0.2 11 2s4 0.05 1001 4s5 0.1 1000 4s6 0.25 01 2

i.e., p(sm�1)þ p(sm). Its code word is the common prefix of order lm�1 of the two code wordsassigned to sm and sm�1, respectively. The new set of source symbols thus generated isreferred to as the first auxiliary source alphabet, which is one source symbol less thanthe original source alphabet. In the first auxiliary source alphabet, we can rearrange thesource symbols according to a nonincreasing order of their occurrence probabilities.The same procedure can be applied to this newly created source alphabet. A binary 0and a binary 1 are, respectively, assigned to the last bits of the two least probablesource symbols in the alphabet. The second auxiliary source alphabet will again haveone source symbol less than the first auxiliary source alphabet. The procedure continues. Insome step, the resultant source alphabet will have only two source symbols. At this time,we combine them to form a single source symbol with a probability of 1. Then the codingis complete.

Let’s go through the following example to illustrate the above Huffman algorithm.

Example 5.9Consider a source alphabet whose six source symbols and their occurrence probabilitiesare listed in Table 5.9. Figure 5.1 demonstrates the Huffman coding procedure applied. Inthe example, among the two least probable source symbols encountered at each step, weassign binary 0 to the top symbol and binary 1 to the bottom symbol.

5.2.2.1 Procedures

In summary, the Huffman coding algorithm consists of the following steps:

s1 (0.3)

s6 (0.25)s ( 1.0 )

s3 (0.20)

s2 (0.10)

s5 (0.10)

s1 (0.3)

s6 (0.25)

s3 (0.20)

s5,4 (0.15)

s1 (0.3)

s6 (0.25)

s5,4,2 (0.25)

s5,4,2,3 (0.45)

s1 (0.3)

s1,6 (0.55)

s5,4,2,3 (0.45)

s6 (0.25)

s3 (0.20)

s2 (0.10)

s4 (0.05)

0

1

0

1

01

01

01

FIGURE 5.1Huffman coding procedure in Example 5.9.


1. Arr ange all source symbol s in such a way that the ir occurren ce probabi lities are ina noni ncreasin g order.

2. Comb ine the two least probabl e source symbol s:. Form a new sou rce symbol with a probability equ al to the sum of the probabi l-

ities of the two least probabl e symbol s.. Assign a bina ry 0 and a bina ry 1 to the two leas t probabl e symbol s.

3. Rep eat unt il the newl y cre ated auxiliar y source alphabe t contai ns only on e sou rcesymbol .

4. Star t from the source symbol in the last auxiliary source alphabet and trace ba ckto each source symbol in the original source a lphabet to fi nd the corre spondin gcode word s.

5.2.2. 2 Comments

First, it is no ted that the assignme nt of the bina ry 0 and 1 to the two leas t probab lesource symbo ls in the original sou rce alphabe t and each of the first ( u � 1) aux iliary sourcealphabe ts can be impl ement ed in two differe nt ways. Here u denotes the totalnumb er of the aux iliary source symbol s in the proce dure. He nce, there is a total of 2 u

poss ible Huffman codes. In Examp le 5.9, there are fi ve auxiliar y source alphabe ts, hen ce atotal of 25 ¼ 32 diffe rent codes . No te that each is optimu m: that is, each has the sameaverag e length .

Seco nd, in sort ing the source symbol s, there may be more than on e symbol having equalprobabi lities. This resu lts in multi ple arran gement s of symbols , hen ce multiple Huffmancodes . W hile all of these Huffman codes are optimum , the y may have some ot her differentprope rties. For insta nce , some Huffman code s result in the m inimum code word lengthvariance [sayood 1996]. Th is prop erty is desir ed fo r applicat ions in which a cons tant bitrate is require d.

Third, Hu ffman coding can be applied to r-ary encoding with r > 2. Th at is, codesymbol s are r -ary wi th r > 2.

5.2.2. 3 Applicati ons

As a systemat ic proce dure to encode a fi nite dis crete mem oryles s source , the Huffman codehas foun d wide appl ication in image and vide o coding. Recal l that it has been used indifferent ial coding and transfor m coding. In transf orm codi ng, as introd uced in Chap ter 4,the magnitude of the quantized transform coefficients and the run length of zeros in thezigzag scan are encoded by using the Huffman code.

5.3 Modified Huffman Codes

5.3.1 Motivation

As a result of Huffman coding, a set of all the code words, called a codebook, are created. Itis an agreement between the transmitter and the receiver. Consider the case where theoccurrence probabilities are skewed, i.e., some are large, whereas some are small. Underthese circumstances, the improbable source symbols take a disproportionately largeamount of memory space in by the codebook. The size of the codebook will be very


large if the number of the improbable source symbols is large. A large size codebookrequires a large memory space and increases the computational complexity. A modifiedHuffman (MH) procedure was therefore devised to reduce the memory requirement whilekeeping almost the same optimality [hankamer 1979].

Example 5.10Consider a source alphabet consisting of 16 symbols, each being a 4-bit binary sequence.That is, S¼ {si, i¼ 1, 2, . . . , 16}. The occurrence probabilities are

p(s1) ¼ p(s2) ¼ 1=4,

p(s3) ¼ p(s4) ¼ � � � ¼ p(s16) ¼ 1=28:

The source entropy can be calculated as follows:

H(S) ¼ 2 � 14log2

14

� �þ 14 � 1

28log2

128

� �� 3:404 bits=symbol:

Applying the Huffman coding algorithm, we find that the code word lengths associatedwith the symbols are l1¼ l2¼ 2, l3¼ 4, and l4¼ l5 ¼ � � � ¼ l16¼ 5, where li denotes the lengthof the ith code word. The average length of Huffman code is

Lavg ¼X16i¼1

p(si)li ¼ 3:464 bits=symbol:

We see that the average length of Huffman code is quite close to the lower entropy bound.It is noted, however, that the required codebook memory, M (defined as the sum of thecode word lengths), is quite large:

M ¼X16i¼1

li ¼ 73 bits:

This number is obviously larger than the average code word length multiplied by thenumber of code words. This should not come as a surprise since the average here is inthe statistical sense instead of in the arithmetic sense. When the total number of improbablesymbols increases, the required codebook memory space will increase dramatically, result-ing in a great demand on memory space.

5.3.2 Algorithm

Consider a source alphabet S that consists of 2v binary sequences, each of length v. Inother words, each source symbol is a v-bit code word in the natural binary code. Theoccurrence probabilities are highly skewed and there is a large number of improbablesymbols in S. The MH coding algorithm is based on the following idea: lumping allthe improbable source symbols into a category named ELSE [weaver 1978]. The algorithmis described below.


1. Categ orize the source alphabe t S into two disjoi nt gro ups, S1 and S2, such that

S1 ¼ s i j p( s i ) > 12v

� �(5 : 17)

and

S2 ¼ s i j p( s i ) � 12v

� �: (5 : 18)

2. Establ ish a source symbol ELSE with its occ urrence probability equ al to p( S2).

3. Apply the Huffman coding algori thm to the source alphab et S3 with S3 ¼ S1 [ ELSE .4. Conve rt the codeb ook of S3 to that of S as fo llows:

. Keep the same c ode words for those symbo ls in S1.

. Use the code word assi gned to ELS E as a pre fix for thos e symbol s in S2.

5.3.3 Codebook Memor y Requirem ent

Codebo ok memor y M is the sum of the code word length s. The M require d by Huffmancoding with respec t to the origi nal source alphabe t S is

M ¼Xi2 S

li ¼Xi 2S1

li þXi2 S2

li , (5: 19)

where li denote s the length of the i th code word, as de fined earlier . In the case of the MHcoding algori thm, the mem ory requi red MmH is

MmH ¼Xi 2S3

li ¼Xi 2 S1

li þ l ELSE , (5: 20)

where lELSE is the length of the code word assi gned to ELSE. Th e abov e equati on revealsthe big sav ings in mem ory requi rement when the probabi lity is skewe d. Th e followi ngexampl e is use d to illustrat e the MH coding algori thm and the resu lting dr amatic memor ysavings.

Examp le 5.11In this exampl e, we appl y the MH coding algori thm to the sou rce alphabet presente d inExamp le 5.10. We first lump the 14 symbo ls havi ng the leas t occ urrence probabiliti estogeth er to form a new symbol ELSE. Th e probability of ELS E is the sum of the 14probabi lities. That is,

p(ELSE ) ¼ 128 � 14 ¼ 1

2 :

Apply H uffman coding to the n ew source alphabet, S3¼ {s1, s2, ELSE}, as shown inFigur e 5. 2.


FIGURE 5.2The modified Huffman (MH) coding procedurein Example 5.11.

ELSE21

41

s1

s1

41

s (1)

0

1

0

1

ELSE21

21s1,2

Fro m Figure 5.2, it is seen that the code words assi gned to symbol s s1, s 2, and ELSE are,respec tively , 10 , 11, and 0. Hence , fo r every source symbol lump ed int o ELSE, its codeword is 0 followed by the original 4-bit bina ry sequ ence. Ther efore,

MmH ¼ 2 þ 2 þ 1 ¼ 5 bits,

i.e., the requi red codebo ok memor y is on ly 5 bits. Comp ared with 73 bits requi red byHu ffman coding (re fer to Example 5.10), there is a savings of 68 bits in code book mem oryspace . Si milar to the comme nt mad e in Examp le 5.1 0, the mem ory savings will be evenlarger if the probabi lity distrib ution is skewed more severely and the number of improb-able symbols is larger. The average length of the MH algorithm is

Lavg,mH ¼ 14� 2 � 2þ 1

28 � 5 � 14 ¼ 3: 5 bits= symbol :

This demo nstrat es that MH codi ng retains almost the same codi ng ef fi ciency as thatachi eved by Huffman coding.

5.3.4 Bounds on Average Code Word Length

It has been shown that the average length of the MH codes sat is fies the follo wingcond ition:

H ( S ) � Lavg < H ( S) þ 1 � p log 2 p, (5: 21)

where p ¼Ps i 2 S2 p( Si). It is seen that, compared with the noiseless source coding theorem,

the upper bound of the code average length is increased by a quantity of�p log2 p. InExample 5.11 it is seen that the average length of the MH code is close to that achieved bythe Huffman code. Hence the MH code is almost optimum.

5.4 Ari thmetic Codes

Arithm etic coding, which is quite different from Hu ffman coding, is ga ining inc reasingpop ularity. In this sectio n, we first analyze the limitations of Huffman coding. Then theprinci ple of arithmetic coding is been introduced. Finally some implementation issuesare discussed briefly.


5.4.1 Limitati ons of Huffm an Codi ng

As see n in Secti on 5.2, Huffman coding is a systemat ic proce dure for enc oding a sourcealphabe t, wi th each source symbol havi ng an occurr ence probab ility. Under these circum-stance s, Huffman codi ng is optimu m in the sense that it produ ces a min imum codi ngredund ancy. It has been shown that the aver age code word length achieved by Huffmancoding satis fi es the follo wing inequa lity [galla gher 1978].

H ( S) � Lavg < H ( S) þ pmax þ 0:086, (5 : 22)

where H( S) is the ent ropy of the source alphabet, and pmax denote s the maximu m occ ur-rence probability in the set of the source symbo ls. Th is ineq uality impl ies that the upperbound of the average code word length of Huffman code is determine d by the entropy andthe maximu m occurren ce probabi lity of the source symbol s being encode d.

In the case whe re the probabi lity distribut ion amo ng source symbo ls is skewed (someprobabi lities are small, whil e some are quite large), the upper bound may be large, imp lyingthat the coding redund ancy may not be small. Imagi ne the followi ng extreme situati on.Ther e are only two source symbol s. On e has a very sm all probability , whi le the ot her has avery large probability (ve ry close to 1). The entropy of the source alphabe t in this cas e is closeto 0 since the uncert ainty is very sm all. Us ing Huffman coding, howe ver, we need 2 bits:one for each. That is, the aver age code word length is 1, which means that the redu ndancy isvery close to 1. Th is agrees wi th Equation 5.22. This inef ficiency is due to the fact thatHuffman coding alw ays enc odes a source symbol wi th an intege r num ber of bits.

The noise less coding the orem (review ed in Se ction 5.1) indicate s that the av eragecode word length of a block code can approach the source alphabe t entropy when theblock size approach es infinit y. As the block size approach es infi nity, the stor age requ ired,the code book size, and the coding del ay will appro ach infinit y, howeve r, and the com-plexity of the coding wi ll be out of control . For tunately, it is of ten the case that when theblock size is large enou gh in prac tice the average code word length of a block code hasbeen rather close to the source alp habet entropy.

The fund amental ide a behind Huffman coding and Shannon –Fano coding (devis ed alittle ear lier than Hu ffman coding [bell 1990]) is block coding. That is, some code wo rdhaving an int egral number of bits is a ssigned to a source symbo l. A message may beencode d by cascad ing the relevant code words. It is the block-base d approach that isrespo nsible fo r the limit ations of Hu ffman codes .

Anoth er limit ation is that when encodi ng a messag e that cons ists of a sequ ence of sourcesymbol s the n th exten sion Hu ffman coding needs to enume rate all poss ible sequ ences ofsource symbol s havi ng the sam e leng th, as dis cussed in coding the n th extended sourcealphabe t. Th is is not computat ionally ef ficient.

Quite differe nt from Hu ffman codi ng, arithme tic codi ng is strea m based . It overcom esthe drawbac ks of Huffman coding. A strin g of source symbol s are enc oded as a strin g ofcode symbols . It is hence free of the integr al-bits-p er-source -symbo l restrict ion and is moreeffi cient. Arithm etic coding may reach the theo retical bound to coding ef fi ciency speci fiedin the noiseles s source coding theo rem for any infor mation sou rce. Below , we introd uce theprincip le of arithme tic coding, from whi ch we can see the strea m-base d nature of arith-metic coding.

5.4.2 The Principle of Arithmetic Coding

To understand the different natures of Huffman coding and arithmetic coding, let us lookat Ex ample 5.12, whe re we use the same sou rce alphabet and the associ ated occ urrence


probabi lities use d in Examp le 5.9. In this exam ple, ho wever, a string of source symbol ss1 s 2 s 3 s 4 s 5 s 6 are enc oded . Note that we conside r the terms st ring and stream to be sligh tlydiffere nt. By stream, we mean a mess age or poss ibly seve ral message s, whic h may cor res-pond to quite a long sequ ence of sou rce symbols . Mo reover, stream gives a dynami c fl avor.Later we see that arithmeti c c oding is impl ement ed in an increm ental mann er. Hencestrea m is a suitab le ter m to use for arit hmetic codi ng. In this exam ple, howeve r, only sixsource symbol s are inv olved. Hence we conside r the term strin g to be suitable, aiming atdisti nguishi ng it from the ter m blo ck.

Exa mple 5 .12The set of six source symbol s and their occurren ce probabi lities are listed in Table 5.10. Inthis exa mple , the st ring to be enc oded using arithme tic coding is s1 s 2 s 3 s 4 s 5 s6. In thefollo wing four sections , we will use thi s example to illus trate the princip le of arithme ticcoding and deco ding.

5.4.2. 1 Dividing Interval [0 , 1) into Sub intervals

As pointed out by Elia s, it is not nec essary to sort out sou rce symbo ls accord ing totheir occurren ce probabi lities. Th erefore, in Figu re 5.3a, the six symbol s are arran gedin the ir natural order from symbols s1, s 2, . . . , s 6. Th e real interval between 0 and 1 isdivi ded into six subinte rvals, each havi ng a length of p( si ), i ¼ 1, 2, . . . , 6. Speci fi cally, theinter val denote d by [0, 1) — where 0 is inc luded in (the left end is closed) and 1 is exc ludedfrom (the righ t end is ope n) the interval — is divided into six subin tervals. The fi rstsubin terval [0, 0.3) corre sponds to s1 and has a length of p( s 1), i.e., 0.3. Similarly , thesubin terval [0, 0.3) is said to be clos ed on the left and open on the righ t. The remainin gfi ve subin tervals are similar ly con structed. All six subin terval s thus formed a re disjointand their unio n is equ al to the inter val [0, 1). This is becau se the sum of all theprobabi lities is equal to 1.

We list the sum of the prece ding probabi lities, kno wn as cum ulative probabi lity(CP) [langd on 198 4], in the right most column of Table 5.10 as well. Note that the conceptof CP is slightly different from that of cumulative distribution function (CDF) in prob-ability theory. Recall that in the case of discrete random variables the CDF is defined asfollows:

CDF(si) ¼Xi

j¼1p(sj): (5:23)

TABLE 5.10

Source Alphabet and Cumulative Probabilities in Example 5.12

SourceSymbol


AssociatedSubintervals

CumulativeProbability

s1 0.3 [0, 0.3) 0s2 0.1 [0.3, 0.4) 0.3s3 0.2 [0.4, 0.6) 0.4s4 0.05 [0.6, 0.65) 0.6s5 0.1 [0.65, 0.75) 0.65s6 0.25 [0.75, 1.0) 0.75


0.3 0.750.650.60.4 1.00

[0, 0.3) [0.3, 0.4) [0.4, 0.6) [0.75, 1.0)

[0.6, 0.65) [0.65, 0.75)

s1

s2

s3

s4

s5

s6

0.3

0.18 0.195 0.225

0[0.09, 0.12)

0.120.09 [0.102, 0.108)

0.108

0.1083 0.1044 0.1056 0.1059 0.1065

0.102

[0.1056, 0.1059)

0.1059

0.10569 0.10572 0.10578 0.105795 0.105825

0.1056

[0.105795, 0.105825)

0.1058250

0.105804 0.105807 0.105813 0.1058145 0.1058175

0.105795

[0.1058175, 0.1058250)

(a)

(b)

(c)

(d)

(e)

(f)

0.120.09

0.108 0.1095 0.11250.1020.099

FIGURE 5.3Arithmetic coding working on the same source alphabet as that given in Example 5.9. The encoded symbol stringis s1 s2 s3 s4 s5 s6.

The CP is de fined as

CP ( si ) ¼Xi� 1j¼ 1

p(sj ), (5 : 24)

where CP (s1) ¼ 0 is defined. Now we see each subinte rval has its lower end poi nt locate d atCP( si ). The wi dth of each sub interval is equal to the probabi lity of the cor respond ing sourcesymbol . A subin terval can be com pletely defi ned by its lower end poi nt and its wi dth.Alternati vely, it is determi ned by its two end poi nts: the lowe r and upper end poi nts(som etimes also calle d the left and righ t end points).


Now con sider enc oding the string of source symbo ls s1 s 2 s 3 s 4 s 5 s6 wi th the arithme ticcoding met hod.

5.4.2. 2 Encod ing

5.4.2.2.1 Encodin g the First Source Sy mbol

As the fi rst symbo l is s1, we pick up its subin terval [0, 0.3). Pickin g up the subinte rval[0, 0.3) means that any real number in the subin terval , i.e., any real number equal to orgrea ter than 0 and sm aller than 0.3, can be a pointe r to the subin terval , thus repre sentingthe source symbol s1. This can be justified by considering that all the six subintervalsare dis joint (see Figu re 5.3a).

5.4.2.2.2 Encoding the Second Source Symbol

We use the same procedure as used in Figure 5.3a to divide the interval [0, 0.3) into sixsubintervals (Figure 5.3b). Since the second symbol to be encoded is s2, we pick up itssubinterval [0.09, 0.12).

Notice that the subintervals are recursively generated from Figure 5.3a to b. It is knownthat an interval may be completely specified by its lower end point and width. Hence, thesubinterval recursion in the arithmetic coding procedure is equivalent to the following tworecursions: end point recursion and width recursion.

From interval [0, 0.3) derived in Figure 5.3a to interval [0.09, 0.12) obtained in Figure5.3b, we can conclude the following lower end point recursion:

Lnew ¼ Lcurrent þWcurrent �CPnew, (5:25)

where Lnew and Lcurrent represent, respectively, the lower end points of the new and currentrecursions, and the Wcurrent and the CPnew denote the width of the interval in thecurrent recursion and the CP in the new recursion, respectively. The width recursion is

Wnew ¼ Wcurrent � p(si), (5:26)

where Wnew and p(si) are, respectively, the width of the new subinterval and the probabil-ity of the source symbol si that is being encoded. These two recursions, also called doublerecursion [langdon 1984], play a central role in arithmetic coding.

5.4.2.2.3 Encoding the Third Source Symbol

When the third source symbol is encoded, the subinterval generated above in Part (b) issimilarly divided into six subintervals. Since the third symbol to encode is s3, its subinterval[0.102, 0.108) is picked up (see Figure 5.3c).

5.4.2.2.4 Encoding the Fourth, Fifth, and Sixth Source Symbols

The subinterval division is carried out according to Equations 5.25 and 5.26. The symbolss4, s5, and s6 are encoded. The final subinterval generated is [0.1058175, 0.1058250) (seeFigure 5.3d through f).

That is, the resulting subinterval [0.1058175, 0.1058250) can represent the source symbolstring s1 s2 s3 s4 s5 s6. Note that in this example decimal digits instead of binary digits areused. In binary arithmetic coding, the binary digits 0 and 1 are used.

5.4.2.3 Decoding

As seen in this example, for the encoder of arithmetic coding, the input is a source symbolstring, and the output is a subinterval. Let us call this the final subinterval or the resultant


subinte rval. Theoreti cally, any rea l number s in the interval can be the code string for theinput symbol st ring since all subin terval s are disjoint. Often, howev er, the lowe r end ofthe final subinte rval is used as the code string. Now let us exa mine how the decodin gproces s is carri ed out with the lowe r end of the final subin terval.

Deco ding sort of reverse s wha t encoding has done. Th e decoder kno ws the encodi ngproced ure and the refore has the informati on con tained in Figure 5.3a. It compar esthe lower end point of the final subinterval 0.1058175 with all the end points. It isdetermined that

0 < 0:1058175 < 0:3:

That is, the lower end falls into the subinterval associated with the symbol s1. Therefore, thesymbol s1 is first decoded.

Once the first symbol is decoded, the decoder may know the partition of subintervalsshown in Figure 5.3b. It is then determined that

0:09 < 0:1058175 < 0:12:

That is, the lower end is contained in the subinterval corresponding to the symbol s2. As aresult, s2 is the second decoded symbol.

The procedure repeats itself until all six symbols are decoded. That is, based on Figure5.3c, it is found that

0:102 < 0:1058175 < 0:108:

The symbol s3 is decoded. Then, the symbols s4, s5, s6 are subsequently decoded because thefollowing inequalities are determined:

0:1056 < 0:1058175 < 0:1059

0:105795 < 0:1058175 < 0:1058250

0:1058145 < 0:1058175 < 0:1058250

Note that a terminal symbol is necessary to inform the decoder to stop decoding.The above procedure gives us an idea of how decoding works. The decoding process,

however, does not need to construct Figure 5.3b through f. Instead, the decoder only needsthe information contained in Figure 5.3a. Decoding can be split into the following threesteps: comparison, readjustment (subtraction), and scaling [langdon 1984].

As described above, through comparison we decode the first symbol s1. From the wayFigure 5.3b is constructed, we know the decoding of s2 can be accomplished as follows.We subtract the lower end of the subinterval associated with s1 in Figure 5.3a, i.e., 0 inthis example from the lower end of the final subinterval 0.1058175, resulting in 0.1058175.Then we divide this number by the width of the subinterval associated with s1, i.e., theprobability of s1, 0.3, resulting in 0.352725. From Figure 5.3a, it is found that

0:3 < 0:352725 < 0:4

That is, the number is within the subinterval corresponding to s2. Therefore, the seconddecoded symbol is s2. Note that these three decoding steps exactly undo what encodinghas done.

To decode the third symbol, we subtract the lower end of the subinterval with s2, 0.3from 0.352725, obtaining 0.052725. This number is divided by the probability of s2, 0.1,


result ing in 0 .52725. The com parison of 0.52725 with end points in Figure 5.3a rev eals thatthe third deco ded symbol is s3.

In decod ing the fo urth symbol , we first sub tract the lower end of the s3’s subinte rvalin Figu re 5.3a, 0.4 from 0.52725, gett ing 0.12725. Di viding 0.12725 by the probab ility of s3,0.2, result s in 0.63625. Referring to Figu re 5.3a, we decode the fo urth symbo l as s4 bycom parison.

Sub traction of the lower end of the subinterval of s4 in Figu re 5.3a, 0 .6, from 0.63625lead s to 0.036 25. Divisi on of 0.0 3625 by the probabi lity of s4, 0.05, produ ces 0.725. Th ecom parison betwe en 0.725 and the end poi nts deco des the fifth symbo l as s5.

Sub tracting 0.725 by the lower end of the subinte rval associa ted with s5, 0.65, gives 0.075.Dividi ng 0.075 by the probab ility of s5, 0.1, generat es 0.75. The com pariso n indicates thatthe sixth deco ded symbol is s 6.

In summa ry, conside ring the way in which Figu re 5.3b through f is cons tructed , we seethat the three st eps dis cussed in the decoding process : compar ison, rea djustme nt, andscaling exa ctly undo wha t the enc oding procedur e has done.

5.4.2. 4 Obser vations

Both enc oding and decod ing inv olve only a rithmeti c ope rations (ad dition and multi plica-tion in enc oding, subtraction and divi sion in decod ing). This expl ains the name arithme ticcoding .

We see that a n input source symbol strin g s 1 s 2 s 3 s 4 s 5 s 6, via encodi ng, corre sponds to asubin terval [0.1058175, 0.105825 0). Any number in this interval can be used to denote thestrin g of the sou rce symbol s.

We also obse rve that arithme tic coding can be carried out in an inc remen tal mann er.That is, source symbol s are fed into the enc oder one by on e and the final subin terval isrefined conti nually, i.e. , the code string is generated continu ally. Furthe rmore , it is done ina mann er called fi rst in fi rst out (FIFO). That is, the source symbol enc oded firs t is decodedfi rst. Th is mann er is superior to that of last in first out (LIFO ). Th is is becaus e FIFO issuitable fo r ada ptation to the st atistics of the symbol strin g.

Obvi ously, the width of the final subin terval become s smaller and smaller when thelength of the source symbol string become s larger and larger. This ca uses wha t is known asthe precision probl em. It is this problem that prohibited arithmeti c coding from practica lusage for quite a long period of time. Only after this probl em was solved in the late 1970s,did arithme tic coding become an increasingly important coding technique.

It is necessary to have a termination symbol at the end of an input source symbol string.In this way, an arithmetic coding system is able to know when to terminate decoding.

Compared with Huffman coding, arithmetic coding is quite different. Basically,Huffman coding converts each source symbol into a fixed code word with an integralnumber of bits, whereas arithmetic coding converts a source symbol string to a codesymbol string. To encode the same source symbol string, Huffman coding can be imple-men ted in two differe nt ways. On e way is shown in Example 5.9. We constru ct a fixedcode word for each source symbol. Since Huffman coding is instantaneous, we cancascade the corresponding code words to form the output, a 17-bit code string00.101.11.1001.1000.01, where, for easy reading, the five periods are used to indicatedifferent code words. As we see that for the same source symbol string, the finalsubinterval obtained by using arithmetic coding is [0.1058175, 0.1058250). It is notedthat a decimal in binary number system, 0.000110111111111, which is of 15 bits, is equalto the decimal in decimal number system, 0.1058211962, which falls into the finalsubinterval representing the string s1 s2 s3 s4 s5 s6. This indicates that, for this example,arithmetic coding is more efficient than Huffamn coding.


Anoth er way is to fo rm a 6th exten sion of the source alphabe t as discuss ed in Secti on5.1.4: treat each gr oup of six source symbol s as a new source symbol ; calculate itsoccurr ence probabi lity by multiplyi ng the related six probabiliti es; then appl y the Huffmancoding algori thm to the 6 th exten sion of the discre te memor yless source. This is calledthe 6th extensio n of Huffman block code (re fer to Secti on 5.1.2.2). In other wo rds, to enc odethe source string s1 s 2 s 3 s4 s 5 s 6, (the 6th exten sion of) Hu ffman coding encodes all of the66 ¼ 46656 code words in the 6th extension of the source alphabe t. Th is impl ies a highcomple xity in impl ement ation and a large codebo ok. It is therefo re not ef ficient.

Note that we use the deci mal fract ion in this sectio n. In binary arithme tic coding , we usethe bina ry fract ion. In [lang don 1984] bot h binary source and code alphabe ts are used inbina ry arit hmetic codi ng.

Similar to the case of Hu ffman coding, arit hmeti c coding is also appl icable to r-aryencodi ng wi th r > 2.

5.4.3 Implemen tation Issues

As men tioned, the final subinterval result ing fro m arithmeti c encodi ng of a sou rce symbolstrea m becom es sm aller and smaller as the length of the source symbol st ring increa ses.That is, the lower and upper bound s of the fi nal subinterval become closer and closer. Thiscauses a gro wing precis ion probl em. It is this probl em that proh ibite d arithmeti c codi ngfrom practical usag e for a long period. This probl em has been resol ved and the finiteprecis ion a rithmeti c is no w use d in arit hmetic codi ng. This advance is due to the inc re-mental implement ation of arit hmetic codi ng.

5.4.3. 1 Increment al Impl ementa tion

Recall in Exa mple 5.12 as source symbol s com e in on e by one, the lowe r and upper ends ofthe final subinte rval get closer and clos er. In Figu re 5.3, the se lower and up per ends inExamp le 5.12 are list ed. We observe that after the third symbol, s 3, is encode d, the resu ltantsubinte rval is [0.102, 0.108). That is, the two mo st signi ficant deci mal digits are the sameand they remain the sam e in the encoding proces s. Hence , we can transmit these two digit swithou t affecting the fi nal code st ring. After the fourth symbol s 4 is encoded, the resu ltantsubinte rval is [0.1056, 0.1059). That is, on e more dig it, 5, can be transmitted. Or we say thecumulative output is now 0.105. After the sixth symbol is encoded, the final subinterval is[0.1058175, 0.1058250). The cumulative output is 0.1058. Refer to Table 5.11. This importantobservation reveals that we are able to incrementally transmit output (the code symbols)and receive input (the source symbols that need to be encoded).

TABLE 5.11

Final Subintervals and Cumulative Output in Example 5.12

Final Subinterval

Source Symbol Lower End Upper End Cumulative Output

s1 0 0.3 —

s2 0.09 0.12 —

s3 0.102 0.108 0.10s4 0.1056 0.1059 0.105s5 0.105795 0.105825 0.105s6 0.1058175 0.1058250 0.1058


5.4.3.2 Finite Precision

With the incremental manner of transmission of encoded digits and reception of inputsource symbols, it is possible to use finite precision to represent the lower and upperbounds of the resultant subinterval, which gets closer and closer as the length of the sourcesymbol string becomes long.

Instead of floating-point math, integer math is used. The potential problems namelythe underflow and overflow, however, need to be carefully monitored and controlled[bell 1990].

5.4.3.3 Other Issues

There are some other problems that need to be handled in implementation of binaryarithmetic coding. Two of them are listed below [langdon 1981].

5.4.3.3.1 Eliminating Multiplication

The multiplication in the recursive division of subintervals is expensive in hardware aswell as software. It can be avoided in binary arithmetic coding so as to simplify theimplementation of binary arithmetic coding. The idea is to approximate the lower end ofthe interval by the closest binary fraction 2�Q, where Q is an integer. Consequently,the multiplication by 2�Q becomes a right shift by Q bits. A simpler approximation toeliminate multiplication is used in the Skew Coder [langdon 1982] and the Q-Coder[pennebaker 1988].

5.4.3.3.2 Carry-Over Problem

Carry-over takes place in the addition required in the recursion updating the lower end ofthe resultant subintervals. A carry may propagate over q bits. If the q is larger than thenumber of bits in the fixed-length (FL) register utilized in finite precision arithmetic,the carry-over problem occurs. To block the carry-over problem, a technique known asbit stuffing is used, in which an additional buffer register is utilized.

For detailed discussion on the various issues involved, readers are referred to [langdon1981, 1982, 1984; pennebaker 1988, 1992]. Some computer programs of arithmetic coding inC language can be found in [bell 1990; nelson 1996].

5.4.4 History

The idea of encoding by using cumulative probability in some ordering, and decoding bycomparison of magnitude of binary fraction was introduced in Shannon’s celebrated paper[shannon 1948]. The recursive implementation of arithmetic coding was devised by Elias.This unpublished result was first introduced by Abramson as a note in his book oninformation theory and coding [abramson 1963]. The result was further developed byJelinek in his book on information theory [jelinek 1968]. The growing precision problemprevented arithmetic coding from practical usage, however. The proposal of usingfinite precision arithmetic was made independently by Pasco [pasco 1976] and Rissanen[rissanen 1976]. Practical arithmetic coding was developed by several independent groups[rissanen 1979; rubin 1979; guazzo 1980]. A well-known tutorial paper on arithmetic codingappeared in [langdon 1984]. The tremendous efforts made in IBM lead to a new form ofadaptive binary arithmetic coding known as theQ-coder [pennebaker 1988]. On the basis ofthe Q-coder, the activities of the international still image coding standards JPEG and JBIGcombined the best features of the various existing arithmetic coders and developed thebinary arithmetic coding procedure known as the QM-coder [pennebaker 1992].


5.4.5 Applications

Arithmetic coding is becoming popular. Note that in text and bi-level image applicationsthere are only two source symbols (black and while), and the occurrence probability isskewed. Therefore, binary arithmetic coding achieves high coding efficiency. It has beensuccessfully applied to bi-level image coding [langdon 1981] and adopted by the inter-national standards for bi-level image compression JBIG. It has also been adopted by theinternational still image coding standard JPEG. More in this regard is covered in the nextchapter when we introduce JBIG.

5.5 Summary

So far in this chapter, not much has been explicitly discussed regarding the term variable-length codes (VLC). It is known that if source symbols in a source alphabet are equallyprobable, i.e., their occurrence probabilities are the same, then fixed-length codes (FLC)such as the natural binary code are a reasonable choice. When the occurrence probabilitiesare, however, unequal, VLCs should be used to achieve high coding efficiency. This is oneof the restrictions on the minimum redundancy codes imposed by Huffman. That is, thelength of the code word assigned to a probable source symbol should not be larger thanthat associated with a less probable source symbol. If the occurrence probabilities happento be the integral powers of 1=2, then choosing the code word length equal to�log2 p(si) fora source symbol si having the occurrence probability p(si) results in minimum redundancycoding. In fact, the average length of the code thus generated is equal to the source entropy.

Huffman devised a systematic procedure to encode a source alphabet consisting offinitely many source symbols, each having an occurrence probability. It is based on somerestrictions imposed on the optimum, instantaneous codes. By assigning code words withvariable lengths according to variable probabilities of source symbols, Huffman codingresults in minimum redundancy codes, or optimum codes for short. These have foundwide applications in image and video coding and have been adopted in the internationalstill image coding standard JPEG and video coding standards H.261, H.263, MPEG 1, 2.

When some source symbols have small probabilities and their number is large, the size ofthe codebook of Huffman codes will require a large memory space. The modified Huffman(MH) coding technique employs a special symbol to lump all the symbols with small prob-abilities together. As a result, it can reduce the codebook memory space drastically whileretaining almost the same coding efficiency as that achieved by the conventional Huffmancoding technique.

On the one hand, Huffman coding is optimum as a block code for a fixed-sourcealphabet. On the other hand, compared with the source entropy (the lower bound ofthe average code word length) it is not efficient when the probabilities of a source alphabetare skewed with the maximum probability being large. This is caused by the restrictionthat Huffman coding can only assign an integral number of bits to each code word.

Another limitation of Huffman coding is that it has to enumerate and encode all thepossible groups of n source symbols in the nth extension Huffman code, even though theremay be only one such group that needs to be encoded.

Arithmetic coding can overcome the limitations of Huffman coding because it is streamoriented rather than block-oriented. It translates a stream of source symbols into a streamof code symbols. It can work in an incremental manner. That is, the source symbols are fedinto the coding system one by one and the code symbols are output continually. In thisstream-oriented way, arithmetic coding is more efficient. It can approach the lower codingbounds set by the noiseless source coding theorem for various sources.


The recursive subinte rval division (equi valently, the double recursi on: the lower endrec ursion and wi dth rec ursion) is the hea rt of arithmeti c codi ng. Sever al measure s havebeen taken in the implement ation of arithme tic coding. They inc lude the inc rementalmann er, fi nite preci sion, and the el imination of multipli cation . As a result of its merits,bina ry arit hmeti c coding has been adopted by the inter national bi-le vel image codingstand ard JBIG and still image codi ng stand ard JPEG . It is becomi ng an inc reasinglyimpo rtant coding techni que.

Exerc ises

1. What does the noiseles s source coding theorem state (usi ng your own word s)? Underwha t con dition does the average code leng th approach the sou rce entropy? Commenton the met hod sugge sted by the noiseles s source codi ng the orem.

2. What characterizes a block code? Consider another definition of block code in [blahut 1986]:a block code breaks the input data stream into blocks of fixed length (FL) n and encodeseach block into a code word of FL m. A re t hes e t wo de finitions (the one above and the one inSection 5.1, which comes from [abramson 1963]) essentially the same? Explain.

3. Is a uniquel y deco dable code necessarily a pre fi x con dition code?4. For tex t encoding, there are on ly two source symbol s for black and white . It is said that

Hu ffman codi ng is not ef ficie nt in this appl ication. But it is kno wn as the optimu m code.Is there a contrad iction? Ex plain.

5. A set of source symbo ls and their occurren ce probabili ties is listed in Table 5.12. Applythe Hu ffman coding algori thm to enc ode the alphabet.

6. Find the Huffman code for the source alphabe t sh own in Ex ample 5.10.7. C ons ider a source a lphabet S ¼ { s i , i ¼ 1, 2, . . . , 32} with p ( s 1 ) ¼ 1 =4, p ( s i ) ¼ 3 =124,

i ¼ 2, 3, . . . , 32. Determine the source entropy, the average length of Huffman code ifapplied to the source alphabet. Then apply the MH coding algorithm. Calculate theaverage length of the MH code. Compare the codebook memory required by Huff-man code and the MH code.

8. A source alphabet consists of the following four source symbols: s1, s2, s3, and s4 withtheir occurrence probability equal to 0.25, 0.375, 0.125, and 0.25, respectively. Applyingarit hmetic coding as show n in Ex ample 5.12 to the sou rce symbo l st ring s2s1s3s4,determine the lower end of the final subinterval.

TABLE 5.12

Source Alphabet in Problem 5

SourceSymbol


Code WordAssigned

s1 0.20s2 0.18s3 0.10s4 0.10s5 0.10s6 0.06s7 0.06s8 0.04s9 0.04s10 0.04s11 0.04s12 0.04


9. For the above problem, show step-by-step how we can decode the original sourcestring from the lower end of the final subinterval.

10. In Problem 8, find the code word of the symbol string s2 s1 s3 s4 by using the4th extension of Huffman code. Compare the two methods, arithmetic coding andHuffman coding.

11. Discuss how modern arithmetic coding overcame the growing precision problem.

References

[abramson 1963] N. Abramson, Information Theory and Coding, McGraw-Hill, New York, 1963.[bell 1990] T.C. Bell, J.G. Cleary, and I.H. Witten, Text Compression, Prentice-Hall, Englewood Cliffs,

NJ, 1990.[blahut 1986] R.E. Blahut, Principles and Practice of Information Theory, Addison-Wesley, Reading, MA,

1986.[fano 1949] R.M. Fano, The transmission of information, Technical Report 65, Research Laboratory of

Electronics, MIT, Cambridge, MA, 1949.[gallagher 1978] R.G. Gallagher, Variations on a theme by Huffman, IEEE Transactions on Information

Theory, IT-24, 6, 668–674, November 1978.[guazzo 1980] M. Guazzo, A general minimum-redundacy source-coding algorithm, IEEE Transac-

tions on Information Theory, IT-26, 1, 15–25, January 1980.[hankamer 1979]M. Hankamer, A modified Huffman procedure with reduced memory requirement,

IEEE Transactions on Communications, COM-27, 6, 930–932, June 1979.[huffman 1952] D.A. Huffman, A method for the construction of minimum-redundancy codes,

Proceedings of the IRE, 40, 1098–1101, September 1952.[jelinek 1968] F. Jelinek, Probabilistic Information Theory, McGraw-Hill, New York, 1968.[langdon 1981] G.G. Langdon, Jr. and J. Rissanen, Compression of black-white images with arith-

metic coding, IEEE Transactions on Communications, COM-29, 6, 858–867, June 1981.[langdon 1982] G.G. Langdon, Jr. and J. Rissanen, A simple general binary source code, IEEE

Transactions on Information Theory, IT-28, 800 (1982).[langdon 1984] G.G. Langdon, Jr., An introduction to arithmetic coding, IBM Journal of Research and

Development, 28, 2, 135–149, March 1984.[nelson 1996] M. Nelson and J. Gailly, The Data Compression Book, 2nd edn., M&T Books, New York,

1996.[pasco 1976] R. Pasco, Source coding algorithms for fast data compression, Ph.D. dissertation,

Stanford University, Palo Alto, CA, 1976.[pennebaker 1988] W.B. Pennebaker, J.L. Mitchell, G.G. Langdon, Jr., and R.B. Arps, An overview of

the basic principles of the Q-coder adaptive binary arithmetic Coder, IBM Journal of Research andDevelopment, 32, 6, 717–726, November 1988.

[pennebaker 1992] W.B. Pennebaker and J.L. Mitchell, JPEG: Still Image Data Compression Standard,Van Nostrand Reinhold, New York, 1992.

[rissanen 1976] J.J. Rissanen, Generalized Kraft inequality and arithmetic coding, IBM Journal ofResearch and Development, 20, 198–203, May 1976.

[rissanen 1979] J.J. Rissanen and G.G. Landon, Arithmetic coding, IBM Journal of Research andDevelopment, 23, 2, 149–162, March 1979.

[rubin 1979] F. Rubin, Arithmetic stream coding using fixed precision registers, IEEE Transactions onInformation Theory, IT-25, 6, 672–675, November 1979.

[sayood 1996] K. Sayood, Introduction to Data Compression, Morgan Kaufmann Publishers,San Francisco, CA, 1996.

[shannon 1948] C.E. Shannon, A mathematical theory of communication, Bell System TechnicalJournal, 27, 379–423 (Part I), July 1948, pp. 623–656 (Part II), October 1948.

[weaver 1978] C.S. Weaver, Digital ECG data compression, Digital Encoding of Electrocardiograms,H.K. Wolf (Ed.), Springer-Verlag, Berlin, New York, 1979.


6Run-Length and Dictionary Coding: InformationTheory Results (III)

As men tioned at the beginning of Chap ter 5, we study some code word assignme nt(enco ding) techni ques in Chapters 5 and 6. In this chapte r, we focus on run-l ength c oding(RLC) and dicti onary-base d coding tech niques. We fi rst introd uce Mark ov mo dels as atype of dependent source model in con trast to mem oryles s source model dis cussed inChapter 5. Based on Marko v mo del, RLC is suitable for facsim ile encoding . Its princip leand applicat ion to facs imile enc oding a re dis cussed, follo wed by an introd uction todictio nary-ba sed coding, which is quite different from Hu ffman and arithme tic codi ngtechni ques discuss ed in Chapte r 5. Two types of a daptive dictio nary c oding technique s, theLZ77 and LZ78 algorithm s, are presente d. Fina lly, a summa ry of and a perform ancecompar ison betwe en int ernation al sta ndard algori thms for lossle ss still image coding arepresen ted.

Since the Markov source model , RLC and dictio nary-ba sed coding are the core ofthis chapte r we con sider this chapter as a third par t of informati on theo ry result s pres entedin the book. It is note d that, howeve r, the emphasi s is place d on their appl ications to imageand video c ompress ion.

6.1 Ma rkov Sourc e Model

In Chapter 5, we discusse d the discrete memor yless source model, in which sourcesymbol s are assume d to be independe nt of each other. In othe r words , the sou rce haszero memor y, i.e., the previo us st atus doe s not affect the presen t on e at all. In rea lity,howeve r, many sources are dep endent in nature. Name ly, the source has memory in thesense that the previo us status has an infl uence on the presen t status. For insta nce, asmenti oned in Chapter 1, the re is an interpixe l correlat ion in digit al images . That is, pi xelsin a dig ital image are no t independe nt of each othe r. As discusse d in this chapter, the re issome depen dence betw een charac ters in text. For instance , the letter u of ten fo llows theletter q in English. Therefore, it is necessary to introduce models that can reflect this type ofdependence. A Markov source model is often in this regard.

6.1.1 Discrete Markov Source

Here, as in Chapter 5, we denote a source alphabet by S¼ {s1, s2, . . . , sm}, the occurrenceprobability by p. An lth-order Markov source is characterized by the following equation ofconditional probabilities:


p(sjjsi1, si2, . . . , sil, . . . ) ¼ p(sjjsi1, si2, . . . , sil), (6:1)

where j, i1, i2, . . . , il, . . .2 {1, 2, . . . ,m}, i.e., the symbols sj, si1, si2, . . . , sil, . . . are chosen fromthe source alphabet S. This equation states that the source symbols are not independent ofeach other. The occurrence probability of a source symbol is determined by some of itsprevious symbols. Specifically, the probability of sj given its history being si1, si2, . . . , sil, . . .(also called the transition probability) is determined completely by the immediatelyprevious l symbols si1, . . . , sil. That is, the knowledge of the entire sequence of previoussymbols is equivalent to that of the l symbols immediately preceding the current symbol sj.

An lth-order Markov source can be described by what is called a state diagram. A stateis a sequence of (si1, si2, . . . , sil) with i1, i2, . . . , il 2 {1, 2, . . . ,m}. That is, any group ofl symbols from the m symbols in the source alphabet S forms a state. When l¼ 1, it iscalled a first-order Markov source. The state diagrams of the first-order Markov sources,with their source alphabets having two and three symbols, are shown in Figure 6.1a and b,respectively. Obviously, an lth-order Markov source withm symbols in the source alphabethas a total of ml different states. Therefore, we conclude that a state diagram consists of allthe ml states. In the diagram, all the transition probabilities together with appropriatearrows are used to indicate the state transitions.

The source entropy at a state (si1, si2, . . . , sil) is defined as

H(Sjsi1, si2, . . . , sil) ¼ �Xmj¼1

p(sjjsi1, si2, . . . , sil) log2 p(sjjsi1, si2, . . . , sil), (6:2)

The source entropy is defined as the statistical average of the entropy at all the states.That is

s2s1

s3s2

s1

p (s2

/s2)

p (s2

/s1)

p (s1/s2)

p (s1/s1)

p (s2

/s2)

p (s1/s1)

(a)

(b)

p (s2

/s1)

p(s2/s3)

p (s1/s2)

p (s1/s3)

p (s3

/s1)

p (s3

/s2)

p (s3

/s3)

FIGURE 6.1State diagrams of the first-order Markov sources with their source alphabets having (a) two symbols and (b) threesymbols.


H ( S ) ¼X

( si 1 , si 2 , ... , s il ) 2 Slp( si1 , s i2 , . . . , s il ) H ( Sj s i1 , s i2 , . . . , s il ), (6 : 3)

where, as de fined in Chapter 5, S l deno tes the l th ext ension of the source alphabet S . Thatis, the summation is carried out with respect to all l-tuples taking over the Sl. Extensions ofa Markov source are defined below.

6.1.2 Extensions of a Discrete Markov Source

An extension of a Markov source can be defined in a similar way to that of an extension ofa meomoryless source in Chapter 5. The definition of extensions of a Markov sourceand the relation between the entropy of the original Markov source and the entropy ofthe nth extension of the Markov source are presented below without derivation. For thederivation, readers are referred to [abramson 1963].

6.1.2.1 Definition

Consider an lth-order Markov source S¼ {s1, s2, . . . , sm} and a set of conditional probabil-ities p(sjjsi1, si2, . . . , sil), where j, i1, i2, . . . , il 2 {1, 2, . . . ,m}. Similar to the memoryless sourcediscussed in Chapter 5, if n symbols are grouped into a block, then there are a total of mn

blocks. Each block can be viewed as a new source symbol. Hence, these mn blocks forma new information source alphabet, called the nth extension of the source S, and denotedby Sn. The nth extension of the lth-order Markov source is a kth-order Markov source,where k is the smallest integer greater than or equal to the ratio between l and n. That is,

k ¼ ln

� �, (6:4)

where the notation dae represents the operation of taking the smallest integer greater thanor equal to the quantity a.

6.1.2.2 Entropy

Denote the entropy of the lth-order Markov source S by H(S), and the entropy of the nthextension of the lth-order Markov source, Sn, by H(Sn), respectively. The relation betweenthe two entropies can be shown as

H(Sn) ¼ nH(S) (6:5)

6.1.3 Autoregressive Model

The Markov source discussed earlier represents a kind of dependence between sourcesymbols in terms of the transition probability. Concretely, in determining the transitionprobability of a present source symbol given all the previous symbols, only the set offinitely many immediately preceding symbols matters. The autoregressive (AR) model isanother kind of dependent source model that has been used often in image coding. It isdefined as

sj ¼Xl

k¼1ak sik þ xj, (6:6)


wheresj represen ts the cur rently observed source symbol, whi le s ik with k ¼ 1, 2, . . . , l denotethe l preced ing obse rved symbol s

ak ’s repre sent coef ficientsxj repre sents the cur rent input to the model

If l ¼ 1, the mo del de fi ned in Equati on 6.6 is refer red to as the firs t-order AR model . Clearly,in thi s case, the cur rent source symbol is a linear function of its preced ing symbol .

6. 2 Run-Length C oding

The term ‘‘ run ’’ is used to indi cate the repe tition of a symbol , whi le the term ‘‘ run-l ength ’’is used to represen t the numb er of repe ated symbol s, in other words , the numb er ofcons ecutive symbols of the sam e value . Inste ad of encodi ng the con secutive symbols ,it is obvious that enc oding the run-len gth and the value that these cons ecutive symbol scom monly share may be mo re ef ficient. Accordi ng to an excellent ear ly rev iew on bina ryimage compr ession [arp s 1979], RLC has been in use since the earliest days of informati ontheo ry [shannon 194 9; laemme l 1951].

Fro m the discussi on of the Joint Phot ograph ic (im age) Experts Grou p codi ng (JP EG) inChapte r 4 (with more detail in Chapte r 7), it is seen that most of the DC T coef fic ients withina block of 8 3 8 are zero afte r certain mani pulations. Th e DC T coef ficients are zigzagsca nned. Th e no nzero DC T coef fi cients and the ir addr ess in the 8 3 8 block need to beenc oded and transmitt ed to the receiver side. There, the nonze ro DC T value s are referredto as labels. The pos ition inf ormatio n about the nonzero DC T coef fi cients is repres ented bythe run-l ength of zeros between the nonzero DC T coef fi cients in the zigzag sca n. Th e labelsand the run-l ength of zeros are then Huffman c oded.

Many docume nts such as lette rs, forms, and drawings can be transmitt ed usi ng facsim ilemachi nes over the general switche d teleph one net work (GSTN). In digital facsimile tech-nique s, the se documents are quan tized into bina ry levels: black and white. Th e resol utionof these binary tone images is usu ally very high. In each scan line, there are manycons ecutive white and black pixels, i.e., many altern ate white runs and black runs. Ther e-fore, it is not surprising to see that RLC has proven to be ef fi cient in bina ry docu menttransmi ssion. RLC has been adop ted in the inter national standard s for facsim ile codi ng: theCC ITT Recomme ndation s T.4 and T.6.

RLC usi ng only the ho rizontal cor relation betwe en pi xels on the same sca n line isrefer red to as 1-D RLC. It is noted that the fi rst-orde r Markov source model with twosymbol s in the source alphab et depict ed in Figure 6.1a can be use d to charac terize 1-DRLC . To achieve higher coding ef ficiency, 2 -D RLC utilize s bot h horizon tal and verticalcorre lation betwe en pixels. Both 1-D and 2-D RLC algorithm s are int roduced below.

6.2.1 1-D Run-Length Coding

In this techniq ue, each scan line is encode d indep endentl y. Each sca n line can be consid-ered as a sequence of alternating, independent white and black runs. As an agreementbetween encoder and decoder, the first run in each scan line is assumed to be a white run. Ifthe first actual pixel is black, then the run-length of the first white run is set to be zero.At the end of each scan line, there is a special code word called end-of-line (EOL). Thedecoder knows the end of a scan line when it encounters an EOL code word.


Deno te run -length by r , which is int eger-va lued. All of the possible run -lengths cons tructa source alphabe t R , which is a random variab le. Th at is,

R ¼ { r : r 2 0, 1, 2, . . . } : (6 : 7)

Mea surements on typi cal bina ry docu ments have shown that the maxim um compr essionratio , zmax , which is de fined below, is ab out 25% high er when the white and black runs areencode d separate ly [hunter 1980]. The aver age white run -lengt h, �rW , can be expres sed as

�rW ¼Xmr ¼ 0

r � PW ( r ), (6 : 8)

wherem is the maximu m value of the run-l engthPW ( r ) denote s the occurr ence probabi lity of a whi te run with leng th r .

The ent ropy of the white runs, HW , is

HW ¼ �Xmr ¼ 0

PW ( r ) log2 PW ( r ) : (6 : 9)

For the black runs , the average run-l ength �rB and the entropy H B can be de fined simi larly.The m aximum the oretical compr ession factor zmax is

zmax ¼�rW þ �rBHW þ HB

: (6 : 10)

Huffman coding is the n appl ied to two source alphabe ts. Accord ing to CC ITT Recom men-dation T.4 , A4 size (210 3 297 mm) docume nts shoul d be accepte d by facsim ile machine s.In each scan line, the re are 1728 pixels. This mean s that the maxi mum run-len gth for bothwhite and black runs is 1728, i.e., m ¼ 1728. Two source a lphabets of such a large size impl ythe requi remen t of two large codebo oks, hen ce the requi remen t of large stor age space .Ther efore, some modi ficati on was m ade, resu lting in the mo di fied Huffman (MH) code.

In the MH code, if the run-l ength is larger than 63, then the run-len gth is represen ted as

r ¼ M � 64 þ T as r > 63, (6 : 11)

where M takes int eger value s from 1 to 27 and M 3 64 is ref erred to as the makeu p run-length ; T takes integer value s from 0 to 63, and is ca lled the terminati ng run-len gth. Thatis, if r � 63, the run-len gth is represen ted by a termin ating code word on ly. Other wise, ifr > 63, the run -lengt h is repres ented by a m akeup code wo rd and a ter minating code word.A portio n of the MH code table [hunter 1 980] is sh own in Table 6.1. In this way, therequi rement of large stor age space is allev iated. Th e ide a is simi lar to that behind MHcoding, discuss ed in Chap ter 5 .

6.2.2 2-D Run-Length Coding

The 1-D RLC discussed above only utilizes correlation between pixels within a scan line. Inorder to utilize correlation between pixels in neighboring scan lines and to achieve highercoding efficiency, 2-D RLC was developed. In Recommendation T.4, the modified relative


TABLE 6.1

Modified Huffman Code Table

Run-Length White Runs Black Runs

Terminating code words0 00110101 00001101111 000111 0102 0111 113 1000 104 1011 0115 1100 00116 1110 00107 1111 000118 10011 000101... ..

. ...

60 01001011 00000010110061 00110010 00000101101062 00110011 00000110011063 00110100 000001100111

Make-up code words64 11011 0000001111128 10010 000011001000192 010111 000011001001256 0110111 000001011011... ..

. ...

1536 010011001 00000010110101600 010011010 00000010110111664 011000 00000011001001728 010011011 0000001100101EOL 000000000001 000000000001

Source: FromHunter, R. and Robinson, A.H., Proc. IEEE, 68,7, 854–867, 1980. With permission.

elem ent address des ignate (R EAD) code, also known as the modi fied REA D code or simplythe MR code , is adopte d.

The modi fied READ code ope rates in a line -by-line mann er. In Figure 6.2 two line s areshown. The top line is called the reference line, which has been coded, while the bottomline is referred to as the coding line, which is being coded. There are a group of fivechanging pixels, a0, a1, a2, b1, b2, in the two lines. Their relative positions decide which of thethree coding modes is used. The starting changing pixel a0 (hence, five changing points)moves from left to right and from top to bottom as 2-D RLC proceeds. The five changingpixels and the three coding modes are defined below.

6.2.2.1 Five Changing Pixels

By a changing pixel, we mean the first pixel encountered in white or black runs when wescan an image line-by-line, from left to right, and from top to bottom. The five changingpixels are defined below.

� 2007 b

a0: The reference changing pixel in the coding line. Its position is defined in the previouscodingmode,whosemeaningwill be explained shortly. At the beginning of a coding line, a0is an imaginary white changing pixel located before the first actual pixel in the coding line.a1: The next changing pixel in the coding line. Because of the above-mentioned left-to-rightand top-to-bottom scanning order, it is at the right-hand side of a0. Since it is a changingpixel, it has an opposite ‘‘color’’ to that of a0.

y Taylor & Francis Group, LLC.

a0

a0

a0

a1

a1

a1

b1

b1

b1

b2

b2

b2

a2

a2

a2

Reference line

Code line

Reference line

Code line

Reference line

Code line

(a) Pass mode

(b) Vertical mode

(c) Horizontal mode

FIGURE 6.22-D run-length coding.

� 2007 b

a2: The next changing pixel after a1 in the coding line. It is to the right of a1 and has thesame color as that of a0.b1: The changing pixel in the reference line that is closest to a0 from the right and has thesame color as a1.b2: The next changing pixel in the reference line after b1.

6.2.2.2 Three Coding Modes

6.2.2.2.1 Pass Coding Mode

If the changing pixel b2 is located to the left of the changing pixel a1, it means that the run inthe reference line starting from b1 is not adjacent to the run in the coding line starting froma1. Note that these two runs have the same color. This is called pass coding mode. A specialcode word, ‘‘0001,’’ is sent out from transmitter. The receiver then knows that the runstarting from a0 in the coding line does not end at the pixel below b2. This pixel (below b2 inthe coding line) is identified as the reference changing pixel a0 of the new set of fivechanging pixels for the next coding mode.

6.2.2.2.2 Vertical Coding Mode

If the relative distance along the horizontal direction between the changing pixels a1 and b1 isnot larger than three pixels, the coding is conducted in vertical coding mode. That is, theposition of a1 is coded with reference to the position of b1. Seven different code wordsare assigned to seven different cases: the distance between a1 and b1 equals 0, �1, �2, �3,whereþmeans a1 is to the right of b1, while –means a1 is to the left of b1. The a1 then becomesthe reference changing pixel a0 of the new set of five changing pixels for the next codingmode.

6.2.2.2.3 Horizontal Coding Mode

If the relative distance between the changing pixels a1 and b1 is larger than three pixels, thecoding is conducted in horizontal coding mode. Here, 1-D RLC is applied. Specifically,


TABLE 6.2

2-D Run-Length Coding Table, jxiyjj: Distance between xi and yj, xiyj > 0: xi isRight to yj, xiyj < 0: xi is Left to yj. (xiyj): Code Word of the Run Denoted by xiyjTaken from the Modified Huffman Code

Mode ConditionsOutput

Code WordPositionof New a0

Pass coding mode b2a1 < 0 0001 Under b2 in Coding LineVertical coding mode a1b1¼ 0 1 a1

a1b1¼ 1 011a1b1¼ 2 000011a1b1¼ 3 0000011a1b1¼�1 010a1b1¼�2 000010a1b1¼�3 0000010

Horizontal coding mode ja1b1j > 3 001þ(a0a1)þ (a1a2)

a2

Source: From Hunter, R. and Robinson, A.H., Proc. IEEE, 68, 7, 854–867, 1980.

the transmitter sends out a code word consisting the following three parts: a flag ‘‘001’’; a1-D RLC word for the run from a0 to a1; a 1-D RLC word for the run from a1 to a2. The a2then becomes the reference changing pixel a0 of the new set of five changing pixels for thenext coding mode.

Table 6.2 contains three coding modes and the corresponding output code words.There, (a0a1) and (a1a2) represent 1-D run-length code words of run-length a0a1 and a1a2,respectively.

6.2.3 Effect of Transmission Error and Uncompressed Mode

In this section, effect of transmission error in the 1-D and 2-D RLC cases and uncompressedmode is discussed.

6.2.3.1 Error Effect in the 1-D RLC Case

As introduced above, the special code word EOL is used to indicate the end of each scanline. With the EOL, 1-D RLC encodes each scan line independently. If a transmission erroroccurs in a scan line, there are two possibilities that the effect caused by the error is limitedwithin the scan line. One possibility is that resynchronization is established after a fewruns. One example is shown in Figure 6.3. There the transmission error takes place in the

Original coded line 1000 011 0111 0011 1110 10 …

3W 4B 2W 5B 6W 3B

Errorcontaminated

line

1000 0010 1110 011 1110 10 …

3W 6B 6W 4B 6W 3B

An error

FIGURE 6.3Establishment of resynchronization after a few runs.


second run from the left. Resynchronization is established in the fifth run in this example.Another possibility lies in the EOL, which forces resynchronization.

In summary, it is seen that the 1-D RLC will not propagate transmission error betweenscan lines. In other words, a transmission error will be restricted within a scan line.Although error detection and retransmission of data through an automatic repeat request(ARQ) system are supposed to be able to effectively handle the error susceptibility issueeffectively, the ARQ technique was not included into Recommendation T.4 due to thecomputational complexity and extra transmission time required.

Once the number of decoded pixels between two consecutive EOL code words is notequal to 1728 (for an A4 size document), an error has been identified. Some error conceal-ment techniques can be used to reconstruct the scan line [hunter 1980]. For instance, we canrepeat the previous line, or replace the damaged line by a white line, or use a correlationtechnique to recover the line as much as possible.

6.2.3.2 Error Effect in the 2-D RLC Case

From the above discussion, we realize that 2-D RLC is more efficient than 1-D RLC on theone hand. On the other hand 2-D RLC is more susceptible to transmission errors thanthe 1-D RLC. To prevent error propagation, there is a parameter used in 2-D RLC, knownas the K-factor, which specifies the number of scan lines that are 2-D RLC coded.

Recommendation T.4 defined that no more than K – 1 consecutive scan lines be 2-D RLCcoded after a 1-D RLC coded line. For binary documents scanned at normal resolution,K¼ 2. For documents scanned at high resolution, K¼ 4.

According to Arps [arps 1979], there are two different types of algorithms in binaryimage coding: raster algorithms and area algorithms. Raster algorithms operate only ondata within one or two raster scan lines. They are hence mainly 1-D in nature. Areaalgorithms are truly 2-D in nature. They require that all, or a substantial portion, of theimage is in random access memory. From our discussion above, we see that both 1-D and2-D RLC defined in T.4 belong to the category of raster algorithms. Area algorithms requirelarge memory space and are susceptible to transmission noise.

6.2.3.3 Uncompressed Mode

For some detailed binary document images, both 1-D and 2-D RLC may result in dataexpansion instead of data compression. Under these circumstances the number of codingbits is larger than the number of bilevel pixels. An uncompressed mode is created as analternative way to avoid data expansion. Special code words are assigned for the uncom-pressed mode.

For the performances of 1-D and 2-D RLC applied to eight CCITT test document images,and issues such as fill bits and minimum scan line time (MSLT), to only name a few,readers are referred to [hunter 1980].

6.3 Digital Facsimile Coding Standards

Facsimile transmission, an important means of communication in modern society, isoften used as an example to demonstrate the mutual interaction between widely usedapplications and standardization activities. Active facsimile applications and the marketbrought on the necessity for international standardization to facilitate interoperabilitybetween facsimile machines worldwide. Successful international standardization, in turn,


TABLE 6.3

Facsimile Coding Standards

Group ofFacsimileApparatuses

SpeedRequirementfor A4 SizeDocument

Analog orDigitalScheme

CCITTRecommendation

Compression Technique

ModelBasicCoder

AlgorithmAcronym

G1 6 min Analog T.2 — — —

G2 3 min Analog T.3 — — —

G3 1 min Digital T.4 1-D RLC2-D RLC (optional)

ModifiedHuffman

MHMR

G4 1 min Digital T.6 2-D RLC ModifiedHuffman

MMR

has stimula ted wi der use of facsim ile transmi ssion and, hence , a mo re deman ding market.Facs imile has also been conside red as a major appl ication for binary image com pression.

So far facsimile machi nes are class ifi ed in fo ur differe nt groups . Facsimi le apparat uses ingro ups 1 and 2 use analog techni ques. They can transmi t a n A4 size (210 3 297 mm)docum ent scanned at 3.85 lines =m m in 6 and 3 minut es, respec tively, over the GSTN .Intern ation al standards fo r these two gro ups of facs imile apparatuse s are CCITT (nowITU) Recomme ndation s T.2 and T.3 , respec tively. Grou p 3 facsim ile m achines use digita ltechni ques and hen ce achieve high codi ng ef ficiency. They can transmit the A4 size bina rydocum ent scanne d at a resol ution of 3.85 lines =mm and sam pled at 1728 pixels =line inabout 1 minut e at a rate of 4800 bits =s over the GSTN . The correspo nding int ernation alstand ard is CCITT Reco mmend ation T.4 . Group 4 facsimile apparat uses have the sametransmi ssion speed requi remen t as that for gro up 3 machine s, but the coding techni que isdiffere nt. Speci fica lly, the coding tech nique used for gro up 4 machi nes is based on 2-DRLC , discusse d ab ove, but modi fied to achieve higher coding ef fi ciency. He nce it isrefer red to as the modi fied modi fi ed REA D (MMR) coding. The corre spondin g sta ndardis CCITT Reco mmend ation T.6. Table 6 .3 summa rizes the ab ove des criptions.

6.4 Dictionary Coding

Dicti onary coding, the focus of this sectio n, is differe nt from Huffman and arithme ticcoding techni ques, discuss ed in Chapte r 5. Both Huffman and arithme tic coding tech-nique s are based on a statist ical model , and the occ urrence probabi lities play a particula rlyimpo rtant role . Recall that in the Huffman coding the shorte r code words are assi gned tomo re frequentl y occurrin g source symbols . In dictio nary-ba sed data com pression tech-nique s a symbol or a string of symbo ls generat ed from a source alphabe t is repres ented byan inde x to a dictionary constru cted from the source alphabe t. A dictionary is a list ofsymbol s and string s of symbo ls. Ther e are many example s of this in our daily lives. Forinsta nce, the string ‘‘ Sept ember ’’ is someti mes represen ted by an index ‘‘ 9, ’’ whil e a soci alsecu rity number represen ts a pers on in the United Stat es.

Dictionary coding is widely used in text coding. Consider English text coding. Thesource alphabet includes 26 English letters in both upper and lower cases, numbers,various punctuation marks, and the space bar. Huffman or arithmetic coding treats eachsymbol based on its occurrence probability. That is, the source is modeled as a memorylesssource. It is well known, however, that this is not true in many applications. In text coding,structure or context plays a significant role. As mentioned earlier, it is very likely that the


letter u appears after the letter q. Similarly, it is likely that the word ‘‘concerned’’ willappear after ‘‘As far as the weather is.’’ The strategy of the dictionary coding is to build adictionary that contains frequently occurring symbols and string of symbols. Whena symbol or a string is encountered and it is contained in the dictionary, it is encodedwith an index to the dictionary. Otherwise, if not in the dictionary, the symbol or the stringof symbols is encoded in a less efficient manner.

6.4.1 Formulation of Dictionary Coding

To facilitate further discussion, we define dictionary coding in a precise manner [bell 1990].We denote a source alphabet by S. A dictionary consisting of two elements is defined asD¼ (P,C), where P is a finite set of phrases generated from the S, and C is a coding functionmapping P onto a set of code words.

The set P is said to be complete if any input string can be represented by a series ofphrases chosen from the P. The coding function C is said to obey the prefix property if thereis no code word that is a prefix of any other code word. For practical usage, i.e., forreversible compression of any input text, the phrase set Pmust be complete and the codingfunction C must satisfy the prefix property.

6.4.2 Categorization of Dictionary-Based Coding Techniques

The heart of dictionary coding is the formulation of the dictionary. A successfully builtdictionary results in data compression; the opposite case may lead to data expansion.According to the ways in which dictionaries are constructed, dictionary coding techniquescan be classified as static or adaptive.

6.4.2.1 Static Dictionary Coding

In some particular applications, the knowledge about the source alphabet and the relatedstrings of symbols, also known as phrases, is sufficient for a fixed dictionary to beproduced before the coding process. The dictionary is used at both the transmitting andthe receiving ends. This is referred to as static dictionary coding. The merit of the staticapproach is its simplicity. Its drawback lies on its relatively lower coding efficiency and lessflexibility compared with adaptive dictionary techniques. By less flexibility, we mean that adictionary built for a specific application is not normally suitable for utilization in otherapplications.

An example of static algorithms occurs is digram coding. In this simple and fast codingtechnique, the dictionary contains all source symbols and some frequently used pairs ofsymbols. In encoding, two symbols are checked at once to see if they are in the dictionary.If so, they are replaced by the index of the two symbols in the dictionary, and the next pairof symbols is encoded in the next step. If not, then the index of the first symbol is used toencode the first symbol. The second symbol is combined with the third symbol to form anew pair, which is encoded in the next step.

The digram can be straightforwardly extended to n-gram. In the extension, the size of thedictionary increases and so does its coding efficiency.

6.4.2.2 Adaptive Dictionary Coding

As opposed to the static approach, with the adaptive approach a completely defineddictionary does not exist before the encoding process and the dictionary is not fixed.


At the beginning of coding, only an initial dictionary exists. It adapts itself to the inputduring the coding process. All the adaptive dictionary coding algorithms can be tracedback to two different original works by Ziv and Lempel [ziv 1977, 1978]. The algorithmsbased on [ziv 1977] are referred to as the LZ77 algorithms, while those based on [ziv 1978]are referred to as the LZ78 algorithms. Before introducing the two landmark works, wewill discuss the parsing strategy.

6.4.3 Parsing Strategy

Once we have a dictionary, we need to examine the input text and find a string of symbolsthat matches an item in the dictionary. Then the index of the item to the dictionary isencoded. This process of segmenting the input text into disjoint strings (whose unionequals the input text) for coding is referred to as parsing. Obviously, the way to segmentthe input text into strings is not unique.

In terms of the highest coding efficiency, optimal parsing is essentially a shortest-pathproblem [bell 1990]. In practice, however, a method called greedy parsing is used mostoften. In fact, it is used in all the LZ77 and LZ78 algorithms [nelson 1995]. With greedyparsing, the encoder searches for the longest string of symbols in the input that matches anitem in the dictionary at each coding step. Greedy parsing may not be optimal, but it issimple in implementation.

Example 6.1Consider a dictionary, D, whose phrase set is P¼ {a, b, ab, ba, bb, aab, bbb}. The code wordsassigned to these strings are C(a)¼ 10, C(b)¼ 011, C(ab)¼ 010, C(ba)¼ 0101, C(bb)¼ 01,C(aab)¼ 11, and C(bbb)¼ 0110. Now the input text is abbaab.

Using greedy parsing, we then encode the text as C(ab).C(ba).C(ab), which is a 10-bitstring: 010.0101.010. In the above representations, the periods are used to indicate thedivision of segments in the parsing. This, however, is not an optimum solution. Obviously,the following parsing will be more efficient, i.e., C(a).C(bb).C(aab), which is a 6-bit string:10.01.11.

6.4.4 Sliding Window (LZ77) Algorithms

Asmentioned earlier, LZ77 algorithms are a group of adaptive dictionary coding algorithmsrooted in the pioneeringwork in [ziv 1977]. Since they are adaptive, there is no complete andfixed dictionary before coding. Instead, the dictionary changes as the input text changes.

6.4.4.1 Introduction

In the LZ77 algorithms [bell 1990; nelson 1995], the dictionary used is actually a portion ofthe input text, which has been recently encoded. The text that needs to be encoded iscompared with the strings of symbols in the dictionary. The longest matched string in thedictionary is characterized by a pointer (sometimes called a token), which is represented bya triple of data items. Note that this triple functions as an index to the dictionary, asmentioned earlier. In this way, a variable-length string of symbols is mapped to a fixed-length (FL) pointer.

There is a sliding window in the LZ77 algorithms. The window consists of two parts: asearch buffer and a look-ahead buffer. The search buffer contains the portion of the textstream that has recently been encoded which, as mentioned, is the dictionary; whilethe look-ahead buffer contains the text to be encoded next. The window slides throughthe input text stream from beginning to end during the entire encoding process.


This explains the term sliding window. The size of the search buffer is much larger thanthat of the look-ahead buffer. This is expected because what is contained in the searchbuffer is in fact the adaptive dictionary. The sliding window is usually on the order of a fewthousand symbols, whereas the look-ahead buffer is on the order of several tens to onehundred symbols.

6.4.4.2 Encoding and Decoding

Below we present more detail about the sliding window dictionary coding technique, i.e.,the LZ77 approach, via a simple illustrative example.

Example 6.2Figure 6.4 shows a sliding window. The input text stream is

ikaccbadaccbaccbaccgikmoabcc

In Figure 6.4a, a search buffer of nine symbols and a look-ahead buffer of six symbolsare shown. All the symbols in the search buffer, accbadacc, have just been encoded. All thesymbols in the look-ahead buffer, baccba, are to be encoded. (It is understood thatthe symbols before the search buffer have been encoded and the symbols after the look-ahead buffer are to be encoded.) The strings of symbols, ik and ccgikmoabcc, are not coveredby the sliding window at the moment.

At the moment, or in other words, in the first step of encoding, the symbol (symbols) tobe encoded begins (begin) with the symbol b. The pointer starts searching for the symbol bfrom the last symbol in the search buffer, c, which is immediately to the left of the firstsymbol b in the look-ahead buffer. It finds a match at the sixth position from b. It furtherdetermines that the longest string of the match is ba. That is, the maximummatching lengthis two. The pointer is then represented by a triple, < i, j, k >. The first item, i, represents thedistance between the first symbol in the look-ahead buffer and the position of the pointer(the position of the first symbol of the matched string). This distance is called offset. In thisstep, the offset is six. The second item in the triple, j, indicates the length of the matchedstring. Here, the length of the matched string ba is two. The third item, k, is the code wordassigned to the symbol immediately following the matched string in the look-ahead buffer.In this step, the third item is C(c), where C is used to represent a function to map symbol(s)

i k

Search buffer of size 9 Look-ahead bufferof size 6

i k a c c b a d a c c b a c c b a c c g i k m o a b c c

i k a c c b a d a c c b a c c b a c c g i k m o a b c c

(a) Triple: < 6, 2, C(c) >

(b) Triple: < 4, 5, C(g) >

(c) Triple: < 0, 0, C(i) >

b a c c b a c c g i k m o a b c ca c c b a d a c c

FIGURE 6.4An encoding example using LZ77.


to a code word , as defi ned in Section 6.4.1. Th at is, the result ing triple after the first step is< 6, 2, C(c) > .

The rea son to inc lude the third item k into the triple is as follows. In the cas e where the reis no match in the search buff er, both i and j wi ll be zero. Th e third item a t this mo mentis the code word of the first symbol in the look-ahead buffer itself. Th is means that evenin the cas e where we cannot find a match strin g, the sliding window still works . In thethird step of the encoding proces s describ ed below, we will see that the result ing triple is< 0, 0, C(i) > . The decod er hence understan ds that there is no match ing, and the singlesymbol i is decoded.

The second st ep of the enc oding is illustrat ed in Figu re 6.4b. The sliding wind ow hasbeen shifted to the right by three posit ions. The fi rst symbol to be encode d now is c, whichis the leftmost symbol in the look-ahe ad buff er. Th e search poi nter mo ves towar ds theleft from the symbol c. It firs t fi nds a match in the first pos ition with a length of on e. It thenfi nds ano ther match in the fourth pos ition from the fi rst symbol in the look-ahe ad buff er.Intere stingly, the maximu m match ing can exceed the bounda ry betw een the search andthe look -ahead buff ers and can enter the loo k-ahead buff er. Why this is poss ible will beexpl ained shortl y, when we discuss the deco ding process . In this mann er, it is found thatthe maximu m length of match ing is five. The last match is foun d at the fifth position. Th elength of the matched st ring is, howe ver, only one. As greedy parsing is used, the matchwith a length fi ve is cho sen. Tha t is, the offset is four and the maxi mum match length isfi ve. Conse quently, the triple resultin g from the second step is < 4, 5, C(g) >.

The sliding windo w is then sh ifted to the righ t by six positions . Th e third step of theenc oding is depicted in Figure 6.4c. Obvi ously, the re is no match ing of i in the searchbuff er. The resu lting tripl e is hen ce < 0, 0, C(i) >.

The encodi ng process can continu e in this way. The poss ible cas es we may enc ounter inthe encoding, ho wever, are descri bed in the abov e-men tioned three steps . Hence , we endour discussi on of the enc oding process and start discussi ng the deco ding proce ss. Com-par ed wi th the encoding, the decodin g is simpler becau se the re is no need fo r matching,which involve s many compar isons betwe en the symbol s in the look -ahead buff er and thesymbol s in the search buff er. The decodin g process is illustrat ed in Figure 6.5.

In the abov e-mention ed thre e steps, the resu lting triples are: < 6, 2, C(c) > , < 4, 5, C(g) > ,and < 0, 0, C(i) > . Now let us see ho w the decod er works. That is, how the decoder rec oversthe strin g baccbaccgi from the se three triples.

In Figu re 6.5a, the search buff er is the same as that in Figu re 6.4a. That is, the st ringaccbad acc stored in the search wi ndow is wha t was jus t decod ed.

On ce the first triple < 6, 2, C(c) > is received, the decoder will mo ve the decod ing pointerfrom the fi rst position in the loo k-ahead buff er to the left by six posit ions. That is, the pointerwill point to the symbo l b. The deco der then copies the two symbol s starting from b , i.e., ba ,into the look-ahe ad buff er. The symbol c will be copied righ t to ba . This is shown in Figu re6.5b. Th e wind ow is the n shifted to the righ t by three pos itions, as sh own in Figure 6.5c.

Aft er the second triple < 4, 5, C(g) > is received, the decoder mo ve the decod ing pointerfrom the firs t position of the loo k-ahead buffer to the left by four pos itions. Th e pointerpoi nts the symbol c . The decoder then copi es five succes sive symbo ls starting fro m thesymbol c pointe d by the pointer. We see that at the beginning of this copyi ng proces s the reare on ly four symbol s availabl e for copyi ng. On ce the fi rst symbol is copi ed, ho wever, allfi ve symbol s are availabl e. Aft er copyi ng, the symbo l g is adde d to the end of the fivecopied symbols in the look-ahead buffer. The results are shown in Figure 6.5c. Figure 6.5dthen shows the window shifting to the right by six positions.

After receiving the triple < 0, 0, C(i) >, the decoder knows that there is no matching anda single symbol i is encoded. Hence, the decoder adds the symbol i following the symbol g.This is shown in Figure 6.5f.


a c c b a d a c c

(a) Search buffer at the beginning

b a c

(b) After decoding < 6, 2, C(c) >

b a d a c c b a c

(c) Shifting the sliding window

b a d a c c b a c c b a c c g

(d) After decoding < 4, 5, C(g) >

b a c c b a c c g

(e) Shifting the sliding window

c

b i a c c b a c c g

(f) After decoding < 0, 0, C(i) >

c

a c c b a d a c c

FIGURE 6.5A decoding example using LZ77.

In Figure 6.5, for each part, the last encoded symbol c before receiving the three triples isshaded. From Figure 6.5f, we see that the string added after the symbol c due to the threetriples is baccbaccgi. This agrees with the sequence mentioned at the beginning of our discus-sion about the decoding process. We thus conclude that the decoding process has correctlydecoded the encoded sequence from the last encoded symbol and the received triples.

6.4.4.3 Summary of the LZ77 Approach

The sliding window consists of two parts: the search buffer and the look-ahead buffer. Themost recently encoded portion of the input text stream is contained in the search buffer,while the portion of the text that needs to be encoded immediately is in the look-aheadbuffer. The first symbol in the look-ahead buffer, located to the right of the boundarybetween the two buffers, is the symbol or the beginning of a string of symbols to beencoded at the moment. Let us call it the symbol s. The size of the search buffer is usuallymuch larger than that of the look-ahead buffer.

In encoding, the search pointer moves to the left, away from the symbol s, to find amatch of the symbol s in the search buffer. Once a match is found, the encoding processwill further determine the length of the matched string. When there are multiple matches,the match that produces the longest matched string is chosen. The match is denoted by atriple < i, j, k >. The first item in the triple, i, is the offset, which is the distance between thepointer pointing to the symbol giving the maximum match and the symbol s. The seconditem, j, is the length of the matched string. The third item, k, is the code word of the symbolfollowing the matched string in the look-ahead buffer. The sliding window is then shiftedto the right by j þ 1 position, before the next coding step takes place.


Whe n there is no match ing in the search buffer , the triple is repres ented by < 0, 0, C(s) > ,where C(s) is the code word assigne d to the symbol s . The sliding wind ow is the n shifted tothe righ t by on e position.

The sliding wi ndow is shifted along the inpu t tex t st ream during the encoding proce ss. Th esymbol s mov es from the beginni ng symbo l to the endi ng symbol of the input text st ream.

At the ver y begi nning, the content of the search buffer can be arbitrari ly selected. Forinsta nce, the symbol s in the search buff er may all be the space symbo l.

Let us denote the size of the search buffer by SB, the size of the look -ahead buff er by L,and the size of the source alphabet by A. Assume that the natural bina ry code (NBC) isuse d. Then we see that the LZ77 app roach encodes variable-l ength string s of symbol s withfi xed-length code words. Sp eci fically, the offset i is of coding length d log2 (SB) e , the lengthof matched st ring j is of coding length d log2 (SB þ L) e , and the code word k is of codinglength d log2 (A) e , where the sign d a e deno tes the sma llest intege r larger than a .

The length of the match ed string is equal to d log2 (SB þ L) e because the search for themaxi mum match ing can enter into the look-ahead buff er as shown in Examp le 6.2.

The deco ding proces s is simpler than the encoding process since there is no com parisoninv olved in the deco ding.

The m ost recently encoded symbol s in the search buff er serve as the dicti onary use d inthe LZ77 appro ach. Th e mer it of doing so is that the dictio nary is we ll adapted to the inputtext. Th e limit ation of the approa ch is that if the distanc e betwe en the repe ated patterns inthe input text st ream is larger than the size of the search buffer , then the approach cannotutil ize the struc ture to compress the text. A vivid exa mple ca n be foun d in [sayood 1996].

A wind ow with a moderat e size , say, SB þ L � 8192, can compress a vari ety of texts well.Sever al rea sons have bee n anal yzed in [bell 1990].

Many variati ons have been made to improve codi ng ef ficiency of the LZ77 approach .The LZ77 prod uces a triple in each encoding st ep; i.e ., the of fset (posi tion of the match edstrin g), the leng th of the match ed strin g, and the code word of the symbol follo wingthe matched string. The transmiss ion of the third item in each coding step is not ef ficient.This is true especial ly at the beginning of coding. A variant of the LZ77, refer red to a s theLZS S algori thm [bell 198 6], improv es this inef fi ciency.

6.4.5 LZ78 Algorithms

6.4.5.1 Introduction

As mentioned earlier, the LZ77 algorithms use a sliding window of fixed size, and both thesearch buffer and the look-ahead buffer have a fixed size. This means that if the distancebetween two repeated patterns is larger than the size of the search buffer, the LZ77algorithms cannot work efficiently. The fixed size of the both buffers implies that thematched string cannot be longer than the sum of the sizes of the two buffers, meaninganother limitation on coding efficiency. Increasing the sizes of the search buffer and thelook-ahead buffer will seemingly resolve the problems. A close look, however, reveals thatit also leads to increases in the number of bits required to encode the offset and matchedstring length as well as an increase in processing complexity.

The LZ78 algorithms [ziv 1978; bell 1990; nelson 1995] eliminate the use of the slidingwindow. Instead these algorithms use the encoded text as a dictionary which, potentially,does not have afixed size. Each time apointer (token) is issued, the encoded string is includedin the dictionary. Theoretically the LZ78 algorithms reach optimal performance as theencoded text stream approaches infinity. In practice, however, as mentioned above withrespect to the LZ77, a very large dictionarywill affect coding efficiency negatively. Therefore,once a preset limit to the dictionary size has been reached, either the dictionary isfixed for thefuture (if the coding efficiency is good), or it is reset to zero, i.e., it must be restarted.


Instead of the triples used in the LZ77, only pairs are used in the LZ78. Specifically, onlythe position of the pointer to the matched string and the symbol following the matchedstring need to be encoded. The length of the matched string does not need to be encodedbecause both the encoder and the decoder have exactly the same dictionary, i.e., thedecoder knows the length of the matched string.

6.4.5.2 Encoding and Decoding

Like the discussion of the LZ77 algorithms, we will go though an example to describe theLZ78 algorithms.

Example 6.3Consider the text stream: baccbaccacbcabccbbacc. Table 6.4 shows the coding process. Wesee that for the first three symbols there is no match between the individual input symbolsand the entries in the dictionary. Therefore, the doubles are < 0, C(b) >, < 0, C(a) >, and< 0, C(c) >, respectively, where 0 means no match, and C(b), C(a), and C(c) representthe code words of b, a, and c, respectively. After symbols b, a, c, comes c, which finds amatch in the dictionary (the third entry). Therefore, the next symbol b is combined to beconsidered. Since the string cb did not appear before, it is encoded as a double and it isappended as a new entry into the dictionary. The first item in the double is the index ofthe matched entry c, 3, the second item is the index=code word of the symbol following thematch b, 1. That is, the double is < 3, 1 >. The following input symbol is a, which appearedin the dictionary. Hence the next symbol c is taken into consideration. Since the string acis not an entry of the dictionary, it is encoded with a double. The first item in the double isthe index of symbol a, 2, the second item is the index of symbol c, 3, i.e., < 2, 3 >. Theencoding proceeds in this way. In Table 6.4, as the encoding proceeds the entries inthe dictionary become longer and longer. First, entries with single symbols come out,later more and more entries with two symbols show up. After that more and more entrieswith three symbols appear. This means that coding efficiency is increasing.

Now consider the decoding process. Since the decoder knows the rule applied in theencoding, it can reconstruct the dictionary and decode the input text stream fromthe received doubles. When the first double < 0, C(b) > is received, the decoder knowsthat there is no match. Hence, the first entry in the dictionary is b. So is the first decodedsymbol. From the second double < 0, C(a) >, symbol a is known as the second entry in the

TABLE 6.4

An Encoding Example Using the LZ78Algorithm

Index DoublesEncodedSymbols

1 < 0, C(b) > b2 < 0, C(a) > a3 < 0, C(c) > c4 < 3, 1 > cb5 < 2, 3 > ac6 < 3, 2 > ca7 < 4, 3 > cbc8 < 2, 1 > ab9 < 3, 3 > cc

10 < 1, 1 > bb11 < 5, 3 > acc


dictio nary as well as the seco nd deco ded symbo l. Si milarly, the next entry in the dicti onaryand the next deco ded symbol are known as c . Whe n the follo wing double < 3, 1 > isrec eived. The deco der knows from two items, 3 and 1, that the next two symbol s arethe third and the first entri es in the dictio nary. This indicate s that the symbols c and b aredecod ed, a nd the st ring cb become s the fourth entry in the dictionary.

We omit the nex t two doub les and take a look at the dou ble < 4, 3 > , which is associate dwith index 7 in Table 6.4. Since the first item in the doub le is 4, it mean s that the maxim ummatch ed strin g is cb , which is associ ated with inde x 4 in Table 6.4. The second item in thedoub le, 3, implies that the symbol following the match is the third ent ry c. Ther efore,the decod er deco des a strin g cbc . Also the string cbc become s the seventh entry in therec onstructe d dictionary. In thi s way, the decod er c an reconst ruct the exact sam e diction-ary as that establish ed by the encode r and deco de the inpu t tex t strea m from the receiveddoub les.

6.4.5. 3 LZW Algori thm

Both the LZ77 and LZ78 approach es, when pub lished in 1977 and 1978, respec tively, we retheo ry oriented . The effective and practical improve ment over the LZ78 in [we lch 1984]bro ught much attentio n to the LZ dictionary coding techni ques. The resu lting algori thm isrefer red to as the LZW algorithm [bell 1990; nels on 1995]. It remov ed the second item inthe double (the index of the symbol follo wing the longe st match ed string) and hen ce, itenhance d codi ng effi ciency. In other wo rds, the LZW only sends the indexe s of thedictio nary to the decoder . For the purpo se, the LZW firs t forms an initia l dictio nary,which consists of all the indivi dual source symbol s contai ned in the sou rce alphab et.Then , the encode r examin es the input symbo l. Since the inpu t symbol match es to anentry in the dictionary, its succeedi ng symbo l is cascad ed to form a string . The cas cadedstrin g does no t fi nd a match in the initial dictionary. He nce the inde x of the match ed symbo lis enc oded and the enlarge d string (the match ed symbol followed by the cascad ed symbo l)is list ed as a new entry in the dicti onary. The enc oding proce ss c ontinues in this mann er.

For the encoding and decoding processes, let us go through an example to see howthe LZW algorithm can encode only the indexes and the decoder can still decode the inputtext string.

Example 6.4Consider the following input text stream: accbadaccbaccbacc. We see that the source alphabetis S ¼ {a,b,c ,d ,}. The top portio n of Table 6.5 (with indexes 1, 2, 3, 4) gives a possibl e initi aldictionary used in the LZW. When the first symbol a is input, the encoder finds that it has amatch in the dictionary. Therefore, the next symbol c is taken to form a string ac. As thestring ac is not in the dictionary, it is listed as a new entry in the dictionary and is given anindex, 5. The index of the matched symbol a, 1, is encoded. When the second symbol, c, isinput, the encoder takes the following symbol c into consideration because there is a matchto the second input symbol c in the dictionary. Since the string cc does not match anyexisting entry, it becomes a new entry in the dictionary with an index, 6. The index of thematched symbol (the second input symbol), c, is encoded. Now consider the third inputsymbol c, which appeared in the dictionary. Hence, the following symbol b is cascaded toform a string cb. Since the string cb is not in the dictionary, it becomes a new entry in thedictionary and is given an index, 7. The index of matched symbol c, 3, is encoded.The process proceeds in this fashion. Take a look at entry 11 in the dictionary shown inTable 6.5. The input symbol at this point is a. Since it has a match in the previous entries, itsnext symbol c is considered. Since the string ac appeared in entry 5, the succeeding symbolc is combined. Now the new enlarged string becomes acc and it does not have a match in


TABLE 6.5

An Example of the Dictionary Coding Usingthe LZW Algorithm

Index Entry Input Symbols Encoded Index

1 a

Initial dictionary2 b3 c4 d5 ac a 16 cc c 37 cb c 38 ba b 29 ad a 1

10 da d 411 acc a, c 512 cba c, b 713 accb a, c, c 1114 bac b, a 815 cc . . . c, c, . . .

o

the previous entries. It is thus added to the dictionary. And a new index, 11, is given to thestring acc. The index of the matched string ac, 5, is encoded and transmitted. The finalsequence of encoded indexes is 1, 3, 3, 2, 1, 4, 5, 7, 11, 8. Like the LZ78, the entries in thedictionary become longer and longer in the LZW algorithm. This implies high codingefficiency since long strings can be represented by indexes.

Now let us take a look at the decoding process to see how the decoder can decode theinput text stream from the received index. Initially, the decoder has the same dictionary(the top four rows in Table 6.5) as that in the encoder. Once the first index 1 comes, thedecoder decodes a symbol a. The second index is 3, which indicates that the next symbol isc. From the rule applied in encoding, the decoder knows further that a new entry ac hasbeen added to the dictionary with an index 5. The next index is 3. It is known that the nextsymbol is also c. It is also known that the string cc has been added into the dictionary asthe sixth entry. In this way, the decoder reconstructs the dictionary and decodes the inputtext stream.

6.4.5.4 Summary

The LZW algorithm, as a representative of the LZ78 approach, is summarized below.The initial dictionary contains the indexes for all the individual source symbols. At the

beginning of encoding, when a symbol is input, since it has a match in the initial dictionary,the next symbol is cascaded to form a two-symbol string. Since the two-symbol stringcannot find a match in the initial dictionary, the index of the first symbol is encoded andtransmitted, and the two-symbol string is added to the dictionary with a new, incrementedindex. The next encoding step starts with the second symbol among the two symbols.

In the middle, the encoding process starts with the last symbol of the latest addeddictionary entry. Since it has a match in the previous entries, its succeeding symbol iscascaded after the symbol to form a string. If this string as appeared before in thedictionary (i.e., the string finds a match), the next symbol is cascaded as well. This processcontinues until such an enlarged string cannot find a match in the dictionary. At thismoment, the index of the last matched string (the longest match) is encoded and transmit-ted, and the enlarged and unmatched string is added into the dictionary as a new entrywith a new, incremented index.


Dec oding is a proce ss of transfor ming the index st ring back to the corre spondin g symbo lstrin g. To do so, ho wever, the dictio nary must be rec onstructe d exactly the sam e as thatestabl ished in the encodi ng process . That is, the initial dictio nary is con structed firs t in thesam e way as that in the encoding. When decod ing the index string, the decoder rec on-struc ts the sam e dicti onary a s that in the encoder accordin g to the rule use d in theenc oding.

Speci fically, at the beginni ng of the decodin g, after rec eiving an index, a cor respondingsingl e symbol can be decod ed. Thro ugh the next received index, anothe r symbol can bedecod ed. Fr om the rule used in the enc oding, the decod er knows that the two symbol sshoul d be cascad ed to form a new entry added into the dictio nary with an increm entedindex. The next step in the deco ding will start fro m the second symbol amo ng the twosymbol s.

Now conside r the middl e of the decodin g proce ss. The pres ently rec eived index is use dto decode a corre spondin g strin g of inpu t symbol s acco rding to the rec onstru cted diction-ary at the mo ment. (Note that this string is said to be with the presen t index.) It is kno wnfrom the enc oding rule that the symbo ls in the string associate d with the next inde x shoul dbe conside red. (Note that this strin g is said to be wi th the nex t index). Th at is, the firstsymbol in the string with the next inde x sh ould be app ended to the last symbol in thestrin g with the pres ent inde x. The result ant comb ination, i.e., the st ring wi th the pres entindex follo wed by the first symbol in the string with the nex t index, cannot find a ma tch toan entry in the dictionary. Th erefore, the combin ation should be added to the dicti onarywith an inc remented inde x. At this momen t, the next inde x becomes the new pres ent index,and the index followi ng the next index becomes the new next index. The decod ing proces sthen procee ds in the sam e fashi on in a new deco ding st ep.

Comp ared with the LZ78 algori thm, the LZW algori thm el iminates the nec essity ofhavi ng the second item in the dou ble, an index =code word of the symbol follo wing amatch ed strin g. Tha t is, the enc oder on ly needs to enc ode and transmit the firs t item in thedoub le. This grea tly enhances the c oding effi ciency and reduce s the comple xity of the LZalgori thm.

6.4.5. 5 Applicati ons

The CC ITT Reco mmend ation V.42 bis is a data compress ion standard used in m odems thatconne ct compute rs with rem ote use rs via the GSTN . In the com pressed mo de, the LZWalgori thm is recom mended for data compr ession.

In image com press ion, the LZW find s its appl ication as we ll. Speci fi cally, it is utilized inthe grap hic int erchange fo rmat (GI F) that was cre ated to encode grap hical images . GIF isnow also used to encode natu ral images, thou gh it is not v ery ef fi cient in this rega rd.For more informati on, rea ders are referred to [sayood 1996]. The LZW algori thm is alsouse d in the Unix Co mpress comm and.

6.5 International S tandards for Lossless S till Image C ompression

In Chapte r 5, we st udied Hu ffman and arithme tic codi ng techni ques. We a lso brie flydiscuss ed the internati onal st andard for bilevel image compr ession, known as the JBIG.In this chapte r, so far we have dis cussed anothe r two coding techniq ues: the RLC anddictio nary coding techni ques. We have also intro duced the internation al st andards forfacsim ile compr ession, in which the technique s kno wn as the MH, MR, and MMRwere recom mended. All of these techni ques inv olve lossle ss com press ion. In Chapter 7,


the international still image coding standard JPEG will be introduced. As we will see, theJPEG has four different modes. They can be divided into two compression categories: lossyand lossless. Hence, we can discuss about the lossless JPEG. Before leaving this chapter,however, we will briefly discuss, compare, and summarize various techniques used in theinternational standards for lossless still image compression. For more detail, readers arereferred to an excellent survey paper [arps 1994].

6.5.1 Lossless Bilevel Still Image Compression

6.5.1.1 Algorithms

As mentioned earlier, there are four different international standard algorithms falling intothis category.

MH (modified Huffman coding): This algorithm defined in CCITT Recommendation T.4 forfacsimile coding uses the 1-D RLC technique followed by the MH coding technique.

MR (modified READ (relative element address designate) coding): This algorithm defined inCCITT Recommendation T.4 for facsimile coding uses the 2-D RLC technique followed bythe MH coding technique.

MMR (modified modified READ coding): This algorithm defined in CCITT Recommenda-tion T.6 is based on MR, but is modified to maximize compression.

JBIG (Joint Bilevel Image experts Group coding): This algorithm defined in CCITT Recom-mendation T.82 uses an adaptive 2-D coding model, followed by an adaptive arithmeticcoding technique.

6.5.1.2 Performance Comparison

The JBIG test image set was used to compare the performance of the above-mentionedalgorithms. The set contains scanned business documents with different densities, graphicimages, digital halftones, and mixed (document and halftone) images.

Note that digital halftones, also named (digital) halftone images, are generated by usingonly binary devices. Some small black units are imposed on a white background. The unitsmay assume different shapes: circle, square, and so on. The denser the black units in a spotof an image, the darker the spot appears. The digital halftoning method has been used forprinting gray-level images in newspapers and books. Digital halftoning through characteroverstriking, used to generate digital images in the early days for the experimental workassociated with courses on digital image processing, is described in [gonzalez 1992].

The following two observations on the performance comparison were made after theapplication of several techniques to the JBIG test image set.

For bilevel images excluding digital halftones, the compression ratio achieved by thesetechniques ranges from 3 to 100. The compression ratio increases monotonically in theorder of the following standard algorithms: MH, MR, MMR, and JBIG.

For digital halftones, MH, MR, and MMR result in data expansion, while JBIG achievescompression ratios in the range of 5–20. This demonstrates that among the techniques, JBIGis the only one suitable for the compression of digital halftones.

6.5.2 Lossless Multilevel Still Image Compression

6.5.2.1 Algorithms

There are two international standards for multilevel still image compression:JBIG (Joint Bilevel Image experts Group coding): Defined in CCITT Recommendation T. 82

uses an adaptive arithmetic coding technique. To encode multilevel images, the JBIG


decom poses multileve l image s into bit-plane s, then com presses the se bit- planes using itsbileve l image compress ion techni que. To further enhance com pression ratio, it uses Garycoding to repre sent pixel amplitu des instead of weigh ted bina ry coding.

JPEG ( Joint Photograp hic (image) Experts Grou p coding ): De fined in CCITT Recomme nda-tion T. 81. For lossless coding, it uses the different ial coding techni que. The predic tive erroris encoded using either Huffman coding or adaptive arithme tic coding techni ques.

6.5.2. 2 Performan ce Compari son

A set of colo r test image s from the JPEG standard s com mittee was use d for pe rformanc ecom parison. The lu minance com ponent (Y ) is of resolutio n 720 3 576 pixels, while thechromi nance compone nts (U and V) are of 360 3 576 pixe ls. The compress ion ratio scalcul ated are the com bined result s for all the three com ponents. The follo wing observa-tions have been reported :

Whe n quantiz ed in 8 bits =pi xel, the compress ion ratio s vary muc h less for multileve limages than for bil evel images , and are rough ly equal to 2 .

Whe n quan tized with 5 bits =pixe l down to 2 bits =pixel, c ompared with the losslessJPEG , the JBIG achieves an inc reasingl y high er compress ion ratio, up to a max-imum of 29%.

Whe n quantiz ed with 6 bits =pixe l, JBIG and lossle ss JPEG achieve simi lar compres-sion ratios.

Whe n quan tized with 7–8 bits =pixel, the lossle ss JPEG achieve s a 2.4% –2.6% high ercompr ession ratio than JBIG.

6.6 Summary

Both Huffman coding and a rithmeti c coding, discuss ed in Chapte r 5, a re refer red to asvari able-len gth codi ng tech niques, because the le ngths of code wo rds assigne d to differententri es in a sou rce alp habet are differe nt. In general, a code word of a shorte r length isassigned to an entry with higher occurrence probabilities. They are also classified as fixed-length-to-variable-length coding techniques [arps 1979], since the entries in a sourcealphabet have the same fixed length. Run-length coding (RLC) and dictionary coding,which are the focus of this chapter, are opposite and are referred to as variable-length-to-fixed-length coding techniques. This is because the runs in the RLC and the string in thedictionary coding are variable and are encoded with code words of the same fixed length.

Based on RLC, the international standard algorithms for facsimile coding, MH, MR,and MMR have worked successfully except for dealing with digital halftones. That is, thesealgorithms result in data expansion when applied to digital halftones. The JBIG, basedon an adaptive arithmetic coding technique not only achieves a higher coding efficiencythan MH, MR, and MMR for facsimile coding, but also compresses the digital halftoneseffectively.

Note that 1-D RLC utilizes the correlation between pixels within a scan line, whereas 2-DRLC utilizes the correlation between pixels within a few scan lines. As a result, 2-D RLCcan obtain higher coding efficiency than 1-D RLC on the one hand. On the other hand, 2-DRLC is more susceptible to transmission errors than 1-D RLC.

In text compression, the dictionary-based techniques have proven to be efficient. Allthe adaptive dictionary-based algorithms can be classified into two groups. One is based


on a pioneering work by Ziv and Lempel in 1977, and another is based on theirpioneering work in 1978. They are called the LZ77 and LZ78 algorithms, respectively.With the LZ77 algorithms, a fixed size window slides through the input text stream. Thesliding window consists of two parts: the search buffer and the look-ahead buffer.The search buffer contains the most recently encoded portion of the input text, whilethe look-ahead buffer contains the portion of the input text to be encoded immediately.For the symbols to be encoded, the LZ77 algorithms search for the longest match in thesearch buffer. The information about the match: the distance between the matched stringin the search buffer and that in the look-ahead buffer, the length of the matched string,and the code word of the symbol following the matched string in the look-ahead bufferare encoded. Many improvements have been made in the LZ77 algorithms.

The performance of the LZ77 algorithms is limited by the sizes of the search buffer andthe look-ahead buffer. With a finite size for the search buffer, the LZ77 algorithms will notwork well in the case where repeated patterns are apart from each other by a distancelonger than the size of the search buffer. With a finite size for the sliding window, the LZ77algorithms will not work well in the case where matching strings are longer than thewindow. In order to be efficient, however, these sizes cannot be very large.

In order to overcome the problem, the LZ78 algorithms work in a different way.They do not use the sliding window at all. Instead of using the most recently encodedportion of the input text as a dictionary, the LZ78 algorithms use the index of the longestmatched string as an entry of the dictionary. That is, each matched string cascadedwith its immediate next symbol is compared with the existing entries of the dictionary.If this combination (a new string) does not find a match in the dictionary constructedat the moment, the combination will be included as an entry in the dictionary. Other-wise, the next symbol in the input text will be appended to the combination and theenlarged new combination will be checked with the dictionary. The process continuesuntil the new combination cannot find a match in the dictionary. Among various variantsof the LZ78 algorithms, the LZW algorithm is perhaps the most important one. It onlyneeds to encode the indexes of the longest matched strings to the dictionary. It can beshown that the decoder can decode the input text stream from the given index stream. Indoing so, the same dictionary as that established in the encoder needs to be reconstructedat the decoder, and this can be implemented since the same rule used in the encoding isknown in the decoder.

The size of the dictionary cannot be infinitely large because, as mentioned above, thecoding efficiency will not be high. The common practice of the LZ78 algorithms is to keepthe dictionary fixed once a certain size has been reached and the performance of theencoding is satisfactory. Otherwise, the dictionary will be set to empty and will bereconstructed from scratch.

Considering the fact that there are several international standards concerning still imagecoding (for both bilevel and multilevel images), a brief summary and a performancecomparison are presented at the end of this chapter. At the beginning of this chapter, adescription of the discrete Markov source and its nth extensions are provided. The Markovsource and the auto regressive (AR) model serve as important models for the dependentinformation sources.

Exercises

1. Draw the state diagram of a second-order Markov source with two symbols inthe source alphabet. That is, S¼ {s1, s2}. It is assumed that the conditional probabilities are


p( s1 j s1 s 1 ) ¼ p( s 2 j s 2 s 2 ) ¼ 0: 7,p( s2 j s1 s 1 ) ¼ p( s 1 j s 2 s 2 ) ¼ 0: 3, andp( s1 j s1 s 2 ) ¼ p( s 1 j s 2 s 1 ) ¼ p( s 2 j s 1 s 2 ) ¼ p( s 2 js 2 s1 ) ¼ 0: 5:

2. What are the de fi nitions of raster algo rithm and are a algori thm in bina ry image coding?Whi ch cate gory does 1-D RLC belon g to ? Whi ch c ategory doe s 2-D RLC belon g to?

3. What effect does a transmis sion err or have on 1-D RLC and 2-D RLC , resp ectively?What is the fun ction of the code word EOL?

4. Mak e a convin cing argumen t that the MH algorithm redu ces the require ment of largestor age space .

5. Whi ch three different mo des doe s 2-D RLC have? How do you vie w the vertica l mode?6. Usin g your ow n words , describ e the encodi ng and deco ding proce sses of the LZ77

algori thms. Go through Examp le 6.2.7. Usin g your ow n words , describ e the encodi ng and deco ding proce sses of the LZW

algori thm. Go throu gh Example 6.3.8. Read the referenc e pap er [arp s 1994], which is an excellent surve y on the int ernational

sta ndards fo r lossless still image com pression. Pay partic ular attentio n to all the fi guresand Table 6.1.

References

[abramson 1963] N. Abramson, Information Theory and Coding, McGraw-Hill, New York, 1963.[arps 1979] R.B. Arps, Binary image compression, in Image Transmission Techniques, W.K. Pratt (Ed.),

Academic Press, New York, 1979.[arps 1994] R.B. Arps and T.K. Truong, Comparison of international standards for lossless still image

compression, Proceedings of the IEEE, 82, 6, 889–899, June 1994.[bell 1986] T.C. Bell, Better OPM=L text compression, IEEE Transactions on Communications, COM-34,

1176–1182, December 1986.[bell 1990] T.C. Bell, J.G. Cleary, and I.H. Witten, Text Compression, Prentice-Hall, Englewood Cliffs,

NJ, 1990.[gonzalez 1992] R.C. Gonzalez and R.E. Woods, Digital Image Processing, Addison Wesley, Reading,

MA, 1992.[hunter 1980] R. Hunter and A.H. Robinson, International digital facsimile coding standards,

Proceedings of the IEEE, 68, 7, 854–867, 1980.[laemmel 1951] A.E. Laemmel, Coding processes for bandwidth reduction in picture transmission,

Rep. R-246-51, PIB-187, Microwave Research Institute, Polytechnic Institute of Brooklyn,New York.

[nelson 1995] M. Nelson and J.-L. Gailly, The Data Compression Book, 2nd edn., M&T Books,New York, 1995.

[sayood 1996] K. Sayood, Introduction to Data Compression, Morgan Kaufmann Publishers,San Francisco, CA, 1996.

[shannon 1949] C.E. Shannon and W. Weaver, The Mathematical Theory of Communication, Universityof Illinois Press, Urbana, IL, 1949.

[welch 1984] T. Welch, A technique for high-performance data compression, IEEE Computer, 17, 6,8–19, June 1984.

[ziv 1977] J. Ziv and A. Lempel, A universal algorithm for sequential data compression,IEEE Transactions on Information Theory, 23, 3, 337–343, May 1977.

[ziv 1978] J. Ziv and A. Lempel, Compression of individual sequences via variable-rate coding,IEEE Transactions on Information Theory, 24, 5, 530–536, September 1978.


Part II

Still Image Compression


7Still Image Coding: Standard JPEG

In this chapter, the JPEG standard is introduced. This standard allows for lossy and losslessencoding of still images, and four distinct modes of operation are supported: sequentialDCT-based mode, progressive DCT-based mode, lossless mode, and hierarchical mode.

7.1 Introduction

Still image coding is an important application of data compression. When an analog imageor picture is digitized, each pixel is represented by a fixed number of bits, which corres-pond to a certain number of gray levels. In this uncompressed format, the digitized imagerequires a large number of bits to be stored or transmitted. As a result, compressionbecomes necessary due to the limited communication bandwidth or storage size. Sincethe mid-1980s, the ITU and ISO have been working together to develop a joint inter-national standard for the compression of still images. Officially, JPEG [jpeg 1992] is theISO=IEC international standard 10918-1: digital compression and coding of continuous-tone still images, or the ITU-T Recommendation T.81. JPEG became an internationalstandard in 1992. The JPEG standard allows for both lossy and lossless encoding of stillimages. The algorithm for lossy coding is a discrete cosine transform (DCT)-based codingscheme. This is the baseline of JPEG and is sufficient for many applications. However, tomeet the needs of applications that cannot tolerate loss, e.g., compression ofmedical images,a lossless coding scheme is also provided and is based on a predictive coding scheme. Fromthe algorithmic point of view, JPEG includes four distinct modes of operation: sequentialDCT-based mode, progressive DCT-based mode, lossless mode, and hierarchical mode. Inthe following sections, an overview of these modes is provided. Further technical details canbe found in the books by Pennelbaker and Symes [pennelbaker 1992, symes 1998].

In the sequential DCT-based mode, an image is first partitioned into blocks of 83 8pixels. The blocks are processed from left to right and top to bottom. The 83 8 two-dimensional (2-D) forward DCT is applied to each block and the 83 8 DCT coefficientsare quantized. Finally, the quantized DCT coefficients are entropy encoded and output aspart of the compressed image data.

In the progressive DCT-based mode, the process of block partitioning and forward DCTtransform is the same as in the sequential DCT-based mode. However, in the progressivemode, the quantized DCT coefficients are first stored in a buffer before the encoding isperformed. The DCT coefficients in the buffer are then encoded by a multiple scanningprocess. In each scan, the quantized DCT coefficients are partially encoded by eitherspectral selection or successive approximation. In the method of spectral selection, thequantized DCT coefficients are divided into multiple spectral bands according to a zigzag(ZZ) order. In each scan, a specified band is encoded. In the method of successive


(a) Sequential coding

(b) Progressive coding

FIGURE 7.1(a) Sequential coding and (b) progressive coding.

approximation, a specified number of most significant bits of the quantized coefficients arefirst encoded, followed by the least significant bits in later scans.

The difference between sequential coding and progressive coding is shown in Figure 7.1.In the sequential coding an image is encoded part-by-part according to the scanning orderwhile in the progressive coding, the image is encoded by multiscanning process and ineach scan the full image is encoded to a certain quality level.

As mentioned earlier, lossless coding is achieved by a predictive coding scheme. In thisscheme, three neighboring pixels are used to predict the current pixel to be coded. Theprediction difference is entropy coded using either Huffman or arithmetic coding. Becausethe prediction is not quantized, the coding is lossless.

Finally, in the hierarchical mode, an image is first spatially down-sampled to a multi-layered pyramid, resulting in a sequence of frames as shown in Figure 7.2. This sequence offrames is encoded by a predictive coding scheme. Except for the first frame, the predictivecoding process is applied to the differential frames, i.e., the differences between the

FIGURE 7.2Hierarchical multiresolution encoding.


frame to be coded and the predic tive refer ence frame. It is importan t to note thatthe ref erence frame is equiva lent to the earlier frame that wou ld be reconst ructed inthe deco der. Th e codi ng metho d for the difference frame may ei ther use the DCT-ba sedcoding method , the lossless codi ng metho d, or the DC T-based proces ses with a finallossle ss proces s. Down -samp ling a nd up-sampli ng fi lters are used in the hierarc hicalmode. The hierarchi cal c oding mo de provides a progress ive presen tatio n similar to pro-gressi ve DCT-b ased m ode, but is also use ful in the appl ications that have multiresol utionrequi rements . The hierar chical codi ng mo de also provides the capa bility of progress ivecoding to a final lossle ss st age.

7.2 S equential D CT-Based Encoding Algorithm

The sequential DCT-based coding algorithm is the baseline algorithm of the JPEG codingstandard. The block diagram of encoding process is shown in Figure 7.3. As shown inFigure 7.4, the digitized image data is fi rst partit ioned into blo cks of 8 3 8 pixels. The 2-Dforward DCT is applied to each 83 8 block. The 2-D forward and inverse DCT of 83 8block are defined as follows:

FDCT

Suv ¼ 14CuCv

X7i¼0

X7j¼0

sij cos(2iþ 1)up

16cos

(2jþ 1)vp16

IDCT

sij ¼ 14

X7u¼0

X7v¼0

CuCvSuv cos(2iþ 1)up

16cos

(2jþ 1)vp16

CuCv ¼1ffiffiffi2p for u, v = 0

1 otherwise

8<:

(7:1)

wheresij is the value of the pixel at position (i,j) in the blockSuv is the transformed (u,v) DCT coefficient

Inputimage

Image partitioning

ForwardDCT

Quantization Entropyencoding

Quantizationtables

Tablespecification

Compressedimage data

FIGURE 7.3Block diagram of sequential discrete cosinetransform (DCT)-based encoding process.


S07S00

S70 S 77

FIGURE 7.4Partitioning to 83 8 blocks.

After the forward DCT, quantization of the transformed DCT coefficients is performed.Each of the 64 DCT coefficients is quantized by a uniform quantizer:

Squv ¼ roundSuvQuv

� �(7:2)

whereSquv is quantized value of the DCT coefficient Suv, and Quv is the quantization stepobtained from the quantization table

There are four quantization tables that may be used by the encoder, but there is nodefault quantization table specified by the standard. Two particular quantization tables areshown in Table 7.1.

At the decoder, the dequantization is performed as follows:

Rquv ¼ Squv �Quv (7:3)

where Rquv is the value of the dequantized DCT coefficient. After quantization, the DCcoefficient, Sq00, is treated separately from the other 63 AC coefficients. The DC coefficientsare encoded by a predictive coding scheme. The encoded value is the difference (DIFF)

TABLE 7.1

Two Examples of Quantization Tables Used by JPEG

Luminance Quantization Table Chrominance Quantization Table

16 11 10 16 24 40 51 61 17 18 24 47 99 99 99 9912 12 14 19 26 58 60 55 18 21 26 66 99 99 99 9914 13 16 24 40 57 69 56 24 26 56 99 99 99 99 9914 17 22 29 51 87 80 62 47 66 99 99 99 99 99 9918 22 37 56 68 109 103 77 99 99 99 99 99 99 99 9924 35 55 64 81 104 113 92 99 99 99 99 99 99 99 9949 64 78 87 103 121 120 101 99 99 99 99 99 99 99 9972 92 95 98 112 100 103 99 99 99 99 99 99 99 99 99


TABLE 7.2

Huffman Coding of DC Coeffi cients

SSSS Difference (DIFF) Values Additional Bits

0 0 —

1 � 1, 1 0, 12 � 3, �2, 2, 3 00, 01, 10, 113 � 7, . . . , �4, 4, . . . , 7 000, . . . , 011, 100, . . . , 1114 � 15, . . . , �8, 8, . . . , 15 0000, . . . , 0111, 1000, . . . , 11115 � 31, . . . , �16, 16, . . . , 31 00000, . . . , 01111,10000, . . . , 111116 � 63, . . . , �32, 32, . . . , 63 . . . , . . .7 �127, . . . , �64, 64, . . . , 127 . . . , . . .8 �255, . . . , �128, 128, . . . , 255 . . . , . . .9 �511, . . . , �256, 256, . . . , 511 . . . , . . .10 �1023, . . . , �512, 512, . . . , 1023 . . . , . . .11 �2047, . . . , �1024, 1024, . . . , 2047 . . . , . . .

betwe en the quantiz ed DC coef ficient of the current block ( Sq00 ) and that of the earlier blockof the same component (PRED ):

DIFF ¼ Sq00 � PRE D (7 : 4)

The value of DIFF is ent ropy code d with Huffman tab les. More spe cifi cally, the two ’scomple ment of the possible DIFF mag nitudes are gr ouped into 12 cate gories , ‘‘ SSSS. ’’The Huffman codes for these 12 difference categor ies and additi onal bits are sh own inTable 7.2.

For each nonzero cate gory, additio nal bits are added to the code word to uniquelyidenti fy which differe nce wi thin the ca tegory actually occurred. The number of addi tionalbits is de fined by ‘‘ SSSS ’’ and the additio nal bits a re append ed to the least signi ficant bit ofthe Huffman code (mos t signi fi cant bit first) accordin g the follo wing rule . If the differencevalue is pos itive, the ‘‘ SSSS ’’ low-ord er bits of DIFF are append ed; if the difference value isnega tive, then the ‘‘ SSSS ’’ low- order bits of DIFF- 1 are appe nded. As an exa mple, theHuffman tables used for coding the luminance and chromi nance DC coef ficients are shownin Table s 7.3 and 7.4, respec tively . Th ese two tables have been deve loped from the averagestatist ics of a large set of image s with 8-bit precis ion.

In contrast to the coding of DC coef fi cients, the quan tized AC coef ficients are arranged toa zigzag or der before being ent ropy code d. This scan order is shown in Figure 7. 5.

TABLE 7.3

Huffman Table for Luminance DC Coef ficient Differences

Category Code Length Code Word

0 2 001 3 0102 3 0113 3 1004 3 1015 3 1106 4 11107 5 111108 6 1111109 7 111111010 8 1111111011 9 111111110


TABLE 7.4

Huffman Table for Chrominance DCCoefficient Differences

Category Code Length Code Word

0 2 001 2 012 2 103 3 1104 4 11105 5 111106 6 1111107 7 11111108 8 111111109 9 11111111010 10 111111111011 11 11111111110

Accord ing to the zigzag sca nning order, the quan tized coef ficients can be repre sented as

ZZ(0 ) ¼ Sq 00 , ZZ(1 ) ¼ Sq 01 , ZZ(2 ) ¼ Sq 10 , . . . , ZZ(6 3) ¼ S q77 (7: 5)

Whe n many of the quan tized AC coef ficients become zer o, they can be ver y ef ficientl yenc oded by expl oiting the run length of zeros. The run length of zer os are ide nti fied bythe nonzero coef ficients. An 8-b it code ‘RRRR SSSS ’ is used to repre sent the nonze rocoef ficient. The four leas t signi fi cant bits, ‘SSSS, ’ de fine a categor y fo r the value of thenext nonze ro coef fi cient in the zigzag sequ ence, whi ch ends the zero- run. The four mo stsigni ficant bits, ‘ RRRR, ’ de fine the run length of zer os in the zigzag sequ ence or the positionof the no nzero coef ficie nt in the zigzag sequenc e. The composit e value , RRRRSS SS, is shownin Figu re 7.6. Th e value ‘RRR RSSSS ’ ¼ ‘11110000 ’ is de fined as ZRL, ‘‘ RRR R’’ ¼ ‘‘ 1111 ’’repres ents a run length of 16 zeros and ‘‘ SSSS ’’ ¼ ‘‘ 0000 ’’ repre sents a zer o-amplitu de.Ther efore, ZRL is used to repres ent a run length of 16 zero coef ficients follo wed by a zer o-amp litude coef fi cient, it is no t an abbrev iation. In the case of a run length of zero coef ficientsthat exceed s 15, multiple symbol s will be used. A special v alue ‘ RRRRSSSS ’ ¼ ‘00000000 ’is used to code the end- of-block (E OB). An EOB occurs whe n the rem aining coef fi cients inthe block are zero. The ent ries marke d N =A a re unde fi ned.

The com posite value , RRRRSS SS, is then Huffman code d. SSSS is actu ally the number toindica te cate gory in the Hu ffman c ode table. Th e coef ficient value s fo r each cate gory areshown in Table 7.5.

FIGURE 7.5Zigzag scanning order of DCT coefficients.

DC


SSSS

0...15

0 1 2 9 10

EOBN/AN/AN/AZRL

Composite values

RRRR

FIGURE 7.6Two-dimensional (2-D) value array for Huffman coding.

Each Huffman code is followed by additional bits that specify the sign and exactamplitude of the coefficients. As with the DC code tables, the AC code tables have alsobeen developed from the average statistics of a large set of images with 8-bit precision.Each composite value is represented by a Huffman code in the AC code table. The formatfor the additional bits is the same as in the coding of DC coefficients. The value of SSSSgives the number of additional bits required to specify the sign and precise amplitude ofthe coefficient. The additional bits are either the low-order SSSS bits of ZZ(k) when ZZ(k) ispositive or the low-order SSSS bits of ZZ(k)�1 when ZZ(k) is negative. Here, ZZ(k) is thekth coefficient in the zigzag scanning order of coefficients being coded. The Huffmantables for AC coefficients can be found in Annex K of the JPEG standard [jpeg 1992] andare not listed here due to space limitations.

As described above, Huffman coding is used as the means of entropy coding. However,an adaptive arithmetic coding procedure can also be used. As with the Huffman codingtechnique, the binary arithmetic coding technique is also lossless. It is possible to transcodebetween two systems without either the FDCT or IDCT processes. Moreover, this trans-coding is a lossless process; it does not affect the picture quality of the reconstructed image.The arithmetic encoder encodes a series of binary symbols, zeros or ones, where eachsymbol represents the possible result of a binary decision. The binary decisions include thechoice between positive and negative signs, a magnitude being zero or nonzero, or aparticular bit in a sequence of binary digits being zero or one. There are four steps in the

TABLE 7.5

Huffman Coding for AC Coefficients

Category (SSSS) AC Coefficient Range

1 �1, 12 �3, �2, 2, 33 �7, . . . , �4, 4, . . . , 74 �15, . . . , �8, 8, . . . , 155 �31, . . . , �16, 16, . . . , 316 �63, . . . , �32, 32, . . . , 637 �127, . . . , �64, 64, . . . , 1278 �255, . . . , �128, 128, . . . , 2559 �511, . . . , �256, 256, . . . , 51110 �1023, . . . , �512, 512, . . . , 102311 �2047, . . . , �1024, 1024, . . . , 2047


arithme tic coding: initial izing the st atistical area , initia lizing the encoder, termin ating thecode strin g, and adding restart markers .

7.3 Progressive DCT-Based E ncodi ng A lgorithm

In progres sive DCT-bas ed coding, the inpu t image is firs t partitione d to blocks of 8 3 8pixe ls. Th e 2-D 8 3 8 DC T is then applie d to each blo ck. The transf orm ed DC T-coef ficientdata is then enc oded with multiple sca ns. In each scan, a porti on of the transf ormed DC Tcoef ficient data is enc oded. This partial encoded dat a can be recon structed to obtain a fullimage size with pictu re of lower quality . The coded dat a of each additio nal sca n willenhance the recon structed image qualit y until the full qual ity has been achi eved at thecom pletion of all sca ns. Two m ethods have bee n used in JPEG standard to performthe DCT- based progress ive codi ng. Th ese include spe ctral selection and succes siveappro ximation .

In the metho d of spe ctral selecti on, the transf ormed DCT coef ficients are first reordere das zigzag sequ ence and then divided into several bands . A freq uency band is de fined in thesca n header by spe cifying the starti ng and ending indice s in the zigzag sequ ence. Th e bandcontai ning DC coef fi cient is enc oded at the fi rst scan. In the following sca n, it is notnec essary for the coding proce dure to fo llow the zigzag ordering. In the met hod of thesucce ssive appro ximation , the DC T coef ficients are reduce d in preci sion by the pointtransf orm. Th e poi nt transfor m of the DC T coef fi cients is an arithme tic-shift-rig ht by aspe cifi ed number of bits, or divided by a power of 2 (near zero, there is slight difference intruncat ion of preci sion betw een arit hmetic shift and divide by 2, see anne x K10 of [JPEG]).This spe cifi ed numb er is the succes sive appro ximation of bit posit ion. To encode usingsuccesive approximations, the significant bits of DCT coefficient are encoded in the firstscan, and each successive scan that follows progressively improves the precision of thecoefficients by 1 bit. This continues until full precision is reached.

The principles of spectral selection and successive approximation are shown inFigu re 7.7. For both met hods, the quan tized coef ficients are coded with either Huffmanor arithmetic codes at each scan. In spectral selection and the first scan of successiveapproximation for an image, the AC coefficient coding model is similar to that used byin the sequential DCT-based coding mode. However, the Huffman code tables areextended to include coding of runs of end-of-bands (EOBs). For distinguishing the end-of-band and end-of-block, a number, n, which is used to indicate the range of run length, isadded to the end-of-band (EOBn). The EOBn code sequence is defined as follows. EachEOBn is followed by an extension field, which has the minimum number of bits required tospecify the run length. The EOBn run structure allows efficient coding of blocks, whichhave only zero coefficients. For example, an EOB run of length 5 means that the currentblock and the next four blocks have an EOBnwith no intervening nonzero coefficients. TheHuffman coding structure of the subsequent scans of successive approximation for a givenimage is similar to the coding structure of the first scan of that image. Each nonzeroquantized coefficient is described by a composite 8-bit run length–magnitude value ofthe form: RRRRSSSS. The four most significant bits, RRRR, indicate the number of zerocoefficients between the current coefficient and the previously coded coefficient. The fourleast significant bits, SSSS, give the magnitude category of the nonzero coefficient. The runlength–magnitude composite value is Huffman coded. Each Huffman code is followed byadditional bits: 1 bit is used to code the sign of the nonzero coefficient and another one bitis used to code the correction, where 0 means no correction and 1 means add one to the


0 LSB6th scan

Image

Block reordering

FDCT

8

8

8 � 8 8 � 8Sending Sending

Coefficients zigzag reorderingand represented by 8 bits

012..

6263

7 6 …. 10

MSB LSB

Sending

Spectral selection

0

1st scan 1st scan

0

Successive approximation

2nd scan

12

616263

N th scan

7 6 5 4 MSB 2nd scan

12..

6263

12..

6263

Sending

SendingSending

SendingSending

Sending

FIGURE 7.7Progressive coding with spectral selection and successive approximation.


FIGURE 7.8Spatial relation between the pixels to be coded and threedecoded neighbors.

c b

a x

decoded magnitude of the coefficient. Although the above technique has been describedusing Huffman coding, it should be noted that arithmetic encoding can also be used inits place.

7.4 Lossless Coding Mode

In the lossless coding mode, the coding method is spatial-based coding instead of DCT-based coding. However, the coding method is extended from the method for coding theDC coefficients in the sequential DCT-based coding mode. Each pixel is coded with apredictive coding method, where the predicted value is obtained from one of three one-dimensional (1-D) or one of four 2-D predictors which are shown in Figure 7.8.

In Figure 7.8, the pixel to be coded is denoted by x, and the three causal neighbors aredenoted by a, b, and c. The predictive value of x, Px, is obtained from three neighbors, a, b,and c in one of seven ways as listed in Table 7.6.

In Table 7.6, the selection value 0 is only used for differential coding in hierarchicalcoding mode. Selections 1, 2, and 3 are 1-D predictions and 4, 5, 6, and 7 are 2-Dpredictions. Each prediction is performed with full integer precision, and without clamp-ing of either underflow or overflow beyond the input bounds. To achieve lossless coding,the prediction differences are coded with either Huffman coding or arithmetic coding. Theprediction difference values can be from 0 to 216 for 8-bit pixels. The Huffman tablesdeveloped for coding DC coefficients in the sequential DCT-based coding mode are usedwith one additional entry to code the prediction differences. For arithmetic coding, thestatistical model defined for the DC coefficients in sequential DCT-based coding mode is

TABLE 7.6

Predictors for Lossless Coding

Selection-Value Prediction

0 No prediction (hierarchicalmode)

1 Px¼ a2 Px¼ b3 Px¼ c4 Px¼ a þ b � c5 Px¼ a þ [(b � c)=2]*6 Px¼ b þ [(a � c)=2]*7 Px¼ (a þ b)=2

Note: *Represents the shift right arithmetic operation.


general ized to a 2-D fo rm in which diffe rences are c onditione d on the pixe l to the leftand the line abov e.

7.5 H ierarchical Coding Mode

The hierarchi cal codi ng m ode provides a progre ssive codi ng similar to the progress iveDCT- based coding mode, but it of fers more functiona lity. Th is fun ctionality addr essesappl ications wi th multire solution requ irements. In hierar chi cal codi ng mo de, an inpu timage frame is first decomposed to a sequence of frames, such as the pyramid shownin Figure 7.2 . Eac h frame is obt ained throug h a down-samp ling process , i.e., low-passfiltering followed by subsampling. The first frame (the lowest resolution) is encoded as anondifferential frame. The following frames are encoded as differential frames, where thedifferential is with repect to the earlier coded frame. Note that an up-sampled versionthat would be reconstructed in the decoder is used. The first frame can be encoded by themethods of sequential DCT-based coding, spectral selection method of progressive coding,or lossless coding with either Huffman code or arithmetic code. However, within animage, the differential frames are either coded by the DCT-based coding method, thelossless coding method, or the DCT-based process with a final lossless coding. All frameswithin the image must use the same entropy coding, either Huffman or arithmetic, withthe exception that nondifferential frame coded with the baseline coding may occur in thesame image with frames coded with arithmetic coding methods. The differential framesare coded with the same method used for the nondifferential frame except the final frame.The final differential frame for each image may use differential lossless coding method.In the hierarchical coding mode, the resolution changes of frames may occur. Theseresolution changes occur if down-sampling filters are used to reduce the spatial resolutionof some or all frames of an image. When the resolution of a reference frame does not matchthe resolution of the frame to be coded, an up-sampling filter is used to increase theresolution of reference frame. The block diagram of coding a differential frame is shownin Figure 7.9.

The up-sampling filter increases the spatial resolution by a factor of two in bothhorizontal and vertical directions by using bilinear interpolation of two neighboring pixels.The up-sampling with bilinear interpolation is consistent with the down-sampling filterthat is used for the generation of down-sampled frames. It should be noted that thehierarchical coding mode allows one to improve the quality of the reconstructed framesat a given spatial resolution.

Up-samplingFrame

memory

EncodingInput frame

Codeddifferential frame

FIGURE 7.9Coding of differential frame in hierarchical coding.


7.6 Summary

In this chapter, the still-image coding standard, JPEG, has been introduced. The JPEGcoding standard includes four coding modes: sequential DCT-based coding mode, pro-gressive DCT-based coding mode, lossless coding mode, and hierarchical coding mode.The DCT-based coding method is probably the one that most are familiar with; however,the lossless coding modes in JPEG which use a spatial domain predictive coding processhave many interesting applications as well. For each coding mode, the entropy coding canbe implemented with either Huffman coding or arithmetic coding. JPEG has been widelyadopted for many applications.

Exercises

1. What is the difference between sequential coding and progressive coding in JPEG?Conduct a project to encode an image with sequence coding and progressive coding,respectively.

2. Use JPEG lossless mode to code several images and explain why different bit rates areobtained.

3. Generate a Huffman code table using a set of images with 8-bit precision (approximately2� 3) using the method presented in Annex C of the JPEG specification. This set ofimages is called the training set. Use this table to code an image within the training setand an image, which is not in the training set, and explain the results.

4. Design a three-layer progressive JPEG coder using (a) spectral selection and (b) pro-gressive approximation (0.3 bits=pixel at the first layer, 0.2 bits=pixel at the second layer,and 0.1 bits=pixel at the third layer).

References

[jpeg 1992] Digital compression and coding of continuous-tone still images: Requirements andGuidelines, ISO-=IEC International Standard 10918-1, CCITT T.81, September, 1992.

[pennelbaker 1992] W.B. Pennelbaker and J.L. Mitchell, JPEG: Still Image Data Compression Standard,Van Nostrand Reinhold, New York, September 1992.

[symes 1998] P. Symes, Video Compression: Fundamental Compression Techniques and an Overview of theJPEG and MPEG Compression Systems, McGraw-Hill, New York, April 1998.


8Wavelet Transform for Image Coding:JPEG2000

Since the mid-1980s, a number of signal processing applications have emerged usingwavelet theory. Among those applications, the most widespread developments haveoccurred in the area of data compression. Wavelet techniques have demonstrated theability to provide not only high coding efficiency but also spatial and quality scalabilityfeatures. In this chapter, we focus on the utility of the wavelet transform for image datacompression applications.

We first introduce wavelet transform theory by starting with the short-time Fouriertransform (STFT). Then discrete wavelet transform (DWT) is presented. Finally, the liftingscheme, known as the second generation wavelet transform, is described.

We then discuss the basic concept of image wavelet transform coding with an emphasison embedded image wavelet transform coding algorithms, which is the base of JPEG2000.

Finally, JPEG2000, the newest still image coding standard, is described in this chapterwith emphasis on its functionality and current status.

8.1 A Review of Wavelet Transform

8.1.1 Definition and Comparison with Short-Time Fourier Transform

The wavelet transform, as a specialized research field, started over more than two decadesago [grossman 1984]. It is known that the wavelet transform is rather different from theFourier transform. The former is suitable to study transitional property of signals, while thelatter is not suitable to study signals in the time–frequency space. The so-called STFT wasdesigned to overcome the drawback of the Fourier transform. For a better understanding,we first give a very short review of the STFT because there are some similarities betweenthe STFT and the wavelet transform. As we know, the STFT uses sinusoidal waves as itsorthogonal basis and it is defined as

F(v,t) ¼ðþ1

�1f (t)w(t� t)e�jvt dt (8:1)

where w(t) is a time domain windowing function, the simplest of which is a rectangularwindow that has a unit value over a time interval and has zero elsewhere. The value t isthe starting position of the window. Thus, the STFT maps a function f(t) into a 2-D plane(v, t), where v and t stand for frequency and time moment, respectively. The STFT is alsoreferred to as Gabor transform [cohen 1989]. Similar to the STFT, the wavelet transform


also maps a time or spatial function into a 2-D function of a and t, where a and t denotedilation and translation in time, respectively. The wavelet transform is defined as follows.Let f(t) be any square integrable function, i.e., it satisfies

ðþ1

�1j f (t)j2dt <1 (8:2)

The continuous-time wavelet transform of f(t) with respect to a wavelet c(t) is defined as

W(a, t) ¼ðþ1

�1f (t)jaj�1=2c * (t� t)a�1

� �dt (8:3)

where a and t are real variables and * denotes complex conjugation. The wavelet, denotedby cat(t), is expressed as

cat(t) ¼ jaj�1=2c (t� t)a�1� �

(8:4)

Equation 8.4 represents a set of functions that are generated from a single function, c(t), bydilations and translations. The variable t represents the time shift and the variable acorresponds to the amount of time scaling or dilation. If a > 1, there is an expansion ofc(t), while if 0 < a < 1, there is a contraction of c(t). For negative values of a, the waveletexperiences a time reversal in combination with a dilation. The function, c(t), is referred toas the mother wavelet and it must satisfy two conditions:

1. The function c(t) integrates to zero:

ðþ1

�1c(t)dt ¼ 0 (8:5)

2. The function is square integrable, or has finite energy:

ðþ1

�1cj(t)j2dt <1 (8:6)

The continuous-time wavelet transform can now be rewritten as

W(a, t) ¼ðþ1

�1f (t)c*at(t)dt (8:7)

In the following, we give two well-known examples of c(t) and their Fourier transform.The first example is the Morlet (modulated Gaussian) wavelet [daubechies 1992],

c(t) ¼ e�12 t

2 � e jv0t

C(v) ¼ (2p)12 exp (v� v0)2=2

� � (8:8)


and the second exa mple is the Haar wavel et:

c( t ) ¼1 0 � t � 1=2� 1 1=2 � t � 10 otherwi se

8><>:

C ( v) ¼ j e� j v= 2 sin 2 ( v=4)v=4

(8 : 9)

From the abov e de fi nitions and exa mple s, we can fi nd that the wave lets have z ero DCvalue . This is clear from Equatio n 8.5. To have good time localiza tion, the wavel ets areusually bandpa ss signals and they decay rapidl y towar d zer o with tim e. We can also fi ndseveral other impo rtant prop erties of the wavel et transf orm and seve ral differe ncesbetwe en the STFT and the wave let transf orm.

The STF T use s a sinus oidal wave as its basis function s, whi ch keep the same freq uencyover the ent ire tim e inter val. In contras t, the wavel et transf orm uses a par ticular wavelet asits basis functi on. He nce, wavel ets vary in bot h posit ion and frequency over the timeinter val. Examp les of two ba sis function s for the sinu soidal wave and wavel et are shownin Figu re 8.1a and b, respec tively, where the ver tical axes stand for mag nitude and thehoriz ontal axes fo r tim e.

The STFT use s a single anal ysis window. In contrast, the wavel et transfor m use s a short-time window at high fre quencies and a long-t ime windo w at low frequenci es. This isrefer red to as cons tant Q-f actor filteri ng or relative cons tant ba ndwidth freq uency analy sis.A com parison of constant bandwid th anal ysis of the ST FT and relative cons tant band widthwavel et transf orm is sh own in Figu re 8.2a and b, respec tively.

This fea ture can be further explained based on the con cept of a tim e–fre quency plane(Figure 8 .3). It is kno wn from the He isenber g inequality [ri oul 1991] that the produ ct oftime resolutio n and freq uency resolution has been lower bounde d as follows:

Dt � Df � 1=(4 p)From the a bove expres sion, we see that these two reso lutions cannot be arbitraril y sm all.

As sh own in Figure 8.3, the wind ow size of the STFT in the tim e doma in is always chosen

(a) (b)0 50 100 150

–1

–0.5

0

0.5

1

1.5

2

2.5

3

3.5

0 200 400 600 800 1000 1200 1400–2

–1

0

1

2

3

4

5

FIGURE 8.1Wave versus wavelet: (a) two sinusoidal waves and (b) two wavelets, where vertical axes stand for magnitude andhorizontal axes for time. (From Castleman, K.R., Digital Image Processing, Prentice-Hall, Englewood Cliffs, NJ,1996. With permission.)


Constant bandwidth Relative constant bandwidth

(a) (b)F F 2F 4F 8F2F 3F 4F 5F

FIGURE 8.2(a) Constant bandwidth analysis (for Fourier transform) and (b) relative constant bandwidth analysis (for wavelettransform). (From Rioul, O. and Vetterli, M., IEEE Signal Process. Mag., 8, 14, 1991. With permission.)

to be constant. The correspond ing freq uency bandwid th is also cons tant. In the wavel ettransf orm, the wind ow size in time doma in vari es wi th the freque ncy. A longer timewind ow is used for lower frequency and shorte r time windo w is use d fo r higher fre quency.This prope rty is very impo rtant for image data com pression. For image dat a, the concept oftim e–freq uency plane become s spatial -frequ ency plane. Th e spatial resolution of a digita limage is m easured with pixels, as describ ed in Chapter 15. To overcom e the limit ations ofdiscre te cosine transfor m (DCT)-bas ed codi ng, the wavel et transf orm allows the spati alresol ution and freque ncy band width to vary in the spatial -frequ ency plane. With thi svari ation, better bit allocati on for active and sm ooth areas can be a chieved.

The conti nuous-tim e wavel et transf orm can be conside red as a correlat ion. For fixed a , itis clear from Equ ation 8.3 that W ( a, t ) is the cro ss-correlati on of functions f(t) with relatedwavelet conjugate dilated to scale factor a at time lag t. This is an important property of thewavelet transform for multiresolution analysis of image data. While the convolution can beseen as a filtering operation, the integral wavelet transform can be seen as a bank of linearfilters acting upon f(t). This implies that the image data can be decomposed by a bank offilters defined by the wavelet transform.

The continuous-time wavelet transform can be seen as an operator. First, it has theproperty of linearity. If we rewrite W(a,t) as Wat[f(t)], then we have

Wat[af (t)þ bg(t)] ¼ aWat[ f (t)]þ bWat[g(t)] (8:10)

where a and b are constant scalars.

Frequency

Time Time

Frequency

STFT Wavelet transform

FIGURE 8.3Comparison of the short-time Fourier transform (STFT) and the wavelet transform in the time–frequency plane.(From Rioul, O. and Vetterli, M., IEEE Signal Process. Mag., 8, 14, 1991. With permission.)


Second, it has the property of translation:

Wat[ f (t� l)] ¼ W(a, t � l) (8:11)

where l is a time lag.Finally, it has the property of scaling:

Wat[ f (t=a)] ¼ W(a=a, t=a) (8:12)

8.1.2 Discrete Wavelet Transform

In the continuous-time wavelet transform, the function f(t) is transformed to a functionW(a, t) using the wavelet c(t) as a basis function. Recall that the two variables, a and t, arethe dilation and translation in time, respectively. Now let us find a means of obtaining theinverse transform, i.e., given W(a, t), find f(t). If we know how to get the inverse transform,we can then represent any arbitrary function f(t) as a summation of wavelets such as in theFourier transform and discrete cosine transform that provide a set of coefficients forreconstructing the original function using sine and cosine as the basis functions. In fact,this is possible if the mother wavelet satisfies the admissibility condition:

C ¼ðþ1

�1jC(v)j2jvj�1dv (8:13)

whereC is a finite constant�(v) is the Fourier transform of the mother wavelet function c(t)

Then, the inverse wavelet transform is

f (t) ¼ 1C

ðþ1

�1

ðþ1

�1jaj�2W(a, t)cat(t)dadt (8:14)

The above results can be extended for 2-D signals. If f(x,y) is a 2-D function, its continuous-time wavelet transform is defined as

W(a,tx,ty) ¼ðþ1

�1

ðþ1

�1f (x, y)c*atxty (x, y)dxdy (8:15)

where tx and ty specify the transform in 2-D. The inverse 2-D continuous-time wavelettransform is then defined as

f (x, y) ¼ 1C

ðþ1

�1

ðþ1

�1

ðþ1

�1jaj�3W(a, tx, ty)catxty (x, y)dadtx dty (8:16)

where C is defined as in Equation 8.13 and c(x,y) is a 2-D wavelet


FIGURE 8.4The wavelet transform implementedwith a bank of filters.

ψ (t– t)

1a1

t−ta1

ψ

1

am

t−tam

ψ

f (t )

∫

∫

∫

Ψ (a1,t)

Ψ (a1,t)

Ψ (am,t)

))

))

ca tx t y ( x, y) ¼ jaj� 3 cx� tx

a,y� ty

a

� �(8: 17)

For image coding, the wavelet transf orm is use d to deco mpose the image data int owavel ets. As indica ted in the thi rd prope rty of the wavel et transf orm, the wavel et trans-form can be viewed as the cro ss-cor relation of the function f ( t ) and the wavel ets cat( t ).Ther efore, the wavel et transf orm is equiva lent to fi nding the outp ut of a bank of ba ndpassfi lters spe cifi ed by the wavelets of ca t( t ) as sh own in Figu re 8.4. This process decompo sesthe input sig nal int o seve ral sub bands. As each subband can be fur ther par titioned, thefi lter bank imp lementat ion of the wave let transf orm ca n be used for multire solutionanal ysis (MRA) . Intui tively , when the analy sis is viewe d as a fi lter ba nk, the tim e reso-luti on must increa se wi th the central fre quency of the analysis filters. This can be exactl yobt ained by the scaling prope rty of the wave let transf orm, where the cen ter frequenci es ofthe band pass filters increa se as the bandwid th widens. Agai n, the ba ndwidth become swider as the dilation par ameter a reduce s. It shoul d be note d that such a n MRA iscons istent with the con stant Q-factor prope rty of the wavel et transf orm. Furt hermore ,the resol ution limit ation of the STF T does not exis t in the wavelet transf orm becaus e thetim e–freq uency resolutions in the wave let transf or m vary, as sh own in Figu re 8.2b.

For dig ital image com pression, it is preferred to repre sent f ( t) as a discre te superpo sitionsum rathe r than an integr al. With this move to the discre te space, the dila tion parame ter a inEquati on 8.10 takes the value s a¼ 2k and the transl ation par ameter t takes the value s t¼ 2kl ,where both k and l are intege rs. From Equati on 8.4, the discre te ver sion of cat(t) becomes

ckl(t) ¼ 2�k=2c(2�kt� l ) (8:18)

Its corresponding wavelet transform can be rewritten as

W(k, l) ¼ðþ1

�1 f ( t) c*kl (t )dt (8:19)

and the inverse transform becomes

f (t) ¼Xþ1k¼�1

Xþ1l¼�1

d(k, l)2�k=2c(2�kt� l) (8:20)


The values of the wavelet transform at those a and t are represented by d(k,l):

d(k, l) ¼W(k, l)=C (8:21)

The d(k,l) coefficients are referred to as the DWT of the function f(t) [daubechies 1992;vetterli 1995]. It is noted that the discretization so far is only applied to the parametersa and t, d(k,l) is still a continuous-time function. If the discretization is further applied to thetime domain by letting t¼mT, where m is an integer and T is the sampling interval(without loss of generality, we assume T¼ 1), then the discrete-time wavelet transform isdefined as

Wd(k, l) ¼Xþ1

m¼�1f (m)c*kl(m) (8:22)

Of course, the sampling interval has to be chosen according to the Nyquist samplingtheorem so that no information has been lost in the process of sampling. The inversediscrete-time wavelet transform is then

f (m) ¼Xþ1k¼�1

Xþ1l¼�1

d(k, l)2�k=2c(2�km� l) (8:23)

8.1.3 Lifting Scheme

An alternative implementation of the DWT has been proposed recently, known as thelifting scheme [sweldens 1995; daubechies 1998]. The implementation of the lifting scheme isvery efficient and similar to the fast Fourier transform (FFT) implementation for Fouriertransform. In this subsection, we first introduce how the lifting scheme works. Then, wecomment on its features, merits, and its application in JPEG2000.

8.1.3.1 Three Steps in Forward Wavelet Transform

Similar to the discrete Fourier transform (DFT), the lifting scheme is conducted from oneresolution level to the next lower resolution level iteratively. To facilitate representation, letus consider here only the case of 1-D data sequence. A 2-D data extension should beunderstandable straightforwardly afterward. One iteration of lifting scheme is described asfollows. Denote the 1-D data sequence by Xi, and Xi¼ {xj}. After the iteration of liftingscheme, the date sequence Xi becomes two data sequences Xiþ1 and Yiþ1, say, the former isthe low-pass component and the latter is the high-pass component. The lifting schemeiterative consists of the following three steps: splitting, prediction, and updated.

Splitting (often referred to as lazy wavelet transform)The data sequence Xi is split into two parts: Xiþ1 and Yiþ1.

Prediction (often referred to as dual lifting)In this step, the data sequence Yiþ1 is predicted with the data sequence Xiþ1, and then

Yiþ1 is replaced by the prediction error

Yiþ1 Yiþ1 � P(Xiþ1)

Updated (often referred to as primary lifting)In this step, the data sequence Xiþ1 is updated with the data sequence Yiþ1, and then Xiþ1

is replaced as follows.


Xiþ1 Xiþ1 þU(Yiþ1)

After this iteration, a new iteration will apply the same three steps to Xiþ1 to generateXiþ2 and Yiþ2.

It is observed that the 1-D DWT via lifting scheme is now very simple and efficient ifthe prediction and update operators are simple. In the example that follows, we can see thisis the case.

8.1.3.2 Inverse Transform

The inverse wavelet transform via lifting scheme is exactly the reverse process of the aboveforward transform in the sense that plus and minus are reverses, splitting and merging arereversed, and the order to the three steps is reversed. The corresponding three steps ininverse transform are shown below.

Inverse update

Xiþ1 Xiþ1 �U(Yiþ1)

Inverse prediction

Yiþ1 Yiþ1 þ P(Xiþ1)

MergingThe data sequence Xi is formed by union of Xiþ1 and Yiþ1.

8.1.3.3 Lifting Version of CDF (2,2)

In this section, a specific lifting scheme is introduced, which essentially implements thewell-known CDF(2,2) wavelet transform, named after its three inventors: Cohen, Daube-chies, and Feauveau.

Forward Transform:Splitting

The dataset Xi¼ {xj} is split into two parts: even data samples and odd data samples. Thisis similar to the fast Fourier transform (FFT) technique.That is,

sj x2j

dj x2jþ1

Note that we use the notations {sj} and {dj} to denote the even and odd data sequence of {xj}in order to avoid confusion in notation of subscripts.

PredictionPredict the odd data samples with the even data samples and replace the odd data

samples by the prediction error as follows:

dj dj � 12(sj þ sjþ1)

UpdateUpdate the even data samples with the odd data samples to preserve the mean of data

samples.

sj sj þ 14(dj�1 þ dj)


It is observed that both prediction and update are linear in this example. The inversetransform runs just opposite to the forward transform as shown below.

Inverse Transformation:

Inverse update

Sj Sj � 14

dj�1 þ dj� �

Inverse prediction

dj dj þ 12

Sj þ Sjþ1� �

MergingThe data sequence Xi is formed by union of Sj and dj.

8.1.3.4 A Demonstration Example

We present a simple numerical example for illustration purpose. In this example, we onlyconsider a four-point data sequence, i.e.,

Xi ¼ {1, 2, 3, 4}

After splitting, we have

{sj} ¼ {2, 4} and {dj} ¼ {1, 3}

After prediction, we have

d1 ¼ d1 � 12(s1 þ s2) ¼ �2 and d2 ¼ d2 � 1

2(s2 þ s3) ¼ 1

Note that the value of s3 is not available and is hence treated as s3¼ 0 for the sake ofsimplicity. For a discussion on boundary treatment, readers are referred to literature, e.g.,[uytterhoeven 1999].

After update, we have

s1 ¼ s1 þ 14(d0 þ d1) ¼ 1:5 and s2 ¼ s2 þ 1

4(d1 þ d2) ¼ 3:75

Similarly we do not have d0 and hence treat d0¼ 0. Hence, after CDF (2,2) via lifting schemewe have Xiþ1¼ {1.5,3.75} and Yiþ1¼ {�2,1}. The former is the low-pass frequency part ofthe data sequence Xi¼ {1,2,3,4}, and the latter is the high-pass frequency part of Xi.For inverse transform, after inverse update, we have

s1 ¼ s1 � 14(d0 þ d1) ¼ 2 and s2 ¼ s2 � 1

4(d1 þ d2) ¼ 4

After inverse prediction, we have

d1 ¼ d1 þ 12(s1 þ s2) ¼ 1 and d2 ¼ d2 þ 1

2(s2 þ s3) ¼ 3

Hence, inverse transform via lifting scheme produces the original sequence Xi¼ {1,2,3,4}.


8.1.3. 5 (5,3) Intege r Wavelet Tr ansform

An addi tional adv antage which li fting sch eme pos sesses, has someth ing to do withintege r wavelet transf orm (IWT) . That is, IWT can map integer to int eger. Image gray scalevalue s are kno wn as integers. Hence , IWT ca n be used for reversible transf ormatio n, whichfi nds appl ications in lossle ss image com pression. Furthe rmore, bot h the forwa rd IWT andthe inv erse IWT can easily be cond ucted via usi ng the lifting schem e. In this subs ection, wepresen t the integer version of CD F (2,2), which is referred to as (5,3) IWT. Note that the(5,3) IWT been adopted in JPEG 2000 [rabban i 2001; skodras 2001]. In additi on, the IWT hasalso been used for rev ersible image data embed ding [xuan 200 2].

Forwar d (5,3) IWT

Splitti ng

sj x2 j and dj x 2jþ 1

Predic tion

dj dj � 12 ( s j þ s jþ 1 )þ 1

2

�

Upd ate

sj s j þ 14 ( dj� 1 þ dj )þ 1

2

�

where the no tation zb c indicates the large st integer not large r than z .Inv erse (5,3) IWTInvers e primar y lifting

sj sj � 14(dj�1 þ dj)þ 1

2

�

Inverse dual lifting

dj dj þ 12(sj þ sjþ1)þ 1

2

�

Merging

x2j sj and x2jþ1 dj

8.1.3.6 A Demonstration Example of (5,3) IWT

He re we work on the same 1-D data sequ ence as used in the exa mple shown in Secti on8.1.3.4. That is,

Xi ¼ {1, 2, 3, 4}

After splitting, we have

{sj} ¼ {2, 4} and {dj} ¼ {1, 3}

After prediction, we have


d1 ¼ d1 � 12 ( s 1 þ s 2 )þ 1

2

� ¼ �2 and d2 ¼ d2 � 1

2 ( s 2 þ s 3 )þ 1

2

� ¼ 1

After up date, we have

s1 ¼ s 1 þ 14 (d0 þ d1 )þ 1

2

� ¼ 2 and s2 ¼ s 2 þ 1

4 ( d1 þ d2 )þ 1

2

� ¼ 4

Becaus e the valu es of s3 and d0 are not a vailable we treat s 3¼ 0 and d0¼ 0 fo r the sake ofsimplici ty. For a discuss ion on bou ndary treatment, readers are referred to lite rature, e.g.,[uytte rhoeven 1999]. We the n see that after (5,3) IWT with lifti ng sch eme we haveXiþ 1¼ {2,4} and Yiþ 1¼ {� 2,1}. The forme r is the low-pass frequency part of the dat asequ ence Xi¼ {1,2,3,4} , and the latter is the high -pass freq uency part of X i. It is observedthat bot h par ts are integers . Th at is, IWT maps int eger to int eger.

Now we show that the inverse (5,3) IWT gives ba ck exa ctly the same origi nal dat asequ ence Xi¼ {1,2,3,4} a s follows:

For inv erse transfor m, a fter inv erse primar y lifting, we have

s1 ¼ s 1 � 14 (d0 þ d1 )þ 1

2

� ¼ 2 and s2 ¼ s 2 � 1

4 ( d1 þ d2 )þ 1

2

� ¼ 4

After inv erse dual lifting, we have

d1 ¼ d1 þ 12 ( s 1 þ s 2 )þ 1

2

� ¼ 1 and d2 ¼ d2 þ 1

2 ( s 2 þ s 3 )þ 1

2

� ¼ 3

After m erging, we have Xi¼ {1,2,3,4} .

8.1.3. 7 Summary

In this section, we summa rize the merits pos sessed by lifting schem e. Becau se of thesemerits, lifting schem e has been adopte d by JPEG2 000 [rabb ani 2001]

1. Lifting scheme provi des anothe r different way to illustrat e wavelet transf orm. Itis note d that most of wavel et transf orm the ory starts fro m Fourier transfor mtheo ry. Lifting scheme, howe ver, doe s provi de one way to view wave let transf ormwitho ut using Fourier transf or m.

2. Lifting schem e is simple and hence ef ficient in impl ement ation.

3. In addi tion, lifting schem e can reduce mem ory require ment signi fi cantly. It canprovi de so-called in-place computat ion of wavelet coef ficients. That is, it canoverw rite the mem ory used to stor e inpu t data with wavel et coef fi cients. Thisbears similarity to the fast Fourier transform.

4. Lifting scheme lends itself easily to integer wavelet transform computation.

8.2 Digital Wavelet Transform for Image Compression

8.2.1 Basic Concept of Image Wavelet Transform Coding

From the last section, we have learned that the wavelet transform has several features thatare different from trad itional transf orms. It is no ted from Figure 8.2 that each transf orm


2-D wavelettransform for

imagedecomposition

QuantizationCoding ofquantized

coefficients

Inputimage Bitstream

FIGURE 8.5Block diagram of the image coding with the wavelet transform coding.

coefficient in the STFT represents a constant interval of time regardless of which band thecoefficient belongs to, whereas for the wavelet transform, the coefficients at the course levelrepresent a larger time interval but a narrower band of frequencies. This feature of thewavelet transform is very important for image coding. In traditional image transformcoding, which make use of the Fourier transform or the discrete cosine transform, onedifficult problem is to choose the block size or window width so that statistics computedwithin that block provide good models of the image signal behavior. The choice of theblock size has to be compromised so that it can handle both active and smooth areas. In theactive areas, the image data is more localized in the spatial domain, while in the smoothareas, the image data is more localized in the frequency domain. With traditional transformcoding, it is very hard to reach a good compromise. The main contribution of wavelettransform theory is that it provides an elegant framework in which both statistical behav-iors of image data can be analyzed with equal importance. This is because that waveletscan provide a signal representation in which some of the coefficients represent long datalags corresponding to a narrow band or low frequency range, and some of the coefficientsrepresent short data lags corresponding to a wide band or high frequency range. Therefore,it is possible to obtain a good trade-off between spatial and frequency domain with thewavelet representation of image data.

To use the wavelet transform for image coding applications, an encoding process isneeded, which includes three major steps: image data decomposition, quantization ofthe transformed coefficients, and coding of the quantized transformed coefficients.A simplified block diagram of this process is shown in Figure 8.5. The image decompos-ition is usually a lossless process, which converts the image data from the spatial domain tofrequency domain, where the transformed coefficients are decorrelated. The informationloss happens in the quantization step and the compression is achieved in the coding step.To begin the decomposition, the image data is first partitioned into four subbands labeledas LL1, HL1, LH1, and HH1, as shown in Figure 8.6a. Each coefficient represents a spatialarea corresponding to the one-quarter of the original image size. The low frequenciesrepresent a bandwidth corresponding to 0<jvj<p=2, while the high frequencies representthe band p=2<jvj<p. To obtain the next level of decomposition, the LL1 subband is furtherdecomposed into the next level of four subbands, as shown in Figure 8.6b. The lowfrequencies of the second level decomposition correspond to 0<jvj<p=4, while the highfrequencies at the second level correspond to p=4<jvj<p=2. This decomposition can be

FIGURE 8.6A 2-D wavelet transform, (a) first-level decompos-ition and (b) second-level decomposition (Ldenotes low band, H denotes high band, and thesubscript denotes number of level; for example LL1

denotes the low-low band at level 1).

LL1 HL1

LH1 HH1

HL1

LH1 HH1

HH2LH2

HL2LL2

(a) (b)


conti nued to as many levels as neede d. Th e filters used to com pute the DWT are generall ythe symm etric quadra ture mirror filters (QM Fs), which is desc ribed in [wood s 1986].A QMF- pyra mid subb and decomp osition is illustrat ed in Figure 8.6b.

Duri ng quantiz ation, each subband is quan tized diffe rently dep ending on its impo rtance,which is usually based on its energy or variance [ja yant 19 84]. To reach the pred eterminedbit rate or compress ion rati o, the coars e quan tizers or large quantiz ation step s would beused to quan tize the low-en ergy subb ands whil e the finer quan tizers or small quan tizationsteps wo uld be use d to quan tize the high-e nergy subb ands. This result s in fewe r bitsallocate d to thos e low- energy subbands and more bits for high-ene rgy subband s.

8.2.2 Embedded Image Wavel et Tran sform Codi ng Algori thms

In thi s section, the wavel et transf orm based image c oding method s, which form the basicframew ork of JPEG 2000, are discuss ed. We first comme nt on the drawbac ks of earlywavel et transf orm based image coding met hods. Th en, we introd uce the concept ofmodern wavel et transf orm based image codi ng m ethods, i.e., embedd ed image wavelettransf orm coding algorithm , in par ticular, the embedd ed image coding using zer otrees ofwavel et coef ficients (EZW ).

8.2.2. 1 Early Wavelet Image Codin g Algori thms an d Their Draw back s

As with othe r transf orm coding schem es, mo st wave let coef ficients in the high-freq uencybands have ver y low energy. Aft er quan tization , many of the se high-frequ ency waveletcoef ficients are quan tized to zer o. Based on the statist ical prope rty of the quantiz edwavel et coef ficients, Huffman codi ng tables can be design ed. Gen erally, most of the energyin an image is con tained in the low-freq uency ba nds. The dat a st ructure of the wavelettransf ormed coef ficients is sui table to exploit this statist ical property. Consi der a multi leveldecomp osition of an image with the DWT, where the lowes t level s of decomp ositionwou ld correspo nd to the high est fre quency subb ands and the finest spatial resol utionand the highest level of decomp osition wo uld correspo nd to the lowest frequency subb andand the coars est spatial resolutio n. Arr anging the subband s from lowest to highest fre-quency, we expect a decrease in energy. Also, we expect that if the wavelet transformedcoefficients at a particular level have lower energy, then coefficients at the lower levels orhigh frequency subbands, which correspond to the same spatial location, would havesmaller energy. Another feature of the wavelet coefficient data structure is spatial self-similar ity acros s subbands, which is shown in Figure 8.7. The ear ly wavel et image codingmethods [vetterli 1984; woods 1986; antonini 1992] utilizing the above features of wavelettransform are referred to as early or conventional wavelet image coding methods.

The drawbacks of these conventional wavelet image coding methods lie on [usevitch2001] the following:

. Quantizer can only be optimal as coding rate larger than 1 bpp (bits per pixel).

. Optimal bit allocation changes as overall bit rate changes, requiring coding processrepeated entirely for each new target bit rate desired.

. It is difficult to code an input to give an exact target bit rate (or predefinedoutput size).

8.2.2.2 Modern Wavelet Image Coding

Embedded image coding is the basic concept of modern wavelet image coding algo-rithm. By embedding code, it means that the code arranges its bits in the order of their


HL1

LH1

(a) (b)

HH1

HL1

LH1 HH1

HH2LH2

HL2

HL3LL3

LH3 HH3

HH2LH2

HL2

HL3LL3

LH3 HH3

FIGURE 8.7(a) Parent–children dependencies of subbands, the arrow points from the subband of the parents to the subband ofchildren. The top left is the lowest frequency band. (b) The scanning order of the subbands for encoding asignificance map. (From Shapiro, J., IEEE Trans. Signal Process., 41, 3445, 1993. With permission.)

importance. In other words, the bits corresponding a lower bit rate coding will be arrangedin the beginning portion of the embedded code. Because the embedded code have all lowerrate codes arranged at the beginning portion of the bit stream, the embedded codes can betruncated to fit a targeted bit rate by simply truncating the bit stream accordingly.Embedded wavelet image coding began with the embedded zerotree wavelet algorithm(EZW) by Shapiro in the early 1990s [shapiro 1993]. This revolutionary breakthroughmarks the beginning of the modern wavelet image coding age. The typical algorithmsincluding the EZW, set partitioning in hierarchical trees (SPIHT) by Said and Pearlman[said 1996], and embedded block coding with optimized truncation of the embedded bitstreams (EBCOT) [taubman 2000]. Consequently, the modern wavelet coding has beenadopted by JPEG2000 for still image coding. A given image can be coded once and it canthen be easily and optimally truncated according to any given bit rate.

Several algorithms have been developed to exploit this and the above-mentionedproperties for image coding. Among them, one of the first was proposed by Shapiro[shapiro 1993] and used an embedded image coding using zerotrees of wavelet coefficientstechnique referred to as EZW. Another algorithm is the so-called SPIHT developed by Saidand Pearlman [said 1996]. This algorithm also produces an embedded bitstream. Theadvantage of the embedded coding schemes allows an encoding process to terminate atany point so that a target bit rate or distortion metric can be met exactly. Intuitively, for agiven bit rate or distortion requirement, a non embedded code should be more efficientthan an embedded code. Since, it has no constraints imposed by embedding requirements.However, embedded wavelet transform coding algorithms are the best currently. Theadditional constraints do not seem to have deleterious effect. In the following, we intro-duce the two embedded coding algorithms: the zerotree coding and the set partitioning inhierarchical tree coding with an emphasis on the former.

8.2.2.3 Embedded Zerotree Wavelet Coding

As with DCT-based coding, an important aspect of wavelet-based coding is to code thepositions of those coefficients that will be transmitted as nonzero values. After quantiza-tion, the probability of the zero symbol must be extremely high for the very low bit ratecase. It is well known that the most of energy of an image is contained in the low-low (LL)subband at the highest scale level and the distribution of wavelet coefficients of


high-fre quency subb ands follows a gene ralized Laplaci an densi ty. Th at is, it has a highpeak aroun d zero and long tail to ward two sides, which m eans large numb er of coef fi cientshaving zero and small magni tude and yet still some sm all num ber of coef fi cients wi th largemagni tudes. The statist ics just mention ed impl ies that a large porti on of the bit bud get willthen be spent on encoding the signi fi cance map, or the binary deci sion map that indicateswheth er a transf ormed coef ficient has a zero- or nonzero -quantized value . Therefor e, theabilit y to effi ciently encode the signi fi cance map become s a key iss ue fo r coding images atvery low bit rates . A new data structure, the zerotree, has been propo sed for thi s purp ose[shap iro 1993]. To describe zer otree, we must fi rst de fine insigni ficance. A wave let coef fi-cient is insi gnifi cant wi th respec t to a given thresho ld value if the ab solute value of thiscoef ficient is smaller than thi s thresho ld. Fr om the natu re of the wave let transfor m we canassume that eve ry wave let transf orm coef fi cients at a give n sca le can be strong related to aset of coef ficients a t the nex t finer scale of similar orientat ion. More spe cific ally, we canfurther assume that if a wave let coef ficie nt at a coarse scale is insigni fi cant wi th respec t tothe pres et thres hold, then all wave let coef fi cients at finer scales are likely to be insigni ficantwith resp ect to this threshold. Ther efore, we can build a tree with these par ent –chi ldrelatio nships, such that, coef fi cients at a coars e sca le are calle d par ents, and a ll coef fi cientscorresponding to the same spatial location at the next finer scale of similar orientation arecalled children. Furthermore, for a parent, the set of all coefficients at all finer scales ofsimilar orientation corresponding to the same spatial location are called descendants. For aQMF- pyramid deco mposi tion the parent –chi ldren dep endencie s are shown in Figure 8.7a.For a multiscale wavelet transform, the scan of the coefficients begins at the lowestfrequency subband and then takes the order of LL, HL, LH, and HH from the coarserscale to the next finer scale as shown in Figure 8.7b.

The zerotree is defined such that if a coefficient itself and all of its descendants areinsignificant with respect to a threshold, then this coefficient is considered as an element ofa zerotree. An element of a zerotree is considered as a zerotree root if this element is notthe descendant of a previous zerotree root with respect to the same threshold value.The significance map can then be efficiently represented by a string with three symbols:zerotree root, isolated zero, and significant. The isolated zero means that the coefficientis insignificant, but it has some significant descendant. At the finest scale, only two symbolsare needed because all coefficients have no children; thus the symbol for zerotree root is notused. The symbol string is then entropy encoded. Zerotree coding efficiently reduces thecost for encoding the significance map by using self-similarity of the coefficients at differentscales. Additionally, it is different from the traditional run-length coding (RLC) that is used inDCT-based coding schemes. Each symbol in a zerotree is a single terminating symbol, whichcan be applied to all depth of the zerotree, similar to the end-of-block (EOB) symbol in theJPEG andMPEGvideo coding standards. The difference between the zerotree and EOB is thatthe zerotree represents the insignificance information at a given orientation across differentscale layers. Therefore, the zerotree can efficiently exploit the self-similarity of the coefficientsat the different scales corresponding to the same spatial location. The EOB only represents theinsignificance information over the spatial area at the same scale. In summary, the zerotreecoding scheme tries to reduce the number of bits to encode the significancemap,which is usedto encode the insignificant coefficients. Therefore, more bits can be allocated to encode theimportant significant coefficients. It should be emphasized that this zerotree coding scheme ofwavelet coefficients is an embedded coder, which means that an encoder can terminate theencoding at any point according to a given target bit rate or target distortionmetric. Similarly,a decoder, which receives this embedded stream, can terminate at any point to reconstruct animage that has been scaled in quality.

In summary, the statistics of wavelet transform coefficients indicates that how to codecoefficients’ magnitude and position is a key issue in wavelet image coding. The EZW


coding has proposed to code the positions of the zero coefficients using wavelet trans-form’s self-similarity characteristics instead of coding positions of significant coefficientsdirectly. This is referred to as significance map coding using zerotrees [usevitch 2001]. Inaddition to this key point, the EZW has developed a successive approximation quantiza-tion, which generates large number of zero coefficients and led to embedded coding[usevitch 2001]. In either [shapiro 1993] or [usevitch 2001], there is one example of 83 8image with three-level wavelet transform. In the examples, step by step, the EZW codingscheme is implemented. Readers are encouraged to go through the examples to getfirst-hand experience about the EZW algorithm, to see how the above-mentioned twokey techniques enhance the coding efficiency, and see how embedded coding is realized.A problem, similar to these two examples is provided in Exercises.

8.2.2.4 Set Partitioning in Hierarchical Trees Coding

Another embedded wavelet coding method is the SPIHT-based algorithm [said 1996]. Thisalgorithm includes two major core techniques: the set partitioning sorting algorithm andthe spatial orientation tree. The set partitioning sorting algorithm is the algorithm thathierarchically divides coefficients into significant and insignificant from the most signifi-cant bit to the least significant bit by decreasing the threshold value at each hierarchicalstep for constructing a significance map. At each threshold value, the coding processconsists of two passes: the sorting pass and the refinement pass, except for the firstthreshold that has only the sorting pass. Let c(i,j) represent the wavelet transformedcoefficients and m is an integer. The sorting pass involves selecting the coefficients suchthat 2m� jc(i,j)j � 2mþ1, with m being decreased at each pass. This process divides thecoefficients into subsets and then tests each of these subsets for significant coefficients.The significance map constructed in the procedure is tree encoded. The significant infor-mation is stored in three ordered lists: list of insignificant pixels (LIP), list of significantpixels (LSP), and list of insignificant sets (LIS). At the end of each sorting pass, the LSPcontains the coordinates of all significant coefficients with respect to the threshold at thatstep. The entries in the LIS can be one of two types: type A represents all its descendantsand type B represents all its descendants from its grandchildren onward. The refinementpass involves transmitting the mth most significant bit of all the coefficients with respect tothe threshold, 2mþ1.

The idea of a spatial orientation tree is based on the following observation. Normally,among the transformed coefficients, most of the energy is concentrated in the low frequen-cies. For the wavelet transform, when we move from the highest to the lowest levels of thesubband pyramid, the energy usually decreases. It is also observed that there exists strong

FIGURE 8.8Relationship between pixels in the spatial orientation tree.


spatial self-s imilarity betwe en subb ands in the same spatial location such as in the zer otreecase. Th erefore, a spati al orientat ion tree struc ture has been proposed for the SPIHTalgorithm . The spatial orientat ion tree natu rally de fi nes the spati al relations hip on thehierar chical pyra mid as shown in Figure 8.8.

During the coding, the wavelet transformed coefficients are first organized into spatialorientation trees as in Figure 8.8. In the spatial orientation tree, each pixel (i,j) from theformer set of subbands is seen as a root for the pixels (2i, 2j), (2i þ 1, 2j), (2i, 2j þ 1), and(2i þ 1, 2j þ 1) in the subbands of the current scale. For a given n-level decomposition,this structure is used to link pixels of the adjacent subbands from level n until level 1. In thehighest level n, the pixels in the low-pass subband are linked to the pixels in the three high-pass subbands at the same level. In the subsequent levels, all the pixels of a subband areinvolved in the tree-forming process. Each pixel is linked to the pixels of the adjacentsubband at the next lower level. The tree stops at the lowest level.

The implementation of the SPIHT algorithm consists of four steps: initialization, sortingpass, refinement pass, and quantization scale update. In the initialization step, we find anintegerm ¼ blog2 (max(i, j) {jc(i, j)j})c. Here b c represent an operation of obtaining the largestinteger less than jc(i,j)j. The value of m is used for testing the significance of coefficients andconstructing the significance map. The LIP is set as an empty list. The LIS is initialized tocontain all the coefficients in the low-pass subbands that have descendants. These coeffi-cients can be used as roots of spatial trees. All these coefficients are assigned to be of type A.The LIP is initialized to contain all the coefficients in the low-pass subbands.

In the sorting pass, each entry of the LIP is tested for significance with respect to thethreshold value 2m. The significance map is transmitted in the following way. If it issignificant, a ‘‘1’’ is transmitted, a sign bit of the coefficient is transmitted, and thecoefficient coordinates are moved to the LSP. Otherwise, a ‘‘0’’ is transmitted. Then, eachentry of the LIS is tested for finding the significant descendants. If there are none, a ‘‘0’’ istransmitted. If the entry has at least one significant descendant, then a ‘‘1’’ is transmittedand each of the immediate descendants are tested for significance. The significance map forthe immediate descendants is transmitted in such a way that if it is significant, a ‘‘1’’ plus asign bit are transmitted and the coefficient coordinates are appended to the LSP. If it is notsignificant, a ‘‘0’’ is transmitted and the coefficient coordinates are appended to the LIP. Ifthe coefficient has more descendants then it is moved to the end of the LIS as an entry oftype B. If an entry in the LIS is of type B then its descendants are tested for significance. If atleast one of them is significant then this entry is removed from the list, and its immediatedescendants are appended to the end of the list of type A. For the refinement pass, the mthmost significant bit of the magnitude of each entry of the LSP is transmitted except those inthe current sorting pass. For the quantization scale update step,m is decreased by 1 and theprocedure is repeated from the sorting pass.

8.3 Wavelet Transform for JPEG2000

8.3.1 Introduction of JPEG2000

Most image coding standards had exploited the DCT as their core technology for imagedecomposition for a while. However, this has been changed later. The wavelet transformhas been adopted by MPEG-4 for still image coding [mpeg4]. Also, JPEG2000 has used thewavelet transform as its core technology for the next generation of the still image codingstandard [jpeg2000 vm]. This is because the wavelet transform can provide not onlyexcellent coding efficiency but also good spatial and quality scalable functionality.


JPEG2000 is a new type of image compression system under development by JointPhotographic Experts Group for still image coding. This standard is intended to meet aneed for image compression with great flexibility and efficient interchangeability.JPEG2000 is also intended to offer unprecedented access into the image while still incompressed domain. Thus, images can be accessed, manipulated, edited, transmitted,and stored in a compressed form.

8.3.1.1 Requirements of JPEG2000

As a new coding standard, the detailed requirements of JPEG2000 include:Low bit rate compression performance: JPEG2000 is required to offer excellent coding

performance at bit rates lower than 0.25 bits=pixel for highly detailed gray level imagesas the current JPEG (10918-1) cannot provide satisfactory results at this range of bit rates.This is the primary feature of JPEG2000.

Lossless and lossy compression: it is desired to provide lossless compression naturally in thecourse of progressive decoding. This feature is especially important for medical imagecoding where the loss is not always allowed. Also, other applications, such as high-qualityimage archival systems, and network applications desire to have functionality of losslessreconstruction.

Large images: currently, the JPEG image compression algorithm does not allow forimages greater than 64K by 64K without tiling.

Single decomposition architecture: the current JPEG standard has 44 modes, many of thesemodes are for specific applications and not used by the majority of JPEG decoders. It isdesired to have a single decomposition architecture that can encompass the interchangebetween applications.

Transmission in noisy environments: it is desirable to consider error robustness whiledesigning the coding algorithm. This is important for the application of wireless commu-nication. The current JPEG has provision for restart intervals, but image quality suffersdramatically when bit errors are encountered.

Computer generated imagery: the current JPEG is optimized for natural imagery and doesnot perform well on computer generated imagery or computer graphics.

Compound documents: the new coding standard is desired to be capable of compressingboth continuous-tone and bi-level images. The coding scheme can compress and decom-press images from 1 to 16-bit for each color component. The current JPEG standard doesnot work well for bi-level images.

Progressive transmission by pixel accuracy and resolution: progressive transmission thatallows images to be transmitted with increasing pixel accuracy or spatial resolution isimportant for many applications. The image can be reconstructed with different resolu-tions and pixel accuracy as needed for different target devices such as in applications ofWorld Wide Web and image archiving.

Real-time encoding and decoding: for real-time applications, the coding scheme should becapable of compressing and decompressing with a single sequential pass. Of course, theoptimal performance cannot be guaranteed in this case.

Fixed-rate, fixed-size, and limited workspace memory: the requirement of fixed bit rate allowsdecoder to run in real time through channels with limited bandwidth. The limited memoryspace is required by the hardware implementation of decoding.

There are also some other requirements, such as backwards compatibility with JPEG,open architecture for optimizing the system for different image types and applications,interface with MPEG-4, and so on. All these requirements have been seriously consideredduring the development of JPEG2000. It is no doubt that the basic requirement on the codingperformance at very low bit rate for still image coding have been achieved by using


the wavelet-based coding as the core technology instead of DCT-based coding as usedin JPEG.

8.3.1.2 Parts of JPEG2000

JPEG2000 consists of several parts. Some parts have become International Standards (ISs),while some are still in the developing stage. We first present in this subsection somecompleted parts, then introduce some parts that are still in their evolution stages.

Part 1 of JPEG2000 is entitled as JPEG2000 Image Coding System: Core Coding System.This is the counterpart of the JPEG baseline system and was issued as an IS in December2000. It is royalty free.

Part 2 is entitled as JPEG2000 Image Coding Systems: Extension. It involves moreadvanced technologies, higher computational complexity, and provides enhanced per-formance, compared with those in Part 1.

Part 3 is Motion JPEG2000 (MJP2). It encodes motion pictures one frame by one frameusing JPEG2000 technologies. As a result, it offers random access to any frame in themotion pictures and is much less complicated than MPEG coding schemes. It has beenadopted by Hollywood as a format for digital cinema. It is also used for digital cameras.

Part 4 is Conformance Testing.Part 5 is about Reference Software for Part 1. Two available implementations are as

follows [rabbani 2001]. One is a C implementation by the Image Power and University ofBritish Columbia [adams 2000] and another is a Java implementation by the JJ2000 group[JJ2000].

Part 6 is Compound Image File Format for document scanning and fax applications[rabbani 2001].

While the above-mentioned six parts have become ISs before 2004, there are some partsthat are still in the working stage. For instance, Part 8, Secure JPEG2000, abbreviated asJPSEC, had its IS published in April 2007. Part 11 is about Wireless JPEG2000 (JPWL) and isclose to the completion of IS. JPEG2000 is still going on with some new issues and newparts at this writing.

8.3.2 Verification Model of JPEG2000

As in other standards such as MPEG-2 and MPEG-4, the verification model (VM) plays animportant role during the development of standards. This is because the VM or TM (testmodel for MPEG-2) is a platform for verifying and testing the new techniques before theyare adopted by the standards. The VM is updated by completing a set of core experimentsfrom one meeting to another. Experience has shown that the decoding part of the finalversion of VM is very close to the final standard. Therefore, to give an overview of therelated wavelet transform parts of the JPEG2000, we start to introduce the newest versionof JPEG2000 VM [jpeg2000 vm]. The VM of JPEG2000 describes the encoding process,decoding process, and the bitstream syntax, which eventually completely defines thefunctionality of the existing JPEG2000 compression system.

The newest version of JPEG2000 verification model, currently VM 4.0, was revised onApril 22, 1999. In this VM, the final convergence has not been reached, but severalcandidates have been introduced. These techniques include a DCT-based coding mode,which is currently the baseline JPEG, and a wavelet-based coding mode. In the wavelet-based coding mode, several algorithms have been proposed: overlapped spatial segmentedwavelet transform (SSWT), nonoverlapped SSWT, and the embedded block-based codingwith optimized truncation (EBCOT). Among these techniques, and according to the con-sensus, the EBCOT had been included into the final JPEG2000 standard.


FIGURE 8.9Example of subblock partitioning for a block of 643 64.

Bi1 Bi

2 Bi3 Bi

Bi Bi Bi Bi

Bi Bi Bi Bi

Bi Bi Bi Bi

4

5 6 7 8

9 10 11 12

13 14 15 16

The basic idea of EBCOT is the combination of block coding with wavelet transform.First the image is decomposed into subbands using the wavelet transform. The wavelettransform is not restricted to any particular decomposition. However, the Mallat waveletprovides the best compression performance on average for natural images; therefore, thecurrent bitstream syntax is restricted to the standard Mallat wavelet transform in VM 4.0.After decomposition, each subband is divided into 643 64 blocks except at image bound-aries where some blocks may have smaller sizes. Every block is then coded independently.For each block, a separate bitstream is generated without utilizing any information fromother blocks. The key techniques used for coding include embedded quad-tree algorithmand fractional bit-plane coding.

The idea of embedded quad-tree algorithm is that it uses a single bit to representwhether or not each leading bit-plane contains any significant samples. The quad-tree isformed in the following way. The subband is partitioned into a basic block. The basic blocksize is 643 64. Each basic block is further partitioned into 163 16 subblocks, as shown inFigure 8.9. Let sj(Bi

k) denote the significance of subblock, Bik (k is the kth subblock as shown

in Figure 8.9), in jth bit-plane of ith block. If one or more samples in the subblock have themagnitude greater than 2j, then sj(Bi

k)¼ 1; otherwise, sj(Bik)¼ 0. For each bit-plane, the

information concerning the significant subblocks is first encoded. All other subblocks canthen be bypassed in the remaining coding procedure for that bit-plane. To specify the exactcoding sequence, we define a two-level quad-tree for the block size of 643 64 and subblocksize of 163 16. The level-1 quads, Qi

1[k], consist of four subblocks, Bi1, Bi

2, Bi3, and Bi

4 fromFigure 8.9. In the same way, we define level-2 quads, Qi

2[k], to be 23 2 groupings of level-1quads. Let sj(Qi

1[k]) denote the significance of the level-1 quad, Qi1[k], in jth bit-plane. If at

least one member subblock is significant in the jth bit-plane then sj(Qi1[k])¼ 1; otherwise,

sj(Qi1[k])¼ 0. At each bit-plane, the quad-tree coder visits the level-2 quad first and

followed by level-1 quads. When visiting a particular quad,QiL[k] (L¼ 1 or 2, is the number

of level), the coder sends the significance of each of the four child quads, sj(QiL[k]), or

subblocks, sj(Bik), as appropriate, except if the significance value can be deduced from the

decoder. Under following three cases, the significance may be deduced by the decoder: (1)the relevant quad or subblock was significant in the previous bit-plane; (2) the entiresubblock is insignificant; or (3) this is the last child or subblock visited in Qi

L[k] and allearlier quads or subblocks are insignificant.

The idea of bit-plane coding is to code the most significant bit first for all samples in thesubblocks with entropy coding and to send the resulting bits. Then, the next most signifi-cant bit will be coded and sent; this process will be continued until all bit-planes have beencoded and sent. This kind of bitstream structure can be used for robust transmission. If thebitstream is truncated due to transmission error or some other reason, then some or all thesamples in the block may lose one or more least significant bits. This will be equivalent tohaving used a coarser quantizer for the relevant samples and we can still obtain a reducedquality reconstructed image. The idea of fractional bit-plane coding is to code eachbit-plane with four passes: forward significance propagation pass, backward significancepropagation pass, magnitude refinement pass, and normalization pass. For the technical


detail of fractional bit-plane coding, the inter ested readers can refer to the VM of JPEG 2000[jpeg200 0 vm].

Fina lly, we brie fl y descri be the optimi zation iss ue of EBC OT. The enc oding opti mizationalgorithm is not a par t of stand ard, as the decoder does not need to know how the enc odergenerat es the bitstr eam. From the viewpoin t of the st andard, the only requi rement from thedecod er to the enc oder is that the bitstream must be complia nt wi th the syntax of stand ard.Howev er, from the ot her side, the bitstr eam syn tax coul d always be de fined to favorcertain codi ng algori thms for generat ing opti mized bitstream s. The opti mization algorithmdescri bed here is jus tified only if the dis tortion measure adopte d for the code blocks isadditi ve. That is, the fi nal distorti on, D, of the whole reconst ructed image should sat isfy

D ¼X

DTii (8 : 24)

whereDi is the dis tortion for block for Bi

Ti is the truncat ion point fo r B i

Let R be the total num ber of bits for codi ng all blocks of the image for a set of truncat ionpoint Ti , then

R ¼X

RTii (8 : 25)

where RiTi are the bits for coding block Bi . The optimizati on proces s wish es to find the

suitable set of Ti values, whi ch minimi zes D subj ect to the con straint R � Rmax. R max isthe maximu m num ber of bits assi gned for coding the image . Th e solution is obt ained bythe metho d of Langrage m ultipliers:

L ¼X

( RTii � lD Tii ) (8: 26)

where the value l must be adjus ted until the rate obtain ed by the truncat ion points, whichminimiz e the v alue of L, satisfy R¼ Rmax . From Equation 8.2 6, we have a separate tri vialoptimi zation problem for each individual block. Sp ecially, for each block, Bi, we find thetruncat ion point, Ti , which min imize s the value ( Ri

Ti � l DiTi). Th is can be achi eved by

find ing the slope turni ng poi nt of rate disto rtion curves. In the VM, the set of truncat ionpoints and the slopes of rate distorti on curv es are com puted imm ediately afte r each blockis code d, and we only stor e enough inf ormatio n to later determi ne the truncat ion poi ntswhich cor respond to the slope tur ning poi nts of rate disto rtion cur ves. This informatio n isgeneral ly muc h sm aller than the bitstream itself which is stored for the block. Also, thesearch for the opti mal l is extre mely fast and occupies a negli gible propo rtion of the overallcomputat ion time.

8.3.3 An Exampl e of Perform ance Compa rison betwe en JPEG an d JPEG2000

Before ending thi s chap ter, we present an example to compar e the perform ance of JPEGand JPEG 2000 in low bit rate compress ion. We appl y both JPEG and JPEG20 00 algorithmto one of the JPEG 2000 test images, name d Wo man or N1A, wi th 0.1 bpp (bits per pi xel). Atthis rathe r low bit rate , the advantage of JPEG 2000 over JPEG is rather obvious. At the 0.1bpp, the PSNR of the JPEG 2000 compr essed Wo man image is 25.50 dB, whil e the JPEGcompr essed Woman image is 23.91 dB. From the hum an visual syste m point of view, thesuperio r visu al quality of the JPEG 2000 com pressed image is rathe r obvio us becaus e it canbe clearly observe d that the re is seve re distortio n called false-co unterin g (as discusse d inSecti on 1.2.2.2) in the JPEG compr essed Woman image (see Figure 8.10).


(a) (b)

FIGURE 8.10Performance comparison between the JPEG compressed and JPEG2000 compressedWoman image at 0.1 bits=pixel(bits per pixel). (a) JPEG compressed with PSNR 23.91 dB (b) JPEG2000 compressed with PSNR 25.50 dB.

8.4 Summary

In this chap ter, image coding using wavel et transf orm has been intro duced. First, anoverv iew of wavel et theory was give n, and second, the princi ples of image coding usingwavel et transf orm have been pres ented. Addit ionally, two par ticular embe dded imagecoding algorithm s have been expl ained, namely the embedd ed zero tre e (E ZW) and setpar titioning in hierar chical trees (SP IHT). Finally , the new standard for still image coding,JPEG 2000, which adop ts the wavel et transfor m a s its core technique , has bee n descri bed.

Exercises

1. For a given function, the Mexican hat wavelet,

f (t) ¼ 1, for jtj � 10, otherwise

,

use Equati ons 8.3 and 8.4 to derive a closed-f orm expres sion fo r the conti nuous wavel ettransform, cab (t).

2. Consider the dilation equation

w(t) ¼ffiffiffi2p X

k

h(k)w (2t� k)


How does w(t) change if h(k) is shifted? Specifically, let

g(k) ¼ h(n� l)

u(t) ¼ 21=2Xk

g(k) u (2t� k)

How does u(t) relate to w(t)?3. Let wa(t) and wb(t) be two scaling functions generated by the two scaling filters ha(k) and

hb(k). Show that the convolutionwa(t)*wb(t) satisfies a dilation equationwith ha(k)* hb(k)=ffiffiffi2p

.4. In the applications of denoising and image enhancement, how can the wavelet trans-

form improve the results?5. For a given function

f (t) ¼0 t < 0t 0 � t < 11 t � 1

(

show that the wavelet transform of f(t) will be

W(a, b) ¼ sgn

(jaj�1

2 2f bþ a2

� �� f (b)� f (bþ a)

h i)

where sgn(x) is the signum function defined as

sgn(x) ¼�1 t < 01 t > 00 t = 0

(

6. Given an 83 8 block of pixels from the central portion (255:262, 255:262) of Barbaraimage, whose 83 8 pixel values are shown below.

� 2007 by Taylor & Fran

144

cis Group, LL

163

C.

194
210 195 151 136 191 170 194 209 200 162 136 178 226 185 200 205 183 152 159 207 218 186 201 188 167 172 197 200 176 204 194 166 171 203 199 162 167 208 167 163 203 212 164 155 215 180 148 186 219 181 141 191 234 141 170 217 198 146 166 231 194
a. Apply three-level 9=7, or 5=3, or Haar wavelet transform to this 83 8 image block. Ifyou have difficulty on this, the result of 5=3 is listed below.
Three-level integer 5=3 discrete wavelet transform (DWT)
144
45 26 �49 �6 15 �15 55 56 �63 �15 1 8 3 �21 7 9 �16 56 111 9 �14 18 6
�32
�2 62 61 �30 32 �41 32 5 11 �15 6 5 4 �16 15 �9 2 �5 10 7 �8 13 �32 16 �13 20 �14 �6 4 �5 36 �39 33 �34 26 26 �19 22 �80
b. Then, following the examples shown in [shapiro 1993] or the example shown in[usevitch 2001], apply the EZW method to this 83 8 wavelet coefficient array com-pletely. Itmeans that you come upwith a set of four different types of symbols (positive

si gn ificant coefficient, negative significant coefficient, zerotree root, and isolated zero),which represent this 8 3 8 three-level wavelet coefficient 2-D array. Provide your codingresults in terms of the four types of symbols in tables.

c. Comment on the ef fi ciency a nd the natu re embed ding code of the EZW.

References

[adams 2000] M.D. Adams and F. Kossentini, JasPer: A software-based JPEG-2000 codec implemen-tation, Proceedings of the IEEE International Conference on Image Processing, Vancouver, CA,September 2000. Also, refer to the JasPer home page at http: ==www.ece.ubc.ca =� mdadams =jasper =.

[antonini 1992] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, Image coding using wavelettransform, IEEE Transactions on Image Processing , 1, 205–220, April 1992.

[castleman 1996] K.R. Castleman, Digital Image Processing , Prentice-Hall, Englewood Cliffs, NJ, 1996.[cohen 1989] L. Cohen, Time–frequency distributions —A review, Proceedings of the IEEE , 77, 7,

941–981, July 1989.[daubechies 1992] I. Daubechies, Ten Lectures on Wavelets , CBMS-NSF series in Applied Mathematics,

Philadelphia, SIAM, 1992.[daubechies 1998] I. Daubechies, W. Sweldens, Factoring Wavelet Transform into Lifting Steps,

Journal of Fourier Analysis , 4, 3, 247–269, 1998.[grossman 1984] A. Grossman and J. Morlet, Decompositions of hardy functions into square integrable

wavelets of constant shape, SIAM Journal of Mathematical Analysis , 15, 4, 723–736, July 1984.[jayant 1984] N.S. Jayant and P. Noll, Digital Coding of Waveforms , Prentice-Hall, Englewood Cliffs,

NJ, 1984.[jj2000] JJ2000: An implementation of JPEG2000 in JAVA y, available at http: ==jj2000.epfl .ch[jpeg2000 vm] JPEG2000 Verification Model 4.0 (Technical description), sc29wg01 N1282, April 22,

1999.[mpeg4] ISO=IEC 14496–2, Coding of audio-visual objects, November 1998.[rabbani 2001] M. Rabbani and R. Joshi, An overview of the JPEG 2000 still image compression

standard, ISO=IEC JTC 1=SC 29=WG1 N2233, July 2001.[rioul 1991] O. Rioul and M. Vetterli, Wavelets and signal processing, IEEE Signal Processing Maga-

zine, 8, 4, 14–38, October 1991.[said 1996] A. Said and W.A. Pearlman, A new fast and efficient image codec based on set partition-

ing in hierarchical trees, IEEE Transactions on Circuits and Systems for Video Technology, 6,243–250, June 1996.

[shapiro 1993] J. Shapiro, Embedded image coding using zerotrees of wavelet coefficients, IEEETransactions on Signal Processing, 41, 12, 3445–3462, December 1993.

[skodras 2001] A. Skodras, C. Christopoulos, and T. Ebrahimi, The JPEG2000 Still Image Compres-sion Standard, IEEE Signal Processing Magazine, 18, 5, 36–58, September 2001.

[sweldens 1995] W. Sweldens, The lifting scheme: A new philosophy in biorthogonal waveletconstructions, Proceedings of SPIE, 2569, 68–79, 1995.

[taubman 2000] D. Taubman, High performance scalable image compression with EBCOT, IEEETransaction on Image Processing, 8, 7, 1158–1170, July 2000.

[usevitch 2001] B.E. Usevitch, A tutorial on modern lossy wavelet image compression: Foundationsof JPEG2000, IEEE Signal Processing Magazine, 18, 5, 22–35, September 2001.

[uytterhoeven 1999] G. Uytterhoeven, Wavelets: Software and applications, doctoral dissertation,Department of Computer Science. K.U. Leuven, Belgium, April 1999.

[vetterli 1984] M. Vetterli, Multidimensional subbands coding: Some theory and algorithms, SignalProcessing, 6, 97–112, February 1984.

[vetterli 1995] M. Vetterli and J. Kovacevic, Wavelets and Subband Coding, Prentice-Hall, EnglewoodCliffs, NJ, 1995.

[woods 1986] J. Woods and S. O’Neill, Subband coding of images, IEEE Transactions Acoustics Speechand Signal Processing, 34, 1278–1288, October 1986.

[xuan 2002] G. Xuan, J. Zhu, J. Chen, Y.Q. Shi, Z. Ni, and W. Su, Distortionless data hiding based oninteger wavelet transform, IEEE Electronics Letters, 38, 25, 1646–1648, December 2002.


http://www.ece.ubc.ca

http://www.ece.ubc.ca

http://www.jj2000.epfl.ch

9Nonstandard Still Image Coding

In this chapter, we introduce three nonstandard image coding techniques: vector quanti-zation (VQ) [nasrabadi 1988] fractal coding [barnsley 1993; jacquin 1993; fisher 1994], andmodel-based coding [li 1994].

9.1 Introduction

The VQ, fractal coding, and model-based coding techniques have not been adopted by anyimage coding standard. However, due to their unique features these techniques may findsome special applications. VQ is an effective technique for performing data compression.Theoretically, VQ is always better than scalar quantization because it fully exploits thecorrelation between components within the vector. The optimal coding performance willbe obtained when the dimension of vector approaches to infinity, and then the correlationbetween all components is exploited for compression. Another very attractive featureof image VQ is that its decoding procedure is very simple since it only consists of tablelookups. However, there are two major problems with image VQ techniques. The firstis that the complexity of VQ exponentially increases with the increasing dimensionalityof vectors. Therefore, for VQ, it is important to solve the problem of how to design apractical coding system which can provide a reasonable performance under a givencomplexity constraint. The second major problem of image VQ is the need of a codebookwhich causes several problems in practical application, such as generating a universalcodebook for a large number of images, scaling the codebook to fit the bit rate requirementand so on. Recently, the lattice VQ schemes have been proposed to address these problems[li 1997].

Fractal theory has a long history. Fractal-based techniques have been used in severalareas of digital image processing, such as image segmentation, image synthesis, andcomputer graphics. But only recently it has been extended to the applications of imagecompression [jacquin 1993]. A fractal is a geometric form, which has the unique feature ofhaving extremely high visual self-similar irregular details while containing very lowinformation content. Several methods for image compression have been developed basedon different characteristics of fractals. One method is based on iterated function system(IFS) proposed in [barnsley 1988]. This method uses the self-similar and self-affine propertyof fractals. Such a system consists of sets of transformations, including translation, rotation,and scaling. In the encoder side of fractal image coding system, a set of fractals is generatedfrom the input image. These fractals can be used to reconstruct the image at the decoderside. Because these fractals are represented by very compact fractal transformations, theyrequire a very small amount of data to be expressed and stored as formulas. Therefore, theinformation needed to be transmitted is very small. The second fractal image coding


met hod is based on the fractal dime nsion [jang 1990; lu 1993]. Fractal dime nsion is a goodrepres entati on of ro ughness of image surfaces. In this m ethod, the image is first segmente dusi ng the fractal dime nsion and then the resu lted unifo rm segme nts can be effic ientlycoded using the prop erties of human visual system (HV S). Anoth er fract al image codingschem e is based on fractal geome try, which is used to measure the length of a curvewith a ya rdstick [walach 1986]. Th e detai ls of these codi ng method s are discusse d inSecti on 9.3.

The basic ide a of model -based coding is to reconst ruct an image with a set of mo delpar ameters. Th e model par ameter s are then encoded and transmi tted to the decoder . Atthe deco der, the deco ded model parame ters are used to rec onstru ct the image with thesam e mo del use d at the encoder. Th erefore, the key techni ques in the mo del-bas ed codingare image mo deling, image analysis , and image syn thesis.

9.2 Vector Q uanti zation

9.2.1 Basic Princip le of Vector Quanti zation

An N -leve l vector quan tizer, Q, is a m apping from a K-dimens ional vector set {V }, into afi nite codeb ook, W ¼ {w1, w2, . . . , w N }:

Q : V ! W (9: 1)

In othe r words , it assigns an inpu t vector , v, to a repre sentativ e vector (cod e wo rd), w , froma codeb ook, W . The vect or quan tizer, Q , is comple tely descri bed by the code book, W ¼{ w1, w 2, . . . , w N}, togeth er with the dis joint par tition, R ¼ { r 1, r 2, . . . , r N }, where

ri ¼ { v: Q (v) ¼ wi } (9:2)

and w and v are K-dimensional vectors. The partition should identically minimize thequantization error [gersho 1982]. A block diagram of the various steps involved in imageVQ is depicted in Figure 9.1.

Vectorformation

Training setgeneration

Codebookgeneration

Codebook

Quantizer

Inputimage

Vectorformation

Training setof images

Index ofcode word

FIGURE 9.1Principle of image vector quantization (VQ). The dashed lines correspond to the training set generation, codebookgeneration, and transmission (if it is necessary).


The first step of image VQ is the image formation. The image data is first partitioned intoa set of vectors. A large number of vectors from various images are then used to form atraining set. The training set is used to generate a codebook, normally using an iterativeclustering algorithm. The quantization or coding step involves searching, for each inputvector, the closest code word in the codebook. Then the corresponding index of the selectedcode word is coded and transmitted to the decoder. At the decoder, the index is decodedand converted to the corresponding vector with the same codebook as at the encoder bylook-up table. Thus, the design decisions in implementing image VQ include (1) vectorformation, (2) training set generation, (3) codebook generation, and (4) quantization.

9.2.1.1 Vector Formation

The first step of VQ is vector formation; that is, the decomposition of the images into a setof vectors. Many different decompositions have been proposed; examples include theintensity values of a spatially contiguous block of pixels [gersho=ramamuthi 1982; baker1983]; these same intensity values but now normalized by the mean and variance of theblock [murakami 1982]; the transformed coefficients of the block pixels [li 1995]; and theadaptive linear predictive coding coefficients for a block of pixels [sun 1984]. Basically,the approaches of vector formation can be classified into two categories: direct spatial ortemporal and feature extraction. Direct spatial or temporal is a simple approach to formingvectors from the intensity values of a spatial or temporal contiguous block of pixels in animage or an image sequence. A number of image VQ schemes have been investigated withthis method. The other kind of methods is feature extraction. An image feature is adistinguishing primitive characteristic. Some features are natural in the sense that suchfeatures are defined by the visual appearance of an image, while the other so-calledartificial features result from specific manipulations or measurements of images or imagesequences. In vector formation, it is well known that the image data in spatial domain canbe converted to a different domain so that subsequent quantization and joint entropyencoding can be more efficient. For this purpose, some features of image data, such astransformed coefficients, block means, can be extracted and vector quantized. The practicalsignificance of feature extraction is that it can result in the reduction of vector size, andconsequently, reduce the complexity of coding procedure.

9.2.1.2 Training Set Generation

An optimal vector quantizer should ideally match the statistics of the input vector source.However, if the statistics of an input vector source is unknown, a training set, representa-tive of the expected input vector source, can be used to design the vector quantizer. If theexpected vector source has a large variance, then a large training set is needed. To alleviatethe implementation complexity caused by large training set, the input vector source can bedivided to the subsets. For example in [gersho 1982], the single input source is divided intoedge and shade vectors, and then the separate training sets are used to generate theseparate codebooks, respectively. Those separate codebooks are then concatenated into afinal codebook. In other methods, small local input sources corresponding to portionsof the image are used as the training sets; thus the codebook can better match thelocal statistics. However, the codebook needs to be updated to track the changes in localstatistics of the input sources. This may increase the complexity and reduce the codingefficiency. Practically, in most coding systems a set of typical images is selected as thetraining set and used to generate the codebook. The coding performance can then beinsured for the images with the training set or those not in training set but with statisticssimilar to those in the training set.


9.2.1.3 Codebook Generation

The key step of conventional image VQ is the development of a good codebook. The optimalcodebook, using the mean squared error (MSE) criterion, must satisfy two necessary condi-tions [gersho 1982]. First, the input vector source is partitioned into a pre-decided number ofregions with the minimum distance rule. The number of regions is decided by the require-ment of the bit rate, or compression ratio and coding performance. Second, the codeword orthe representative vector of this region is the mean value, or the statistical center, of thevectors within the region. Under these two conditions, a generalized Lloyd clusteringalgorithm proposed by Linde, Buzo, and Gray (the so-called LBG algorithm [linde 1980])has been extensively used to generate the codebook. The clustering algorithm is an iterativeprocess, minimizing a performance index calculated from the distances between the samplevectors and their cluster centers. The LBG clustering algorithm can only generate a codebookwith a local optimum,which depends on the initial cluster seeds. Two basic procedures havebeen used to obtain the initial codebook or cluster seeds. In the first approach, the startingpoints involve finding a small codebook with only two code words, and then recursivelysplitting the codebook until the required number of codewords is obtained. This approach isreferred to as binary splitting. The second starts with initial seeds for the required number ofcode words, these seeds being generated by preprocessing the training sets. To address theproblem of local optimum, Equitz [equitz 1989] proposed a new clustering algorithm, thepairwise nearest neighbor (PNN) algorithm. The PNN algorithm begins with a separatecluster for each vector in the training set and merges two clusters at a time until the desiredcodebook size is obtained. At the beginning of the clustering process, each cluster containsonly one vector. In the following process, two closest vectors in the training set aremerged totheir statistical mean value, in such a way the error incurred by replacing these two vectorswith a single code word is minimized. The PNN algorithm significantly reduces computa-tion complexity without sacrificing performance. This algorithm can also be used as aninitial codebook generation for LBG algorithm.

9.2.1.4 Quantization

Quantization in the context of a VQ involves selecting a code word in the codebook foreach input vector. The optimal quantization, in turn, implies that for each input vector, v,the closest code word, wi, is found as shown in Figure 9.2. The measure criterion could beeither MSE, absolute error, or other distortion measures.

A full search quantiz ation is an exhau stive search process over the entire codebook forfi nding the closest code word as shown in Figu re 9.3a. It is opti mal fo r the give n codebo ok,but the computation is more expensive. An alternative approach is a tree-search quantiza-tion, where the search is carried out based on a hierarchical partition. A binary tree-search isshown in Figure 9.3b. Tree-search is much faster than full search, but it is clear that the tree-search is suboptimal for the given codebook and requires more memory for the codebook.

FIGURE 9.2Principle of vector quantization (VQ).

Codebook

v−wk = Min i {

Inputvector v Index k

Quantization

⏐ ⏐ ⏐ ⏐v−wi }


(a) (b)

FIGURE 9.3(a) Full search quantization and (b) binary tree-search quantization.

9.2.2 Several Image Coding Schemes with Vector Quantization

In this section, we present several image coding schemes using VQ that include residualVQ, classified VQ, transform domain VQ, predictive VQ, and block truncation coding(BTC), which can be seen as a binary VQ.

9.2.2.1 Residual VQ

In the conventional image VQ, the vectors are formed by spatially partitioning the imagedata into blocks of 83 8 or 43 4 pixels. In the original spatial domain the statistics ofvectors may be widely spread in the multidimensional vector space. This causes thedifficulty for generating the codebook with a finite size and limits the coding performance.Residual VQ is proposed to alleviate this problem. In residual VQ, the mean of the block isextracted and coded separately. The vectors are formed by subtracting the block meanfrom the original pixel values. This scheme can be further modified by considering thevariance of the blocks. The original blocks are converted to the vectors with zero mean andunit standard deviation with the following conversion formula [murakami 1982]:

mi ¼ 1K

XK�1j¼0

sj (9:3)

xj ¼(sj �mi) (9:4)

si

1 XK�12

2 312

si ¼ Kj¼0

(sj �mi)4 5 (9:5)

wheremi is the mean value of ith blocksi is the variance of ith blocksj is the pixel value of pixel j (j¼ 0, . . . , K � 1) in the ith blockK is the total number of pixels in the blockxj is the normalized value of pixel j

The new vector Xi is now formed by xj (j¼ 0, 1, . . . , K � 1):

Xi ¼ [x0, x1, . . . , xK]i (9:6)


With the above normalization, the probability function P(X) of input vector X is approxi-mately similar for image data from different scenes. Therefore, it is easy to generate acodebook for the new vector set. The problem of this method is that the mean and variancevalues of blocks have to be coded separately. This increases the overhead and limits thecoding efficiency. Several methods have been proposed to improve the coding efficiency.One of these methods is to use predictive coding to code the block mean values. The meanvalue of the current block can be predicted by the one of previously coded neighbors. Insuch a way, the coding efficiency increases as the use of inter-block correlation.

9.2.2.2 Classified VQ

In image VQ, the codebook is usually generated using training set under constraint ofminimizing the MSE. This implies that the code word is the statistical mean of the region.During the quantization, each input vector is replaced by its closest code word. Therefore,the coded images usually suffer from edge distortion at very low bit rates since edges aresmoothed by the operation of averaging with the small-sized codebook. To overcome thisproblem, we can classify the training vector as edge vectors and shade vectors [gersho1982]. Two separate codebooks can then be generated with the two types of training sets,respectively. Each input vector can be coded by the appropriate code word in the code-book. However, the edge vectors can be further classified into many types according totheir location and angular orientation. The classified VQ can be extended into a system,which contains many sub-codebooks; each represents a type of edges. However, thiswould increase the complexity of the system and would be hard to implement in practicalapplications.

9.2.2.3 Transform Domain VQ

The VQ can be performed in the transform domain. A spatial block of 43 4 or 83 8 pixelsis first transformed to the 43 4 or 83 8 transformed coefficients. There are several methods toform vectors with transformed coefficients. In the first method, a number of high-ordercoefficients can be discarded because most of the energy is usually contained in the low-order coefficients for most of the blocks. This reduces the VQ computational complexity atthe expense of a small increase of distortion. However, for some active blocks, the edgeinformation is contained in the high frequencies, or high-order coefficients. It will causeserious subjective distortion by discarding high frequencies. In the second method,the transformed coefficients are divided into several bands and each band is used to form itscorresponding vector set. This method is equivalent to the classified VQ in spatial domain.An adaptive scheme is then developed by using two kinds of vector formation methods.The first method is used for the blocks containing the moderate intensity variation and thesecond method is used for the blocks with high spatial activities. However, the complexityincreases, as more codebooks are needed in such kinds of adaptive coding systems.

9.2.2.4 Predictive VQ

The vectors are usually formed by the spatially consecutive blocks. The consecutive vectorsare then highly statistically dependent. Therefore, better coding performance can beachieved if the correlation between vectors is exploited. Several predictive VQ schemeshave been proposed to address this problem. One kind of predictive VQ is finite state VQ[dunham 1985; foster 1985]. The finite-state VQ is similar to a trellis coder. In finite stateVQ, the codebook consists of a set of sub-codebooks. A state variable, used to specifywhich sub-codebook should be selected for coding the input vector. The information about


state variable must be inferred from the received sequence of state symbols and initial statesuch as in a trellis coder [stewart 1982]. Therefore, there is no side information or nooverhead is needed to be transmitted to the decoder. The new encoder state is a function ofprevious encoder state and the selected sub-codebook. This permits the decoder to trackthe encoder state if the initial condition is known. The finite state VQ needs additionalmemory to store the previous state, but it takes advantage of correlation between succes-sive input vectors by choosing the appropriate codebook for the given history. It should benoted that the minimum distortion selection rule of conventional VQ is not necessarilyoptimum for finite state VQ for a given decoder because a low distortion code word maylead to a bad state and hence to poor long-term behavior. Therefore, the key design issue offinite state VQ is to find a good next-state function.

Another predictive VQ was proposed in [hang 1985]. In this system, the input vector isformed in such a way that the current pixel is as the first element of the vector and theprevious inputs as the remaining elements in the vector. The system is like a mapping or arecursive filter which is used to predict the next pixel. The mapping is implemented by avector quantizer look-up table and provides the predictive errors.

9.2.2.5 Block Truncation Coding

In the block truncation coding (BTC) [delp 1979], an image is first divided into 43 4 blocks.Each block is then coded individually. The pixels in each block are first converted into twolevel signals by using the first two moments of the block:

a ¼ mþ s

ffiffiffiffiffiffiffiffiffiffiffiffiN � q

q

s

b ¼ m� s

ffiffiffiffiffiffiffiffiffiffiffiffiq

N � q

r (9:7)

wherem is the mean value of the blocks is the standard deviation of the blockN is the number of total pixels in the blockq is the number of pixels which are greater in value than m

Therefore, each block can be described by the values of block mean, variance, and a binary-bit plane, which indicates whether the pixels have the values above or below the blockmean. The binary-bit plane can be seen as a binary vector quantizer. If the mean andvariance of the block are quantized to 8 bits, then 2 bits=pixel is achieved for the blocks of43 4 pixels. The conventional BTC scheme can be modified to increase the coding effi-ciency. For example, the block mean can be coded by DPCM coder that exploits the inter-block correlation. The bit plane can be coded with an entropy coder on the patterns[udpikar 1987].

9.2.3 Lattice VQ for Image Coding

In the conventional image VQ schemes, there are several issues, which cause some diffi-culties for the practical applications of image VQ. The first problem is the limitation ofvector dimension. As it is indicated that the coding performance of VQ increasesas vector dimension the coding complexity exponentially increases at the same time asincreasing vector dimension. Therefore, in practice, only a small size of vector dimension


is poss ible under the com plexity constrai nt. Anoth er imp ortant iss ue in VQ is the need for acodeb ook. Much research effort has gone into findin g out how to generat e a codebook.Howe ver, in practic al applicat ions, the re is anothe r probl em of ho w to scale the codebookfor vari ous rate disto rtion requi remen ts. The codebook generat ed by LBG -like algo rithmswith a trai ning set is usually only sui table for a speci fied bit rate and it does not have thefl exibility of codeb ook scalability . For example , a codeb ook generat ed for an image withsm all resol ution may not be sui table fo r the image s wi th high resolutio n. Even for the sam espatia l resol ution, different bit rate s would requi re differe nt code books. Addit ionally, theVQ needs a table to specif y the codebo ok and consequen tly, the comple xity of st oring andsearch ing the table is too high to have a ver y large tab le. This fur ther limits the codingperform ance of image VQ. Th ese probl ems beco me maj or obstac les of image VQ forimpl ement ation. Recentl y, a n algori thm of lattice VQ has bee n propo sed to addr ess theseprobl ems [li 1997]. Lat tice VQ does not have the above problem s. The codebo ok fo r latticeVQ is simp ly a c ollection of latt ice poi nts uniform ly distribut ed over the vect or space.Scalab ility can be achieved by scaling the cell size associ ated with eve ry lattice points justlike in the scalar quan tizer by sca ling the quantiz ation step. The ba sic con cept of lattice canbe foun d in [co nway 1991]. A Typi cal Latti ce VQ schem e is shown in Figu re 9.4. Ther e aretwo steps involve d in the image lattice VQ . The fi rst step is to find the closest lattice pointfor the inpu t vect or. The second step is to label the lattice point, i.e., map ping a latticepoi nt to an index. Since lattice VQ doe s need a codebo ok, the index assignme nt is basedon lattice labeling algorithm instead of loo k-up table such as in the conve ntional VQ.Ther efore, the key iss ue of latt ice VQ is to develop an ef fi cient latt ice-labeli ng algorithm .Usin g this algori thm, the clos est latt ice poi nt a nd its corre spondin g index within a finit eboundary can be obtained by calculation at the encoder for each input vector.

At the decoder, the index is converted to the lattice point by the same labeling algorithm.The vector is then reconstructed with the lattice point. The efficiency of a labeling algo-rithm for lattice VQ is measured by how many bits needed to represent the indices of thelattice points within a finite boundary. We use a two-dimensional (2-D) lattice to explainthe lattice labeli ng ef ficiency. A 2-D lattice is sh own in Figu re 9.5.

In Figure 9.5, there are seven lattice points. One method to label these seven 2-D latticepoints is to use their coordinates (x,y) to label each point. If we label x and y separately, weneed 2 bits to label three values of x and 3 bits to label five possible values of y and thus, weneed a total of 5 bits. It is clear that 3 bits are sufficient to label seven lattice points.Therefore, different labeling algorithms may have different labeling efficiency. Severalalgorithms have been developed for multidimensional lattice labeling. In [conway 1983],

Find a closestcode word Ci

Inputvector Index i

Lattice VQ encoding

Map back toCi

(Latticede-labeling)

Lattice VQ decodingChannel

Index iOutputvector

Extract indexi

(Lattice labeling)

FIGURE 9.4Block diagram of lattice vector quantization (VQ).


y

x

FIGURE 9.5Labeling a two-dimensional (2-D) lattice.

the labeling method assigns an index to every lattice point within a Voronoi boundarywhere the shape of the boundary is the same as the shape of Voronoi cells. Apparently, fordifferent dimensions, the boundaries have different shapes. In the algorithm proposed in[laroia 1993], the same method is used to assign an index to each lattice point. However,the boundaries are defined by the labeling algorithm; this algorithm might not achieve a100% labeling efficiency for a prespecified boundary such as a pyramid boundary. Thealgorithm proposed in [fischer 1986] can assign an index to every lattice point within aprespecified pyramid boundary and achieves a 100% labeling efficiency, but this algorithmcan only be used for the Zn lattice. In the recent proposed algorithm [wang 1998], thetechnical breakthrough has been obtained. In this algorithm, a labeling method has beendeveloped for Construction-A and Construction-B lattices [conway 1983], which are veryuseful for VQ with proper vector dimension such as 16 and achieve 100% efficiency.Additionally, these algorithms are used for labeling lattice points with dimension 16 andprovide the minimum distortion. These algorithms are developed based on the relationsbetween lattices and linear block codes. Construction-A and Construction-B are the twosimplest ways to construct a lattice from a binary linear block code C¼ (n, k, d), where n, k,and d are the length, the dimension, and the minimum distance of the code, respectively.

A construct-A lattice is defined as

Ln ¼ Cþ 2Zn (9:8)

where Zn is the n-dimensional cubic lattice and C is a binary linear block code. Thereare two steps involved for labeling a Construct-A lattice. First is to order the latticepoints according to the binary linear block code C, and then to order the lattice pointsassociated with a particular nonzero binary code word. For the lattice points associatedwith nonzero binary code word, two sub-lattices are considered separately. One sub-latticeconsists of all the dimensions that have 0 component in the binary code word and theother consists of all the dimensions that have 1 component in the binary code word. Thefirst sub-lattice is considered as a 2Z lattice, whereas the second is considered as atranslated 2Z lattice. Therefore, the labeling problem is reduced to label the Z lattice atthe final stage.

A Construction-B lattice is defined as

Ln ¼ Cþ 2Dn (9:9)

where Dn is an n-dimensional Construction-A lattice with the definition as

Dn ¼ (n,n� 1,2)þ 2Zn (9:10)


and C is a binary doubly even linear block code. When n is equal to 16, the binary evenlinear block code associated with L16 is C¼ (16, 5, 8). The method for labeling a Construc-tion-B lattice is similar to the method for labeling a Construction-A lattice with two minordifferences. The first difference is that for any vector y¼C þ 2x, x 2 Zn, if y is aConstruction-A lattice point; and x 2 Dn, if y is a Construction-B lattice point. The seconddifference is that C is a binary doubly even linear block code for Construction-B latticeswhile it is not necessarily doubly even for Construction-A lattices. In the implementation ofthese lattice point labeling algorithms, the encoding and decoding functions for lattice VQhave been developed in [li 1997]. For a given input vector, an index representing the closestlattice point will be found by the encoding function and for an input index, the recon-structed vector will be generated by the decoding function. In summary, the idea of latticeVQ for image coding is an important achievement for eliminating the need of codebook forimage VQ. The development of efficient algorithms for lattice point labeling makes latticeVQ feasible for image coding.

9.3 Fractal Image Coding

9.3.1 Mathematical Foundation

A fractal is a geometric form whose irregular details can be represented by some objectswith different scale and angle, which can be described by a set of transformations such asaffine transformations. Additionally, the objects used to represent the image irregulardetails have some form of self-similarity and these objects can be used to represent animage with simple recursive way. An example of fractals is Von Koch curve as shown inFigure 9.6. The fractals can be used to generate an image. The fractal image coding that isbased on IFS is the inverse process of image generation with fractals; therefore, the keytechnology of fractal image coding is the generation of fractals with an IFS.

To explain what an IFS is, we start from the contractive affine transformation. A 2-Daffine transformation A is defined as follows:

Axy

� �¼ a b

c d

� �xy

� �þ e

f

� �(9:11)

FIGURE 9.6Construction of the Von Koch curve.

E0

E1

E2

E3

•

•


This is a transf or mation, which cons ists of a line ar transfor mation follo wed by a sh ift ortranslati on and maps points in the Euclide an plane int o new points in the ano ther Eucli-dean plane . We de fi ne that a transf orma tion is contrac tive if the distanc e of two points P1

and P2 in the new plane is smaller than their distanc e in the or iginal plane , i.e.,

d( A ( P1 ), A( P 2 )) < s d(P 1 , P2 ) (9: 12)

where s is a con stant and 0 < s < 1. The contrac tive transfor mations have the prope rty thatwhen the con tractive transf ormati ons are repeate dly app lied to the points in a plane, thesepoints will converg e to a fixed poi nt. An IFS is de fi ned as a collect ion of contrac tive af finetransf ormatio ns. A well-kno wn exa mple of the IFS contains four fo llowing transf or m-ation s:

Aixy

� �¼ a b

c d

� �xy

� �þ e

f

� �i ¼ 1, 2, 3 , 4 (9 : 13)

This is the IFS of a fer n leaf of which parame ters are shown in Table 9.1.The transfor matio n A1, A 2, A3, and A 4 are used to gene rate the stalk, righ t leaf, left leaf, and

main fern, respec tively . A fund amental theorem of fractal geome try is that each IFS defi nes auniqu e fractal image. This image is refer red to as the attra ctor of the IFS. In othe r words , animage correspo nds to the attra ctor of an IFS. No w let us explain how to generat e the imageusing the IFS. Let us suppose that an IFS con tains N af fine transf ormatio ns, A1, A 2, . . . , AN ;each transf ormati on has an associate d probabi lity, p1, p2, . . . , pN , resp ectively. Suppo se thatthis is a com plete set and the sum of the probabi lity equals to 1, i.e.,

p1 þ p2 þ . . . þ pN ¼ 1 and pi > 0 for i ¼ 0, 1, . . . , N : (9 : 14)

The procedur e of generat ing an attra ctor is as fo llows. For any given poi nt ( x0, y0) inEuclide an plane , on e transfor matio n in the IFS accordin g to its probability is selecte d andappl ied to this point to generat e a new point (x1, y1). Then another transformation isselected according to its probability and applied to the point (x1, y1) to obtain a newpoint (x2, y2). This process is repeated over and over again to obtain a long sequence ofpoints: (x0, y0), (x1, y1), . . . , (xn, yn), . . . . According to the theory of iterated function systems,these points will be converged to an image that is the attractor of the given IFS. The abovedescri bed procedur e is sh own in the fl owchart of Figure 9.7. With the abov e algori thm andthe parameters in Table 9.1, initially the point can be anywhere within the large square, butafter several iterations it will converge onto the fern. The 2-D affine transformations areextended to three-dimensional (3-D) transformations, which can be used to create fractalsurfaces with the iterated function systems. This fractal surface can be considered as thegray level or brightness of a 2-D image.

TABLE 9.1

The Parameters of the Iterated Function System (IFS) of a Fern Leaf

a b c d e f

A1 0 0 0 0.16 0 0.2

A2 0.2 �0.26 0.23 0.22 0 0.2

A3 �0.15 0.28 0.26 0.24 0 0.2

A4 0.85 0.04 �0.04 0.85 0 0.2


FIGURE 9.7Flowchart of generating an image with an iterated function system (IFS).

Given(x0, y0)

Choose k (0<k<N ) with pk

(x1, y1)=Ak ((x0, y0)

Plot (x1, y1)

Checkconvergence

No

Yes

Stop

9.3.2 IFS-Based Fractal Image Coding

As it is described in the last section, an IFS can be used to generate a unique image, which isreferred to as an attractor of the IFS. In other words, an image is the attractor of an IFS; thisimage can be simply represented by the parameters of the IFS. Therefore, if we can use aninverse procedure to generate a set of transformations, i.e., an IFS, from an image, thenthese transformations or the IFS can be used to represent the approximation of the image.The image coding system can use the parameters of the transformations in the IFS insteadof the original image data for storage or transmission. As the IFS contains only very limiteddata such as transformation parameters, this image coding method may result in a veryhigh compression ratio. For example, the fern image is represented by 24 integers or 192bits (if each integer is represented by 8 bits). This number is much smaller than the numberneeded to represent the fern image in the way of pixel by pixel. Now the key issue of theIFS-based fractal image coding is to generate the IFS for the given input image. Threemethods have been proposed to obtain the IFS [lu 1993]. The first method is the directmethod, which directly finds a set of contractive affine transformations from the imagebased on the self-similarity of the image. The second method partitions an image intothe smaller objects whose IFSs are known. These IFSs are used to form a library. Theencoding procedure is to look for an IFS from the library for each small object. The thirdmethod is called partitioned IFS (PIFS). In this method, the image is first divided into thesmaller blocks and then the IFS for each block is found by mapping a larger block into asmall block.

In the first direct approach, the image is first partitioned into nonoverlapped blocks insuch a way that each block is similar to the whole image and a transformation can map thewhole image to the block. The transformation for each individual block may be different.The combination of these transformations can be taken as the IFS of the given image.


Then much less data is required to represent the IFS or the transformations than to transmitor store the given image in the pixel-by-pixel way. For the second approach, the key issue ishow to partition the given image into objects whose IFSs are known. The image processingtechniques, such as color separation, edge detection, spectrum analysis, and texture-variation analysis can be used for the image partitioning. However, for natural images orarbitrary images, it may be impossible or very difficult to find an IFS whose attractorperfectly covers the original image. Therefore, for most natural images the PIFS methodhas been proposed [lu 1993]. In this method, the transformations do not map the wholeimage into small blocks. For encoding an image, the whole image is first partitioned into anumber of larger blocks that are referred to as domain blocks. The domain blocks can beoverlapped. Then the image is partitioned into a number of smaller blocks, called rangeblocks. The range blocks do not overlap and the sum of total range blocks covers the wholeimage. In the third step, a set of contractive transformations is chosen. Each range block ismapped into a domain block with a searching method and a matching criterion. Thecombination of the transformations is used to form a PIFS. The parameters of PIFS aretransmitted to the decoder. It is noted that no domain blocks are transmitted. The decodingstarts with a flat background. The iterated process is then applied with the set of trans-formations. The reconstructed image is then obtained after the process converges. From theabove discussion, it is found that there are three main design issues involved in the blockfractal image coding system. First is partitioning techniques which include the range blockpartitioning and domain block partitioning. As mentioned earlier, the domain block islarger than range block. Dividing the image into square blocks is the simplest partitioningapproach. The second issue is the choice of distortion measure and searching method. Thecommon distortion measure in the block fractal image coding is the root mean square(RMS) error. The closest matching between range block and transformed domain block isfound by the RMS distortion measure. The third is the selection of a set of contractivetransformations defined consistently with a partition.

It is noted that the PIFS-based fractal image coding has several similar features withimage VQ. Both coding schemes are block-based coding schemes and need a codebook forencoding. For PIFS-based fractal image coding, the domain blocks can be seen as forming avirtual codebook. One difference is that the fractal image coding does not need to transmitthe codebook data (domain blocks) to the decoder while VQ needs. The second differenceis the block size. For VQ, block size for code vector and input vector is the same while inPIFS fractal coding the size of the domain block is different from the size of the rangeblocks. Another difference is that in fractal image coding the image itself serves the code-book while this is not true for VQ image coding.

9.3.3 Other Fractal Image Coding Methods

Apart from the IFS-based fractal image coding, there are several other fractal image codingmethods. One is the segmentation-based coding scheme using fractal dimension. In thismethod, the image is segmented into regions based on the properties of the HVS. Theimage is segmented into the regions; each of these regions is homogeneous in the sense ofhaving similar features in visual perception. This is different from the traditional imagesegmentation techniques that try to segment an image into regions of constant intensity.For complicated image, good representation of an image needs a large number of smallsegmentations. However, to obtain high compression ratio, the number of segmentationsis limited. The trade-off between image quality and bit rate has to be considered.A parameter, fractal dimension, is used as a measure to control the trade-off. Fractaldimension is a characteristic of a fractal. It is related to a metric property such as the lengthof a curve and the area of a surface. The fractal dimension can provide a good measure of


perceptual roughness of the curve and surface. For example, if we use many segments ofstraight lines to approximate a curve, with increasing length of straight line, the perceptualrougher curves are represented.

9.4 Model-Based Coding

9.4.1 Basic Concept

In model-based coding, an image model that can be 2-D model for still images or 3-Dmodel for video sequence is first constructed. At the encoder, the model is used to analyzethe input image. The model parameters are then transmitted to the decoder. At thedecoder, the reconstructed image is synthesized by the model parameters with the sameimage model used at the encoder. This basic idea of model-based coding is shown inthe Figure 9.8. Therefore, the basic techniques in model-based coding are the imagemodeling, image analysis, and image synthesis techniques. Both image analysis andsynthesis are based on the image model. The image modeling techniques used for imagecoding can normally be divided into two classes: structure modeling and motion modeling.The motion modeling is usually used for video sequences and moving pictures, whereasthe structure modeling is usually used for still image coding. The structure model is usedfor reconstruction of 2-D or 3-D scene model.

9.4.2 Image Modeling

The geometric model is usually used for image structure description. The geometric modelcan be classified into surface-based description and volume-based description. The majoradvantage of surface description is that such description is easily converted into surfacerepresentation that can be encoded and transmitted. In these models, the surface isapproximated by planar polygonal patches such as triangle patches. The surface shape isrepresented by a set of points that represent the vertices of these triangle meshes. The sizeof these triangle patches can be adjusted according to the surface complexity. In otherwords, for more complicated area, more triangle meshes are needed to approximate thesurface, whereas for smoothing area, the mesh sizes can be larger or fewer vertices ofthe triangle meshes are needed to represent the surface. The volume-based description is anatural approach for modeling most of solid world objects. Most existing research work on

FIGURE 9.8Basic principle of model-basedcoding.

Imageanalysis

Modelparameterencoder

Modelparameterdecoder

Imagemodel

Imagesynthesis

To channel

From channel

Input image

Reconstructedimage


volume -based descri ption focuses on the parame tric volume descript ion. The volume -based descri ption is, of course , used for 3-D obje cts or video sequ ences.

Howe ver, model -based coding is succe ssfully applicabl e only to cer tain kinds of imagessince it is ver y hard to fi nd general image m odels suitable for mo st natural scene s. The fewsucces sful example s of image mo dels inc lude the huma n face, head, and body. Thesemodel s are develop ed for analysis and syn thesis of mo ving images. Th e face animati onhas been adop ted by the MPEG -4 visual coding. The body animati on is under cons ider-ation fo r the ver sion 2 of MPEG -4 visual codi ng.

9.5 S ummary

In this chap ter, three kinds of image codi ng tech niques, VQ, fract al image coding, andmodel -based coding, which are not used in the current standard s, have bee n presen ted. Allthree techni ques have seve ral importan t fea tures such as very high compr ession ratio forcertain kinds of images and v ery simp le decod ing procedur e (specia l fo r VQ). Howeve r,due to some limit ations these techni ques have not bee n adopted by indu stry stand ards. Itshoul d be noted that recently the faci al m odel, face animati on techni que, has been adoptedby MPEG -4 visual sta ndard [mpeg4 visual] .

Exerci ses

1. In the mo di fied residual VQ describ ed in Equati on 9.5, with 4 3 4 bloc k size , and 8 bitfor each pixel of original image, if we use 8 bits for coding blo ck mean and blockvaria nce. We wan t to obt ain the final bit rate is 2 bits =pixel, wha t codeb ook size wehave to use fo r codi ng resi dual, assumi ng that we use fixed- length coding (FLC) to codevector indi ces?

2. In the block truncat ion coding (BTC) describ ed in Equati on 9.7, wha t is the bit rate for ablock size of 4 3 4 if the mean and v ariance are both enc oded with 8 bits? Do you haveany suggest ions for reducing the bit rate wi thout seri ously affecti ng the recon structionqualit y?

3. Is the code book gene rated with the LBG algori thm local optimu m? Lis t the severalimpo rtant factors that will affect the qualit y of codebook generat ion.

4. In image codi ng using VQ, what kind of problem s will be cause d by using code book inthe practical applications (NB: changing bit rate).

5. What is the most important improvement of the lattice VQ over traditional VQ in thepractical application? What is the key issue for lattice VQ for image coding application?

6. Write a subroutine to generate a fern leaf (using C).

References

[baker 1983] R.L. Baker and R.M. Gray, Image compression using an-adaptive spatial vector quanti-zation, International Symposium on Circuits and Systems (ISCAS’83), 1983, pp. 55–61.

[barnsley 1988] Michael F. Barnsley and A.E. Jacquin, Application of recurrent iterated functionsystems, SPIE, 1001, Visual Communications and Image Processing, 122–131, 1988.

[barnsley1993]M.F.BarnsleyandL.P.Hurd,Fractal ImageCompression,AKPeters,Wellesley,MA,1993.


[conway 1983] J. Conway and N.J.A. Slone, A fast encoding method for lattice codes andquantizers, IEEE Trans. on Information Theory, Vol. IT-29, 1983, 820–824.

[conway 1991] J. Conway and N.J.A. Slone, Sphere Packings, Lattices and Groups, Spring-Verlag,New York, 1991.

[delp 1979] E.J. Delp and D.R. Mitchell, Image compression using block truncation coding, IEEETransactions on Communications, COM-27, 9, 1335–1342, September 1979.

[dunham 1985] M. Dunham and R. Gray, An algorithm for the design of labelled-transition finite-state vector quantizer, IEEE Transactions on Communications, COM-33, 83–89, May 1985.

[equitz 1989] W.H. Equits, A new vector quantization clustering algorithm, IEEE Transactions onASSP, 37, 1568–1575, October 1989.

[fischer 1986] T.R. Fischer, A paramid vector quantization, IEEE Transactions on Information Theory,IT-32, 568–583, 1986.

[fisher 1994] Y. Fisher, Fractal Image Compression—Theory and Application, Springer-Verlag, New York,1994.

[foster 1985] J. Foster, R.M. Gray, and M.O. Dunham, Finite-state vector quantization for waveformcoding, IEEE Transactions on Information Theory, IT-31, 348–359, May 1985.

[gersho=ramamuthi 1982] A. Gersho and B. Ramamurthi, Image coding using vector quantization,International Symposium on Circuits and Systems (ICASP’82), May 1982, pp. 428–431.

[gersho 1982]A. Gersho, On the structure of vector quantizer, IEEE Transactions on Information Theory,IT-28, 157–166, March 1982.

[hang 1985] H.M. Hang and J.W. Woods, Predictive vector quantization of images, IEEE Transactionson Communications, COM-33, 1208–1219, November 1985.

[jacquin 1993] A.E. Jacquin, Fractal image coding: A review, Proceedings of the IEEE, 81, 10, 1451–1465, October 1993.

[jang 1990] J. Jang, and S.A. Rajala, Segmentation-based image coding using fractals and thehuman visual system, IEEE International Conference of Acoustics Speech Signal Processing,1990, pp. 1957–1960.

[laroia 1993] R. Laroia and N. Favardin, A structured fixed rate vector quantizer derived from avariable length scalar quantier: I & II, IEEE Transaction on Information Theory, IT-39, 851–876, 1993.

[li 1994] H. Li, A. Lundmark, and R. Forchheimer, Image sequence coding at very low bitrates:A review, IEEE Transactions on Image Processing, 3, 5, 589–604, September 1994.

[li 1995] W. Li, and Ya-qin Zhang, Vector-based signal processing and quantization for image andvideo compression, Proceedings of IEEE, 83, 2, 317–335, February 1995.

[li 1997] W. Li et al., A video coding algorithm using vector-based technique, IEEE Transactions onCircuits and Systems for Video Technology, 7, 1, 146–157, February 1997.

[linde 1980] Y. Linde, A. Buzo, and R.M. Gray, An algorithm for vector quantizer design, IEEETransactions on Communications, 28, 84–95, 1980.

[lu 1993] G. Lu, Fractal image compression, Signal Processing: Image Communications 5, 327–343, 1993.[mpeg4 visual] ISO=IEC 14496-2, Coding of audio-visual objects, Part 2, December 18, 1998.[murakami 1982] T. Murakami, K. Asai, and E. Yamazaki, Vector quantization of video signals,

Electronic Letters, 7, 1005–1006, November 1982.[nasrabadi 1988] N.M. Nasrabadi and R.A. King, Image coding using vector quantization: A review,

IEEE Transactions on Communications, COM-36, 8, 957–971, August 1988.[stewart 1982] L.C. Stewart, R.M. Gray, and Y. Linde, The design of trellis waveform coders, IEEE

Transactions on Communications, COM-30, 702–710, April 1982.[sun 1984] H. Sun and M. Goldberg, Image coding using LPC with vector quantization, Proceedings

of the IEEE International Conference on Digital Signal Processing, Florence, Italy, September1984, pp. 508–512.

[udpikar 1987] V.R. Udpikar and J.P. Raina, BTC image coding using vector quantization, IEEETransactions on Communications, COM-35, 352–356, March 1987.

[walach 1986] E. Walach and E. Karnin, A fractal-based approach to image compression, IEEEInternational Conference on Acoustics Speech Signal Processing, 1986, pp. 529–532.

[wang 1998] C. Wang, H.Q. Cao, W. Li, and K.K. Tzeng, Lattice labeling algorithm for vectorquantization, IEEE Transactions on Circuits and Systems for Video Technology, 8, 2, 206–220,April 1998.


Part III

Motion Estimation andCompensation


10Motion Analysis and Motion Compensation

The basic techniques in image coding, specifically, techniques utilized in still image codingwere discussed in the previous chapters. From this chapter, we start to address the issue ofvideo sequence compression. To fulfill the task, in this chapter, we first define the conceptsof image and video sequences. Then we address the issue of interframe correlation betweensuccessive frames. Later, two techniques in exploitation of interframe correlation, framereplenishment, and motion compensated (MC) coding, are discussed. The rest of thechapter covers the concepts of motion analysis and motion compensation in general.

10.1 Image Sequences

In this section, the concept of various image sequences is defined in a theoretical andsystematic manner. The relationship between image sequences and video sequences is alsodiscussed.

It iswell known that in the 1960s, the advent of the semiconductor computer and the spaceprogram swiftly brought the field of digital image processing into public focus. Sincethen, the field has experienced rapid growth and has entered every aspect of moderntechnology. Since the early 1980s, digital image sequence processing has been an attractiveresearch area [huang 1981a, 1983]. This is not surprising, because an image sequence, as acollection of images, may provide more information than a single image frame. Theincreased computational complexity and memory space associated with image sequenceprocessing are becoming more affordable due to more advanced, achievable computa-tional capability.With the tremendous advancements continuouslymade in VLSI computerand information processing, image and video sequences are evermore indispensableelements of modern life. Although the pace and the future of this development cannotbe predicted, one thing is certain: this process is going to drastically change all aspects ofour world in the next several decades.

As far as image sequence processing is concerned, it is noted that in addition to temporalimage sequences, stereo image pair and stereo image sequences also obtained attention inthe mid-1980s [waxman 1986]. The concepts of temporal and spatial image sequences, andthe imaging space (which may be considered as a next higher-level unification of temporaland spatial image sequences) may be illustrated as follows.

Consider a sensor located in a specific position in the three-dimensional (3-D) world space.It generates images about the scene, one after another. As time goes by, the images form asequence. The set of these images can be representedwith a brightness function g(x,y,t), wherex and y are coordinates on the image planes. This is referred to as a temporal image sequence.This is the basic outline about the brightness function g(x,y,t) dealt with by researchers inboth computer vision [e.g., horn 1980] and signal processing fields [e.g., pratt 1979].


Now consider a generalization of the above basic outline. A sensor, as a solid article, canbe translated (in three free dimensions) and rotated (in two free dimensions). It is notedthat the rotation of a sensor about its optical axis is not counted because the imagesgenerated will remain unchanged when this type of rotation takes place. Thus, we canobtain a variety of images when a sensor is translated to different coordinates and rotatedto different angles in the 3-D world space. Equivalently, we can imagine that there is aninfinite number of sensors in the 3-D world space that occupies all possible spatialcoordinates and assumes all possible orientations at each coordinate; i.e., they are locatedon all possible positions. At one specific moment, all of these images form a set, which canbe referred to as a spatial image sequence. When time varies, these sets of images form amuch larger set of images, called an imaging space.

Clearly, it is impossible to describe such a set of images by using the above-mentionedg(x, y, t). Instead, it should be described by a more general brightness function,

g(x, y, t, s*), (10:1)

where s* indicates the sensor’s position in the 3-D world space; i.e., the coordinates of thesensor center and the orientation of the optical axis of the sensor. Hence s* is a five-dimensional (5-D) vector. That is,

s* ¼ (~x, ~y,~z,b,g), (10:2)

where ~x, �~y, and ~z represent the coordinates of the optical center of the sensor in the 3-Dworld space; and b and g represent the orientation of the optical axis of the sensor in the3-D world space. More specifically, each sensor in the 3-D world space may be consideredassociated with a 3-D Cartesian coordinate system such that its optical center is located onthe origin and its optical axis is aligned with theOZ axis. In the 3-D world space, we choosea 3-D Cartesian coordinate system as the reference coordinate system. Hence, a sensor withits Cartesian coordinate system coincident with the reference coordinate system has itsposition in the 3-D world space denoted by s* ¼ (0, 0, 0, 0, 0). An arbitrary sensor positiondenoted by s* ¼ (~x, ~y,~z,b, g) can be described as follows. The sensor’s associated Cartesiancoordinate system is first shifted from the reference coordinate system in the 3-D worldspace with its origin settled at (~x, ~y, ~z) in the reference coordinate system. Then it is rotatedwith the rotation angles b and g being the same as Euler angles [shu 1991; shi 1994].Figu re 10.1 shows the referenc e coor dinate system and an arbit rary Cartesian coor dinat esystem (indicating an arbitrary sensor position). There, oxy and o0x0y0 represent, respect-ively, the related image planes.

Assume now a world point P in the 3-D space that is projected onto the image plane as apixel with the coordinates xP and yP. Then, xP and yP are also dependent on t and s*. That is,the coordinates of the pixel can be denoted by xP ¼ xP(t, s*) and yP ¼ yP(t, s*). So generallyspeaking, we have

g ¼ g(xP(t, s*), yP(t, s

*), t, s*): (10:3)

As far as temporal image sequences are concerned, let us take a look at the framework ofPratt [pratt 1979], and Horn and Schunck [horn 1980]. There, g¼ g(xP(t), yP(t), t) is actuallya special case of Equation 10.3. That is,

g ¼ g(xP(t, s* ¼ constant vector), yP(t, s

* ¼ constant vector), t, s* ¼ constant vector):

In other words, the variation of s* is restricted to be zero, i.e., Ds* ¼ 0. This means the sensoris fixed in a certain position in the 3-D world space.


~ ~ ~(x,y,z) x�

0

(0, 0, 0)

0

X

Y

Z

x

y

X�

Y�

Z�

y�

0'

FIGURE 10.1Two sensors’ positions: s* ¼ (0,0,0,0,0) and s* ¼ (~x, ~y, ~z, b, g).

Obvio usly, an altern ative is to de fine the imaging space as a set of all temporal imagesequ ences; i.e., thos e taken by sensors locate d a t all possibl e positions in the 3-D worldspace. Stereo image sequenc es can thus be viewe d as a proper sub set of the imaging space ,just like a stereo pair of images can be cons idered as a prop er subset of a spatial imagesequ ence.

In summa ry, the imaging space is a collect ion of all poss ible fo rms assume d by thegeneral brigh tness functi on g( x, y, t , s* ). Eac h pictu re, taken by a sensor located on apartic ular posit ion at a speci fic mom ent, is merely a special cross section of this imagi ngspace. Both tem poral and spati al image sequenc es are spe cial prope r subset s of theimaging space. Th ey are in the midd le level, betw een the imagi ng space and the individualimages . Th is hierar chical structur e is dep icted in Figure 10.2.

Before con cluding thi s section, we will discuss the relati onship betwe en image sequ encesand video sequenc es. It is note d that the term video is use d ver y often nowaday s inadditi on to the ter ms image frame s and sequ ences. It is nec essary to pau se fo r a whi le todiscuss the relationshi p betw een the se terms. Image frames and sequ ences have beende fined clearl y abov e wi th the introd uction of the concept of the imagi ng space. Vi deocan mean an indivi dual video frame or video seq uences. It refers, howeve r, to thos e frame sand sequ ences that are associ ated with the visible fre quency band in the electro magneti cspectru m. For image frames and sequ ences, there is no such restrict ion. For insta nce,infrared image frames and sequ ences correspo nd to a ba nd outside the visible band inthe spe ctrum. Fro m this point of view, the scope of image frame s and sequenc es is widerthan that of video frames and sequences. When the visible band is concerned, the termsimage frame and sequence are interchangeable with that of video frame and sequence.

Another point we would like to bring to readers’ attention is as follows. Although videois referred to as visual information, which includes both a single frame and frame


Intermediate level

Image space g (x, y, t, s )

Temporal image sequenceg (x, y, t, s = a specific vector)

Spatial image sequenceg (x, y, t = a specific moment, s )

Individual imagesg (x, y, t = a specific moment, s = a specific vector )

Top level

Bottom level

FIGURE 10.2A hierarchical structure.

sequ ences, in practi ce it is often used to mean sequenc es exc lusively. Such an exa mple canbe found in a book entit led Digital Vid eo Proce ssing by Tekal p [tek alp 19 95].

In this book, we use image compr ession to indi cate still image com pression, and vide ocom pression to indi cate vide o sequ ence compr ession. Reade rs shoul d keep in min d, how-ever, that fi rst, video can mean a singl e frame or sequenc es of frame s; seco nd, the scope ofimage is wi der than that of v ideo, and v ideo is more pertine nt to multi media engine ering.

10.2 Interfr ame Correlation

As far as v ideo compr ession is concer ned, all the techni ques discuss ed in the previ ouschap ters are applicabl e. By this we mean two c lasses of techni ques. Th e fi rst class, which isalso the most straightforw ard way to handle v ideo compr ession, is to code each framesep arately. That is, indi vidual frames are coded indep endentl y on each othe r. For insta nce,usi ng a JPEG compr ession algori thm to code each frame in a video sequenc e resu lts inmo tion JPEG [wes twater 1997]. In the second class, met hods utilized for still image codingcan be generali zed for video compr ession. For insta nce, dis crete cosine transf orm (DC T)coding can be gene ralized and appl ied to video coding by extending two-di mensio nal(2-D ) DCT to 3-D DCT. That is, instead of 2-D DCT, say, 8 3 8, app lied to a singl e imageframe , we can appl y 3-D DC T, say, 8 3 8 3 8, to a video sequ ence; see Figu re 1 0.3. That is,eight blocks of 8 3 8 each located , resp ectively, at the sam e posit ion in on e of the eightsucce ssive frame s from a video sequenc e are coded tog ether with the 3-D DCT. It wasreporte d that this 3-D DCT techniq ue is qui te ef fi cient [lim 1990; westwat er 1997]. Inadditi on, the different ial pulse code modul ation (DP CM) tech nique and the hybrid tech-nique can be generali zed and applied to video compr ession in a simi lar fashion [jain 1989;lim 1990]. It is note d that in the second class of techni ques, several succe ssive frame s aregrouped and coded together, whereas in the first class each frame is coded independently.


12

34

56

7

8 � 8

8

FIGURE 10.3A 3-D discrete cosine transform (DCT) of 83 83 8.

Video compress ion has its own charac ter istics; ho wever, these make it quite differentfrom st ill image com press ion. The major difference lies on the exploitat ion of interframecorre lation that exists betwe en succe ssive frames in vide o sequ ences in a ddition tothe intrafr ame corre lation that exists withi n each frame . As menti oned in Chapter 1, theinter frame cor relation is also referred to as temporal redund ancy, whil e the int raframecorre lation is refer red to as spati al redundanc y. To achi eve coding ef ficiency , we need toremov e the se redundanc ies fo r video com press ion. To do so we must first unders tandthese redundanc ies.

Consi der a vide o seq uence taken in a vide ophone servi ce, where the came ra is staticmost of the time. A typical scene is a head and shoulders view of a pe rson impo sed on abackg round. In this type of vide o sequ ence the back ground is usuall y st atic. Only thespeak er experi ences mo tion, which is not severe . Therefor e, the re is a strong simi laritybetwe en succe ssive frames , that is, a stro ng adjacent -fram e c orrelation. In ot her words ,there is a strong interframe corre lation. It was reporte d in [moun ts, 1969] that when usi ngvideopho ne-like signals wi th mo derate motion in the scene , on average, less than on e-tenthof the elem ents change betw een frame s by an amount which exceed s 1% of the peak signal .Here, a 1% change is regard ed as signi ficant. Our expe rimen t on the fi rst 40 frame s of theMiss Ame rica seq uence sup ports this observat ion. Two succes sive frame s of the sequ ence,frames 24 and 25, are shown in Figu re 10.4.

Now, cons ider a vide o sequ ence generat ed in a tele vision broadcas t. It is we ll knownthat tele vision sig nals are generat ed wi th a scene sca nned in a particula r mann er tomaintai n a steady pictu re fo r a human being to view regard less of wheth er there is ascenery change or not. That is, altho ugh there is no change from one frame to the next, thescene is stil l sca nned con stantly. He nce the re is a grea t deal of frame -to-frame corre lation[haskell 1972b; netravali 1979]. In TV broadcasts, the camera is most likely not static, and itmay be panned, tilted, and zoomed. Furthermore, more movement is involved in the scene.As long as the TV frames are taken densely enough, most of the time we think the changesbetween successive frames are due mainly to the apparent motion of the objects in thescene that takes place during the frame intervals. This implies that there is also a high


20

40

60

80

100

120

14020 40 60 80 100 120 140 160

20

40

60

80

100

120

14020 40 60 80 100 120 140 160

(a) Frame 24 (b) Frame 25

FIGURE 10.4Two frames of the Miss America sequence.

corre lation between sequ ential frames . In ot her word s, there is an interframe redu ndancy(inte rpixel redu ndancy betwe en pixels in succe ssive frames). Ther e is more corre lationbetwe en televisi on pictu re elemen ts along the frame -to-frame tem poral dimension thanthere is betwe en adjacent elem ents in a single frame along the spati al dimens ion. Thatis, the re is general ly more interframe corre lation than intrafr ame correlat ion. Takingadv antage of the interframe corre lation, i.e., elim inating or decreas ing the uncertain ty ofsucce ssive frames , leads to video dat a com pression. This is anal og to the case of st ill imagecoding with the DPCM techni que, where we can predic t part of an image by knowing theothe r par t. Now the kno wledge of the previous frames can rem ove the unc ertainty ofthe next frame. In both cases, knowledg e of the past rem oves the uncert ainty of the futur e,leavi ng less actual informati on to be transmitt ed [k retzmer 1952]. In Chapter 16, the word s‘‘ past ’’ and ‘‘ future ’’ use d here are change d respectiv ely, to ‘‘ some frame s’’ and ‘‘ someothe r frames ’’ in advance d video coding techniq ues, such as MPEG. There, a frame mightbe predicted from both its ear lier frames and its future frames.

At this point, it become s clea r that the seco nd class of techniq ues (Secti on 10.2), whichgeneral izes tech niques original ly develop ed for still image coding and appl ies the m tovide o coding, expl oits interframe correlat ion. For insta nce, in the cas e of the 3-D DC Ttechni que, a st rong temporal correlation causes a n energy compac tion wi thin the lowtem poral frequency regi on. The 3-D DC T techni que drops transf orm coefficients associatedwith high temporal frequency, thus achieving data compression.

The two techniques specifically developed to exploit interframe redundancy, i.e., framereplenishment and MC coding, are introduced below. The former is the early work,whereas the latter is the more popular recent work.

10 .3 Frame Replenishment

As men tioned in Chapte r 3, frame-t o-frame redundanc y has long been recogn ized in TVsignal compr ession. The fi rst few experi ments of a frame sequ ence coder exploiting inter-frame redundancy may be traced back to the 1960s [seyler 1962, 1965; mounts, 1969]. In[mounts, 1969] the first real demonstration was presented and was termed conditional


replen ishment . This frame replenishme nt tech nique can be brie fl y des cribed as follo ws. Eachpixel in a frame is classi fied into changi ng or unc hanging areas depending on wheth er or notthe inten sity diffe rence betw een its presen t value and its previous one (the intensity valueat the sam e position on the previ ous frame) exceeds a thres hold. If the differe nce does exceedthe thresho ld, i.e., a signi fi cant change has been identi fi ed, the addr ess and inten sity of thi spixel are coded and stored in a buffer and the n transmi tted to the receiver to repl enishinten sity. For thos e unc hanging pixe ls, nothing is coded and transmi tted. Th eir earlierinten sities are repe ated in the receiver . It is noted that the buff er is utilize d to make theinforma tion presente d to the transmi ssion chann el occ ur at a smooth bit rate. The thres holdis to make the averag e repl enishment rate match the chann el capacity.

Since the repl enishment technique only encodes thos e pixels whose inten sity value haschanged signi ficantl y between successive frames, its codi ng ef ficiency is much higherthan the codi ng techni ques, whi ch encode every pixel of every frame, say, the DPCMtechni que appl ied to each single frame. In othe r words , utilizing interframe corre lation,the replen ishment tech nique achieves a lower bit rate, while kee ping the equiva lentrecon structed image quality .

Muc h effort had bee n made to furthe r improve this type of simple replenishme ntalgorithm . As menti oned in the discuss ion of 3-D DPCM in Chapter 3 , for insta nce, itwas soon realized that int ensity value s of pixels in a changi ng area need not be transmitt edindep endently on on e ano ther. Instead , using both spatia l and temporal neighbors ’ int en-sity values to predic t the int ensity value of a changing pixel leads to a frame-di fferencepredic tive codi ng tech nique. Ther e, the different ial signal is coded instead of the originalinten sity value s, thus ac hieving a lower bit rate (refer Secti on 3.5.2 fo r mo re detail) .Anoth er exa mple of the improve ments is that measures have been taken to distingui shthe int ensity differe nce cause d by no ise from tho se associate d with changi ng to avo id thedirty wind ow effect , whose mean ing is given in the nex t parag raph. For more detailedinforma tion on these improve ments over the simp le frame replen ishment techniq ue,readers are ref erred to two exc ellent rev iews [has kell 1972b, 1979].

The main dr awback associ ated with the frame replenishme nt technique is that it isdif ficult to handle frame sequenc es contai ning mo re rap id changes . When the re are morerapid change s, the numb er of pixels whose intensity value s need to be update d increa ses.To main tain the transmis sion bit rate in a steady and prope r level, the thresho ld has to beraised , thus causi ng many slow change s that canno t show up in the rec eiver. This poo rerrecon struction in the receiver is somew hat analogou s to viewing a scene throu gh a dirtywindo w. This is referred to as the dirty wind ow effect. Th e resu lt of one expe riment on thedirty wind ow effect is displa yed in Figure 10.5. Fro m frame 22 –25 of the M iss Ameri casequ ence, there are 216 6 pixe ls (less than 10% of the total pixels) which change the ir gr aylevel values by more than 1% of the peak signal. When we only update the gray levelvalues for 25% (randomly chosen) of these changing pixels, we can clearly see the dirtywindow effect. When rapid scene changes exceed a certain level, buffer saturation willresult, causing picture breakup [mounts, 1969]. MC coding, which is discussed below, hasbeen proved to be able to provide better performance than the replenishment techniquein situations with rapid changes.

10.4 Motion Compensated Coding

In addition to the frame-difference predictive coding technique (a variant of the framereplenishment technique discussed above), another technique, displacement-based predict-ive coding, was developed at almost the same time [rocca 1969; haskell 1972a]. In this


FIGURE 10.5Dirty window effect.

20

40

60

80

100

120

140

20 40 60 80 100 120 140 160

technique, a motion model is assumed. That is, the changes between successive frames areconsidered due to the translation of moving objects in the image planes. Displacementvectors of objects are first estimated. Differential signals between the intensity value of thepicture elements in the moving areas and that of their counterpart in the previous frame,which are translated by the estimated displacement, are encoded. This approach, whichtakes motion into account to compress video sequences, is referred to as motion compen-sated predictive coding. It was found to be much more efficient than the frame-differenceprediction technique.

To understand the above statement, let us take a look at the diagram shown in Figure 10.6.Assume a car translating from the right side to the left side in the image planes in a uniformspeed during the time interval between the two consecutive image frames. Other than this,there are no movements or changes in the frames. Under this circumstance, if we know thedisplacement vector of the car on the image planes during the time interval betweentwo consecutive frames, we can then predict the position of the car in the latter frame fromits position in the former frame. One may think that if the translation vector is estimatedwell, then so is the prediction of the car position. This is true. In reality, however, estimation

(a) tn −1 (b) tn

FIGURE 10.6Two consecutive frames of a video sequence.


errors occ urring in determi nation of the motion vect or, which may be caused by vari ousnoise s exis ting in the frame s, m ay cause the predic ted posit ion of the car in the latt er frameto differ from the actu al pos ition of the car in the latt er frame.

The ab ove transl ational mo del is a very simp le one; it c annot acco mmodate mo tionsothe r than transl ation, say, rotati on, and came ra zooming. Occlusion and disocclus ionof objects make the situ ation even more com plicated becau se in the cas e of occ lusion,some portions of the images may disappe ar, whe reas in the case of disoc clusion, somenewl y expos ed are as may appe ar. Therefor e, the prediction err or is almost inevi table. Tohave good-qu ality frames in the receiver , we can fi nd the prediction error by sub tractingthe predic ted versio n of the latter frame from the actual ver sion of latter frame . If weencode bot h the displacem ent vect ors and the predic tion error, and transmit the data to thereceiver , we may be able to obt ain high -qua lity reconst ructed images in the receiver . This isbecaus e in the receivin g end, usi ng the displaceme nt v ectors transmi tted from the trans-mitter and the reconstru cted forme r frame, we can predic t the latter frame . Addin g thetransmi tted predic tion err or to the predic ted frame , we may recon struct the latter framewith satisfac tory qualit y. Furthe rmore, if manipul ating the proce dure prope rly, we are ableto achieve data com pression.

The dis placemen t vect ors are refer red to as sid e or overhead inform ation to indicate theirauxiliar y nature . It is no ted that mo tion esti mation dr astically increa ses the com putationalcomple xity of the coding algorithm . In other words, the higher coding ef ficiency is obt ainedin MC coding, but with a higher com putational burden. As pointed out in Secti on 10.1,this is bot h tech nically feasib le and economical ly des ired becaus e the cost of dig ital signalproces sing decrease s muc h faster than that of transmi ssion [dubo is 198 1].

MC video compr ession has bee n a major deve lopmen t in codi ng since then. For moreinforma tion, readers should ref er to seve ral excellent survey pap ers [mu smann 1985; zhang1995; kunt 1995].

The com mon practi ce of M C coding in video com pression can be spl it into the follo wingthree st ages: First, the mo tion anal ysis stage; that is, displace ment vector s for either everypixel or a set of pixe ls in image planes from sequ ential images are estimate d. Second , thepresen t frame is predic ted by using estimated mo tion vectors and the previ ous frame .The predic tion error is then calcul ated. This st age is called predic tion and diffe rentiati on.The thi rd stage is enc oding. The predic tion err or (differen ce betw een the pres ent and thepredicted present frames) and the motion vectors are encoded. Through an appropriatemanipulation, the total amount of data for both the motion vectors and prediction error isexpected to be much less than the raw data existing in the image frames, thus resulting indata compression. A block diagram of MC coding is shown in Figure 10.7.

Before concluding this section, we compare the frame replenishment technique with theMC coding technique. Qualitatively speaking, from the above discussion, we see that the

Motionanalysis

Predictionand

differentiation

Encoding FIGURE 10.7Block diagram of motion compensated (MC) coding.


replenishment technique is also a kind of predictive coding in nature. This is particularlytrue if we consider the frame-difference predictive technique used in frame replenishment.There, it uses a pixel’s intensity value in the previous frame as an estimator of its intensityvalue in the present frame. Now let us take a look at MC coding. Consider a pixel on thepresent frame. Through motion analysis, the MC technique finds its counterpart in theprevious frame. That is, a pixel in the previous frame is identified such that it is supposedto translate to the position on the present frame of the pixel under consideration during thetime interval between successive frames. This counterpart’s intensity value is used as anestimator of that of the pixel under consideration. Therefore, we see the model used for MCcoding is much more advanced than that used for frame replenishment; therefore, itachieves much higher coding efficiency. An MC coding technique that utilized the firstpel recursive algorithm for motion estimation [netravali 1979] was reported to achieve a bitrate 22%–50% lower than that obtained by simple frame-difference prediction, a version offrame replenishment.

The more advanced model utilized in MC coding, on the other hand, leads to highercomputational complexity. Consequently, both the coding efficiency and the computa-tional complexity in MC coding are higher than that in frame replenishment.

10.5 Motion Analysis

As discussed above, we usually conduct motion analysis in video sequence compression.There, 2-D displacement vectors of a pixel or a group of pixels on image planes areestimated from given image frames. Motion analysis can be viewed from a much broaderpoint of view. It is well known that the vision systems of both human beings and animalsobserve the outside world to ascertain motion and to navigate themselves in the 3-D worldspace. Two groups of scientists study vision. Scientists in the first group, includingpsychophysicists, physicians, and neurophysiologists, study human and animal vision.Their goal is to understand biological vision systems—their operation, features, andlimitations. Computer scientists and electrical engineers form the second group. As pointedout in [aggarwal 1988], their ultimate goal is to develop computer vision systems with theability to navigate, recognize, and track objects, and estimate their speed and direction.Each group benefits from the research results of the other group. The knowledge andresults of research in psychophysics, physiology, and neurophysiology have influenced thedesign of computer vision systems. Simultaneously, the research results achieved incomputer vision have provided a framework in modeling biological vision systems andhave helped in remedying faults in biological vision systems. This process will continue toadvance research in both groups, hence benefiting human beings.

10.5.1 Biological Vision Perspective

In the field of biological vision, most scientists consider motion perception as a two-stepprocess, even though there is no ample biological evidence to support this view [singh 1991].The two steps are measurement and interpretation. The first step measures the 2-D motionprojected on the imaging surfaces. The second step interprets the 2-D motion to induce the3-D motion and structure on the scene.

10.5.2 Computer Vision Perspective

In the field of computer vision, motion analysis from image sequences is traditionally splitinto two steps. In the first step, intermediate variables are derived. By intermediate


FIGURE 10.8Feature exaction and correspondence from two consecutive frames in a temporal image sequence.

variab les, we mean 2-D mo tion param eters in image plane s. In the second step, 3-D motionvariab les, say, spe ed, displ acemen t, posit ion, and dire ction are determi ned.

Depend ing on the diffe rent intermedi ate result s, all approach es to mo tion anal ysis canbe basical ly class ifi ed into two categorie s: fea ture c orrespond ence and opti cal fl ow. In theforme r categor y, a few distinct fea tures are first extracte d from image frame s. For instance ,cons ider an image sequ ence con taining an aircraft . Two consecuti ve frames a re shown inFigure 10.8. The head and tail of an airc raft and the tips of its wings may be chos en asfeatur es. The correspond ence of these fea tures on succe ssive image frame s needs to beestablish ed. In the seco nd step, 3-D mo tion can then be anal yzed from the extractedfeatur es and the ir cor responden ce in succe ssive frames . In the latt er cate gory, the int er-mediate variab les are opti cal fl ow. An opti cal fl ow vector is de fi ned as a velocit y vector of apixel on an image frame. An optical fl ow fi eld is ref erred to as the collect ion of the velocit yvector s of all the pixels on the frame. In the fi rst step, opti cal fl ow vectors are dete rminedfrom image seq uences as the intermedi ate variable s. In the seco nd st ep, 3-D mo tion isestimate d from optical flow. It is noted that optical flow vectors are closely related todisplace ment vect ors in that a velocit y vector multiply ing by the tim e inter val between twocons ecutive frames results in the corre spondin g displa cement vector. Optical flow and itsdetermi nation are discuss ed in detai l in Chapte r 13.

It is note d that there is a so-ca lled dire ct metho d in motion anal ysis. Contrary to theabov e optical flow approa ch, instead of determi ning 2-D m otion vari ables , (i.e., theinter mediate variab les), prio r to 3 -D motion estimati on, the direct method attemp ts toestimate 3-D m otion without expl icitly solving for the inter mediate variables . In [hu ang1981b], the equatio n charac terizing displace ment vect ors in the 2-D image plane and theequati on character izing mo tion par ameters in 3-D world space are com bined so that themotion parame ters in 3-D worl d space can be directl y deri ved. Th is metho d has beenutilized to recover struc ture (object surfa ces) in 3-D worl d space as well [ne gahdaripou r1987; horn 1988; shu 1993]. The direct method has cer tain lim itations. Th at is, if thegeome try of obje ct surfa ces is unkn own in a dvance then the metho d fails.

The fea ture corre sponde nce approach is someti mes ref erred to as the discre te approach ,whil e the optical flow appro ach is someti mes referred to as the continuous approach. Thisis because the correspondence approach concerns only a set of relatively sparse but highlydiscriminatory 2-D features on image planes. The optical flow approach is concerned witha dense field of motion vectors.

It has been found that both feature extraction and correspondence establishment are nottrivial tasks. Occlusion and disocclusion, which cause some features to disappear and somefeatures to reappear, respectively, make feature correspondence even more difficult.


The deve lopmen t of ro bust techni ques to solve the correspo ndence probl em is an activeresearch area and is still in its infancy. So far, only par tial solut ions sui table for simplis ticsituati ons have been develop ed [aggarw al 1988]. He nce the fea ture c orrespond enceappro ach is rarely use d in video compr ession. Ther efore, we do not discuss this approachany furthe r.

Mo tion a nalysis (sometime s ref erred to as motion estim ation or motion inter pretation)from image sequenc es is necessa ry in aut omated navig ation. It has play ed a central ro le inthe fi eld of com puter visi on since the late 1970s and early 1980s. A grea t deal of pape rspresen ted at the Intern ation al Confere nce on Comput er Vision cov er motion anal ysis andrelated topics . Many wo rkshops, symp osiums, and special ses sions are organi zed aroun dthis sub ject [thomp son 1989].

10.5.3 Signal Process ing Pe rspective

In the fiel d of signal proce ssing, m otion anal ysis is mainly conside red in the conte xt ofband width reducti on and dat a compr ession in the transmiss ion of visual signal s. Ther e-fore, instead of the motion in 3-D world space , only the 2-D motion in the image plane isconcer ned.

Becau se of the rea l-time nature in visu al transmi ssion, the motion model cannot be ver ycom plicated. So far, the 2-D transl ational m odel is mo st fre quently assume d in the fi eld. Inthe 2-D transl ationa l mo del, it is assume d that the change betwe en a frame and its previ ousone is due to the motion of obje cts in the frame plane duri ng the tim e inter val betwe en twoconsecutive frames. In many cases, as long as frames are taken densely enough, thisassumption is valid. By motion analysis, we mean the estimation of translational motion—either the dis placemen t vectors or velocity vectors . Using this kind of motion analysis, on ecan apply the MC coding discusse d abov e, makin g coding more ef ficient.

Basi cally, there are three techni ques in 2-D mo tion a nalysis: c orrelation, rec ursive , anddiffere ntial techni ques. Philoso phically spe akin g, the fi rst two techniq ues belong to thesam e group: regio n match ing.

Refe r to Figure 10.6, where the mo ving car is the obje ct unde r investi gation. By motionanal ysis, we mean fi nding the displaceme nt vect or, i.e., a vect or repre senting the relativeposit ions of the car in the two consecuti ve frames. With regio n match ing, one may cons iderthe car (or a portio n of the car) as a regio n of inter est, and seek the best match ing betwe en thetwo regio ns in the two frames : speci fically , the regio n in the present frame and the regionin the previ ous frame . For ide ntifyin g the best match ing, two techniques , the cor relation andthe rec ursive m ethods, work different ly in metho dolog y. The cor relation techni que find s thebest match ing by searchin g the maximu m corre lation betwe en the two regio ns in a pre-de fined search range, where as the rec ursive tech nique estim ates the bes t match ing byrec ursively min imizing a nonl inear measure of the dissimi larity betw een the two regions.

A coup le of c omments are in order. First, it is no ted that the most freque ntly use dtechni que in mo tion analysis is calle d block m atching, which is a type of the corre lationtechni que. Ther e, a video frame is divided into no noverlapp ed rectangu lar block s witheach block havi ng the same size , usually 16 3 1 6. Eac h block thu s generat ed is assume d tomo ve as one, i.e., all pixels in a blo ck share the same displace ment vect or. For each block,we find its best match ing in the previ ous frame with corre lation. That is, the block in theprevi ous frame, which gives the maximu m correlat ion, is ide nti fied. The relative positionof the se two best match ed blocks produces a dis placemen t v ector. This block matchingtechni que is simple and very ef ficient, and wi ll be dis cussed in detail in Chap ter 11.Seco nd, as multi media fi nds more and mo re applicat ions, the regions occupied by arbi-trary- shaped objects (no longer alw ays rec tangula r blocks ) become inc reasingl y impo rtantin content-based video retrieval and manipulation. Motion analysis in this case is discussedin Chapte r 18. Th ird, although the rec ursive techni que is categor ized as a region matching


techni que, it may be used for findin g displace ment vector s for indivi dual pixe ls. In fact, therecursi ve techniq ue was origi nally deve loped for dete rmining displace ment vector s ofpixels, and hence , it is calle d pel rec ursive . This tech nique is discusse d in Chapte r 12.Fourth , both correlat ion and recursive tech niques can be utilized for dete rmining opticalflow vect ors. Optical flow is dis cussed in Chapter 13.

The third techni que in 2-D motion analysis is a differe ntial tech nique. This is one of themain tech niques utilized in determi ning opti cal flow vectors . It is name d after the ter m ofdifferent ial becau se it uses the partia l differe ntiation of an intensity functi on with respec t tothe spatial coor dinat es x and y, as we ll as the tem poral coordinat e t . This technique is alsodiscuss ed in Chapte r 13.

10 .6 Motion C ompensat ion f or Image S equence Pr oce ssing

Motion anal ysis has long been con sidered a key issue in image seq uence proce ssing [hu ang1981a; shi 1997]. Obviousl y, in a n area like aut omate d navigati on, motion anal ysis plays acentral ro le. From the discussi on in this chapter, we see that motion analysis also play s akey ro le in vide o data c ompress ion. Speci fi cally, we have dis cussed the con cept of motioncompens ated vide o codi ng in Se ction 10.4. In thi s section, we would like to conside rmotion com pensatio n for image sequenc e process ing, in general . Let us firs t conside rmotion compensated interpolation. Then, we will discuss motion compensated enhancement,restor ation, and down-con ver sion.

10.6.1 Moti on Compen sated Interpo lation

Interpolation is a simple yet efficient and important method in image and video compression.In image compression, we may only transmit, say, every other row. We then try to interpolatethese missing rows from the other half transmitted rows in the receiver. In this way, wecompress the data to half. As the interpolation is carried out within a frame, it is referred toas spatial interpolation. In video compression, for instance, in videophone service, instead oftransmitting 30 frames=s, we may choose a lower frame rate, say, 10 frames=s. In the receiver,we may try to interpolate the dropped frames from the transmitted frames. This strategyimmediately drops the transmitted data to one-third. Another example is the conversion of amotion picture into an NTSC (national television system commission) TV signal. There, everyfirst frame in the motion picture is repeated three times and the next frame twice, thusconverting a 24 frame=s motion picture to a 60 field=s NTSC signal. This is commonly referredto as 3:2 pulldown. In these two examples concerning video, interpolation is along thetemporal dimension, which is referred to as temporal interpolation.

For basic concepts of zer o-order inter polati on, bilin ear int erpolation , and polyno mialinter polation , readers are referred to signal proces sing texts, for insta nce, [lim 1990]. Intemporal int erpola tion, the zero- order int erpolation means cre ation of a frame by copyi ngits nea rest frame along the time dime nsion. The conve rsion of a 24 frame =s mo tion pictureto a 60 fi eld =s NTSC signal can be classi fied int o thi s type of inter polati on. Weig hted linearinter polation can be il lustrated with Figure 10 .9.

There, the weights are determined according to the lengths of time intervals, which issimilar to the bilinear interpolation widely used in spatial interpolation, except that hereonly one index (along the time axes) is used, while two indexes (along two spatial axes) areused in spatial bilinear interpolation. That is,

f (x, y, t) ¼ l2l1 þ l2

f (x, y, t1)þ l1l1 þ l2

f (x, y, t2): (10:4)


l1 l2

t1 t t2

f (x,y,t1) f (x,y,t2)f (x,y,t )

t

FIGURE 10.9Weighted linear interpolation.

If there are one or multiple moving objects existing in successive frames, however, theweighted linear interpolation will blur the interpolated frames. Taking motion into accountin the interpolation results in MC interpolation. In Figure 10.10, we still use three framesshown in Figure 10.9 to illustrate the concept of MC interpolation. First, motion betweentwo given frames is estimated. That is, the displacement vectors for each pixel are deter-mined. Second, we choose a frame that is nearer to the frame we want to interpolate. Third,the displacement vectors determined in the first step are proportionally converted to theframe to be created. Each pixel in this frame is projected via the determined motiontrajectory to the frame chosen in step 2. In the process of MC interpolation, spatialinterpolation in the frame chosen in step 2 is usually needed.

f (x, y, t1) f (x, y, t2)f (x, y, t )

t1

l1 l2

t2t t

(x0, y0) (x0, y0)

dyt

dxt

dy

dx

(x0, y0)

(x0 + dxt, y0 + dyt)

(x0 + dx, y0 + dy)

FIGURE 10.10Motion compensated (MC) interpolation.


10.6.2 Moti on Compen sated En hancement

It is well kno wn that when an image is corrupt ed by additive white Gau ssian noise(AW GN) or burst noise, linear low- pass filtering such as simple averag ing or no nlinearlow-pass fi ltering such as a m edian fi lter perform s well in remov ing the noise. Whe n animage sequenc e is concerne d, we may apply such types of fi ltering along the temporaldimens ion to rem ove noise. Th is is calle d tem poral filteri ng. These types of low-passfilteri ng may blu r images, an effect that may become quite serious when motion exists inimage plane s. Th e enhance ment, which takes mo tion int o acco unt, is refer red to as MCenhance ment, and it was found ver y ef fi cient in tem poral filtering [huang 1981c].

To faci litate the discussi on, we conside r simp le averagin g as a mean s for noise filteri ng inwha t follows . It is unders tood that othe r filtering techni ques are pos sible, and that every-thing discusse d here is app licable there . Instead of simp ly averag ing n succes sive imageframes in a video sequenc e, MC temporal fi ltering will fi rst analyze the motion existi ng inthese frames . Th at is, we esti mate the motion of pixels in succe ssive frame s first. Thenaverag ing will be con ducted only on those pixe ls along the sam e mo tion traje ctory. InFigure 10.11, three succe ssive frame s are sh own and denoted by f( x, y, t1), f( x, y, t 2), andf (x , y, t3), respectiv ely. Assume that thre e pixels, denote d by ( x1, y1), ( x2, y2), and ( x3, y3),respectively, are identified to be perspective projections of the same object point in the3-D world space on the three frames. The averaging is then applied to these three pixels.It is noted that the number of successive frames, n, may not necessarily have to be three.Motion analysis can be any one of several tech niques dis cussed in Se ction 10.5. MCtemporal filtering is not necessarily implemented pixelwise; it can also be objectwise orregionwise.

C1

B3

p�3 (x3, y3)

p�2 (x2, y2)

p�1 (x1, y1)

x

t

A3

A2

A1B1

B2

D3

D2

D1

C3

C2

y

t2

t1

t3

FIGURE 10.11Motion compensated (MC) temporal filtering.


10.6.3 Motion Compensated Restoration

Extensive attention has been paid to the restoration of full-length feature films. There,typical artifacts are due to dirt and sparkle. Earlier study in the detection of these artifactsignored motion information completely. Late motion estimation has been utilized to detectthese artifacts based on the assumption that the artifacts occur occasionally along thetemporal dimension. Once the artifacts have been found, MC temporal filtering and=orinterpolation will be used to remove the artifacts. One successful algorithm for the detec-tion and removal of anomalies in digitized animation film can be found in [tom 1998].

10.6.4 Motion Compensated Down-Conversion

Here we present one more example in which motion compensation finds application indigital video processing.

It is believed that there will be a need to down-convert a high definition television(HDTV) image sequence for display onto an NTSC monitor during the upcoming transi-tion to digital television broadcast. The most straightforward approach is to fully decodethe image sequence first, then apply a prefiltering and subsampling process to each field ofthe interlaced sequence. This is referred to as a full-resolution decoder (FRD). The meritof this approach is the high quality achieved, whereas the drawback is a high cost in termsof the large amount ofmemory required to store the reference frames. To reduce the requiredmemory space, another approach is considered. In this approach, the down-conversion isconducted within the decoding loop and is referred to as a low-resolution decoder (LRD).It can significantly reduce the required memory and still achieve a reasonably good picturequality.

The prediction drift is a major type of artifact existing in the down-conversion. It isdefined as the successive blurring of forward-predicted frames with a group of pictures. Itis caused mainly by nonideal interpolation of sub-pixel intensities and the loss of high-frequency data within the block. An optimal set of filters to perform low-resolution motioncompensation has been derived to effectively minimize the drift. For details on an algo-rithm in the down-conversion, utilizing an optimal motion compensation scheme, readersare referred to [vetro 1998].

10.7 Summary

After Part II, still image compression, we shift our attention to video compression. BeforePart IV, where we discuss various video compression algorithms and standards, however,we first address the issue of motion analysis and motion compensation in this chapter thatstarts Part III, motion estimation and compensation. This is because video compression hasits own characteristics, which are different from that of still image compression. The maindifference lies on interframe correlation.

In this chapter, the concept of various image sequences is discussed in a broad scope. Indoing so, a single image sequence, temporal image sequences, and spatial image sequencesare all unified under the concept of imaging space. The redundancy between pixels insuccessive frames is analyzed for both videoconferencing and TV broadcast cases. In theseapplications, there is more interframe correlation than intraframe correlation in general.Therefore, the utilization of interframe correlation becomes a key issue in video compression.

There are two major techniques in exploitation of interframe correlation: frame replen-ishment and motion compensation. In the conditional replenishment technique, only those


pixels ’ gra y level value s, whose vari ation from their count erparts in the previ ous frameexceed s a threshold, are enc oded and transmitt ed to the receiver . These pixels are calledchanging pixe ls. For the pixels other than the changing pixe ls, their gray value s are justrepeate d in the rec eiver. This simp lest frame replen ishment techniq ue achieves highercoding ef ficiency than codi ng each pixe l in each frame due to utilizat ion of interframeredund ancy. In the more adv anced frame repl enishme nt techni ques , say, frame-di fferencepredic tive coding tech nique, bot h temporal and spati al neighbo ring pixe ls ’ gray valuesare used to predic t that of a changi ng pixel. Instead of the intensity value s of the changi ngpixels, the prediction error is e ncoded and t ransmitted. Becau se the variance of t heprediction error is smaller t han t hat of the intensity v al ues, thi s more advanced framereplen ishment techniq ue is more ef ficient than the cond itional repl enishment techniq ue.

The main dr awback of the frame replen ishment tech niques is associate d with rap idmotion and/or intensity variati on occurrin g on the image planes . Unde r these circumstan-ces, frame reple nishment will suff er from dir ty window effe ct, an d even buf fer saturation.

In MC coding, the motion of pixe ls is firs t anal yzed. On the basis of previous framesand the esti mated motion , the current frame is pred icted. The predic tion err or togetherwith mo tion vect ors are enc oded and transmi tted to the rec eiver. Du e to more accu ratepredic tion ba sed on mo tion m odel, MC codi ng achieves higher coding ef fi ciency com paredwith frame repl enishment . This is conceivabl e becaus e frame replenishme nt basical lyuses the int ensity value of a pi xel in the previ ous frame to pred ict that of the pixel in thesame location in the presen t frame, where as the prediction in MC codi ng use s motiontraje ctory. Th is implies that high er c oding ef ficiency is obtain ed in motion com pensatio n atthe cost of higher computat ional compl exity. Th is is tech nically fea sible and economi callydesir ed since the cost of dig ital signal proces sing decrea ses muc h fas ter than that oftransmi ssion.

Becau se of the real-time requi rement in video codi ng, only simple 2-D transla tionalmodel is used. There are main ly three types of m otion analy sis tech niques used in MCcoding. Th ey are block match ing, pel recursi on, and optical flow. By far, block ma tching isused most frequently . These three tech niques are dis cussed in detai l in Chapters 11throug h 13.

Motion com pensati on is also widel y util ized in othe r tasks of digital vide o sequ enceproces sing. Ex amples inc lude M C inter polation , M C enhance ment, MC restor ation , andMC dow n-convers ion.

Exerci ses

1. Explai n the analogy betwe en a stereo image sequ ence ver sus the imaging space, and astereo image pair ver sus the spatial image sequ ence, to which the stereo image pairbelon gs.

2. Explai n why the imagi ng space can be con sidered as a uni ficati on of image frames ,spatial image sequenc es, and tem poral image sequ ences.

3. Give the de fi nitions of the foll owing several con cepts: image , image sequ ence, a ndvide o. Discuss the relati on betw een the m.

4. What feature causes vide o compr ession qui te different from st ill image compr ession?5. Desc ribe the cond itional repl enishme nt tech nique. W hy can it achieve higher codi ng

effi ciency in video codi ng than those technique s encoding each pixe l in each frame ?6. Desc ribe the frame -differ ence pred ictive coding techni que. Refer to Section 3.5.2.7. What is the main drawback of frame replenishment?


8. Both the frame-d ifferenc e predic tive coding and MC codi ng are pred ictive coding innature .

(a) What is the main difference betw een the two?

(b) Ex plain why MC coding is usually more ef ficient.

(c) What is the price paid for higher codi ng ef ficiency with MC codi ng?

9. Mo tion analy sis is a n impo rtant task enc ountered in both compute r vision and vide ocoding. What is the major different requi remen t for motion anal ysis in these two field s?

10. Wo rk on the first 40 frames of a video sequ ence other than the Miss Ameri ca. Deter-min e, on an averag e basis, ho w many percenta ges of the total pixels change their graylevel value s by more than 1% of the peak signal between two cons ecutive frames.

11. Si milar to the expe riment associ ated with Figure 10.5, do your own experi ment toobse rve the dirty wind ow effect. That is, work on two succe ssive frame s of a vide osequ ence chosen by yourself, a nd on ly upd ate a part of tho se changi ng pixels.

12. Take two frames from the Miss Ame rica sequ ence or from ot her sequ ences with yourown choice, betwe en which a relatively large motion is inv olved.

(a) Usin g the weigh ted line ar interpol ation de fined in Equati on 10.4, create anint erpola ted frame, which is located in the one- third of the tim e inter val fromthe second frame (i.e., l2 ¼ 1

3 ( l1 þ l 2 ) acco rding to Figu re 10.9).(b) Using MC interpolation, create an interpolated frame at the same position along

the temporal dimension.

(c) Compare the two interpolated frames and make your comments.

References

[aggarwal 1988] J.K. Aggarwal and N. Nandhakumar, On the computation of motion from sequencesof images—a review, Proceedings of the IEEE, 76, 8, 917–935, 1988.

[dubois 1981] E. Dubois, B. Prasada, and M.S. Sabri, Image sequence coding, chapter 3, in ImageSequence Analysis, T.S. Huang (Ed.), Springer-Verlag, Berlin, 1981.

[haskell 1972a] B.G. Haskell and J.O. Limb, Predictive video encoding using measured subjectvelocity, U.S. Patent 3,632,865, January 1972.

[haskell 1972b] B.G. Haskell, F.W. Mounts, and J.C. Candy, Interframe coding of videotelephonepictures, Proceedings of IEEE, 60, 7, 792–800, July 1972.

[haskell 1979] B.G. Haskell, Frame replenishment coding of television, chapter 6 in Image Transmis-sion Techniques, W.K. Pratt (Ed.), Academic Press, New York, 1979.

[horn 1980] B.K.P. Horn and B.G. Schunck, Determining optical flow, Artificial Intelligence, 17,185–203, 1981.

[horn 1988] B.K.P. Horn and E.J. Weldon Jr., Direct methods for recovering motion, InternationalJournal of Computer Vision, 2, 51–76, 1988.

[huang 1981a] T.S. Huang (Ed.), Image Sequence Analysis, Springer-Verlag, 1981.[huang 1981b] T.S. Huang and R.Y. Tsai, Image sequence analysis: Motion estimation, chapter 1 in

Image Sequence Analysis, T.S. Huang (Ed.), Springer-Verlag, Berlin, 1981.[huang 1981c] T.S. Huang and Y.P. Hsu, Image Sequence Enhancement, chapter 4 in Image Sequence

Analysis, T.S. Huang (Ed.), Springer-Verlag, Berlin, 1981.[huang 1983] T.S. Huang (Ed.), Image Sequence Processing and Dynamic Scene Analysis, Springer-

Verlag, Berlin, 1983.[jain 1989] A.K. Jain, Fundamentals of Digital Image Processing, Prentice-Hall, Englewood Cliffs, NJ,

1989.[kretzmer 1952] E.R. Kretzmer, Statistics of television signal, The Bell System Technical Journal, 31, 4,

751–763, July 1952.


[kunt 1995] M. Kunt (Ed.), Special issue on digital television part 1: Technologies, Proceedings of theIEEE, 83, 6, June 1995.

[mounts 1969] F.W. Mounts, A video encoding system with conditional picture-element replenish-ment, The Bell System Technical Journal, 48, 7, 2545–2554, September 1969.

[musmann 1985]H.G. Musmann, P. Pirsch, and H.J. Grallert, Advances in picture coding, Proceedingsof the IEEE, 73, 4, 523–548, 1985.

[negahdaripour 1987] S. Negahdaripour and B.K.P. Horn, Direct passive navigation, IEEE Transac-tions on Pattern Analysis and Machine Intelligence, PAMI-9, 1, 168–176, January 1987.

[netravali 1979] A.N. Netravali and J.D. Robbins, Motion compensated television coding: Part I, TheBell System Technical Journal, 58, 3, 631–670, March 1979.

[lim 1990] J.S. Lim, Two-Dimensional Signal and Image Processing, Prentice-Hall, Englewood, NJ, 1990.[pratt 1979] W.K. Pratt, (Ed.), Image Transmission Techniques, Academic Press, New York, 1979.[rocca 1969] F. Rocca, Television bandwidth compression utilizing frame-to-frame correlation and

movement compensation, Symposium on Picture Bandwidth Compression, MIT Cambridge, MA,1969, Gordon and Breach, 1972.

[seyler 1962] A.J. Seyler, The coding of visual signals to reduce channel-capacity requirements, TheInstitution of Electrical Engineers Monograph, no. 533E, July 1962.

[seyler 1965] A.J. Seyler, Probability distributions of television frame difference, Proceedings of IREE,(Australia), 26, 335, November 1965.

[singh 1991] A. Singh, Optical flow computation: A unified perspective, IEEE Computer Society Press,CA, 1991.

[shi 1994] Y.Q. Shi, C.Q. Shu, and J.N. Pan, Unified optical flow field approach to motion analysisfrom a sequence of stereo images, Pattern Recognition, 27, 12, 1577–1590, 1994.

[shi 1997] Y.Q. Shi, Editorial introduction to special issue on image sequence processing, InternationalJournal on Imaging Systems and Technology, 9, 4, 189–191, August 1998.

[shu 1991] C.Q. Shu and Y.Q. Shi, On unified optical flow field, Pattern Recognition, 24, 6, 579–586,1991.

[shu 1993] C.Q. Shu and Y.Q. Shi, Direct recovering of Nth order surface structure using unifiedoptical flow field, Pattern Recognition, 26, 8, 1137–1148, 1993.

[tekalp 1995] A.M. Tekalp, Digital Video Processing, Prentice-Hall PTR, Upper Saddle River, NJ, 1995.[thompson 1989] W.B. Thompson, Introduction to special issue on visual motion, IEEE Transactions

on Pattern Analysis and Machine Intelligence, 11, 5, 449–450, 1989.[tom 1998] B.C. Tom, M.G. Kang, M.C. Hong, and A.K. Katsaggelos, Detection and removal of

anomalies in digitized animation film, in Y.Q. Shi (Ed.), special issue on image sequenceprocessing, International Journal of Imaging Systems and Technology, 9, 4, 283–293, 1998.

[vetro 1998] A. Vetro and H. Sun, Frequency domain down-conversion of HDTV using an optimalmotion compensation scheme, in Y.Q. Shi (Ed.), special issue on image sequence processing,International Journal of Imaging Systems and Technology, 9, 4, 274–282, 1998.

[waxman 1986] A.M. Waxman and J.H. Duncan, Binocular image flow: Steps towards stereo-motionfusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-8, 6, 715–729, 1986.

[westwater 1997] R. Westwater and B. Furht, Real-time Video Compression, Kluwer Academic Pub-lishers, Dordrecht, 1997.

[zhang 1995] Y.-Q. Zhang, W. Li, and M.L. Liou (Eds.), Special Issue on Advances in Image andVideo Compression, Proceedings of the IEEE, 83, 2, 133–340, February 1995.


11Block Matching

As mention ed in Chapte r 10, displace ment vect or measure ment a nd its usage in motioncompens ation in int erframe coding for a TV signal c an be traced back to the 1970s.Netraval i and Robbi ns [netrav ali 1979] developed a pel rec ursive techni que, which esti-mates the dis placemen t vect or for each pixel rec ursive ly from its neig hboring pixe ls usi ngan optimi zation m ethod. Limb and Mur phy [limb 19 75], Rocca and Zanoletti [rocca 1972],Caffor io and Rocca [caff orio 1976], and Brof ferio and Rocca [brofferio 197 7] develop edtechni ques for estim ation of dis placemen t v ectors of a block of pixels. In the latter approach ,an image is first segm ented into areas with each having an approximat ely uniform trans-lation. Then the motion vect or is estimate d fo r each area . The segm entation and motionestimati on associ ated with these arbit rarily shaped blocks are ver y dif ficult. When the re aremultiple movin g areas in images, the situa tion become s more c hallengi ng. In additi on tomotion vect ors, the sh ape inform ation of these areas needs to be code d. He nce, whe nmovin g areas have various com plicated shapes , both computat ional com plexity and codingload wi ll increa se remarkab ly.

In contras t, the block matching technique , which is the focus of thi s chapter, is simp le,strai ghtforward , and yet ver y ef ficient. It has bee n by far the mo st pop ularly util izedmotion estimation technique in video coding. In fact, it has been adopted by all theinternational video coding standards: ISO MPEG-1 and MPEG-2, and ITU H.261, H.263and H.264. These st andards will be intro duced in detail in Chapters 16 through 20,respectively.

It is interesting to note that nowadays, with the tremendous advancements in multi-media engineering, object-based and/or content-based manipulation of audiovisualinformation is very demanding, particularly in audiovisual data storage, retrieval, anddistribution. The applications include digital library, video-on-demand, audiovisual data-base, and so on. Therefore, the coding of arbitrarily shaped objects has regained greatresearch attention these days. It is included in the MPEG-4 activities [iscas 1997], and hencewill be dis cussed in Chapter 18.

In this chapter various aspects of block matching are addressed. They include theconcept and algorithm, matching criteria, searching strategies, limitations, and newimprovements.

11.1 Nonoverlapped, Equally Spaced, Fixed Size, Small RectangularBlock Matching

To avoid the kind of difficulties encountered in motion estimation and motion compensa-tion with arbitrarily shaped blocks, the block matching technique was proposed by Jainand Jain in 1981 [jain 1981] based on the following simple motion model.


An image is partitioned into a set of nonoverlapped, equally spaced, fixed size, smallrectangular blocks; and the translation motion within each block is assumed to be uniform.Although this simple model considers only translation motion, other types of motions,such as rotation and zooming of large objects, may be closely approximated by thepiecewise translation of these small blocks provided that these blocks are small enough.This observation, originally made by Jain and Jain, has been confirmed again and againsince then.

Displacement vectors for these blocks are estimated by finding their best-matchedcounterparts in the previous frame. In this manner, motion estimation is significantlyeasier than that for arbitrarily shaped blocks. Since the motion of each block is describedby only one displacement vector, the side information on motion vectors decreases.Furthermore, the rectangular shape information is known to both the encoder and thedecoder, and hence does not need to be encoded, which saves both computation load andside information.

The block size needs to be chosen properly. In general, the smaller the block size, themore accurate the approximation is. It is apparent, however, that the smaller blocksize leads to more motion vectors to be estimated and encoded, which means an increasein both computation and side information. As a compromise, a size of 163 16 is consideredto be a good choice. (This has been specified in international video coding standards suchas H.261, H.263; and MPEG-1, MPEG-2.) Note that for finer estimation sometimes a blocksize of 83 8 is used.

Figure 11.1 is utilized to illustrate the block matching technique. In Figure 11.1a an imageframe at moment tn is segmented into nonoverlapped p3 q rectangular blocks. As men-tioned above, in common practice, square blocks of p¼ q¼ 16 are used most often.Consider one of the blocks centered at (x, y). It is assumed that the block is translated asa whole. Consequently, only one displacement vector needs to be estimated for this block.Figure 11.1b shows the previous frame: the frame at moment tn�1. In order to estimate thedisplacement vector, a rectangular search window is opened in the frame tn�1 and centeredat the pixel (x, y). Consider a pixel in the search window, a rectangular correlation window

Correlation window

q

q

p

(x0, y0)

(xi, yi)(xi , yi)

(x, y) (x, y)

p

q

p

The best block matching

Displacementvector

Search window

An original block

(a) tn frame (b) tn –1 frame

(x0, y0)

FIGURE 11.1Block matching.


pp + 2d

q + 2d

qd

d

d

d

FIGURE 11.2Search window and correlation window.

of the same size p3 q is opened with the pixel located in its center. A certain type ofsimilarity measure (correlation) is calculated. After this matching process has been com-pleted for all candidate pixels in the search window, the correlation window correspond-ing to the largest similarity becomes the best match of the block under consideration inframe tn . The relati ve pos ition between these two blocks (the block a nd its best match )gives the dis placemen t vect or (Figure 11.1b).

The size of the search window is determined by the size of the correlation window andthe maximum possible displacement along four directions: upwards, downwards, right-wards, and leftwards. In Figure 11.2 these four quantities are assumed to be the same andare denoted by d. Note that d is estimated from a priori knowledge about the translationmotion, which includes the largest possible motion speed and the temporal intervalbetween two consecutive frames, i.e., tn� tn�1.

11.2 Matching Criteria

Block matching belongs to image matching and can be viewed from a wider perspective.In many image processing tasks, we need to examine two images or two portions ofimages on a pixel-by-pixel basis. These two images or two image regions can be selectedfrom a spatial image sequence, i.e., from two frames taken at the same time with twodifferent sensors aiming at the same object, or from a temporal image sequence, i.e., fromtwo frames taken at two different moments by the same sensor. The purpose of theexamination is to determine the similarity between the two images or two portions ofimages. Examples of this type of application include image registration [pratt 1974] andtemplate matching [jain 1989]. Image registration deals with spatial registration of images,while template matching extracts and recognizes an object in an image by matching theobject template and a certain area of the image.

The similarity measure, or correlation measure, is a key element in the matching process.The basic correlation measure between two images tn and tn�1, C(s, t), is defined as follows[anuta 1969].

C(s, t) ¼Pp

j¼1Pq

k¼1 fn( j, k) fn�1( jþ s, k þ t)ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPpj¼1

Pqk¼1 fn( j, k)

2q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPp

j¼1Pq

k¼1 fn�1( jþ s, k þ t)2q : (11:1)


This is also refer red to as a normal ized two-dime nsional (2-D) cross- corre lation functi on[mu smann 1985].

Inste ad of findin g the maximu m similar ity, or corre lation, an equiva lent yet morecom putational ly ef ficie nt way of block matching is to find the minimu m dissimi larity, ormatch ing err or. The dissimi larity (sometime s referred to as the error , dis tortion, or dis-tance) betwe en two images tn and tn � 1, D( s , t ) is defined as follows:

D (s , t) ¼ 1lm

Xpj¼ 1

Xqk ¼ 1

M ( fn ( j , k ), f n� 1 ( j þ s , k þ t )), (11: 2)

where M ( u, v) is a met ric that measure s the dissimi larity betwe en the two a rgumen ts uand v. The D ( s , t) is also referred to as the match ing criter ion or the D v alues.

In the literatu re the re are seve ral types of match ing criter ia, among whi ch the meansquare err or (MSE) [ja in 1981] and mean absolut e differe nce (MAD) [koga 1981] are use dmo st of ten. It is note d that the sum of square d diffe rence (SSD) [anan dan 1987] or the sumof squ ared err or (SSE) [chan 1990] is essen tially the same as MSE. The MAD is some timesrefer red to as the mean absolute error (MA E) in the literature [nog aki 1992].

In the M SE match ing criterion, the dissimila rity met ric M ( u, v) is defined as

M ( u, v) ¼ ( u � v) 2 : (11: 3)

In the MAD ,

M (u , v) ¼ ju � vj : (11: 4)

Obvio usly, both criteria a re simp ler than the normal ized 2-D cross- correlat ion measu rede fined in Equati on 11.1.

Before proce eding to Se ction 11.3, a com ment on the selecti on of the dissimi laritymeasu re is due . A st udy based on experi mental works reporte d that the match ing criteriondoes not significantly affect the search [srinivasan 1984]. The MAD is hence preferred dueto its simplicity in implementation [musmann 1985].

11.3 Searching Procedures

Searching strategy is another important issue to deal with in block matching. Severalsearching strategies are discussed below.

11.3.1 Full Search

Figu re 11.2 shows a search windo w, a corre lation wind ow, and their sizes. In search ing forthe best matching, the correlation window is moved to each candidate position withinthe search window. That is, there are a total (2dþ 1)3 (2dþ 1) positions that need to beexamined. The minimum dissimilarity gives the best matching. Apparently, this full searchprocedure is brute force in nature. While the full search delivers good accuracy in searchingfor the best matching (thus, good accuracy in motion estimation), a large amount ofcomputation is involved.

In order to lower computational complexity, several fast searching procedures have beendeveloped. They are introduced below.


11.3.2 2-D Logarithm Search

Jain and Jain developed a 2-D logarithmic searching procedure in [jain 1981]. Based on a1-D logarithm search procedure [knuth 1973], the 2-D procedure successively reduces thesearch area, thus reducing the computational burden. The first step computes the matchingcriterion for five points in the search window. These five points are as follows: the centralpoint of the search window and the four points surrounding it, with each being a midpointbetween the central point and one of the four boundaries of the window. Among thesefive points, the one corresponding to the minimum dissimilarity is picked as the winner. Inthe next step, surrounding this winner, another set of five points are selected in a similarfashion to that in the first step, with the distances between the five points remainingunchanged. The exception takes place when either a central point of a set of five pointsor a boundary point of the search window gives a minimum D value. In these circumstan-ces, the distances between the five points need to be reduced. The procedure continuesuntil the final step, in which a set of candidate points are located in a 33 3 2-D grid.Figure 11.3 demonstrates two cases of the procedure. Figure 11.3a shows that theminimum D value takes place on a boundary, while Figure 11.3b the minimum D valuein the central position.

A convergence proof of the procedure is presented in [jain 1981], under the assumptionthat the dissimilarity monotonically increases as the search point moves away from thepoint corresponding to the minimum dissimilarity.

j

j – 1

j – 2

j + 2

j + 1

j + 3

j + 4

j – 3

j – 4

3

4

3

4

4

4 4

1 2

1

111 2

2

(a)

k – 4 k – 3 k – 2 k – 1 k + 1 k + 2 k + 3 k + 4k

FIGURE 11.3(a) 2-D logarithm search procedure: points at ( j, k þ 2), ( j þ 2, k þ 2), ( j þ 2, k þ 4), and ( j þ 1, k þ 4) are found togive the minimum dissimilarity in steps 1, 2, 3, and 4, respectively.

(continued)


j

j − 1

j − 2

j – 3

j – 4

(b)

k − 4 k − 3 k − 2 k − 1 k

4

3

44

4 4

44

3

4

12

1

1

j + 1

j + 2

j + 3

j + 4

k + 4k + 3k + 2k + 1

1 1

2

2

FIGURE 11.3 (continued)(b) 2-D logarithm search procedure: points at ( j, k � 2), ( j þ 2, k � 2), and ( j þ 2, k � 1) are found to give theminimum dissimilarity in steps 1, 2, 3, and 4, respectively.

11.3.3 Coar se –Fine Thr ee-Step Search

Anoth er importan t work on the block match ing techni que was comple ted at almost thesam e time by Koga, Linu ma, Hirano, Iijima, and Ishigur o. A coa rse–fi ne three -step pro-cedu re was develop ed for fast searchin g [koga 1981].

The three-s tep search is very simi lar to the 2-D logari thm search . There are, howe ver,three main differe nces betw een the two procedur es. First, each step in the three-s tep searchcom pares a set of nine poi nts that form a 3 3 3 2-D grid structure. Se cond, the distanc esbetwe en the points in the 3 3 3 2-D grid st ructure in the three-ste p search decre asemo notonica lly in step s 2 and 3. Third, a total of only three st eps are carried out. Obviousl y,these three items are different from the 2-D logarithm search descri bed in Section 11. 3.2.

An illustrat ive exa mple of the three -step search is shown in Figure 11.4.

11.3.4 Conj ugate Directio n Search

The conjugat e directi on search is anothe r fast search algorithm , develop ed by Srini vasanand Rao. In princip le, the procedur e cons ists of two par ts. In the first part, it find s themin imum dissimi larity along the horizon tal dire ction wi th the ver tical coordinat e fixed atan initial posit ion. In the second par t, it find s the m inimum D value along the vertica ldirecti on wi th the ho rizontal coor dinate fi xed in the posit ion determine d in the firs t part.Star ting with the ver tical dire ction follo wed by the horiz ontal directi on is, of course ,functi onally equiva lent. It was repo rted that this search procedur e works quite ef fi ciently[srin ivasan 1984].

Figu re 11.5 illustrat es the principle of the conjugat e directi on search . In this example ,each step involves a comparison between three testing points. If a point assumes the


j

j − 2

j + 2

j + 4

j + 6

j − 6

j − 4

k − 6 k − 4 k − 2 k k + 2 k + 4 k + 6

3

3

3

3

3 3

3

3

2

1 1

11

1 1 1

1

22

2 2

222

1

FIGURE 11.4Three-step search procedure: points ( j þ 4, k � 4), ( j þ 4, k � 6), and ( j þ 5, k � 7) give the minimum dissimilarityin steps 1, 2, and 3, respectively.

34

7

2

j + 1

j + 2

j + 3

j + 4

6

5

1 1 1j

j − 1

j − 2

j − 4

j − 3

k − 3k − 4 k − 2 k − 1 k k + 2k + 1 k + 3 k + 4

5

FIGURE 11.5Conjugate direction search.


min imum D value compar ed wi th bot h of its two immedi ate neighbo rs (in one dire ction),then it is cons idered to be the best match ing along this directi on, and the search alonganothe r directio n is started. Speci fically , the proced ure starts to com pare the D value sfor three points ( j , k � 1), ( j , k ), and ( j , k þ 1). If the D value of poi nt ( j , k � 1) appe ars to bethe min imum amo ng the three, then points ( j , k � 2), ( j, k � 1), and ( j , k ) are exa mined.The procedur e conti nues, findin g point ( j, k � 3) as the best match ing along the horizon taldirecti on since its D value is smaller than that of points ( j, k � 4) and ( j, k � 2). Theproced ure is then con ducted along the ver tical directi on. In this example the bes t match ingis final ly foun d at poi nt ( j þ 2, k � 3).

11.3.5 Sub sampling in the Cor relation Windo w

In the evaluation of the matching criterion, either MAD or MSE, all pixels within a correlationwindow at the tn� 1 frame and an original block at the tn frame are involved in the computation.Note that the correlation window and the original block are the same size (refer to Figure 11.1).In order to further reduce the computational effort, a subsampling inside the window and theblock is performed [bierling 1988]. Aliasing effects can be avoided by using low-pass filtering.For instance, only every second pixel, both horizontally and vertically, inside thewindow andthe block, is taken into account for the evaluation of the matching criterion. Obviously, byusing this subsampling technique, the computational burden is reduced by a factor of 4. Since3=4 of the pixels within the window and the block are not involved in the matching compu-tation, however, the use of such a subsampling procedure may affect the accuracy of theestimated motion vectors, especially in the case of small size blocks. Therefore, the subsam-pling technique is recommended only for the cases with a large enough block size so that thematching accuracy will not be seriously affected. Figure 11.6 shows an example of 23 2subsampling applied to both an original block of 163 16 at the tn frame and a correlationwindow of the same size at the tn�1 frame.

11.3.6 Multiresolution Block Matching

It is well known that multiresolution structure, also known as pyramid structure, is avery powerful computational configuration for various image processing tasks. To savecomputation in block matching, it is natural to resort to the pyramid structure. In fact,

(b) A correlation window of 16 × 16 in frame at tn –1

An original block

A correlation window

(a) An original block of 16 × 16 in frame at tn

FIGURE 11.6An example of 23 2 subsampling in the original block and correlation window for fast search.


multiresolution technique has been regarded as one of the most efficient methods in blockmatching [tzovaras 1994]. In a named top-down multiresolution technique, a typicalGaussian pyramid is formed first.

Before diving into further description, let us give a short introduction to the concept ofGaussian pyramid. Gaussian pyramid can be understood as a set of images with differentresolutions related to an original image in a certain way. The original image has the highestresolution and is considered as the lowest level, sometimes called the bottom level, in the set.From the bottom level to the top level, the resolution decreases monotonically. Specifically,between two consecutive levels, the upper level is half as large as the lower level in bothhorizontal and vertical directions. The upper level is generated by applying a low-pass filter(which has a group of weights) to the lower level, followed by a 23 2 subsampling. That is,each pixel in the upper level is aweighted average of some pixels in the lower level. In general,this iterative procedure of generating a level in the set is equivalent to convolving a specificweight function with the original image at the bottom level followed by an appropriatesubsampling. Under certain conditions, these weight functions can closely approximate theGaussian probability density function (pdf), which is why the pyramid is named afterGaussian. (For a detailed discussion, readers are referred to [burt 1983, 1984].) A diagram ofa Gaussian pyramid structure is depicted in Figure 11.7. Note that the Gaussian pyramiddepicted in Figure 11.7 resembles the so-called quad-tree structure in which each node hasfour children nodes. In the simplest quad-tree pyramid, eachpixel in anupper level is assignedan average value of its corresponding four pixels in the next lower level.

Now let us return to our discussion on the top-down multiresolution technique. After aGaussian pyramid has been constructed, motion search ranges are allocated among thedifferent pyramid levels. Block matching is initiated at the lowest resolution level to obtainan initial estimation of motion vectors. These computed motion vectors are then propa-gated to the next higher resolution level, where they are corrected, and then propagated tothe next level. This procedure continues until the highest resolution level is reached. As aresult, a large amount of computation can be saved. In [tzovaras 1994] it was shown that atwo-level Gaussian pyramid outperforms a three-level pyramid. Compared with fullsearch block matching, the top-down multiresolution block search saves up to 67% com-putation without seriously affecting the quality of the reconstructed images.

Levelincreasing

Level 1

Level 4

Level 3

Level 2

Level 0

Resolutiondecreasing

FIGURE 11.7A Gaussian pyramid structure.


In con clusion, it has been demonstr ate d that multire solution is indeed an ef ficientcom putational st ructure in block match ing. This once a gain con fi rms the high computa-tiona l effi ciency of the multire solution structur e.

11.3.7 Thr esholdin g Mult ireso lution Block Matchi ng

Wit h the multi resolutio n tech nique discusse d ear lier, the compute d motion vectors at anyinter mediate pyramid level are projecte d to the nex t high er resol ution level. In rea lity,some com puted motion vectors at the lower reso lution level may be ina ccurate and have tobe refi ned fur ther, whil e othe rs may be relative ly accurate and able to provi de satisfac torymo tion com pensatio n fo r the correspo nding block . From a computat ion saving point ofview, fo r the latter class it may not be wo rth propagati ng the motion vect ors to the nexthigh er resolutio n level for further proces sing.

Mo tivated by the above observat ion, a new multi resolution block match ing met hod witha thres holding tech nique was deve loped [shi 1997]. Wit h the thresholdi ng techniq ue, itpreve nts thos e blocks, whose estima ted motion vect ors provide sat isfactor y mo tion com-pensa tion, from furthe r proces sing, thus sav ing a lot of computat ion. In wha t fo llows, thi stechni que is presen ted in detail so as to provid e readers with an insi ght to both multi-resol ution block match ing and thresho lding multire solution block match ing techni ques.

11.3.7. 1 Algori thm

Let fn ( x, y) be the frame of an image sequenc e at curren t mo ment n. First, two Gau ssianpyra mids are forme d, pyramids n and n � 1, fro m image frame s fn ( x, y) and f n � 1(x, y),respec tively . Let the level s of the pyra mids be denoted by l, l ¼ 0, 1, . . . , L, where 0 is thelowes t resol ution level (top level), L is the full resolution level (bottom level), and L þ 1 isthe to tal number of layers in the pyramids . (Note that thi s way of numb ering the levelsof the pyra mid struc ture [s hi 1997] is different from the way depict ed in Figure 11.7.) If ( i , j)are the coor dinat es of the upper left corner of a block at level l of pyramid n, the block isrefer red to as block ( i , j ) ln . Th e horiz ontal and vertical dimens ions of a block at level l aredenote d by b lx and b

ly , respec tively . Like the variab le blo ck size method (refer to metho d 1 in

[tzovar as 1994]) , the size of the block in thi s work varies with the pyramid levels. That is, ifthe size of a blo ck at level l is blx � bly , then the size of the block at level l � 1 become s2b lx � 2b ly . The variab le block size metho d is used becaus e it gives more ef fi cient motionestima tion than the fi xed block size met hod. He re, the match ing criter ion use d formo tion esti mation is the MAD becaus e it does no t requ ire m ultiplica tion and gives simi larperform ance as the MSE doe s. Th e MAD betwe en block ( i , j )ln of the cur rent frame andblock ( i þ vx , j þ vy ) ln� l of the previ ous frame at level l can be calculate d as

MAD( i, j) ln( vlx , v

ly ) ¼

1blx � bly

Xb lx � 1k ¼ 0

Xbly � 1m ¼ 0

�� f ln ( i þ k , j þ m ) � f ln � 1 i þ k þ vlx , j þ m þ vly� �� (11: 5)

whereV l ¼ ( vlx , vly ) is one of the candidat es of the mo tion vector of block ( i , j ) lnvlx , v

ly are the two component s of the mo tion vector along the x and y dire ctions,

respec tively

A block diagram of the algorithm is shown in Figure 11.8. The thresho ld in terms ofMAD needs to be determi ned in advance acco rding to the accuracy requ irement of themo tion estimati on. How to determi ne the thres hold is dis cussed below in Secti on 11.3.7.2.Gau ssian pyramids are formed for two consecutive frames of an image sequence from


Block matching

Block matching

Block matching

Satisfyingthreshold

Satisfyingthreshold

Motion field

Low-pass filteringand subsampling




Y

Y

Frame n – 1 Frame n

N

N

FIGURE 11.8Block diagram for a three-level threshold multiresolution block matching.

which motion esti matio n is desir ed. Block ma tching is the n perform ed at the to p level withthe full search schem e. Th e estimate d motion vector s are che cked to see if they providesatisfac tory motion c ompensati on. If the ac curacy require ment is met , then the motionvector s will be directl y transf ormed to the bottom level of the pyra mid. Otherwi se, themotion vect ors wi ll be propagate d to the nex t higher resolutio n level for furthe r refi ne-ment. Th is thresholdi ng proces s is discuss ed bel ow in Se ction 11.3.7 .3. The algorithmconti nues in this fashion until either the thresho ld has been satis fied or the bot tom levelhas bee n reache d. The skipping of some intermediate level computation provides forcomputational saving. Experimental work with quite different motion complexities dem-onstrates that the proposed algorithm reduces the processing time from 14% to 20%, whilemaintaining almost the same quality in the reconst ructed image compar ed with the faste stexisti ng mul tiresolutio n blo ck m atching algorithm [tzovar as 1994].

11.3.7. 2 Threshold Determination

The MAD accuracy criterion is used in this work for the sake of saving computation.The threshold value has a direct impact on the performance of the proposed algorithm.A small threshold value can improve the reconstructed image quality at the expense ofincreased computational effort. On the other hand, a large threshold value can reduce the


TABLE 11.1

Parameters Used in Experiments

Parameters at Level Low Resolution Level Full Resolution Level

‘‘Miss America’’ sequenceSearch range 33 3 13 1Block size 43 4 83 8Thresholding value 2 None (not applicable)

‘‘Train’’ sequenceSearch range 43 4 13 1Block size 43 4 83 8Thresholding value 3 None (not applicable)

‘‘Football’’ sequenceSearch range 43 4 13 1Block size 43 4 83 8Thresholding value 4 None (not applicable)

computational complexity, but the quality of the reconstructed image may be degraded. Onepossible way to determine a threshold value, which is used in many experiments in [shi 1997],is as follows.

The pea k signal -to-nois e ratio (PSNR) is commonl y use d as a measure of the quality ofthe rec onstructe d image. As introd uced in Chapte r 1, it is defi ned as

PSNR ¼ 10 log10255 2

MSE : (11: 6)

Fro m the give n requi red PSNR, one can find out the necess ary MSE va lue. A squa re root ofthis MSE value can be cho sen as a threshold value , which is appl ied to the first two image sfrom the sequ ence. If the resultin g PS NR and requi red process ing time are satisfac tory, it isthen use d for the rest of the sequ ence. Oth erwise, the thres hold can be slightl y adjuste dacco rdingly a nd appl ied to the second and third images to che ck the PSNR and proce ssingtim e. It was reporte d that this adjuste d thres hold value has been good enough, and that the reis no need for furthe r adjus tment in numer ous expe rimen ts. As shown in Table 11.1, thethres hold value s used for the ‘‘ Miss America, ’’ ‘‘ Train, ’’ and ‘‘ Football ’’ sequ ences (note thatthree seq uences have qui te different motion com plexitie s) are 2, 3, and 4, respec tively. Theyare all dete rmined in this fashion and give satisfac to ry perform ance, as shown in the threerows marked ‘‘ Ne w Method (TH ¼ 2), ’’ ‘‘ New Metho d (TH ¼ 3), ’’ and ‘‘ Ne w Metho d(TH ¼ 4),’’ respec tively (T able 11. 2). That is, the PSNR expe riences only about 0.1 dB los sand the proce ssing time decre ases drastica lly. In the experi ments , the thres hold value of 3,i.e., the a verage value of 2, 3, and 4, was also tried. Refe r to the three rows marke d ‘‘ NewMet hod (T H ¼ 3) ’’ in Table 1 1.2. It is noted that this average thres hold value 3 has alread ygiven satisfac tory perform ance for all three seq uences. Speci fically , for the Miss Ame ricasequ ence, since the criterion inc reases fro m 2 to 3, the PSNR los s increa ses from 0. 12 to 0.48dB, and the reductio n in process ing time inc reases from 20% to 38%. For the Footballsequ ence, since the crit erion decreas es from 4 to 3, the PSNR loss decreas es from 0.08 to0.05 dB, and the reduction in processing time decreases from 14% to 9%. Obviously, for theTrain sequence, the criterion, aswell as the performance, remains the same.One can thereforeconclude that the threshold determination may not require much computation at all.

11.3.7.3 Thresholding

Motion vectors estimated at each pyramid level will be checked to see if they providesatisfactory motion compensation. Assume Vi(i, j) ¼ (vlx, v

ly) is the estimated motion vector


TABLE 11.2

Experimental Results (I)

PSNR(dB)

Error ImageEntropy

(bits=pixel)

VectorEntropy

(bits=vector)

BlockStopped

at Top Level=Total Block

Processing Times(Number of

Additions, 106)

‘‘Miss America’’ sequenceMethod 1 [tzoraras 1994] 38.91 3.311 6.02 0=1280 10.02New method (TH ¼ 2) 38.79 3.319 5.65 679=1280 8.02New method (TH ¼ 3) 38.43 3.340 5.45 487=1280 6.17

‘‘Train’’ sequenceMethod 1 [tzoraras 1994] 27.37 4.692 6.04 0=2560 22.58New method (TH ¼ 3) 27.27 4.788 5.65 1333=2560 18.68

‘‘Football’’ sequenceMethod 1 [tzoraras 1994] 24.26 5.379 7.68 0=3840 30.06New method (TH ¼ 4) 24.18 5.483 7.58 1464=3840 25.90New method (TH ¼ 3) 24.21 5.483 7.57 1128=3840 27.10

for block ( i, j) ln at level l of pyramid n. For thresholdi ng, V l ( i, j) shoul d be dire ctly proj ectedto the botto m lev el L . Th e corre sponding m otion vect or for the same block at the bot tomlevel of pyramid n will be V L (2 ( L� l) i, 2(L� l ) j ), and is give n as

V L (2( L� l ) i , 2( L� l ) j ) ¼ 2( L� l) V l (i , j ) : (11 : 7)

The MAD betw een the block at the bottom pyra mid level of the current frame and itscount erpart in the previ ous frame can be dete rmine d acco rding to Equatio n 11.5, where themotion vector is V L ¼ V L (2 ( L� l) i, 2(L� l ) j ).

This com puted MAD value can be compar ed with the prede fined threshold. If this MADvalue is less than the thres hold, the com puted motion vector V L (2 ( L� l) i , 2( L� l) j) will beassigne d to block (2 ( L� l) i , 2( L� l) j ) Ln at level L in the curren t frame and mo tion estimati onfor this block will be stopp ed. If not, the estima ted motion vector Vl ( i , j ) at level l will bepropaga ted to level l þ 1 for further refi nement. Figure 11.9 gives an illu stration of theabov e thresho lding process .

11.3.7. 4 Experime nts

To verify the effectivene ss of the proposed algorithm , exten sive experi ments have beencond ucted. The perform ance of the new algori thm is evalu ated and com pared with that of

Pyramid n –1

Pyramid n

Pyramidlevel

1

L

Projectionof the block and

its estimated motion

vector at level L

Estimation of motion vectorof a block at level 1

Calculation of theMAD of the block at

level L FIGURE 11.9The thresholding process.


met hod 1, on e of the mo st ef ficient multire solution block matching method s [tzovar as1994], in terms of PS NR, err or image entro py, motion v ector entropy, the num ber of bloc ksstopp ed at the to p level v ersus the total number of blocks , and proces sing tim e. Thenum ber of blocks stopp ed at the top level is the num ber of blocks withh eld from furtherproces sing, while the total number of blocks is the number of blocks existing at the to plevel. It is note d that the total numb er of blo cks is the same for each lev el in the pyramid.The proce ssing time is the sum of the total num ber of additio ns inv olved in the evaluati onof the MAD and the thres holding ope ration.

In the expe rimen ts, two-l evel pyra mids are use d since they give bett er perform ance formo tion estimati on purpo ses [tzovar as 1994]. Th e algorithm s are teste d on three vide osequ ences with different motion compl exities, i.e., the Miss Ameri ca, Train , and Footbal l.The Miss Ameri ca sequ ence has a spe aker impos ed on a static ba ckgroun d and contai nsless mo tion. The Train sequ ence has more detail and contai ns a fast mov ing object (train) .The 20th frame of the sequ ence is shown in Figu re 11.10 . The Football sequ ence contai nsthe most complic ated mo tion compar ed with the othe r two seq uences. The 20th frame isshown in Figure 11 .11. Table 11.1 is the list of impl ement ing param eters use d in theexpe riments . Table s 11.2 and 11.3 give the perform ance of the prop osed algorithm com-par ed wi th met hod 1. In all thre e cases, the mo tion esti mation has a hal f-pixel accu racy, themean ing of which will be explain ed in the nex t sectio n. All perform ance measu res listedthere are averaged for the first 25 frames of the testing sequ ences.

Eac h frame of the M iss America sequenc e is of 360 3 288 pixels. For conve nience, onlythe cen tral portion, 320 3 256 pixels, is process ed. Wit h the ope rational par ameter s listed inTable 11.1 (with a crit erion value of 2), 38% of the total blocks at the top level satisfy theprede fined criter ion a nd are not propag ated to the bottom lev el. The proce ssing timeneede d by the prop osed algorithm is 2 0% less than metho d 1, whi le the PS NR, the errorimage entropy, and the vect or entro py are almost the sam e. Co mpared with metho d 1, anextra amoun t of compu tation (around 0.16 3 10 6 addi tions) is con ducted on the threshold-ing ope ration, but a large com putational sav ings (arou nd 2.16 3 10 6 additio ns) areachi eved throu gh withh olding from furthe r process ing thos e blocks whos e MAD value sat the full resolution level are less than the predefined accuracy criterion.

The frames of the Train sequence are 7203 288 pixels, and only the central portion,6403 256 pixels, is processed. With the operational parameters listed in Table 11.1 (with a

FIGURE 11.10The 20th frame of the ‘‘Train’’ sequence.


FIGURE 11.11The 20th frame in the ‘‘Football’’ sequence.

criter ion value of 3), about 52% of the total blocks are stopp ed at the top level .The proces sing tim e is reduce d by about 17% by the new algorithm , com pared withmetho d 1. The PSNR, the error image entropy, and the vect or ent ropy are almos t the same.

The frames of the Footbal l sequ ence are 720 3 480 pixels, and on ly the central porti on,640 3 38 4 pixels, is process ed. With the operati onal parame ters listed in Table 11.1 (witha criterion value of 4), about 38% of the total blocks are stopped at the top level. Theprocessing time is about 14% less than that required by method 1, while the PSNR,the error image entropy, and the vector entropy are almost the same.

As discussed, the experiments with a single accuracy criterion of 3 also produce similarlygood performance for the three different image sequences.

In summary, it is clear that with the three different testing sequences, the thresholdingmultiresolution block matching algorithm works faster than the fastest existing top-down multiresolution block matching algorithm while achieving almost the same qualityof the reconstructed image.

TABLE 11.3

Experimental Results (II)

Number of % of Total BlocksStopped at Top Level (%)

Number of % SavedProcessing Time Compared

with Method 1 in [tzovaras 1994] (%)

‘‘Miss America’’ sequence (TH¼ 2) 38 20

‘‘Train’’ sequence (TH¼ 3) 52 17

‘‘Football’’ sequence (TH¼ 4) 38 14


11.4 Matching Ac cur acy

Appar ently, the two compone nts of the displace ment vector s obtain ed using the techni quedescri bed ab ove are an integer mul tiple of pixels. This is referred to as one- pixel accu racy.If a higher accuracy is des ired, i.e., the com ponents of the displace ment vector s may bea non-integ er mul tiple of pixels, the n spatial inter polati on is require d. Not only wi llmo re compu tation be involve d, but also will more bits be requi red to repres ent motionvect ors. The gain is mo re accu rate motion estimatio n, hence less predic tion err or.In pra ctice, half-p ixel and quar ter-pixel accu racy are two widely utilized accu racies otherthan one-pixe l accuracy .

11.5 Limitations with Block Matching T echniques

Althou gh ver y simple, st raightforw ard, and ef ficient, henc e, util ized mo st widely in vide ocoding , the block match ing m otion compensati on techni que has its drawbac ks. First, it hasan unr eliable motion vect or fi eld with respec t to the true mo tion in 3-D world space, inpar ticular, it has unsat isfact ory motion esti mation and compensati on along mo ving bound-aries. Se cond, it cause s block artifacts. Third, it needs to handle side infor mation. That is, itneeds to encode and transmi t mo tion vect ors as an overhead to the receivin g end, thu smakin g it dif ficul t to use sm aller block size to achi eve high er accuracy in motion estimati on.

All the se drawbac ks are due to its simple m odel: each block is assumed to experi ence auniform translati on, the mo tion vectors of partit ioned blocks are estimate d inde pende ntlyof each ot her.

Unr eliable mo tion estimati on, partic ularly along movin g boundar ies, cause s more pre-dictio n err or, hence redu ced coding ef ficiency.

The block artifacts do not cause seve re percept ual deg radatio n to the human visu alsystem (HV S) when the a vailable coding bit rate is adequa tely high. This is becaus e, with ahigh bit rate, a sufficient amount of the motion compensated (MC) prediction error can betransmitted to the receiving end, hence improving the subjective visual effect to suchan extent that the block artifacts do not appear to be annoying. However, when theavailable bit rate is low, particularly less than 64 kbits=s (kilobits per second), the artifactsbecome visu ally unpl easant . In Figure 11.12, a rec onstructe d frame of the Miss Ame ricasequence at a low bit rate is shown. Obviously, block artifacts are very annoying, especiallywhere mouth and hair are. The sequence was coded and decoded by using a codecfollowing ITU-T Recommendations H.263, an international standard in which blockmatching is utilized for motion estimation.

The assumption that motion within each block is uniform requires a small block sizesuch as 163 16 and 83 8. A small block size leads to a large number of motion vectors,however, resulting in a large overhead of side information. A study in [chan 1990] indicatesthat 83 8 block matching performs much better than 163 16 in terms of decoded imagequality due to better motion estimation and compensation. The bits used for encodingmotion vectors, however, increase significantly (about four times), which may be prohibi-tive for very low bit rate coding since the total bit rate needed for both prediction error andmotion vectors may exceed the available bit rate. It is noted that when coding bit rate isquite low, say, on the order of 20 kbits=s, the side information becomes compatible with themain information (prediction error) [lin 1997].

Tremendous research efforts have been made to overcome the limitations of block match-ing techniques. Some improvements have been achieved and are discussed next. It shouldbe kept in mind that, however, so far block matching is still by far the most popular and


FIGURE 11.12The 21st reconstructed frame of the ‘‘Miss America’’ sequence by using a codec following H.263.

efficient motion estimation and compensation technique utilized for video coding, and ithas been adopted by various international coding standards. In otherwords, blockmatchingis the most appropriate in the framework of first-generation video coding [dufaux 1995].

11.6 New Improvements

11.6.1 Hierarchical Block Matching

Bierling developed the hierarchical search in 1988 [bierling 1988] based on the followingtwo observations. On the one hand, for a relatively large displacement, accurate blockmatching requires a relatively large block size. This is conceivable if one considers itsopposite case: a large displacement with a small correlation window. Under this circum-stance, the search range is large. Therefore, the probability of finding multiple matching ishigh, resulting in unreliable motion estimation. On the other hand, however, a large blocksize may violate the assumption that all pixels in the block share the same displacementvector. Hence a relatively small block size is required to meet the assumption. Theseobservations shed light on the problem of using a fixed block size, which may lead tounreliable motion estimation.


(a) (b)Frame ti Frame ti –1

FIGURE 11.13Hierarchical block matching.

To satisfy these two con tradicti ng requ irements simult aneous ly, in a hierar chical searchproced ure, a set of different size s of blocks and corre lation windo ws are util ized. Tofaci litate the discussi on, cons ider a three -level hierar chical block matching algori thm, inwhich three block match ing procedur es are c onducted, each wi th its own par ameter s.Block match ing is fi rst con ducted with resp ect to the largest size of block s a nd corre lationwind ows. Using the esti mated displac ement vect or as an initi al vector at the second level, anew search is carri ed out with resp ect to the second larges t size of block s and corre lationwind ows. The third search proced ure is carri ed out simi larly based on the results ofthe seco nd search. An example with thre e corre lation windows is illus trated in Figu re11.13. It is no ted that the resultant displace ment vect or is the sum of the three displace mentvect ors determi ned by three searches.

The param eters in these three levels are listed in Tabl e 11.4. The algori thm is desc ribedbelow with an explanat ion of the various param eters in Table 11.3. Before each blockmatch ing, a sep arate low-pass filter is applied to the whol e image to achieve reli ableblock match ing. The low- pass filteri ng used is simp ly a local averagin g. That is, the grayvalue of every pixel is replace d by the mean value of the gra y value s of all pixe ls withi n asquare are a cen tered at the pixel to which the mean valu e is assig ned. In calcul ating thematch ing criterion D value, a subsamp ling is applied to the origi nal block and the corre l-ation window in order to save com putation, which was discusse d in Secti on 11.3.5.

In the first level, for every eighth pixe l horizon tally and vertica lly (a step size of 8 3 8)block matching is cond ucted with the maximu m displace ment being �7 pixe ls, a corre l-ation wind ow size of 64 3 64, and a subs ampling factor of 4 3 4. A 5 3 5 averag ing low-pass filtering is applied before first-level block matching. Second-level block matching is

TABLE 11.4

Parameters Used in a Three-Level Hierarchical Block Matching

Hierarchical LevelMaximum

Displacement (pel)CorrelationWindow Size Step Size

LPF WindowSize Subsampling

1 �7 643 64 8 53 5 43 4

2 �3 283 28 4 53 5 43 4

3 �1 123 12 2 33 3 23 2

Source: M. Bierling, Displacement estimation by hierarchical blockmatching, Proceedings of Visual Communicationsand Image Processing, SPIE 1001, pp. 942–951, 1988. With permission.


cond ucted wi th respec t to eve ry fourth pixel horiz ontally and ver tically (a step size of4 3 4). No te that fo r a pixe l whos e displac ement vect or estimate has not been determi ned infirs t-level block match ing, an aver age of the four nearest neighbori ng estima tes will betaken as its estimate. All the par ameter s for the second level a re listed in Tabl e 11.3. Onething that needs to be emphasi zed is that in block match ing at this level the search wi ndowshoul d be dis placed by the estimated displacem ent vector obta ined in the first level. Third-level block match ing is dealt wi th accord ingly fo r every second pixel horiz ontally andvertica lly (a st ep size of 2 3 2). The differe nt para meters are listed in Table 11.4.

In each of the three le vels, the three-s tep search discusse d in Secti on 11.3.3 is utilized .Experimental work has demonstrated a more reliable motion estimation due to the usage ofa set of different sizes for both the original block and the correlation window. The first levelwith a large window size and a large displacement range determines a major portion of thedisplacement vector reliably. The successive levels with smaller window sizes and smallerdisplacement ranges are capable of adaptively estimating motion vectors more locally.

Figure 11.14 shows a portion of an image with pixels processed in the three levels,respectively. It is noted that it is possible to apply one more interpolation after these threelevels so that a motion vector field of full resolution is available. Such a full resolutionmotion vector field is useful in such applications as MC interpolation in the context ofvideophony. There, to maintain a low bit rate some frames are skipped for transmission.At the receiving end these skipped frames need to be interpolated. As discussed in

1, 2, 3 3 2, 32, 3

333 3

1, 2, 3

3

2,3 2, 3 33 2, 3

3 3 3 33

331, 2, 3 2, 3

Processed inlevels 2 and 3

2, 3Processed ineach of three

levels

1, 2, 3 Processed inlevel3

3

1, 2, 3

FIGURE 11.14A portion of an image with pixels processed in the three levels, respectively.


FIGURE 11.15An illustration of a three-level hierarchical structure.

l – 1

l

l + 1

Chapte r 10, MC inter polation is able to produ ce better frame qualit y than that achieva bleby using weigh ted line ar inter polati on.

11.6.2 Mult igrid Block Matchi ng

Multi grid theo ry was deve loped originally in mathemat ics [hackb usch 1982]. It is a usefulcom putational st ructure in image proce ssing besides the multire solut ion one des cribed inSecti on 11.3 .6. A diagram wi th three different levels is use d to illustrat e a multi gridstruc ture (Figure 11.15). Althou gh it is also a hierar chi cal structure, each level within thehierar chy is of the same resol ution. A few algori thms ba sed on multig rid structur e havebeen deve loped to improve the block match ing tech nique. Two adv anced m ethods areintro duced below .

11.6.2. 1 Thresho lding Mult igrid Block Matching

Real izing that the simple block-bas ed mo tion model (as suming a uniform mo tion withi n afi xed-size block) in the block match ing techni que cause s seve ral drawbac ks, Chan, Yu, andConstan tinide s propo sed a va riable size block match ing tech nique. The main idea is usinga split-and -merge strate gy wi th a multi grid struc ture in order to segme nt an image into aset of vari able size blocks, each of which has an app roximate ly unifo rm motion . A bina rytree (also kno wn as bin-tre e) st ructure is used to rec ord the relationshi p betwe en theseblocks with differe nt size .

Speci fically, an image frame is initial ly spl it into a set of square block s by cuttin g theimage altern ately ho rizontal ly and vertica lly. With resp ect to each block thu s generat ed, ablock matching is perform ed in conjunc tion wi th its previ ous frame. Then the matchingaccu racy in terms of the SSE is compar ed with a preset threshold. If it is smaller than orequal to the thres hold, the block rem ains unchanged in the whol e proce ss and theestima ted mo tion vect or is final . Other wise, the block wi ll be split int o two blocks , and anew run of block match ing is con ducted for each of these two childre n block s. The proces sconti nues until eithe r the estimated vect or sat is fies a preset accu racy requi remen t or theblock size has rea ched a pred efi ned minimum. At this point, a merg e proce ss is propo sedby Chan et al.: Neighbo ring blocks unde r the sam e inter mediate nodes in the bin-tree areche cked to see if they can be mer ged, i.e., if the merg ed block can be approxi mated by ablock in the reconstru cted previ ous frame with adequ ate accu racy. It is noted that themer ge ope ration may be optio nal depen ding on the speci fi c applicat ion.

A block diagram of the multi grid block match ing is shown in Figu re 11.16. Note that it issimi lar to that shown in Figure 11.8 fo r the thre sholding multi resolutio n block matching


N

Initialization with an intermediate level inthe multigrid

Block matching

Is the preset accuracy criterion

satisfied?

Does the block size reach a preset

minimum?

Splitting the block(binary or quaternary)

Completion of matching for the block

N

Y

Y

FIGURE 11.16A block diagram of multigrid block matching.

discuss ed in Section 11.3.6. Th is observat ion re flects the similar ities betwe en multigri d andmultire solution struc tures: both are hierar chi cal in natu re and the spl it and mer ge can beeasily perform ed.

An exa mple of an image deco mposi tion and its corre spondin g bin-tree a re shown inFigure 11.17.

It was reported in [chan 1990] that, with resp ect to a picture of a com puter mouse and acoin, the propo sed variab le size block match ing achieves up to 6 dB imp rovemen t in SNRand about 30% redu ction in requi red bits com pared with fixedsize (16 3 16) block match -ing. For seve ral typical videoc onferen cing sequ ences, the proposed algorithm constantl yperform s bett er than the fi xed-size block ma tching techniq ue in terms of improve d SNR ofrecon structed frames with the same bit rate.

A simi lar algori thm was repo rted in [xi a 1996], where a quad-tre e based segmentat ion isused. The thresholdi ng techniq ue is similar to that use d in [shi 1997], and the emphasi s isplace d on the redu ction of com putational comple xity. It was foun d that for head –sh ouldertype of videopho ny sequ ences the thresho lding multi grid block match ing algorithm[xia 1996] perform s bett er than the thresholdi ng multi resolutio n block match ingalgorithm [shi 1997]. For v ideo sequenc es that contain more compl icated detai ls andmotion , howeve r, the perfo rmance compar ison tur ns out to be rev ersed .


76

54

2 31 10 11

1298

(a) (b)

8

1 9 10 11

12

2 3 4 5 6 7

FIGURE 11.17Thresholding multigrid block matching. (a) An example of a decomposition. (b) The corresponding bin-tree.

A few rem arks can be mad e as a conclusion for the thre sholdin g tech nique. Althou gh itneeds to enc ode and transmit the bin-tree or quad -tree as a portion of side inform ation, andit has to resolve the preset threshold issue, overal l spe aking, the proposed algo rithmsachi eve bett er perform ance compar ed wi th fixedsize block match ing. Wit h the flexibi lityprovid ed through the vari able-siz e metho dology, the proposed appro ach is capa ble ofmakin g the motion model of the uniform mo tion wi thin each block mo re accu rate thanfi xed-size block match ing can do.

11.6.2. 2 Opti mal Mult igrid Block Matching

As poi nted out in Chapter 10, the ultimate goal of mo tion esti mation and motion compen-sation in the context of video coding is to provide high code ef ficiency in rea l time. In otherwords , accu rate true mo tion esti mation is no t the fi nal goal, a lthough accu rate motionestima tion is cer tainly desired. Th is poi nt was presen ted in [bierlin g 1988] as we ll. There,the differe nt require ments with respec t to MC coding and MC interpol ation we re dis-cussed. Whi le the forme r requi res motion vector estima tion leadin g to min imum predictionerror and at the same time a low amoun t of mo tion vect or infor mation, the latter require saccu rate estimati on of tru e vector s and a high resolutio n the mo tion vector fi eld.

This point was ver y muc h emphasi zed in [d ufaux 1995]. There, Dufau x and Mo schen iclearl y state d that in the conte xt of video coding, estimati on of tru e mo tion in 3-D worl dspace is not the ultimate goal. Inste ad, motion estimatio n sh ould be able to provide goodtem poral predic tion a nd at the sam e tim e requi re low overh ead informatio n. In a wo rd, thetotal amount of information that needs to be encoded should be minimized. Based on thisobservation, a multigrid block matching technique with an advanced entropy criterion wasproposed.

Since it belongs to the category of thresholding multigrid block matching, it shares manysimilarities with that in [chan 1990; xia 1996]. It also bears some resemblance to threshold-ing multiresolution block matching [shi 1997]. What really distinguishes this approachfrom other algorithms is its segmentation decision rule. Instead of a preset threshold,the algorithm works with an adaptive entropy criterion, which aims at controlling thesegmentation in order to achieve an optimal solution in such a way that the total bits


neede d fo r represen ting both the prediction err or and motion overhead are min imized.The decisio n of spl itting a block is made only when the extra mo tion overhead involvedin the splitting is lower than the gain obt ained from le ss pred iction error due to moreaccu rate mo tion estimati on. No t only is it optimal in the sense of bit saving, but it alsoelim inates the need fo r setting a thresho ld.

The amo unt of bits need ed for enc oding motion infor mation can be estimate d in astrai ghtforward mann er. As far as the predic tion error is concerne d, the amoun t of bitsrequi red can be represen ted by a to tal entropy of the predic tion error, which can beestimate d by using an analy tical express ion presente d in [mos cheni 1993, dufau x 1992,1994]. Note that the codi ng cost for quad-t ree segme ntation informati on is negli giblecompar ed with that used for encoding predic tion error and motion vect ors and, hence, isomitt ed in dete rmining the crit erion.

In addi tion to this entro py criter ion, a more advance d procedur e is adopted in thealgorithm for dow n-proj ecting the motion vectors betwe en two con secutive grids inthe coars e-tofine iterative refi nemen t proce ss.

Both qualitati ve and quantit ative assessm ents in experi ments demonstr ate its goodperform ance. It was reported that, when the PSNR is fixed, the bit rate saving for thesequ ence ‘‘ Flo wer Gard en ’’ is from 1 0% to 20%, for ‘‘ Mobile Calend ar ’’ fro m 6% to 12%,and for ‘‘ Table Tennis ’’ up to 8%. This can be transl ated int o a gain in the PSNR rangi ngfrom 0.5 to 1.5 dB. Su bjective ly, the visual quality is improve d greatly. In partic ular,movin g edge s beco me muc h sharp er. Figu res 11.18 throu gh 11.20 sh ow a frame fromFlower Garde n, Mo bile Calen dar, and Table Tenn is seq uences, respec tively .

11.6.3 Predict ive Moti on Fiel d S egment ation

As pointe d at the begi nning of Se ction 11.5, the block-base d mo del, which assu mes constantmotion within each block, leads to unreliable motion estimation and compensation. This

FIGURE 11.18The 20th frame of the ‘‘Flower garden’’ sequence.


FIGURE 11.19The 20th frame of the ‘‘Mobile and Calendar’’ sequence.

FIGURE 11.20The 20th frame of the ‘‘Table tennis’’ sequence.


block effect beco mes mo re obvious and seve re for motion dis continu ous areas in imageframes . This is becau se there are two or mo re regio ns in a block in the are as, each having adifferent motion . Using one motion vect or to represen t and com pensate for the whole blockresult s in signi ficant predic tion error increa se.

Orchard proposed a predictive motion field segmentation technique to improve motionestimation and compensation along boundaries of moving objects in [orchard 1993]. Thesignificant improvement in the accuracy of the MC frame was achieved through relaxing therestrictive block-based model along moving boundaries. That is, for those blocks involvingmoving boundaries, the motion field assumes pixel resolution instead of block resolution.

Two key issues have to be resol ved in order to rea lize the idea. One is the segmentat ionissue. It is known that the segmentat ion infor mation is neede d at the rec eiving end formotion com pensati on. Th is gives rise to a large increa se in side infor mation. To main tainalmost the sam e amoun t of coding cost as the conventi onal block m atching techniq ue, themotion field segme ntation was propo sed to be cond ucted based on previ ously deco dedframes. This scheme is based on the following observation: the shape of an moving objectdoes not change from frame to frame.

This segmentat ion is simi lar to the pel recursi ve technique (discu ssed in detai l in Chap ter12) in the sense that both technique s ope rate ba ckwar ds: based on previ ously deco dedframes. The segmentation is different from the pel recursive method in that it only usespreviously decoded frames to predict the shape of discontinuity in the motion field; not thewhole motion field itself. Motion vectors are still estimated using the current frame at theencoder. Consequently, this scheme is capable of achieving high accuracy in motionestimation, and at the same time it does not cause a large increase in side informationdue to the motion field segmentation.

Another key issue is how to achieve a reconstructed motion field with pixel resolutionalong moving boundaries. In order to avoid extra motion vectors that need to be encodedand transmitted, the motion vectors applied to these segmented regions in the areas ofmotion discontinuity are selected from a set of neighboring motion vectors. As a result, theproposed technique is capable of reconstructing discontinuities in the motion field at pixelresolution whereas maintaining the same amount of motion vectors as the conventionalblock matching technique.

A number of algorithms using this type of motion field segmentation technique havebeen developed and their performance has been tested and evaluated on some real videosequences [orchard 1993]. Two of the 40 frame test sequences used were the Table Tennisand the Football. The former contains fast ball motion and camera zooming, while thelatter contains small objects with relatively moderate amounts of motion and camerapanning. Several proposed algorithms were compared with conventional block matchingin terms of average pixel prediction error energy and bits per frame required for codingprediction error. For the average pixel prediction error energy, the proposed algorithmsachieve a significant reduction, ranging from �0.7 to �2.8 dB with respect to the TableTennis sequence, and from �1.3 to �4.8 dB with the Football sequence. For bits per framerequired for coding prediction error, a reduction of 20%–30% was reported.

11.6.4 Overlapped Block Matching

All the techniques discussed so far in this section aim at more reliable motion estimation.As a result, they also alleviate annoying block artifacts to a certain extent. In this section wediscuss a group of techniques, termed overlapped block matching, developed to alleviateor eliminate block artifacts [watanabe 1991; auyeung 1992; nogaki 1992].

The idea is to relax the restriction of a nonoverlapped block partition imposed in the block-based model in block matching. After the nonoverlapped, fixed size, small rectangular


(a) Frame at tn (b) Frame at tn−1

Best matched enlargedblock

Estimated motion vector

An enlarged

target block

A neighboringoverlapped

block

An original nonoverlapped block

A neighboring nonoverlapped block

FIGURE 11.21Overlapped block matching. (a) Frame at tn. (b) Frame at tn�1.

block partition has been made, each block is enlarged along all four directions from thecenter of the block (refer to Figure 11.21). Both motion estimation (block matching) and MCprediction are conducted in the same manner as that in block matching except for theinclusion of a window function. That is, a 2-D window function is utilized in order tomaintain an appropriate quantitative level along the overlapped portion. The windowfunction decays toward the boundaries. In [nogaki 1992] a sine shaped window functionwas used.

Next, we use the algorithm proposed by Nogaki and Ohta as an example to specificallyillustrate this type of technique. Consider one of the enlarged, overlapped original (alsoknown as target) blocks, T(x, y), with a dimension of l3 l. Assume that a vector vi is one ofthe candidate displacement vectors under consideration. The predicted version of thetarget block with vi is denoted by Pvi(x, y). Thus, the prediction error with vi, Evi(x, y) canbe calculated according to Equation 11.8

Evi (x, y) ¼ Pvi (x, y)� T(x, y): (11:8)

The window function W(x, y) is applied at this stage as follows, resulting in a window-operated prediction error with vi, WEvi.

WEvi (x, y) ¼ Evi (x, y)�W(x, y): (11:9)

Assume that the MAD is used as the matching criterion. It can then be determined as usualby using the window-operated prediction error WEvi(x, y). That is,

MAD ¼ 1l2

Xl

x¼1

Xl

y¼1jWEvi (x, y)j: (11:10)

The best matching, which corresponds to the minimum MAD, produces the displacementvector v.


In MC prediction, the predicted version of the enlarged target block Pv(x, y) is derivedfrom the frame at ti�1 by using estimated vector v. The same window function W(x, y) isused to generate the final window-operated predicted version of the target block. That is,

WPv(x, y) ¼ Pv(x, y)�W(x, y): (11:11)

It was reported in [nogaki 1992] that the luminance signal of an HDTV sequence wasused in computer simulation. A block size of 163 16 was used for conventional blockmatching, while a block size of 323 32 was employed for the proposed overlappedblock matching. The maximum displacement range d was taken as d¼ 15, i.e., from�15 to þ15 in the both horizontal and vertical directions. The simulation indicated areduction in the power of prediction error by about 19%. Subjectively, it was observedthat the blocking edges originally existing in the prediction error signal with conventionalblock matching were largely removed with the proposed overlapped block matchingtechnique.

11.7 Summary

By far, block matching is used more frequently than any other motion estimation tech-niques in MC coding. By partitioning a frame into nonoverlapped, equally spaced, fixedsize, small rectangular blocks, and assuming that all the pixels in a block experience thesame translational motion, block matching avoids the difficulty encountered in motionestimation of arbitrarily shaped blocks. Consequently, block matching is much simplerand involves less side information compared with motion estimation with arbitrarilyshaped blocks.

Although this simple model considers translation motion only, other types of motions,such as rotation and zooming of large objects, may be closely approximated by thepiecewise translation of these small blocks provided that these blocks are small enough.This important observation, originally made by Jain and Jain, has been confirmed againand again since then.

Various issues related to block matching such as selection of block sizes, matchingcriteria, search strategies, matching accuracy, its limitations and improvements are dis-cussed in this chapter. Specifically, a block size of 163 16 is used most often. For moreaccurate motion estimation, the size of 83 8 is used sometimes. In the latter case,more accurate motion estimation is obtained at the cost of more side information andhigher computational complexity.

There are several different types of matching criteria that can be used in block matching.Since it was shown that the different criteria do not cause significant difference in blockmatching, the MAD is hence preferred due to its simplicity in implementation.

On the one hand, full search procedure delivers good accuracy in searching for the bestmatching. On the other hand, it requires a large amount of computation. In order to lowercomputational complexity, several fast searching procedures were developed: 2-D loga-rithm search, coarse–fine three-step search, conjugate direction search, etc.

Besides these suboptimum search procedures, there are some other measures developedto lower computation. One of them is subsampling in the original blocks and the correl-ation windows. By the subsampling, the computational burden in block matching can bereduced drastically, while the accuracy of the estimated motion vectors may be affected.Therefore, the subsampling procedure is only recommended for the case with a largeblock size.


Natu rally, multire solution struc ture, a powerf ul computat ional con figura tion in imageproces sing, lend s itself well to fast search in blo ck match ing. It sig nificantl y redu cescom putation involve d in block m atching. Threshold ing multire solution block matchingfurthe r sav es com putation.

In ter ms of match ing acc uracy, several com mon choices are one-pi xel, half-p ixel, andquar ter-pixel accu racies. Sp atial inter polation is usu ally requi red for hal f-pixel and quar ter-pixe l accuraci es. That is, a higher accu racy is ac hieved with more com putatio n.

Limi tations wi th block matching techni ques are mainly: unrelia ble motion vector fi eld,and block artifacts. Both are caused by the simple model: each block is assumed to experi-ence a uniform translation. Much efforts have been made to improve these drawbacks.Sever al tech niques that have made imp rovemen t over the conventi onal block matchingtechni que are discuss ed in thi s chapter.

In the hierar chic al block match ing techni que, a set of different size s for both theorigi nal block and the correlat ion wind ow a re used. The first level in the hierarchy witha large windo w size and a large displaceme nt range determi nes a major portion of thedispla cement vector reliably. The succe ssive levels wi th smaller windo w sizes and sm allerdispla cement range s are capab le of ada ptively estimati ng mo tion vect ors more local ly.

The mul tigrid block match ing techni que use s mul tigrid st ructure, anothe r pow erfulcom putational structur e in image process ing, to provi de a vari able size block matching.Wit h a spl it-and-me rge st rategy, the thresholdi ng multig rid block matching techni quesegme nt an image into a set of vari able size block, each of which experi ences an app roxi-mate ly uniform motion . A tree struc ture (bin-tree or quad-t ree) is used to rec ord therelationship between these variable size blocks. With the flexibility provided throughthe variable-size methodology, the thresholding block matching technique is capable ofmaking the motion model of the uniform motion within each block more accurate than thefixed-size block matching can do.

As poi nted out in Chapte r 10, the ultimate goal of mo tion compensati on in video coding isto achieve a high coding efficiency. In otherwords, accurate truemotion estimation is not thefinal goal. From this point of view, in the above-mentioned multigrid block matching, thedecision of splitting a block is made only when the bits used to encode extra motion vectorsinvolved in the splitting are less than the bits saved from encoding reduced prediction errordue to more accurate estimation. To this end, an adaptive entropy criterion is proposed andused in the optimalmultigrid blockmatching technique. Not only is it optimal in the sense ofbit saving, but it also eliminates the need for setting a threshold.

Apparently the block-based model encounters more severe problem along movingboundaries. To solve the problem, the predictive motion field segmentation techniquemake the blocks involving moving boundaries to have the motion field with pixel reso-lution instead of block resolution. In order to save shape overhead, segmentation is carriedout backwards, i.e., based on previously decoded frames. In order to avoid a large increaseof side information associated with extra motion vectors, the motion vectors applied tothese segmented regions along moving boundaries are selected from a set of neighboringmotion vectors. As a result, the technique is capable of reconstructing discontinuities in themotion field at pixel resolution whereas maintaining the same amount of motion vectors asthe conventional block matching technique.

The last improvement over the conventional block matching discussed in this chapter isoverlapped block matching. In contrast to dealing with blocks independently of each other,the overlapped block matching technique enlarges blocks so as to make them overlapped.A window function is then constructed and used in both motion estimation and motioncompensation. Because it relaxes the restriction of a nonoverlapped block partitionimposed by the conventional block matching, it achieves better performance than theconventional block matching.


Exerci ses

1. Refer to Figure 11.2, it is said that there are a to tal of (2 d þ 1) 3 (2d þ 1) positions thatneed to be exa mined in blo ck match ing with full search if one- pixel accu racy is requ ired.How many pos itions are there that need to be examin ed in block match ing with fullsearch if half-pix el and quar ter-p ixel accuraci es are require d?

2. What are the two effect s that the subsamp ling in the original block and the corre lationblock may brin g out?

3. Read [bur t 1983, 1984] and explain why the pyramid is named afte r Gau ssian.4. Read [burt 1983, 1984] and explain why a pyra mid structure is cons idered as a pow erful

compu tational con fi guration . Speci fically , in multire solut ional block match ing, how andto wha t exten t does it sav e computat ion dramati cally com pared with the conve ntionalblock match ing tech nique? (Refer to Section 11.3.7.)

5. How is the threshold determi ned in the thresho lding multid imensi onal block ma tchingtechni que? (Refer to Section 11.3.7.) It is said that the square root of the MSE value ,derive d from the given PS NR (Eq uation 11.6), is used as an initial thres hold value .Justify the necess ity of the square root ope ration.

6. Refer to Secti on 11.6.1 or [bierlin g 1988]. State the different requi rements in the appli-cation s of M C int erpolation and MC coding. Discuss where a full resolution of transla-tional motion vector field may be used?

7. Read [dufaux 1995] and explain the main feature of the optimal multigrid blockmatching. State how the adaptive entropy criterion is established. Implement algorithmand compare its performance with that presented in [chan 1990].

8. Learn the predictive motion field segmentation technique [orchard 1993]. Explain howthe algorithms avoid a large increase in overhead due to motion field segmentation.

9. Implement the overlapped block matching algorithm introduced in [nogaki 1992].Compare its performance with that of the conventional block matching technique.

References

[anandan 1987] P. Anandan, Measurement visual motion from image sequences, Ph.D. thesis, COINSDepartment, University of Massachusetts, Amherst, 1987.

[anuta 1969] P.F. Anuta, Digital registration of multispectral video imagery, Society of Photo-OpticalInstrumentation Engineers Journal, 7, 168–175, September 1969.

[auyeung 1992] C. Auyeung, J. Kosmach, M. Orchard, and T. Kalafatis, Overlapped block motioncompensation, SPIE Proceedings of Visual Communication and Image Processing‘92, Vol. 1818,Boston, MA, pp. 561–571, November 1992.

[bierling 1988] M. Bierling, Displacement estimation by hierarchical blockmatching, Proceedings ofVisual Communications and Image Processing, SPIE 1001, pp. 942–951, 1988.

[brofferio 1977] S. Brofferio and F. Rocca, Interframe redundancy reduction of video signals generatedby translating objects, IEEE Transactions on Communications, COM-25, 448–455, April 1977.

[burt 1983] P.J. Burt and E.H. Adelson, The Laplacian pyramid as a compact image code, IEEETransactions on Communications, COM-31, 4, 532–540, April 1983.

[burt 1984] P.J. Burt, The pyramid as a structure for efficient computation, in Multiresolution ImageProcessing and Analysis, A. Rosenfeld (Ed.), Springer-Verlag, Germany, pp. 6–37, 1984.

[cafforio 1976] C. Cafforio and F. Rocca, Method for measuring small displacement of televisionimages, IEEE Transactions on Information Theory, IT-22, 573–579, September 1976.

[chan 1990] M.H. Chan, Y.B. Yu, and A.G. Constantinides, Variable size block matching motioncompensation with applications to video coding, IEE Proceedings, 137, Part I, 4, 205–212,August 1990.


[dufaux 1992] F. Dufaux and M. Kunt, Multigrid block matching motion estimation with an adaptivelocal mesh refinement, SPIE Proceedings of Visual Communications and Image Processing’92,Vol. 1818, Boston, MA, pp. 97–109, November 1992.

[dufaux 1994] F. Dufaux, Multigrid block matching motion estimation for generic video coding,Ph.D. dissertation, Swiss Federal Institute of Technology, Lausanne, Switzerland, 1994.

[dufaux 1995] F. Dufaux and F. Moscheni, Motion estimation techniques for digital TV: A review anda new contribution, Proceedings of the IEEE, 83, 6, 858–876, 1995.

[hackbusch 1982] W. Hackbusch and U. Trottenberg, Eds., Multigrid Methods, Springer-Verlag,New York, 1982.

[haskell 1972] B.G. Haskell and J.O. Limb, Predictive video encoding using measured subjectvelocity, U.S. Patent 3, 632, 865, January 1972.

[iscas 1997] J. Brailean, Universal accessibility and object-based functionality, ISCAS Tutorial onMPEG 4, Chapter 3.3, June 1997.

[jain 1981] J.R. Jain and A.K. Jain, displacement measurement and its application in interframe imagecoding, IEEE Transactions on Communications, COM-29, 12, 1799–1808, December 1981.

[jain 1989]A.K. Jain, Fundamentals of Digital Image Processing, Prentice-Hall, EnglewoodCliffs,NJ, 1989.[koga 1981] T. Koga, K. Linuma, A. Hirano, Y. Iijima, and T. Ishiguro, Motion compensated interframe

coding for video conferencing, Proceedings of NTC’81, New Orleans, LA, pp. G5.3.1–G5.3.5,December 1981.

[knuth 1973] D.E. Knuth, Searching and Sorting, Vol. 3: The Art of Computer Programming. Addison-Wesley, Reading, MA, 1973.

[limb 1975] J.O. Limb and J.A. Murphy, Measuring the speed of moving objects from televisionsignals, IEEE Transactions on Communications, COM-23, 474–478, April 1975.

[lin 1997] S. Lin, Y.Q. Shi, and Y.-Q. Zhang, An optical flow based motion compensation algorithmfor very low bit-rate video coding, Proceedings of 1997 IEEE International Conference on Acoustics,Speech and Signal Processing, Munich, Germany, pp. 2869–2872, April 1997; Y.Q. Shi, S. Lin, andY.-Q. Zhang, Optical flow-based motion compensation algorithm for very low-bit-rate videocoding, International Journal of Imaging Systems and Technology, 9, 4, 230–237, 1998.

[moscheni 1993] F. Moscheni, F. Dufaux, and H. Nicolas, Entropy criterion for optimal bit allocationbetween motion and prediction error information, SPIE 1993 Proceedings of Visual Communica-tions and Image Processing, Cambridge, MA, pp. 235–242, November 1993.


[netravali 1979] A.N. Netravali and J.D. Robbins, Motion compensated television coding, Part I,The Bell System Technical Journal, 58, 3, 631–670, March 1979.

[nogaki 1992] S. Nogaki and M. Ohta, An overlapped block motion compensation for high qualitymotion picture coding, Proceedings of IEEE International Symposium on Circuits and Systems, 1,184–187, May 1992.

[orchard 1993] M.T. Orchard, Predictive motion-field segmentation for image sequence coding, IEEETransactions on Circuits and Systems for Video Technology, 3, 1, 54–69, February 1993.

[pratt 1974] W.K. Pratt, Correlation techniques of image registration, IEEE Transactions on Aerospaceand Electronic Systems, AES-10, 3, 353–358, May 1974.

[rocca 1972] F. Rocca and S. Zanoletti, Bandwidth reduction viamovement compensation on amodel ofthe randomvideo process, IEEETransactions onCommunications, COM-20, 960–965,October 1972.

[shi 1997] Y.Q. Shi and X. Xia, A thresholding multidimensional block matching algorithm, IEEETransactions on Circuits and Systems for Video Technology, 7, 2, 437–440, April 1997.

[srinivasan 1984] R. Srinivasan and K.R. Rao, Predictive coding based on efficient motion estimation,Proceedings of ICC, pp. 521–526, May 1984.

[tzovaras 1994] D. Tzovaras, M.G. Strintzis, and H. Sahinolou, Evaluation of multiresolution blockmatching techniques for motion and disparity estimation, Signal Processing: Image Communica-tion, 6, 56–67, 1994.

[xia 1996] X. Xia and Y.Q. Shi, A thresholding hierarchical block matching algorithm, Proceedings ofIEEE 1996 International Symposium on Circuits and Systems, Vol. II, Atlanta, GA, pp. 624–627,May 1996; X. Xia, Y.Q. Shi, and Y. Shi, A thresholding hierarchical block matching algorithm,Journal of Computer Science and Information Management, 1, 2, 83–90, 1998.


12Pel Recursive Technique

As discuss ed in Chapte r 10, the pel rec ursive technique is one of the three maj orapproach es to two-dime nsional (2-D ) displace ment estima tion in image plane s for thesignal proce ssing commu nity. Conce ptually speak ing, it is on e type of regio n match ingtechni que. In contrast to block match ing (wh ich was discuss ed in Chap ter 11), it rec ursivelyestimate s displace ment vector fo r each pixel in an image frame . The displace ment vect or ofa pixel is estimate d by recursi vely minimizi ng a no nlinear function of the dissimi laritybetwe en two certain regions locate d in two con secutive frames . Note that region mean s agroup of pixels, but it could be as small as a singl e pixel. Al so note that the terms ‘‘ pe l’’ and‘‘ pixel ’’ have the sam e meanin g. Both terms are used frequently in the field of signal andimage proces sing.

This chap ter is org anized as follows. A general descri ption of the recursive techni que isprovid ed in Secti on 12.1. Some fund amental techniq ues in optimizati on are covered inSecti on 12.2. Section 12.3 describes the Ne travali and Robbins algorithm , the pioneeri ngwork in thi s c ategory . Sever al other typical pel recursi ve algorithm s are introduce d inSecti on 12.4. In Se ction 12.5, a pe rformance compar ison betwe en these algo rithms is made.

12.1 Problem Formulation

In 1979 Netravali and Robbins published the first pel recursive algorithm to estimatedisplacement vectors for motion compensated interframe image coding. In [netravali1979], a quantity, called the displaced frame difference (DFD), was defined as follows:

DFD(x, y; dx, dy) ¼ fn(x, y)� fn�1(x� dx, y� dy), (12:1)

wheren and n� 1 indicate two moments associated with two successive frames based onwhich motion vectors are to be estimated

x, y are coordinates in image planesdx, dy are the two components of the displacement vector,

*d, along the horizontal and

vertical directions in the image planes, respectively

DFD(x, y; dx, dy) can also be expressed as DFD(x, y;*d ). Whenever it does not cause confu-

sion, it can be written as DFD for the sake of brevity. Obviously, if there is no error in theestimation, i.e., the estimated displacement vector is exactly equal to the true motion vectorthen DFD will be zero.


A nonlinear function of the DFD was then proposed as a dissimilarity measure in[netravali 1979], which is a square function of DFD, i.e., DFD2.

Netravali and Robbins thus converted displacement estimation into a minimizationproblem. That is, each pixel corresponds to a pair of integers (x, y), denoting its spatialposition in the image plane. Therefore, the DFD is a function of

*d. The estimated displace-

ment vector*d ¼ (dx, dy)

T, where (.)T denotes the transposition of the argument vector ormatrix, can be determined by minimizing the DFD2. This is a typical nonlinear program-ming problem, on which a large body of research has been reported in the literature. In thenext section, several techniques that rely on a method, called descent method, in optimiza-tion are introduced. The Netravali and Robbins algorithm can be applied to a pixel once oriteratively applied several times for displacement estimation. Then the algorithm moves tothe next pixel. The estimated displacement vector of a pixel can be used as an initialestimate for the next pixel. This recursion can be carried out horizontally, vertically, ortemporally. By temporally, we mean that the estimated displacement vector can be passedto the pixel of the same spatial position within image planes in a temporally neighboringframe. Figure 12.1 illustrates these three different types of recursion.

(x, y) (x, y + 1)

(a) Horizontal

(x + 1, y)

(x, y)

(b) Vertical

(x, y, tn – 1) (x, y, tn)

(c) Temporal

x

y

t

FIGURE 12.1Three types of recursions.


12.2 Descent Methods

Consider a nonlinear real-valued function z of a vector variable *x,

z ¼ f (*x ) (12:2)

with *x 2 Rn, where Rn represents the set of all n-tuples of real numbers. The question weface now is how to find such a vector denoted by *x * that the function z is minimized. Thisis classified as an unconstrained nonlinear programming problem.

12.2.1 First-Order Necessary Conditions

According to the optimization theory, if f (*x) has continuous first-order partial derivatives,then the first-order necessary conditions that *x * has to satisfy are

rf (*x *) ¼ 0, (12:3)

where r denotes the gradient operation with respect to *x evaluated at *x *. Note thatwhenever there is only one vector variable in the function z to which the gradient operatoris applied, the sign r would remain without a subscript, as in Equation 12.3. Otherwise,i.e., if there is more than one vector variable in the function, we will explicitly write out thevariable, to which the gradient operator is applied, as a subscript of the sign r. In thecomponent form, Equation 12.3 can be expressed as

@f (*x )@x1

¼ 0

@f (*x )@x2

¼ 0

..

.

@f (*x )@xn

¼ 0

8>>>>>>>>>>>><>>>>>>>>>>>>:

(12:4)

12.2.2 Second-Order Sufficient Conditions

If f (*x ) has second-order continuous derivatives, then the second-order sufficient conditionsfor f (*x *) to reach the minimum are known as

rf (*x *) ¼ 0 (12:5)

and

H(*x *) > 0, (12:6)


where H denotes the He ssian matrix and is de fi ned as follo ws:

H ( *x ) ¼

@ 2 f ( *x )@ 2 x1

@ 2 f ( *x )@ x1 @ x2

� � � @ 2 f ( *x )@ x1 @ xn

@ 2 f ( *x )@ x2 @ x1

@ 2 f ( *x )@ 2 x2

� � � @ 2 f ( *x )@ x2 @ xn

..

. ... ..

. ...

@ 2 f ( *x )@ xn @ x1

@ 2 f ( *x )@ xn @ x2

� � � @ 2 f ( *x )@ 2 xn

26666666666664

37777777777775

(12: 7)

We can thu s see that the He ssian matri x consists of all the second -order partia l deri vativesof f with respect to the components of *x. Equati on 12.6 mean s that H is pos itive de finite.

12.2.3 Underlying Strategy

Our aim is to derive an iterative procedure for the minimization. That is, we want to find asequence

*x0,*x1,

*x2, . . . ,*xn, . . . (12:8)

such that

f (*x0) > f (*x1) > f (*x2) > � � � > f (*xn) > � � � (12:9)

and the sequence converges to the minimum of f (*x ), f (*x *).A fundamental underlying strategy for almost all the descent algorithms [luenberger 1984]

is described as follows. We start with an initial point in the space; we determine a direction tomove according to a certain rule; then we move along the direction to a relative minimum ofthe function z. This minimum point becomes the initial point for the next iteration.

This strategy can be better visualized using a 2-D example, shown in Figure 12.2. There,*x ¼ (x1, x2)T. Several closed curves are referred to as contour curves or level curves. That is,each of the curves represents

f (x1, x2) ¼ c (12:10)

with c being a constant.

FIGURE 12.2Descent method.

xk + 1

x1

x2

0

xk

ak wk


Assume that at the k th iterat ion, we have a guess: ~x k . For the ( k þ 1)th iteration, weneed to

1. Find a search directi on, pointed by a vect or ~vk

2. Dete rmine an optimal step size ak wi th ak > 0

such that the next gues s *x k þ 1 is

*x k þ 1 ¼ *x k þ ak ~v k (12 : 11)

and *x k þ 1 satis fies f ( *x k ) > f ( *x k þ 1 ).In Equati on 12.11, *x k can be viewed as a predic tion vect or fo r *x k þ 1 , whil e ak ~v k an upda te

vector , ~vk . He nce, using the Taylo r series expan sion, we can have

f ( *x k þ 1 ) ¼ f ( *x k ) þ rf ( *x k ), ak *v k� �þ «, (12 : 12)

where ~s ,~t� i denotes the inn er produc t betwe en vect ors ~s and ~t, and « repres ents the high er-

order terms in the expan sion. Cons ider that the inc remen t of ak *v k is sm all enough and,thus , « can be ignore d. From Equatio n 12 .10, it is obvio us that to have f ( *x k þ 1 ) < f ( *x k ) wemust have r f ( *x k ), ak *v k

� �< 0. Th at is,

f ( *x k þ 1 ) < f ( *x k ) ) rf ( *x k ), ak *v k� �

< 0 (12 : 13)

Choos ing a different update vect or, i.e., the produ ct of the *v k v ector and the step size ak

result s in a different algorithm in implement ing des cent method .In the same cate gory of the des cent method , a variety of techniques have been devel-

oped. The reader may refer to [luenbe rger 1984] or the many othe r exis ting books onoptimi zation. Two commo nly used techni ques of the des cent met hod are discusse d. One iscalle d the st eepest descent method , in whic h the search directi on represen ted by the *vvector is chos en to be opp osite to that of the grad ient vector , and a real par ameter of thestep size ak is use d; the othe r is the Ne wton –Raphson metho d, in which the up date vect orin estimati on, dete rmine d jointly by the search directi on a nd the step size , is related to theHessi an matrix (de fined in Equation 12.7 ). These two tech niques are fur ther discusse d inSecti ons 12.2.5 and 12.2.6, respec tively.

12.2.4 Convergence Speed

Speed of convergence is an important issue in discussing the descent method. It is utilizedto evaluate the performance of different algorithms.

12.2.4.1 Order of Convergence

Assume a sequence of vectors {*x k}, with k¼ 0, 1, . . . ,1, converges to a minimum denotedby *x *. We say that the convergence is of order p if the following formula holds [luenberger1984]:

0 � limk!1j*x kþ1 � *x *jj*x k � *x *jp <1 (12:14)


wherep is pos itivelim is the limit sup eriorj .j is the magnitu de or norm of a vector argu ment

For the two latter no tions, mo re descri ptions follow.The concept of the limit superio r is ba sed on the con cept of suprem um. He nce, let us first

discuss the supremum . Cons ider a set of real num bers, denoted by Q that is bounde dabov e. Then the re must exist a smallest real num ber o such that for all the real num bers inthe set Q , i.e., q 2 Q , we have q � o . This real number o is referred to as the least upperbound or the sup remum of the set Q, and it is denoted by

sup { q:q 2 Q } or supq2 Q ( q) : (12: 15)

Now tur n to a real bounde d a bove sequ ence r k , k ¼ 0, 1, . . . , 1 . If s k ¼ sup { r j : j � k }, thenthe sequ ence { s k } con verges to a real num ber s *. This real num ber s* is refer red to as thelimit sup erior of the sequenc e { r k }, and is denote d by

limk !1 ( r k ) : (12: 16)

The magnitu de or no rm of a vector *x , denote d by j *xj , is defined as

j*xj ¼ h*x, *x i, (12: 17)

where h~s ,~t i is the inner prod uct betwe en the vector s *s and*t . Through out thi s discussi on,

when we say vect or we mean colu mn vect or. (Row vector s can be handl ed acco rdingly.)The inner product is therefore defined as

h*s,*t i ¼ *s*t T, (12:18)

with the superscript T indicating the transposition operator.With the definitions of the limit superior and the magnitude of a vector introduced, we are

now in a position to easily understand the concept of the order of convergence defined inEquation 12.14. Since the sequences generated by the descent algorithms behave quite well ingeneral [luenberger 1984], the limit superior is rarely necessary. Hence, roughly speaking,instead of the limit superior, the limit may be used in considering the speed of convergence.

12.2.4.2 Linear Convergence

Among the various orders of convergence, the order of unity is important, and is referredto as linear convergence. Its definition is as follows. If a sequence {*x k}, k¼ 0, 1, . . . , 1,converges to *x * with

limk!1j*x kþ1 � *x *jj*x k � *x *j ¼ g < 1, (12:19)

then we say that this sequence converges linearly with a convergence ratio g. The linearconvergence is also referred to as geometric convergence because a linear convergentsequence with convergence ration g converges to its limit at least as fast as the geometricsequences cg k, with c being a constant.


12.2.5 Steepest Descent Method

The steepest descent method, often referred to as the gradient method, is the oldest andsimplest one among various techniques in the descent method. As Luenberger pointed outin his book, it remains to be the fundamental method in the category for the following tworeasons. First, owing to its simplicity, it is usually the first method attempted for solving anew problem. This observation is very true. As we shall see soon, when handlingthe displacement estimation as a nonlinear programming problem in the pel recursivetechnique, the first algorithm developed by Netravali and Robbins is essentially thesteepest descent method. Second, owing to the existence of a satisfactory analysis forthe steepest descent method, it continues to serve as a reference for comparing andevaluating various newly developed and more advanced methods.

12.2.5.1 Formulae

In the steepest descent method, *v k is chosen as

*v k ¼ �rf (*x k), (12:20)

resulting in

f (*x kþ1) ¼ f (*x k)� akrf (*x k), (12:21)

where the step size ak is a real parameter, and, with our rule mentioned before, the sign rhere denotes a gradient operator with respect to *x k. Since the gradient vector points to thedirection along which the function f (*x ) has greatest increases, it is naturally expected thatthe selection of the negative direction of the gradient as the search direction will lead to thesteepest descent of f (*x ). This is where the term steepest descent originated.

12.2.5.2 Convergence Speed

It can be shown that if the sequence {*x} is bounded above, then the steepest descent methodwill converge to the minimum. Furthermore, it can be shown that the steepest descentmethod is linear convergent.

12.2.5.3 Selection of Step Size

It is worth noting that the selection of the step size ak has significant influence on thealgorithm’s performance. In general, if it is small, it produces an accurate estimate of *x *.But a smaller step size means it will take longer for the algorithm to reach the minimum.Although a larger step size will make algorithm converge faster, it may lead to an estimatewith large error. This situation can be demonstrated in Figure 12.3. There, for the sake of aneasy graphical illustration, *x is assumed to be one-dimensional (1-D). Two cases of too small(with subscript 1) and too large (with subscript 2) step sizes are shown for comparison.

12.2.6 Newton–Raphson’s Method

The Newton–Raphson method is the next most popular method among various descentmethods.


Small a

Large a

f (x)

x

v1k

x1kx1

k+1x2k+1x2

k

v2k

0

FIGURE 12.3An illustration of effect of selection of step size on minimization performance. Too small a requires more steps toreach x *. Too large a may cause overshooting.

12.2.6. 1 Formul ae

Consi der *x k at the k th iterat ion. The k þ 1th gue ss, *x k þ 1 , is the sum of *x k and *v k ,

*x k þ 1 ¼ *x k þ

*v k , (12: 22)

where *v k is a n upd ate vector as shown in Figure 12.4 . Now expan d the *x k þ 1 into the Taylor

series expl icitly contai ning the second-o rder term.

f ( *x k þ 1 ) ¼ f ( *x k ) þ hrf , *v i þ 12 h H ( *x k ) *v, *v i þ w, (12: 23)

wherew denotes the high er-ord er termsr deno tes the grad ientH denote s the Hessi an matri x

If*v is small enough, we can ignore the w. According to the first-order necessary conditions

for *x k þ 1 to be the min imum, discusse d in Secti on 12.2.1, we have

r~vf (*x k þ *

v ) ¼ rf (*x k)þH(*x k)*v ¼ 0, (12:24)

FIGURE 12.4Derivation of Newton–Raphson’s method.

xk + 1

vk

xk


where r *v denote s the gradient operator with respect to *v. This lead s to

*v ¼ �H � 1 ( *x k )r f (*x k ) : (12 : 25)

The Ne wton –Raphson metho d is thu s derived as follows.

f ( *x k þ 1 ) ¼ f ( *x k ) � H � 1 ( *x k ) r f ( *x k ) : (12 : 26)

Anoth er loose and intuiti ve way to view the Newton –Raphso n met hod is that its fo rmat issimilar to the steepest des cent met hod, exc ept that the step size ak is now chos en asH � 1 ( *x k ), the inv erse of the Hessian matrix evalu ated at *x k .

The idea behind the Newton –Raphso n met hod is that the functi on being minimize d isapproxi mated locally by a quadra tic functi on and this quad ratic func tion is then minim-ized. It is note d that any fun ction will beha ve like a quad ratic function when it is close tothe min imum. Hence , the closer to the minimum, the mo re ef fi cient is the Ne wton –Raphso n metho d. Th is is the exa ct opposi te of the steepes t descen t met hod, which worksmore effi ciently at the begi nning, and less ef ficiently when close to the min imum. Th e pricepaid with the Newton–Raphson method is the extra calculation involved in evaluating theinverse of the Hessian matrix at *x k.

12.2.6.2 Convergence Speed

Assume that the second -order suf ficient conditions dis cussed in Se ction 12.2.2 are sat is fied.Furthermore, assume that the initial point *x 0 is sufficiently close to the minimum *x *.Then it can be shown that the Newton–Raphson method converges with an order of atleast two. This indicates that the Newton–Raphson method converges faster than thesteepest descent method.

12.2.6.3 Generalization and Improvements

In [luenberger 1984], a general class of algorithms is defined as

*x kþ1 ¼ *x k � akGrf (*x k), (12:27)

where G denotes an n3 n matrix, and ak a positive parameter. Both the steepest descentand the Newton–Raphson methods fall into this framework. It is clear that if G is an n3 nidentical matrix I, then this general form reduces to the steepest descent method. If G¼Hand a¼ 1 then this is the Newton–Raphson method.

Although it descends rapidly near the solution, the Newton–Raphson method may notdescend for points far away from the minimum because the quadratic approximation maynot be valid there. The introduction of the ak, which minimizes f, can guarantee the descentof f at the general points. Another improvement is to set G ¼ [zkIþH(*x k)]�1 with z � 0.Obviously this is a combination of the steepest descent method and the Newton–Raphsonmethod. Two extreme ends are the steepest method (very large zk) and the Newton–Raphson method (zk¼ 0). For most cases, the selection of the parameter zk aims at makingthe G matrix positive definite.

12.2.7 Other Methods

There are other gradient methods such as the Fletcher–Reeves method (also known asthe conjugate gradient method) and the Fletcher–Powell–Davidon method (also knownas the variable metric method). Readers may refer to [luenberger 1984] or otheroptimization texts.


12 .3 Netra vali – Ro bb ins ’ Pel R ec ursi ve Algo rith m

Havi ng had an introd uction to some basic nonl inear progra mmi ng the ory, we now turn tothe pe l rec ursive tech nique in displace ment estimati on from the perspective of the descentmet hods. Let us take a look at the first pel recursi ve algori thm, the Ne travali –Robb ins pe lrec ursive algorithm . It actu ally estimate s dis placemen t vector s usi ng the steepest des centmet hod to minimize the squared DFD . That is,

*d k þ 1 ¼

*d k � 1

2 ar *d DFD 2 ( x, y,

*d k ), (12: 28)

where r *d DFD 2 ( x, y, *d k ) denote s the gradient of DFD 2 with respect to

*d evalu ated at ~dk , the

displa cement vector at the k th iterat ion, and a is pos itive. Equa tion 12.28 can be furtherwritte n as

*d k þ 1 ¼

*d k � aDFD( x, y,

*d k ) r *d DFD ( x, y,

*d k ) : (12: 29)

Owin g to Equatio n 12.1, the abov e equati on leads to*d k þ 1 ¼

*d k � aDFD ( x, y,

*d k ) rx , y f n� 1 ( x � dx , y � dy ), (12: 30)

where rx,y means a grad ient operato r with respec t to x and y. In [netrav ali 19 79], a constantof 1=1024 is assigne d to a.

12.3.1 Inclus ion of a Neighbo rhood Area

To mak e displacem ent estim ation more robust, Ne travali and Robbins cons idered an areafor evalu ating the DFD 2 in calcul ating the upd ate term. More precis ely, they assume thedispla cement vector is constant withi n a small neighborh ood V of the pixe l for which thedispla cement is being estimate d. That is,

*d k þ 1 ¼

*d k � 1

2 ar *d

Xi, x, y , 2V

wi DFD 2 ( x, y, *d k ), (12: 31)

wherei repre sents an index for the ith pixel ( x, y) wi thin Vwi is the we ight for the ith pixel in V

All the weights sat isfy the fo llowing two constrai nts:

wi � 0 (12: 32)Xwl ¼ 1 (12: 33)

(
i2 V
This inc lusion of a neighb orhood area also explain s why pe l recursi ve techni que is class i-fi ed into the category of region match ing techni ques as we discuss ed at the beginni ng ofthis c hapter.

12.3.2 Interpol ation

It is note d that inter polati on will be necessary when displace ment vect ors ’ com ponentsdx and dy are not intege r number of pi xels. A bilin ear int erpolation tech nique is used in[netrav ali 1979]. For the bilin ear interpol ation, rea ders may ref er to Chapter 10.


12.3.3 Simpl ifi cation

To make the propo sed algori thm more ef ficient in com putation, Netravali and Robbi nsalso prop osed simp li fied versio ns of the displaceme nt estimati on and inter polati on algo-rithms in the ir paper.

One simpli fi ed ver sion of the Ne travali and Robb ins algori thm is as follows :

*d k þ 1 ¼

*d k � a sign{DF D(x, y;

*d k )} sign{ rx, y f n� 1 (x � dx , y � dy )} (12 : 34)

where sign{ s } ¼ 0, 1, �1 dep ending on s ¼ 0, s > 0, s < 0, respec tively, whil e the sign of avector quantit y is the vector of signs of its compone nts. In thi s ver sion the update vectorscan only assu me a n angle which is an integer multi ple of 45 8 . As sh own in [ne travali 1979],this ver sion is effect ive.

12.3.4 Perform ance

The perform ance of the Netraval i and Robbins algorithm has bee n evaluate d usi ngcompute r simu lation [ne travali 1979]. Two vide o seq uences with differe nt amoun ts anddifferent types of mo tion are tested. In eithe r cas e, the prop osed pel rec ursive algorithmdisplay s superior perform ance over the repl enishment algorithm [moun ts 1969; haskell1979], dis cussed brie fl y in Chapte r 10. The Netraval i and Robb ins algorithm ac hieves a bitrate which is 22% to 50% lower than that requi red by the repl enishme nt technique with thesimple frame differe nce predic tion.

12.4 Other P el Recursive A lgorithms

The progress and succe ss of the Netraval i and Robb ins algori thm stimula ted grea t researchinter ests in pe l rec ursive techniq ues. Many new algorithm s have bee n develop ed. Some ofthem are discuss ed in thi s section.

12.4.1 Bergmann ’s Algori thm (19 82)

Bergma nn modi fi ed Netravali and Robbi ns algori thm by using the Newton –Raphso nmetho d [ber gmann 1982]. In doing so, the follo wing difference betwe en the fun damentalframew ork of the descent met hods discusse d in Se ction 12.2 and the minimizat ion problemin displaceme nt estimati on discuss ed in Se ction 12.3 needs to be notic ed. That is, the objectfuncti on f ( *x ) discusse d in Section 12.2 now become s DFD 2 ( x, y;

*d ). Th e He ssian matrix H ,

cons isting of the second-ord er partia l derivati ves of the f (*x ) with respect to the compone ntsof *x now become the second-order derivatives of DFD2 with respect to dx and dy. Sincethe vector

*d is a 2-D column vector now, the H matrix is hence a 23 2 matrix. That is,

H ¼

@2DFD2(x, y,*d )

@2dx

@2DFD2(x, y,*d )

@dx@dy

@2DFD2(x, y,*d )

@dy@dx

@2DFD2(x, y,*d )

@2dy

266664

377775: (12:35)


As expected, the Bergmann algorithm (1982) converges to the minimum faster than thesteepest descent method since the Newton–Raphson method converges with an order ofat least two.

12.4.2 Bergmann’s Algorithm (1984)

Based on Burkhard and Moll’s algorithm [burkhard 1979], Bergmann developed an algo-rithm, which is similar to the Newton–Raphson algorithm. The primary difference is thatan average of two second-order derivatives is used to replace those in the Hessian matrix.In this sense, it can be considered as a variation of the Newton–Raphson algorithm.

12.4.3 Cafforio and Rocca’s Algorithm

Based on their earlier study, Cafforio and Rocca proposed an algorithm in 1982, which isessentially the steepest descent method. That is, the step size a is defined as follows[cafforio 1983]:

a ¼ 1

rfn�1(x� dx, y� dy)�� 2þh2

(12:36)

with h2¼ 100. The addition of h2 is intended to avoid the problem that would haveoccurred in a uniform region where the gradients are very small.

12.4.4 Walker and Rao’s Algorithm

Walker and Rao developed an algorithm based on the steepest descent method [walker1984; tekalp 1995], and also with a variable step size. That is,

a ¼ 1

2 rfn�1(x� dx, y� dy)�� 2 , (12:37)

where

rfn�1(x� dx, y� dy)�� 2¼ @fn�1(x� dx, y� dy)

@dx

� �2þ @fn�1(x� dx, y� dy)

@dy

� �2(12:38)

It is observed that this step size is variable instead of being a constant. Furthermore, thisvariable step size is reverse proportional to the norm square of the gradient of fn�1 (x � dx,y� dy) with respect to x, y. This means that this type of step size will be small in the edge orrough area, and will be large in the relatively smooth area. These features are desirable.

Although it is quite similar to the Cafforio and Rocca algorithm, the Walker and Raoalgorithm differs in the following two aspects: (i) a is selected differently, (ii) implemen-tation of the algorithm is different. For instance, instead of putting an h2 in the denominatorof a, the Walker and Rao algorithm uses a logic.

As a result of using the variable step size a, the convergence rate is improved substan-tially. This implies fast implementation and accurate displacement estimation. It wasreported that usually one to three iterations are able to achieve quite satisfactory resultsin most cases.

Another contribution is that the Walker and Rao algorithm eliminates the need totransmit explicit address information so as to bring out higher coding efficiency.


12 .5 Per formance C ompariso n

A com prehensi ve survey of va rious algorithm s using the pe l recursive tech nique can befoun d in a study by Musma nn, Pirsch, and Grallert [mu smann 1985]. Ther e, two perform -ance fea tures are com pared amo ng the algorithm s. One is the converg ence rate and hencethe accuracy of dis placemen t estimati on. The other is the stability range . By stabili ty range ,we mean a range starti ng from which an algorithm can con verge to the minimu m of DFD 2,or the true dis placemen t vect or.

Comp ared with the Ne travali and Robb ins algorithm , thos e impro ved algorithm s dis-cussed in Se ction 12.4 do not use a constant st ep size , thus provid ing better adapt ation tolocal image statistics. Conse quently, they achieve a better conve rgence rate and moreaccu rate displace ment estimatio n. Acco rding to [ber gmann 1984] and [mu smann 1985],Bergma nn ’s algori thm (1984) pe rforms best among these various algori thms in ter ms ofconve rgence rate and accuracy.

Accordi ng to [mu smann 1985], the Newton –Raphson algorithm has a relative ly smallerstability range than the other algorithms. This agrees with our discussion in Section 12.2.2.That is, the performance of the Newton–Raphsonmethod improves when it works in the areaclose to the minimum. The choice of the initial guess, however, is relatively more restricted.

12.6 Summary

The pel recursive technique is one of three major approaches to displacement estimationfor motion compensation. It recursively estimates displacement vectors in a pixel-by-pixelfashion. There are three types of recursion: horizontal, vertical, and temporal. Displace-ment estimation is carried out by minimizing the square of the displaced frame difference(DFD). Therefore, the steepest descent method and the Newton–Raphson method, the twomost fundamental methods in optimization, naturally find their application in pel recursivetechniques. The pioneering Netravali and Robbins algorithm and several other algorithms,such as Bergmann’s (1982), Cafforio and Rocca, Walker and Rao, and Bergmann’s (1984) arediscussed in this chapter. They can be classified into one of two categories: The steepestdescent-based algorithms or the Newton–Raphson-based algorithms. Table 12.1 contains aclassification of these algorithms.

Note that the DFD can be evaluated within a neighborhood of the pixel for which adisplacement vector is being estimated. The displacement vector is assumed constantwithin this neighborhood. This makes the displacement estimation more robust againstvarious noises.

TABLE 12.1

Classification of Several Pel Recursive Algorithms

Category I Category IIAlgorithms Steepest Descent-Based Algorithm Newton–Raphson Based Algorithm

Netravali and Robbins Steepest descent

Bergmann (1982) Newton–Raphson

Walker and Rao Variation of steepest descent

Cafforio and Rocca Variation of steepest descent

Bergmann (1984) Variation of Newton–Raphson


Comp ared wi th the repl enishment techni que with simp le frame difference prediction(the fi rst rea l inter frame coding algori thm), the Ne travali and Robbins algorithm (the firstpel recursi ve tech nique) achi eves muc h higher coding ef ficiency. Speci fically , a 22% to 50%sav ing in bit rate has been reported for some compute r simu lations. Sever al new pelrec ursive algorithm s have made further impro vement s in ter ms of the conve rgence rateand the estimati on accuracy owin g to the repl acemen t of the fi xed step size util ized in theNe travali and Robbi ns algori thms, whi ch mak e these algorithm s mo re adapt ive to the localstatistics in image frames.

Exercises

1. What is the de fi nition of dis placed frame diffe rence? Justify Equati on 12.1.2. Why does the inclusion of a neighborhood area make the pel recursive algorithm more

robust against noise?3. Compare the performance of the steepest descent method with that of the Newton–

Raphson method.4. Explain the function of h2 in the Cafforio and Rocca algorithm.5. What is the advantage you expect to have from the Walker and Rao algorithm?6. What is the difference between the Bergmann algorithm (1982) and the Bergmann

algorithm (1984)?7. Why does the Newton–Raphson method have a smaller stability range?

References

[bergmann 1982] H.C. Bergmann, Displacement estimation based on the correlation of imagesegments, IEEE Proceedings of International Conference on Electronic Image Processing,York, England, pp. 215–219, July 1982.

[bergmann 1984] H.C. Bergmann, Ein schnell konvergierendes Displacement-Schätzverfahrenfür dieinterpolation von Fernsehbildsequenzen, Ph.D. dissertation, Technical University of Hannover,Hannover, Germany, February 1984.

[biemond 1987] J. Biemond, L. Looijenga, D.E. Boekee, and R.H.J.M. Plompen, A pel recursiveWiener-based displacement estimation algorithm, Signal Processing, 13, 399–412, December 1987.

[burkhard 1979] H. Burkhard and H. Moll, A modified Newton-Raphson search for the model-adaptive identification of delays, in Identification and System Parameter Identification, R. Isermann(Ed.), Pergamon Press, New York=Oxford, England, 1979, pp. 1279–1286.

[cafforio 1983] C. Cafforio and F. Rocca, The differential method for image motion estimation, inImage Sequence Processing and Dynamic Scene Analysis, T.S. Huang (Ed.), Springer-Verlag, Berlin,Germany, 1983, pp. 104–124.

[haskell 1979] B.G. Haskell, Frame replenishment coding of television, in Image TransmissionTechniques, W.K. Pratt (Ed.), Academic Press, New York, 1979.

[luenberger 1984] D.G. Luenberger, Linear and Nonlinear Programming, Addison Wesley, Reading,MA, 1984.

[mounts 1969] F.W. Mounts, A video encoding system with conditional picture-element replenish-ment, The Bell System Technical Journal, 48, 7, 2545–2554, September 1969.


[netravali 1979] A.N. Netravali and J.D. Robbins, Motion compensated television coding, Part I, TheBell System Technical Journal, 58, 3, 631–670, March 1979.

[tekalp 1995] A.M. Tekalp, Digital Video Processing, Prentice-Hall, Englewood Cliffs, NJ, 1995.[walker 1984] D.R. Walker and K.R. Rao, Improved pel recursive motion compensation, IEEE

Transactions on Communications, COM-32, 1128–1134, October 1984.


13Optical Flow

As men tioned in Chapte r 10, optical flow is one of the three major techniq ues that can beused to estimate dis placemen t vect ors from succe ssive image frame s. As opp osed to theothe r two displace ment estimati on techniq ues, block match ing and pel recursive method ,discuss ed in Chapte rs 11 and 12, howe ver, the optical flow tech nique was develop edprimar ily for 3-D m otion esti matio n in the compute r visi on comm unity. Al thoug h itprovid es a relative ly mo re accu rate displace ment estimatio n than the other two tech-nique s, as we shall see in thi s chap ter and the next chap ter, optical flow has not yet foundwide app lications for mo tion compensate d (MC) vide o c oding. This is mainly due to thefact that the re is a large number of motion vect ors (one vector per pixel) involve d, hence ,the more side inf ormatio n that needs to be enc oded and transmitt ed. As emphasi zed inChapter 11, we sh ould not forget the ultimate goal in MC video coding: to encode videodata wi th a total bit rate as low as possibl e, whil e maintaini ng a satisfac tory quality ofrecon structed video frames a t the receivin g end. If the extra bits require d for enc oding alarge amo unt of opti cal fl ow vectors counter balan ce the bits saved in enc oding thepredic tion err or (owing to mo re accurate motion esti matio n), then the usage of opticalflow in MC coding is not worth while. Beside s, mo re com putatio n is require d in opticalflow determi nation. These factors have preve nted optical fl ow from being pra cticallyutilized in MC vide o coding. With the conti nued adv ance in technol ogies, howe ver, webelieve this problem may be resol ved in the near future. In fact, a n initial , succes sfulattemp t has been made [shi 1998].

On the ot her hand, in theory, the opti cal fl ow tech nique is of gr eat impo rtance inunderstan ding the fund amental issues in 2-D mo tion dete rminatio n, such as the apertureproblem , the cons ervati on and neighborh ood constrai nts, and the dis tinction and relati on-ship betwe en 2-D mo tion and 2-D appa rent mo tion.

In this chap ter, we will focus on the optical flow techniq ue. In Secti on 1 3.1, as state dabov e, some fund amental issue s associate d with optical fl ow are addressed. Secti on 13.2discuss es the diffe rential met hod, while the corre lation met hod is covered in Secti on13.3. In Se ction 13.4, a multiple attrib utes appro ach is pres ented. Some perform ancecompar isons betwe en vari ous techni ques are inc luded in Secti ons 13.3 and 13.4.A summa ry is give n in Secti on 13.5.

13 .1 Fundamentals

Optical flow is ref erred to as the 2-D dis tribution of apparent velocit ies of movem ent ofinten sity patterns in an image plane [ho rn 1981]. In other word s, an optical flow fieldconsists of a dense velocity field with one velocity vector for each pixel in the image plane.


If we know the time interval between two consecutive images, which is usually the case,then velocity vectors and displacement vectors can be converted from one to another. Inthis sense, optical flow is one of techniques used for displacement estimation.

13.1.1 2-D Motion and Optical Flow

In the above definition, it is noted that the word apparent is used and nothing about 3-Dmotion in the scene is stated. The implication behind this observation is discussed in thissection, beginning with the definition of 2-D motion. 2-D motion is referred to as motion ina 2-D image plane caused by 3-D motion in the scene. That is, 2-D motion is the projection(commonly perspective projection) of 3-D motion in the scene onto the 2-D image plane.This can be illustrated by using a very simple example shown in Figure 13.1. There theworld coordinate system O-XYZ and the camera coordinate systems o-xyz are aligned.The point C is the optical center of the camera. A point A1 moves to A2, while itsperspective projection moves correspondingly from a1 to a2. We then see that a 2-D motion(from a1 to a2) in image plane is invoked by a 3-D motion (from A1 to A2) in 3-D space. By a2-D motion field, or sometimes image flow, we mean a dense 2-D motion field: one velocityvector for each pixel in the image plane.

Optical flow, according to its definition, is caused by movement of intensity patterns inan image plane. Therefore, 2-D motion (field) and optical flow (field) are generally differ-ent. To support this conclusion, let us consider the following two examples. One is given byHorn and Schunck [horn 1981]. Imagine a uniform sphere rotating with a constant speed inthe scene. Assume that the luminance and all other conditions do not change at all whenpictures are taken. Then, there is no change in brightness patterns in the images. Accordingto the definition of optical flow, the optical flow is zero, whereas the 2-D motion field isobviously not zero. At the other extreme, consider a stationary scene; all objects in 3-Dworld space are still. If illuminance changes when pictures are taken in such a way thatthere is movement of intensity patterns in image planes, as a consequence, optical flowmay be nonzero. This confirms a statement made by Singh: the scene does not have to be inmotion relative to the image for the optical flow field to be nonzero [singh 1991]. It can beshown that the 2-D motion field and the optical flow field are equal under certainconditions. Understanding the difference between the two quantities and the conditionsunder which they are equal is important.

z, Z

a1a2

A1

A2

C

Of

x, X

y,Y

FIGURE 13.12-D motion versus 3-D motion.


This unde rstandi ng can provi de us with some sort of guid e to evalu ate the reli ability ofestimati ng 3-D mo tion from optic al flow. Th is is becau se, in practi ce, time-v arying imagesequ ences are the only ones wha t we have at hand. Th e task in compute r vision is tointer pret 3-D motion from time-v arying sequ ences. Ther efore, we can only work withoptical flow in estimati ng 3-D motion . Since the main focus of thi s book is on image andvideo codi ng, we do not cover these equality cond itions here. (Inter ested rea ders may ref erto [singh 1991].) In mo tion com pensated video c oding, it is likew ise true that the imageframes and video data are on ly what we have at hand. We also, therefore , have to workwith optical fl ow. Our atte ntion is thus turne d to optical fl ow determinat ion and its usagein vide o data compress ion.

13.1.2 Apertu re Prob lem

Aper ture probl em is an imp ortant issue, originati ng in opti cs. Since it is inher ent in thelocal esti mation of opti cal flow , we address thi s issue in thi s sectio n. In optics, ape rtures areopenin gs in flat screens [bracewel l 1995]. Therefor e, apertures can have vari ous shapes ,such as circ ular, semicirc ular, a nd rectangu lar. Examp les of apertures includ e a thi n slit orarray of slit s in a scree n. A circ ular apertu re, a ro und hole made on the shut ter of awindo w, was used by Newton to st udy the compos ition of sunligh t. It is also we llknown that the circ ular a perture is of special inter est in st udying the diffr action pat tern[sear s 1986].

Roug hly speakin g, the aperture probl em in motion anal ysis refers to the problem thatoccurs when viewing motion via an aperture, i.e., a small opening in a flat screen. In [marr1982], it is stated that when a straight moving edge is observed through an aperture onlythe component of motion orthogonal to the edge can be measured. Let us examine somesimple example s dep icted in Figu re 13. 2. In Figu re 13.2a, a large rectangu lar ABCD islocated in the XOZ plane. A rectangular screen EFGH with a circular aperture is perpen-dicular to the OY axis. Figure 13.2b and c shows, respectively, what is observed throughthe aperture when the rectangular ABCD is moving along the positive X and Z directionswith a uniform speed. Since the circular opening is small and the line AB is very long, nomotion will be observed in Figure 13.2b. Obviously, in Figure 13.2c the upward movementcan be observed clearly. In Figure 13.2d, the upright corner of the rectangle ABCD, angle B,appears. At this time the translation along any direction in the XOZ plane can be observedclearly. The phenomena observed in this example demonstrate that it is sometimes impos-sible to estimate motion of a pixel by only observing a small neighborhood surrounding it.The only motion that can be estimated from observing a small neighborhood is the motionorthogonal to the underlying moving contour. In Figure 13.2b, there is no motion orthog-onal to the moving contour AB, the motion is aligned with the moving contour AB, whichcannot be observed through the aperture. Therefore, no motion can be observed throughthe aperture. In Figure 13.2c, the observed motion is upward, which is perpendicular to thehorizontal moving contour AB. In Figure 13.2d, any translation in the XOZ plane can bedecomposed into horizontal and vertical components. Either of these two components isorthogonal to one of the two moving contours: AB or BC.

A more accurate statement on the aperture problem needs a definition of the so-callednormal optical flow. The normal optical flow refers to the component of optical flow alongthe direction pointed by the local intensity gradient. Now we can make a more accuratestatement: the only motion in an image plane that can be determined is the normaloptical flow.

In general, the aperture problem becomes severe in image regions where strongintensity gradients exist, such as at the edges. In image regions with strong higher-orderintensity variations, such as corners or textured areas, the true motion can be estimated.


X

Z

YO

A

B

C

D

(a)

E

F

G

H

C

A B

D

(b)

E F

GH

C

A B

D

(c)

E F

GH

(d)C

A

D

E F

GH

B

Any direction

FIGURE 13.2(a) Aperture problem: A large rectangle ABCD is located in the XOZ plane. A rectangular screen EFGH with acircular aperture is perpendicular to the OY axis. (b) Aperture problem: No motion can be observed through thecircular aperture when the rectangular ABCD is moving along the positive X direction. (c) Aperture problem: Themotion can be observed through the circular aperture when the ABCD is moving along the positive Z direction.(d) Aperture problem: The translation of ABCD along any direction in the XOZ plane can be observed throughthe circular aperture when the upright corner of the rectangle ABCD, angle B, appears in the aperture.

Singh provides a more elegant discussion on the aperture problem, in which he argues thatthe aperture problem should be considered as a continuous problem (it always exists, butin varying degrees of acuteness) instead of a binary problem (either it exists or it does not)[singh 1991].


13.1.3 III-Pos ed Problem

Motion estimatio n fro m image seq uences, inc luding opti cal flow estima tion, belong s to thecateg ory of inv erse probl ems. This is because we wan t to infer motion from given 2-Dimages , which is the pe rspective proje ction of 3-D motion . Accordi ng to Hadama rd[berte ro 1988], a mathemat ical probl em is we ll-posed if it pos sesses the follo wing threecharac teristics :

1. Existe nce (the solution exists)

2. Uniqu eness (the sol ution is uniqu e)

3. Conti nuity (wh en the err or in the dat a ten ds towar d zero, then the induce d error inthe solut ion ten ds towar d zero as well)

Inverse probl ems usu ally a re no t well-p osed in that the solution m ay no t exist. In theexampl e discuss ed in Secti on 13.1.1, i.e., a uniform sphere rotated with illuminanc e fixed,the sol ution to motion estimati on does not exist since no motion can be infer red fromgiven image s. The aperture problem disc ussed in Section 13.1.2 is the cas e, where thesolut ion to the mo tion may not be uniqu e. Let us take a look at Figure 13.2b. From thegiven picture, one cannot tell whe ther the straigh t line AB is static, or is movinghoriz ontally. If it is mo ving horizon tally, one canno t tell the mo ving speed. In ot herwords , infinit ely many solutions exist for the case. In optical flow dete rminatio n, we willsee that com putations are noise sensitiv e. Th at is, even a small err or in the data canprodu ce an extremel y large error in the solution. Hence, we see that the motionestimati on from image sequ ences suffers from all the thre e aspects just mention ed:none xistence, nonu niquene ss, and disconti nuity. Th e last term is also refer red to as theinsta bility of the solut ion.

It is pointed out in [berte ro 1988] that all the low- level proces sing (also known as earlyvision ) in computat ional vision are inv erse problem s and are often ill-posed. Examp les inthe low- level proce ssing inc lude motion recovery, computat ion of opti cal fl ow, edgedetecti on, structur e from stereo, structur e from motion , structure from texture, sh apefrom shad ing, and so on. Fortun ately, the probl em with ear ly visi on is mildly ill-po sedin general. By mildly, we mean that a reduction of errors in the data can significantlyimprove the solution.

Since the early 1960s, the demand for accurate approximates and stable solutions inareas such as optics, radioastronomy, microscopy, and medical imaging has stimulatedgreat research efforts in inverse problems, resulting in a unified theory: the regularizationtheory of ill-posed problems [tikhonov 1977]. In the discussion of optical flow methods,we shall see that some regularization techniques have been posed and have improvedaccuracy in flow determination. More advanced algorithms continue to come.

13.1.4 Classification of Optical Flow Techniques

Optical flow in image sequences provides important information regarding both motionand structure, and it is useful in such diverse fields as robot vision, autonomous naviga-tion, and video coding. Although this subject has been studied for more than a decade,reducing the error in the flow estimation remains a difficult problem. A comprehensivereview and a comparison of the accuracy of various optical flow techniques have recentlybeen made [barron 1994]. So far, most of the techniques in the optical flow computationsuse one of the following basic approaches:


. Gradien t-base d [horn 1981; lucas 198 1; nagel 1986; uras 1988; szeliski 1995;black 1996]

. Correlati on-bas ed [anan dan 1989; singh 1992; pan 1998]

. Spatiotem poral energy- based [ad elson 1985; heege r 1988; bigu n 1991]

. Phase-base d [wax man 1988; flee t 1990]

Beside s the se determi nistic approach es, the re is the stochas tic a pproach to opti cal flowcom putation [konrad 1992]. In thi s chapte r, we fo cus our discussi on of optical flow on thegrad ient-ba sed and correlat ion-ba sed techni ques because of their frequent appl ications inpra ctice a nd because of their fundame ntal impo rtance in theo ry. We also dis cuss multipl eattribu te tech niques in opti cal fl ow determinat ion. The othe r two app roaches will beexpl ained brie fly when we discuss new techni ques in mo tion estimati on in Chapter 14.

13 .2 Gradient -Bas ed Approach

It is note d that before the method s of optical flow determinat ion we re actu ally deve loped,optica l flow had been discusse d and exploited for motion and struc ture recovery fromimage sequenc es in com puter visi on for year s. Th at is, the optical flow field was assume dto be availab le in the study of motion recovery. Th e firs t ty pe of metho ds in optical flowdetermi natio n is ref erred to as gradient- based techni ques. This is becaus e the spati al andtem poral par tial derivative s of intensity function are util ized in these techni ques . In thi ssectio n, we shall presen t the Horn a nd Schunc k algorithm . It is rega rded as the mo stpromi nent repre sentativ e of this cate gory. Other m ethods in this categ ory are brie flydiscuss ed afte r presen ting the basic concepts .

13.2.1 Horn and Schu nck ’s Method

We shall begi n with a ver y general framew ork [shi 19 94] to derive a brigh tness tim e-invariance equation. We will then introduce Horn and Schunck’s method.

13.2.1.1 Brightness Invariance Equation

As state d in Chapter 10, the imaging space can be repre sented by

f (x, y, t, *s) (13:1)

where *s indicates the sensor’s position in 3-D world space, i.e., the coordinates of the sensorcenter and the orientation of the optical axis of the sensor. The *s is a 5-D vector. That is,*s ¼ (~x, ~y,~z,b,g) where ~x, ~y, and ~z represent the coordinate of the optical center of the sensorin 3-D world space; and b and g represent the orientation of the optical axis of the sensor in3-D world space, the Euler angles: pan and tilt, respectively.

With this very general notion, each picture, taken by a sensor located on a particularposition at a specific moment, is merely a special cross-section of this imaging space. Bothtemporal and spatial image sequences become a proper subset of the imaging space.

Assume now a world point P in 3-D space that is perspectively projected onto the imageplane as a pixel with the coordinates xP and yP. Then, xP and yP are also dependent ont and *s. That is,

f ¼ f (xP(t,*s), yP(t,

*s), t, *s) (13:2)


If the optical radiation of the world point P is invariant with respect to the time intervalfrom t1 to t2, we then have

f (xP(t1,*s1), yP(t1,

*s1), t1,*s1) ¼ f (xP(t2,

*s1), yP(t2,*s1), t2,

*s1) (13:3)

This is the brightness time-invariance equation.At a specific moment t1, if the optical radiation of P is isotropical we then get

f (xP(t1,*s1), yP(t1,

*s1), t1,*s1) ¼ f (xP(t1,

*s2), yP(t1,*s2), t1,

*s2): (13:4)

This is the brightness space-invariance equation.If both conditions are satisfied, we get the brightness time-and-space-invariance

equation, i.e.,

f (xP(t1,*s1), yP(t1,

*s1), t1,*s1) ¼ f (xP(t2,

*s2), yP(t2,*s2), t2,

*s2): (13:5)

Consider two brightness functions f (x(t, *s), y(t, *s), t,*s) and f (x(tþ Dt, *sþ D*s), y(tþ Dt,*s þ D*s), t þ Dt, *sþ D*s) in which the variation in time, Dt, and the variation in thespatial position of the sensor, D*s, are very small. Due to the time-and-space-invariance ofbrightness, we can get

f (x(t, *s), y(t, *s), t, *s) ¼ f (x(tþ Dt, *sþ D*s), y(tþ Dt, *sþ D*s), tþ Dt, *sþ D *s) (13:6)

The expansion of the right-hand side of Equation 13.6 in the Taylor series at (t,~s), and theuse of Equation 13.5 lead to

@f@x

uþ @f@y

vþ @f@t

� �Dtþ @f

@xu

*s þ @f@y

v*s þ @f

@ *s

� �D*sþ « ¼ 0 (13:7)

where

u D¼@x@t

, v D¼@y@t

, u*s D¼

@x@ *s

, v*s D¼

@y@ *s

If D *s ¼ 0, i.e., the sensor is static in a fixed spatial position (in other words, both thecoordinate of the optical center of the sensor and its optical axis direction remainunchanged), dividing both sides of the equation by Dt and evaluating the limit as Dt! 0degenerate Equation 13.7 into

@f@x

uþ @f@y

vþ @f@t¼ 0 (13:8)

If Dt¼ 0, both its sides are divided by D*s and D*s! 0 is examined, Equation 13.7 thenreduces to

@f@x

u*s þ @f

@yv

*s þ @f@ *s¼ 0 (13:9)

when Dt¼ 0, i.e., at a specific time moment, the images generated with sensors at differentspatial positions can be viewed as a spatial sequence of images. Equation 13.9 is, then, theequation for the spatial sequence of images.


For the sake of brevity, we shall fo cus on the grad ient-ba sed approa ch to opti cal flowdetermi natio n wi th respec t to temporal image sequenc es. Th at is, in the rest of this sectionwe shall addr ess only Equati on 13.8. It is noted that the derivati on can be extended tospatia l image sequenc es. Th e optical fl ow technique for spatial image seq uences is useful instereo image dat a compr ession. It plays an importan t role in motion and structur e rec ov-ery. Intere sted readers are refer red to [shi 1994; sh u 1993].

13.2.1. 2 Smoothne ss Constrai nt

Carefu l exa minati on of Equatio n 13.8 reveals that we have two unkn owns: u and v, i.e., thehoriz ontal a nd vertica l com ponents of an optic al flow vector at a three -tuple ( x, y, t ), butonly one equati on to relate the m. This once again demo nstrates the ill-posed nature ofoptica l fl ow determi natio n. This also indicate s that the re is no way to com pute opti cal flowby con sidering a singl e poi nt of the brigh tness pattern movin g indep endently. As state d inSecti on 13 .1.3, some regulari zation measure — here an ext ra cons traint — must be taken toovercome the difficulty.

A most popularly used constraint was proposed by Horn and Schunck and is referred toas the smoothness constraint. As the name implies, it constrains flow vectors to vary fromone to another smoothly. Clearly, this is true for points in the brightness pattern most of thetime, particularly for points belonging to the same object. It may be violated, however,along moving boundaries. Mathematically, the smoothness constraint is imposed in opticalflow determination by minimizing the square of the magnitude of the gradient of theoptical flow vectors:

@u@x

� �2

þ @u@y

� �2

þ @v@x

� �2

þ @v@y

� �2

(13:10)

It can be easily verified that the smoother the flow vector field, the smaller these quantities.Actually, the square of the magnitude of the gradient of intensity function with respect tothe spatial coordinates, summed over a whole image or an image region, has been used asa smoothness measure of the image or the image region in the digital image processingliterature [gonzalez 1992].

13.2.1.3 Minimization

Optical flow determination can then be converted into a minimization problem.The square of the left-hand side of Equation 13.8, which can be derived from the

brightness time-invariance equation, represents one type of error. It may be caused byquantization noise or other noises and can be written as

«2b ¼@f@x

uþ @f@y

vþ @f@t

� �2

(13:11)

The smoothness measure expressed in Equation 13.10 denotes another type of error,which is

«2s ¼@u@x

� �2

þ @u@y

� �2

þ @v@x

� �2

þ @v@y

� �2

(13:12)


The total error to be minimized is

«2 ¼Xx

Xy

«2b þ a2«2s

¼Xx

Xy

@f@x

uþ @f@y

vþ @f@t

� �2

þa2 @u@x

� �2

þ @u@y

� �2

þ @v@x

� �2

þ @v@y

� �2" #

(13:13)

where a is a weight between these two types of errors. The optical flow quantities u and vcan be found by minimizing the total error. Using the calculus of variation, Horn andSchunck derived the following pair of equations for two unknown u and v at each pixel inthe image.

f 2x uþ fxfyv ¼ a2r2u� fxftfxfyuþ f 2y v ¼ a2r2v� fyft

�(13:14)

where fx ¼ @f@x , fy ¼ @f

@y , ft ¼ @f@t ;r2 denotes the Laplacian operator. The Laplacian operator

of u and v is defined below.

r2u ¼ @2u@x2þ @2u

@y2

r2v ¼ @2v@x2þ @2v@y2

(13:15)

13.2.1.4 Iterative Algorithm

Instead of using a classical algebraic method to solve the pair of equations for u and v,Horn and Schunck adopted the Gaussian Seidel [ralston 1978] method to have the follow-ing iterative procedure:

ukþ1 ¼ �uk � fx[ fx�uk þ fy�vk þ ft]a2 þ f 2x þ f 2y

vkþ1 ¼ �vk � fy[ fx�uk þ fy�vk þ ft]a2 þ f 2x þ f 2y

(13:16)

where the superscripts k and k þ 1 are indexes of iteration and �u, �v are the local averages ofu and v, respectively.

Horn and Schunck define �u, �v as follows:

�u ¼ 16{u(x, yþ 1)þ u(x, y� 1)þ u(xþ 1, y)þ u(x� 1, y)}

þ 112

{u(x� 1, y� 1)þ u(x� 1, yþ 1)þ u(xþ 1, y� 1)þ u(xþ 1, yþ 1)}

�v ¼ 16{v(x, yþ 1)þ v(x, y� 1)þ v(xþ 1, y)þ v(x� 1, y)}

þ 112

{v(x� 1, y� 1)þ v(x� 1, yþ 1)þ v(xþ 1, y� 1)þ v(xþ 1, yþ 1)}

(13:17)


(x, y, t +1) (x, y +1, t +1)

(x +1,y, t +1) (x +1,y +1,t +1)

(x, y, t ) (x, y +1, t )

(x +1, y, t ) (x+1,y +1,t )

x

y

t

{[f(x +1,y,t) – f(x,y,t)] + [f(x +1,y,t +1) – f(x,y,t +1)]

{[f(x,y +1,t) – f(x,y,t)] + [f(x +1,y +1,t) – f(x +1,y,t )]

{[f(x,y,t +1) – f(x,y,t)] + [f(x +1,y,t +1) – f(x+1,y,t )]

+ [f(x + 1,y +1,t) – f(x,y,t)] + [f(x +1,y +1,t +1) – f(x, y +1,t +1)]}

+ [f(x,y +1,t +1) – f(x,y,t +1)] + [f(x +1,y +1,t +1) – f(x +1,y,t +1)]}

+ [f(x,y +1,t +1) – f(x,y +1,t)] + [f(x +1,y +1,t +1) – f(x +1,y +1,t)]}41

41

41

ft

fy

fx

=

=

=

FIGURE 13.3Estimation of fx, fy, and ft.

The estima tion of the par tial derivati ves of inten sity functi on and the Laplac ian of flowvect ors need to be addr essed. Horn and Schunc k cons idered a 2 3 2 3 2 spatiote mpora lneighbo rhood , shown in Figu re 13.3, for esti mation of par tial derivati ves fx, f y, and f t. Notethat repl acing the fi rst-orde r different iatio n by the fi rst-orde r difference is a commo npra ctice in managi ng digital images. Th e arithme tic aver age can rem ove the noise eff ect,thus making the obt ained first-ord er difference s less sensi tive to va rious noise s.

The Laplaci an ope rator of u and v is approxim ated by

r 2 u ¼ �u (x, y) � u (x , y)r 2 v ¼ �v( x, y) � v( x, y ) (13: 18)

Equi valent ly, the Lapl acian of u and v, r2( u) and r 2( v), can be obt ained by appl ying a

3 3 3 wind ow ope rator, sh own in Figu re 13.4, to each point in the u and v planes ,respec tively .

Sim ilar to the pel recursi ve techni que discusse d in Chapte r 12, the re are two differentways to iterat e. One way is to iterat e at a pixe l unt il a solut ion is steady. Anoth er way is to


x

y

(x −1, y −1) (x −1, y )

(x, y −1) (x, y ) (x, y +1)

(x +1, y −1) (x +1, y )

[u(x –1, y) + u(x, y –1) + u(x, y +1) + u(x +1, y)]

[u(x –1, y–1) + u(x –1, y +1) + u(x +1, y –1) + u(x +1, y +1)]

[v(x –1, y –1) + v(x –1, y +1) + v(x +1, y –1) + v(x +1, y +1)]

[v(x –1, y) + v(x, y –1) + v(x, y +1) + v(x +1, y)]

12

16

1

12

16

1

+

≈∇2v

− u(x,y)

− v(x,y)

+

≈∇2u

12

1

6

1

12

1

6

16

1

12

1

6

1

12

1

−1(x −1, y +1)

(x +1, y +1)

FIGURE 13.4A 33 3 window operation for estimation of the Laplacian of flow vector.

iterat e only once for each pixe l. In the latter case, a good initial flow vect or is require d andis usually derive d from the previo us pixe l.

13.2.2 Modi fied Horn and Schu nck Method

Obse rving that the first-ord er difference is used to approxim ate the fi rst-orde r differe nti-ation in Horn and Schunc k ’s original algorithm , and rega rding this as a relatively cru deform and a source of error, Barro n et al. [barron 1994] develop ed a modi fied ver sion of theHorn and Schunc k method .

It featur es a spati otemporal presmo othing and a more advance d approximat ion ofdifferent iatio n. Speci fic ally, it uses a Gaussian filter as a spati otemporal pre filter. By theterm Gaussi an fi lter, we mean a low- pass filter wi th a mask sh aped similar to that ofthe Gaussian probability density function (pdf). This is similar to wha t was utilized in theformu lation of the Gau ssian pyramid discusse d in Chapte r 11. Th e term spati otemporalmeans that the Gau ssian fi lter is used for low- pass fi ltering in both spati al and tem poraldoma ins.

With resp ect to the more advance d approxi matio n of different iation, a fo ur-point cen traldifference operator is use d, which has a mas k, shown in Figure 13. 5.

As we shall see later in this chap ter, this mo di fied Horn and Schu nck algo rithm hasachieve d better perform ance than the original one owin g to the two above-m entionedmeasure s. This success indi cates that a reducti on of no ise in image (data) lead s to a


FIGURE 13.5Four-point central difference operator mask.

12

1−

12

80 12

8−

12

1

signi ficant reducti on of no ise in optical fl ow (solu tion). Th is exa mple support s the state-men t we men tioned earlier that the ill-po sed problem in low- level computat ional vision ismild ly ill-pos ed.

13.2.3 Luc as and Kana de’ s Method

Lucas a nd Kanad e assume that a flow vector is c onstant within a small neighborh ood of apixe l, denoted by V. Then the y fo rm a weigh ted obje ct fun ction as foll ows:

X( x, y )2 V

w2 ( x,y)@ f ( x,y,t )

@ xu þ @ f ( x,y,t )

@ vv þ @ f ( x,y,t)

@ t

� �2(13: 19)

where w ( x, y) is a windo w functi on that gives more we ight to the cen tral portion than thesurrou nding portio n of the neighbo rhood V .

The fl ow determi nation thu s become s a problem of a least square fit of the brigh tnessinv ariance con straint. We obse rve that the smooth ness constrai nt has been implied inEquati on 13.19, where the flow vector is assume d to be con stant wi thin V .

13.2.4 Nagel ’s Method

Nage l first used the second-o rder derivative s in optical flow determi nation in the ver yearly days [nagel 1983]. Since the brightness function f (x,y,t, *s) is a real-valued function ofmulti ple variable s (o r a vector of variables ), the Hessi an matrix, discusse d in Chapter 12, isused for the second-order derivatives.

An oriented-smoothness constraint was developed by Nagel that prohibits imposition ofthe smoothness constraint across edges (Figure 13.6). In Figure 13.6, an edge AB separatestwo different moving regions: region 1 and region 2. The smoothness constraint is imposedin these regions separately. That is, no smoothness constraint is imposed across the edge.Obviously, it would be a disaster if we smoothen the flow vectors across the edge.As a result, this reasonable treatment effectively improves the accuracy of optical flowestimation [nagel 1989].

FIGURE 13.6Oriented-smoothness constraint.

Region 2

Region 1

A

BDirection across the edge

Edge


13.2.5 Uras, Girosi, Ver ri, and Torre ’s Method

The Uras, Gir osi, Verri , and Torre met hod is another m ethod that uses second -orderderivati ves. Base d on a local procedur e, it perform s qui te well [uras 1988].

13.3 Correlation-Based Approach

The cor relation-bas ed app roach to opti cal flow determi natio n is similar to block match ing,covered in Chap ter 11 . As may be rec alled, the conv entiona l block match ing techni quepartit ions an image into nonove rlapped, fi xed-size , rectangl e blocks. Then, for each block,the best match ing in the previo us image frame is fo und. In doing so, a search wi ndow isopene d in the previ ous frame accordin g to some a priori knowledge : the time intervalbetwe en the two frame s and the maximu m pos sible mo ving velo city of objects in frames .Centere d on each of the candidat e pixe ls in the search wi ndow, a rectangl e corre lationwindo w of the same size as the or iginal block is ope ned. Th e best match ed block in thesearch wind ow is chosen so that either the simi larity measu re is maxi mized orthe dissimila rity measu re is minimi zed. The relati ve spatial pos ition betw een these twoblocks (the origi nal block in the current frame and the best match ed one in the previ ousframe) give s a transl ational mo tion vect or to the original block. In the correlat ion-ba sedapproach to optical flow compu tation, the m echanism is very simi lar to that in theconve ntional block match ing. The only differe nce is that fo r each pixel in an image, weopen a rec tangle cor relation windo w centered on this pixel for which an optical fl ow vect orneeds to be dete rmined. It is for thi s corre lation wind ow that we fi nd the best match in thesearch wind ow in its temporal neighbo ring image frame (Figure 13.7). A com pariso nbetwe en Figu res 13.7 and 11.1 can convin ce us about the above observat ion. In this secti on,we first brie fl y dis cuss Anan dan ’s met hod, which is a pi oneer work in this cate gory, andthen Si ngh ’s metho d is describ ed. His uni fied view of optical flow computat ion is intro-duced . We the n presen t a cor relation-f eedb ack met hod by Pan, Shi, and Shu, which usesthe feedback tech nique in fl ow calcu lation.

(x0,y0)p

q

q

p(x,y)

(x,y)

f (x, y, t )

The best matching correlation window

Optical flow vector

f (x, y, t −1)

Search window

Correlation window

The pixel to which optical flow needs to be determined

FIGURE 13.7Correlation-based approach to optical flow determination.


13.3.1 Ananda n ’s Method

As mention ed in Chapter 11, the sum of squared differe nce (SS D) is used as a dis similar itymeasu re in [ananda n 1 987]. It is essen tially a simp li fied ver sion of the we ll-know n meansquare err or (MSE) . Due to its simp licity, it is use d in the m ethods deve loped by Singh[singh 1992] as we ll as Pan, Shi, and Shu [pa n 199 8].

In Anan dan ’ s method [anan dan 1989], a pyramid struc ture is forme d, and it can be use dfor an efficient coarse–fine search. This is very similar to the multiresolution block matchingtechni ques discusse d in Chapte r 11. In the high er level s (with lower reso lution) of thepyra mid, a full search can be perform ed withou t a subs tantial increa se in compu tation.The esti mated velocit y (o r displace ment) vect or can be propagate d to the lower level s (withhigh er resolutio n) fo r furthe r re finem ent. As a resu lt, a relative ly large mo tion vect or can beestima ted with a cer tain degree of accuracy.

Inste ad of the Gau ssian pyra mid discusse d in Chapte r 11, a Lapl acian pyra mid is use dhere. To understan d the Lapl acian pyra mid let us take a look at Figu re 13.8a. Th ere twocons ecutive levels are sh own in a Gaussia n pyramid st ructure: level k , denoted by f k ( x, y)and level k þ 1, f k þ 1( x, y ). Figure 1 3.8b shows how level k þ 1 can be deriv ed from level kin the Gaussian pyra mid. Th at is, as st ated in Chapter 11, level k þ 1 in the Gau ssianpyra mid can be obtain ed throu gh low-pass fi ltering appl ied to level k , followe d by sub-sam pling. In Figu re 13.8c, level k þ 1 is first inter polated, thu s produc ing an esti mate oflevel k , f k( x, y). Th e difference betwe en the original level k and the inter polate d estimate oflevel k generat es an error at level k , denote d by e k( x, y). If the re are no quan tization err orsinv olved, then level k , f k ( x, y) can be recovere d com pletely from the int erpolated esti mateof level k , f k ( x, y), and the error at level k , e k ( x, y). That is,

f k ( x, y) ¼ f k ( x, y) þ ek (x, y) (13: 20)

Wit h quan tization errors, howe ver, the recovery of level k , f k ( x, y) is not error free. It can beshown that codi ng f k( x, y) and e k( x, y) is more efficient than dire ctly coding f k (x, y).

A set of image s e k ( x, y), k ¼ 0, 1, . . . , K � 1 and f k ( x, y) fo rms a Lapl acian pyra mid.Figu re 13.8d display s a Lapl acian pyramid with K ¼ 5. It can be shown that Lapl acianpyra mids provid e an ef ficient way for image coding [b urt 1983]. A m ore detai led descri p-tion of Gau ssian and Laplaci an pyra mids can be foun d in [bur t 1984; lim 1990].

13.3.2 Singh ’ s Metho d

Singh pres ented a uni fied poi nt of view on optical flow com putatio n in [singh 1991, 1992].He class ifi ed the infor mation availabl e in image seq uences for optical flow determinat ioninto two cate gories : conserv ation infor mation and neighb orhood informati on. Conse rvationinfor mation is the inf ormatio n assu med to be con served from one image frame to the next inflow estimation. Intensity is an example of conservation information, which is used mostfrequently in flow computation. Clearly, the brightness invariance constraint in the Hornand Schunck method is another way to state this type of conservation. Some functions ofintensity may be used as conservation information as well. In fact, Singh uses the Laplacianof intensity as conservation information for computational simplicity.More examples can befoun d later in Se ction 13.4. Other infor mation, different from intensity, such as color, can beused as conservation information. Neighborhood information is the information available inthe neighborhood of the pixel from which optical flow is estimated.

These two different types of information correspond to two steps in flow estimation. Inthe first step, conservation information is extracted, resulting in an initial estimate of flow


Level k + 1: f k+1 (x, y )

Level k : f k (x, y )

(a) Two consecutive levels in a pyramid structure

Interpolation

Level k + 1: f k+1 (x, y )

Subsampling

Level k + 1: f k+1 (x, y )

e1 (x, y )

f 4 (x, y )

e 3 (x, y )

e2 (x, y )

e0 (x, y )

(d) Structure of Laplacian pyramid

Level k: f k (x, y )

(c) Derivation of error at level k in a Laplacian pyramid

Subtraction from level k Low-pass filtering

(b) Derivation of level k+1 from level k in a Gaussian pyramid

Error at level ke

k (x, y )

f k (x, y )-Inter [f

k+1 (x, y)] = e k (x, y)

FIGURE 13.8Laplacian pyramid (level k in a Gaussian Pyramid).

vector . In the second step , thi s initial esti mate is propagate d into a neighbo rhood are a andis iterat ively upd ated. Obvi ously, in the Horn and Schunc k metho d, the sm oothne sscons traint is essen tially one type of neighborh ood informati on. Iterativ ely, estima tes offlow vector s are refined with neig hborhood inf ormatio n so that flow estimators from area shaving suf ficient intensity variati on, such as the intensity corners as shown Figure 13.2dand are as with strong textu re, can be prop agated int o areas with relati vely small int ensityvariati on or uniform intensity distrib ution.

With this uni fied point of view on opti cal fl ow estimatio n, Si ngh tre ated fl ow computa-tion as par ameter estimati on. By appl ying estimatio n theory to fl ow computat ion, hedevelop ed an esti matio n-theoreti cal method to determi ne opti cal flow. It is a corre lation-based met hod and cons ists of the ab ove-men tioned two steps.


13.3.2. 1 Cons ervation Information

In the fi rst st ep, fo r each pixe l ( x, y) in the cur rent frame fn ( x, y), a corre lation window of(2 l þ 1) 3 (2 l þ 1) is opene d, cen tered on the pixe l. A search window of (2 N þ 1) 3 (2 N þ 1)is ope ned in the previo us frame fn � 1( x, y) centered on ( x, y). An error distri bution of thos e(2 N þ 1) 3 (2 N þ 1) sample s are calcul ated by usi ng SSD as fo llows:

Ec ( u, v) ¼Xl

s¼� l

Xl

t ¼� l[ fn ( x þ s , y þ t ) � fn� 1 ( x � u þ s , y � v þ t )] 2 � N � u, v � N (13: 21)

A respons e distribut ion fo r these (2 N þ 1) 3 (2 N þ 1) sam ples is then calcul ated.

Rc(u, v) ¼ e�bEc(u,v) (13:22)

where b is a par ameter , whos e functi on and selecti on will be descri bed in Se ction 13.3.3.1.According to the weighted-least-square estimation, the optical flow can be estimated in

this step as follows:

uc ¼Pu

PvRc(u, v)uP

u

PvRc(u, v)

vc ¼Pu

PvRc(u, v)vP

u

PvRc(u, v)

(13:23)

Assuming errors are additive and zero-mean random noise, we can also find the covar-iance matrix associated with the above estimate:

Sc ¼

Pu

Pv

Rc(u,v)(u�uc)2Pu

Pv

Rc(u,v)

Pu

Pv

Rc(u,v)(u�uc)(v�vc)Pu

Pv

Rc(u,v)Pu

Pv

Rc(u,v)(u�uc)(v�vc)Pu

Pv

Rc(u,v)

Pu

Pv

Rc(u,v)(v�vc)2Pu

Pv

Rc(u,v)

0BBBBB@

1CCCCCA

(13:24)

13.3.2.2 Neighborhood Information

After step 1, all initial estimates are available. In step 2, they need to be refined according toneighborhood information. For each pixel, the method considers a (2w þ 1)3 (2w þ 1)neighborhoodcenteredon it. Theopticalflowof the centerpixel is updated fromthe estimatesin the neighborhood. A set of Gaussian coefficients is used in thismethod such that the closerthe neighbor pixel is to the center pixel, themore influence the neighbor pixel has on the flowvector of the center pixel. The weighted-least-square-based estimate in this step is

�u ¼Pu

PvRn(u, v)uP

u

PvRn(u, v)

�v ¼Pu

PvRn(u, v)vP

u

PvRn(u, v)

(13:25)


(0.25 × 0.25)116

(0.5 × 0.25)1

8

(0.25 × 0.25)1

16

(0.5 × 0.25)1

8

(0.5 × 0.5)1

4

(0.5 × 0.25)1

8

(0.25 × 0.25)116

(0.5 × 0.25)18

(0.25 × 0.25)116 FIGURE 13.9

33 3 Gaussian mask.

and the associ ated cov ariance matrix is

Sc ¼

Pi

Rn ( ui , v i )(ui ��u) 2Pi

Rn ( ui , v i )

Pi

Rn ( ui , v i )( ui ��u)(v i ��v )Pi

Rn ( ui , v i )Pi

Rn ( ui ,v i )(ui ��u)( vi ��v)Pi

Rn ( ui ,v i )

Pi

Rn ( ui ,v i )(v i ��v ) 2Pi

Rn ( ui , v i )

0BBBBBB@

1CCCCCCA

(13 : 26)

where

1 � i � (2 w þ 1)2

In impl ementati on, Singh use s a 3 3 3 neighborh ood (i.e., w ¼ 1) cen tered on the pixelunder conside ratio n. The weigh ts are depicted in Figu re 13.9.

13.3.2. 3 Minimiz ation and Iterative Algori thm

According to estimation theory [beck 1977], two covariance matrices, expressed inEquatio ns 13.24 and 13.26, respec tively, a re relate d to the con fi dence measu re. That is,the reciprocals of the eigenvalues of the covariance matrix reveal confidence of the estimatealong the direction represented by the corresponding eigenvectors. Moreover, conserva-tion error and neighborhood error can be represented as the following two quadraticterms, respectively:

(U �Uc)TS�1c (U �Uc) (13:27)

(U � �U)TS�1n (U � �U) (13:28)

where

�U ¼ (�u, �v),Uc ¼ (uc, vc),U ¼ (u, v)

The minimization of the sum of these two errors over the image area leads to an optimalestimate of optical flow. That is, find (u, v) such that the following error is minimized:


Xx

Xy

[( U � Uc ) T S� 1c ( U � Uc ) þ ( U � �U ) T S� 1n (U � �U )] (13: 29)

An iterative procedur e acco rding to the Gau ss–Si edel algorithm [rals ton 1978] is used bySingh:

U k þ 1 ¼ [ S� 1c þ S� 1n ] � 1 [ S� 1c Uc þ S� 1n�U k ]

U 0 ¼ Uc(13: 30)

Note that Uc, Sc are calculate d once and rem ain unc hanged in all the iterat ions. On thecontrar y, U and Sn vary with each iterat ion. This agrees with the description of the methodin Se ction 13.3.2.2.

13.3.3 Pan , Shi, an d Shu’ s Method

Applying feedback (a pow erful technique widel y use d in aut omatic control and manyothe r fields) to a correlat ion-ba sed algori thm, Pan, Shi, and Shu develop ed a cor relation-feedb ack method to compute opti cal fl ow. The met hod is iterative in nature. In eachiterat ion, the estimate d opti cal flow and its several variati ons are fed back. For each ofthe vari ed opti cal flow vectors, the corre spondin g sum of square d displace d frame differ-enc e (DFD) (Chapter 12 ), which often involve s bil inear inter polati on, is calcul ated. Thisuseful information is then utilized in a revised version of a correlation-based algorithm[singh 1992]. They choose to work with this algorithm because it has several merits, andits estimation-theoretical computation framework lends itself to the application of thefeedback technique.

As expected, the repeated usage of two given images via the feedback iterative proced-ure improves the accuracy of optical flow considerably. Several experiments on real imagesequences in the laboratory and some synthetic image sequences demonstrate thatthe correlation-feedback algorithm performs better than some standard gradient- andcorrelation-based algorithms in terms of accuracy.

13.3.3.1 Proposed Framework

Block diagram of the proposed framework shown in Figure 13.10 is described next.

Correlation Propagation

Observer

Initialization

+

−

f2

f1

f2

ukc vk

c

uk vk

u0 v 0

uk vk

FIGURE 13.10Block diagram of correlation-feedback technique.


13.3.3.1.1 Initi alization

Althou gh any flow algori thms can be used to generat e an initi al optical flow field*uo ¼ ( uo , vo ) (ev en a nonzero initial flow fi eld wi thout appl ying any fl ow algori thm maywork, but slo wly), the Horn and Schunc k algorithm [horn 1981], dis cussed in Secti on 13.2.1(usual ly 5–10 iterations ) is use d to provi de an appropri ate starting point after prep ro-cessin g (invo lving low-pass filtering), since the algorithm is fast and the problem causedby the sm oothne ss con straint is not serious in the fi rst 10 –20 iterations . Th e mo difi edHorn and Schunc k met hod, dis cussed in Secti on 13.2.2, may a lso be use d for theinitialization.

13.3.3.1.2 Observer

The DFD at the kth iteration is observed as fn(*x)� fn�1(

*x� *uk), where fn and fn�1 denotetwo consecutive digital images, *x ¼ (x,y) denotes the spatial coordinates of the pixel underconsideration, and *uk ¼ (uk, vk) denotes the optical flow of this pixel estimated at the kthiteration. (Note that the vector representation of the spatial coordinates in image planes isused quite often in the literature, owing to its brevity in notation.) Demanding fractionalpixel accuracy usually requires interpolation. In Pan et al.’s work, the bilinear interpolationis adopted. The bilinearly interpolated image is denoted by f n�1.

13.3.3.1.3 Correlation

Once the bilinearly interpolated image is available, a correlation measure needs to beselected to search for the best match of a given pixel in fn(

*x) in a search area in theinterpolated image. In their work, the sum-of-square-differences (SSD) is used. For eachpixel in fn, a correlation window Wc of size (2l þ 1)3 (2l þ 1) is formed, centered onthe pixel.

The search window in the proposed approach is quite different from that used in thecorrelation-based approach, say, in [singh 1992]. Let u be a quantity chosen fromthe following five quantities:

u 2 uk � 12uk,uk � 1

4uk,uk,uk þ 1

4uk,uk þ 1

2uk

� �(13:31)

Let v be a quantity chosen from the following five quantities:

v 2 vk � 12vk, vk � 1

4vk, vk, vk þ 1

4vk, vk þ 1

2vk

� �(13:32)

Hence, there are 25 (i.e., 53 5) possible combinations for (u, v). (It is noted that therestriction of the nonzero initial flow field mentioned above in Section 13.3.3.1.1 comesfrom here.) Note that other choices of variations around (uk, vk) are possible. Each of themcorresponds to a pixel, (x – u, y – v), in the bilinearly interpolated image plane. A correlationwindow is formed and centered in this pixel. The 25 samples of error distribution around(uk, vk) can be computed by using the SSD. That is,

E(u,v) ¼Xl

s¼�l

Xl

t¼�l(fn(xþ s, yþ t)� fn�1(x� uþ s, y� vþ t))2 (13:33)

The 25 samples of response distribution can be computed as follows:

Rc(u, v) ¼ e�b E(u,v) (13:34)


where b is cho sen so as to make the maxim um Rc among the 25 sam ples of respons edistri bution be a number close to unity. The cho ice of an exponenti al function forconve rting the error distri bution into the respo nse distri bution is ba sed primaril y onthe fo llowing conside ration: the exponenti al functi on is well beha ved when the errorappro aches zero and all the respo nse distrib ution va lues are posit ive. The choice of bmen tioned abov e is mo tivated by the followi ng observat ion: in this way, the Rc value s,which are the weigh ts used in Equatio n 13.35, will be mo re effective. That is, thecom putation in Equatio n 1 3.35 will be more sensitiv e to the variati on of the errordistri bution defi ned in Equati on 13.33.

The optical fl ow vector derive d at this corre lation stage is then calcul ated as foll ows,acco rding to the weigh ted-le ast-squar e estimati on [singh 1992].

uk ( x, y) ¼P

uP

v Rc ( u, v) uPuP

v R c ( u, v) , vkc ( x, y) ¼

PuP

v Rc ( u, v) vPuP

v Rc ( u, v) (13: 35)

13.3.3.1.4 Propagati on

Excep t in the vicini ty of mo tion boundar ies, the mo tion vect ors associ ated with neighbor-ing pixels are expecte d to be simi lar. Th erefore, thi s cons traint can be used to regul arize themo tion field . Th at is,

uk þ 1 ( x, y) ¼Xwi ¼� w

Xwj¼� w

w1 ( i, j) ukc ( x þ i , y þ j ),

vk þ 1 ( x, y) ¼Xwi ¼� w

Xwj¼� w

w1 ( i, j) ukc ( x þ i , y þ j )(13: 36)

where w1( i, j ) is a weigh ting functi on. The Gau ssian ma sk shown in Figure 13 .9 is chos en asthe weigh ting fun ction w1( i , j ) used in our expe riments . By using this mask, the vel ocity ofvari ous pixe ls in a pixe l’ s neighborh ood will be we ighted accordin g to their distanc e fromthe pi xel: the larger the dis tance the sm aller the weight. Th e mas k smooths the optical flowfield as well.

13.3.3.1.5 Converge nce

Unde r the assump tion of the symmet ric respons e distribut ion with a singl e maximu mvalue assu med by the gro und-tru th optical flow, the conve rgence of the correlation-feedb ack techniq ue is justified in [pan 1995].

13.3.3.2 Implementation and Experiments

13.3.3.2.1 Implementation

To make the algorithm more robust against noise, three consecutive images in an imagesequence, denoted by f1, f2, and f3, respectively, are used to implement their algorithminstead of the two images in the above principle discussion. This implementation wasproposed in [singh 1992]. Assume the time interval between f1 and f2 is the same as thatbetween f2 and f3. Also assume the apparent 2-D motion is uniform during these twointervals along the motion trajectories. From images f1 and f2, (u

0, v0) can be computed.From (uk, vk), the optical flow estimated during the kth iteration, and f1 and f2, the responsedistribution, Rþc (u

k, vk), can be calculated as


R þc ( uk , vk ) ¼ exp � b

Xl

s¼� l

Xl

t ¼� l[ f2 ( x þ s , y þ t ) � f 1 ( x � uk þ s , y � vk þ t )] 2

( )(13 : 37)

Similarl y, from images f3 and f2, (� uk , � vk ) can be calculate d. Th en R �c ( � uk , � vk ) can becalcul ated as

R�c ( � uk , � vk ) ¼ exp �bXl

s¼� l

Xl

t ¼� l[ f2 ( x þ s, y þ t) � f 3 ( x þ uk þ s , y þ vk þ t )] 2

( )(13 : 38)

The respo nse dis tribution Rc( u k, vk ) can the n be dete rmine d as the sum of Rþc ( u

k , vk ) andR�c ( � uk , � vk ). Th e size of the correlat ion windo w and the weigh ting function is chos en to be3 3 3, i.e., l ¼ 1, w ¼ 1. In each search window, b is cho sen so as to make the larger one amo ngRþc and R �c a num ber close to unity. In the observe r stage, the bilin ear interpol ation is used,which is show n to be faste r and better than the B- spline in Pan et al.’ s m any experi ments.

13.3.3.2.2 Exp eriment I

Figure 13.11 sh ows the three succes sive image frames f1, f 2, and f 3 about a square post. Theywere taken by a CC D video came ra and a DATACU BE real tim e image proce ssing systemsupport ed by a Sun workstat ion. The square post is mo ving horiz ontal ly, perp endicular tothe opti cal axi s of the camera, in a uniform speed of 2.747 pixels per frame . To rem ovevarious noise s to a certain exten t and to speed up proces sing, these three 256 3 256 imagesare low- pass filtered and the n subs ampled prior to opti cal flow esti mation. That is, theinten sities of every 16 pixels in a block of 4 3 4 are aver aged and the aver age value isassigne d to repre sent this block. Note that the cho ice of othe r low- pass filters is alsoposs ible. In this way, these three images a re compr essed into three 64 3 64 image s. The‘‘ groun d-truth ’’ 2-D mo tion velocit y vector is hence kno wn as ua ¼�0.6868; va ¼ 0.

To com pare the perform ance of the correlat ion-feed back approach wi th that of the gradi-ent-bas ed and corre lation-ba sed approach es, Horn and Schunc k’s algori thm is chos en torepres ent the grad ient-b ased approach and Sing h’ s framew ork to represen t the corre lation-based appro ach. Table 13.1 sh ows the results of the com pariso n. Ther e, l , w , and N indicatethe size s of the cor relation wi ndow, we ighting functi on, and search wi ndow, resp ectively.The progra m that implement s Sing h’ s algori thm is provided by the aut hors of [barron 1994].In the corre lation-f eedback algorithm , 10 iterations of Horn and Schunc k’ s algori thm witha ¼ 5 are use d in the initi alization. (Recall that a is a regulari zation par ameter use d in [horn1981].) Only the central 40 3 40 flow vector array is use d to compute uerror , which is the ro otmean square (R MS) error in the vect or magni tudes betwe en the groun d-truth and estimate doptical fl ow vector s. It is noted that the relative error in Experime nt I is grea ter than 10%. Thisis becau se the denomi nator in the formu la calcul ating the RM S error is too small due to thestatic ba ckgroun d and, hence , many zer o groun d-truth 2-D motion velocity vect ors inthis experi ment. Rel atively speak ing, the corre lation-f eedback algorithm perfo rms best indetermi ning optic al fl ow for a textu re post in translati on. The correct opti cal fl ow field andthose calcul ated by usi ng three different algori thms are shown in Figure 13.12.

13.3.3.2.3 Exp eriment II

The image s in Figure 13.13 were obtain ed by ro tating a CCD camera with resp ect to thecenter of a ball. The rotating velocity is 2 .5 8 per frame . Similarly, thre e 256 3 256 images a recompressed into thre e 64 3 64 image s by using the averag ing and subs ampling dis cussedabov e. On ly the central 40 3 40 optical vector arrays are used to compute uerror . Table 13.2reports the results for this experiment. There, uerror, l, w, and N have the same meaning as


(c)

(a) (b)

FIGURE 13.11(a) Texture square A. (b) Texture square B. (c) Texture square C.

that discussed in Experiment I. It is obvious that our correlation-feedback algorithmperforms best in determining optical flow for this rotating ball case.

13.3.3.2.4 Experiment III

To compare the correlation-feedback algorithm with other existing techniques in a moreobjective, quantitative manner, Pan et al. cite some results reported in [barron 1994], whichwere obtained by applying some typical optical flow techniques to some image sequences

TABLE 13.1

Comparison in Experiment I

TechniquesGradient-Based

ApproachCorrelation-Based

ApproachCorrelation-Feedback

Approach

Conditions Iteration number¼ 128a¼ 5

Iteration number¼ 25l¼ 2, w¼ 2, N¼ 4


uerror 56.37% 80.97% 44.56%


chos en with del iberatio n. In the meanti me they report the result s obtain ed by appl yingtheir feed back techniq ue to the identi cal image sequenc es wi th the same accuracy meas-urement as used in [barron 19 94].

Thre e image sequ ences use d in [barron 1994] were utilized her e: name ly ‘‘ Tr anslatin gTree, ’’ ‘‘ Divergin g Tr ee, ’’ and ‘‘ Yos emite. ’’ Th e fi rst two simu late transla tional came ramotion with resp ect to a textured planar surfa ce (see Figure 13.14), and are some timesreferred to as ‘‘Tree 2-D’’ sequence. Therefore, there are no occlusions and no motion

Y-axis0

30

20

X-axis

0

2030

–1.2

00.

20u

(a)

0

0

20

20X-axis 30

30

Y -axis

u�

1.20

0.20

(b)

FIGURE 13.12(a) Correct optical flow field. (b) Optical flow field calculated by the gradient-based approach.

(continued)


Y-axis030

20

X-axis

0

2030

–1.2

00.

25

u

(c)

Y-axis0

30

20X-axis

0

2030

–1.2

00.

25

u

(d)

FIGURE 13.12 (continued)(c) Optical flow field calculated by the correlation-based approach. (d) Optical flow field calculated by thecorrelation-feedback approach.

discontinuities in these two sequences. In the Translating Tree sequence, the camera movesnormally to its line of sight, with velocities between 1.73 and 2.26 pixels=frame parallel tothe X-axis in the image plane. In the Diverging Tree sequence, the camera moves along itsline of sight. The focus of expansion is at the center of the image. The speeds vary from 1.29


(a) (b)

(c)

FIGURE 13.13(a) Ball A. (b) Ball B. (c) Ball C.

pixels =frame on the left side to 1.86 pixels =frame on the righ t. Th e ‘‘ Yose mite ’’ sequenc e isa mo re c omplex test case (see Figure 13.15). The mo tion in the upper righ t is main lydiverge nt. The clouds translate to the righ t wi th a spe ed of 1 pixe l=frame, whi le velocitie sin the lower left are about 4 pixels =frame . This sequ ence is chal lengi ng because of the rangeof velo cities and the occludin g edge s betw een the mo untains and at the horizon. There issevere aliasin g in the lower portion of the images , causing m ost met hods to produ ce poo rervelocit y measu remen ts. No te that thi s syn theti c sequ ence is fo r quantit ative study purpo ses

TABLE 13.2

Comparison in Experiment II

TechniquesGradient-Based

ApproachCorrelation-Based

ApproachCorrelation-Feedback

Approach

Conditions Iteration number¼ 128a¼ 5



uerror 65.67% 55.29% 49.80%


20

140

120

100

80

60

40

20

40 60 80 100 120 140

FIGURE 13.14A frame of the ‘‘Tree 2-D’’ sequence.

FIGURE 13.15A frame of the ‘‘Yosemite’’ sequence.


TABLE 13.3

Summary of the ‘‘Translating Tree’’ 2-D Velocity Results

Techniques Average Error (degree) Standard Deviation (degree) Density (%)

Horn and Schunck (original) 38.72 27.67 100Horn and Schunck (modified) 2.02 2.27 100Uras et al. (unthresholded) 0.62 0.52 100Nagel 2.44 3.06 100Anandan 4.54 3.10 100Singh (step 1, l¼ 2, w¼ 2) 1.64 2.44 100Singh (step 2, l¼ 2, w¼ 2) 1.25 3.29 100Pan, Shi and Shu (l¼ 1, w¼ 1) 1.07 0.48 100

since its gr ound-tru th flow field is known and is, otherwis e, far less comple x than manyreal world outdo or sequenc es proce ssed in the lite rature.

The angu lar measure of the error used in [barron 1994] is utilize d her e, as we ll. Let imagevelocit y *u ¼ ( u, v) be repre sented as 3-D dire ction vect ors,

*V � 1ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

u2 þ v2 þ 1p ( u, v , 1) (13 : 39)

The angu lar error betw een the correct image vel ocity*V and an estimate

*Ve is

cE ¼ across( *V c �

*V e ). It is obvi ous that the smaller the angu lar error cE, the mo re accu rate

the estimati on of the optical fl ow fi eld will be. Despi te the fact that the con fidencemeasure ment can be use d in the corre lation-fee dback algorithm as we ll, Pan et al. didnot cons ider the usage of the con fidence measu remen t in the ir work. Th erefore, on ly theresult s wi th 100% density in Tables 4.6, Table 4.7, and Table 4.10 in [b arron 1994] wereused in Tables 13.3 throu gh 1 3.5, resp ectively.

Before compu tation of the optical fl ow fi eld, the Yose mite and Tree 2-D test sequ enceswere com pressed by a factor of 16 and 4, respec tively, using the aver aging and sub sam-pling met hod discuss ed ear lier.

As men tioned in [barron 1994] the opti cal fl ow field for the ‘‘ Yos emite ’’ sequ ence iscomple x, and Table 13.5 indi cates that the c orrelation-f eedb ack algori thm eviden tlyperform s best. In [blac k 1996], a robust method was deve loped and applie d to a clou dlessYose mite seq uence. It is no ted that the perform ance of flow determi natio n algori thms will beimprove d if the sky is remov ed from con siderati on [barron 1994; black 1996]. Still, it is clearthat the algori thm in [blac k 1996] achi eved ver y good perform ance in terms of accu racy.In order to ma ke a com pariso n with their algorithm , the c orrelation-f eedb ack algorithmwas applied to the sam e cloudle ss ‘‘ Yose mite ’’ sequ ence. The resu lts were reporte d inTable 1 3.6, fro m whi ch it can be observe d that the results obtain ed by Pan et al. are slightl y

TABLE 13.4

Summary of the ‘‘Diverging Tree’’ 2-D Velocity Results


Horn and Schunck (original) 12.02 11.72 100Horn and Schunck (modified) 2.55 3.67 100Uras et al. (unthresholded) 4.64 3.48 100Nagel 2.94 3.23 100Anandan (frames 19 and 21) 7.64 4.96 100Singh (step 1, l¼ 2, w¼ 2) 17.66 14.25 100Singh (step 2, l¼ 2, w¼ 2) 8.60 5.60 100Pan, Shi, and Shu (l¼ 1, w¼ 1) 5.12 2.16 100


TABLE 13.5

Summary of the ‘‘Yosemite ’’ 2-D Velocity Results


Horn and Schunck (original) 32.43 30.28 100Horn and Schunck (modified) 11.26 16.41 100Uras et al. (unthresholded) 10.44 15.00 100Nagel 11.71 10.59 100Anandan (frames 19 and 21) 15.84 13.46 100Singh (step 1, l¼ 2, w¼ 2) 18.24 17.02 100Singh (step 2, l¼ 2, w¼ 2) 13.16 12.07 100Pan, Shi, and Shu (l¼ 1, w¼ 1) 7.93 6.72 100

bett er. Tabl es 13.3 and 13.4 indi cate that the feedb ack tech nique also perform s very well intransl ating and divergi ng texture pos t cas es.

13.3.3.2.5 Experiment IV

He re, the correlat ion-feed back algo rithm is appl ied to a rea l sequ ence name d ‘‘ Hambu rgTaxi, ’’ which is used as a testi ng sequenc e in [barron 1994]. Ther e are four mo ving obje ctsin the scene: a mo ving pedestr ian in the up per left portio n, a turning car in the middl e, acar movin g towar d righ t a t the left side and a car mov ing towar d left at the right side.A frame of the sequ ence and the need le diagram of flow vectors estimate d by usi ng 10iterat ions of the corre lation-f eedback algorithm (with 10 iterations of Horn and Schunc k’ salgori thm for initi alization) are show n in Figu res 13.16 and 13.17, respec tively. The needlediagram is print ed in the sam e fashion as thos e shown in [barro n 1994]. It is note d that themo ving ped estrian in the upper left porti on cannot be show n becau se of the scale used inthe needle diag ram. The othe r three movin g veh icles in the sequ ence are shown ver yclearl y. The noise level is low. Co mpared with tho se diagram s repo rted in [barron 1994],the corre lation-f eedback algori thm achi eves very good resu lts.

For a com parison on a local basis, the portio n of the needle diagram associ ated with theare a surroundi ng the turni ng car (a sample of the velo city fi elds), obt ained by 50 iterat ionsof the corre lation-f eedback algori thm with 5 iterat ions of Horn and Schunc k’s algori thm asinitial ization, is provi ded in Figu re 13.18c. Its counter parts obt ained by appl ying Ho rn andSchunc k’ s (50 iterat ions), and Singh ’s (50 iterations) algorithm s are displaye d in Figu re13.18a and b, respec tively . It is obse rved that the correlat ion-feed back algo rithm achi evesthe best results among the three algori thms.

13.3.3. 3 Disc ussion an d Conc lusion

Although it uses a rev ised ver sion of a cor relation-bas ed algori thm [s ingh 1992], thecorrelation-feedback technique is quite different from the correlation-based algorithm[singh 1992] in the following four aspects. First, different optimization criteria: the algo-rithm does not use the iterative minimization procedure used in [singh 1992]. Instead,

TABLE 13.6

Summary of the Cloudless ‘‘Yosemite’’ 2-D Velocity Results


Robust formulation 4.46 4.21 100

Pan, Shi, and Shu (l¼ 1, w¼ 1) 3.79 3.44 100


FIGURE 13.16Hamburg taxi.

some variations of the estimated optical flow vectors are generated and fed back. Theassociated bilinearly interpolated displaced frame difference (DFD) for each variation iscalculated and utilized. In essence, the feedback approach utilizes two given imagesrepeatedly, while the Singh method uses two given images only once (uc and vc derivedfrom the two given images are only calculated once). The best local matching between thedisplaced image, generated via feedback of the estimated optical flow, and the given imageis actually used as the ultimate criterion for improving optical flow accuracy in the iterativeprocess. Second, the search window in the algorithm is an adaptive ‘‘rubber’’ window,having a variable size depending on (uk, vk). In the correlation-based approaches [singh1992], the search window has a fixed size. Third, the algorithm uses a bilinear interpolationtechnique in the observation stage and provides the correlation stage with a virtually

FIGURE 13.17Needle diagram of flow field of Hamburg taxi sequence obtained by using the correlation-feedback algorithm.


continuous image field for more accurate motion vector computation, while that in [singh1992] does not. Fourth, different performances are achieved when image intensity is alinear function of image coordinates. In fact, in the vicinity of a pixel, the intensity canusually be considered as such a linear function. Except if the optical flow vectors happento have only an integer multiple of pixels as their components, an analysis in [pan 1994]shows that the correlation-based approach [singh 1992] will not converge to the apparent

(a)

(b)

FIGURE 13.18A portion of the needle diagram obtained by using (a) Horn and Schunk’s algorithm, and (b) Singh’s algorithm.

(continued)


(c)

FIGURE 13.18 (continued)(c) The correlation-feedback algorithm.

2-D motion vect ors and will easi ly have error much greater than 10%. In [pa n 1994] it isalso shown that the line ar int ensity function guar antees the assump tion of the symm etricrespo nse dis tribution with a single maximu m value assu med by the groun d-truth opticalflow. As discusse d in Section 13.3.3 .1, unde r this assu mption the conv ergence of thecorre lation-fee dback techni que is justi fi ed.

Num erous experi ments have demo nstrat ed the corre lation-f eedback algo rithm ’s con ver-genc e and accu racy, and usu ally it is mo re accurate than some standard gradie nt- andcorre lation-ba sed approach es. In the complic ated optical fl ow cases, speci fically in the caseof the ‘‘ Yose mite ’’ image seq uence (regard ed as the mo st challengi ng quantita tive testimage sequ ence in [b arron 1994]) , it perform s bett er than all other techni ques.

13.4 Multiple Attributes for Conservation Information

As state d at the beginning of this chapte r, there a re many algori thms in opticalflow computation reported in the literature, and many new algorithms are to be devel-oped. In Se ctions 1 3.2 and 13.3, we intro duced some typical algo rithms using gradient- andcorrelation-based approaches, and will not explore various algorithms any further here. Itis hoped that the fundamental concepts and algorithms introduced above have provided asolid base for readers to study more advanced techniques.

We would like to discuss optical flow from another point of view, however: multipleimage attributes versus a single image attribute. All of methods discussed so far use onlyone kind of image attributes as conservation information in flow determination. Mostmethods use intensity. Singh’s method uses the Laplacian of intensity, which is calculatedby using the difference of the Gaussian operation [burt 1984]. It was reported by Weng,Ahuja, and Huang that using a single attribute as conservation information may result inambiguity in matching two perspective views, while multiple attributes, which are


FIGURE 13.19Multiple attributes versus single attribute. (a) With inten-sity information only, points D, E, and F tend to match topoints A, B, and C, respectively. (b) With intensity, edge,and corner information points D and E tend to matchpoints B and C, respectively.

A B C

D E F

A B C

D E F

(a) (b)

mo tion inse nsitive, may reduc e amb iguity remarkabl y, resu lting in bett er matching[weng 1992]. An example is shown in Figure 13 .19 to illus trate thi s argumen t. In thi ssectio n, Wen g et al. ’s met hod is discuss ed firs t. Then we introd uce Xia and Shi ’s metho d,which use s multiple attribu tes in a framew ork ba sed on weigh ted-lea st-squar e estimati onand feedbac k tech niques.

13.4.1 Wen g, Ahuja, and Huang ’ s Metho d

Wen g, Ahu ja, and Huang proposed a quite differe nt approac h to image point matching[weng 1992]. Note that the image match ing amoun ts to fl ow field compu tation since itcalcul ates a displacem ent fi eld fo r each point in image plane s, which is ess entially a flowfi eld if the time int erval betw een two image frames is kno wn.

Base d on an anal ysis indi cating that using image inten sity as a singl e attribute is notenoug h in acc urate image match ing, Wen g, Ahuja, and Hu ang util ize multiple attri butesassoci ated with images in estim ation of the dense displace ment fi eld. These image attri butesare m otion inse nsitive, i.e., the y generall y sustain only small change under motion assume dto be local ly rigid . The image attribu tes use d are image intensity, edge ness, and cor nerness.For each image attribu te, the algo rithm forms a residual fun ction, re flecting the inaccur acy ofthe estimate d ma tching. Th e matching is then dete rmined via an iterative procedur e tomin imize the we ighted sum of these resid ual functions . In handl ing neig hborhood infor-mati on, a more advance d sm oothne ss con straint is use d to take care of movin g disco ntinu-itie s. The method conside rs uniform regions and the occlus ion iss ue as well.

In addi tion to usi ng multi ple image attrib utes, the metho d is point-w ise proce ssing.Ther e is no need for calcu lation of correlat ion within two correlat ion windo ws, whichsav es computat ion dr amatical ly. Howeve r, the metho d also has some dr awbacks. First, theedge ness and cornerne ss involve calcul ation of the spati al gradient, which is noise sensi-tive. Second, in solving for minimization, the method resorts to numerical differentiationagain: the estimated displacement vectors are updated based on the partial derivatives ofthe noisy attribute images. In a word, the computational framework heavily relies onnumerical differentiation, which is considered to be impractical for accurate computation[barron 1994].

On the othe r hand, the Pan, Shi, and Shu met hod, discusse d in Section 13.3.3 in thecategory of correlation-based approaches, seems to have some complementary features. Itis correlation-based. It uses intensity as a single attribute. In these two aspects the methodof Pan, Shi, and Shu is inferior to the method of Weng, Ahuja, and Huang. The feedbacktechnique and the weighted-least-square computation framework used in the Pan et al.method are, however, superior compared with the Weng et al. method. Motivated by theabove observations, an efficient, multiattribute feedback method was developed by Xiaand Shi [xia 1995, 1996], discussed in the next section. It is expected that more insight of theWeng, Ahuja, and Huang method will become clear in the discussion as well.


13.4.2 Xia and Shi’s Method

This method uses multiple attributes that are motion insensitive. The following fiveattributes are used: image intensity, horizontal edgeness, vertical edgeness, contrast, andentropy. The first three are used in [weng 1992] as well, and can be considered as structuralattributes, while the last two, which are not used in [weng 1992], can be considered astextural attributes according to [haralick 1979].

Instead of the computational framework presented in [weng 1992], which, as discussedabove, may be not practical for accurate computation, the method uses the computationalframework of [pan 1994, 1998]. That is, the weighted-least-squared estimation techniqueused in [singh 1992] and the feedback technique used in [pan 1994, 1998] are utilized here.Unlike in [weng 1992], sub pixel accuracy is considered and a confidence measure isgenerated in the method.

The Xia and Shi method is also different from those algorithms presented in [singh 1992;pan 1995, 1998]. First, there is no correlation in the method, while both [singh 1992; pan1995, 1998] are correlation-based. Specifically, the method is a point-wise processing.Second, the method uses multiple attributes, while both [singh 1992; pan 1995, 1998] useimage intensity as a single attribute.

In summary, the Xia and Shi method to compute optical flow is motivated by severalexistingalgorithmsmentionedabove. Itdoes,however,differ fromeachof themsignificantly.

13.4.2.1 Multiple Image Attributes

As mentioned earlier, there are five image attributes in the Xia and Shi method. They aredefined below.

13.4.2.1.1 Image Intensity

The intensity at a pixel (x, y) in an image fn(x, y), denoted by Ai(x, y), i.e., Ai(x, y)¼ fn(x, y).

13.4.2.1.2 Horizontal Edgeness

The horizontal edgeness at a pixel (x, y), denoted by Ah(x, y), is defined as

Ah(x, y) ¼ @f (x, y)@y

(13:40)

i.e., the partial derivative of f(x, y) with respect to y, the second component of the gradientof intensity function at the pixel.

13.4.2.1.3 Vertical Edgeness

The vertical edgeness at a pixel (x, y), denoted by Av(x, y), is defined as

Av(x, y) ¼ @f (x, y)@x

(13:41)

i.e., the first component of the gradient of intensity function at the pixel. Note that thepartial derivatives in Equations 13.40 and 13.41 are computed by applying a Sobel operator[gonzalez 1992] in a 33 3 neighborhood of the pixel.

13.4.2.1.4 Contrast

The local contrast at a pixel (x, y), denoted by Ac(x, y), is defined as

Ac(x, y) ¼Xi,j2S

(i� j)2Ci,j (13:42)


where S is a set of all the dis tinct gray levels withi n a 3 3 3 window centered at pixe l ( x, y).Ci,j spec ifies a relati ve fre quency with which two neighb oring pi xels separate d horiz ontallyby a dis tance 1 occur in the 3 3 3 wi ndow, on e wi th gray level i and the ot her wi th graylevel j .

13.4.2.1.5 Entropy

The local entropy at a poi nt ( x, y), denote d by Ae ( x, y), is given by

Ae ( x, y) ¼ �Xi2 S

pi log pi (13: 43)

where S was de fi ned above, and pi is the probabi lity of occ urrence of the gray level i in the3 3 3 wi ndow.

Since the intensity is assume d to be inv ariant to motion , so are the horizon tal edge ness,ver tical edge ness, contrast, and entropy.

As men tioned above, the inten sity and edge ness are used as attribu tes in Wen g et al.’ salgori thm as well. Comp ared with the nega tive and posit ive corne rness use d inWen g et al. ’s algo rithm, the local contras t and ent ropy need no differe ntiation and there-fore are less sensi tive to various noises in origi nal images . In addi tion, the se two attri butesare inexpe nsive in terms of com putation. They refl ect the textural infor mation about thelocal neighb orhood of the pixe l for which flow vector is to be esti mated.

13.4.2. 2 Cons ervation Stage

In the Xia et al. algorithm , thi s stage is similar to that in the Pan et al. algorithm . That is,for a fl ow vect or estimated at the k th iteration, denoted by ( uk , vk), we find its 25 variation s,( u, v), accordin g to

u 2 uk � u k

2, u k � uk

4, u k , uk þ u k

4, uk þ u k

2

� �

v 2 vk � vk

2, vk � vk

4, uk , v k þ v k

4, vk þ vk

2

� � (13: 44)

For each of the se 25 vari ations, the match ing err or is com puted as

E( u, v) ¼ r 2Ai ( x, y, u , v) þ r 2Ah

( x, y, u, v) þ r 2Av ( x, y, u, v) þ r 2A c ( x, y, u, v) þ r 2Ae

( x, y, u, v) (13: 45)

where rAi , r A h, r Av

, r A c , r A e , denote the resid ual fun ction with resp ect to the five attri butes,respectively.

The residual function of intensity is defined as

rAi (x, y, u, v) ¼ Ain (x, y)� Ain�1 (x� u, y� v) ¼ fn(x, y)� fn�1(x� u, y� v) (13:46)

where fn(x, y) and fn�1(x, y) are defined as before, i.e., the intensity function at tn and tn�1,respectively; Ain, Ain�1 denote the intensity attributes on fn and f n �1, respec tively .

It is observe d that the resid ual error of int ensity is essen tially the DFD (Ch apter 12). Th erest of the residual functions are defined similarly. When sub-pixel accuracy is required,spatial interpolation in the attribute images generally is necessary. Thus, the flow vectorestimation is now converted to a minimization problem. That is, find u and v at pixel (x, y)such that the matching error defined in Equation 13.45 is minimized. The weighted-least-square method [singh 1992; pan 1998] is then used. That is,


R( u, v) ¼ e� b E( u, v ) (13 : 47)P PR (u , v) u

P PR( u, v )v

u k þ 1c ¼ u vPu

PvR( u, v)

, vk þ 1c ¼ u vPu

PvR( u, v )

(13 : 48)

Since the weigh ted-lea st-squar e met hod has been discusse d in detail in Secti ons 13.3.2 and13.3.3, we will not go into mo re detail her e.

13.4.2. 3 Propaga tion Stage

Similar to wha t was proposed in the Pan et al. algori thm, in this stag e Xia et al. form awindo w W of size (2w þ 1) 3 (2 w þ 1) center ed at the pixe l ( x, y) in the image fn ( x, y). Theflow estimate at the pixel ( x, y) in thi s st age, denoted by ( uk þ 1, v k þ 1), is calcul ated as aweigh ted sum of the flow vect ors of the pixel wi thin the window W .

uk þ 1 ¼Xws ¼� w

Xwt ¼� w

w1 [ f n ( x, y), fn ( x þ s , y þ t )] � uk þ 1c (x þ s , y þ t )

vk þ 1 ¼Xws ¼� w

Xwt ¼� w

w1 [ f n ( x, y), f n ( x þ s, y þ t)] � vk þ 1c ( x þ s , y þ t )(13 : 49)

where w1[.,.] is a weigh t function . For each poi nt in the windo w W , a we ight is assignedaccord ing to the we ight functi on. Let ( x þ s , y þ t ) denote a pixel within the wi ndow W ,then the weigh t of the pixe l ( x þ s , y þ t ) is given by

w1 [ f n ( x, y), fn ( xþ s; yþ t)] ¼ c« þ j f n (x, y) � f n ( x þ s, y þ t )j (13 : 50)

where « is a sm all pos itive number to preve nt the denominat or from v anishin g, c is anormal ization constant that makes the summa tion of all the weigh ts in the W equal 1.

From Equatio n 13.50, we see that the weight is determine d ba sed on the inten sitydifference betwe en the pixel under cons ideratio n and its neighbori ng pixel. The largerthe difference in the intensity, the more likely the two poi nts belong to differe nt regio ns.Ther efore, the weigh t will be small in this case. On the othe r hand , the flow vector in thesame regio n will be similar since the corresponding weight is large . Thus, the we ightin gfuncti on impl icitly takes fl ow disco ntinuity into accoun t and is more advance d than that in[singh 1992; pan 1994, 1998].

13.4.2.4 Outline of Algorithm

The following points summarize the procedures of the algorithm:

1. Perform a low-pass prefiltering on two input images to remove various noises.

2. Generate attribute images: intensity, horizontal edgeness, vertical edgeness, localcontrast, and local entropy. Those attributes are computed at each grid point ofboth images.

3. Set the initial flow vectors to zero. Set the maximum iteration number and=orestimation accuracy.

4. For each pixel under cons ideratio n, generate flow vari ations accord ing to Equati on13.44. Comput e ma tching error for each flow variation accord ing to Equatio n 13.45


TABLE 13.7

Summary of the ‘‘Translating Tree’’ 2-D Velocity Results


Horn and Schunck (original) 38.72 27.67 100Horn and Schunck (modified) 2.02 2.27 100Uras et al. (unthresholded) 0.62 0.52 100Nagel 2.44 3.06 100Anandan 4.54 3.10 100Singh (step 1, n¼ 2, w¼ 2) 1.64 2.44 100Singh (step 2, n¼ 2, w¼ 2) 1.25 3.29 100Pan, Shi, and Shu (n¼ 1, w¼ 1) 1.07 0.48 100Weng, Ahuja, and Huang 1.81 2.03 100Xia and Shi 0.55 0.52 100

and transf or m the m to the correspond ing respo nse dis tribution R usi ng Equ ation13.4 7. Comp ute the flow estimati on u c, vc usi ng Equation 13.48.

5. For m a (2 w þ 1) 3 (2w þ 1) neighbo rhood wind ow W cen tered at the pixe l. Com-pute the weigh t for each pixe l withi n the wind ow W usi ng Equatio n 13.50. Upd atethe fl ow vector using Equati on 13.49.

6. De crease the pres et iteration number . If the iterat ion num ber is zero, the algorithmret urns with the resultant optical fl ow field. Other wise, go to the next step.

7. If the change in fl ow vect or over two succes sive iterat ions is less than thepred efi ned thresho ld, the algori thm retur ns with the estimate d optical fl ow fi eld.Oth erwise, go to step 4.

13.4.2. 5 Exper imental Results

To com pare the met hod with other m ethods existi ng in the lite rature, similar to wha t hasbeen done in [pa n 1998] (Sectio n 13.3.3), the met hod was a pplied to three test sequ encesused in [barron 1994]: the ‘‘Translating Tree’’ sequence, the ‘‘Diverging Tree’’ sequence,and the ‘‘Yosemite’’ sequence. The same accuracy criterion is used as that in [barron 1994].Only those results reported in [barron 1994] with 100% density are listed in Tables 13.7through 13.9 for a fair and easy comparison. The algorithm of Weng et al. was implemen-ted by Xia et al. and the results were reported in [xia 1995].

TABLE 13.8

Summary of the ‘‘Diverging Tree’’ 2-D Velocity Results


Horn and Schunck (original) 32.43 30.28 100Horn and Schunck (modified) 11.26 16.41 100Uras et al. (unthresholded) 10.44 15.00 100Nagel 11.71 10.59 100Anandan 15.84 13.46 100Singh (step 1, n¼ 2, w¼ 2, N¼ 4) 18.24 17.02 100Singh (step 2, n¼ 2, w¼ 2, N¼ 4) 13.16 12.07 100Pan, Shi, and Shu (n¼ 1, w¼ 1) 7.93 6.72 100Weng, Ahuja, and Huang 8.41 8.22 100Xia and Shi 7.54 6.61 100


TABLE 13.9

Summary of the ‘‘Yosemite’’ 2-D Velocity Results


Horn and Schunck (original) 12.02 11.72 100Horn and Schunck (modified) 2.55 3.67 100Uras et al. (unthresholded) 4.64 3.48 100Nagel 2.94 3.23 100Anandan (frames 19 and 21) 7.64 4.96 100Singh (step 1, n¼ 2, w¼ 2, N¼ 4) 17.66 14.25 100Singh (step 2, n¼ 2, w¼ 2, N¼ 4) 8.60 5.60 100Pan, Shi, and Shu (n¼ 1, w¼ 1) 5.12 2.16 100Weng, Ahuja, and Huang 8.01 9.71 100Xia and Shi 4.04 3.82 100

13.4.2. 6 Discussion and Concl usion

The abov e experi mental results demo nstrate that the Xia and Shi method outpe rforms bothPan, Shi , and Shu met hod and Weng, Ahuja, and Huang ’ s met hod in terms of accuracy ofoptical flow determine d. Comp utation ally speakin g, Xia and Shi ’s met hod is less expe nsivethan Pan et al. ’s, since there is no corre lation involve d and the c orrelation is known to becomputat ional ly expens ive.

13 .5 Summar y

The opti cal flow field is a dense 2-D dis tribution of apparent velocit ies of movem ent ofinten sity patterns in image planes , while the 2-D mo tion fi eld can be unde rstood as theperspecti ve projecti on of 3-D motion in the scene onto image plane s. Th ey are different .Only unde r certain circumst ance s, are they equal to each othe r. In pra ctice, howe ver, the yare closely relate d in that image sequenc es are usually the only dat a we have in motionanaly sis. He nce, we can only deal with the optical flow in motion anal ysis, instead of the2-D motion field . The aperture probl em in motion anal ysis refer s to the problem that occurswhen viewi ng mo tion v ia an aperture. Speci fi cally, the only mo tion we can obse rve fromlocal measu rement is the motion component orthogo nal to the unde rlying mo ving contour .That is another way to manifest the il l-posed nature of opti cal flow com putatio n. Ingeneral , mo tion anal ysis from image seq uences is an inverse problem , which is ill-po sed.Fortun ately, low- level computat ional vision probl ems are on ly mildly ill-posed. Hence ,lowering the no ise in image data lead s to a poss ible signi fi cant reducti on of errors in fl owdetermi nation.

Num erous flow dete rminatio n algorithm s have appeare d over the course of m ore thanone decade. Most of the techniques take one of the following approaches: the gradient-based approach, the correlation-based approach, the energy-based approach, and thephase-based approach. In addition to these deterministic approaches, there is also astochastic approach. A unification point of view of optical flow computation is presentedin Secti on 13.3 . That is, fo r any algorithm in opti cal flow compu tation, there are two type sof information that need to be extracted—conservation information and neighborhoodinformation.

Several techniques are introduced for the gradient-based approach, particularly the Hornand Schunck algorithm, which is a pioneer work in flow determination. There, the bright-ness invariant equation is used to extract conservation information; the smoothness


cons traint is used to extract neighbo rhood informati on. The modi fied Ho rn and Schunc kalgori thm shows signi ficant err or reducti on in flow determinat ion, owin g to a reductio n ofnoise in image data, which con firms the mil dly ill-po sed nature of optical flow compu tation.

Sev eral tech niques are discusse d for the corre lation-ba sed appro ach. Th e Singh algo-rithm is emphasis ed due to its estimati on –theoret ical framew ork. The Pan, Shi, and Shualgori thm that applies the feed back tech nique to the c orrelation method , demonstr ate s anaccu racy enhanceme nt in flow estimatio n.

Secti on 13.4 addr esses the usage of multiple image attribu tes versus that of a single imageattribu te in the fl ow dete rminatio n tech nique. It is foun d that the usage of multi ple motioninsens itive attrib utes can help redu ce the a mbigui ty in mo tion anal ysis. The appl ication ofmulti ple image attribute s to con servation inform ation turns out to be promi sing for flowcomputation.

Some experim ental works we re presente d in Secti ons 13.3 and 13.4. W ith Bar ron et al.’ srecent comprehensive survey of various existing optical flow algorithms, we can have aquantitative assessment on various optical flow techniques.

(a) (b)

(c)

FIGURE 13.20(a) The 21st original frame of the ‘‘Miss America’’ sequence, (b) the reconstructed 21st frame with H.263, and(c) the reconstructed 21st frame with the proposed technique.


Optic al fl ow fi nds applicat ion in areas such as compu ter visi on, image inter polati on,temporal fi ltering, and v ideo coding. In compu tational visi on, raisi ng the accu racy ofoptical fl ow estimati on is importan t. In vide o codi ng, howe ver, lowe ring the bit rate forboth pred iction error and m otion overhead, while keeping certain quality of rec on-struc ted frame s is the ulti mate goal. Prop er handl ing of the large amoun t of velocit yvector s is a key issue in this rega rd. It is noted that the optical fl ow-base d mo tionestimati on for video com pression has been appl ied for many years. Howeve r, the highbit overh ead and computat ional comple xity preve nt it from pra ctical usage in videocoding. Wit h the conti nued adv ance in technol ogies, howeve r, we believe this problemmay be reso lved in the nea r futur e. In fact, an initi al, successf ul attemp t has bee n madeand repo rted in [shi 1998]. Th ere, based on a study which demons trates that flow vectorsare highly correlat ed and can be mo deled by a fi rst-orde r aut oregres sive (AR) mo del, theDCT is appl ied to fl ow vect ors. An adapt ive thre shold techni que is deve loped to matchoptical flow motion predic tion and min imize the resid ual errors . Conse quently , thisoptical flow-b ased motion com pensate d vide o coding algorithm achi eves good perform -ance fo r ver y low bit rate video coding. It obt ains a bit rate com patible with that obt ainedby an H.263 standard a lgorithm , which uses block match ing for motion estimatio n. (No tethat the vide o coding standard H.263 is cov ered in Chapter 19.) Furt hermore , therecon structed video frame s by using this fl ow-base d algorithm are free of a nnoyingblocki ng artifacts. Th is eff ect is demonstr ate d in Figu re 13.20. No te that Figure 1 3.20bhas a ppeared in Figure 11.12, whe re the same pictu re is dis played in a larger size and theblocking artifacts are hence clearer.

Exercises

1. What is an optical flow field? What is a 2-D motion field? What is the differencebetween the two? How are they related to each other?

2. What is an aperture problem? Give two of your own examples.3. What is the ill-posed problem? Why do we consider motion analysis from image

sequences an ill-posed problem?4. Is the relationship between the optical flow in an image plane and the velocities of

objects in 3-D world space necessarily obvious? Justify your answer.5. What does the smoothness constraint imply? Why is it required?6. How are the derivatives of intensity function and the Laplacian of flow components

estimated in the Horn and Schunck method?7. What are the differences between the Horn and Schunck original method and the

modified Horn and Schunck method? What do you observe from these differences?8. What is the difference between the smoothness constraint proposed by Horn and

Schunck and the oriented smoothness constraint proposed by Nagel? Providecomments.

9. In your own words, describe the Singh method. What is the weighted-least-squareestimation technique?

10. In your own words, describe conservation information and neighborhood information.Using this perspective, take a new look at the Horn and Schunck algorithm.

11. How is the feedback technique applied in the Pan et al. algorithm?12. In your own words, tell the difference between the Singh method and the Pan et al.

method.13. Give two of your own examples to show that multiple image attributes are able to

reduce ambiguity in image matching.


14. How does the Xia et al. method differ from the Weng et al. method?15. How does the Xia et al. method differ from the Pan et al. method?

References

[adelson 1985] E.H. Adelson and J.R. Bergen, Spatiotemporal energy model for the perception ofmotion, Journal of the Optical Society of America A, 2, 2, 284–299, 1985.

[anandan 1987] P. Anandan, Measurement visual motion from image sequences, Ph.D. thesis, COINSDepartment, University of Massachusetts, Amherst, MA, 1987.

[anandan 1989] P. Anandan, A computational framework and an algorithm for the measurement ofvisual motion, International Journal of Computer Vision, 2, 283–310, 1989.

[barron 1994] J.L. Barron, D.J. Fleet, and S.S. Beauchemin, Systems and experiment performance ofoptical flow techniques, International Journal of Computer Vision, 12, 1, 43–77, 1994.

[beck 1977] J.V. Beck and K.J. Arnold, Parameter. Estimation in Engineering and Science, John Wiley &Sons, New York, 1977.

[bertero 1988]M. Bertero, T.A. Poggio, and V. Torre, Ill-posed problems in early vision, Proceedings ofthe IEEE, 76, 8, 869–889, August 1988.

[bigun 1991] J. Bigun, G. Granlund, and J. Wiklund, Multidimensional orientation estimation withapplications to texture analysis and optical flow, IEEE Transactions on Pattern Analysis andMachine Intelligence, 13, 775–790, 1991.

[black 1996] M.J. Black and P. Anandan, The robust estimation of multiple motions: Parametricand piecewise-smooth flow fields, Computer Vision and Image Understanding, 63, 1, 75–104,1996.

[bracewell 1995] R.N. Bracewell, Two-Dimensional Imaging, Prentice Hall, Englewood, NJ, 1995.[burt 1983] P.J. Burt and E.H. Adelson, The Laplacian pyramid as a compact image code, IEEE

Transactions on Communications, 31, 4, 532–540, April 1983.[burt 1984] P.J. Burt, The pyramid as a structure for efficient computation, in Multiresolution Image

Processing and Analysis, A. Rosenfeld (Ed.), Springer-Verlag, Germany, pp. 6–37, 1984.[gonzalez 1992] R.C. Gonzalez and R.E. Woods, Digital Image Processing, Addison-Wesley, Reading,

MA, 1992.[fleet 1990] D.J. Fleet and A.D. Jepson, Computation of component image velocity from local phase

information, International Journal of Computer Vision, 5, 77–104, 1990.[haralick 1979] R.M. Haralick, Statistical and structural approaches to texture, Proceedings of the IEEE,

67, 5, 786–804, May 1979.[heeger 1988] D.J. Heeger, Optical flow using spatiotemporal filters, International Journal of Computer

Vision, 1, 279–302, 1988.[horn 1981] B.K.P. Horn and B.G. Schunck, Determining optical flow, Artificial Intelligence, 17,

185–203, 1981.[konrad 1992] Stochastic approach J. Konrad and E. Dubois, Bayesian estimation of motion vector

fields, IEEE Transactions on Pattern Analysis and Machine Intelligence, 14, 9, 910–927, 1992.[lim 1990] J.S. Lim, Two-Dimensional Signal and Image Processing, Prentice Hall, Englewood Cliffs, NJ,

1990.[lucas 1981] B. Lucas and T. Kanade, An iterative image registration technique with an application to

stereo vision, Proceedings of DARPA Image Understanding Workshop, pp. 121–130, 1981.[marr 1982] D. Marr, Vision, Freeman, Boston, MA, 1982.[nagel 1983] H.H. Nagel, Displacement vectors derived from second-order intensity variations in

image sequences, Computer Graphics and Image Processing, 21, 85–117, 1983.[nagel 1986] H.H. Nagel and W. Enkelmann, An investigation of smoothness constraints for the

estimation of displacement vector fields from image sequences, IEEE Transactions on PatternAnalysis and Machine Intelligence, 8, 565–593, 1986.

[nagel 1989] H.H. Nagel, On a constraint equation for the estimation of displacement rates in imagesequences, IEEE Transactions on Pattern Analysis and Machine Intelligence, 11, 13–30, 1989.


[pan 1994] J.N. Pan, Motion estimation using optical flow field, Ph.D. dissertation, Electrical andComputer Engineering, New Jersey Institute of Technology, Newark, NJ, April 1994.

[pan 1995] J.N. Pan, Y.Q. Shi, and C.Q. Shu, A convergence justification of the correlation-feedbackalgorithm in optical flow determination, Technical Report, Electronic Imaging Laboratory, Electricaland Computer Engineering Department, New Jersey Institute of Technology, Newark, NJ, May1995.

[pan 1998] J.N. Pan, Y.Q. Shi, and C.Q. Shu, Correlation-feedback technique in optical flow deter-mination, IEEE Transactions on Image Processing, 7, 7, 1061–1067, July 1998.

[ralston 1978] A. Ralston and P. Rabinowitz, A First Course in Numerical Analysis, McGraw-Hill,New York, 1978.

[sears 1986] F.W. Sears, M.W. Zemansky, and H.D. Young, University Physics, Addison-Wesley,Readings, MA, 1986.

[shi 1994] Y.Q. Shi, C.Q. Shu, and J.N. Pan, Unified optical flow field approach to motion analysisfrom a sequence of stereo images, Pattern Recognition, 27, 12, 1577–1590, 1994.

[shi 1998] Y.Q. Shi, S. Lin, and Y.Q. Zhang, Optical flow-based motion compensation algorithmfor very low-bit-rate video coding, International Journal of Imaging Systems and Technology, 9, 4,230–237, 1998.

[shu 1993] C.Q. Shu and Y.Q. Shi, Direct recovering of Nth order surface structure using UOFFapproach, Pattern Recognition, 26, 8, 1137–1148, 1993.

[singh 1991] A. Singh, Optical Flow Computation: A Unified Perspective, IEEE Computer Society Press,Los Alamitos, CA, 1991.

[singh 1992] A. Singh, An estimation-theoretic framework for image-flow computation, CVGIP:Image Understanding, 56, 2, 152–177, 1992.

[szeliski 1995] R. Szeliski, S.B. Kang, and H.-Y. Shum, A parallel feature tracker for extended imagesequences, Proceedings of the International Symposium on Computer Vision, pp. 241–246, Florida,November 1995.

[tikhonov 1977] A.N. Tikhonov and V.Y. Arsenin, Solutions of Ill-posed Problems, Winston & Sons,Washington, DC, 1977.

[uras 1988] S. Uras, F. Girosi, A. Verri, and V. Torre, A computational approach to motion perception,Biological Cybernetics, 60, 79–97, 1988.

[waxman 1988] A.M. Waxman, J. Wu, and F. Bergholm, Convected activation profiles and receptivefields for real time measurement of short range visual motion, Proceedings of IEEE ComputerVision and Pattern Recognition, pp. 717–723, Ann Arbor, 1988.

[weng 1992] J. Weng, N. Ahuja, and T.S. Huang, Matching two perspective views, IEEE Transactionson PAMI, 14, 8, 806–825, August 1992.

[xia 1995] X. Xia and Y.Q. Shi, A multiple attributes algorithm to compute optical flow, Proceedings ofthe Twenty-ninth Annual Conference on Information Sciences and Systems, p. 480, The John HopkinsUniversity, Baltimore, MD, March 1995.

[xia 1996] X. Xia, Motion estimation and video coding, Ph.D. dissertation, Electrical and ComputerEngineering, New Jersey Institute of Technology, Newark, NJ, October, 1996.


14Further Discussion and Summary on 2-DMotion Estimation

Since Chapter 10, we have been devoting our discuss ion to motion analysis and motioncompens ated (MC ) coding. Followin g a general descript ion in Chapter 10, three maj ortechni ques, block m atching, pel rec ursion, and optical fl ow, are cov ered in Chapte rs 11, 12,and 13, respec tively.

Before conclud ing this subject , in this chapte r, we provide fur ther discuss ion and asumma ry. A general character ization of 2-D motion estimatio n is given in Secti on 14.1.In Section 14 .2, different classi ficati ons of vari ous met hods for 2-D mo tion anal ysis aregiven in a wider scope. Secti on 14.3 is con cerned with a perform ance c ompariso n amo ngthe three major techniq ues. More advance d tech niques and new tre nds in m otion analy sisand motion compensa tion are introd uced in Secti on 14.4.

14.1 General C haracterization

A few com mon fea tures charac terizi ng all three maj or tech niques are discusse d in thissectio n.

14.1.1 Apertu re Prob lem

The aperture probl em, dis cussed in Chapte r 13, des cribes phenome na that occ ur whenobservi ng motion throu gh a sma ll ope ning in a fl at scree n. That is, on e can on ly observenormal velocity . It is essentia lly a form of ill-pos ed probl em since it is concerne d withexiste nce and uniqu eness issues, as illustrat ed in Figure 13.2a and b. Th is problem isinherent with the optical flow technique.

We note, however, that the aperture problem also exists in block matching and pelrecursive techniques. Consider an area in an image plane having strong intensity gradients.According to our discussion in Chapter 13, the aperture problem does exist in this area nomatter what type of technique is applied to determine local motion. That is, motionperpendicular to the gradient cannot be determined as long as only a local measure isutilized. It is noted that, in fact, the steepest descent method of the pel recursive techniqueonly updates the estimate along the gradient direction [tekalp 1995].

14.1.2 Ill-Posed Inverse Problem

In Chapter 13, when we discussed the optical flow technique, a few fundamental issueswere raised. It is stated that optical flow computation from image sequences is an inverse


probl em, which is usually ill-po sed. Speci fically , there are thre e problem s: nonexiste nce,nonu niquene ss, and insta bility. That is, the solution may not exist; if it exis ts, it may not beuniqu e; the solution may no t be stab le in the sense that a small perturbat ion in the imagedata may cause a huge error in the sol ution.

Now we can extend our discussi on to bot h block match ing and pel recursio n techniq ues.This is becaus e bot h the technique s are inten ded for determi ning 2-D motion from imagesequ ences, and are there fore inverse probl ems.

14.1.3 Cons ervatio n Informa tion and Nei ghborh ood Informa tion

Becau se of the ill-pos ed natu re of 2-D motion estimatio n, a uni fi ed point of view regardin gvari ous optical fl ow algorithm s is also appl icable for block match ing and pel recursivetechni ques. Tha t is, all three major tech niques inv olve extracti ng con servation informati onand extracti ng neighborh ood informati on.

Take a look at the bloc k match ing technique. Ther e, cons ervati on inf ormatio n is adistri bution of some sort of features (usual ly int ensity or function s of intensity) withinblocks . Neighbo rhood informatio n manife sts itse lf in that all pixels withi n a block share thesam e dis placemen t. If the latter cons traint is not impo sed, block match ing cannot work.On e example is the follo wing extreme cas e. Consi der a block size of 1 3 1, i.e., a blockcontai ning on ly a singl e pixe l. It is we ll known that the re is no way to estimate the mo tionof a pixel whose mo vement is indep endent of all its neighbors [horn 1981].

Wit h the pel rec ursive techni que, say, the st eepest descent method , con servation infor-mati on is the int ensity of the pixel for whi ch the displ acemen t vect or is to be estimate d.Ne ighborho od infor mation manife sts itself as rec ursive ly propagati ng dis placemen t esti-mate s to neighbori ng pixe ls (s patially or temporall y) as initial esti mates.

In Se ction 12.3, it is pointed out that Netravali and Robbi ns sug gested an altern ative,calle d ‘‘ inclus ion of a neighbo rhood area. ’’ That is, to make displace ment estimati on mo rerob ust, they cons ider a sm all neighborh ood V of the pixel for evalu ating the square ofdispla ced frame differe nce (DFD ) in calcul ating the update term. Th ey assume a constantdispla cement vect or within the area. The algori thm thu s becomes

*d k þ 1 ¼

*d k � 1

2 ar

*

d

Xi, x ,y 2 V

wi DFD 2 ( x,y; *d k ) (14: 1)

wherei repre sents an index for the ith pixel ( x, y) within Vwi is the we ight for the ith pixel in V

All the weights satisfy certain conditions; i.e., they are nonnegative, and their sum equals 1.Obviously, in this more advanced algorithm, the conservation information is the intensitydistribution within the neighborhood of the pixel, the neighborhood information isimposed more explicitly, and it is stronger than that in the steepest descent method.

14.1.4 Occlusion and Disocclusion

The problems of occlusion and disocclusion make motion estimation more difficult andhence more challenging. Here we give a brief description about these and other relatedconcepts.

Let us consider Figure 14.1. There, the rectangle ABCD represents an object in an imagetaken at the moment of tn�1, f(x, y, tn�1). The rectangle EFGH denotes the same object,which has been translated, in the image taken at tn moment, f(x, y, tn). In the image f(x, y, tn),


D HC G

B FA E

An object at tn – 1 The object at tnFIGURE 14.1Occlusion and disocclusion.

the area BFDH is occluded by the object that newly moves in. On the other hand, in f(x, y, tn),the area ofAECG resurfaces and is referred to as a newly visible area, or a newly exposed area.

Clearly, when occlusion and disocclusion occur, all three major techniques discussed inthis part will encounter a fatal problem, since conservation information may be lost makingmotion estimation fail in the newly exposed areas. If image frames are taken denselyenough along the temporal dimension, however, occlusion and disocclusion may notcause serious problems, since the failure in motion estimation may be restricted to somelimited areas. An extra bit rate paid for the corresponding increase in encoding predictionerror is another way to resolve the problem. If high quality and low bit rate are bothdesired, then some special measures have to be taken.

One of the techniques suitable for handling the situation is Kalman filtering, which isknown as the best, by almost any reasonable criterion, working in the Gaussian white noisecase [brown 1992]. If we consider the system that estimates the 2-D motion to be contami-nated by the Gaussian white noise, we can use Kalman filtering to increase the accuracy ofmotion estimation, particularly along motion discontinuities. It is powerful in doingincremental, dynamic, and real-time estimation.

In estimating 3-D motion, the Kalman filtering was applied in [matthies 1989; pan1994a]. Kalman filters were also utilized in optical flow computation [singh 1991; pan1994b]. In using Kalman filter technique, the question of how to handle the newly exposedareas was raised in [matthies 1989]. In [pan 1994a], one way to handle this issue wasproposed, and some experimental work demonstrated its effectiveness.

14.1.5 Rigid and Nonrigid Motion

There are two types of motion: rigid motion and nonrigid motion. Rigid motion refers tomotion of rigid objects. It is known that our human vision system is capable ofperceiving 2-D projections of 3-D moving rigid bodies as 2-D moving rigid bodies.Most cases in computer vision are concerned with rigid motion. Perhaps this is due tothe fact that most applications in computer vision fall into this category. On the otherhand, rigid motion is easier to handle than nonrigid motion. This can be seen in thefollowing discussion.

Consider a point P in 3-D world space with the coordinates (X, Y, Z) that can berepresented by a column vector

*v :

*v ¼ (X,Y,Z)T (14:2)

Rigid motion involves rotation and translation and has six free motion parameters. Let Rdenote the rotation matrix and T the translational vector. The coordinates of point P in the3-D world after the rigid motion are denoted by

*v0. Then we have


*v 0 ¼ R *v þ T (14: 3)

Nonri gid motion is more complic ated. It inv olves deform ation in additio n to rotati onand translatio n, and thus cannot be charac terized by Equatio n 14.3. Accordi ng to theHe lmholtz theory [som merfeld 1950], the counter part of Equatio n 14.3 become s

*v 0 ¼ R *v þ T þ D

*v (14: 4)

where D is a deform ation matrix. Note that R, T, and D are pixe l depen dent. Handlin gnonri gid m otion, hen ce, is ver y complic ated.

In vide ophony and videocon ferencin g appl ications, a typical scene might be a head andshoul der view of a perso n impo sed on a back ground. Th e facial expre ssion is no nrigid innature . Model-b ased facial codi ng has bee n st udied extensive ly [aiz awa 1989, 1995; li1993]. There, a 3-D wi re frame model is use d for handl ing rigid hea d motion . In [li 1993],the facial nonrig id motion is anal yzed as a weigh ted linear c ombina tion of a set of actionunits , instead of dete rmining D

*v directly. Si nce the num ber of acti on units is li mited, the

com putation becomes less expensiv e. In [aiz awa 1989], the portio ns in the hum an face withrich expres sion, such as lips, a re cut and then transmi tted out. At the receivin g end, theportio ns are pasted back in the face .

Among the three types of techni ques, block ma tching may be use d to manage rigidmo tion, while pel rec ursive and opti cal flow ma y be use d to handl e either rigid or no nrigidmo tion.

14.2 Differ ent Classi fi cations

Ther e are various method s in motion estimati on. They can be classi fied in many differentways. We will discuss some of the classi fication s here.

14.2.1 Dete rmin istic Metho ds versus Stochast ic Metho ds

Most algorithm s are dete rminist ic in nature. To see thi s, let us take a loo k at the mo stpromi nent algorithm fo r each of the three maj or 2-D motion estimation tech niques. That is,the Jain and Jain algorithm for the block match ing tech nique [jain 1981]; the Netravali andRobbi ns algori thm for the pel recursi ve tech nique [netrav ali 1979]; and the Horn andSchuck algori thm fo r the opti cal flow techni que [horn 19 81]. All are deterministic method s.Ther e are also stoch astic metho ds in 2-D motion estimati on, such as the Konr ad andDub ois algori thm [konrad 1992] that estimates 2-D motion usi ng the maximu m a posterioriprobability (MAP).

14.2.2 Spatial Domain Methods versus Frequency Domain Methods

While most techniques in 2-D motion analysis are spatial domain methods, there are alsofrequency domain methods [kughlin 1975; heeger 1988; porat 1990; girod 1993; kojima1993; koc 1998]. In [heeger 1988], a method to determine optical flow in the frequencydomain, which is based on spatiotemporal filters, was developed. The basic idea andprinciple of the method is introduced in this section. A very new and effective frequencymet hod for 2-D motion anal ysis [koc 1998] is presen ted in Secti on 14.4, whe re we dis cussnew trends in 2-D motion estimation.


14.2.2. 1 Optical Flow Determi nation Using Gabor Energy Filters

The frequency doma in method of optic al fl ow computat ion develop ed by He eger issuitable for highly tex tured image sequ ences. First let us take a loo k at how mo tion canbe dete cted in the fre quency domain.

14.2.2.1.1 Moti on in the Spatio temporal Frequen cy Domain

We begi n our dis cussion wi th a one- dimens ional (1-D) case. The spati al freq uency of a(trans lational ly) movin g sinusoid al signal, vx, is defined as cycles per distanc e (usu allycycles per pixel), while temporal frequency, vt, is defi ned as cycles per time unit(usual ly cycles per frame). Hence, the velocit y of (tran slational ) mo tion, de fined as dis tanceper tim e unit (usual ly pixe ls per frame ), can be related to the spatial and temporalfrequenci es as follows:

v ¼ vt =v x (14 : 5)

A 1-D mo ving sig nal with a velo city v may have multi ple spati al fre quency compone nts.Each spati al freque ncy component vxi , i ¼ 1, 2, . . . has a correspo nding tem poral freq uencycompone nt vti such that

vti ¼ vv xi : (14 : 6)

This relati on is shown in Figu re 14.2. Thus, we see that in the spati otemporal freq uencydoma in, velocit y is the slo pe of a strai ght line relati ng temporal and spatial frequenci es.For 2-D movin g signal s, we denote spatial frequenci es by vx and v y, and velo city vect orby v ¼ ( vx , vy ) The ab ove 1-D result can be ext ended in a straightforw ard mann er asfollo ws:

vt ¼ vx v x þ vy v y (14 : 7)

The interpretation of Equation 14.7 is that a 2-D translating texture pattern occupies a planein the spatiotemporal frequency domain.

14.2.2.1.2 Gabor Energy Filters

As Adelson and Berger pointed out, the translational motion of image patterns is charac-terized by orientat ion in the spati otemporal doma in [adels on 1985] as sh own in Figu re 14.3.Therefore, motion can be detected by using spatiotemporally oriented filters. One of thistype of filter, suggested by Heeger, is the Gabor filter.

0wx

wt = nwx

wt

FIGURE 14.2Velocity in 1-D spatiotemporal frequencydomain.


A0

C

D0

x

yy

0

xt

A0 B0

C0

(a)

B0

C0D0

A B

D

(b) (c)

t

x

FIGURE 14.3Orientation in spatiotemporal domain. (a) A horizontal bar translating downwards. (b) A spatiotemporal cube.(c) A slice of the cube perpendicular to y axis. The orientation of the slant edges represents the motion.

A 1-D sine phase Gabor filter is defined as follows:

g(t) ¼ 1ffiffiffiffiffiffi2pp

ssin(2pvt) exp � t2

2s2

� �(14:8)

Obviously, this is a product of a sine function and a Gaussian probability density function(pdf). In the frequency domain, this is the convolution between a pair of impulses locatedin v and �v, and the Fourier transform of the Gaussian, which is itself again a Gaussianfunction. Hence, the Gabor function is localized in a pair of Gaussian windows in thefrequency domain. This means that the Gabor filter is able to selectively pick up somefrequency components.

A 3-D sine Gabor function is

g(x, y, t) ¼ 1ffiffiffi2p

p32sxsyst

� exp � 12

x2

s2xþ y2

s2yþ t2

s2t

!( )

� sin[2p(vx0xþ vy0yþ vt0 t)] (14:9)

where sx, sy, and st are, respectively, the spreads of the Gaussian window along thespatiotemporal dimensions and vx0, vy0, and vt0 are, respectively, the central spatiotem-poral frequencies. The actual Gabor energy filter used by Heeger is the sum of a sine-phasefilter (which is defined above), and a cosine-phase filter (which shares the same spreadsand central frequencies as that in the sine-phase filter, and replaces sine by cosine inEquation 14.9). Its frequency response, therefore, is as follows:


G( vx , v y , v t ) ¼ 14exp �4p 2 s 2x (v x � vx 0 )

2 þ s 2y ( vy � v y0 ) 2hn

þ s 2t ( vt � vt 0 ) 2ioþ 1

expn� 4p 2

hs 2x ( vx þ v x0 ) 2
4
þ s 2y ( v y þ vy 0 ) 2 þ s 2t (v t þ v t0 ) 2

io(14 : 10)

This indicate s that the Gabor filter is motion -sensi tive in that it respond s largel y to motionthat has mo re power dis tributed near the central freq uencies in the spatiote mpora l fre-quency doma in, while it respo nds poo rly to motion that has little pow er nea r the cen tralfrequenci es.

14.2.2.1.3 Flow Extraction with Motion Ene rgy

Using a vivid exa mple, Heeger expl ains in his pape r why on e such filter is no t suf fi cient indetecti on of mo tion. Multi ple Gabor filters must be used. In fact, a set of 12 Gabo r fi lters a reutilized in Heeger ’ s algorithm . The 12 Gabo r fi lters in the set have one thing in commo n:

v0 ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiv2x 0 þ v2

y 0

q(14 : 11)

In other words, the 12 filters a re tun ed to the same spatia l fre quency ba nd but to differentspatial orientation and temporal fre quencies.

Brie fl y spe akin g, opti cal flow is dete rmined as fo llows. Denote the measure d motionenergy by ni , i ¼ 1, 2, . . . , 12. Here i indicates one of the 12 Gabor fi lters. The summa tion ofall ni is denoted by

n ¼X12i ¼ 1

ni (14 : 12)

Denote the predic ted motion energy by Pi ( vx, vy), and the sum of pred icted motion energy by

P ¼X12i ¼ 1

Pi ( vx , vy ) (14 : 13)

Similar to what many algorithm s do, opti cal fl ow determi nation is the n conve rted to aminimiz ation probl em. That is, optical fl ow should be able to min imize error betw een themeasure d and predic ted mo tion energi es.

J ( vx , vy ) ¼X12i¼ 1

ni � niPi (v x , vy )

Pi(vx, vy)

" #2(14:14)

Similarly, many readily available numerical methods can be used for solving this mini-mization problem.

14.2.3 Region-Based Approaches versus Gradient-Based Approaches

As state d in Chapte r 10, metho dolog ically speaking, there are generally two approaches to2-D motion analysis for video coding: region-based and gradient-based approaches. As wehave gone through the three major techniques, we can see this classification more clearly.


The regi on-based approach can be charac terized as follows. For a regio n in animage frame , we find its best match in ano ther image frame. The relati ve spatia lposit ion betwe en the se two regions produ ces a displace ment vect or. Th e bes t matchingis found by minimizi ng a dissimi larity measu re between the two regi ons, which isde fined as

X( x, y ) 2 R

XM [ f ( x, y, t ), f ( x � dx , y � dy , t � D t )], (14: 15)

whereR denotes a spatial regio n, on which the dis placemen t vect or (dx, dy)

T estimate is basedM [ a, b] denotes a dissimi larity measu re betwe en two argumen ts a and bDt is the tim e inter val betwe en two consecuti ve frames

Block matching certain ly bel ongs to the region-base d appro ach (he re region mean s arec tangula r block ). For an original blo ck in a (cu rrent) frame , block match ing search esfor its bes t match in another (previo us) frame amo ng candidat es. Se veral dis similaritymeasu res are utilized , among which the mean absolut e difference (MA D) is usedmo st of ten.

Altho ugh it uses the spatial gradient of int ensity function, the pel recursiv e method withinc lusion of a neighborh ood area assu mes the same dis placemen t v ector withi n a neigh-bor hood region. A weighted sum of the squared DFD withi n the region is used as adissimi larity measu re. By usi ng numer ical metho ds such as vari ous des cent met hods, thepel recursive metho d iterat ively min imizes the dissimi larity measure, thu s deliveri ngdispla cement vectors . The pel recursi ve techniq ue is there fore in the categ ory of region-based app roaches .

In optical fl ow computat ion, the two most fre quently use d techni ques dis cussed inChapte r 13 are the grad ient met hod and the correlat ion met hod. Clear ly, the corre lationmet hod is regio n-base d. In fact, as we pointe d out in Chap ter 13, it is ver y similar toblock match ing.

As far as the grad ient-b ased approach is concer ned, we start its charac terizat ion with thebrigh tness invari ant equ ation, cov ered in Chapte r 13. Th at is, we assu me that brightne ss iscons erved during the time interval between two cons ecutive image frames.

f ( x, y, t ) ¼ f ( x � dx , y � dy , t � Dt ) (14: 16)

By expan ding the right-ha nd sid e of Equatio n 14.16 into the Taylor series, a pplyingEquati on 14.16, and some mathemat ical manipul ation, we can derive Equ ation 14.17.

fx u þ f y v þ f t ¼ 0 (14: 17)

where fx, f y, f t are partial deri vatives of inten sity functi on wi th respec t to x, y, and t ,respec tively ; and u and v are two component s of pixe l velo city. Equatio n 14.17 contai nsgrad ients of inten sity function with respec t to spatial and tem poral varia bles and links twocom ponents of the displace ment vect or. The square of the left-hand side in Equati on 14.17is an error that need s to be minimize d. Through the minimizat ion, we can estimatedispla cement vect ors.

Clear ly, the gradient metho d in opti cal flow dete rminati on, discusse d in Chap ter 13, fallsinto the above framew ork. Ther e, an extra constrai nt is impos ed and included into the errorrepresented in Equation 14.17.

Table 14.1 summa rizes what we discusse d in this secti on.


TABLE 14.1

Region-Based versus Gradient-Based Approaches

Block Matching Pel Recursive

Optical Flow

Gradient-BasedMethod

Correlation-BasedMethod

Regional-based approachesp p p

Gradient-based approachesp

14.2.4 Forwa rd versus Bac kward Moti on Estimati on

Motion compensate d predic tive vide o coding may be done in two differe nt ways: forwa rdand back ward [bo roczky 1991 ]. These ways are dep icted in Figu res 14.4 and 14.5, resp ect-ively. With the forward manner, motion estimation is carried out by using the originalinput video frame and the reconstructed previous input video frame. With the backwardmanner, motion estimation is implemented with two successive reconstructed input videoframes.

Forward manner provides relatively higher accuracy in motion estimation and hencemore efficient motion compensation than the backward manner, owing to the fact thatthe original input video frames are utilized. However, the backward manner does notneed to transmit motion vectors to the receiving end as an overhead, while the forwardmanner does.

Block matching is used in almost all the international video coding standards, such asH.261, H.263, and MPEG 1 and MPEG 2 (covered in the next part of this book), as forwardmotion estimation. The pel recursive technique is used as backward motion estimation. Inthis way, the pel recursive technique avoids encoding a large amount of motion vectors.On the other hand, however, it provides relatively less accurate motion estimation than

fr

eVideo in f– T Q

Q–1

T

–1

MCP

+

ME

v

q

FB

fp

FIGURE 14.4Forward motion estimation and compensa-tion. T: transformer, Q: quantizer, FB: framebuffer, MCP: motion compensated predictor,ME: motion estimator, e: prediction error,f: input video frame, fp: predicted videoframe, fr: reconstructed video frame, q: quant-ized transform coefficients, v: motion vector.


FIGURE 14.5Backward motion estimation andcompensation. T: transformer,Q: quantizer, FB: frame buffer, MCP:motion compensated predictor, ME:motion estimator, e: prediction error,f: input video frame, fp: predictedvideo frame, fr1: reconstructed videoframe, fr2: reconstructed previousvideo frame, q: quantized transformcoefficients.

fr1

fr2

eVideo in f– T Q

Q–1

T

–1

MCP

+

ME

q

FB

fp

block match ing. Optica l flow is usually use d as fo rward motion esti matio n in motioncom pensated video coding. Therefor e, as expe cted, it achieves higher motion estimati onaccu racy on the one hand and needs to handle a large amoun t of mo tion vect ors asoverh ead on the ot her hand. This wi ll be discusse d in Secti on 14.3.

It is note d that on e of the new improve ments in the block match ing techni que isdescri bed in Secti on 11.6.3. It is ca lled the pred ictive motion field segme ntation techni que[orchard 1993], and it is motivated by backward motion estimation. There, segmentation isconducted backwards, i.e., based on previously decoded frames. The purpose of this is tosave overhead for shape information of motion discontinuities.

14.3 Performance Comparison between Three Major Approaches

14.3.1 Three Representatives

A performance comparison between the three major approaches, block matching,pel recursion, and optical flow, was provided in a review paper by Dufaux and Moscheni[dufaux 1995]. Experimental work was carried out as follows. The conventional full searchblock matching is chosen as a representative for the block matching approach, while theNetravali and Robbins algorithm and the modified Horn and Schunck algorithm arechosen to represent the pel recursion and optical flow approaches, respectively.

14.3.2 Algorithm Parameters

In full search block matching, the block size is chosen as 163 16 pixels, the maximumdisplacement is �15 pixels, and the accuracy is half pixels. In the Netravali and Robbinspel recursion, «¼ 1=1024, the update term is averaged in an area of 53 5 pixels and clipped


to a maximu m of 1=16 pixe ls per frame, a nd the algorithm iterat es one iterat ion per pixel.In the mo di fied Ho rn and Schunc k algorithm , the weigh t a2 is set to 100, and 100 iterat ionsof the Gauss and Seidel procedur e are carried out.

14.3.3 Experi mental Res ults and Obser vations

The three test video sequ ences are the ‘‘ Mo bile and Calen dar, ’’ ‘‘ Flower Gard en, ’’ and‘‘ Table Tenn is. ’’ Both subject ive criterion (in terms of needle diagram s sh owing displace -ment vector s) and obje ctive crit eria (in terms of DFD err or energy ) are appl ied to access thequalit y of m otion esti mation.

It turns out that the pe l recursi ve algori thm gives the wors t a ccuracy in motion estima-tion. In par ticular, it canno t follo w the fast and large motion s. Both block matching andoptical flow algorithm s give bett er motion esti matio n.

It is note d that we must be cautious in drawing c onclusions from these tests. This isbecaus e different algorithm s in the same cate gory and the sam e algori thm under differentimpl ementati on conditio ns will provi de qui te different perform ances. In the ab ove expe ri-ments , the full search block match ing with hal f-pixel accuracy is on e of the bett er blo ckmatch ing techni ques. On the con trary, there are many improve d pel rec ursive and opticalflow algorithm s, which outp erform the chosen repres entatives in the repo rted expe rimen ts.

The expe riments do, howe ver, provi de an insi ght about the three major approach es. Pelrecursi ve algori thms are seldo m used in video coding now, mainly due to their inaccuratemotion estim ation, alth ough they do no t require transmi tting mo tion vect ors to the receiv-ing end. Althou gh they can provi de relati vely accu rate motion estimatio n, optical fl owalgorithm s requ ire a large amoun t of overh ead for handlin g dense motion vector s. Thisprevents the optical fl ow tech niques from wide and practi cal usage in video codi ng. Blo ckmatch ing is simple, yet ver y ef fi cient for motion estimati on. It provid es quite accurate andreliab le motion esti matio n fo r mo st practi cal vide o sequenc es in spite of its simp le piece-wise transl ational model . At the same tim e it does no t require much overh ead. Ther efore,for fi rst-ge neration video codi ng, block match ing is cons idered to be the mo st sui tableamong the three approach es.

14 .4 Ne w T rends

In Chapte rs 11, 12, and 13, many new, eff ective improve ments wi thin the three maj orapproach es we re dis cussed. These techni ques include multi resolutio n block match ing,(locall y ada ptive) multi grid block match ing, overlap ped block match ing, thre sholdin gtechni ques, (pr edictiv e) motion fi eld segme ntation, feed back and multiple attribu tes inoptical flow computat ion, sub -pixel accu racy, and so on. Some imp rovemen ts will bediscuss ed in Part IV, where various internatio nal vide o coding st andards such as H.263and MPEG 2, and 4 are intro duced .

As pointe d out in [orchar d 1998], toda y our understan ding of motion analy sis and videocompr ession is still ba sed on an ad hoc frame work, in gene ral. What toda y’ s standards haveachieve d is no t near the ideally possible performance. Therefore, more efforts are continu-ously made in this field, seeking much more simple, practical, and efficient algorithms.

As an example of such developments, we conclude this chapter by presenting a novelmethod for 2-D motion estimation: the discrete cosine transform (DCT)-based motionestimation [koc 1998].


14.4.1 DCT -Based Motion Esti mation

As poi nted out in Secti on 14.2.2, as opp osed to the conve ntional 2-D mo tion estimati ontechni ques, this method is carried out in the frequency doma in. It is also different from theGabo r energy filter met hod by He eger (Secti on 14. 2.2.1). Witho ut int roducing Gabor filters,this method is directly DCT-based. The fundamental concepts and techniques of thismethod are discussed below.

14.4.1.1 DCT and DST Pseudophases

The underlying idea behind this method is to estimate 2-D translational motion by deter-mining the DCT andDST (discrete sine transform) pseudophases. Let us use the simpler 1-Dcase to illustrate this concept. Once it is established, it can be easily extended to the 2-D case.

Consider a 1-D signal sequence { f(n), n 2 (0, 1, . . . , N� 1)} of length N. Its translatedversion is denoted by {g(n), n 2 (0, 1, . . . , N� 1)}. The translation is defined as follows:

g(n) ¼ f (n� d), if (n� d) 2 (0, 1, . . . ,N � 1)0, otherwise

�(14:18)

In Equation 14.18, d is the amount of the translation and it needs to be estimated. Let usdefine the following several functions before introducing the pseudophases. The DCT andthe DST of the second kind of g(n), GC(k) and GS(k), are defined as follows:

GC(k) ¼ 2NC(k)

XN�1n¼0

g(n) coskpN

(nþ 0:5)� �

k 2 {0, 1, . . . ,N � 1} (14:19)

S 2 XN�1 kp� �

G (k) ¼NC(k)

n¼0g(n) sin

N(nþ 0:5) k 2 {1, . . . ,N} (14:20)

The DCT and DST of the first kind of f(n), FC(k) and FS(k), are defined as

FC(k) ¼ 2NC(k)

XN�1n¼0

f (n) coskpN

n� �

k 2 {0, 1, . . . ,N � 1} (14:21)

S 2 XN�1 kp� �

F (k) ¼NC(k)

n¼0f (n) sin

Nn k 2 {1, . . . ,N} (14:22)

In the above equations, C(k) is defined as

C(k) ¼1ffiffi2p for n ¼ 0 or N1 otherwise

�(14:23)

Now we are in a position to introduce Equation 14.24, which relates the translationalamount d to the DCT and DST of the original sequence and its translated version, definedabove. That is,

GC(k)GS(k)

� �¼ FC(k) �FS(k)

FC(k) FC(k)

� �DC(k)DS(k)

� �(14:24)

where DC(k) and DS(k) are referred to as the pseudophases and defined as follows:


D C ( k ) D¼ cosk pN

d þ 12

� ��

D S ( k ) D¼ sink pN

d þ 12

� �� (14 : 25)

Equatio n 14.24 can be sol ved for the amoun t of translati on d, thus motion estimati on. Thisbecome s cleare r when we rewrite the equatio n in a matrix –vect or format. Denote the 2 3 2matrix in Equatio n 14. 24 by F (k ), the 2 3 1 colu mn vect or at the left-h and side of theequati on by ~G ( k ), and the 2 3 1 column vect or at the right-ha nd sid e by ~D ( k ). It is easy toverify that the matrix F( k ) is orthog onal by obse rving the follo wing:

lFT ( k ) F( k ) ¼ I (14 : 26)

where I is a 2 3 2 ide ntity matrix, the constant l is

l ¼ 1

[ FC ( k )]2 þ [F S (k )] 2 (14 : 27)

We the n derive the matrix –vector fo rmat of Equati on 1 4.24 as fo llows:

~D( k ) ¼ lFT ( k ) *G ( k ) k 2 {1, . . . , N � 1} (14 : 28)

14.4.1. 2 Sinusoid al Orthog onal Princip le

It was show n that the pseudop hases, which contai n the transl ation inf ormatio n, can bedetermi ned in the DC T and DS T fre quency doma in. But ho w the amo unt of the translati oncan be found has no t been menti oned. Here, the algori thm uses the sinus oidal principle topick up thi s informati on. Th at is, the inv erse DST of the seco nd kind of scaled pse udo-phase, C ( k )D S ( k ), is found to equal a n alge braic sum of the following two dis crete imp ulsesaccord ing to the sinus oidal orthog onal princip le:

ISDTnC( k ) D S ( k )

o 2N

XNk ¼ 1

C2 ( k ) D S ( k ) sink pN

n þ 12

� ��

¼ d( d � n ) � d( d þ n þ 1) (14 : 29)

Since the inv erse DST is limit ed to n 2 {0, 1, . . . , N � 1}, the only peak value am ong this setof N values indica tes the amo unt of the translati on d. Furthe rmore, the directio n of thetranslati on (pos itive or nega tive) can be dete rmine d from the polarity (positiv e or nega tive)of the peak value.

The block diagram of the algorithm is shown in Figure 14.6. This technique can be extendedto the 2-D case in a straightforward manner. Interested readers can refer to [koc 1998].

14.4.1.3 Performance Comparison

The algorithm was applied to several typical testing video sequences, such as the‘‘Miss America’’ and ‘‘Flower Garden’’ sequences, and an ‘‘Infrared Car’’ sequence.The results were compared with the conventional full search block matching technique


FIGURE 14.6Block diagram of DCT-based motion estimation (1-D case).

f (n) and g (n)

DCT and DST

Pseudophase computation

IDST of {Ds(k)}

Determination of peak position and polarity

d

and several fast search block matching techniques such as the 2-D logarithm search,three-step search, search with subsampling in the original block, and the correlationwindows.

Before applying the algorithm, one of the following preprocessing procedures is imple-mented: frame differentiation or edge extraction. It was reported that for the ‘‘FlowerGarden’’ and ‘‘Infrared Car’’ sequences, the DCT-based algorithm achieves higher codingefficiency than all three fast search block matching methods, while for the Miss Americasequence it obtains a lower efficiency. It was also reported that it performs well even in anoisy situation.

A lower computational complexity, O(M2) for anM3M search range, is one of the majoradvantages possessed by the DCT-based motion estimation algorithm compared withconventional full search block matching, O(M2N2) for an M3M search range and anN3N block size.

With DCT-based motion estimation, a fully DCT-based motion compensated coderstructure becomes possible, which is expected to achieve a higher throughput and alower system complexity.

14.5 Summary

In this chapter, which concludes the motion analysis and compensation portion of thebook, we first generalize the discussion of the aperture problem, the ill-posed nature, and


the conserv ation and neighbo rhood infor mation uni fied point of view, previo usly madewith respect to the opti cal flow techni que in Chapter 13, to cover blo ck match ing and pelrecursive techniques. Then, the occlusion and disocclusion, and rigidity and nonrigidityare discussed with respect to the three techniques. The difficulty of nonrigid motionestimation is analyzed. Its relevance in visual communications is addressed.

Different classifications of various methods in the three major 2-D motion estimationtechniques, block matching, pel recursion, and optical flow, are presented. Besides thefrequently utilized deterministic methods, spatial domain methods, region-based methods,and forward motion estimation, their counterparts: stochastic methods, frequency domainmethods, gradient methods, and backward motion estimation are introduced. In particu-lar, two frequency domain methods are presented with some detail. They are the methodusing the Gabor energy filter and the DCT-based method.

A performance comparison between the three techniques is also introduced here, basedon which observations are drawn. A main point is that block matching is at present themost suitable technique for 2-D motion estimation among the three techniques.

Exercises

1. What is the difference between the rigid motion and nonrigid motion? In facial encod-ing, what is the nonrigid motion? How is the nonrigid motion handled?

2. How is 2-D motion estimation carried out in the frequency domain? What are theunderlying ideas behind the Heeger method, and the Koc and Liu method?

3. Why is one Gabor energy filter not sufficient in motion estimation? Draw the powerspectrum of a 2-D sine-phase Gabor function.

4. Show the correspondence of a positive (negative) peak value in the inverse DST of thesecond kind of DST pseudophase to a positive (negative) translation in the 1-D spatialdomain.

5. How does neighborhood information manifest itself in the pel recursive technique?6. Using your own words and some diagrams, state that the translational motion of an

image pattern is characterized by orientation in the spatiotemporal domain.

References

[adelson 1985] E.H. Adelson and J.R. Bergen, Spatiotemporal energy models for the perception ofmotion, Journal of Optical Society of America, A2, 2, 284–299, 1985.

[aizawa 1989] K. Aizawa and H. Harashima, Model-based analysis synthesis image coding (MBA-SIC) system for a person’s face, Signal Processing: Image Communications, 139–152, 1989.

[aizawa 1995] K. Aizawa and T.S. Huang, Model-based image coding: Advanced video codingtechniques for very low bit rate applications, Proceedings of the IEEE, 83, 2, 259–271, February1995.

[boroczky 1991] L. Boroczky, Pel recursive motion estimation for image coding, Ph.D. dissertation,Delft University of Technology, The Netherlands, 1991.

[brown 1992] R.G. Brown and P.Y.C. Hwang, Introduction to Random Signals, 2nd edn., John Wiley &Sons, New York, 1992.

[dufaux 1995] F. Dufaux and F. Moscheni, Motion estimation techniques for digital TV: A review anda new contribution, Proceedings of the IEEE, 83, 6, 858–876, 1995.

[girod 1993] B. Girod, Motion-compensating prediction with fractional-pel accuracy, IEEE Transac-tions on Communications, 41, 604, 1993.


[heeger 1988] D.J. Heeger, Optical flow using spatiotemporal filters, International Journal of ComputerVision, 1, 279–302, 1988.

[horn 1981] B.K.P. Horn and B.G. Schunck, Determining optical flow, Artificial Intelligence, 17,185–203, 1981.

[jain 1981] J.R. Jain and A.K. Jain, Displacement measurement and its application in interframe imagecoding, IEEE Transactions on Communications, COM-29, 12, 1799–1808, December 1981.

[koc 1998] U.-V. Koc and K.J.R. Liu, DCT-based motion estimation, IEEE Transactions on ImageProcessing, 7, 7, 948–865, July 1998.

[kojima 1993] A. Kojima, N. Sakurai, and J. Kishigami, Motion detection Using 3D FFT Spectrum,Proceedings of International Conference on Acoustics, Speech, and Signal Processing, V, 213–216,April 1993.

[konrad 1992] J. Konrad and E. Dubois, Bayesian estimation of motion vector fields, IEEE Transactionson Pattern Analysis and Machine Intelligence, 14, 9, 910–927, 1992.

[kughlin 1975] C.D. Kughlin and D.C. Hines, The phase correlation image alignment method,Proceedings of 1975 IEEE International Conference on Systems, Man, and Cybernetics,163–165, 1975.

[li 1993] H. Li, P. Roivainen, and R. Forchheimer, 3-D motion estimation in model-based facial imagecoding, IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 545–555, 1993.

[matthies 1989] L. Matthies, T. Kanade, and R. Szeliski, Kalman filter-based algorithms for estimatingdepth from image sequences, International Journal of Computer Vision, 3, 209–236 (1989).

[netravali 1979] A.N. Netravali and J.D. Robbins, Motion compensated television coding: Part I, TheBell System Technical Journal, 58, 3, 631–670, March 1979.

[orchard 1993] M.T. Orchard, Predictive motion-field segmentation for image sequence coding, IEEETransactions on Circuits and Systems for Video Technology, 3, 1, 54–69, February 1993.

[orchard 1998] M.T. Orchard, Visual coding standards: A research community’s midlife crisis? IEEESignal Processing Magazine, 43, March 1998.

[pan 1994a] J.N. Pan, Y.Q. Shi, and C.Q. Shu, A Kalman filter in motion analysis from stereo imagesequences, Proceedings of IEEE 1994 International Conference on Image Processing, Vol. 3,Austin, TX, pp. 63–67, November 1994.

[pan 1994b] J.N. Pan and Y.Q. Shi, A Kalman filter for improving optical flow accuracy along movingboundaries, Proceedings of SPIE 1994 Visual Communication and Image Processing, Vol. 1,Chicago, IL, pp. 638–649, September 1994.

[porat 1990] B. Porat and B. Friedlander, A frequency domain algorithm for multiframe detection andestimation of dim targets, IEEE Transactions on Pattern Recognition and Machine Intelligence, 12,398–401, April 1990.

[singh 1991] A. Singh, Incremental estimation of image-flow using a Kalman filter, Proceedings of1991 IEEE Workshop on Visual Motion, Princeton, NJ, 36–43, 1991.

[sommerfeld 1950] A. Sommerfeld, Mechanics of Deformable Bodies, Academic Press, 1950.[tekalp 1995] A.M. Tekalp, Digital Video Processing, Prentice-Hall PTR, Upper Saddle River, NJ, 1995.


Part IV

Video Compression


15Fundamentals of Digital Video Coding

In thi s chapte r, we introduce the fundame ntals of dig ital vide o coding which inc ludedigital video represen tation, rate dis tortion theo ry, and digit al video formats. We alsogive a brief overv iew of image and vide o coding st andards, which will be discuss ed in thesubsequ ent chapte rs.

15.1 Digital Video Representation

As we discuss ed in previo us chap ters, a digit al image is obt ained by quan tizing a con-tinuou s image bot h spati ally and in amp litude. Digitizat ion of the spatial coor dinat es iscalle d image sam pling , where as digitiza tion of the amplitu de is called gray-l evel quan ti-zation . Suppos e that a continu ous image is denote d by g( x, y), where the amplitu de orvalue of g at the point ( x, y) is the inten sity or brightne ss of an image at that point. Thetransf ormatio n of a conti nuous image to a digital image can the n be express ed as

f ( m, n) ¼ Q [ g( x0 þ m Dx, y0 þ nD y)] (15 : 1)

whereQ is a quan tiza tion operatorx0 and y0 are the origi ns of image planem and n are the discrete values 0, 1, 2, . . . , Dx and Dy are the samplin g intervals in the

horizontal and vertical directions, respectively

If the sampling process is extended to a third temporal direction (or the original signal inthe temporal direction is discrete format), a sequence, f(m, n, t), is obtained as introduced inChapter 10,

f (m,n, t) ¼ Q[ g(x0 þmDx, y0 þ nDy, t0 þ tDt)] (15:2)

wheret takes the values 0, 1, 2, . . .Dt is the time interval

Each point of the image or each basic element of the image is called as a pixel or pel. Eachindividual image is called a frame. According to the sampling theory, the original continu-ous signal can be recovered exactly from its samples if the sampling frequency is two timeshigher than the bandwidth of original signal [oppenheim 1989]. The frames are normallypresented at a regular time interval so that the eye can perceive fluid motion. For example,


the NTSC (Nat ional Televis ion Syste ms Committe e) speci fied a temporal sam pling rate of30 frame s=s and interlace 2 to 1. Therefor e, as a result of this spati otemporal samplin g, thedigit al signal s exhibit high spatial and temporal cor relation, just as the anal og signal s didbefore video dat a com pression. In Section 15.2, we wi ll disc uss the theo retical basis of vide odigit ization. An impo rtant notion is the strong depen dence betw een value s of neighbo ringpixe ls wi thin the sam e frame , and betw een the frames thems elves; this can be rega rdedas statist ical redund ancy of the image sequ ence. In the follo wing secti on, we explainhow this statistical redundanc y is expl oited to achi eve com pression of the digitizedimage seq uence.

1 5 . 2 Information Theory Results: Rate Distortion Function of Video Signal

The princi pal goal in the design of a vide o codi ng syste m is to redu ce the transmi ssion raterequi remen ts of the vide o source subject to some pictu re quality cons traint. Th ere are onlytwo ways to acco mplish this goal: reductio n of the st atistical redundanc y and psychop hys-ical redunda ncy of the vide o source. The vide o source is normal ly very high ly corre lated,bot h spatial ly and tem porally ; that is, strong depen dence can be rega rded as statist icalredund ancy of the data source. If the vide o source to be coded in a transmi ssion system isviewe d by a human observe r, the perce ptual limit ations of hum an vision can be use d toreduce transmi ssion requi remen ts. Hu man observers are subj ect to pe rceptual limitation sin amplitu de, spatial resolution, and temporal a cuity. By prop er design of the codingsystem , it is possibl e to discard informati on withou t a ffecting perce ption, or at least, withonly min imal degrad ation. In summa ry, we c an use two factors: the statist ical struc ture ofthe data source and the fidelity require ments of the end user, whi ch make the com pressionposs ible. The perform ance of the video compr ession algorithm dep ends on several factors.First, and also a basic one, is the amo unt of redundanc y containe d in the video data sou rce.In other word s, if the original source contains a large amoun t of inform ation, or highcom plexity then more bits are neede d to repre sent the com press ed data. Second , if a lossycoding techni que is use d, by which some amo unt of loss is permi tted in the recon structedvide o data then the perfo rmance of the coding tech nique depends on the com pressionalgori thm and distortio n measure ments. In lossy coding, differe nt dis tortion measu re-men ts will perceiv e the loss in different ways , giving different subj ective results. Thedeve lopmen t of a dis tortion measu re that can provi de consistent num erical and subj ectiveresult s is a very dif ficulty task . Mo reover, the maj ority of the video com press ion appl ica-tions do not require lossless coding, i.e., it is not required that the reconstructed andoriginal images be identical or reversible.

This intuitive explanation of how redundancy and lossy coding methods can be used toreduce source data is made more precise by Shannon’s rate distortion theory [berger 1971],which addresses the problem of how to characterize both the source and the distortionmeasure. Let us consider the model of a typical visual communication system depicted inFigu re 15.1. The source vide o dat a is fed to the enc oder syste m, which c onsists of two par ts:source coding and channel coding. The function of the source coding is to remove theredundancy in both the spatial and temporal domain, whereas the function of channelcoding is to insert the controlled redundancy, which is used to protect the transmitted datafrom the interference of channel noise. It should be noted that according to Shannon[shannon 1948] certain conditions allow the source and channel coding operations to beseparated without any loss of optimality, such as when the sources are ergodic. However,Shannon did not indicate the complexity constraint on the coder involved. In practicalsystems, which are limited by the complexity, this separation may not be possible


Sourceencoder

Channelencoder

Sourcedecoder

Channeldecoder

To channel

From channel

Input data

Reconstructeddata FIGURE 15.1

A typical visual communicationsystem.

[viterbi 1979]. There is still some work on the joint optimization of the source and channelcoding [modestino 1981; sayood 1991]. Coming back to rate distortion theory, the problemaddressed here is the minimization of the channel capacity requirement, while maintainingthe average distortion at or below an acceptable level.

The rate distortion function R(D) is the minimum average rate (bits per element), andhence minimum channel capacity, required for a given average distortion level D. To makethis more quantitative, we suppose that the source is a sequence of pixels, and these valuesare encoded by successive blocks of length N. Each block of pixels is then described by oneof a denumerable set of messages, {Xi}, with probability function, P(Xi). For a given inputsource, {Xi}, and output, {Yj}, the decoder system can be described mathematically by theconditional probability, Q(Yj=Xi). Therefore, the probability of the output message is

T(Yj) ¼Xi

P(Xi)Q(Yj=Xi) (15:3)

The information transmitted is called the average mutual information between Y and Xand is defined for a block of length N as follows:

IN(X,Y) ¼Xi

Xj

P(Xi)Q(Yj=Xi) log2[Q(Yj=Xi)=T(Yj)] (15:4)

In the case of error-free encoding, Y¼X and then

Q(Yj=Xi) ¼ 1, j ¼ i,0, j 6¼ i,

�and T(Yj) ¼ T(Yi) (15:5)

In this case, Equation 15.4 becomes

IN(X,Y) ¼Xi

Xj

P(Xi) log2 P(Xi) ¼ HN(X) (15:6)

which is the Nth-order entropy of the data source. This can also be seen as the informationcontained in the data source under the assumption that no correlation exists betweenblocks and all the correlation between elements of each N length block is considered.Therefore, it requires at least HN(X) bits to code the data source without any informationloss. In other words, the optimal error-free encoder requires HN(X) bits for the given datasource. In the most general case, noise in the communication channel will result in error atleast some of the time, causing Y 6¼X. As a result,

IN(X,Y) ¼ HN(X)�HN(X=Y) (15:7)


where HN(X=Y) is the entropy of the source data at the condition of decoder output Y.Because the entropy is a positive quantity, the source entropy is the upper bound to themutual information; i.e.,

IN(X,Y) � HN(X) (15:8)

Let d(X,Y) be the average distortion between X and Y. Then, the average distortion perpixel is defined as

D(Q) ¼ 1NE{d(X,Y)} ¼ 1

N

Xi

Xj

d(Xi,Yj)P(Xi)Q(Xi=Yj) (15:9)

The set of all conditional probability assignments, Q(Y=X) that yield average distortion lessthan or equal to D*, can be written as

{Q:D(Q) � D*} (15:10)

The N-block R(D) is then defined as the minimum of the average mutual information,IN(X, Y), per pixel:

RN(D*) ¼ MinQ:D(Q)�D*

1NIN(X,Y) (15:11)

The limiting value of the N-block R(D) is simply called the rate distortion function,

R(D*) ¼ LimN!1

RN(D*) (15:12)

It should be clear from the above discussion that Shannon’s rate distortion function is alower bound on the transmission rate required to achieve an average distortion Dwhen theblock size is infinite. In other words, when the block size is approaching infinity, thecorrelation between all elements within the block is considered as the information con-tained in the data source. Therefore, the rate obtained is the lowest rate or lower bound.Under these conditions, the rate at which a data source produces information, subject to arequirement of perfect reconstruction, is called the entropy of the data source, i.e., theinformation contained in the data source. It follows that the R(D) is a generalization of theconcept of entropy. Indeed, if the distortion measure is a perfect reproduction, it isassigned zero distortion. Then, R(0) is equal to the source entropy H(X). Shannon’s codingtheorem states that one can design a coding system with rate only negligibly greater thanR(D), which achieves the average distortion D. As D increases, R(D) decreases monoton-ically and usually becomes zero at some finite value of distortion. The R(D) specifies theminimum achievable transmission rate required to transmit a data with average distortionlevel D. The main value of this function in a practical application is that it potentially givesa measure for judging the performance of a coding system. However, this potential valuehas not been completely realized for video transmission. There are two reasons for this.First, there currently do not exist tractable and faithful mathematical models for an imagesource. The R(D) for Gaussian sources under the squared error distortion criterion can befound, but it is not a good model for images. Second, a suitable distortion measure, D,which matches the subjective evaluation of image quality, has not been totally solved.


Some result s have bee n investi gated for this task such as JND (just notic eable distorti on)[jnd]. The issue of sub jective and objectiv e asses sments of image qual ity has bee n dis cussedin Chapte r 1. In spite of these drawbac ks, the rate dis tortion the orem is still a m athematicalbasis for compar ing the perform ance of differe nt codi ng syste ms.

15 .3 Digital Video Format s

15.3.1 Digital Vid eo Color Systems

In pra ctical appl ications, mo st video signal s are color signals. Var ious color systems havebeen discuss ed in Chapte r 1. A color signal can be see n a s a summa tion of light inten sitiesof three primar y wave length band s. In this section, we introd uce several ext ensive ly usedcolor systems in the vide o indus try. Th ere are two primar y color spaces use d to repres entdigital video signal , whi ch are RGB (red, green, blue) and YCbCr . Th e difference betweenRGB and YCbCr is that RGB repres ents color as red, green, and blue, whereas YCbCrrepres ents color as brigh tness and two color difference sig nals. In YCb Cr, the Y is thebrigh tness (luma) , Cb is blue min us lum a ( B � Y), and Cr is red min us luma ( R � Y ). Weusually use YCC as a sh ort way of saying YCrC b. The standard RG B color space is ref erredto as sRGB, which is an RGB color space create d coop eratively by He wlett –Packa rd andMicro soft Co rporation . It has bee n endorse d by many indus try player s. It is also wellaccepte d by Open Sour ce softwar e such as the GIM P (GN U Image Manipul ation Progra m),and is used in prop rietary and ope n gr aphics file formats such as PNG (Portab le Ne tworkGraphi c). Th e sYCC is simp ly YCC create d from sRGB [IEC 61966-2 -1]. The YCbCr colorrepres entation is used fo r mo st of vide o coding stand ards in complia nce with the ITU-RBT.601 [ITU -R BT.601 ], BT.709 [ITU -R BT.709 ], commo n intermedi ate format (CIF), andsource input format (SIF ). ITU-R BT.601 is an ITU- R st andard for com ponent digital vide o.It was design ed to provi de a commo n digit al standa rd fo r inter operabi lity between thethree anal og vide o=TV systems (NTSC , PA L, and SECAM). ITU- R BT.601 ena bles theirsignals to be conve rted to digit al and then easily conve rted back again to any of the threeformats for distri bution. In 19 90, BT.709 was introd uced fo r high de finition television(HDT V) with speci fication s for 1125 and 1250 line s. In 2000, BT.709 -4 added 1080 lines toconform to the digit al tele vision (DTV ) standard . Conversi on betw een the YCbCr and RGBformats can be acco mplish ed with the transf ormatio ns in Chapter 1.

Differe nt color space cov ers different range of colors. Th e ter m ‘‘ gamu t ’’ is use d torepres ent a set of possibl e colors within a color system. Cu rrently, mo st of vide o displaysare sRGB ga mut-limi ted dis plays, whi le the still image syste ms widel y use the displayswith sYC C colo r space. Rece ntly, various kinds of extended-g amut displays are emergi ngand used for displayi ng st ill images. User s are alw ays enjoyi ng wi de-gam ut displays.In video signal s, there are many unus ed regions that coul d st ore wide-gam ut colors .Ther efore, recently a new color space standard , IEC 61966-2 -4, has been proposed fo rvideo dis plays [IEC 61966 -2-4]. In IEC 61966-2 -4, the exten ded-gamu t colo r spacefor video appl ications � xvYCC colo r space has bee n propo sed. The xvYCC is com patiblewith cur rentl y use d video signals . It use s the same de finit ion for insi de of the sRGBgamu t and the re is no change nec essary for conve ntional con tents. Howeve r, it addsan unambiguo us de fi nition to the cur rently unde fi ned or out-of -sRGB ga mut regions .The xvYCC has 100% coverage, whereas sRGB has only 55%, which can be seen inFigure s 15.2 and 15.3 [katoh 2 005].


Ext

ende

d re

gion

Gamut of xvYCCY

254

0 < R�,G�,B� < 1(Gamut of BT.709)

235

128

16

Extended

1 16 128 240 254Extended

–0.57 –0.5 +0.5

sRGB

sYCC

Cb, Cr+0.56

1 < R�, G�, B�

R�, G�, B� < 0 R�, G�, B� < 0

Black

White

Ext

ende

d re

gion

1 < R�, G�, B�

FIGURE 15.2 (See color insert following page 288.)Two-dimensional (2-D) view of xvYCC.

15.3.2 Pro gressive and Interlaced Vid eo Signals

Curre ntly, mo st video signals that are generat ed by a TV came ra are inter laced. Th esevide o signals are repres ented at 30 frames =s for an NTSC syste m. Each frame cons ists oftwo fi elds, the top field and bot tom fi eld, which are 1=60 of a second apart. In the displa y ofan inter laced frame, the top field is sca nned firs t and the bottom field is sca nned nex t. Th etop and bot tom field s are com posed of altern ating line s of the int erlaced frame. Progres sivevide o does no t con sist of field s, only frames. In an NT SC system, the se frame s are space d1=30 s apar t. In contras t to int erlaced vide o, every line within the frame is succes sivelysca nned. An exa mple of progres sive and inter laced video is shown in Figure 15.4.

Gamut coverage

sRGB

sYCC

xvYCC

Cover ratio = 100%

Munsell color cascade (769 colors)

—provided by Dr. Pointer and measured by NPL, UK

FIGURE 15.3 (See color insert following page 288.)Gamut coverage of sRGB, sYCC, and xvYCC color spaces.


(a) An example of progressive video

(b) An example of interlaced video

Full frame (1/60 s)Part frame scanned (1/60 s)

Odd field (1/60 s) Even field (1/60 s) Full frame (1/30 s)

FIGURE 15.4 (See color insert following page 288.)(a) An example of progressive video and (b) an example of interlaced video.

15.3.3 Video Formats Used by Video Industry

15.3.3.1 ITU-R

According to ITU-R 601 (earlier ITU-R was CCIR), a color video source has three compon-ents: a luminance component (Y) and two color-difference or chrominance components(Cb and Cr or U and V in some documents). The CCIR format has two options: one forthe NTSC TV system and another for the PAL TV system, both are interlaced. The NTSCformat uses 525 lines per frame at 30 frames=s. The luminance frames of this format have7203 480 active pixels. The chrominance frames have two kinds of formats: one has3603 480 active pixels and is referred as the 4:2:2 format, whereas the other has 3603 240active pixels and is referred as the 4:2:0 format. The PAL format uses 625 lines per frame at25 frames=s. Its luminance frame has 7203 576 active pixels per frame and the chrominanceframe has 3603 576 active pixels per frame for the 4:2:2 format and 3603 288 pixels perframe for the 4:2:0 format, both at 25 frames=s.

The a:b:c notation for sampling ratios, as found in the ITU-R BT.601 [ITU-R BT.601]specifications, has the following meaning:

4:2:2 means 2:1 horizontal downsampling, no vertical downsampling. (Think 4 Ysamples for every 2 Cb and 2 Cr samples in a scanline.)

4:2:0 means 2:1 horizontal and 2:1 vertical downsampling. (Think 4 Y samples forevery Cb and Cr samples in a scanline.)

15.3.3.2 Source Input Format

Source input format (SIF) has luminance resolution of 3603 240 pixels per frame at 30frames=s or 3603 288 pixels per frame at 25 frames=s. For both cases, the resolution of the


chrominance components is half the luminance resolution in both horizontal and verticaldimensions. SIF can easily be obtained from a CCIR format using an appropriate anti-aliasing filter followed by subsampling.

15.3.3.3 Common Intermediate Format

Common intermediate format (CIF) is a noninterlaced format. Its luminance resolution has3523 288 pixels per frame at 30 frames=s and the chrominance has half the luminanceresolution in both vertical and horizontal dimensions. As its line value, 288, represents halfthe active lines in the PAL television signal, and its picture rate, 30 frames=s, is the same asthe NTSC television signal, it is a common intermediate format for both PAL or PAL-likesystems and NTSC systems. In the NTSC systems, only a line number conversion is needed,whereas in the PAL or PAL-like systems only a picture rate conversion is needed. For lowbit rate applications, the quarter-SIF (QSIF) or quarter-CIF (QCIF) may be used becausethese format have only a quarter number of pixels of SIF and CIF formats, respectively.

15.3.3.4 ATSC Digital Television Format

The concept of digital television consists of SDTV (standard-definition television) andHDTV. Recently, in the United States, the FCC (Federal Communication Commission)has approved the ATSC recommended DTV standard [atsc 1995]. The DTV format is notincluded in the standard due to the divergent opinions of TV and computer manufacturers.Rather, it has been agreed that the picture format will be decided by the future market. TheATSC recommended DTV formats including two kinds of formats: SDTV and HDTV. TheATSC DTV standard includes the following 18 formats:

For HDTV: 19203 1080 pixels at 23.976=24 Hz progressive scan, 29.97=30 Hz interlacedscan, and 59.94=60 Hz progressive scan; 12803 720 pixels at 24, 30, and 60 Hz progressivescan.

For SDTV: 7043 480 pixels with 4:3 aspect ratio at 23.976=24, 29.97=30, 59.94=60 Hzprogressive scan, 30 Hz interlaced scan; 7043 480 pixels with 16:9 aspect ratio at 23.976=24,29.97=30, 59.94=60 Hz progressive scan, 30 Hz interlaced scan; and 6403 480 with 4:3aspect ratio at 23.976=24, 29.97=30, 59.94=60 Hz progressive scan, 30 Hz interlaced scan.

It is noted that all HDTV formats use square pixels and only part of SDTV formats usesquare pixels. The number of pixels per line versus the number of lines per frame is knownas the aspect ratio.

15.4 Current Status of Digital Video=Image Coding Standards

The fast growth of digital transmission services has generated a great deal of interest in thedigital transmission of video signals. Some digitized video source signals require very highbit rates, ranging from more than 100 Mbits=s for broadcast-quality video to more than 1Gbits=s for HDTV signals. Owing to this, video compression algorithms, which reduce thebit rates to an affordable level on practical communication channels, are required. Digitalvideo coding techniques have been investigated over several decades. There are twofactors that make video compression possible: the statistical structure of the data in thevideo source and the psychophysical redundancy of human vision. Video compressionalgorithms can remove the spatial and temporal correlation that is normally present in thevideo source. In addition, human observers are subject to perceptual limitations in amp-litude, spatial resolution, and temporal acuity. By proper design of the coding system, it is


TABLE 15.1

List of Some Organizations for Standardization

Organization Full Name of Organization

ITU International Telecommunication UnionJPEG Joint Photographic Experts GroupMPEG Moving Picture Experts GroupISO International Standards OrganizationIEC International Electrotechnical Commission

possible to discard information without affecting perceived image quality, or at least, withonly minimal degradation.

Several traditional techniques have been developed for image and video data com-pression. Recently, with advances in data compression and VLSI techniques, the datacompression techniques have been extensively applied to video signal compression.Video compression techniques have been under development for over 20 years and haverecently emerged as the core enabling technology for a new generation of DTV (both SDTVand HDTV) and multimedia applications. Digital video systems currently being imple-mented (or under active consideration) include terrestrial broadcasting of digital HDTV inthe United States [atsc 1993], satellite DBS (Direct Broadcasting System) [isnardi 1993],computer multimedia [ada 1993], and video via packet networks [verbiest 1989]. Inresponse to the needs of these emerging markets for digital video, several national andworldwide standards activities have been started over the last few years. These organiza-tions include ISO (International Standards Organization), ITU (International Telecommu-nication Union, formally known as CCITT (International Telegraph and TelephoneConsultative Committee), JPEG (Joint Photographic Experts Group), and MPEG (MovingPicture Experts Group) as shown in Table 15.1. The related standards include JPEGstandards, MPEG-1,2,4 standards, and H.261 and H.263 video teleconferencing codingstandards as shown in Table 15.2. It should be noted that the JPEG standards are usuallyused for still image coding, but they can also be used in video coding. Although the codingefficiency would be lowered, they have shown to be useful in some applications,

TABLE 15.2

Video=Image Coding Standards

Name Year of Completion Major Features

JPEG 1992 For still image coding, DCT basedJPEG2000 2000 For still image coding, DWT basedH.261 1990 For videoconferencing, 64 kbits=s–1.92 Mbits=sMPEG-1 1991 For CD-ROM, 1.5 Mbits=sMPEG-2 (H.262) 1994 For DTV=DVD, 2–15 Mbits=s; for ATSC HDTV, 19.2 Mbits=s;

most extensively usedH.263 1995 For very low bit rate coding, below 64 kbits=sMPEG-4 Part 2 1999 For multimedia, content-based coding, its simple profile and

advanced simple profile is applied to mobile video andstreaming

H.264=AVC(MPEG-4 Part 10)

2005 For many applications with significant improved codingperformance over MPEG-2 and MPEG-4 Part 2

VC-1 2005 For many applications, coding performance close to H.264RealVideo 1997 For many applications, coding performance similar to

MPEG-4 Part 2MPEG-7 2000 Content description and indexingMPEG-21 2002 Multimedia framework


e.g., studio edi ting systems . Th ough JPEG standard s (discusse d in Chapters 7 and 8) arenot video coding standards , we inc lude them here to give a full pictu re of all int ernation alimage and vide o coding standard s.

15.4.1 JPEG Standard

Since the mid-1 980s, the ITU and ISO have been worki ng tog ether to deve lop a jointinter national standard for the compress ion of still image s. Of fi cially, JPEG [jpe g] is theISO =IEC internati onal stand ard 1 0918-1, ‘‘ Di gital com pression and coding of conti nuous-tone still images ,’’ or the ITU- T rec ommen dation T.81. JPEG becam e an inter nationalstand ard in 1992. JPEG is a DCT- based coding a lgorithm . It con tinues to wo rk on futur eenhance ments , which may adopt wavel et-bas ed algori thms.

15.4.2 JPEG200 0

JPEG 2000 [jpe g2000] is a new type of image coding syste m unde r develop ment by JPEGfor st ill image coding. JPEG2000 is conside red using the wavel et transf orm as its cor etechni que. This is becau se the wavel et transf or m can provi de not only excelle nt codingef ficiency but also wonde rful spatial and qual ity scalable fun ctionality . This standard isinten ded to meet a need for image compr ession with grea t flexibi lity and efficie nt inter-change ability . It is also intended to offer unpre cedented acc ess int o the image while st ill incom pressed doma in. Th us, an image can be ac cessed, manipul ated, edited, transmitt ed,and stor ed in a com pressed form.

15.4.3 MPEG -1

In 1988, ISO establ ished the MPEG to deve lop st andards for the code d repres entation ofmo ving pi ctures and asso ciated audi o inf ormatio n for digita l stor age applicat ions. MPEGcompleted the first phase of its work in 1991. It is known as MPEG-1 [mpeg1] or ISOstandard 11172, ‘‘Coding of moving picture and associated audio.’’ The target applicationfor this specification is digital storage media at bit rates up to about 1.5 Mbits=s.

15.4.4 MPEG-2

MPEG started its second phase of work, MPEG-2 [mpeg2], in 1990. MPEG-2 is an extensionof MPEG-1 that allows for greater input-format flexibility, higher data rate for SDTV orHDTV applications, and better error resilience. This work resulted in the ISO standard13818 or ITU-T Recommendation H.262, ‘‘Generic coding of moving pictures and associ-ated audio.’’

15.4.5 MPEG-4

Part 2 [mpeg4]. MPEG-4 Part 2 Visual standard has been approved in 1999. The MPEG-4Part 2 Visual supports object-based coding technology and it aims to provide enablingtechnology for a variety of functionalities and multimedia applications:

Universal accessibility and robustness in error-prone environments

High interactive functionality

Coding of natural and synthetic data or both

Compression efficiency


15.4.6 H.261

H.261 [h261] was adopte d in 1990 and the fi nal revision was appro ved in 1993 by the ITU- T.It is design ed for video teleconfe rencing and utilizes a DC T-based mo tion com pensati onschem e. Th e target bit rates are from 64 to 1920 kbits =s.

15.4.7 H.263, H.263 Version 2 (H.2 631 ), H.263 11 , and H.26L

The H.263 [h263] vide o codi ng stand ard is spe cifi cally design ed for very low bit rateappl ications such as video confe rencing. Its tech nical conte nt was com pleted in late 1995and the stand ard was approve d in ear ly 1996. It is ba sed on the H.261 standard withseveral added fea tures: unrestr icted motion vectors , syntax- based arithme tic coding,advance d pred iction, and PB-frame s. Th e H.263 version 2 video coding st andard, alsoknown as H.263 þ , was approv ed in Janu ary 1998 by the ITU- T. H.263 þ includes a numb erof new optional features based on the H.263. These new optio nal features are added toprovid e imp roved coding ef ficiency, a flexibl e video format, scalabi lity, and ba ckwar d-compat ible supplem ental enhance ment inf ormatio n. H.263 þþ is the ext ension of H.263 þand was comple ted in the year 2000. H.26L is a long-ter m proje ct, which is loo king formore ef ficient vide o codi ng algori thms. Fina lly, the acti vity of H.26L ended becaus e thejoint video team (JVT ) of MPEG and ITU-T VC EG developed a new video coding stand ardH.264, which has grea tly impro ved the codi ng ef ficiency over MPEG-2 and H.263.

15.4.8 MPEG -4 Part 10 Adva nced Vid eo Coding or H.264 =AVC

Recentl y, the JVT of MPEG and ITU- T VC EG has deve loped new video codi ng stand ard[h264] . Because many new tools have been used, H.264 =AVC has achieved higher codi ngeffi ciency, which is almost twice bett er than MPEG-2. The detailed informati on is int ro-duced in Chapter 20.

15.4.9 VC-1

VC-1 is a video codec developed by Micr osoft and late has been standardi zed by SMPTE(Societ y of Mo tion Pi cture and Televis ion En gineers ). It is impl ement ed by Micr osoft asWindow s Media Vid eo (WMV ) 9. Its coding perform ance is close to the H.264 =AV C.

15.4.10 Rea lVideo

RealVid eo is a video code c develop ed by RealNe tWorks. It was firs t rele ased in 19 97 andits ver sion 10 is releas ed in 2006. RealVi deo is support ed on ma ny platform s, inc ludingWindow s, Mac, Linu x, Solaris, and seve ral mo bile phones. Its coding perform ance is closeto MPEG -4 Par t 2.

The abov e organi zations and standard s are summa rized in Table s 15.1 and 15.2, respec -tively.

It should be noted that MPEG-7 [mpeg-7] and MPEG-21 [mpeg-21] in Table 15.2 are nota coding standard; MPEG-7 is a multimedia content description standard, which can beused to fast indexing and searching for multimedia content; and the MPEG-21 is amultimedia framework, which aims at defining an open framework for multimedia appli-cations. The VC-1 is a SMPTE standard and RealVideo is not an international standard, butit is extensively supported by many platforms.

It is also interesting to note that in terms of video compression methods, there is agrowing convergence towards motion compensated (MC), interframe DCT algorithms


represented by the video coding standards. However, wavelet-based coding techniqueshave found recent success in the compression of still image coding in both the JPEG2000and MPEG-4 standards. This is because it posses unique features in terms of high codingefficiency and excellent spatial and quality scalability. The wavelet transform has notsuccessfully been applied to video coding due to several difficulties. First one, it is notclear how the temporal redundancy can be removed in this domain. Motion compensationis an effective technique for DCT-based video coding scheme, however, it is not so effectivefor wavelet-based video coding. This is because the wavelet transform uses large block sizeor full frame, but motion compensation is usually performed on a limited block size. Thismismatch would reduce the interframe coding efficiency. Many engineers and researchersare working on these problems.

Among these standards, MPEG-2 has had a great impact on the consumer electronicsindustry because the DVD (digital versatile disk) and DTV have adopted it as coretechnology. But recently developed new coding standard is attracted many applications,including HD-DVD, mobile TV, and others.

15.5 Summary

In this chapter, the several fundamental issues of digital video coding have been presented.These include the representation and R(D) of digital video signals and the various videoformats, which are widely used by video industry. Finally, existing and emerging videocoding standards have been briefly introduced in this chapter.

Exercises

1. Suppose that we have one-dimensional (1-D) digital array (it can be extended to 2-Darray that may be an image), f(i)¼Xi, (i¼ 0, 1, 2, . . . ). If we use the first-order linearpredictor to predict the current component value with the previous component such asXi0 ¼ aXi�1 þ b, where a and b are two parameters for this linear predictor. If we want

to minimize the mean squared error (MSE) of the prediction E{(Xi � X0i)2}, what values

of a and b should we choose? Assuming that E{Xi} ¼ m, E{X2i } ¼ s2, and E{XiXi�1} ¼ r,

(for i ¼ 0, 1, 2, . . . ), where m, s, and r are constant.2. To get a 1283 128 or 2563 256 digital image, write a program to use two 33 3

operators (Sobel operator) such as

�1 �2 010 0

24

35 �1 0

1�2 0

24

35

to filter the image, separately. Discuss the resulting image. What will be the result if boththe operators are used?

3. The conversion of 2-D array is defined as

y(m,n) ¼Xþ1k¼�1

Xþ1l¼�1

x(k, l) h(m� k, n� l)


and

�x ¼ 1 4 12 5 3

� ��h ¼ 1 1

1 �1� �

Calculate the convolution y(m, n). If h(m, n) is changed to

0 �1 0�1 4 �10 �1 0

24

35,

recalculate y(m, n).4. The entropy of an image source is defined as

H ¼ �XMk¼1

pk log2 pk,

under assumption that each pixel is an independent random variable. If the image is abinary image, i.e., M¼ 2, and the probability p1þ p2 ¼ 1. If we define p1¼ p thenp2¼ 1� p, (0 � p � 1). The entropy can be rewritten as

H ¼ �p log2 p� (1� p) log2 (1� p):

Find several digital binary images and compute their entropies. If one image has almostequal number of ‘‘0’’ and ‘‘1’’ and other has different number of ‘‘0’’ and ‘‘1,’’ whichimage has larger entropy? Prove that the entropy of a binary source is maximum if thenumber of ‘‘0’’ and ‘‘1’’ is equal.

5. A transformation defined as y¼ f(x), is applied to a 2563 256 digital images, where x isthe original pixel value and y is transformed pixel value. Obtain new images for (1) fis a linear function, (2) f is logarithm, and (3) f is a square function; compare the resultsand indicate subjective differences of the resulting images. Repeat the experiments fordifferent images and draw conclusions about possible use of this procedure in imageprocessing applications.

References

[IEC 61966-2-1]Multimedia systems and equipment-Colour measurement and management-Part 2-1:Colour management-Default RGB colour space—sRGB, December 7, 1999.

[IEC 61966-2-4] International Standard: Multimedia Systems and management-Part 2-4: Colourmanagement-Extended-gamut YCC colour space for video applications �xvYCC, January 2006.

[ITU-R BT.601] International Telecommunications Union, ITU-R BT.60, 1987.[ITU-R BT.709] International Telecommunications Union, ITU-R BT.709, 1990.[ada 1993] J.A. Ada, Interactive Multimedia, IEEE Spectrum, March 1993, 22–31.[atsc 1995] ATSC Digital Television Standard, Doc. A=53, September 16, 1995.[berger 1971] T. Berger,Rate Distortion Theory—AMathematical Bais for Data Compression, Prentice-Hall,

Englewood Cliffs, NJ, 1971.[h261] ITU-T Rec. H. 261, Video codec for audio visual services at px64 kbit=s, March 1995.[h263] ITU-T Rec. H. 263, Video coding for low bit rate communication, May 2, 1996.


[h264] ITU-T Rec. H.264 =ISO =IEC 11496-10, Advanced video coding for generic audiovisual services,February 28, 2005.

[isnardi 1993] M. Isnardi, Consumers seek easy to use products, IEEE Spectrum, January 1993, 64.[jnd] www.sarnoff.com =tech_realworld =broadcast= jnd=index.html.[jpeg] ISO=IEC IS 11544, ITU-T Rec. T.81, 1992.[jpeg 2000] ISO=IEC 15444-1, ITU-T Rec. T. 800, Information technology-JPEG 2000 image coding

system: Core coding system, 2000.[katoh 2005] Attachment of proposal for MPEG Hong Kong Meeting 2005, IEC 61966-2-4. The

Extended-gamut color space for video application –xvYCC color space, by Naoya Katoh andYoshihide Simpuku, Color Rendering Community, Sony Corporation.

[modestino 1981] J.W. Modestino, D.G. Daut, and A.L. Vickers, Combined source-channel coding ofimage using the block cosine transform, IEEE Transactions on Communication, COM-29,1262–1274, September 1981.

[mpeg1] ISO =IEC JTC1 IS 11172, Coding of moving picture and coding of continuous audio fordigital storage media up to 1.5 Mbps, November 1992.

[mpeg2] ISO=IEC JTC1 IS 13818, Generic coding of moving pictures and associated audio, November1994.

[mpeg4] ISO =IEC JTC1 FDIS 14496-2, Information technology— generic coding of audio-visualobjects, November 19, 1998.

[mpeg7] MPEG-7 Overview v.8, ISO =MPEG N4980, Klagenfurt, Austria, July 2002.[mpeg21] MPEG-21 Overview v.5, ISO=IEC JTC1=SC29=WG11=N5231, October 2002.[oppenheim 1989] A.V. Oppenheim and R.W. Schafer, Discrete-Time Signal Processing, Prentice-Hall,

Englewood Cliffs, NJ, 1989.[sayood 1991] K. Sayood and J.C. Borkenhagen, Use of residual redundancy in the design of joint

source=channel coders, IEEE Transactions on Communications, 39, 838–846, June 1991.[shannon 1948] C.E. Shannon, A mathematical theory of communication, Bell Systems Technical

Journal, 27, 379–423, 623–656.[verbiest 1989] W. Verbiest and L. Pinnoo, A variable bit rate video codec for asynchronous transfer

mode networks, IEEE Journal on Selected Areas in Communications, 7, 5, 761–770, 1989.[viterbi 1979] A.J. Viterbi and J.K. Omura, Principles of Digital Communication and Coding, McGraw-

Hill, New York 1979.


http://www.sarnoff.com

16Digital Video Coding Standards: MPEG-1=2 Video

In this chapter, we introduce the ISO=IEC digital video coding standards, MPEG-1 [mpeg1]and MPEG-2 [mpeg2], which are extensively using in the video industry for televisionbroadcast, visual communications, and multimedia applications.

16.1 Introduction

As we know, MPEG has successfully developed two standards, MPEG-1 and MPEG-2. TheMPEG-1 video standard was completed in 1991 with the development of the ISO=IECspecification 11172, which is the standard for coding of moving picture and associatedaudio for digital storage media at up to about 1.5 Mbits=s. To support a wide range ofapplication profiles, the user can specify a set of input parameters including flexible picturesize and frame rate. MPEG-1 was developed for multimedia CD-ROM applications.Important features provided by MPEG-1 include frame-based random access of video,fast forward=fast backward searches through compressed bitstreams, reverse playback ofvideo, and editability of the compressed bitstream. MPEG-2 is formally referred to asISO=IEC specification 13818, which is the second phase of MPEG video coding solutionfor applications not originally covered by the MPEG-1 standard. Specifically, MPEG-2 wasdeveloped to provide video quality not lower thanNational Television SystemsCommittee=phase alternating line (NTSC=PAL) and up to high definition television (HDTV) quality.The MPEG-2 standard was completed in 1994. Its target bit rates for NTSC=PAL are about2–15 Mbits=s, and it is optimized at about 4 Mbits=s. The bit rates used for HDTV signalsare about 19 Mbits=s. In general, MPEG-2 can be seen as a superset of the MPEG-1 codingstandard and is backward compatible to MPEG-1 standard. In other words, every MPEG-2compatible decoder is able to decode a compliant MPEG-1 bitstream.

In this chapter, we briefly introduce the standard itself. As many books and publicationsexist for the explanation of the standards [haskell 1997; mitchel 1997], we paymore attentionto the utility of the standard, how the standard is used, and touch on some interestingresearch topics that have emerged. In other words, the standards provide the knowledgefor of how to design the decoders that are able to successfully decode the compliant MPEGbistreams. But the standards do not specify the means of generating these bitstreams.For instance, given some bit rate, how can one generate a bitstream that provides the bestpicture quality? To answer this, one needs to understand the encoding process, which isan informative part of standard (referred to as the Test Model), but it is very importantfor the content and service providers. In this chapter, the issues related to the encodingprocess are described. The main contents include the following topics: preprocessing,motion compensation, rate control, statistically multiplexing (StatMux) multiple programs,and optimal mode decision. Some of the sections contain the authors’ own research results.


These research result s are useful in provi ding example s for the readers to unde rstand howthe st andard is used.

16 .2 Fe atur es of MPEG-1 =2 Video Coding

It sh ould be noted that MPEG -2 vide o coding has the fea ture of being back ward com pat-ible with MPEG -1. It turns out that most of the decoder s in the marke t are MPEG -2com pliant decoder s. For simpli city, we introd uce the tech nical detai l of MPEG -1 andthen descri be the enhance d featur es of MPEG -2, which M PEG-1 does no t have .

16.2.1 MPEG -1 Feature s

16.2.1. 1 Introduc tion

The algori thms emp loyed by MPEG -1 do not provide a lossless c oding sch eme. Howe ver,the standard can sup port a variet y of input formats and be appl ied to a wi de range ofappl ications. As we know, the main purp ose of MPEG -1 vide o is to code mo ving imagesequ ences or video signals. To achi eve a high compr ession ratio, both intrafr ame andinter frame redund ancies shoul d be expl oited . This implies that it wou ld not be ef ficientto code the vide o signal wi th an intrafr ame codi ng scheme, such as JPEG. On the otherhand, to satisfy the requi remen t of random access, we have to use intrafr ame coding fromtim e to time. Ther efore, the MPEG-1 video algorithm is main ly based on discre te cosi netransf orm (DC T) coding and inter frame motion com pensatio n. The DCT coding is used torem ove the intrafr ame redund ancy a nd the mo tion compens ation is use d to remov e theinter frame redund ancy. W ith regard to inpu t pic ture format, MPEG -1 allows progres sivepictu res only, but offers grea t flexibi lity in the size , up to 4095 3 4095 pixels. Howeve r, thecoder itse lf is optimized to the extensive ly use d vide o SIF pictu re format. The SIF is asimple derivative of the ITU-R 601 video format for digital television applications. Accord-ing to ITU-R 601, a color video source has three components, a luminance component (Y)and two chrominance components (Cb and Cr), which are in the 4:2:0 subsampling format.Note that the 4:2:0 and 4:2:2 color fo rmats were descri bed in Chapter 15.

16.2.1.2 Layered Structure Based on Group of Pictures

The MPEG coding algorithm is a full-motion compensated DCT and DPCM hybrid codingalgorithm. In MPEG coding, first, the video sequence is divided into groups of pictures orframes (GOPs) as shown in Figure 16.1. Each GOP may include three types of pictures

FIGURE 16.1A group of pictures (GOPs) of videosequence in display order.

N = 9M = 3

I B B P B B P B B I

Forward motion compensation



Bidirectional motion compensationGOP


or frames: intracod ed (I) picture or frame, pred ictive-cod ed (P) picture or frame , andbidirect ionally predictiv e-code d (B) pi cture or frame. I-pic tures are coded by int raframetechni ques only, withou t need for previ ous informati on. In othe r words , I-pic tures areself-suf ficient. They are used as anchors fo r forward and back ward predic tion. P-pi cturesare coded usi ng one- directio nal mo tion compensate d (MC ) predic tion from a previ ousanchor frame, whi ch could be either I- or P-pi cture. The dis tance betw een two nearestI-frame s is denote d by N , whic h is the size of GO P. The distanc e betwe en twonearest ancho r frames is denote d by M . Both Parame ters N and M are user selecta bleparame ters, which are selected by user during the encodi ng. Larger num ber of N and Mwill increa se the coding perform ance but cause error propagati on or dr ift. Usua lly, N ischos en from 12 to 15 and M from 1 to 3. If M is selected to be 1 then no B-picture will beused. Lastly , B-pictu res can be coded using predic tions from either past or futur e anchorframes (I or P), or both. Reg ardless of the type of frame , each frame ma y be divi ded int oslices; each slice cons ists of seve ral m acroblocks (MBs). Th ere is no rule to decide the slicesize. A slic e c ould con tain all MB s in a row of a frame or all MBs of a frame. Smallerslice size is favorabl e for the purp ose of error resilie nce, but will decrease coding per-forman ce due to high er overh ead. An MB contains a 16 3 16 Y com ponent and spati allycorre spondin g 8 3 8 Cb and Cr com ponents. An MB has four lumi nance bloc ks and twochromi nance blocks (for 4:2:0 sam pling format) and the MB is also the basic unit ofadapt ive quan tization and motion compens ation. Each block contains 8 3 8 pi xels overwhich the DC T operati on is perform ed.

To expl oit the temporal redu ndancy in the vide o sequenc e, the motion vect or for eachMB is estimated from two original luminance picture s usi ng a block match ing algorithm .The criter ion for the best match betw een the curren t MB and an MB in the anchor frame isthe m inimum mean absol ute error (MSE). Once the m otion vector for each MB is esti-mated , pixe l value s fo r the target MB can be pred icted from the previ ously decod edframe. All MB s in I-fr ame are coded in intramo de with no mo tion compe nsation. MBsin P- and B-frames can be coded in seve ral modes . Among the modes are intraco ded andinter coded with motion compensati on. Th is decisio n is mad e by mo de selecti on. Mo stencode rs dep end on the value s of pred icted difference s to mak e this decisio n. Wit hin eachslice, the value s of motion vectors and DC values of each MB are coded usi ng DPCM. Thedetailed spe cifi cations of thi s coding can be found in the docu ment proposed by theMPEG video com mittee [mpeg2] . The st ructure of MPEG impl ies that if an err or occurswithin I-frame dat a, it will be prop agated throu gh all frames in the GOP. Si milarly, anerror in a P-f rame wi ll affect the related P- and B- frames, while B-frame errors will beisolat ed.

16.2.1. 3 Encod er Structur e

The typical MPEG -1 v ideo encoder structur e is shown in Figu re 16.2. It shoul d be notedthat whe n B-pic ture is used as shown in Figure 1 6.1, two frame mem ories are neede d forbidirect ional predic tion. Howe ver, the encodi ng or der is different from the dis play order;the inpu t sequenc e has to be reorde red for encodi ng. For example , if we choose the GOPsize (N) to be 12, and the distance betwe en two nea rest ancho r frames ( M ) to be 3, thedisplay order and encodi ng or der are as sh own in Tabl e 16.1.

It should be noted that in the encoding order or in the bitstream, the first frame in a GOPis always an I-picture. In the display order, the first frame can be either I-picture or thefirst B-picture of the consecutive series of B-pictures that immediately precedes the firstI-picture, and the last picture in a GOP is an anchor picture, either I- or P-picture.The first GOP always starts with an I-picture and as consequence, this GOP will haveless than B-pictures than the other GOPs.


DCT Blockquantizer

IDCT

Motioncompensated

prediction

Framememory 1

Framememory 2

Motionestimation processor

Resequencedinput To VLC encoder

Motion vectors

Blockdequantizer

FIGURE 16.2Typical MPEG-1 encoder structure.

The MPEG -1 video compr ession techni que use s motion com pensatio n to rem ove theinter frame redu ndancy. Th e c oncept of m otion compens ation is based on the estimatio n ofmo tion betwe en vide o frames . The fun damen tal model , which is use d, assum es that atransl ational motion can approxi mate the motion of a block. If all elem ents in a videoscene are appro ximate ly spatially dis placed, the mo tion betwe en frame s can be des cribedby a limit ed numb er of motion par ameter s. In ot her words , the motion can be described bymo tion vect ors for transl atory mo tion of pixels. Becau se the spatial correlat ion betwe enadjace nt pixe ls is usually ver y high , it is not nec essary to transmi t motion informati on foreach coded image pixel. This wo uld be to o expe nsive and the coder wou ld never be ab le torea ch a high compr ession ratio. Th e MPEG vide o uses the MB st ructure for motion compen-sation, i.e., fo r each 16 3 16 MB , only one or sometim es two mo tion vect ors are transmitt ed.The mo tion vectors for any block are fo und within a search windo w that can be up to 512pixe ls in each directi on. Also, the match ing can be done at half-pix el accu racy, where thehal f-pixel value s are computed by aver aging the full-pixel values as shown in Figure 16.3.

For interframe coding, the predic tion diffe rences or error image s are code d and trans-mit ted wi th mo tion infor mation. A two-di mensional (2-D) DCT is used fo r codi ng both theintrafr ame pixels and pred ictive error pixels. The image to be coded is first par titioned int o8 3 8 block s. Eac h 8 3 8 pixe l block is the n subject to an 8 3 8 DCT, result ing in a fre quencydoma in repres entation of the block as shown in Figure 16.4.

The goal of the transf ormati on is to decorre late the block dat a so that the resultin gtransf orm coef ficients can be code d more ef ficiently . Th e transf orm coef fi cients are thenquan tized. Du ring the proces s of quantiz ation, a weigh ted quantiz ation matrix is use d. Th efuncti on of quantiz ation matri x is to quantiz e high freq uencies with coarser quan tizationsteps that will suppress high frequencies with no subjective degradation, thus takingadvantage of human visual perception characteristics. The bits saved for coding highfrequencies are used for lower frequencies to obtain better subjective coded images.

TABLE 16.1

Display Order and Encoding Order

Display order 0 1 2 3 4 5 6 7 8 9 10 11 12

Encoding order 0 3 1 2 6 4 5 9 7 8 12 10 11

Coding type I P B B P B B P B B I B B


+ + + + Full-pixel locations

Half-pixel locations

+ + +

+ + +FIGURE 16.3Half-pixel locations in motion compensation.

Ther e are two quantiz er weigh ting matrice s in Test Model 5 (TM5) [tm5], an intra quan tizerweigh ting matrix a nd a non-intr a quan tizer we ightin g matri x; the later is more flat becausethe energy of coef ficients in interframe coding is mo re unifo rmly distribut ed than inintrafr ame coding .

In intra MBs, the DC value, dc, is an 11 bit value before quan tization and it will bequan tized to 8, 9, or 10 bits accordin g to the setting of parame ter. Th us the quan tized DC(QDC ) value is calcul ated as

QDC(8 bit) ¼ dc== 8, QDC( 9 bit) ¼ dc == 4, or QD C(10 bit) ¼ dc== 2 (16 : 1)

where symbo l == means integer divi sion with rounding to the nearest integer and the halfinteger values are roun ded away for zero unless ot herwis e speci fied. Th e AC c oeffi cients,ac( i ,j), are firs t quan tized by indivi dual quan tization facto rs to the va lue of ac � ( i ,j):

ac � ( i ,j) ¼ (16 * ac( i ,j)) == WI ( i ,j) (16 : 2)

where WI ( i ,j ) is the elem ent at the ( i ,j) posit ion in the intra quan tizer we ighting matrixshown in Figure 16.5.

The quantiz ed level QAC( i ,j) is give n by

QAC( i ,j ) ¼ [ac � ( i ,j ) þ sign(a c � ( i ,j )) * (( p * mqua nt) == q)] =(2 * mq uant) (16 : 3)

where mq uant is the quantiz er sca le or step which is derived for each MB by rate controlalgorithm , and p ¼ 3 and q ¼ 4 in TM5 [tm5] . For non-intr a M Bs,

ac � ( i,j ) ¼ (16 * ac( i,j )) == WN ( i ,j) (16 : 4)

where WN ( i,j ) is non-i ntra quan tizer we ighting matri x in Figu re 16.5 and

QAC( i ,j ) ¼ ac � ( i ,j )= (2 * mqua nt) (16 : 5)

An exa mple of encodi ng an intrabl ock is shown in Figure 16.6.

255 255 255 255 255 255 255 255

255 187 204 255 255 255 255 255

255 122 20 102 230 255 255 255

255 153 0 0 35 136 213 255

255 196 0 0 0 0 17 94

255 247 43 0 0 0 0 0

255 255 82 0 0 0 0 0

255 255 128 0 0 0 0 0

DCT

Low

HighPixels Frequencies

HighLow

–6 –6 –5 –16 8 02

–1 –2 –4 –212 0 1–3

2 14 0 –15 –7 014

–8 17 0 2–13 –4 –13

–12 –8 –4 –5–16 –4 –540

51 –42 1 5–20 –14 725

137 –35 16 74 17 2–94

276 89 –13 –1239 7 –759

FIGURE 16.4Example of 83 8 discrete cosine transform (DCT).


8 16 19 22 26 27 29 34

16 16 22 24 27 29 34 37

19 22 26 27 29 34 34 38

22 22 26 27 29 34 37 40

22 26 27 29 32 35 40 48

26 27 29 32 35 40 48 58

26 27 29 34 38 46 56 69

27 29 35 38 46 56 69 83

16 17 18 19 20 21 22 23

17 18 19 20 21 22 23 24

18 19 20 21 22 23 24 25

19 20 21 22 23 24 26 27

20 21 22 23 25 26 27 28

21 22 23 24 26 27 28 30

22 23 24 26 27 28 30 31

23 24 25 27 28 30 31 33

Intra quantizer weighting matrix Non-intra quantizer weighting matrix

FIGURE 16.5Quantizer matrices for intra- and non-intracoding.

The coef ficients are proce ssed in zigzag order becau se the major par t of the energyis usually concen trated in the lowe r-order coef ficients. The zigzag ordering of elemen tsin an 8 3 8 matri x allo ws for a mo re ef fi cient run -length coder. This is illustrat ed inFigu re 16.7.

Wit h the zigzag order, the run-len gth coder converts the quantiz ed fre quency coef fi-cients to pairs of zero run s and nonze ro c oeffi cients:

34 0 1 0 � 1 1 0 0 0 0 0 0� 1 0 0 0 0 . . .

Aft er parsing, we obtain the pairs of zer o runs and value s:

34 j 0 1 j 0 � 1 j 1 j 0 0 0 0 0 0� 1 j 0 0 0 0 . . .

WQandAQ

Intra quantizer weighting matrix

8 16 19 22 26 27 29 34

16 16 22 24 27 29 34 37

19 22 26 27 29 34 34 38

22 22 26 27 29 34 37 40

22 26 27 29 32 35 40 48

26 27 29 32 35 40 48 58

26 27 29 34 38 46 56 69

27 29 35 38 46 56 69 83

Adaptive quantization –6 –6 –5 –16 8 02

–1 –2 –4 –212 0 1–3

2 14 0 –15 –7 014

–8 17 0 2–13 –4 –13

–12 –8 –4 –5–16 –4 –540

51 –42 1 5–20 –14 725

137 –35 16 74 17 2–94

276 89 –13 –1239 7 –75934 0 1 0 0 0 0 0

1 –1 0 0 0 0 0 0

0 0 –1 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

FIGURE 16.6An example of coding an intrablock.


34 0 1 0 0 0 0 0

1 −1 0 0 0 0 0 0

0 0 −1 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0

34 1, 1 1, −1 0, 1 6, −1 End of block

Zigzag scan Quantized frequencycoefficients

Runs and value

FIGURE 16.7Zigzag scans to get pairs of zero runs and value.

These pairs of runs and values are then coded by a Hu ffman -type entropy coder. Forexampl e for the above run =value pairs are

� 2007 by Taylor & Francis Group, L

Run=Value VLC (Variable-Length Code)

LC.

34
—
1, 1
0110 1, �1 0111 0, 1 110 6, �1 0001011 End of block (EOB) 10
The vari able-len gth code (VLC) tables are obt ained by sta tistically opti mizing a largenumb er of training video sequ ences and are inc luded in the MPEG -2 speci fication . Thesame idea is appl ied to code the DC valu es, motion vectors , and othe r inf ormatio n.Ther efore, the M PEG vide o standard contai ns a num ber of VLC tables.

16.2.1. 4 Structur e of the Com pressed Bitst ream

After coding, all the inf ormatio n is conve rted to bina ry bits. Th e MPEG v ideo bitstreamcons ists of several well-de fined layers with headers and data field s. These layers inc ludesequ ence, GO P (gro up of picture s), picture , slic e, MB , and block. The importan t syntaxelem ents c ontaine d in each layer are summa rized in Table 16.2. Th e ty pical st ructure of theMPEG -1 video com pressed bitstream is sh own in the Figu re 16.8 . The syntax elemen tscontai ned in the hea ders and the amount of bits defi ned for each elem ent can be foun d inthe standard .

For picture layer, a frame of picture is first par titioned into MB s (16 3 16 for luminanceand 8 3 8 for chromi nance in the 4:2:0 colo r repres entation ). The com pressed bitstreamstruc ture at thi s layer is show n in Figure 16.9. It is importan t to note that mo st elem ents inthe syntax are coded by VLC. The tab les of these variable run-l ength codes (RLCs ) areobtain ed throu gh the simulat ion of a large number of trai ning video sequ ences.

16.2.1. 5 Decoding Process

The decodin g process is an inverse proce dure of encodi ng. The block diagram of a typicaldecod er is shown in Figure 16.10. The vari able-len gth decoder (VLD) first decod es thecoded data or vide o bitstr eam. Th is process yields the quan tized DC T coef ficients andmotion vect or dat a for each MB . The coef ficients are inv ersely sca nned and de-qu antized.

TABLE 16.2

Summary of Important Syntax of Each Layer

Name of Layer Important Syntax Elements

Sequence Picture size and frame rateBit rate and buffering requirementProgrammable coding parameters

Groups of pictures (GOPs) Random access unitTime code

Picture Timing information (buffer fullness, temporal reference)Coding type (I, P, or B)

Slice Intraframe addressing informationCoding re-initialization (error resilience)

MB Basic coding structureCoding modeMotion vectorsQuantization

Block Discrete cosine transform (DCT) coefficients

The decoded DCT coefficients are then inverse transformed to obtain the spatial-domainpixels. If the MB was intracoded, these pixels represent the reconstructed values, withoutany further processing. However, if the MB is intercoded then motion compensation isperformed to add the prediction from the corresponding reference frame(s).

16.2.2 MPEG-2 Enhancements

The basic coding structure of MPEG-2 video is the same as that of MPEG-1 video, i.e.,intraframe and interframe DCT with I-, P-, and B-pictures is used. The most importantfeatures of MPEG-2 video coding include

. Field=frame prediction modes for supporting the interlaced video input

. Field=frame DCT coding syntax

. Downloadable quantization matrix and alternative scan order

. Scalability extension

Sequence header Sequence data Sequence header Sequence data

Picture width Picture height Aspect ratio Bit rate Picture rate . . .

GOP header Picture header Picture . . . Picture header Picture

Temporal reference Picture type VBV delay Extension start code Picture structure

FIGURE 16.8Description of layered structure of compressed bitstream.


Picture header Slice header Macroblock . . . Macroblock Slice header Macroblock . . .

Address Type Quantizer scale Motion vectors Coded block pattern Block Block . . .

Picture

Slice

Macroblock block

Macroblock 4:2:0

Y U VBlocks

FIGURE 16.9Picture layer data structure.

The ab ove enhance ment items a re all coding perform ance improve ments that are relate d tothe support of interlace d materi al. Ther e are also seve ral non-comp ression enh anceme nts,which includ e

. Synt ax to facilitate 3:2 pull-do wn in the decod er

. Pan and scan codes with 1=16 pi xel resol ution

. Di splay fl ags indicati ng chromat icity, subcarr ier amp litude, and pha se(for NTSC =PAL =SE CAM source mate rial)

In the follo wing, each of these enhance ments is introd uced.

16.2.2. 1 Field =Frame Predi ction Mode

In MPEG -1 video, we always code each pictu re as a frame struc ture, wheth er the originalmateri al is progres sive or inter laced. If the origi nal sequ ence is inter laced, each framecons ists of two field s: to p fi eld and bot tom fi eld as sh own if Figu re 16.11. We still canuse frame-bas ed predic tion if we cons ider that the two fi elds as a frame such as shown inFigure 16.11.

In Figu re 16 .11, three frame s are code d as I-, B-, and P-frame s and each frame con sists oftwo field s. The P-frame is pred icted with the I-fra me with one motion vector . The B- frame

Variable-length

decoding

Inversescan and

quantization Inverse DCT

Motioncompensation

Framestore

memories

Codeddata

Motion vectors

Decoded pixels

FIGURE 16.10Simplified MPEG video decoder.


FIGURE 16.11Frame-based prediction of MPEG-1 video coding.

Top field

Bottom field

I B P

One or two motion vectors

One motion vector

can be predicted only with I-fram e (forw ard predic tion) or only wi th P-frame (backw ardpredic tion) or from both I- and P-f rames (bidirecti onal predic tion), the fo rward andback ward predic tion need s on ly one mo tion vector and the bidirect ional predictionneeds two mo tion vector s.

MPEG -2 video provides an enhance d predic tion mo de to support interlaced material,which uses the adapt ive fi eld =frame selectio n, based on the bes t match criter ia. Each framecons ists of two field s: top field and bot tom fi eld. Each field can be predic ted from eitherfi eld of the previo us anchor frame. The poss ible predic tion modes are shown inFigu re 16.12.

In a field -based pred iction, the to p fi eld of the curren t frame can be predic ted either fromthe to p fi eld or the bottom fi eld of an anchor frame as sh own in Figu re 16.12. The solid arrowrepres ents the pred iction from the top field and the dash ed arrow repre sents the predictionfrom the bot tom field . Th e sam e is also true fo r bot tom field of the current frame . If thecur rent frame is a P-f rame, there could be up to two mo tion vect ors use d to make thepredic tion (one for top field and one for bot tom fi eld); if the curren t frame is a B- frame,there coul d be up to four mo tion vector s (each fi eld could be bidir ectional predic tion whichneeds two motion vectors ). At the MB level of MPEG -2, several coding mo des are adde d tosup port these new fi eld-bas ed predic tions. Add itionally, the re is ano ther new predictionmo de sup ported by the MPEG -2 syntax. This is the special pred iction m ode referred to asdual prime pred iction. The basic ide a of dual prime pred iction is to code a set of fiel d motionvect ors with a sca ling to a near or far fi eld, plus a transmi tted delta vector . Du e to thecorre lation of adjacent pixels, the dual prime coding of fi eld vector s can sav e the num ber ofbits used for field mo tion vector s. The dual prime predic tion is shown in the Figu re 16.13. InFigure 16.13, one fieldmotion vector and the deltamotion vector are transmitted, themotionvectors for other field are derived from the above two vectors.

It should be noted that only the P-picture is allowed to use dual prime prediction. Inother words, if the dual prime prediction is used in the encoder, there will be no B-pictures.The reason for this restriction is to limit the required memory bandwidth for a real systemimplementation.

FIGURE 16.12Field-based prediction of enhancedoption of MPEG-2 video coding.

Top field

Bottom field

I B P

Two or four motion vectors

One or two motion vector


Reference Prediction

Top Bottom Top Bottom

Transmittedvectors

Delta vectorsDerivedvectors

FIGURE 16.13Dual prime prediction in MPEG-2 video coding.

16.2.2.2 Field=Frame DCT Coding Syntax

Another important feature to support interlaced material is to allow adaptive selection ofthe field=frame DCT coding as shown in Figure 16.14.

In Figure 16.14, the middle is a luminance MB of 163 16 pixels, the black rectangularblock represents the eight pixels in the top field and the white rectangular block representsthe eight pixels in the bottom field. The left is the field DCT in which each 83 8 blockcontains only the pixels from the same field. The right is the frame DCT; each 83 8block contains the pixels from both top field and bottom field.

At the MB level for interlaced video, the field-type DCT may be selected when the videoscene contains less detail and experiences large motion. Moreover, the difference betweenadjacent fields may be large when there is large motion between fields; it may be moreefficient to group the fields together, rather than the frames. In this way, the possibility thatthere exists more correlation among the fields can be exploited. Ultimately, this canprovide much more efficient coding because the block data is represented with fewercoefficients, especially if there is not much detail contained in the scene.

16.2.2.3 Downloadable Quantization Matrix and Alternative Scan Order

The new feature in MPEG-2 regarding the quantization matrix is that it can be downloadedfor every frame. This may be helpful if the input video characteristics are very dynamic. Ingeneral, the quantizer matrices are different for intracoding and non-intracoding. With 4:2:0format, only two matrices are used, one for the intrablocks and another for the non-intrablocks. With 4:2:2 or 4:4:4 formats four matrices are used, both an intra- and a non-intramatrix are used for the luminance and chrominance blocks. If the matrix load flags are not

Field DCT coding Luminance macroblock Frame DCT coding

FIGURE 16.14Frame and field discrete cosine transform (DCT) for interlaced video.


FIGURE 16.15Two zigzag scan methods for MPEG-2 videocoding. Normal scan order Alternative scan order

DC DC

set, the deco der wi ll use defaul t matrice s. The formats 4:2:0, 4 :2:2 are de fined in Chapter 15.In the 4:4:4 format, the luminance and two chrominance pictures have the same picture size.

In the picture layer, there is a flag that can be set for an alternative scan of DCT blocks,instead of using the zigzag scan discussed earlier. Depending on the spectral distribution,the alternative scan can yield run lengths that better exploit the multitude of zero coeffi-cients. The zigzag scan and alternative scan are shown in Figure 16.15.

The normal zigzag scan is used for MPEG-1 and as an option for MPEG-2. The alterna-tive scan is not supported by MPEG-1 and is an option for MPEG-2. For frame-type DCT ofinterlaced video, more energy may exist at the bottom part of the block; hence the RLCmay be better off with the alternative scan.

16.2.2.4 Pan and Scan

In MPEG-2, there are several parameters defined in the sequence display extension andpicture display extension for panning and displaying a rectangle around a reconstructedframe. These parameters include display-horizontal-size and display-vertical-size in thesequence display extension, and frame-center-horizontal-offset and frame-center-vertical-offset in the picture display extension. The function of these parameters can be found inthe MPEG-2 system specification. A typical example to use pan–scan parameters is theconversion of 16:9 frame to 4:3 frame. The 4:3 region is defined by display-horizontal-sizeand display-vertical-size and the 16:9 frame is defined by horizontal-size and vertical-size.If we chose the display-horizontal-size be four pixels less than the horizontal-size, andkeep the display-vertical-size as the same as the vertical-size, then we can obtain a 4:3pictures on the display. Figure 16.16 shows the conversion of 16:9 frame to 4:3 frame usingpan–scan parameter, but there is no center offset involved in this example.

16.2.2.5 Concealment Motion Vector

The concealment motion vector (CMV) is a new tool supported by MPEG-2. This tool isuseful in concealing errors in the noisy channel environment where the transmitted data

FIGURE 16.16An example of pan–scan.

16

9

22


may be lost or corrupt ed. Th e basic idea of CMV is that the mo tion vect ors are sent for theintracod ed MB . These mo tion vector s are ref erred to as CMVs, which sh ould be use d inMBs imm ediate ly below the one in whi ch the CMV occurs. The details are descri bed inSecti on 17.5.3.2.

16.2.2. 6 Scalabi lity

MPEG -2 v ideo has seve ral scalabl e modes , whi ch inc ludes spatial sca lability, tem poralscalabi lity, SNR sca lability, and data partitioni ng. Th ese scalabi lity to ols a llow a subs et ofany bitstr eam to be deco ded into meanin gful image ry. Moreove r, scalabili ty is a use ful toolfor error resilience on priori tized transmi ssion medi a. The drawbac k of sca lability is thatsome codi ng efficiency is los t due to extra overh ead. Here, we brie fly introduce the basicnotion s of the above scalability fea tures.

Spati al sca lability allo ws multire solution codi ng, which is suitable for video serviceinter -networ king appl ications. In spati al scalabili ty, a singl e video source is split into abase layer (lower spatial resol ution) and enhancem ent layers (highe r spati al resol ution).For example , an ITU-R 601 video can be down-sam pled to SIF format wi th spatial filteri ng,which can serve as the base laye r video. The base layer or low-re solution video can becoded with MPEG -1 or MPEG -2, and the higher resol ution layer must be coded by MPEG -2support ed syn tax. For the up-samp led lower layer, an additio nal predic tion mode isavailabl e in the MPEG -2 enc oder. This is a flexibl e tech nique in ter ms of bit rate ratios,and the enh anceme nt layer can be use d in high qualit y servi ce. The probl em with spatialscalabi lity is that there exis ts some bit rate penalty due to overhead and the re is also amoderat e inc rease in com plexity. A block diagram that illustrat es encodi ng wi th spatialscalabi lity is shown in Figure 16 .17. In Figure 16.17, the outpu t of decod ing and spatialup-samp ling block provi des an additi onal choice of pred iction for the MPEG -2 com patiblecoder, but no t the only choice of pred iction. The pred iction can be obtained from HDTVinput itself also depending on the prediction select criterion such as the minimum predic-tion difference.

It should be noted that the spatial scalability coding allows the base layer to be codedindependently from the enhancement layer. In other words, the base layer or lower layerbitstream is generated without regard for the enhancement layer and can be decodedindependently. The enhancement layer bitstream is additional information, which can beseen as the prediction error based on the base layer data. This implies that the enhancementlayer is useless without the base. However, this type of structure can find a lot ofappl ications such as error concea lment, which is dis cussed in the Section 17.5 .

MPEGencoder

Spatial filtering anddown-sampling

Decodingand spatial

up-sampling

MPEG-2compatible coder

HDTV input

SDTV Base layer bitstream

Enhancementlayer bitstream

FIGURE 16.17Block diagram of spatial scalability encoder.


Temporal sampling MPEG encoder

MPEG decoder

MPEG-2 encoder

Inputvideo

Base layerbitstream

Enhancementlayer bitstream

FIGURE 16.18Block diagram of temporal scalability encoder.

Temporal scalability is a scalable coding technique in the temporal domain. An exampleof a two-layer temporal scalable coder is shown in Figure 16.18. This example usestemporal scalability to decompose the progressive image sequence to two interlacedimage sequences, then one is coded as the base layer and one as the enhancement layer.Of course, the decomposition could be different. For the enhancement layer, there is achoice of making predictions. One choice for prediction is available between one base layerprediction and a temporal prediction from enhancement layer itself. It should be noted thatthe spatial resolution of two layers is the same and the combined temporal rate of twolayers is the full temporal rate of the source. Again, it should be noted that the decodingoutput of base layer bitstream by the MPEG decoder provides an additional choice ofprediction but not the only choice of predictions.

The signal-to-noise ratio (SNR) scalability provides a mechanism for transmitting two-layer service with the same spatial resolution but different quality levels. The low layer iscoded at a coarse quantization step at 3–5 Mbits=s to provide NTSC=PAL=SECAM qualityvideo for low capacity channel. In the enhancement layer, the difference between originaland coarse-quantized signals is then coded with a finer quantizer to generate an enhance-ment bitstream for high quality video applications.

The above three scalability schemes generate at least two bitstreams, one for base layerand other for enhancement layer and the lower layer bitstream can be independentlydecoded to provide low spatial resolution, low quality, or low frame rate video, respect-ively. There is another scalability scheme, data partitioning, in which the base layerbitstream cannot be independently decoded. In data partitioning, a single video source issplit into a high priority portion, which can be better protected, and a low priority portion,which is less important with regard to the reconstructed video quality. The prioritybreakpoint in the syntax specifies which syntax elements are coded as low priority (forexample, the higher-order DCT coefficients in the intercoded blocks).

16.3 MPEG-2 Video Encoding

16.3.1 Introduction

MPEG video compression is a generic standard that is essential for the growth of digitalvideo industry, as mentioned earlier. Although the MPEG video coding standard recom-mended a general coding methodology and syntax for the creation of a legitimate MPEGbitstream, there are many areas of research left open regarding how to generate high


qualit y MPEG bitstr eams. This allows the design ers of an MPEG encode r great fl exibility indevelop ing and implement ing their own MPEG-s peci fi c algori thms, lead ing to produ ctdifferent iatio n on the mark etplace. To design a perform ance optimi zed MPEG -2 enc odersystem, several maj or are as of research have to be cons idered. Th ese include imageprepro cessing , motion esti matio n, coding mode deci sions, and rate contro l. Al gorithmsfor all of these areas in an enc oder shoul d aim to min imize sub jective dis tortion for aprescri bed bit rate and ope rating dela y constrai nt. Th e prep rocessin g includ es the noisereducti on and the remov al of redundant fiel ds, whi ch are contai ned in the detelecinemateri al. Th e telecin e mate rial is used for the m ovie indus try, whi ch con tains 24 progre s-sive frames =s. Th e TV signal is 30 frames =s. The dete lecine proces s con verts the 2 4frames =s film signal to the 30 frame s=s TV signal . Th is is also ref erred to as 3:2 pull-down process . Becau se the 30 frames =s detelecine mate rial only con tains 24 frames =s ofuniqu e picture s, the encode r has to dete ct and rem ove the redundant fields for obt ainingbetter coding perform ance. The proce ssion of noise redu ction can reduce the bits wastedfor codi ng random noise. Motion com pensatio n is used to remove the temporal redun-dancy in the vide o signal s. The mo tion vector s betwe en the anchor picture and the cur rentpicture are obtain ed with mo tion estimati on algori thms. Excep t for I-pic tures each MB canbe inter - or intracod ed which is determi ned by the m ode deci sion. Th e investigati on ofmotion estimati on algori thms is an imp ortant rese arch topic becaus e differe nt mo tionestimati on sch emes m ay result in differe nt coding ef ficiency. Rate control is always applie dfor no n-variabl e bit rate (non-VB R) codi ng. The purp ose of rate con trol is to prop erlyassign the bits for each MB unde r the constrai nts of total bit rate budge t and buffer size .This is also an impo rtant topic becau se the opti mized bit assi gnment schem e will resu lt inbetter coding perform ance and bett er subject ive recon struct qual ity at a given bit rate. Inthis section, are as of prep rocessin g and mo tion esti matio n are covered . The topics of ratecontrol and optimum mo de deci sion are discusse d in later sections.

16.3.2 Preproces sing

For low bit rate v ideo coding, preproce ssing is someti mes appl ied to the video signalsbefore codi ng to inc rease the coding ef ficiency. Usua lly the preproce ssing imp lies a filter-ing of the video signals that are cor rupted by rando m and burst no ise for v arious rea sons,such a s imperfe ctions of the sca nner, transmiss ion, or rec ording medium . Noise reducti onnot only impro ves the visu al qual ity but also increases the perfo rmance of vide o codi ng.Noise reducti on can be achi eved by filtering each frame independe ntly. There are a variet yof spatial filters, which have been develop ed for image noise fi ltering and restor ationwhich ca n be use d fo r noise reduc tion task [cano 1983; katsag gelos 1991]. On the ot herhand, it is also poss ible to filter the video sequenc e tem porall y along the motion trajecto riesusing motion compe nsation [sezan 1991]. Howeve r, it was shown that amo ng the recursivestation ary metho ds, the MC spatio temporal fi ltering perform ed better than spatia l or MCtemporal fi ltering alone [ozkan 1993].

Anoth er importan t preproce ssing is dete lecine proce ssing. As movie materi al is origin-ally shot at 24 progressive frames=s, standard conversion to television at 30 frames=sis made by a 3:2 pull-down process, which periodically inserts repeated field, giving30 frames=s telecine source material. Because the 30 frames=s detelecine material onlycontains 24 frames=s of unique pictures, it is necessary to detect and remove the redundantfields before or during encoding. Rather than directly encoding the 30 frames=s detelecinematerial, one can remove the redundant fields first and then encode 24 frames=s of uniquematerial, thereby realizing higher coding quality at the same bit rate. The decoder cansimply reconstruct the redundant fields before presenting them. Examples of telecine anddeteleci ne proces s are shown in the Figure 1 6.19.


FIGURE 16.19Examples of telecine and detelecine process.

(a) Example of telecine process

(b) Example of detelecine process

A

Ao Ae Ao

B

Bo Be Bo

C

Co Ce Co

24 frames/s

60 fields/s

Tel evision broad cast progra mme rs frequentl y switc h betw een telecin e mate rial andnatural 30 frames =s mate rial, such as when spl icing to and from various sources of mo vies,ordinar y televisi on progra ms, and comme rcials. An MPEG -2 encoder should be able tocope with these transi tions and con sistentl y produ ce dece nt picture s. During movie seg-men ts, the encoder should reali ze the gain s from coding at the lower frame rate afterdetele cine. Ideally , the proces s of source transi tion from the lowe r 24 frames =s rate to thehigh er 30 frame s=s rate shoul d not cause any qual ity drop of every enc oded frame . Thequalit y of enc oded frames shoul d m aintain the sam e as the case where the deteleci neproces s is ignored and all materi al, rega rdless of sou rce type, is coded at 30 frames =s.

16.3.3 Moti on Estimati on and Moti on Comp ensatio n

In princi ple for codi ng vide o signals , if the motion trajecto ry of each pixe l could bemeasu red the n only the initial or anchor referenc e frame and the mo tion vect or informati onneed to be coded. In such a way the interframe redundanc y will be remov ed. To repro ducethe pictures, one can simply propag ate each pixel along its motion trajec tory. Becaus e the reis also a cost fo r transmi tting motion vect or informati on, in practice, one can measure onlythe motion vectors of a group of pixels, which wi ll share the cost for transmi ssion of themo tion informati on. Of course , at the same tim e, the pixels in the same group are assume dto have the sam e mo tion infor matio n. Th is is no t always tru e becau se the pixels in the blockmay move in differe nt directio ns, or some of the m belong to the ba ckgroun d. Ther efore,bot h motion vect ors and the pred iction diffe rence have to be transmitt ed. Usua lly, theblock match ing can be cons idered as the mo st practi cal metho d for motion estimati on dueto less hardware c omplexit y. In the block match ing metho d, the image frame is divi dedinto fi xed size small rectangu lar blocks such as 16 3 16 or 16 3 8 in M PEG vide o coding.Each block is assumed to undergo a linear translation and the displacement vector of eachblock and the predictive errors are coded and transmitted. The related issues for motionestimation and compensation include motion vector searching algorithm, searching range,matching criteria, and coding method. Although the matching criteria and searchingalgori thms have been disc ussed in Chapter 11, we still brie fl y intro duce the m here forthe sake of completeness.

16.3.3.1 Matching Criterion

The matching of the blocks can be determined according the various criteria including themaximum cross-correlation, the minimum MSE, the minimum mean absolute difference(MAD), and maximummatching pixel count (MPC). For MSE and MAD, the best matching


block is reached if the MSE or MAD is minimized at that location. In practice, we use MADinstead of MSE as matching criterion due to its computational simplicity. The minimumMSE criterion is not commonly used in hardware implementations because it is difficult torealize the square operation. However, the performance of the MAD criterion deterioratesas the search area becomes larger due to the presence of several local minima. In themaximumMPC criterion, each pixel in the block is classified as either a matching pixel or amismatching pixel according the difference whether which is smaller than a preset thresh-old. The best matching is then determined by the maximum number of the matchingpixels. However, the MPC criterion requires a threshold comparator and a counter.

16.3.3.2 Searching Algorithm

Finding the best matching block requires optimizing the matching criterion over allpossible candidate displacement vectors at each pixel. The so-called full search, logarithmicsearch, and hierarchical searching algorithms can accomplish this.

16.3.3.2.1 Full Search

The full search algorithm evaluates the matching criterion for all possible values within thepredefined searching window. If the search window is restricted to a [�p, p] square, foreach motion vector there are (2p þ 1)2 search locations. For a block size of M3N pixels, ateach search location we compare N3M pixels. If we know the matching criterion and thenumber of operations needed for each comparison then we can calculate the computationcomplexity of full search algorithm. Full search algorithm is computationally expensive,but guarantees finding the global optimal matching within a defined searching range.

16.3.3.2.2 Logarithmic Search

Actually, the expected accuracy of motion estimation algorithms varies according to theapplications. In MC video coding, all one seeks is a matching block in terms of some metric,even if the match does not correlate well with the actual projected motion. Therefore inmost cases, search strategies faster than full searches are used, although they lead tosuboptimal solutions. These faster search algorithms evaluate the criterion function onlyat a predetermined subset of the candidate motion vector locations instead of all possiblelocations. One of these faster search algorithms is the logarithmic search. Its more popularform is referred to as the three-step search. We explain the three-step search algorithmwith the help of Figure 16.20 where only the search frame is depicted. Search locations

2

0

1

11

1

1

1

1

1

2

2

3

2333

1

111111

11 10

2 2 2

2

2

3 3

332

FIGURE 16.20Three-step search.


corresponding to each of the steps in the three-step search procedure is labeled as 1, 2,and 3. In the first step, starting from pixel 0 we compute MAD for the nine search locationslabeled 1. The spacing between these search locations here is 4. Assume that MAD isminimum for the search location (4,4) which is circled 1. In the second step, using thecriterion function is evaluated at eight locations around the circled 1 which are labeled 2.The spacing between locations is now two pixels. Assume now the minimum MAD is atthe location (6,2) which is also circled. Thus the new search origin is the circled 2 which islocated at (6,2). For the third step, the spacing is set to 1 now and the 8 locations labeled 3are searched. The search procedure is terminated at this point and the output of motionvector is (7,1). Additional steps may be incorporated into the procedure if we wish toobtain subpixel accuracy in the motion estimations. Then the search frame needs to beinterpolated to evaluate the criterion function at subpixel locations.

16.3.3.2.3 Hierarchical Motion Estimation

Hierarchical representations of images in the form of a Laplacian pyramid or wavelettransform are also quite often used with block matching method for improved motionestimation. The basic idea of hierarchical block matching is to perform motion estimationat each level successively, starting with the lowest resolution level. The lower resolutionlevels serve to determine a rough estimate of the motion information using relatively largerblocks. The estimate of the motion vector at a lower resolution level is then passed onto thenext higher resolution level as an initial estimate. The higher resolution levels are used tofine-tune the motion vector estimate. At higher resolution levels, relatively smaller windowsizes can be used sincewe start with a good initial estimate. The hierarchicalmotion estimatecan significantly reduce the implementation complexity because its search method is veryefficient. However, such a method requires increased storage due to the need to keeppictures at different resolutions. Furthermore, this scheme may yield inaccurate motionvectors for regions containing small objects. When the search starts at the lowest resolutionof the hierarchy, regions containing small objects may be eliminated and thus fail to betracked. On the other hand, the creation of low-resolution pictures provides some immunityof noise. The following table provides some experimental results. The experimental resultsperformed by one of the authors have shown that comparing with full search the two-layerhierarchical motion estimation reduces the search complexity of factor 10 at the price ofdegrading reconstruction quality from about 0.2 to 0.6 dB for frame-mode coding, from 0.26to 0.38 dB for field-mode coding and only from 0.16 to 0.37 dB for frame=field adaptivecoding, for different video sequences in the case of a fixed bit rate of 4 Mbits=s. In the case ofVBR coding, the similar results can be observed from the rate distortion curves.

In the above discussion, we have restricted the motion vector estimation to integer pixelgrids, or pixel accuracy. Actually, the motion vectors can be estimated with fractional orsub-pixel accuracy. In MPEG-2 video coding the half-pixel accuracy motion estimation canbe used. Half-pixel accuracy can be easily achieved by interpolating the current andreference pictures by a factor 2 and then using any of the motion estimation methodsdescribed earlier.

16.3.3.3 Advanced Motion Estimation

Progress has recently been made in several aspects of motion estimation, which aredescribed as follows.

16.3.3.3.1 Motion Estimation Using a Reduced Set of Image Data

The methods to reduce search complexity with subsampling and pyramid processing arewell known and around in the literatures [sun 1994]. However, the reduction by lowering


the precision of each sample does not appear to have been extensively studied. Someexperimental results have shown that the performance degradation of the hierarchicalmotion estimation algorithm is not serious when each layer up to four-layer pyramid islimited to 6 bits=sample. At 4–5 bits=sample, the performance is degraded 0.2 dB over fullprecision.

16.3.3.3.2 Overlapped Motion Estimation

A limitation of block matching is that it generates a significant proportion of motionvectors that do not represent the true motion present in the scene. One possible reason isthat the motion vectors are estimated without reference to any picture data outside of thenonoverlapping blocks. This problem has been addressed by overlapped motion estima-tion. In case of the overlapped motion compensation, MC regions translated by the motionvectors are overlapped with each other. Then a window function is used to determine theweighting factors for each vector. This technique has been adopted into the H.263 videocoding standard. Some improvements have been clearly identified for low bit rate coding[katto 1994].

16.3.3.3.3 Frequency Domain Motion Estimation

An alternative to spatial-domain block matching methods is to estimate motion vector infrequency domain through calculating the cross-correlation [young 1993]. Most inter-national standards, such as MPEG, H.263, and the proposed HDTV standard, use theDCT and block-based motion estimation as the essential elements to achieve spatial andtemporal compression, respectively. The new motion estimation approach is proposed inthe DCT-domain [koc 1998]. This method of motion estimation has certain merits overconventional methods. It has very low computational complexity and is robust even in anoise environment. Moreover, the motion compensation loop in the encoder is muchsimplified due to replacing the IDCT out of the loop [koc 1998].

16.3.3.3.4 Generalized Block Matching

In generalized block matching, the encoded frame is divided into triangular, rectangular,or arbitrary quadrilateral patches. We then search for the best matching triangular orquadrilateral in the search frame under a given spatial transformation. The choice ofpatch shape and the spatial transform is mutual related. For example, triangular patchesoffer sufficient degree of freedom with affine transformation, which has only six independ-ent parameters. The bilinear transform has eight free parameters. Hence it is suitable foruse with rectangular or quadrilateral patches. The generalized block matching is usuallyonly adaptively used for those blocks where standard block matching is not satisfactory foravoiding imposed computational load.

16.4 Rate Control

16.4.1 Introduction of Rate Control

The purpose of rate control is to optimize the perceived picture quality and to achieve agiven constant average bit rate by controlling the allocation of the bits. From the viewpoint of rate control, the encoding can be classified into VBR coding and constant bit rate(CBR) coding. The VBR coding can provide a constant picture quality with variablecoding bit rate, whereas the CBR will provide a CBR with a nonuniform picture quality.


Rate control and buffer regulation are important issues for both VBR and CBR applica-tions. In the case of VBR encoding, the rate controller attempts to achieve optimumquality for a given target rate. In the case of CBR encoding and real-time application,the rate control scheme has to satisfy the low-latency and VBV (video buffering verifier)buffer constraints. The VBV is a hypothetical decoder, which is conceptually connected tothe output of an encoder (Appendix C of [mpeg2]). The bitstream generated by theencoder is placed to the VBV buffer at the CBR that is being used. The rate control hasto assure that the VBV will not be overflow and underflow. In addition, the rate controlscheme has to be applicable to a wide variety of sequences and bit rates. At the GOPlevel, the total number of available bits are allocated among the various picture types,taking into account the constraints of the decoder buffer, so that the perceived quality isbalanced. Within each picture, the available bits are allocated among the MBs to maxi-mize the visual quality and to achieve the desired target of encoded bits for the wholepicture.

16.4.2 Rate Control of Test Model 5 for MPEG-2

As we described before, the standard only defines the syntax for decoding. The TM5 is anexample of encoder, which may not be optimal; however, it can provide a compliantcompressed bitstream. Also, the Test Model served as a reference during the developmentof the standard. The TM5 rate control algorithm consists of three steps to adapting the MBquantization parameter for controlling the bit rate.

Step 1: Target bit allocationThe target bit allocation is the first step of rate control. Before coding a picture, we need toestimate the number of bits available for coding this picture. The estimation is based onseveral factors. These include the picture type, buffer fullness, and picture complexity. Theestimation of picture complexity is based on the number of bits and quantization para-meter used for coding the same type previous picture in the GOP. The initial complexityvalues are given according to the type of picture:

Xi ¼ 160 * bit rate=115

Xp ¼ 60 * bit rate=115

Xb ¼ 42 * bit rate=115

(16:6)

where the subscripts i, p, and b stand for picture types I, P, and B (this will be applied to theformulas in this section). After a picture of a certain type (I, P, or B) is encoded, therespective global complexity measure (Xi, Xp, and Xb) is updated as

Xi ¼ SiQi, Xp ¼ SpQp, and Xb ¼ SbQb (16:7)

where Si, Sp, and Sb are the number of bits generated by encoding this picture, and Qi, Qp,and Qb are the average quantization parameters computed by averaging the actual quant-ization values used during the encoding of the all the MBs including the skipped MBs. Thisestimation is very intuitive; however, if the picture is more complicated, more bits areneeded to encode it. The quantization parameter (step or interval) is used to normalize thismeasure because the number of bits generated by encoder is inversely proportional to thequantization step. The quantization step can also be considered as a measure of codedpicture quality. The target number of bits for the next picture in the GOP (Ti, Tp, and Tb) iscomputed as follows:


Ti ¼ maxR

1þNpXp

XiKpþNbXb

XiKb

, bit rate=(8* picture-rate)

8>>><>>>:

9>>>=>>>;

Tp ¼ maxR

Np þNbKpXb

XbKp


8>>><>>>:

9>>>=>>>;

Tb ¼ maxR

Nb þNpKbXp

XpKb


8>>><>>>:

9>>>=>>>;

(16:8)

where Kp and Kb are universal constants dependent on the quantization matrices. For thematrices of TM5, Kp¼ 1.0 and Kb¼ 1.4. R is the remaining number of bits assigned to theGOP and after coding the picture this number is updated by subtracting the bit used forthe picture. Np and Nb is the number of P- and B-pictures remaining in the current GOP inthe encoding order. The problem of above target bit assignment algorithm is that it doesnot handle scene changes efficiently.

Step 2: Rate controlWithin a picture the bits used for each MB is determined by the rate control algorithm.Then a quantizer step is derived from the number of bits available for the MB to be coded.The following is an example of rate control for P-picture.

In Figure 16.21, dp0 is initial virtual buffer fullness, the Tp is the target bits for P-picture. Bj isthe number of bits generated by encoding all MBs in the picture up to and including jth MB.MB_cnt is the number of MBs in the picture. Before encoding the jth MB, the virtual bufferfullness is adjusted during the encoding according to the following equation for P-picture:

dpj ¼ dp0 þ Bj�1 �Tp( j� 1)MB cnt

(16:9)

Then the quantization step is computed with the equation:

Qpj ¼

dpjr

(16:10)

where the reaction parameter r is given by r¼ 2 * bit rate=picture-rate and dpj is the fullnessof the appropriate virtual buffer. This procedure is shown in Figure 16.21. The fullness of

MB index MB_cnt

Tp

Bj –1 djp

jd p0

FIGURE 16.21Rate control for P-picture.


the virtua l buff er for the last MB is use d for encoding the next picture of the sam e type asthe initial fullne ss.

The abov e example can be exten ded to the general case for all I-, P-, and B-picture s.Before enc oding the j th MB, we com pute the fullne ss of the approp riate virt ual buff er:

dij ¼ di0 þ B j� 1 �Ti ( j � 1)MB cnt

or

dpj ¼ dp0 þ B j� 1 �Tp ( j � 1)MB cnt

(16: 11)

or

dbj ¼ db0 þ B j� 1 �Tb ( j � 1)MB cnt

Depend ing on the picture type, where di0 , dp0 , d

b0 are initial fullne ss of the virtua l buffers and

dij , dpj , d

bj are the full ness of virt ual buff er at j th MB — one fo r each pictu re type. From the

num ber of bits of the virtua l buff er fullness we compute the quan tization step Qj for MB jacco rding to the buff er fullne ss:

Qj ¼dj * 31

r (16: 12)

The initial value s of the virtua l buffer fullness are

di0 ¼ 10 � r= 31dp0 ¼ K p � di0db0 ¼ K b � di0

(16: 13)

Kp and Kb are constants which are de fined in Equatio n 16.8.

Step 3: Adaptive quan tizationAdap tive quantiz ation is the last step of the TM 5 rate control . It is noted that for activeare as or busy areas, the human eyes are no t so sensitiv e to the quan tization noise, where asthe sm ooth area s are more sensitiv e to the quan tiza tion noise as discusse d in Chap ter 1. Onthe ba sis of thi s observat ion, we m odulate the quan tization step obt ained from the previ-ous step in such a way to increase quantization step for active areas and reduce thequantization step for the smooth areas. In other words, we use more bits in the smoothareas and less bits for the active areas. The experiment results have shown that thesubjective quality is higher with adaptive quantization step than without this step.The procedure of adaptive quantization in TM5 is as follows. First, the spatial activitymeasure for the jth MB is calculated from the four luminance frame-organized subblocksand the four luminance field-organized blocks using the intrapixel values:

actj ¼ 1þ Minsblk¼1,8

(var sblk) (16:14)

where var_sblk is the variance of each spatial 83 8 block which value is calculated as

var sblk ¼ 164

X64k¼1

(Pk � Pmean)2 (16:15)


and Pk is the pixel value in the original 83 8 block and Pmean is the mean value of the blockwhich is calculated as

Pmean ¼ 164

X64k¼1

Pk (16:16)

The normalized activity factor N_actj is

N actj ¼2 � actj þ avg actactj þ 2 � avg act

(16:17)

where avg_act is the average value of actj the last picture to be encoded. Therefore, thisvalue will not give good result when a scene change occurs. On the first picture,this parameter takes the value of 400. Finally, we obtain the modulated quantization stepfor jth MB:

mquantj ¼ Qj �N actj (16:18)

whereQj is the reference quantization step value obtained in the last step. The final value ofmquantj is clipped to the range of [1,31] and is used and coded as described in the MPEGstandard.

As indicated before, the TM5 rate control provides only a reference model. It is notoptimized in many aspects. Therefore, there is still a lot of room for improving the ratecontrol algorithm, such as to provide more precise estimation of average activity bypreprocessing. In the following section, we will investigate the optimization problem formode decision combined with rate control, which can provide a significant qualityimprovement as shown by experimental results.

16.5 Optimum Mode Decision

16.5.1 Problem Formation

This section addresses the problem of determining the optimal MPEG [mpeg2] codingstrategy in terms of the selection of MB coding modes and quantizer scales. In the TestModel [tm5], the rate control operates independently from the coding mode selection foreach MB. The coding mode is decided based only upon the energy of predictive residues.Actually, the two processes, coding mode decision and rate control, are intimately relatedto each other and should be determined jointly to achieve optimal coding performance.A constrained optimization problem can be formulated based on the rate distortioncharacteristics, or R(D) curves, for all the MBs that compose the picture being coded.Distortion for the entire picture is assumed to be decomposable and expressible as afunction of individual MB distortions, with this being the objective function to minimize.The determination of the optimal solution is complicated by the MPEG differential encod-ing of motion vectors and dc coefficients, which introduce dependencies that carry overfrom MB to MB for a duration equal to the slice length. As an approximation, a near-optimum greedy algorithm can be developed. Once the upper bound in performance iscalculated, it can be used to assess how well practical suboptimum methods perform.

Earlier-related studies dealing with dependent quantization for MPEG include theworks done by Ramchandran [ramchandran 1994] and Lee [lee 1994]. These works treated


the problem of bit allo cation where the re is tem poral dep endency in coding comple xityacros s I, P, and B frames. Althoug h these techni ques repre sent the most prop er bitallo cation st rategies across frames from a theoretical viewpoin t, no pra ctical real-t imeMPEG enc oding system wi ll use even thos e proposed simp li fied tech niques becaus e theyrequi re an unw ieldy num ber of pre-an alysis enc oding passes over the wind ow of dep en-dent frame s (one MPEG GO P). To overcom e the se com putational burde ns, more prag maticsolut ions that can realistica lly be imp lemente d have been conside red by Sun [s un 1997]. Inthis st udy, the maj or emphasi s is not on the problem of bit a llocation amo ng I-, P-, andB-frame s; rather, the authors cho ose to util ize the frame-lev el allocati on method provi dedby the Test Model [tm5] . In this way, frame-lev el coding comple xities are estimate d frompast frames withou t any forwa rd pre-an alysi s kno wledge of futur e frame s. This type ofanal ysis forms the most rea sonab le set of assu mptions for a pra ctical real-time enc odingsystem . Another metho d that exten ds the basic Test Model idea to alter frame budge tsheuris tically in the case of scene change s, use of dyn amic GOP size , and tem poral mask-ing effect s can be foun d in [wan g 1995]. These techni ques also offer v ery effective andpra ctical sol utions for implement ation . Giv en the chos en method for frame -level bitbudge t allo cation, the fo cus of thi s secti on is to jointl y opti mizing MB coding modes andquan tizers within each frame.

Ther e exis ts many choices for the MB codi ng mode under the MPEG -2 st andard forP- and B- pictures, includ ing intramod e, no motion com pensatio n mode, frame =field =dualprime motion com pensatio n inter mode, fo rward =ba ckward =averag e int ermode, andfi eld =frame DC T mode. In the standard Tes t M odel refer ence [tm5], the codi ng mode foreach MB is selected by comparin g the energy of predictiv e resid uals. For exampl e, theintra =int er decisio n is dete rmine d by a com pariso n of the variance of the MB pixels again stthe variance of the predic tive resid uals; the inter predic tion mode is selecte d to be theinter mode that has the least pred ictive residual MSE. Th e c oding mo de selected by TestMo del criteria does not result in the opti mal coding perform ance .

In a ttemptin g to achi eve optimal codi ng perform ance , it is impo rtant to realize thatcoding modes should be determined jointly with rate control because the best coding modedepends upon the operating point for rate. In deciding which of the various codingmodes is best, one should consider what the operating point is for distortion, and alsoconsider the trade-off between spending bits for coding the prediction residuals and bitsfor coding motion vectors.

The amount of bits used for coding the MB is the sum of bits used for coding motionvectors and bits used for coding residuals:

RMB ¼ Rmv þ Rresidual (16:19)

For exampl e, in Figu re 16.22, cons ider the deci sion betwe en (a) frame-m ode forwa rdprediction and (b) field-mode bidirectional prediction. Mode (b) will almost always pro-duce a prediction that has lower MSE than mode (a). However, mode (a) requires coding offewer motion vectors than mode (b). Which mode is best? The answer depends on theoperating point for distortion. When coding at a very coarse quant-scale, mode (a) canperform better than mode (b) because the difference in bits required for coding motionvectors between the two modes may be much greater than the difference in bits requiredfor coding residuals between the two modes. However, when coding at a fine quant-scale,mode (b) can perform better than mode (a) because mode (b) provides a better predictionand the bits required for motion vectors would become negligible compared to bits forcoding residuals.

Coding mode decisions and rate control can be determined jointly and optimally startingfrom the basics of constrained optimization using R(D) curves. This optimal solution


Distortion

Rate

Intramode

Frame fwd mode

Field avg mode

ΔRmv << ΔRresidual ΔRmv >> ΔRresidual

FIGURE 16.22Rate distortion [R(D)] curves for different macro-block (MB) coding modes.

would be an a-posterior solution that assumes complete knowledge of R(D). We investi-gate an optimal solution for objective functions of the form:

DPICT ¼XNMB

i¼1DMBi (16:20)

Equation 16.20 states that the distortion for the picture, DPICT, can be measured as anaccumulation of individual MB distortions, DMB, for all NMB number of MBs in thepicture. We minimize this objective function subject to having individual MB distortionsbeing uniform over the picture:

D1 ¼ D2 ¼ � � � ¼ DNMB (16:21)

and having the bits generated from coding each MB, RMB, sum to a target bit allocation forthe entire picture, RPICT

XNMB

i¼1RMBi ¼ RPICT (16:22)

The choice for the MB distortion measure, DMB, can be MSE computed over the pixels inthe MB, or it can be a measure that reflects subjective distortion more accurately, such asluminance and frequency-weighted MSE. Other choices for DMB may be the quantizerscale used for coding the MB, or better yet, the quantizer scale weighted by an activitymasking factor. In this chapter, we select distortion for each MBi to be spatial-masking-activity-weighted quantizer scale:

DMBi ¼ qscalei=N acti (16:23)

where N_acti 2 [0.5, 2.0] is the normalized spatial-masking-activity quantizer weightingfactor, as defined in the Test Model [tm5]:

N acti ¼ 2 * acti þ avg actacti þ 2 * avg act

(16:24)

where acti is the minimum luma block spatial variance for MB Ii and avg_act is the averagevalue of acti over the last picture to be coded. N_acti reflects the relative amount of


quan tization error that can be to lerated fo r MB i as com pared to the rest of the M Bs thatcom pose the picture . N_acti depen ds st rongly on wheth er the M B belong s to a sm ooth,edge , or textu red region of the picture. He nce, the MB disto rtion met ric is space variant anddep ends on the conte xt of the local pictu re charac teristi cs surrou nding each M B. Weassum e that main tainin g the same DMBi for all MB s, or selecting the quan tizer scale sdirectl y propo rtional to N _acti in such a manner, correspo nds to maintaini ng uniformsubj ective qual ity througho ut the picture . Mas king-acti vity-weigh ted quan tizer scale issomew hat a c oarse measu re for image qualit y, but it refl ects subj ective image qualit ybett er than MSE or peak sig nal to noise ratio (PSNR), and it is a practical metric to com putethat le nds itself to an additiv e form for dis tortion.

It is importan t to note that the resultin g distorti on measu re for the pictu re DPICT is rea llyonly meanin gful as a relative compar ison figure for the same ide ntical pictu re (thus havingthe sam e mask ing activ ities) quan tized different ways. It is not useful in com paringbetwe en two different images. PSNR is only use ful in this sense too, tho ugh with poo rersubj ective accuracy .

In the fo llowing, a procedur e fo r obtaining the opti mal coding perform ance with thejoint optimi zation of coding mode selectio n and rate control is discusse d. As this metho dwou ld be to o c omplex to implement , a pra ctical subo ptimal heuristi c algori thm is pre-sented . Some simulat ion result s and compar isons between the different algori thms — TestMo del algo rithm, Ne ar-Optim um algorithm , and the practi cal Sub-O ptimum algorithm —are also provi ded to assist the reader in understan ding the difference s in perform ance .

16.5.2 Pro cedure for Obtai ning the Optimal Mode

16.5.2. 1 Opti mal Solutio n

The solut ion to the optimi zation probl em is unique becau se the obje ctive function ismo notonic and the indivi dual MB R( D ) functi ons are also monoto nic. To sol ve fo r theoptim al set of M B mo des and quan t-scal es for the pictu re (mode

��!and qsca le

��!), the differen-

tial encodi ng of motion v ectors and int ra dc coef fi cients as done in MPEG shoul d beacco unted for. Accord ing to MPEG , each slice has its own diffe rential encodi ng chain. Atthe start of each slic e, predic tion m otion vector s are reset to zero. As each MB is enc oded inraste r scan order, the MB motion vectors are enc oded diffe rentially wi th resp ect to pred ic-tion motion vector s that depend on the codi ng mode of the previo us M B. These predic tionmo tion vectors ma y be reset to zer o in the cas e that the previo us MB was coded as intra orskippe d. Similarly , dc coef ficients in con tinuous runs of intra MB s a re enc oded differen-tiall y with respect to the previ ous intra MB . The intra dc predicto rs are reset at the startof eve ry slice, and at inter or skippe d MB s. Slice boundar ies del imit indep endent self-contai ned decodab le units. Finding the optimal set of coding mo des for the MB s in eachslic e entails a search throu gh a trellis of dime nsions S st ages by M state s per stag e, withS bei ng the slice size and M being the number of coding modes being cons idered (seeFigu re 16.23). Th is trellis struc ture arises because there are M 2 distinct rate distortio n,Rmodejprevious-mode(D), characteristic curves corresponding to each of M coding modes,with each in turn having a different dependency for each of M coding modes of theprevious MB. We now consider populating the trellis links with values by sampling theset of these M2S. rate distortion curves at a specific distortion level. For a given fixed MBdistortion level, DMB, each link on the trellis is assigned a cost equal to the number of bits tocode an MB in a certain mode given the mode from which the preceding MB was coded.For any group of links entering a node, the cost of these links differs only because of thedifference in bits caused by the motion vector and dc coefficient coding dependency uponthe earlier MB.


Codingmode

Macroblock number

The global optimum path

The best mode at each stage

FIGURE 16.23Full search trellis,Ms (M is number ofmodesat each stage and S is the length of slice)searches needed to obtain the best path.

The computat ional require ments per slice involve

. To determine link costs in the trellis, the number of code the MB operates (i.e., DCT þquantization þ RL C=VLC) is equal to M2S.

. Aft er determini ng all trellis link costs, the number of path searches is equal to M S .

A gene ral iterative proced ure fo r obt aining the opti mal solut ion is as follo ws:

1. Initi alize a gues s for DMB ¼ D MB0. Ho wever, D MB is the same fo r every MB in thepictu re; this sets an initial gues s for the operating disto rtion level of the picture .

2. Follow the given proce dure for each slice in the picture:

(a) For each MB in the slic e and the mode conside red, determi ne the quantiz erscale, whi ch yields the disto rtion level DMB, i.e., qs ¼ f( D MB), whe re f is thefunction which descri bes the relationshi p between quan tizer scale qs and dis-tortion DMB. If we use spatial-masking-activity-weighted quantizer scale as ameasure of dis tortion (as from Equatio n 16.4), then qs equals N_act *DMB.

(b) Compute all the link costs in the trellis representing the slice.The link costs, RMBi (mode kjmode j), represent the number of resulting bits(total bits for coding residual, motion vectors, and MB header) for coding MBiin mode k given that the preceding MB was coded in mode j.

(c) Search through the trellis tofind the path that has the lowestSRMBi over the slice.

3. Compute S RMBi for all MBs in the picture and compare to target RPICT.

(a) If jSRMBi � RPICTj< « then the optimal mode��!

and qscale��!

has been found forpicture. Repeat the process for the next picture.

(b) If SRMBi < RPICT then decrement DMB¼DMB � DDMB and goto step 2.

(c) If SRMBi > RPICT then increment DMB¼DMB þ DDMB and goto step 2.

16.5.2.2 Near-Optimal Greedy Solution

The solution from the full exponential-order search requires an unwieldy amount of com-putations. To avoid the heavy computational burden, we can use a greedy approach [lee1994] to simplify and sidestep the dependency problems of the full search method. In thegreedy algorithm, the best coding mode selection for the current MB depends only upon


Macroblock number

Codingmode

The best mode at each stage

The greedy locally “best ” path

FIGURE 16.24Greedy approach, M3S comparisons needed to obtain the locally ‘‘best’’ path.

the best mode of the previous coded MB. Therefore, the upper bound we obtain is a near-optimumsolution instead of a global optimum. Figure 16.24 illustrates the greedy algorithm.After coding anMB in each of theMmodes, the mode resulting in the least number of bits ischosen to be best. The very next MB is coded with dependencies to that chosen best mode.The computations per slice are reduced to M3 S ‘‘code the MB’’ operations and M3 Scomparisons. A general iterative procedure for obtaining the greedy solution is as follows:

1. Initialize a guess for DMB¼DMB0.

2. Follow the given procedure for each MB:

(a) For each mode considered, determine the quantizer scale which yields the dis-tortion level DMB, i.e., qs¼ f(DMB), where f is the function we mentioned earlier.

(b) For each mode, code the MB in that mode with that qs value and record theresulting number of generated bits, RMBi(mode ijmode j). The MB is codedbased on the earlier determined mode of the preceding MB.

(c) The best mode for MBi is the mode for which RMBi(mode ijmode j), mode issmallest. This yields RMBi bits for MBi.

3. Compute SRMBi for all MBs in the picture and compare to target RPICT.

(a) If jSRMBi � RPICTj < « then the optimal mode��!

and qscale��!

has been found forpicture. Repeat the process for the next picture.

(b) If SRMBi < RPICT then decrement DMB¼DMB � DDMB and goto step 2.

(c) If SRMBi > RPICT then increment DMB¼DMB þ DDMB and goto step 2.

16.5.3 Practical Solution with New Criteria for the Selection of Coding Mode

It is obvious that the near-optimal solution discussed in the previous section is not apractical method because of its complexity. To determine the best mode, we have toknow how many bits it takes to code each MB in every mode with the same distortionlevel. The total number of bits for each MB, RMB, consists of three parts, bits for codingmotion vectors, Rmv, bits for coding the predictive residue, Rres, and bits for coding MBheader information, Rheader, such as MB-type, quantizer scale, and coded-block-pattern.

RMB ¼ Rmv þ Rres þ Rheader (16:25)


QuantizerDCT VLC

Predictiveresidue

Compressedbitstream

FIGURE 16.25Coding stages to find out bit count.

The number of bits fo r mo tion vect ors, Rmv, can be easi ly obt ained by VLC table look-u p. Butto obtain the number of bits for coding the pred ictive residue, one has to go through thethree -step coding proced ure: (1) DC T, (2) quantiz ation , and (3) VLC as sh own in Figure16.25. At step 3, Rres is obtained with a loo k-up table accord ing to the run length of zeros andthe level of quan tized coef ficie nts, i.e., Rres depends on the pai r of values of run and level:

Rres ¼ f (run, lev el) (16 : 26)

As st ated above, to obtain the up per-bound coding perform ance all three steps are neededfor each codi ng m ode, and then the coding mo de resu lting in the least num ber of bits isselected as the best mo de.

To obtain a m uch less com putational ly inten sive met hod, it is prefe rred to use astatist ical model of DC T coef ficient bit usage ver sus varia nce of the prediction resid ualand quan tizer step size. This will provi de an app roximati on of the num ber of resi dual bits,Rres . For thi s purpose, we assume that the run and level pair in Equati on 16.7 is stronglydepen dent on valu es of the quan tizer scale, qs , and the variance of the residue , V res , fo r eachMB. Intuiti vely, we would expe ct the number of bits to enc ode an MB is propo rtional to thevariance of the residual and inv ersely proporti onal to the va lue of quan tizer step size .Ther efore a st atistical model can be cons tructed by plotti ng Rres ver sus the indep endentvariab les Vres and qs over a large set of repres entative MB pixe ls from images typical ofnatural vide o mate rial. This results in a sca tter plo t showi ng tigh t correlat ion, and hence asurfa ce can be fit through the dat a poi nts. It was found that Equati on 16.24 can beapproxi mately express ed as

Rres � f ( q s , V res ) ¼ [K =( Cq s þ q 2s )] V res (16 : 27)

where K and C are cons tants fo und throu gh surface fitting regr ession. If we assume Rheader

is a relatively fixed component that does not vary much with MB coding mode and can beignored , the n Equati on 16.23 can be approxim ately repl aced by

RMB0 ¼ Rmv þ [K=(Cqs þ q2s )]Vres (16:28)

The value of RMB0 reflects the variable portion of bit usage that is dependent on codingmode, and can be used as the measure for selecting the coding mode in our encoder. For agiven quantizer step size, the mode resulting in the smallest value of RMB0 is chosen as thebest mode. It is obvious that in the use of this new measurement to select the coding mode,the computational complexity increase over the Test Model method is very slight (the sameidentical calculation for Vres is made in the Test Model).

16.6 Statistical Multiplexing Operations on Multiple Program Encoding

In this section, the strategies for StatMux operation on the multiple program encoding areintroduced. This topic is an extension of rate control into the case of multiple program


encoding. First, a background survey of general encoding and multiplexing modes isreviewed. Second, the specific algorithm used in some current systems has been intro-duced; its shortcomings are addressed and possible amendments to the basic algorithm aredescribed. Some potential research topics such as modeling strategies and methods forsolving the problem are proposed for investigation. These topics may be good researchtopics for the interested graduate student.

16.6.1 Background of Statistical Multiplexing Operation

In many applications, several video sources may often be combined, or multiplexed onto asingle link for transmission. At the receiving end, the individual sources of data from themultiplexed data are demultiplexed and supplied to the intended receivers. For example,in an asynchronous transfer mode (ATM) network scenario many video sources originat-ing from a local area are multiplexed onto a wide area backbone trunk. In a satellitebroadcasting scenario, several video sources are multiplexed for transmission through atransponder. In a cable TV scenario, hundreds of video programs are broadcasted onto acable bus. Because the transmission channel, such as a trunk, a transponder, or a cable, isalways an expensive resource, the limited channel capacity should be exploited as much aspossible. The goal of StatMux encoding is to make the best use of the limited channelcapacity as possible. There are several approaches to encoding and multiplexing a pluralityof video sources. In the following, we compare the methods and describe the situationwhere each method is applicable. The qualitative comparisons are made in terms of trade-offs among factors of computation, implementation complexity, encoded picture quality,buffering delay, and channel utilization. To understand the StatMux method, we introducea simple case of deterministic multiplexing function of CBR encoder. The standard methodfor performing the encoding and multiplexing function is to independent encode sourcewith a CBR. The CBR encoder produces an encoded bitstream, representing the videosupplied to it, at a predetermined CBR. To produce CBR, the CBR encoder utilizes a ratebuffer and feedback control mechanism that continually modifies the amount of quantiza-tion applied to the video signal as shown in Figure 16.26.

The CBR encoder provides a CBR with varying encoded picture quality. This means thatthe degree of quantization applied depends upon the current frame’s coding complexityoffered to the MPEG compression algorithm. Fine quantization is then applied to those

MPEGencoder

MPEGencoder

Ratecontrol

Ratecontrol

C

CBR

CBR

FIGURE 16.26Independent encoding=muxing of CBR sources.


frames that have low spatial and temporal coding complexity, and conversely coarsequantization is applied to frames that possess high spatial and temporal coding complexityto meet the bit rate. However, varying the quantization level corresponds to varying thevideo quality. Thus, in a CBR encoder, spatial and temporal complexity tends to beencoded in such a manner that the subjective quality of the reproduced image is lowerthan that of less complex images. This makes any form of rate control inherently bad in thesense that control is always imposed in a direction contrary to the goal of achievinguniform image quality. Usually, bit rates for CBR encoders are chosen so that the moder-ately difficult scenes can be coded to an acceptable quality level. Given that moderatelydifficult scenes give good results, then all simpler scenes will yield even better results withthe given rate, while very difficult scenes will result in noticeable degradation. As CBRencoders produce CBR, the multiplexing of a plurality of sources is very simple. Therequired channel capacity would simply be the sum of all the individual CBRs. Determinis-tic time or frequency division multiplexing of the individual CBR bitstreams onto thechannel in a well-known and simple process. So with uniform CBR encoding, consistentimage quality is impossible for the video sequence with varying scene complexity but thereward is the ease of multiplexing. The penalty of CBR coding with easy multiplexing maynot only result in the nonuniform picture quality but also result in lower efficiency ofchannel bandwidth employment. Better efficiency can be gained by StatMux, wherebyeach source is encoded at a VBR coding approach. The VBR coding will result in uniformor consistent coded image quality by fixing the quantization scale or by modulatingquantization scale to a limited extent according to activity masking attributes of thehuman visual system (HVS). Then the bit rates generated by VBR coding vary with theincoming video source material’s coding complexity. The StatMux is referred to as StatMuxin short. The coding gain of StatMux is possible through sharing of the channel resourcejointly among the encoders. For example, two MPEG encoders may assign the appearanceof their I-pictures at different time, this may reduce the limitation of the maximum channelbandwidth requirement because coding I-picture may generate a large number of bits. Thismay not be a good example for practical applications. However, this explains that theprocess of StatMux is not a zero-sum game whereby one encoder’s gain must be exactlyanother encoder’s loss. In the process of StatMux, one encoder’s gain is obtained by usingthe channel bandwidth, which another encoder does not need at that time or would bring avery marginal gain for another encoder at that time. More exactly, this concept of gainsthrough sharing arises when the limited amount of bits is dynamically appropriatedtoward encoders that can best utilize those bits in substantially improving its image qualityduring complex segments and eschewed from encoders that can improve its imagequality only marginally during easy segment. It is obvious that the CBR-encoded sourcesdo not need StatMux because the bandwidth for each encoded source is well defined. Thegain of StatMux can only be possibly obtained with VBR-encoded sources. In the followingsection, we discuss two kinds of multiplexing with multiple VBR-encoded sources.

16.6.2 VBR Encoders in StatMux

There are two multiplexing methods for encoding multiple sources with VBR encoders,open-loop and close-loop. Each VBR encoder in open-loop multiplexing mode producesthe most consistently uniform predefined image quality level regardless of the codingcomplexity of incoming video sources. The image quality is decided by fixing quantizationscale. When the quantization scale is fixed, the SNR is fixed under assumption of whiteGaussian quantization noise. Sometimes, the quantization scale is slightly modulatedaccording to the image activity to match HVS for example in the method in MPEG-2TM5. The resulting VBR process is generated by allowing the encoder to freely use,


however, many bits needed to meet the predetermined quality level. Usually, each videosource encoded by VBR encoder in open-loop mode is not geographically colocated andcannot be encoded jointly. However, the resulting VBR processes do share the channeljointly, in the sense that the total channel bandwidth is not rigidly allocated among thesources in a fixed manner such as done in CBR operation mode where each source has thefixed portion of channel bandwidth. The instantaneous combined rates of all the VBRencoders may exceed the channel capacity; especially, in the case when all the encodersgenerate the bursts of bits at the same time the joint buffer will overflow, thereby leading toloss of data. However, there always still exists a possibility to more efficiently utilize thechannel capacity by carefully allocating the loading conditions without losing of data. Butthe totally open-loop VBR coding is not stationary and it is hard to achieve both goodchannel utilization and very limited data loss. A practical method of VBR transmission foruse in ATM environment involves placing limitations to the degree of variability allowedin VBR processes. Figure 16.27 illustrates the idea of self-regulating VBR encoders.

The difference between proposed VBR encoder and totally open-loop VBR encoder isthat a looser form of rate control is imposed to the VBR encoder to avoid violatingtransmission constraints that are agreed to by the user and the network as part of thecontract negotiated during the call setup stage. The rate control will match the policingfunction, which is enforced by the network. Looser rate control means that the rate controlis not as strict as the one in CBR case because it allows for the encoder to vary its output bitrate according to the coding complexity up to a certain degree as decided by the policingfunction.

In some applications such as the TV broadcasting or cable TV, the video sources may begeographically colocated at the same site. In such scenarios, additional gains can berealized by the StatMux in which the sources are jointly encoded and jointly multiplexed.By using a common rate controller, all encoders operate in VBR mode, but withoutcontending and stepping over one another as in independent VBR encoding and multi-plexing. The joint rate controller assigns the total available channel capacity to eachencoder so that a certain common quality level is maintained. The bit rates assigned toeach individual encoder by joint rate control dynamically change based on the codingcomplexities of each video source to achieve the most uniform quality among the encoders

Ratecontrol

MPEGencoder

Policingfunction

VBR

VBR

VBR

VBR

C

Network switch

Ratecontrol

MPEGencoder

Policingfunction

Policingfunction

Policingfunction

FIGURE 16.27Independent encoding=muxing of geographically dispersed VBR sources.


Colocated encoders

Joint rate control

MPEG

Mux policy/ packet scheduler

VBR

VBR

C

MPEGencoder

MPEGencoder

FIGURE 16.28Method of joint rate control and multiplexing.

and along time for each encoder. In such a joint rate control method, although each encoderproduces its own VBRs, the sum of bits produced by all encoders combined together is aCBR to fit the channel capacity. Such an idea is shown in Figure 16.28.

16.6.3 Research Topics of StatMux

The major problem of StatMux is how to allocate the bit rate resource among the videosources which share the common channel bit rate and are jointly encoded by a joint ratecontroller. This allocation should be based on the coding complexity of each source. The bitrate, Ri(t), for encoder i at time t according to the normalized coding complexity of allencoders for the GOP period ending at time t such as

Ri(t) ¼ Xi(t)PNj¼1

Xj(t)�C (16:29)

where Xi(t) is the coding complexity of source for encoder i at the time t over a GOP periodand C is the total channel capacity. Also the bit rate assignment has to be updated fromtime to time to trace the variation of source complexity. In the following, we will discussseveral topics which may be the research topics for graduate students.

16.6.3.1 Forward Analysis

Without forward analysis, scene transitions are unanticipated and lead to incorrect bitallocation for a brief transient period following the scene changes. If the bit allocation ofcurrent video segment is based on the complexity of previous video segment and isadjusted by the available bit rate resource, those video segments which change fromeasy coding complexity to difficult coding complexity suffers the greatest degradationwithout pre-analysis of upcoming increased complexity. Pre-analysis could be performedwith a dual set of encoders operating with a certain preprocessing delay ahead of the actualencoding process. As a simple example, we start to assign the equal portion of bit rate foreach encoder, then we can obtain the average quantization scale for this GOP that can beconsidered as the forward analysis results of coding complexity. The real coding processcan operate on the coding complexity obtained by the pre-analysis. If we choose one or two


GOPs according to the synchronous status of the input video sources to perform the pre-analysis, it will result in small buffering delay.

16.6.3.2 Potential Modeling Strategies and Methods

Several modeling strategies and methods have been investigated to find a suitable pro-cedure for classifying sources and determining what groups of sources can appropriatelybe jointly encoded together for transmission over a common channel as so to meet aspecified image quality level. These modeling strategies and methods include modelingof video encoding, modeling of source coding complexity, and source classification. Themodeling of video encoding algorithm involves measuring the operating performance ofthe individual encoders or characterizing its rate distortion function for a variety of scenes.Embodied into this model are the MPEG algorithms implemented for motion estimation,mode decision, rate control, and their joint optimization issues. It has been speculated thata hyperbolic functional form of

Rate ¼ X=distortion (16:30)

would be appropriate over the normal operating bit rate range of 3–7 Mbits=s for MPEG-2-encoded ITU-R 601-sized videos. The hyperbolic shape of rate distortion curves would bealso suitable for all video scenes. Actually, we can use a set of collected rate distortion datapairs with an encoder to fit a hyperbola through the points as shown in Figure 16.29 andestimate the shape parameter X. The value of X will be used to present the codingcomplexity offered to an encoder. For modeling at the GOP level, the rate would be thenumber of bits used to code that GOP and the distortion can be chosen as the averagingquantization scale over the GOP. In some literatures, the distortion is taken as the averagePSNR over the GOP or overall sequence. If it is assumed that the quantization noise ismodeled by white Gaussian noise then both distortion measures are equivalent.

After obtaining the correct coding complexity parameters, we can improve the StatMuxalgorithm by assigning an encoding bit budget to each encoder based on the GOP levelnormalized complexity measure X that each encoder is encoding. The GOP level normal-ized complexity measure X(n) is defined as

X(n) ¼X

i2GOP

T(i)Q(i) (16:31)

wheren is the GOP numberT(i) is the total number of bits used for encoding picture iQ(i) is the average quantization scale used for encoding picture i

Some research results have shown that the X(n) is insensitive to the operating bit rate;therefore, X(n) is a reliable measure of a video source’s loading characteristics. Therefore,

FIGURE 16.29Rate distortionmodeling of encoding algorithmand video source.

Bits per GOP

Averaging quantizationscale over GOP


the study of accurate model of the random process of X(n) is very important for improvingthe operations of the StatMux algorithm. The accurate model of X(n) reflects the videosource’s loading characteristics which dictates the share of total bit budget that an encoderexpects to get. Several statistical models have been proposed to describe the complexitymeasure, X(n). For example, an auto-regressive process model is proposed for the intra-scene X(n) process. This proposed model is based on the following observations; thecomplexity measure within a single scene has a skewed distribution by the Gammafunction, and furthermore, the complexity measure within a scene displays a strongtemporal correlation and the form of the correlation is essentially exponential. The defini-tion for the Mth order auto-regressive model is

X(n) ¼XMm¼1

a(m) �X(n�m)þ e(n) (16:32)

wheree(n) is the white noise processa(m)s are the innovation filter coefficients

The statistics of the model such as the mean value, the variance, the correlation, andmarginal distribution are used to match those of actual signals by adjusting a(m)s, e(n),and M. Other cases, such as scene transition model and intercoded scene models, we leaveas the project topics for the graduate students.

16.7 Summary

In this chapter, the technical detail of MPEG video was introduced. The technical detail ofMPEG standards includes the decoding process of MPEG-1 and MPEG-2 video. Althoughthe encoding process is not a standard part, it is very important for the content providersand service providers. We discussed the most important parts of encoding techniques.Some examples such as the joint optimizing of mode decision and rate control are goodexamples to understand how the standard is used.

Exercises

1. According to your understanding, give several reasons to explain why the MPEGstandards specify only the decoding as normative part and define the encoding asinformative part (Test Model).

2. Can an MPEG-2 video decoder decode a bitstream generated by an MPEG-1 videoencoder? Summarize the main difference between the MPEG-1 and MPEG-2video standards.

3. Pre-filtering may reduce the noise of original video source and increase the codingefficiency. But at the same time the pre-filtering will result in a certain information loss.Conduct a project to investigate at what bit rate range the pre-filtering may benefit thecoding efficiency for some video sources.

4. Use TM5 rate control to encode several video sequences (such as Flower Gardensequence) in two ways: (a) with adaptive quantization step and (b) without adaptive


quan tization step (E quation 16.16). Comp are and discuss the numerical result s andsubj ective result s (observe the sm ooth areas care fully).

5. Why does MPEG -2 use s different quantiz er matrice s for intra- and intercodi ng? Con-duct a proje ct to use different quan tization matrice s to enc ode seve ral vide o sequ encesand report the results.

6. Conduct a project to encode several video sequences (a) with B-picture and (b) withoutB-picture. Compare the numerical and subjective results. Observe what difference existsbetween the sequences with fast motion and the sequences with slow motion. (Typicalbit rates for ITU-R 601 sequences are 4–6 Mbits=s.)

References

[cano 1983] D. Cano and M. Benard, 3-D Kalman filtering of image sequences, in Image SequenceProcessing and Dynamic Scene Analysis, T.S. Huang (Ed.), Berlin, Springer, 1983, pp. 563–579.

[haskell 1997] B.G. Haskell, A. Puri, and A.N. Netravali, Digital Video: Introduction to MPEG-2,Chapman and Hall, New York, 1997.

[katsaggelos 1991] A.K. Katsaggelos, R.P. Kleihorst, S.N. Efstratiadis, and R.L. Lagendijk, Adaptiveimage sequence noise filtering methods, Proceeding of SPIE Visual Communication and ImageProcessing, Boston, MA, pp. 10–13, November 1991.

[katto 1994] Jiro Katto, Jun-ichi Ohki, Satoshi Nogaki, and Mutsumi Ohta, A wavelet codec withoverlapped motion compensation for very low bit rate enviroment, IEEE Transactions on Circuitsand Systems for Video Technology, 4, 3, 328–338, June 1994.

[koc 1998] U.-V. Koc and K.J.R. Liu, DCT-based motion estimation, IEEE Transactions on ImageProcessing, 7, 948–965, July 1998.

[lee 1994] J. Lee and B.W. Dickerson, Temporally adaptive motion interpolation exploiting temporalmasking in visual perception, IEEE Transactions on Image Processing, 3, 5, 513–526, September1994.

[mitchel 1997] J.L. Mitchell, W.B. Pennebaker, C.E. Fogg, and D.J. LeGall, MPEG Video CompressionStandard, Chapman and Hall, New York, 1997.

[mpeg1] ISO=IEC 11172, International Standard, 1992.[mpeg2] ISO=IEC 13818 MPEG-2 International Standard, Video Recommendation ITU-T H.262,

January 10, 1995.[ozkan 1993] M.K. Ozkan, M.I. Sezan, and A.M. Tekalp, Adaptive motion compensated filtering

of noisy image sequences, IEEE Transactions on Circuits and Systems for Video Technology, 3, 4,277–290, August 1993.

[ramchandran 1994] Kannan Ramchandran, Antonio Ortega, and Martin Vetterli, Bit Allocation forDependent Quantization with Application to MPEG Video Coders, IEEE Transactions on ImageProcessing, 3, 5, 526, 533–545, September 1994.

[sezan 1991] M.I. Sezan, M.K. Ozkan, and S.V. Fogel, Temporal adaptive filtering of noisy imagesequences using a robust motion estimation algorithm, IEEE ICASSP, 2429–2432, 1991.

[sun 1994] H. Sun, Sarnoff Internal Technical Report, May 1994.[sun 1997] H. Sun, W. Kwok, M. Chien, and C.H. John Ju, MPEG coding performance improvement

by jointly optimization coding mode decision and rate control, IEEE Transactions on Circuitsand Systems for Video Technology, 7, 3, 449–458, June 1997.

[tm5] MPEG-2 Test model 5, ISO-IEC=JTC1=SC29=WG11, April, 1993.[wang 1995] L. Wang, Rate control for MPEG-2 video coding, SPIE on Visual Communications

and Image Processing, Taipei, Taiwan, pp. 53–64, May 1995.[young 1993] R.W. Young and N.G. Kingsbury, Frequency-domain motion estimation using a

complex lapped transform, IEEE Transactions on Image Processing, 2, 1, 2–17, January 1993.


17Application Issues of MPEG-1 = 2 Video Coding

This chap ter is an exten sion of Chapte r 16. We will introd uce several importan t appl icationissues of MPEG-1=2 video that include the Advanced Television Standard Committee(ATSC) DTV standard which has been adopted by the Federal Communications Commis-sion (FCC) as TV standard in the United States: transcoding, down-conversion decoder,and error concealment.

17.1 Introduction

Digital video signal processing is an area of science and engineering that has developedrapidly over the last decade. The maturity of the moving picture expert group (MPEG)video coding standard is a very important achievement for the video industry and pro-vides a strong support for digital transmission and storage of video signals. The MPEGcoding standard is now being deployed for a variety of applications, which include highdefinition television (HDTV), teleconferencing, direct broadcasting by satellite (DBS),interactive multimedia terminals, and digital video disc (DVD). The common feature ofthese applications is that the different source information such as video, audio, and dataare all converted to the digital format and then mixed together to a new format that isreferred to as the bitstream. This new format of information is a revolutionary change inthe multimedia industry, since the digitized information format, i.e., the bitstream, can bedecoded by not only the traditional consumer electronic products such as television, butalso the digital computer. In this chapter, we will present several application examples ofMPEG-1=2 video standards, which include the ATSC DTV standard, transcoding, down-conversion decoder, and error concealment. The DTV standard is the application extensionof MPEG video standard. The transcoding and down-conversion decoders are the practicalapplication issues which increase the features of compression-related products. The errorconcealment algorithms provide the tool for transmitting the compressed bitstream overnoisy channels.

17.2 ATSC DTV Standards

17.2.1 A Brief History

The birth of digital television (DTV) in the United States has undergone several stages,which are the initial stage, the competition stage, the collaboration stage, and the approvalstage [reitmeier 1996]. The concept of HDTV was proposed in Japan in the late 1970s and


early 1980s. During that period, Japan and Europe continued to make their efforts in thedevelopment of analog television transmission systems such as MUSE and HD-MACsystems. In early 1987, U.S. broadcasters fell behind in this field and felt they shouldtake action to catch up with the new HDTV technology and petitioned the FCC to reservespectrum for terrestrial broadcasting of HDTV. As a result, the Advisory Committee onAdvanced Television Service (ACATS) was founded in August 1987. This committee takesthe role of recommending a standard to the FCC for approval. Thus, the process ofselecting an appropriate HDTV system for the United States started. At the initial stagebetween 1987 and 1990, there were over 23 different analog systems proposed; amongthese systems two typical approaches were extended definition television (EDTV) that fitsinto a single 6 MHz channel, and HDTV approach that requires two 6 MHz channels. By1990, ACATS had established the Advanced Television Test Center (ATTC), an officialtesting laboratory sponsored by broadcasters to conduct extensive laboratory tests inVirginia and field tests in Charlotte, North Carolina. Also, the industry had formed theAdvanced Television Standards Committee (ATSC) to perform the task of drafting theofficial standard documents of the selected winning system.

As we know, the current ATSC proposed television standard is a digital system. In theearly 1990s, the FCC issued a very difficult request to the industry about DTV standard.The FCC required the industry to provide full quality HDTV service in a single 6 MHzchannel. Having recognized the technical difficulty of this requirement at that time, theFCC also stated that this service could be provided by a simulcast service in whichprograms would be simultaneously broadcasted in both NTSC and the new televisionsystem. However, the FCC decided not to assign new spectrum bands for televisions. Thismeans that simulcasting would occur in the already crowded VHF and UHF spectrum. Thenew television system had to use low-power transmission to avoid excessive interferenceinto the existing NTSC services. In addition, the new television system had to use a veryaggressive compression approach to squeeze full HDTV signal into 6 MHz spectrum. Onegood thing was that backward compatibility with NTSC was not required. Actually, underthese constraints the backward compatibility had already become impossible. Moreover,this goal could not be achieved by any of the previously proposed system and it causedmost of the competing proponents to reconsider their approaches. Engineers realized thatit was almost impossible to use the traditional analog approaches to reach this goal andthat the solution may be in digital approaches. After a few months of considering, GeneralInstrument announced their first digital system proposal for HDTV, DigiCigher, in June1990. In the middle of the following year, three other digital systems were proposed: theAdvanced Digital HDTV by the Advanced Television Research Consortium, whichincluded Thomson, Philips, Sarnoff, and NBC in November 1990; Digital Spectrum Com-patible HDTV by Zenith and AT&T in December 1990; and Channel Compatible Digici-pher by General Instrument and the Massachusetts Institute of Technology in January1991. Thus the competition stage started. The prototypes of four competing digital systemsand the analog system, Narrow MUSE, proposed by NHK, were officially tested andextensively analyzed during 1992. After the first round of tests, they concluded that thedigital systems would be continued for further improvement and would be adopted. InFebruary 1992, the ACATS recommended digital HDTV for the U.S. standard. It alsorecommended that the competing systems be either further improved and retested, or becombined with a new system. In the middle of 1993, the former competitors joined a GrandAlliance. Then the DTV development entered the collaboration stage. The Grand Alliancebegan a collaborative effort to create the best system which combines the best features andcapabilities of the formerly competing systems into a single best-of-the-best system. Afterone-year of joint effort by the seven Grand Alliance members, the Grand Alliance provideda new system that was prototyped and extensively tested in the laboratory and field.


The test results showe d that the syste m is indeed the best of the best compar ed withforme rly com peting systems [g a 19 94]. Th e ATSC the n rec ommen ded this syste m to theFCC as the ca ndidate HDTV stand ard in the United Stat es. Du ring the follo wing period,the compute r indu stry rea lized that DTV provid es the signal s that can no w be use d forcompute r appl ications and the TV industry was invad ing the ir ter rain. They pres enteddifferent opi nion about the signal format and where especial ly opposed to the interlacedformat. This reaction delay ed the appro val of the AT SC stand ard. Aft er long-t ime debate ,the FCC fi nally approv ed the ATSC standard in early 1997. But, the FCC did no t specif ythe picture formats and left this iss ue to be decided by the m arket.

17.2.2 Techn ical Overv iew of ATSC Systems

The ATSC DTV syste m has been designed to satisfy the FCC requi remen ts. Th e basicrequi rement is that no addi tional fre quency spectrum will be assig ned for DTV broad cast-ing. In ot her words , during a transition period, bot h NTSC and DTV service will besimult aneous ly bro adcast on different chann els and DT V can only use the taboo channels .This approach allows a sm ooth transi tion to DTV , such that the service s of the existi ngNTSC receiver s will rem ain and gr adually be phased out of exis tence in the year 2006. Thesimulca sting requi remen t causes some tech nical dif ficulties of DTV design . First, the high-qualit y HDTV progra m must be del ivered in a 6 MHz chann el to make ef fi cient use ofspectru m and fi t allo cation plans fo r spe ctrum assigne d to television br oadcast ing. Seco nd,a low- power and low-in terferen ce signal must be used so that simu lcasting in the samefrequency allocati ons as cur rent NT SC servi ce does no t cause excessive interfer ence to theexisti ng NTSC rec eiving, since the tabo o chann els are generall y unsu itable for bro adcastin gan NTSC signal due to high int erferenc e. Beside s satisfyi ng the fre quency spectrumrequi rement, the DTV stand ard has seve ral impo rtant featur es that allo w DTV to achi eveinter operabi lity with computers and data com munica tions. Th e first fea ture is the adopti onof a layered digit al syste m archite cture. Each individual layer of the system is designed tobe interope rable with ot her systems at the corre spondin g layers. For exa mple, the squarepixel and progres sive scan pictu re fo rmat should be provi ded to allow com puters access tothe com pression laye r or picture layer depen ding on the capa city of compu ters and theATM-li ke packet fo rmat for ATM net work to access the transp ort laye r. Second, the DTVstandard uses a header =descri ptor appro ach to provid e maxi mum fl exible operati ngcharac teristics . Ther efore, the layered archite cture is the mo st impo rtant fea ture of DTVstandards. The additional advantage of layering is that the elements of the system can becombined with other technologies to create new applications. The system of DTV standardincludes four layers: picture, compression, transport, and transmission.

17.2.2.1 Picture Layer

At the picture layer the input video formats have been defined. The Executive Committeeof the Advanced Television Systems Committee has approved to release the statementregarding the identification of the HDTV and SDTV transmission formats within the ATSCDTV standards. There are six video formats in the ATSC DTV standard, which are HDT(Table 17.1).

The remaining 12 video formats are not HDTV format. These formats represent someimprovements over analog NTSC and are referred to as standard definition television(SDT V) (Table 17.2).

These definitions are fully supported by the technical specifications for the variousformats as measured against the internationally accepted definition of HDTV establishedin 1989 by the International Telecommunication Union (ITU) and the definitions cited by


TABLE 17.1

HDTV Formats

Spatial Format (X3Y Active Pixels) Aspect Ratio Temporal Rate

19203 1080 (square pixel) 16:9 23.976=24 Hz progressive scan29.97=30 Hz progressive scan59.94=60 Hz interlaced scan

12803 720 (square pixel) 16:9 23.976=24 Hz progressive scan29.97=30 Hz progressive scan59.94=60 Hz progressive scan

the FCC during the DTV standard develop process. These formats cover a wide variety ofapplications, which include motion picture film, currently available HDTV productionequipment, the NTSC television standard, and computers such as personal computersand workstations. However, there is no simple technique that can convert images fromone pixel format and frame rate to another that achieve interoperability among film and thevarious worldwide television standards. For example, all low cost computers use squarepixels and progressive scanning while current television uses rectangular pixels andinterlaced scanning. The video industry has paid a lot of attention to developing the formatconverting techniques. Some techniques such as de-interlacing, down=up conversion forformat conversion have already been developed. It should be noted that the broadcasters,content providers, and service providers can use any one of these DTV format. This resultsin a difficult problem for DTV receiver manufacturers who have to provide all kinds ofDTV receivers to decode all these formats and then to convert the decoded signal to itsparticular display format. On the other hand, this requirement also gives receiver manu-facturers the flexibility to produce a wide variety of products that have different function-ality and cost, and the consumers freedom to choose among them.

17.2.2.2 Compression Layer

The raw data rate of HDTV of 19203 10803 303 16 (16 bits=pixel corresponds to 4:2:2color format) is about 1 Gbits=s. The function of the compression layer is to compress theraw data from about 1 Gbits=s to the data rate of approximately 19 Mbits=s to satisfy the 6MHz spectrum requirement. This goal is achieved by using the main profile and high levelof MPEG-2 video standard. Actually, during the development of Grand Alliance HDTVsystem, many research results have been adopted by the MPEG-2 standard at the sametime. For example, the support for interlaced video format and the syntax for datapartitioning and scalability. The ATSC DTV standard is the first and important applicationexample of the MPEG-2 standard. The use of MPEG-2 video compression fundamentally

TABLE 17.2

SDTV Formats

Spatial Format (X3Y Active Pixels) Aspect Ratio Temporal Rate

7043 480 (CCIR 601) 16:9 or 4:3 23.976=24 Hz progressive scan29.97=30 Hz progressive scan59.94=60 Hz progressive scan

6403 480 (VGA, square pixel) 4:3 23.976=24 Hz progressive scan29.97=30 Hz progressive scan59.94=60 Hz progressive scan


enab les AT SC DTV devi ces to interope rate with MPEG -1 =2 com puter mul timedia appli-cation s directly at the compr essed bitstream level.

17.2.2. 3 Transport Layer

The transp ort layer is another importan t iss ue for interope rabili ty. Th e ATSC DTV trans-port layer uses the MPEG -2 syste m transpo rt st ream syn tax. It is a fully com patibl e subs etof the MPEG -2 transpo rt protoco l. The ba sic function of transp ort layer is to de fine thebasic format of data packets. The purp oses of packeti zation inc lude:

. Packa ging the data into the fixed size cells or packe ts for forward error correcti on(FE C) enc oding to protect the bit- error due to the comm unication chann el no ise

. Mul tiplexing the video, audi o, and data of a progra m into a bitstr eam

. Provi ding tim e syn chronizati on fo r differe nt medi a el ements

. Provi ding fl exibility and exten sibili ty wi th backwar d com patibility.

The transport layer of ATSC DTV uses a fixed length (FL) packet. The packet size is 188 bytesconsisting of 184 bytes of payload and 4 bytes of header. Within the packet header, the 13-bitpacket identifier (PID) is used to provide the important capacity to combine the video, audio,and ancillary data stream into a single bitstream as shown in Figure 17.1. Each packet containsonly a single type of data (video, audio, data, program guide, etc.) identified by the PID.

This ty pe of packe t struc ture packetiz es the vide o, audio, and aux iliary data sep arately.It also provides the basic multi plexing functi on that produces a bitste am inc luding video,five- channel surround -sound audi o, and an auxiliar y data capacity. This kind of transp ortlayer approach also provid es com plete fl exibility to allo cate chann el capa city to achi eveany mix among video, audi o, and other data services. It should be noted that the selecti onof 188-pack et le ngth is a trade -off betwe en reduci ng the overh ead due to the transp ortheader and increasing the ef ficie ncy of error c orrection . Also, one ATSC DTV packet can becomple tely encapsu lated with its hea der wi thin four ATM packets by using 1 AAL byt e perATM header leaving 47 usabl e paylo ad bytes times 4 for 188 bytes. The details of thetranspo rt laye r will be discusse d in Chapter 21.

17.2.2. 4 Transm ission Layer

The functi on of transmi ssion layer is to mo dulate the transpo rt bitstr eam into a signal thatcan be transmitt ed over 6 MHz anal og chann el. Th e ATSC DTV system use s a trellis-c oded8-level vestig ial sideban d (8-V SB) modul ation tech nique to deliver app roximate ly 1 9.3Mbits=s in the 6 MHz terrestrial simulcast channel. VSB modulation inherently requiresonly processing of the in-phase signal sampled at the symbol rate, thus reducing the

184-byte payload

188-byte packet

Video Audio Video Video Audio PGM GD Video

4-byte packet header

FIGURE 17.1Packet structure of ATSC DTV transport layer.


complexity of receiver, and ultimately the cost of implementation. The VSB signal isorganized in a data frame that provides a training signal to facilitate channel equalizationfor removing multipath distortion. However, from several field test results, the multi-path distortion is still a serious problem of terrestrial simulcast receiving. The frame isorganized into segments each with 832 symbols. Each transmitted segment consists ofone-synchronization byte (four symbols), 187 data bytes, and 20 R-S parity bytes. Thiscorresponds to a 188-byte packet, which is protected by 20-byte R-S code. Interoperabilityat the transmission layer is required by different transmission media applications. Thedifferent media use different modulation techniques now, such as QAM for cable andQPSK for satellite. Even for terrestrial transmission, European DVB systems use OFDMtransmission. The ATV receivers will not only be designed to receive terrestrial broadcasts,but also the programs from cable, satellite, and other media.

17.3 Transcoding with Bitstream Scaling

17.3.1 Background

As indicated in the previous chapters, digital video signals exist everywhere in the formatof compressed bitstreams. The compressed bitstreams of video signals are used for trans-mission and storage through different media such as terrestrial TV, satellite, cable, ATMnetwork, and the Internet. The decoding of a bitstream can be implemented in eitherhardware or software. However, for high bit rate compressed video bitstreams of high-definition video signals, specially designed hardware is still the major decoding approachdue to the speed limitation of current computer processors. The compressed bitstream as anew format of video signal is a revolutionary change of video industry since it enablesmany applications. For example, the coded video bitstreams can be decoded not only withdigital televisions or set-top boxes, but also with computers and mobile terminals, such ascellular phones. Therefore, the problem of interactivity and integration of video data withcomputer, cellular, and television systems is relatively new and subject to a great deal ofresearch worldwide. As the number of networks, types of devices, and content represen-tation formats increase, interoperability between different systems and different networksis becoming more important. Thus, devices such as gateways, multipoint control units, andservers must be developed to provide a seamless interaction between content creation andconsumption. Transcoding of video content is one key technology to make this possible.Generally speaking, transcoding can be defined as the conversion of one coded signal toanother. In the earliest work on transcoding, the majority of interest focused on reducingthe bit rate to meet an available channel capacity. Additionally, researchers investigatedconversions between constant bit rate (CBR) streams and variable bit rate (VBR) streams tofacilitate more efficient transport of video. As time moved on and mobile devices withlimited display and processing power became a reality, transcoding to achieve spatialresolution reduction, as well as temporal resolution reduction, has also been studied.Furthermore, with the introduction of packet radio services over mobile access networks,error-resilience video transcoding has gained a significant amount of attention lately,where the aim is to increase the resilience of the original bit stream to transmission errors.Also in some applications, the syntax conversion is needed between different compressionstandards such as JPEG, MPEG-1, MPEG-2, H.261, H.263, and H.264=AVC. In this section,we will focus on the topic of bit rate conversion since it finds wide application and thereaders can extend the idea for other kinds of transcoding. Also, we limit ourselves to focus


on the problem of scaling an MPEG CBR encoded bitstream down to a lower CBR.A comprehensive survey of video transcoding can be found in [vetro 2003].

The basic function of bitstream scaling may be thought as a black box, which passivelyaccepts a precoded MPEG bitstream at the input and produces a scaled or size-reducedbitstream, which meets new constraints that are not known a priori during the creation ofthe original precoded bitstream. The bitstream scaler is a transcoder, or filter, that providesa match between an MPEG source bitstream and the receiving load. The receiving loadconsists of the transmission channel, the destination decoder, and perhaps a destinationstorage device or display. The constraint on the new bitstream may be bound by a varietyof conditions. Among them these conditions include the peak or average bit rate imposedby the communications channel, the total number of bits imposed by the storage device,and the variation of bit usage across pictures due to the amount of buffering available atthe receiving decoder. While the idea of bitstream scaling has many concepts similar tothose provided by the various MPEG-2 scalability profiles, the intended applications andgoals differ. The scalable video coding (SVC) methods are aimed at providing encoding ofsource video into multiple service grades (that are predefined at the time of encoding) andmultitiered transmission for increased signal robustness. The multiple bitstreams createdby MPEG SVC are hierarchically dependent in such a way that by decoding an increasingnumber of bitstreams, higher service grades are reconstructed. Bitstream transcodingmethods, in contrast, are primarily decoder=transcoder techniques for converting an exist-ing precoded bitstream to another one that meets new rate constraints. Several applicationsthat motivate bitstream scaling or transcoding include:

1. Video on-demand: Consider a video on-demand (VOD) scenario wherein a video file-server includes a storage device containing a library of precodedMPEG bitstreams.These bitstreams in the library are originally coded at a high quality (e.g., studioquality). A number of clients may request retrieval of these video programs at oneparticular time. The number of users and the quality of video delivered to the usersare constrained by the outgoing channel capacity. This outgoing channel, whichmay be a cable bus or an ATM trunk, for example, must be shared among the userswho are admitted to the service. Different usersmay require different levels of videoquality, and the quality of a respective program will be based on the fraction of thetotal channel capacity allocated to each user. To simultaneously accommodate aplurality of users, the video file-server must scale the stored precoded bitstreams toa reduced rate before it is delivered over the channel to respective users. The qualityof the resulting scaled bitstream should not be significantly degraded compared tothe quality of a hypothetical bitstream so obtained by coding the original sourcematerial at the reduced rate. Complexity cost is not such a critical factor becauseonly the file-server has to be equipped with the bitstream scaling hardware, notevery user. Presumably, video service providers would be willing to pay a high costfor delivering the possible highest quality video at a prescribed bit rate.

� 2007 b

Asanoption, a sophisticatedvideofile-servermayalsoperformscalingofmultiple originalprecoded bitstreams jointly and statistically multiplex the resulting scaled VBR bitstreamsinto the channel. By scaling the groupof bitstreams jointly, statistical gains canbe achieved.These statistical gains can be realized in the form of higher and more uniform picturequality for the same channel capacity. Statistical multiplexing over a DirecTv transponder[isnardi 1993] is one example for the application of video statistical multiplexing.

2. Trick-play track on digital VTRs: In this application, the video bitstream is reduced tocreate a sidetrack on video tape recorders. This sidetrack contains very coarse quality


video sufficient to facilitate trick-modes on the VTR (e.g., FF and REW at differentspeeds). Complexity cost for the bitstream scaling hardware is of significant concern inthis application since the VTR is a mass consumer item subject to mass production.

3. Ext ended- play recordi ng on digi tal VTRs : In this appl ication, vide o is bro adcast touse rs ’ homes at a certain broadcas t qualit y ( � 6 Mbit s=s for standard de finitionvide o and �19 M bits= s for high -defi nition video). With a bitstr eam sca ling featur ein their video tape rec orders, use rs may record the video at a reduce d rate, akin toext ended play (EP) m ode on today ’ s VHS recorders , the reby rec ording a grea terdurat ion of video progra ms onto a tape at lower qualit y. Agai n, hardw are com-plex ity costs would be a maj or factor her e.

4. Univ ersal mult imedia access ( UMA ): The concept of UMA is to ena ble access to anymul timedia content over any type of network, such as Intern et and Wir eless LAN,from any type of termin als with varyi ng capa bilitie s such as mob ile phones, per-sona l computers , and telev ision sets [mp eg21]. Th e primar y fun ction of UMAservi ces is to provi de the best QoS or use r experi ence by either selecting appropri atecon tent formats , or adapting the conte nt format directl y to meet the playb ackenv ironm ent, or to adapt the conte nt playb ack environ ment to acco mmod ate thecon tent. Tow ard the a bove goal, video transco der can bridge the mismatc h betwe enthe large number of video dat a, bandwid th of com munica tion c hannels , and cap-ab ility of end user ter minals. The concept of UMA is illus trated in Figu re 17.2.

17.3.2 Bas ic Princi ples of Bitst ream Scali ng

As describ ed previ ously, the idea of sca ling a n MPEG -2 com pressed bitstream dow n to alower bit rate is initia ted by seve ral applicat ions. One probl em is the criteria that shoul d beuse d to judge the perform ances of an architectu re that can reduce the size or rate of an MPEGcom pressed bitstream . Two basic princi ples of bitstr eam scaling are (1) the informati on inthe original bitstream should be exploited as much as possible, and (2) the resulting imagequality of the new bitstreamwith a lower bit rate should be as close as possible to a bitstreamcreated by coding the original source video at the reduced rate. Here, we assume that for agiven rate the original source is encoded in an optimalway.Of course, the implementation ofhardw are comple xity also has to be cons idered. Figu re 17.3 sh ows a simp li fied enc odingstructure of MPEG encoding in which the rate control mechanism is not shown.

Networks

Route with transcoder

FIGURE 17.2Concept of UMA.


T Q

P

−

T

−1

Q−1

+

VLCInput source Bits

FIGURE 17.3Simplified encoder structure. T:transform; Q: quantizer; P: motioncompensated prediction; VLC:variable-length coding.

In this st ructure, a block of an image data is fi rst transf or med to a set of coef fi cients, thecoef ficients are the n quantiz ed wi th a quan tizer st ep that is decided by the given bit ratebudge t, or num ber of bits assi gned to thi s block . Fina lly, the quantiz ed coef ficients a re codedin vari able-len gth coding (VLC) to the binary format, which is calle d the bitstr eam or bits.

From this struc ture it is obvio us that the perform ance of changi ng the quan tizer step willbe bett er than cutting high er frequenci es whe n the sam e amo unt of rate needs to be reduce d.In the or iginal bitstr eam the coef fi cients are quan tized with fi ner quan tization steps that areoptimi zed at the or iginal high rate . After cut ting the coef ficients of high er frequenci esthe rest of coef ficie nts are not quan tized wi th an optimal quantiz er. In the met hod ofre-quan tization all coef fi cients are re-quan tized with an optimal quantiz er that is deter-mined by the reduced rate, the perform ance must be bett er than the one by cutting highfrequenci es to reach the redu ced rate. The the oretical anal ysis is given in the append ix.

In the fo llowing, seve ral different archite ctures that acco mplish the bitstr eam scaling arediscuss ed. The diffe rent met hods have varyi ng hardw are implement ation com plexity; eachhaving its own degre e of trade-o ff betw een require d hardw are and result ed image qualit y.

17.3.3 Archite ctures of Bitstream Scaling

Four archite ctures for bitstream scaling are discusse d. Each of the sca ling archite cturesdescri bed has the ir own par ticular bene fi ts that are suitable for a partic ular appl ication.

Architec ture 1: Th e bitstream is sca led by cut ting high frequenci es.

Architec ture 2: Th e bitstream is sca led by re-q uantizati on.

Architec ture 3: The bitstr eam is scaled by re- encoding the rec onstructe d pictures withmotion vectors and coding deci sion mo des extracted from the origi nalhigh-qu ality bitstream .

Architec ture 4: The bitstr eam is scaled by re- encoding the rec onstructe d pictures withmotion vector s extracted from the origi nal high-qu ality bitstr eam, butnew coding decisio ns are compute d based on reconst ructed pictures.

Archit ectures 1 a nd 2 are cons idered for VTR appl ications such as tri ck-play modes and EPrecor ding. Arch itectures 3 and 4 are cons idered fo r VOD and other applicabl e StatMu xscenar ios.

17.3.3. 1 Architectu re 1: Cutting AC Coef ficien ts

A block diagram illus trating archite cture 1 is shown in Figu re 17.4a. Th e met hod ofreduci ng the bit rate in archi tecture 1 is based on cutting the higher fre quency coef fi cients.The inc oming precod ed CBR strea m ent ers a deco der rate buffer . Follow ing the top branchleadin g from the rate buffer, a variable-length decoder (VLD) is used to parse the bits forthe next frame in the buffer to identify all the variable-length code words that correspond


BitstreamDelay

VLD parserBit allocation

analysis

VLD parser

New bit rate

Rate controller(frequency cut)

Bits-out

(a)Cumulative bits

used for AC coefficients

Block number0

Profile of original bits

Scaled profile

New targettotal of AC bits

(b)

FIGURE 17.4(a) Architecture 1: cutting high frequencies. (b) Profile map.

to AC coefficients used in that frame. No bits are removed from the rate buffer. The codewords are not decoded, but just simply parsed by the VLD parser to determine code wordlengths. The bit allocation analyzer accumulates these AC bit-counts for every macroblock(MB) in the frame and creates an AC bit usage profile as shown in Figure 17.4b. That is, theanalyzer generates a running sum of AC DCT coefficient bits on an MB basis:

PVN ¼X

AC BITS (17:1)

where PVN is the profile value of a running sum of AC code word bits until the MB N. Inaddition, the analyzer counts the sum of all coded bits for the frame, TB (total bits). After allMBs for the frame have been analyzed, a target value TVAC of AC DCT coefficient bits perframe is calculated as

TVAC ¼ PVLS � a * TB� BEX (17:2)

whereTVAC is the target value of AC code word bits per framePVLS is the profile value at the last MBa is the percentage by which the pre-encoded bitstream is to be reducedTB is the total bitsBEX is the amount of bits by which the previous frame missed its desired target

The profile value of AC coefficient bits is scaled by the factor TVAC=PVLS. Multiplying eachPVN performs scaling by that factor to generate the linearly scaled profile shown in Figure


17.4b. Follow ing the bot tom branch fro m the rate buffer, a delay is inserte d equal to theamoun t of tim e requi red fo r the top branch anal ysis proces sing to be comple ted fo rthe cur rent frame. A seco nd VLD par ser a ccesses and remov es all code word bits fromthe buffer and delivers them to a rate control ler. The rate contro ller receives the scaled targetbit usage pro file for the amoun t of AC bits to be used within the frame . The rate control lerhas mem ory to store all coef ficients assoc iated with the cur rent MB it is operati ng on. Alloriginal code word bits at a high er level than AC coef fi cients (i.e., all FL hea der codes ,motion vector codes , MB -type code s, etc.) are held in memor y a nd wi ll be re-mu ltiplexedwith all AC code words in that MB that have no t been excised to form the outgoing scale dbitstr eam. The rate con troller determi nes and flags in the MB code word mem ory whi ch ACcode words to keep and which to excise . AC code words are accesse d from the MB codeword mem ory in the order AC11, AC12, AC 13, AC14, AC15, AC 16, AC21, AC22, AC 23,AC24, AC 25, AC 26, AC31, AC32, AC 33, etc. , whe re AC ij denote s the i th AC code wordsfrom j th block in the MB if it is present. As the AC code words are accesse d from memor y,the respec tive code word bits are summed and conti nuously compar ed with the scaledpro file value to the curren t MB, less num ber of bits for inserti on of end- of-block (E OB) codewords . Respective AC code words are flag ged as kep t unt il the runni ng sum of AC codewords bits exceeds the scaled profile value less EOB bits. When this condition is met, allremaining AC code words are marked for being excised. This process continues until allMBs have their kept code words reassembled to form the scaled bitstream.

17.3.3.2 Architecture 2: Increasing Quantization Step

Architecture 2 is shown in Figure 17.5. The method of bitstream scaling in architecture 2 isbased on increasing the quantization step. This method requires additional dequantizer=quantizer and VLC hardware over the first method. Like the first method, it also makes afirst VLD pass on the bitstream and obtains a similar scaled profile of target cumulativecode word bits versus MB count to be used for rate control.

The rate control mechanism differs from this point on. After the second pass VLD ismade on the bitstream, quantized DCT (QDCT) coefficients are dequantized. A blockof finely QDCT coefficients is obtained as a result of this. This block of DCT coefficientsis re-quantized with a coarser quantizer scale. The value used for the coarser quantizerscale is determined adaptively by making adjustments after every MB so that the scaledtarget profile is tracked as we progress through the MBs in the frame:

QN ¼ QNOM þ G*XN�1

BU� PVN�1

!(17:3)

Bitstream Delay

VLD Parser

VLD &Dequantizer

New bit rate

Reconstruct

Bits-out

Re-encoder

Motion vectorand coding

decisionextracter

Motion vectors and

Macroblock decision modes

FIGURE 17.5Architecture 2: increasing quantization step.


whereQN is the quantization factor for MB NQNOM is an estimate of the new nominal quantization factor for the frameSN�1BU is the cumulative amount of coded bits up to MB N�1G is a gain factor that controls how tightly the profile curve is tracked through the

picture

QNOM is initialized to an average guess value before the very first frame, and updatedfor the next frame by setting it to QLS (the quantization factor for the last MB) from theframe just completed. The coarsely re-quantized block of DCT coefficients is variable-length-coded to generate the scaled bitstream. The rate controller also has provisions forchanging some MB-layer code words, such as the MB-type and coded-block-pattern toensure a legitimate scaled bitstream that conforms to MPEG-2 syntax.

17.3.3.3 Architecture 3: Re-Encoding with Old Motion Vectorsand Old Decisions

The third architecture for bitstream scaling is shown in Figure 17.6. In this architec-ture, the motion vectors and MB coding decision modes are first extracted from theoriginal bitstream, and at the same time the reconstructed pictures are obtained fromthe normal decoding procedure. Then the scaled bitstream is obtained by re-encodingthe reconstructed pictures using the old motion vectors and MB decisions from theoriginal bitstream. The benefits obtained from this architecture compared to full decod-ing and re-encoding are that no motion estimation and decision computation areneeded.

SDTV

Pictures

(a)

(b)

VLD anddequantizer

IDCTDown

conversion

Frame store

HDTV

Bitstream+

HD MVs

SDTV

Pictures

VLD anddequantizer

IDCTDown-

conversion

Low-resolutionmotion compensation

Frame store

HDTV

Bitstream+

HD MVs

Full resolutionmotion compensation

FIGURE 17.6Architecture 3.


17.3.3.4 Architecture 4: Re-Encoding with Old Motion Vectors and New Decisions

Architecture 4 is a modified version of architecture 3 in which new MB decision modesare computed during re-encoding based on reconstructed pictures. The scaled bitstreamcreated this way is expected to yield an improvement in picture quality because thedecision modes obtained from the high-quality original bitstream are not optimal forre-encoding at the reduced rate. For example, at higher rates the optimal modedecision for an MB is more likely to favor of bidirectional field motion compensation(MC) over forward frame MC. But at lower rates, only the opposite decision may betrue. In order for the re-encoder to have the possibility of deciding on new MB codingmodes, the entire pool of motion vectors of every type must be available. This can besupplied by augmenting the original high-quality bitstream with ancillary data contain-ing the entire pool of motion vectors during the time it was originally encoded. Itcould be inserted into the user data every frame. For the same original bit rate, thequality of an original bitstream obtained this way is degraded compared to an originalbitstream obtained from architecture 3 because the additional overhead required for theextra motion vectors steals away bits for actual encoding. However, the resultingscaled bitstream is expected to show quality improvement over the scaled bitstreamfrom architecture 3 if the gains from computing new and more accurate decisionmodes can overcome the loss in original picture quality. Table 17.3 outlines thehardware complexity savings of each of the three proposed architectures as comparedto full decoding and re-encoding.

17.3.3.5 Comparison of Bistream Scaling Methods

We have described four architectures for bitstream scaling, which are useful for variousapplications as described in the introduction. Among the four architectures, architec-tures 1 and 2 neither require entire decoding and encoding loops nor frame storememory for reconstructed pictures, thereby saving significant hardware complexity.However, video quality tends to degrade through the group of pictures (GOP) untilthe next I-picture due to drift in the absence of decoder=encoder loops. For largescaling, say for rate reduction greater than 25%, architecture 1 produces poor qualityblocky pictures, primarily because many bits were spent in the original high-qualitybitstream on finely quantizing the DC and other very low-order AC coefficients.Architecture 2 is a particularly good choice for VTR applications since it is a goodcompromise between the hardware complexity and reconstructed image quality. Archi-tectures 3 and 4 are suitable for video-on-demand (VOD) server applications and otherStatMux applications.

TABLE 17.3

Hardware Complexity Savings Over Full Decoding=Re-Encoding

Coding Method Hardware Complexity Savings

Architecture 1 No decoding loop, no DCT=IDCT, no frame store memory, no encoding loop,no quantizer=dequantizer, no motion compensation, no VLC, simplified rate control

Architecture 2 No decoding loop, no DCT=IDCT, no frame store memory, no encoding loop,no motion compensation, simplified rate control

Architecture 3 No motion estimation, no macroblock coding decisions

Architecture 4 No motion estimation


Appendi x

In this anal ysis, we assume that the opti mal quan tizer is obtained by assigni ng the num berof bits acco rding to the vari ance or energy of the c oeffi cients. It is slightly differe nt fromMPEG standard that wi ll be explain ed later, but the princip al concept is the sam e and theresult s wi ll hold for the M PEG st andard. We firs t analyze the error s cause d by cutting highcoef ficients and increa sing the quan tizer step . The opti mal bit assig nment is given by[jay ant 1984]:

Rk 0 ¼ R av0 þ 12log2

s 2k

QN � 1

i ¼ 0s 2i

� �1 = N , k ¼ 0, 1, . . . , N � 1, (17: 4)

whereN is the numb er of coef fi cients in the blockRk 0 is the num ber of bits assigne d to the k th coef ficient

Rav0 is the averag e num ber of bits assigned to each coef fi cient in the block, i.e.,RT0 ¼ N � Rav0 is the total bits for this block under a cer tain bit rate and is the vari ance ofk th coef ficient. Unde r the optimal bit assi gnment (Eq uation 17.3), the min imized averagequantizer error, s2

q0, is

s2q0 ¼

1N

XN�1k¼1

s2qk ¼

1N

XN�1k¼1

2�2Rk0 �s2k (17:5)

where s2qk is the quantizer error of kth coefficient. According to Equation 17.4, we have two

major methods to reduce the bit rate, cutting high coefficients or decreasing the Rav, i.e.,increasing the quantizer step. We are now analyzing the effects on the reconstructed errorscauseddue to the cut of bits using thesemethods.Assume that thenumberof thebits assignedto the block is reduced fromRT0 toRT1. Then the bits to be reduced,DR1, is equal toRT0 –RT1.

In the case of cutting high frequencies, say the number of coefficients is reduced fromN to M, then

Rk0 ¼ 0 for K < M, and DR1 ¼ RT0 � RT1 ¼XN�1k¼M

Rk0 (17:6)

the quantizer error increased due to the cutting is

Ds2q1 ¼ s2

q1 � s2q0 ¼

1N

XM�1k¼0

2�2Rk0 �s2k þ

XN�1k¼M

s2k �

XN�1k¼0

2�2Rk0 �s2k

!

¼ 1N

XN�1k¼M

s2k �

XN�1k¼M

2�2Rk0 �s2k

!

¼ 1N

XN�1k¼M

(1� 2�2Rk0 ) �s2k (17:7)

where s2q1 is the quantizer error after cutting the high frequencies.

In the method of increasing quantizer step, or decreasing the average bits, from Rav0 toRav2, assigned to each coefficient, the number of bits reduced for the block is

DR2 ¼ RT0 � RT2 ¼ N � (Rav0 � Rav2) (17:8)


and the bits assigned to each coef fi cient beco me now

Rk 2 ¼ R av2 þ 12log2

s 2k

QN � 1

i ¼ 0s 2i

� �1 =N , k ¼ 0, 1, . . . , N � 1, (17 : 9)

The cor respond ing quan tizer error inc reased by the cut ting bits is

D s 2q2 ¼ s 2q 2 � s 2q0 ¼1N

XN � 1

k ¼ 02� 2 Rk 2 � s 2k �

XN � 1

k ¼ 02� 2 Rk 0 � s 2k

!

¼ 1N

XN � 1

k ¼ 0(2 � 2R k 2 � 2� 2 Rk 0 ) � s 2k (17 : 10)

where s 2q2 is the quan tizer error at the reduce d bit rate .If the same num ber of bits is reduce d, i.e., D R1 ¼ D R2, it is obvious that Ds 2q 2 is smaller

than D s 2q 1 since s 2q 2 is the minimize d value at the reduce d rate. This imp lies that theperform ance of changing the quantiz er step will be better than cut ting higher freque ncieswhen the same amo unt of rate need s to be reduce d. It shoul d be no ted that in the MPEGvideo coding, mo re sophi sticated bit assignme nt algorithm s are used. First, diffe rentquan tizer matrice s are used to improve the visu al percept ual perfo rmance. Second, differ-ent VLC tables are use d to code the DC value s and the AC transf or m coef ficients; and therun-l ength coding (RLC) is use d to code the pairs of the zer o-run length and the value s ofamp litudes. Howeve r, in gene ral, the bits are still assi gned acco rding to the statist icalmodel that indi cates the energy distribut ion of the transfor m coef fi cients. Ther efore, theabov e theoret ical analysis will hold for the MPEG vide o coding.

17.3.4 MPEG -2 to MP EG-4 Transcod ing

In this sectio n, we are goi ng to indicate that there is anothe r kind of transcod ing, whichis to conv ert the bitstr eam betwe en standard s. An example is the transcode r betw eenMPEG -2 to MPEG -4. The tech nical detail of MPEG -4 will be introduce d in Chapte r 18.Since we have not le arnt the MPEG -4 yet, here we jus t introd uce the concept of transcodi ngbetwe en MPEG -2 to MPEG -4. This concept can be extended to ot her standards .

The transcod ing met hods the mselves can be applied within the sam e syn tax fo rmat a sdescri bed in the previous sectio n or betwe en differe nt syn tax formats. As MPEG -4 simp lepro file is adop ted as the solution fo r mobile multi media com munica tions and a largeamount of MPEG-1=2 contents is available, we focus our discussion on MPEG-2 toMPEG-4 transcoding. The transcoding from MPEG-2 to MPEG-4 is necessary and usefulfor allowing the mobile or PDA terminals to receive MPEG-2 compressed contents withtheir limited display size. Here, we describe the principles and techniques used in theMPEG-2 to MPEG-4 video transcoder. The conversions include techniques for bit ratereduction, spatial resolution down-sampling, temporal resolution down-scaling, andpicture-type change. The main difficulty of MPEG-2 to MPEG-4 transcoding is to performtranscoding on both bit rate reduction and spatial resolution reduction at the same time.The transcoding on bit rate reduction and spatial resolution reduction would causeserious error drift due to the change of predictive references. To address this problem,several issues were investigated. First, an analysis of drift errors is provided to identifythe sources of quality degradation when transcoding to a lower spatial resolution. Twotypes of drift error are considered: a reference picture error and an error due to the


nonco mmutati ve property of M C and dow n-samp ling. To overco me these source s oferror , seve ral novel transco ding archi tectures are the n pres ented. On e archi tectureattemp ts to compe nsate for the referenc e pictu re error in the reduc ed resol ution,whil e another atte mpts to do the sam e in the original resol ution. We pres ent a thirdarchite cture that atte mpts to elimina te the seco nd ty pe of drift error and a fi nal a rchitec-ture that reli es on an intrabl ock ref resh method to compensate all types of errors . In allthese architectu res, a vari ety of MB level conve rsions are requi red, such as motionvect or map ping and tex ture down-samp ling. These conve rsions a re dis cussed in detai l.Anoth er importan t iss ue fo r the transcod er is rate control . Rate contol is esp eciallyimpo rtant for the intrarefre sh archite cture since it must fi nd a balan ce betw een thenum ber of intrabl ocks use d to compens ate errors and the asso ciated rate-distor tioncharac teristi cs of the low- resolutio n signal. The com plexity and qual ity of the archi tec-tures are com pared. Based on the results, we find that the intrare fresh archi tecture offer sthe best trade -off betw een qualit y and com plexity, and is a lso the most fl exible. Aft erlearni ng MPEG -4 in Chapte r 18, you can refer to the tech nical details of MPEG-2 toMPEG -4 transco ding in [peng 2002].

17 .4 Down-Convers ion De coder

17.4.1 Bac kground

Dig ital video broad castin g has had a major impact in both acad emic and indus trial com-mun ities. A great deal of eff ort has been m ade to impro ve the coding ef ficiency at thetransmi ssion side and offer cost-ef fective impl ement ations in the overall end- to-end system.Alon g these lines, the notion of format conv ersion is beco ming increasingl y pop ular. On thetransmi ssion side, there are a number of different fo rmats that are likely candidat es fordigit al vide o bro adcast. Th ese formats vary in horiz ontal , vertical, and tem poral resol ution.Simila rly, on the rec eiving side, the re are a vari ety of display devices that the receiver shoul dacco unt for. In this sectio n, we are inter ested in the spe cific probl em of how to rec eive anHDTV bitstream and display it at a lower spatial resol ution. In the conve ntion al metho d ofobt aining a low-re solution image sequ ence, the HD bitstr eam is fully decod ed the n it issimp ly pre filtere d and subs ampled [tm5] . The blo ck diagram of thi s system, shown in Figu re17.7a, is referred to as a full-resol ution decod er (FR D) with spatial down-co nversion.Althou gh the qual ity is very good, the cost is quite high due to the large memory requi re-men ts. As a result , low-re solution deco ders (LRD s) have been prop osed to reduce some ofthe costs [ng 1993 ; sun 1993; boyce 1995; ba o 1996]. Al though the qualit y of the pictu re willbe compr omise d, signi fi cant reductio ns in the a mount of mem ory ca n be rea lized; the blockdiagram fo r this syste m is shown in Figu re 17.7 b. Here, incom ing block s are subject to down-conve rsion fi lters wi thin the decodin g loo p. In thi s way, the dow n-conve rted blocks arestor ed into mem ory rathe r than the full-resol ution blocks. To achi eve a high -quality outp utwith the LRD , it is importan t to take special care in the algori thms for down-con ver sion andMC. These two proces ses are of major importan ce to the deco der as they have signi ficantimpact on the final quality . Althou gh a moderat e am ount of c omplexit y withi n the decodin gloop is added, the reductions in external memory are expected to provide significant costsavings, provided that these algorithms can be incorporated into the typical decoder struc-ture in a seamless way.

As stated above, the filters used to perform the down-conversion are an integral part ofthe LRD. In Figure 17.7b, down-conversion is shown to take place before the IDCT.Although filtering is not required to take place in the DCT domain, we initially assume


(a)

(b)

VLD anddequantizer

IDCTDown-

conversion

Full resolution motion compensation

Frame store

HDTV

Bitstream+

HD MV’s

SDTV

Pictures

SDTV

Pictures

VLD anddequantizer

IDCTDown-

conversion

Low-resolutionmotion compensation

Frame store

HDTV

Bitstream+

HD MV’s

FIGURE 17.7Decoder structures. (a) Block diagram of full-resolution decoder with down-conversion in the spatial domain. Thequality of this output will serve as a drift-free reference. (b) Block diagram of low-resolution decoder. Down-conversion is performed within the decoding loop and is a frequency domain process. Motion compensation isperformed from a low-resolution reference using motion vectors that are derived from the full-resolution encoder.Motion compensation is a spatial domain process.

that it takes place before the adder. In any case, it is usually more intuitive to derive adown-conversion filter in the frequency domain rather than in the spatial domain [mokry1994; pang 1996; merhav 1997]. The major drawback of these approaches is that high-frequency data is lost or not preserved very well. To overcome this, a method of down-conversion that better preserves high-frequency data within the MB has been reported in[bao 1996; vetro 1998a]; this method is referred to as frequency synthesis.

Although the above statement of the problem has mentioned only filtering basedapproaches to memory reduction within the decoding loop, readers should be awarethat other techniques have also been proposed. For the most part, these approaches relyon methods of embedded compression. For instance, in [de with 1998], the data beingwritten to memory is quantized adaptively using a block predictive coding scheme, then asegment of MBs is fit into a FL packet. Similarly, in [yu 1999], an adaptive minimum–

maximum quantizer and edge detector is proposed. With this method, each MB is com-pressed to a fixed size to simplify memory access. Another more simple approach may beto truncate the 8-bit data to 7 or 6 bits. However, in this case, it is expected the drift wouldaccumulate very fast and result in poor reconstruction quality. In [bruni 1998], a vectorquantization method has been utilized, and in [lei 1999] a wavelet-based approach isdescribed. Overall, these approaches offer exceptional techniques to reduce the memoryrequirements, but in most cases, the reconstructed video would still be a high-resolutionsignal. The reason is that compressed high-resolution data is stored in memory rather thanthe raw low-resolution data. For this reason, the remainder of this section emphasizes thefiltering based approach, in which the data stored in memory represents the actual low-resolution picture data.


The main novelty of the system that we descri be is the filteri ng, which is used to performthe MC from low-re solution anchor frames . It is well known that predic tion dr ift has beendif ficult to a void. It is partly due to the loss of high-freq uency dat a from the down-conve rsion and par tly due to the inab ility to recover the los t inf ormatio n. Althou ghpredic tion dr ift cannot be total ly avoide d in an LRD, it is possibl e to signi fi cantly reducethe effect s of drift in contrast to simp le int erpola tion metho ds. The sol ution that wedescri be is optimal in the leas t-squa res sense and is dep endent on the metho d of down-conve rsion that is used [vetro 1998b]. In its direct form, the solution cannot be readil yappl ied to a practi cal deco ding schem e. Howe ver, it is sh own that a ca scaded realizati on iseasi ly impl ement ed into the FRD -type structure [vetro 1998c].

17.4.2 Frequ ency Synthesi s Dow n-Con version

The concept of fre quency synthes is was firs t reporte d in [bao 1996] and later expan ded in[ve tro 1998b]. The basic premise is to better preserve the fre quency charac teristi cs of an MBin c ompariso n to simpler metho ds that extract or cut spe cifi ed frequency com ponents of an8 3 8 blo ck. To acco mplish thi s, the four block s of an M B are subject to a glo bal transf orm-ation — this transf orm ation is refer red to as frequency syn thesis. Esse ntially, a single fre-quency doma in block can be realized using the informati on in the ent ire MB . Fro m thi s,low- resolutio n blocks can be achieved by cutting out the low-ord er frequency com ponentsof the synthe sized blo ck — this action repre sents the dow n-conversi on proce ss and isgenerally represented in the following way:

~A ¼ XA (17:11)

where~A denotes the original DCT MBA denotes the down-converted DCT blockX is a matrix that contains the frequency synthesis coefficients

The original idea for frequency synthesis down-conversion was to directly extract an 83 8block from the 16 3 16 syn thesized block in the DCT doma in as shown in Figure 17.8a. Th eadvantage of doing this is that the down-converted DCT block is directly applicable to an83 8 IDCT (for which fast algorithms exist). The major drawback with regard to compu-tation is that each frequency component in the synthesized block is dependent on all of thefrequency components in each of the 83 8 blocks, i.e., each synthesized frequency com-ponent is the result of a 256-tap filter. The major drawback with regard to quality is thatinterlaced video with field-based predictions should not be subject to frame-based filtering[vetro 1998b]. If frame-based filtering is used, it becomes impossible to recover the appro-priate field-based data that is required to make field-based predictions. In areas of largemotion, severe blocking artifacts will result.

Obviously, the original approach would incur too much computation and qualitydegradation; so instead, the operations are performed separately and vertical down-conversion is performed on a field-basis. In Figure 17.8b, it is shown that a horizontaldown-conversion can be performed. To perform this operation, a 16-tap filter is ultimatelyrequired. In this way, only the relevant row information is applied as the input to thehorizontal filtering operation and the structure of the incoming video has no bearing onthe down-conversion process. The reason is that the data in each row of an MB belongsto the same field, hence the format of the output block will be unchanged. It is noteworthythat the set of filter coefficients is dependent on the particular output frequency index. For


Frequencysynthesis

16

16

8

8

16

8

N

M

(a) (b) (c)

FIGURE 17.8Concept of frequency synthesis down-conversion, (a) 256-tapfilter applied to every frequency component to achieve verticaland horizontal down-conversion by a factor of 2 frame-basedfiltering, (b) 16-tap filter applied to frequency components in thesame row to achieve horizontal down-conversion by 2, picturestructure is irrelevant, (c) illustrates that the amount of synthe-sized frequency components which are retained is arbitrary.

1-dim ensional (1-D ) filteri ng, this means that the filters used to com pute the seco nd outpu tindex, for exa mple, are different fro m thos e used to compute the fifth outp ut inde x.Similar to the horizontal down-con versio n, ver tical dow n-convers ion can a lso be appliedas a separate proces s. As rea soned earlier , fi eld-based filteri ng is necessary for interlacedvideo wi th fi eld-based predic tions.

Howe ver, since an MB consists of 8 line s for the even fi eld and 8 lines fo r the odd fiel d,and the v ertical bloc k unit is 8, fre quency synthes is cannot be applie d. Frequ ency synthe sisis a global transf ormatio n and is only appl icable whe n on e wi shes to observe the fre quencycharac teristics over a larger range of data than the basic unit. Therefor e, to perform thevertica l dow n-convers ion, we can simply cut the low-o rder fre quency com ponents in thevertica l dire ction. This loss that we accept in the vertical directi on is justi fied by the abilit yto perform accurate low- resolutio n MC that is free from seve re blocki ng arti facts.

In the ab ove, we have expla ined how the original idea to extract an 8 3 8 DC T block isbrok en down int o separabl e operati ons. Howeve r, since freq uency syn thesis provides anexpres sion for every fre quency com ponent in the new 16 3 16 block, it m akes sense togeneral ize the dow n-convers ion proces s so that decimati on, which are multiple s of 1 =16can be perfo rmed. In Figure 17 .8c, an M 3 N bloc k is ext racted. Altho ugh this type ofdown-co nversion filteri ng may no t be appropri ate before the IDCT operati on and may notbe appropri ate for a bitstr eam con taining fi eld-bas ed pred ictions, it may be appl icableelsew here, e.g., as a spatial domain filter somew here else in the syste m or for progress ivemateri al. To obtain a set of spatial doma in fi lters, an appro priate transf ormatio n can beappl ied. In this way, Equati on 17.8 is expres sed as

~a ¼ xa (17 : 12)

where the lowerca se count erparts denote spati al equ ivalents. Th e expre ssion that trans-forms X to x is derive d in Appendi x A.

17.4.3 Low-Res oluti on Moti on Compen sation

The focus of this secti on is to provi de an express ion for the opti mal set of low-resol utionMC filters given a set of dow n-convers ion filters. Th e resultin g filters are optimal in theleast-sq uares sense as they min imize the mean square d error (MSE) betwe en a refer enceblock and a blo ck obtain ed throu gh low-re solution MC. The results derived in [ve tro1998a] assu me that a spatial doma in filter, x, is appl ied to incom ing MBs to achieve thedown-co nversion. The schem e shown in Figure 17.9a illustrat es the proces s by whichrefer ence blocks are obtain ed. First, full-resol ution MC is performed on MBs a , b , c, and dto yield h. To execu te this process , the filters S(r)a , S

(r)b , S

(r)c , and S(r)d are used. Basically, these


(b)

Full-resolutionmotion compensation

Sa, Sb, Sc, Sd

Down-conversion

X

h habcd

Down- conversion

XDown-

conversion X

Down- conversion

X

Down- conversion

X

Low-resolution motion compensation

N1, N2, N3, N4

MinimizeMSE bychoosing

˜h

a

b

c

d

a

b

c

d

(a)

N1, N2, N3, N4

FIGURE 17.9Comparison of decoding methods to achieve low-resolution image sequence. (a) FRD with spatial down-conversion, (b) LRD. The objective is to minimize the MSE between the two outputs by choosing N1, N2, N3,and N4 for a fixed down-conversion.

fi lters represen t the m asking =averagin g operati ons of the MC in a matrix form. Mo re onthe com position of these filters can be foun d in the Appendi x B. Once h is obtained, it isdown-converted to ~h via the spatial filter, x:

~h ¼ xh (17:13)

The above block is considered to be the drift-free reference. On the other hand, in thescheme of Figure 17.9b, the blocks a, b, c, and d are first subject to the down-conversionfilter, x, to yield the down-converted blocks, ã, ~b, ~c, and ~d, respectively. Using these down-converted blocks as input to the low-resolution MC process, the following expression canbe assumed:

~h ¼ [N1 N2 N3 N4]

~a~b~c~d

26664

37775 (17:14)

where Nk, k¼ 1, 2, 3, 4, are the unknown filters that are assumed to perform the low-resolution MC, ~h is the low-resolution prediction.

As in [vetro 1998a], these filters are solved for by differentiating the following objectivefunction:

J{Nk} ¼ ~h� ~h�� 2 (17:15)

with respect to each unknown filter and setting each result equal to zero. It can be verifiedthat the optimal least-square solution for these filters is given by


N ( r )1 ¼ xS( r)a xþ ; N ( r)2 ¼ xS( r )b xþ

N ( r )3 ¼ xS( r)c xþ ; N ( r)4 ¼ xS( r )d xþ (17 : 16)

where

xþ ¼ xT ( xx T ) � 1 (17 : 17)

is the M oore –Penrose Inve rse [lancas ter 1985] fo r an m 3 n matrix with m � n . In thesolut ion of Equatio n 17.16, the sup erscrip t (r ) is adde d to the filters, Nk , due to theirdepen dency on the full-re solution MC filters . In usi ng the se filters to perform the low-resol ution MC, the MSE between ~h and ~h is minimize d. It shoul d be emphas ized thatEquatio n 17.16 represen ts a generali zed set of MC filters that are applicable to any x, whichoperate s on a single MB. For the special cas e of the 4 3 4 cut, the se filters are equiva lent tothe ones that were dete rmine d in [morky 1994] to minimize the drift.

In Figu re 17.10, two equiva lent MC sch emes are shown. Ho wever, for implement ationpurpo ses, the opti mal M C sch eme is realized in a cascad e form rathe r than a dire ct form.The rea son is that the direct form fi lters a re dep endent on the matrice s that perform full -resol ution MC. Althou gh these matrices were very useful in anal ytically express ing thefull-resolution MC process, they require a huge amount of storage due to their dependencyon the prediction mode, motion vector, and half-pixel accuracy. Instead, the three linearproces ses in Equatio n 17.13 are sep arated, so that an up-convers ion, full -resolut ion MC,and down-conversion can be performed. Although one may be able to guess such ascheme, we have proven here that it is an optimal scheme provided the up-conversionfilter is Moore–Penrose inverse of the down-conversion filter. In [vetro, 1998b], the optimalMC scheme, which employs frequency synthesis, has been compared to a nonoptimal MCscheme, which employs bilinear interpolation, and an optimal MC scheme, which employs

Small memory Low-resolution

prediction

Down-conversionX

Full-resolutionMC

Up-conversionX

+

Low-resolution prediction

HDmotion vectors

Low-resolution MC

Frame store

Large memory for filter coefficient storage

N1, N2, N3, N4

Sa, Sb, Sc, Sd

Framestore

HDmotion vectors

Small memory

FIGURE 17.10Optimal low-resolution MC scheme: directform (top) versus cascade form (bottom).Both forms yield equivalent quality, butvary significantly in the amount of internalmemory.


the 4 3 4 cut dow n-convers ion. Si gnifi cant redu ctions in the amount of drift were realizedby both optimal MC sch emes over the metho d, which used bilinear interpol ation as themet hod of up-convers ion. But more imp ortantly, a 3 5% redu ction in the amo unt of driftwas realized by the optimal MC schem e usi ng freq uency syn thesis over the optimal MCschem e usi ng the 4 3 4 cut.

17.4.4 Thr ee-Laye r Scalabl e Decoder

In this secti on, we show ho w the key algorithm s for dow n-convers ion and MC areintegr ated into a three -layer sca lable decoder . The cen tral concept of this decod er is thatthree layers of resol ution can be decod ed using a decreas ed amo unt of mem ory for thelower resol ution layers. Also, regard less of which layer is being decod ed, muc h of thelogic can be share d. Th ree possibl e deco der con figura tions are conside red: full-me morydecod er (F MD), hal f-memo ry decod er (HMD ), and quar ter-me mory decoder (QMD) . TheLRD c on figura tions are based on the key algori thms, which were desc ribed for down-conve rsion and MC. In the follo wing, three poss ible archite ctures are discusse d thatprovid e equal quality , but vary in system level comple xity. The first (ARCH1) is basedon the LRD model ed in Figure 17.7b , the second (ARC H2) is very similar , but atte mpts toreduce the IDCT computat ion, whi le the third (ARC H3) is concerne d with the am ount ofinter face with an exis ting high-l evel decod er.

Wit h rega rd to function ality, a ll of the archite ctures share simi lar charac teristics. For on e,an ef ficient impl ement ation is achieved by arrangin g the logic in a hierarchi cal m anner, i.e.,empl oys sep arable proce ssing. In this way, the FM D con figura tion is the simp lest andserves as the logic core from which other deco der c on figura tions are bui lt on. In the HMDcon fi guration , an addi tional horiz ontal down-con ver sion and up -conversion are per-forme d. In the QMD con fi guration , all of the logic compone nts from the HMDare utilized , such that an additio nal ver tical down-con ver sion is perform ed after a hori-zontal dow n-convers ion, and a n additio nal vertica l up-convers ion is perform ed afte r ahoriz ontal up-convers ion. In summa ry, the logic fo r the HMD is bui lt on the logic for theFMD, and the log ic for the QMD is built on the log ic of the HMD . The total syste m contai nsa moderat e inc rease in logic , but HD bitstr eams may be decoded to a lower resolution witha smaller amount of exter nal memor y. By simply remov ing exter nal mem ory, lower layerscan be achi eved at a redu ced cost.

The comple te block diagram of AR CH1 is sh own in Figu re 17.11a. The diagram shownhere assume s two things : (i) the initi al system model of an LRD from Figure 17.6b isassum ed and (ii) the dow n-conversi ons in the incom ing branch are pe rformed after theIDCT to avoid any confu sion rega rding MB format convers ions in the DCT doma in [vetro1998b ]. In look ing at the resultin g system, it is eviden t that full computat ion of the IDCT isrequi red, and that two indep endent down-con ver sion ope rations must be perform ed. Th elatter is nec essary so that low-re solution prediction s are added to low- resolutio n resid uals.Overal l, the increase in logic for the adde d feature of mem ory savings is quite sm all.Howe ver, it is eviden t that ARCH1 is not the most cost- effective impl ement ation, but itrepresents the foundation of previous assumptions, and allows us to better analyze theimpact of the two modified architectures to follow.

In Figure 17.11b, the block diagram of ARCH2 is shown. In this system, realizing that theIDCT operation is simply a linear filter reduces the combined computation for the IDCTand down-conversion. In the FMD, we know that a fast IDCT is applied separately to therows and columns of an 83 8 block. For the HMD, our goal is to combine the horizontaldown-conversion with the horizontal IDCT. In 1-D case, an 83 16 matrix can represent thehorizontal down-conversion, and an 83 8 matrix can represent the horizontal IDCT.Combining these processes such that the down-conversion operates on the incoming


(a)

(b)

VLD andIQ

IDCT

Horizontal/verticaldown-conversion

+IDCT

Verticaldown-

conversion Horizontal

down-conversion

External memoryfor QMD

Additionalmemory for HMD

Additionalmemory for FMD

Displayprocessor


Verticalup-

conversionHorizontal

up-conversion

HDTVbitstream

480P

480I

Horizontal down-conversion

+IDCT

+

VLD andIQ

IDCT

Horizontal down-

conversion

Verticaldown-

conversion

Verticaldown-


down-conversion

External memory for QMD

Additionalmemory for HMD

Additionalmemory for FMD

Displayprocessor


Verticalup-


up-conversion

HDTVbitstream

480P

1080I/720P

1080I/720P

480I

+

FIGURE 17.11Block diagram of various three-layer scalable decoder architectures. All architectures provide equal quality withvarying system complexity. (a) ARCH1, derived directly from block diagram of assumed low-resolution decoder.(b) ARCH2, reduced computation of IDCT by combining down-conversion and IDCT filters.

(continued)


(c)

VLD andIQ

IDCT

Verticaldown-

conversion

Horizontaldown-

conversion

Externalmemoryfor QMD

Additionalmemoryfor HMD

Additionalmemoryfor FMD

Displayprocessor

Full-resolutionmotion

compensation

Verticalup-


up-conversion

HDTVbitstream

1080I/720P

480P

480I

FIGURE 17.11 (continued)(c) ARCH3, minimized interface with existing HL decoder by moving linear filtering for down-conversion outsideof the adder.

DCT rows first, results in a combined 83 16 matrix. To complete the transformation, theremaining columns can then be applied to the fast IDCT. In the above description,computational savings are achieved in two places: first, the horizontal IDCT is fullyabsorbed into the down-conversion computation which must take place anyway, andsecond, the fast IDCT is utilized for a smaller amount of columns. In the case of theQMD, these same principles can be used to combine the vertical down-conversion withthe vertical IDCT. In this case, one must be aware of the MB type (field-DCT or frame-DCT)so that an appropriate filter can be applied. In contrast to the previous two architectures,ARCH3 assumes that the entire front-end processing of the decoder is used; it is shown inFigure 17.11c. In this way, the adder is always a full-resolution adder, whereas in ARCH1and ARCH2, the adder needed to handle all three-layers of resolution. The major benefit ofARCH3 is that it does not require many interfaces with the existing decoder structure. Thememory is really the only place where a new interface needs to be defined. Essentially, adown-conversion filtering may be applied before storing the data, and an up-conversionfiltering may be applied, as the data is needed for full-resolution MC. This final architectureis similar in principle to the embedded compression schemes that were mentioned in thebeginning of this section. The main difference is that the resolution of the data is decreasedrather than compressed. This allows a simpler means of low-resolution display.

17.4.5 Summary of Down-Conversion Decoder

A number of integrated solutions for a scalable decoder have been presented. Each decoderis capable of decoding directly to a lower resolution using a reduced amount of memory incomparison to the memory required by the high-level decoder. The method of frequencysynthesis is successful in better preserving the high-frequency data within an MB and thefiltering that is used to perform optimal low-resolution MC is capable of minimizing


the drift. It has been shown that a realizable implementation can be achieved, suchthat the filters for optimal low-resolution MC are equivalent to an up-conversion, full-resolution MC, and down-conversion, where the up-conversion filters are determined by aMoore–Penrose inverse of the down-conversion. The amount of logic required by theseprocesses is kept minimal since they are realized in a hierarchical structure. Since thedown-conversion and up-conversion processes are linear, the architecture design is flexiblein that equal quality can be achieved with varying levels of system complexity. The firstarchitecture that we examined came from the initial assumptions that were made on theLRD, i.e., a down-conversion is performed before the adder. It was noted that a full IDCTcomputation was required and that a down-conversion must be performed in two places.As a result, a second architecture was presented to reduce the IDCT computation, and athird was presented to minimize the amount of interface with the existing high-leveldecoder. The major point here is that the advantages of ARCH2 and ARCH3 cannot berealized by a single architecture. The reason is that performing a down-conversion in theincoming branch reduces the IDCT computation, therefore a down-conversion must beperformed after the full-resolution MC as well. In any case, equal quality is offered by eacharchitecture and the quality is of commercial grade.

Appendix A: DCT-to-Spatial Transformation

Our objective in this section is to express the following DCT domain relationship:

~A(k, l) ¼XM�1p¼0

XN�1q¼0

[Xk, l(p, q)A(p, q)] (17:18)

as

~a(i, j) ¼XM�1s¼0

XN�1t¼0

[xi, j(s, t)a(s, t)] (17:19)

whereÃ and ã are the DCT and spatial outputA and a are the DCT and spatial inputX and x are the DCT and spatial filters

By definition, the M3N DCT transform is defined by

A(k, l) ¼XM�1i¼0

XN�1j¼0

a(i, j)cMk (i)c

Nl ( j) (17:20)

and its inverse, the M3N IDCT by

a(i, j) ¼XM�1k¼0

XN�1l¼0

A(k, l)cMk (i)c

Nl ( j) (17:21)

where the basis function is given by

cNk ¼

ffiffiffiffi2N

ra(k) cos

2iþ 12N

kp� �

(17:22)

and


a( k ) ¼1ffiffi2p for k ¼ 01 for k 6¼ 0:

�(17: 23)

By sub stituting Equatio n 17.22 into the express ion for the IDCT yields :

~a ( i, j) ¼XM� 1k ¼ 0

XN � 1

l¼ 0c Mk ( i )c

Nl ( j) �

XM � 1

p ¼ 0

XN � 1

q ¼ 0Xk , l ( p � q ) A( p � q )

24

35

¼XM� 1p¼ 0

XN � 1

q ¼ 0A ( p, q ) � XM � 1

k ¼ 0

XN � 1

l ¼ 0Xk , l ( p, q ) c Mk ( i) c

Nl ( j)

" #(17: 24)

Sub stituting the DCT de fi nition int o Equatio n 17.24 gives the fo llowing:

~a ( i , j) ¼XM� 1p¼ 0

XN � 1

q ¼ 0

XM � 1

s ¼ 0

XN � 1

t ¼ 0a ( s, t )c Mp ( s ) c

Nq ( t )

" #XM�1k¼0

XN�1l¼0

Xk, l(p, q) �cMp (i)c

Nq ( j)

h i(17:25)

Fina lly, Equatio n 17.16 can be forme d with

xi, j(s, t) ¼XM�1k¼0

XN�1l¼0

cMk (i) �c N

l ( j)XM�1p¼0

XN�1q¼0

Xk, l(p, q) �cMp (s)c

Nq (t)

24

35 (17:26)

and the transformation is fully defined.

Appendix B: Full-Resolution Motion Compensation in Matrix Form

In 2-D, a motion compensated MB may have contributions from at most 4 MBs per motionvector. As noted in Figure 17.12, MBs a, b, c, and d include four 83 8 blocks each. Thesesubblocks are raster-scanned so that each MB can be represented as a vector. According to

FIGURE 17.12Relationship between the input and out-put blocks of the motion compensationprocess in the FRD.

a1 a2 a3 a4

a2 b1 a4 b3

a3 a4 c1 c2

a4 b3 c2 d1

y1

a b

h

c d

To get h1

To get h2

To get h3

To get h4

+ + +

+ + +

+

+

+ +

+ +

Outcome offiltering relevant

block by M1


block by M2


block by M3


block by M4


the motion vect or, ( dx , dy ), a local refer ence, ( y1, y2), is com puted to indi cate whe re theorigin of the MC blo ck is located ; the local referenc e is determine d by

y1 ¼ dy � 16 � [Int eger( dy =16) � g ( dy )]y2 ¼ dx � 16 � [Int eger( dx =16) � g ( dx )] (17 : 27)

where

g ( d) ¼ 1 if d < 0 and d mod 16 ¼ 00 ot herwis e:

�(17 : 28)

The ref erence point fo r this value is the or igin of the upper- leftmos t inpu t MB . Wit h this,the MC pred iction may be expres sed as

h ¼

h1

h2

h3

h4

2666664

3777775¼ S(r)a S(r)b S(r)c S(r)dh i

�

a

b

c

d

2666664

3777775; r ¼ 1, 2, 3, 4: (17:29)

As an example , Figu re 17.11 con siders ( y1, y2) 2 [0, 7], which implies that r¼ 1. In this casethe MC filters are given by

S(1)a ¼M1 M2 M3 M4

0 M1 0 M3

0 0 M1 M2

0 0 0 M1

2664

3775, S(1)b ¼

0 0 0 0M2 0 M4 00 0 0 00 0 M2 0

2664

3775,

S(1)c ¼0 0 0 00 0 0 0M3 M4 0 00 M3 0 0

2664

3775, S(1)d ¼

0 0 0 00 0 0 00 0 0 0M4 0 0 0

2664

3775

(17:30)

In Equation 17.30, the M1, M2, M3, and M4 matrices operate on the relevant 83 8 blocks ofa, b, c, and d. Their elements vary according to the amount of overlap as indicated by (y1, y2)and the type of prediction. The type of prediction may be frame-based or field-based and ispredicted with half-pixel accuracy. As a result, the matrices S(r)a ,S(r)b , S(r)c , and S(r)d areextremely sparse and may only contain nonzero values of 1, 1=2, and 1=4. For differentvalues of ( y1,y2) the configuration of the above matrices will change: y1 2 [0,7] and y2 2[8,15] implies r¼ 2; y1 2 [8,15] and y2 2 [0,7] implies r¼ 3; y1, y2 2 [8,15] implies r¼ 4. Theresulting matrices can easily be formed using the concepts illustrated in Figure 17.11.

17.5 Error Concealment

17.5.1 Background

Practical communication channels available for delivery of compressed digital video arecharacterized by occasional bit-error and packet loss, although the actual impairmentmechanism varies widely with the specific medium under consideration. The class of


MPEG

Encoder

Transport

Encoder

Transport

Decoder

MPEGdecoder

Errorconcealment

Compressed video bitstream(or optional high priority layer)

Optional low-priority layer Noise, inteference

congestion

Packets/cells High priority

Lowpriority

Error tokens

Videoout

Transmissionmedium

e.g., BroadcastATM…

FIGURE 17.13System block diagram of visual communication system.

decod er error concea lment schemes descri bed here is based on identi fication and pred ict-ive repl acemen t of picture regions affe cted by bit-erro r or data loss. It is note d that thi sappro ach is based on conve rsion (via approp riate err or =loss dete ction mech anism s) of thetransmi ssion medi um into an erasure chann el in which all error or loss events ca n beide nti fied in the rec eived bitstream . In a block st ructured com pression algori thm such asMPEG , all channel impairme nts are manifest ed as era sures of vide o units (such as MPEGMB s or slices). Co ncealm ent at the decod er is the n based on expl oiting temporal and spati alpictu re redund ancy to obtain an estimate of era sed pictu re areas. The ef ficiency of errorconcea lment depen ds on redundanc ies in pictu res and on redu ndancies in the com pressedbitstr eam that are not remov ed by source coding. Block com pression a lgorithm s do notrem ove a cons iderable amount of inter -block redund ancies, such as structur e, texture, andmo tion informati on about objects in the scene.

To be mo re speci fic, error resi lience fo r com pressed video can be achieved throu gh theadditi on of sui table transport and error concealme nt method s, as outlined in the systemblock diagram shown in Figu re 17.13.

The key elem ents of such a robust video delive ry syste m a re outlined below:

. The video signal is enc oded using approp riate vide o c ompress ion syntax such asMPEG. Note that we have restricted consideration primarily to the practical case inwhich the video compression process itself is not modified, and robustness isachieved through additive transport and decoder concealment mechanisms(except fo r I-frame motion descri bed in Section 4.3). This appro ach simp li fiesencoder design, because it separates media-independent video compression func-tions from media-dependent transport operations. On the receiver side, although asimilar separation is substantially maintained, the video decoder must be modifiedto support an error token interface and error concealment functionality.

. Compressed video data is organized into a systematic data structure with appropri-ate headers for identification of the temporal and spatial pixel-domain location ofencodeddata [joseph 1992b].Whenan erroneous=lost packet is detected, these videounits serve as resynchronization points for resumption of normal decoding, whilethe headers provide a means for precisely locating regions of the picture that werenot correctly received. Note that two-tier systems may require additional transportlevel support for high and low priority (HP=LP) resynchronization [siracusa 1993].


. The video bitstream may optionally be segregated into two layers for prioritizedtransport [ghanbari 1989; karlsson 1989; kishno 1989; zdepski 1989; joseph 1992a,b; siracusa 1993] when a high degree of error resilience is required. Note thatseparation into high and low priorities may be achieved either by using a hier-archical (layered) compression algorithm [ghanbari 1989; siracusa 1993] or bydirect code word parsing [zdepski 1989, 1990]. Note that both these layeringmechanisms have been accepted for standardization by MPEG-2 [mpeg2].

. Once the temporal and spatial location(s) corresponding to lost or incorrectlyreceived packets is determined by the decoder, it will execute an error concealmentprocedure for replacement of lost picture areas with subjectively acceptable mater-ial estimated from available picture regions [harthanck 1986; jeng 1991; wang1991]. Generally, this error concealment procedure will be applied to all erasedblocks in one-tier (single priority) transmission systems, while for two-tier(HP=LP) channels the concealment process may optionally ignore loss of LP data.

. In the following sections, the technical detail of some commonly used errorconcealment algorithms is provided. Specifically, we focus on the recovery ofcode word errors and errors that affect the pixels within an MB.

17.5.2 Error Concealment Algorithms

In general, design of specific error concealment strategies depends on the system design.For example, if two-layered transmission is used, the receiver should be designed toconceal HP error and LP error with different strategies. Moreover, if some redundancy(steering information) could be added to the encoder the concealment could be moreefficient. However, we first assume that the encoder is defined for maximum compressionefficiency, and that concealment is only performed in the receiver. It should be noted thatsome exemptions exist for this assumption. These exemptions include the use of I-framemotion vectors, scalability concealment, and limitation of slice length (in order to performacceptable concealment in the pixel domain the limitation of slice length exists, i.e., thelength of slices cannot be longer than one row of picture). Figure 17.14 shows a blockdiagram of a generic one=two-tier video decoder with error concealment.

Note that Figure 17.14 shows two stages of decoder concealment in the codeword domainand pixel domain, respectively. Codeword domain concealment, inwhich locally generateddecodable code words (e.g., B-picture motion vectors, EOB code, etc.) are inserted into the

LP

VLD

Errortokens

HP

Bit-streamerror

resolution

Videodecoder

Video

Errorregion ID

Code worddomain

errorconcealment

To postfilteringSpatial

domainerror

concealment

Video decoder and error concealment

Repairedwords data

Damagedwordsdata

Errortokens

FIGURE 17.14MPEG video decoder with error concealment.


bitstr eam, is conve nient fo r implement ation of simp le tem poral repl acemen t function s(wh ich in principle c an also be perform ed in the pixe l doma in). Th e seco nd stage of pixe l-doma in proces sing is for temporal and spatial operati ons not conve nient ly done in the codeword doma in. Ad vanced spati al proces sing will generall y have to be perform ed in the pixe ldoma in, although limit ed code word domain optio ns can also be identi fi ed.

17.5.2. 1 Code W ord Domain Error Conc ealment

The code word domain concealme nt receives vide o data and error tokens from thetransp ort proces sor =VLD . Under no rmal cond itions, no action is tak en and the data ispassed along to the video decoder . When an err or token is recei ved, damaged dat ais repaire d to the extent possibl e by inserti on of local ly gene rated code words and resyn-chroni zation code s. An err or region ID is also cre ated to indicate the image region to beconcea led by subs equent pixel-doma in process ing. Two mechani sms have bee n use d incode word doma in err or con cealment: negl ect the effect of lost dat a by decla ring an EOB,or replace the los t data with a pseudo code to handle the MB-types or othe r VLC codes. Ifhigh -level data such as DC or MB header is lost, the code word doma in conc ealment withpse udo codes can on ly provi de signal resync hronizati on (decod ability) and repl aces theimage scene with a fi xed gray level in the err or region. Obvi ously, fur ther imp rovemen t isneede d in the vide o deco der. This tas k is impl ement ed with the error concea lment in thevide o decoder . It is des irable to replace erased I- or P-pi cture regio ns with a reasona blyaccu rate estimate to min imize the impact of frame-t o-frame propag ation.

17.5.2. 2 Spati o-Tempora l Error Conceal ment

In general, two basic approach es are used for spati al doma in error con cealment: tem poralrepl acemen t and spatial int erpolation . In tem poral repl acemen t, as shown in Figu re 17.15,the spatially corresponding ones in the previously decoded data with MC replace thedamaged blocks in the current frame ifmotion information is available. Thismethod exploitstemporal redundancy in the reconstructed video signals and provides satisfactory results inareaswith smallmotion and forwhichmotion vectors are provided. If motion information islost, this method will fail in the moving areas. In the method of spatial interpolation asshown in Figu re 17.16, the lost block s are inter polate d by the data fro m the adjacent non-error blocks with maximally smooth reconstruction criteria or other techniques.

In this method, the correlation between adjacent blocks in the received and reconstructedvideo signals is exploited. However, severe blurring will result from this method if data inadjacent blocks is also lost. In an MPEG decoder, temporal replacement outlined above isbased on previously decoded anchor (I, P) pictures that are available in the frame memory.If motion vectors corresponding to pixels in the erasure region can also be estimated, thistemporal replacement operation can be improved via MC. Also, in the MPEG decoder,

FIGURE 17.15Error concealment uses temporal replenishmentwith motion compensation.

MV

Framememory

ConcealmentConcealed data

Error

token

MV

Previousdecoded

frameCurrentframe


ConcealmentConcealed dataError

token

Goodcoded data

Good MB

Damaged MB

MB: macroblock

FIGURE 17.16Error concealment uses spatial inter-polation with the data from goodneighbors.

groups of video pixels (blocks, macroblocks, or slices) are separately decoded, so that pixelvalues and motion information corresponding to adjacent picture regions are generallyavailable for spatial concealment. However, estimation from horizontally adjacent blocksmay not always be useful since cell loss tends to affect a number of adjacent blocks (due tothe MPEG and ATM data structures); also differential encoding between horizontallyadjacent blocks tends to limit the utility of data obtained from such neighbors. Therefore,most of the useable spatial information will be located in blocks above or below thedamaged region. That is, vertical processing=concealment is found to be most useful dueto the transmission order of the data.

For I-pictures, the damaged data can be reconstructed by either temporal replacementfrom the previously decoded anchor frame or by spatial interpolation from good neigh-bors. These two methods will be discussed later. For P- and B-pictures, the main strategy toconceal the lost data is to replace the region with pixels from the corresponding (andpossibly motion compensated) location in the previously decoded anchor. In this replace-ment the motion vectors play a very important role. In other words, if good estimates ofmotion information can be obtained, its use may be the least noticeable correction. SinceDPCM coding for motion vectors only exploited the correlations between the horizontallyneighboring MBs, the redundancy between the vertical neighborhood still exists afterencoding. Therefore, the lost motion information can be estimated from the verticalneighbors. In the following, three algorithms that have been developed for error conceal-ment in the video decoder are described.

Algorithm 1: Spatial interpolation of missing I-picture data and temporal replacement forP- and B-pictures with MC [sun1 1992]

For I-pictures, DC values of damaged block are replaced by the interpolation from theclosest top and bottom good neighbors; the AC coefficients of those blocks are synthesizedfrom the DC values of the surrounding neighboring blocks.

For P-pictures, the previously decoded anchor frames with MC replace the lost blocks.The lost motion vectors are estimated by interpolation of the ones from the top and bottomMBs. If motion vectors in both top and bottom MBs are not available, zero motion vectorsare used. The same strategy is used for B-pictures; the only difference is that the closestanchor frame is used. In other words, the damaged part of the B-picture could be replacedby either the forward or backward anchor frame, depending on its temporal position.

Algorithm 2: Temporal replacement of missing I-picture data and temporal replacement forP- and B-pictures with top MC

For I-pictures, the damaged blocks are replaced with the colocated ones in the previouslydecoded anchor frame.

For P- and B-pictures, the closest previously decoded anchor frame replaces the dam-aged part with MC as in the Algorithm 1. The only difference is that the motion vectors are


estima ted only from the clos est top MB instead of inter polation of top and bottom motionvect ors. This makes the imp lementat ion of this sch eme much easier . If the se mo tion vector sare no t availab le the n zer o motion vect ors are use d.

In the abov e two algorithm s, the damaged blo cks in an I-pictur e (anch or frame ) areconcea led by two method s: temporal replace ment and spatial inter polati on. Temp oralrepl acemen t is able to provi de high-re solution image data to sub stitute fo r lost dat a;howe ver, in mo tion are as, a big diffe rence migh t exist between the current intraco dedframe and the previ ously decod ed frame. In this case, tem poral repl acem ent will produ celarge shear ing dis tortion unle ss some mo tion-b ased process ing can be applied at thedecod er. Howeve r, this type of proce ssing is not generall y av ailable since it is a computa-tiona lly deman ding task to locally c ompute motion traje ctories at the decoder . In contrast,the spatial inter polati on appro ach synthe sizes lost dat a from the adjace nt blocks in thesam e frame. Th erefore, the intrafr ame redundanc y betwe en block s is exploited , while thepotenti al probl em of severe blurring due to insuf ficient high order AC coef fi cients foracti ve are as. To allev iate this probl em, an ada ptive con cealment strate gy can be used as acom promise; this is described in Alg orithm 3.

Algori thm 3: Ad aptive spatiote mporal repl acemen t of missing I-pic ture data and tem poralrepl acemen t with MC fo r P- and B- pictures

For I-pic tures, the damag ed block s are concea led with temporal repl acemen t or spatia linter polati on accordin g to the decision made by the top and bot tom MBs (Figure 17.17). Th edecision of which concealment method to use will be based on the more cheaply obtainedmeasures of image activity from the neighboring top and bottomMBs. One candidate for thedecision processor is tomake the decision based onprediction error statisticsmeasured in theneighbo rhood . The decision region is shown in Figu re 17.15, whe re

VAR ¼ E[(x� x)2],

VAROR ¼ E[x2]� m2,(17:31)

Concealed dataErrortoken

Decision

Spatialinterpolation

Temporalreplacement

Good MB

Damaged MB

MB: macroblock

0 12 3

Top

Bottom

Spatialinterpolation

VAR

VAROR

Temporalreplacement

FIGURE 17.17Adaptive error concealment strategy.


and x is the neighboring good MB data, x is the data of the corresponding MB in thepreviously decoded frame at the colocated position and m is the average value of theneighboring good MB data in the current frame. One can appreciate that VAR is indicativeof the local motion and VAROR of the local spatial detail. If VAR > VAROR and VAR > T,where T is a preset threshold value which is set to 5 in the experiments, the concealmentmethod is spatial interpolation; if VAR < VAROR or VAR < T, the concealment method istemporal replacement.

It should be noted that the concealment for luminance is performed on a block basisinstead of MB basis, while the chrominance is still on the MB basis. The detailed decisionsfor the luminance blocks are described as follows:

If both top and bottom are temporally replaced, then four blocks (0, 1, 2, and 3) arereplaced by the colocated ones (colocated means no MC) in the previously decodedframe;

If top is temporally replaced and bottom is spatially interpolated, then blocks 0 and 1are replaced by the colocated ones in the previously decoded anchor frame andblocks 2 and 3 are interpolated from the block boundaries;

If top is spatially interpolated and bottom is temporally replaced, then blocks 0 and 1are interpolated from the boundaries, and block 2 and 3 are replaced by the colocatedones in the previously decoded anchor frame;

If both top and bottom are not temporally replaced, all four blocks are spatiallyinterpolated.

In spatial interpolation, a maximal smoothing technique with boundary conditions undercertain smoothness measures is used. The spatial interpolation process is carried out withtwo steps: the mean value of the damaged block is first bilinearly interpolated with onesfrom the neighboring blocks, then spatial interpolation for each pixel is performed with aLaplacian operator. Minimizing the Laplacian on the boundary pixels using the iterativeprocess in [wang 1991] enforces the process of maximum smoothness.

For P- and B-pictures a similar concealment method is used as in Algorithm 2 exceptmotion vectors from top and bottom neighboring MBs are used for top two-blocks andbottom two-blocks, respectively.

A schematic block diagram for implementation of adaptive error concealment forintra-coded frames is given in Figure 17.18. Corrupted MBs are first indicated by errortokens obtained through the transport interface. Then, a decision regarding which

Errortokens

Video

decoder

Video

filteringError region ID

postToTemporal

replacement

Intra-codedwords data

Spatial

interpolation

Video

Decision

Error

region ID

FIGURE 17.18Two-stage error concealment strategy.


concea lment method (temp oral repl acemen t or spatial int erpolation ) should be use d isbased on easi ly obtained m easures of image activit y from the neighbori ng top and bot tomMB s. The corru pted MB s are first classi fied int o two classe s acc ording to the local acti vities. Iflocal mo tion is smaller than spatial detai l, the corru pted MBs are defi ned as the first class andwill be concea led by temporal replace ment; when local motion is greater than local spatialdetail, the corru pted MBs are de fined as the second class and will be concea led by spatia linter polati on. The overal l concea lment procedur e cons ists of two st ages. First, tem poralrepl acemen t is applied to all corrupt ed MB s of the fi st class throu ghout the whol e frame.Aft er the tem poral replace ment stage, the remain ing uncon cealed dam aged MB s of thesecond class are mo re likel y to be surrou nded by valid image MB s. A st age of spatialinter polati on is then perform ed on the m. This will no w resu lt in le ss blu rring, or the blurringwill be li mited to sm aller are as. Th erefore, a good com promise betwe en sh earing (discon-tinuity or shift of edge or line) and blu rring can be obt ained.

17.5.3 Algo rithm Enhance ments

As discuss ed abov e, I-pic ture err ors, which are impe rfectly con cealed, wi ll ten d to propa-gate through all frames in the GO P. Ther efore, it is desir able to develop enhancem ents forthe basic spati otemporal error concea lment tech nique to furthe r improve the accu racy withwhich missing I-pictur e pi xels are replace d. Three new algorithm s have bee n develop ed forthis purp ose. The fi rst is an ext ension of the spatial restor ation tech nique outlin ed ear lier,and is based on proces sing of edge informa tion in a large local neighbo rhood to obt ainbett er restor ation of the miss ing data . The seco nd and thi rd are v ariation s, which inv olveenc oder modi fica tions a imed at improve d error concea lment pe rformance . Speci fi cally,infor mation such as I-pictur e pseudo mo tion vect ors, or low-re solution data in a hierarch-ical compr ession syste m, are adde d in the enc oder. Th ese redundanc ies can signi fi cantlybene fi t err or concea lment in the decoder s that must ope rate under higher cel l loss = errorcond itions, whil e having a relative ly mo dest impact on no minal image qual ity.

17.5.3. 1 Direc tional Interpol ation

Impro vement s in spatial interpol ation algori thms (for use with M PEG I-pic tures) havebeen proposed in [sun 1995, kwok 1993]. In these studies, additional smoothness criteriaand directional filtering are used for estimating the picture area to be replaced. The newalgorithms utilize spatially correlated edge information from a large local neighborhood ofsurrounding pixels and perform directional or multidirectional interpolation to restore themissing block. The block diagram illustrating the general principle of the restorationproces s is shown in Figure 17.19.

Three parts are included in the restoration processing: edge classification, spatial inter-polation, and pattern mixing. The function of the classifier is to select the top one, two orthree directions that strongly characterize edge orientations in the surrounding neighbor-hood. Spatial interpolation is performed for each of the directions determined by theclassifier. For a given direction, a series of 1-D interpolations are carried out along thatdirection. All of the missing pixels are interpolated from a weighted average of goodneighborhood pixels. The weights depend inversely on the distance from the missing pixelto the good neighborhood pixels. The purpose of pattern mixing is to extract strong charac-teristic features of two ormore images andmerge them into one image,which is then used toreplace the corrupted one. Results show that these algorithms are capable of providingsubjectively better edge restoration in missing areas, and may thus be useful for I-pictureprocessing in high error-rate scenarios. However, the computational practicality of theseedge-filtering techniques needs further investigation for given application scenarios.


Classifier

Missing MB with surrounding good MBs

0� 22.5� 45� 67.5� 90� 112.5� 135� 157.5�

Bank of directional interpolators

Imagemixer

Restored MB

FIGURE 17.19The multi-directional edge restoration process.

17.5.3. 2 I-Picture Moti on Vecto rs

Motion inf ormatio n is ver y use ful in concea ling losses in P and B frame s, but is notavailabl e for I-pic tures. Th is limits the concealme nt algorithm to spatial or direct tem poralreplace ment opti ons descri bed above, which may not alway s be succe ssful in movin gareas of the picture . If motion vectors are made ava ilable for all MPEG frames (includin gintra-cod ed on es) as an aid fo r error con cealment [sun1 1992], good error concea lmentperform ance can be obtain ed wi thout the com plexity of adaptive spati al proces sing.Ther efore, a syntax extensio n has been adop ted by the MPEG -2 where mo tion vectorscan be transmi tted in an I-pictur e as the redundanc y fo r error con cealment purpo ses[sun 2 1992]. The MB syntax is not change d; ho wever, mo tion vectors are int erpretedin the follo wing way: the decod ed forwa rd mo tion vect ors belong to the MB spati allybelow the curren t MB , a nd describe how that MB can be repla ced from the previ ousanchor frame in the event that the MB ca nnot be rec overed. Si mulatio n resultshave shown that subject ive picture qualit y wi th I-pic ture motion vectors is noticeabl ysuperio r to con vention al temporal replace ment, and that the overhead for transmi ttingthe additi onal mo tion vect ors is less than 0.7% of the total bit rate at bit rate of about 6 –7Mbit s=s.

17.5.3. 3 Spatial Scalabl e Error Conceal ment

This approach fo r error concea lment of MPEG v ideo is ba sed on the scalabi lity(or hierarchy) feature of MPEG -2 [mpeg2] . Hierar chical transmiss ion provid es mo re pos-sibili ties fo r error concea lment , when a cor respond ing two-tier transmis sion medi a areavailable. A block diagram illustrating the general principle of coding system with spatialscalabi lity a nd err or concea lment is shown in Figure 17.20.


LP anddown-sampling MPEG

MPEG-1

Up-sampling

MPEG

Encoder

Add error

Add error

Decoder

Conceal

Conceal

Low

High

VLDDecoder

VLDDecoder

Up-sampling

Packetization

Packetization

Channel

Channel

Side informationfor low-resolution layer

Low-resolution video

Side informationfor high-resolution layer

High-resolution video

Input

FIGURE 17.20Block diagram of spatial scalability with error concealment.

It should be noted that the concept of scalable error concealment is different from thetwo-tier concept with data partitioning. Scalable concealment uses the spatial scalabilityfeature in MPEG-2, while the two-tier case uses the data partitioning feature of MPEG-2, inwhich the data corresponds to the same spatial resolution layer but is partitioned to twoparts with a breakpoint. In spatial scalability, the encoder produces two separate bit-streams: one for the low-resolution base layer and another for the high-resolution enhance-ment. The high-resolution layer is encoded with an adaptive choice of temporal predictionfrom previous anchor frames and compatible spatial prediction (obtained from the up-sampled low-resolution layer) corresponding to the current temporal reference. In thedecoder, redundancies that exist in the scaling data greatly benefit the error concealmentprocessing. In a simple experiment with spatial scalable MPEG-2, we consider a scenarioin which losses in the high-resolution MPEG-2 video are concealed with information fromthe low-resolution layer. Actually, there are two kinds of information in the lower layerthat can be used to conceal the data loss in the high-resolution layer: up-sampled picturedata and scaled motion information. Therefore, three error concealment approaches arepossible:

1. Up-sampled substitution: Lost data is replaced by colocated up-sampled data in thelow-resolution decoded frame. The up-sampled picture is obtained from the low-resolution picture with proper up-sampling filter.

2. Mixed substitution: Lost MBs in I-picture are replaced by colocated up-sampledMBs in the low-resolution decoded frame, while lost MBs in P- and B-picture aretemporally replaced by the previously decoded anchor frame with the motionvectors for the low-resolution layer.

3. Motion vector substitution: The previously decoded anchor frame with the motion-vectors replaces lost MBs for the low-resolution layer appropriately scaled.

Since motion vectors are not available for I-pictures, obviously, method 3 does not work forI-picture (unless I-picture motion vectors [concealment motion vectors] of MPEG-2 aregenerated in encoder). Simulation results have shown that, on average, the up-sampledsubstitution outperforms the other two, and mixed substitution also provides acceptableresults in the case of video with smooth motion.


TABLE 17.4

Subjective Quality Comparison

PictureMaterial Items

Algorithm1

Algorithm2

Algorithm3 Comments

Still Blurring High None Low Temporal replacement worksShearing None None None very good in no motion areasArtifact blocking Medium None Low

Slow motion Blurring High None Low Temporal replacement worksShearing None Low Low well in slow motion areasArtifact blocking Medium None Low

Fast motion Blurring High None Medium Temporal replacement causesmore shearing

Shearing None High Low Spatial interpolation resultsin blurring

Artifact blocking High Low Medium Adaptive strategy limitsblurring in smaller areas

Overall The adaptive strategy of steering the temporal replacement and spatial interpolation accordingto the measures of local activity and local motion gives a good compromise betweenshearing and blurring.

17.5.4 Summary of Error Concealment

In this section, a general class of error concealment algorithms for MPEG video has beendiscussed. The error concealment approaches that have been described are practicalfor current MPEG decoder implementations and have been demonstrated to providesignificant robustness. Specifically, it has been shown that the adaptive spatiotemporalalgorithm can provide reasonable picture quality at cell loss ratios (CLR) as high as 10�3

when used in conjunction. These results confirm that compressed video is far less fragilethan originally believed when appropriate transport and concealment techniques areemployed. The results can be summarized as in Table 17.4.

Several concealment algorithm extensions based on directional filtering, I-picturepseudo motion vectors and MPEG-2 scalability were also considered and shown toprovide performance gains that may be useful in certain application scenarios. In view ofthe practical benefits of robust video delivery, it is recommended that such error-resiliencefunctions (along with associated transport structures) be important for implementation inemerging TV, HDTV, teleconferencing, and multimedia systems if the cell loss rates onthese transmission system are significant. Particularly for terrestrial broadcasting and ATMnetwork scenarios, we believe that robust video delivery based on decoder error conceal-ment is an essential element of a viable system design.

17.6 Summary

In this chapter, several application issues of MPEG-2 have been discussed. The mostsuccessful application of MPEG-2 is US HDTV standard. The other application issuesinclude transcoding with bitstream scaling, down-conversion decoding, and error conceal-ment. Transcoding is a very interesting topic that converts the bitstreams between differentstandards. The error concealment is very useful in the noisy communication channels suchas terrestrial television broadcasting. The down-conversion decoder responds to the mar-ket requirement during the DTV transition period and long term need for displaying DTVsignal on computer monitors.


Exercises

1. In DTV applications, describe the advantages and disadvantages of interlaced formatand progressive format. Explain why the computer industry favors progressive formatand TV manufacturers like interlaced format.

2. Do all DTV formats have square pixel format? Why is square pixel format important fordigital television?

3. The bitstream scaling is one kind of transcoding, according to your knowledge, describeseveral other kinds of transcoding (such as MPEG-1 to JPEG) and propose a feasiblesolution to achieve the transcoding requirements.

4. What type of MPEG-2 frames will cause a higher degree of error propagation if errorsoccur? What technique of error concealment is allowed by the MPEG-2 syntax? Usingthis technique, perform simulations with several images to determine the penalty in thecase of no errors.

5. To reduce the drift in a down-conversion decoder, what coding parameters can bechosen at the encoder? Will these actions affect the coding performance?

6. What are the advantages and disadvantages of a down-conversion decoder in thefrequency-domain and spatial-domain?

References

[bao 1996] J. Bao, H. Sun, and T. Poon, HDTV down-conversion decoder, IEEE Transactions onConsumer Electronics, 42, 3, 402–410, August 1996.

[boyce 1995] J. Boyce, J. Henderson, and L. Pearlestien, an SDTV decoder with HDTV capability: anall-format ATV decoder, SMPTE Fall Conference, New Orleans, LA, 1995.

[bruni 1998] R. Bruni, A. Chimienti, M. Lucenteforte, D. Pau, and R. Sannino, A novel adaptive vectorquantization method for memory reduction in MPEG-2 HDTV receivers, IEEE Transactions onConsumer Electronics, 44. 3, 537–544, August 1998.

[de with 1998] P.H.N. de With, P.H. Frencken, and M.V.d. Schaar-Mitrea, An MPEG decoder withembedded compression for memory reduction, IEEE Transactions on Consumer Electronics, 44. 3,545–555, August 1998.

[ga 1994] Grand Alliance HDTV System Specification Version 2.0, December 7, 1994.[ghanbari 1989] M. Ghanbari, Two-layer coding of video signals for VBR networks, IEEE Journal on

Selected Areas in Communications, 7, 5, 771–781, June 1989.[harthanck 1986] W. Harthanck, W. Keesen, and D. Westerkamp, Concealment techniques for block

encoded TV-signals, Picture Coding Symposium, 1986.[isnardi 1993] M.A. Isnardi, Consumers seek easy to-use products, IEEE Spectrum, January 1993,

p. 64.[jayant 1984] N.N. Jayant and P. Noll, Digital Coding of Waveforms to Speech and Video, Prentice-Hall,

Englewood Cliffs, NJ, 1984.[jeng 1991] F.-C. Jeng and S.H. Lee, Concealment of bit error and cell loss in inter-frame coded video

transmission, ICC Proceeding, ICC’91, pp. 496–500.[joseph 1992a] K. Joseph, S. Ng, D. Raychaudhuri, R.S. Girons, T. Savatier, R. Siracusa, and

J. Zdepski, MPEGþþ: A robust compression and transport system for digital HDTV, SignalProcessing Image Communication, 4, 307–323, 1992.

[joseph 1992b] K. Joseph, S. Ng, D. Raychaudhuri, R. Saint Girons, R. Siracusa, and J. Zdepski,Prioritization and transport in the ADTV digital simulcast system, Proceedings of the ICCE ‘92,June 1992.

[karlsson 1989] G. Karlsson, and M. Vetterli, Packet video and its integration into the networkarchitecture, IEEE Journal on Selected Areas in Communication, June 1989, 739–751.


[kishno 1989] F. Kishino, K. Manabe, Y. Hayashi, and H. Yasuda, Variable bit-rate coding of videosignals for ATM networks, IEEE Journal on Selected Areas in Communications, 7, 5, 801–806, June1989.

[kwok 1993]W. Kwok and H. Sun, Multi-directional interpolation for spatial error concealment, IEEETransactions on Consumer Electronics, 455–460, August 1993.

[lancaster 1985] P. Lancaster and M. Tismenetsky, The Theory of Matrices with Application, AcademicPress, Boston, MA, 1985.

[lei 1999] S. Lei, A quadtree embedded compression algorithm for memory saving DTV decoders,Proceedings of the International Conference on Consumer Electronics, Los Angeles, CA, June1999.

[merhav 1997]N. Merhav and V. Bhaskaran, Fast algorithms for DCT-domain image down-samplingand for inverse motion compensation, IEEE Transaction on Circuits and Systems for VideoTechnology, 7, 3, 468–476, June 1997.

[mokry 1994] R. Mokry and D. Anastassiou, Minimal error drift in frequency scalability for motioncompensated DCT coding, IEEE Transaction on Circuits and Systems for Video Technology, 4, 4,392–406, August 1994.

[mpeg2] MPEG-2 International Standard. Video Recommendation ITU-T H.262, ISO=IEC 13818–2,January 10, 1995.

[mpeg21] MPEG-21 Overview v.5, ISO=IEC JTC1=SC29=WG11=N5231, October 2002.[ng 1993] S. Ng, Thompson Consumer Electronics, Low resolution HDTV receivers, U.S. Patent

5,262,854, November 16, 1993.[pang 1996] K.K. Pang, H.G. Lim, S. Dunstan, and J.M. Badcock, Frequency domain decimation and

interpolation techniques, Picture Coding Symposium, Melbourne, Australia, March 1996.[peng 2002] P. Yin, A. Vetro, B. Lui, and H. Sun, Drift compensation for reduced spatial resolution

transcoding, IEEE Transactions on Circuits Systems for Video Technology, 12, 1009–1020, Novem-ber 2002.

[reitmeier 1996] Glenn A. Reitmeier, The U.S. Advanced Television Standard and its Impact on VLSI,Submitted to VLSI and Signal Processing, 1996.

[siracusa 1993] R. Siracusa, K. Joseph, J. Zdepski, and D. Raychaudhuri, Flexible and robust packettransport for digital HDTV, IEEE Journal of Selected Areas in Communications, 11, 1, 88–98,January 1993.

[sun1 1992]H. Sun, K. Challapali, and J. Zdepski, Error concealment in simulcast AD-HDTV decoder,IEEE Transaction on Consumer Electronics, 38, 3, 108–118, August 1992.

[sun2 1992] H. Sun, M. Uz, J. Zdepski, and R. Saint Girons, A proposal for increased error resilience,ISO-IEC=JTC1=SC29=WG11, MPEG92, September 30, 1992.

[sun 1993] H. Sun, Hierarchical decoder for MPEG compressed video data, IEEE Transaction onConsumer Electronics, 39, 3, 559–562, August 1993.

[sun 1995] H. Sun and W. Kwok, Restoration of damaged block transform coded image usingprojection onto convex sets, IEEE Transaction on Image Processing, 4, 4, 470–477, April 1995.

[tm5] MPEG Test Model 5, ISO=IEC JTC=SC29=WG11 Document. April, 1993.[vetro 1998a] A. Vetro and H. Sun, On the motion compensation within a down-conversion decoder,

Journal of Electronic Imaging, 7, 3, July 1998.[vetro 1998b] A. Vetro and H. Sun, Frequency domain down-conversion using an optimal motion

compensation scheme, Journal of Imaging Science and Technology, 9, 4, August 1998.[vetro 1998c] A. Vetro, H. Sun, P. DaGraca, and T. Poon, Minimum drift architectures for three-layer

scalable DTV decoding, IEEE Trans Consumer Electronics, 44, 3, August 1998.[vetro 2003] A. Vetro, C. Christopolous, and H. Sun, Video transcoding architectures and techniques:

An overview, IEEE Signal Processing Magazine, 18–29, March 2003.[wang 1991] Y. Wang and Q.-F. Zhu, Signal loss recovery in DCT-based image and video codecs,

Proceedings of SPIE on Visual Communication and Image Processing, Boston, 667–678,November 1991.

[yu 1999] H. Yu, W.-M. Lam, B. Canfield, and B. Beyers, Block-based image processor for memoryefficient MPEG video decoding, Proceedings of the International Conference on ConsumerElectronics, Los Angeles, CA, June 1999.


[zdepski 1989] J. Zdepski, et al., Packet transport of rate-free interframe DCT compressed digitalvideo on a CSMA=CD LAN, Proceedings of the IEEE Global Conference on Communication,Dallas, TX, November 1989.

[zdepski 1990] J. Zdepski, et al., Prioritized packet transport of VBR CCITT H.261 format compressedvideo on a CSMA=CD LAN, Third International Workshop on Packet Video, Morristown, NJ,March 22–23, 1990.


18MPEG-4 Video Standard: Content-BasedVideo Coding

This chapter provides an overview of the ISO MPEG-4 standard. The MPEG-4 workincludes natural video, synthetic video, audio, and systems. Both natural and syntheticvideo has been combined into a single part of the standard, which is referred to as MPEG-4visual [mpeg-4 visual]. It should be emphasized that neither MPEG-1 nor MPEG-2 con-siders synthetic video (or computer graphics) and the MPEG-4 is also the first standard toconsider the problem of content-based coding. Here, we will focus on the video parts of theMPEG-4 standard.

18.1 Introduction

As discussed in the previous chapters, MPEG has completed two standards MPEG-1 thatwas mainly targeted for CD-ROM applications up to 1.5 Mbits=s and MPEG-2 for digitalTV and HDTV applications at bit rates between 2 and 30 Mbits=s. In July 1993, MPEGstarted its new project, MPEG-4, which was targeted to provide technology for multimediaapplications. The first working draft (WD) was completed in November 1996 and thecommittee draft (CD) of version 1 reached in November 1997. The draft internationalstandard (DIS) of MPEG-4 was completed in November of 1998. The international stand-ard (IS) of MPEG-4 version 1 was completed in February of 1999. The goal of the MPEG-4standard is to provide the core technology that allows efficient content-based storage,transmission and manipulation of video, graphics, audio, and other data within a multi-media environment. As mentioned earlier, there exist several video coding standards suchas MPEG-1=2, H.261 and H.263. Why do we need a new standard for multimedia appli-cations? In other words, are there any new attractive features of MPEG-4 that the currentstandards do not have or cannot provide? The answer is yes. The MPEG-4 has manyinteresting features that will be described later in this chapter. Some of these features arefocused on improving coding efficiency; some are used to provide robustness of transmis-sion and interactivity with end user. However, among these features the most importantone is the content-based coding. MPEG-4 is the first standard that supports content-basedcoding of audiovisual objects (AVO). For content providers or authors, the MPEG-4standard can provide the greater reusability, flexibility, and manageability of the contentthat is produced. For network providers, MPEG-4 will offer transparent information thatcan be interpreted and translated into the appropriate native signaling messages of eachnetwork. This can be accomplished with the help of relevant standard bodies that have thejurisdiction. For end users, MPEG-4 can provide more functionality to make the user


terminal have more capabilities of interaction with the content. To reach these goals,MPEG-4 should have the following important features.

The contents such as audio, video, or data are represented in the form of primitiveAVOs. These AVOs can be natural scenes or sounds, which are recorded by video cameraor synthetically generated by computers.

The AVOs can be composed together to create compound AVOs or scenes.The data associated with AVOs can be multiplexed and synchronized so that they can be

transported through network channels with certain quality requirements.

18.2 MPEG-4 Requirements and Functionalities

As the MPEG-4 standard is mainly targeted for multimedia applications, there are manyrequirements to ensure that several important features and functionalities are offered.These features include the allowance of interactivity, high compression, universal accessi-bility, and portability of audio and video content. From the MPEG-4 video requirementdocument, the main functionalities can be summarized by the following three aspects:content-based interactivity, content-based efficient compression, and universal access.

18.2.1 Content-Based Interactivity

In addition to provisions for efficient coding of conventional video sequences, MPEG-4video has the following features of content-based interactivity.

18.2.1.1 Content-Based Manipulation and Bitstream Editing

The MPEG-4 supports the content-based manipulation and bitstream coding without theneed for transcoding. In MPEG-1 and MPEG-2, there is no syntax and no semantics forsupporting true manipulation and editing in the compressed domain. MPEG-4 providesthe syntax and techniques to support content-based manipulation and bitstream editing.The level of access, editing, and manipulation can be done at the object level in connectionwith the features of content-based scalability.

18.2.1.2 Synthetic and Natural Hybrid Coding

The MPEG-4 supports combining synthetic scenes or objects with natural scenes or objects.This is for compositing synthetic data with ordinary video, allowing for interactivity. Therelated techniques in MPEG-4 for supporting this feature include sprite coding, efficientcoding of 2-D and 3-D surfaces, and wavelet coding for still textures.

18.2.1.3 Improved Temporal Random Access

The MPEG-4 provides efficient method to randomly access, within a limited time and withthe fine resolution, parts, e.g., video frames or arbitrarily shaped image objects from anaudio-visual sequence. This includes conventional random access at very low bit rate. Thisfeature is also important for content-based bitstream manipulation and editing.

18.2.2 Content-Based Efficient Compression

Our initial goal of MPEG-4 is to provide highly efficient coding tool with high compressionat very low bit rates. But this goal has now extended to a large range of bit rates from


10 kbits=s to 5 Mbits=s, which covers from QSIF to CCIR 601 video formats. Two importantitems are included in this requirement.

18.2.2.1 Improved Coding Efficiency

The MPEG-4 video standard provides subjectively better visual quality at comparable bitrates comparing the existing or emerging standards, including MPEG-1=2 and H.263.MPEG-4 video contains many new tools that optimize the code in different bit raterange. Some experimental results have shown that it outperforms MPEG-2 and H.263 atthe low bit rates. Also, the content-based coding reaches the similar performance of theframe-based coding.

18.2.2.2 Coding of Multiple Concurrent Data Streams

The MPEG-4 provides the capability of coding multiple views of a scene efficiently. Forstereoscopic video applications, MPEG-4 allows the ability to exploit redundancy inmultiple viewing points of the same scene, permitting joint coding solutions that allowcompatibility with normal video as well as the ones without compatibility constraints.

18.2.3 Universal Access

Another important feature of the MPEG-4 video is the feature of universal access.

18.2.3.1 Robustness in Error-Prone Environments

The MPEG-4 video provides strong error robustness capabilities to allow access to appli-cations over a variety of wireless and wired networks and storage media. Sufficient errorrobustness is provided for low bit rate applications under severe error conditions (e.g., longerror bursts).

18.2.3.2 Content-Based Scalability

The MPEG-4 video provides the ability to achieve scalability with fine granularity incontent, quality (e.g., spatial and temporal resolution), and complexity. These scalabilitiesare especially intended to result in content-based scaling of visual information.

18.2.4 Summary of MPEG-4 Features

From the above description of MPEG-4 features, it is obvious that the most importantapplication of MPEG-4 will be in a multimedia environment. The media that can use thecoding tools of MPEG-4 include computer networks, wireless communication networks,and the Internet. Although it can also be used for satellite, terrestrial broadcasting, andcable TV, these are still the territories of MPEG-2 video because MPEG-2 has already madesuch a large impact in the market. A large number of silicon solutions exist and itstechnology is more mature compared to the current MPEG-4 standard. From the viewpointof coding theory, we can say there is no significant breakthrough in MPEG-4 videocompared to MPEG-2 video. Therefore, we cannot expect to have a significant improve-ment of coding efficiency when using MPEG-4 video over MPEG-2. Although MPEG-4optimized its performance in a certain range of bit rates, its major strength is that itprovides more functionality than MPEG-2. Recently, MPEG-4 added the necessary toolsto support interlaced material. With this addition, MPEG-4 video does the support


functionalities already provided by MPEG-1 and MPEG-2, including the provision toefficiently compress standard rectangular sized video at different levels of input formats,frame rates, and bit rates.

Overall, the incorporation of an object or content-based coding structure is the featurethat allows MPEG-4 to provide more functionality. It enables MPEG-4 to provide the mostelementary mechanism for interactivity and manipulation with objects of images or videoin the compressed domain without the need for further segmentation or transcoding at thereceiver, since the receiver can receive separate bitstreams for different objects contained inthe video. To achieve content-based coding, the MPEG-4 uses the concept of a video objectplane (VOP). It is assumed that each frame of an input video is first segmented into a set ofarbitrary shaped regions or VOPs. Each such region could cover a particular image orvideo object (VO) in the scene. Therefore, the input to the MPEG-4 encoder can be a VOP,and the shape and the location of the VOP can vary from frame to frame. A sequence ofVOPs is refereed to as a VO. The different VOs may be encoded into separate bitstreams.MPEG-4 specifies demultiplexing and composition syntax, which provide the tools for thereceiver to decode the separate VO bitstreams and composite them into a frame. In thisway, the decoders have more flexibility to edit or rearrange the decoded VOs. The detailedtechnical issues will be addressed in the following sections.

18.3 Technical Description of MPEG-4 Video

18.3.1 Overview of MPEG-4 Video

The major feature of MPEG-4 is to provide the technology for object-based compressionthat is capable of separately encoding and decoding VOs. To clearly explain the idea ofobject-based coding, we should review the set of VO-related definitions. An image scenemay contain several objects. In the example of Figure 18.1, the scene contains the back-ground and two objects. The time instant of each VO is referred to as the VOP. The conceptof a VO provides a number of functionalities of MPEG-4, which are either impossible orvery difficult in MPEG-1 or MPEG-2 video coding. Each VO is described by the informa-tion of texture, shape, and motion vectors (MV). The video sequence can be encoded in away that will allow the separate decoding and reconstruction of the objects and allow theediting and manipulation of the original scene by simple operation on the compressedbitstream domain. The feature of object-based coding is also able to support functionalitysuch as warping of synthetic or natural text, textures, image, and video overlays onreconstructed VOs.

FIGURE 18.1Video object definition and format.(a) Video object, (b) VOPs.

VO1

VO2

VO3

1 2 3


As MPEG-4 aims at providing coding tools for multimedia environment, these tools notonly allow one to efficiently compress natural VOs, but also compress synthetic objects,which are a subset of the larger class of computer graphics. The tools of MPEG-4 videoincludes:

. Motion estimation (ME) and MC

. Texture coding

. Shape coding

. Sprite coding

. Interlaced video coding

. Wavelet-based texture coding

. Generalized temporal and spatial as well as hybrid scalability

. Error resilience

The technical details of these tools will be explained in the following sections.

18.3.2 Motion Estimation and Compensation

For object-based coding, the coding task includes two parts: texture coding and shapecoding. The current MPEG-4 video texture coding is still based on the combination ofmotion compensated (MC) prediction and transform coding (TC). MC predictive coding isa well-known approach for video coding. The MC is used to remove the interframeredundancy and the TC is used to remove the intraframe redundancy, as in the MPEG-2video coding scheme. However, there are lots of modifications and technical details inMPEG-4 for coding a very wide range of bit rates. Moreover, MPEG-4 coding has beenoptimized for low bit rate applications with a number of new tools. In other words, MPEG-4video coding uses the most common coding technologies such as MC and TC, but at thesame time, itmodifies some traditionalmethods such as advancedMCand also creates somenew features such as sprite coding.

The basic technique to perform MC predictive coding for coding a video sequence is ME.The basic ME method used in the MPEG-4 video coding is still the block matchingtechnique. The basic principle of block matching for ME is to find the best-matched blockin the previous frame for every block in the current frame. The displacement of the best-matched block relative to the current block is referred to as the MV. Positive values for bothMV components indicate that the best-matched block is on the bottom-right of the currentblock. The MC prediction difference block is formed by subtracting the pixel values of thebest-matched block from the current block, pixel by pixel. The difference block is thencoded by texture-coding method. In MPEG-4 video coding, the basic technique of texturecoding is a discrete cosine transformation (DCT). The coded MV information and differ-ence block information is contained in the compressed bitstream, which is transmitted tothe decoder. The major issues in the ME and MC are the same as in the MPEG-1 andMPEG-2, which include the matching criterion, the size of search window (searchingrange), the size of matching block, the accuracy of MVs (one pixel or half pixel), andinter=intra mode decision. As we have discussed these topics already, we will focus on thenew features in the MPEG-4 video coding. The feature of the advanced motion predictionis a new tool of MPEG-4 video. This feature includes two aspects: adaptive selection of163 16 block or four 83 8 blocks to match the current 163 16 block and overlapped MCfor luminance block.


18.3.2.1 Adaptive Selection of 163 16 Block or Four 83 8 Blocks

The purpose of the adaptive selection of the matching block size is to further enhancecoding efficiency. The coding performance may be improved at low bit rate because thebits for coding prediction difference could be greatly reduced at the limited extra cost forincreasing MVs. Of course, if the cost of coding MVs is too high, this method will not work.The decision in the encoder should be very careful. For explaining the procedure of how tomake decision, we define {C(i, j), i, j¼ 0, 1, . . . , N � 1} to be the pixels of the current blockand {P(i, j), i, j¼ 0, 1, . . . , N � 1} to be the pixels in the search window in the previous frame.The sum of absolute difference (SAD) is calculated as

SADN(x, y) ¼

PN�1i¼0

PN�1j¼0jC(i, j)�P(i, j)j�T if (x, y)¼ (0, 0);

PN�1i¼0

PN�1j¼0jC(i, j)�P(iþ x, jþ y)j otherwise

8>>>>><>>>>>:

(18:1)

where(x, y) is the pixel within the range of searching windowT is a positive constant

The following steps then make the decision:

Step 1: To find SAD16(MVx, MVy)

Step 2: To find SAD8(MV1x, MV1y), SAD8(MV2x, MV2y), SAD8(MV3x, MV3y), andSAD8(MV4x, MV4y);

Step 3: if

X4i¼1

SAD8(MVix,MViy) < SAD16(MVx,MVy)� 128

then choose 83 8 prediction; otherwise, choose 163 16 prediction.If the 83 8 prediction is chosen, there are four MVs for the four 83 8 luminance blocks

that will be transmitted. The MV for the two chrominance blocks is then obtained by takingan average of these four MVs and dividing the average value by a factor of two. Since eachMV for the 83 8 luminance block has half-pixel accuracy, the MV for the chrominanceblock may have a 16th-pixel accuracy.

18.3.2.2 Overlapped Motion Compensation

This kind of MC is always used for the case of four 83 8 blocks. The case of one MV for a163 16 block can be considered as having four identical 83 8 MVs, each for an 83 8 block.Each pixel in an 83 8 of the best-matched luminance block is a weighted sum of threeprediction values specified in Equation 18.2:

p0(i, j) ¼ (H0(i, j) � q(i, j)þH1(i, j) � r(i, j)þH2(i, j) � s(i, j))=8 (18:2)


where divi sion is with roundoff . The weigh ting matric es are spe cifi ed as

H0 ¼

4 5 5 5 5 5 5 45 5 5 5 5 5 5 55 5 6 6 6 6 5 55 5 6 6 6 6 5 55 5 6 6 6 6 5 55 5 6 6 6 6 6 65 5 5 5 5 5 5 54 5 5 5 5 5 5 4

266666666664

377777777775, H1 ¼

2 2 2 2 2 2 2 21 1 2 2 2 2 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 2 2 2 2 1 12 2 2 2 2 2 2 2

266666666664

377777777775, and H2 ¼

2 1 1 1 1 1 1 22 2 1 1 1 1 2 22 2 1 1 1 1 2 22 2 1 1 1 1 2 22 2 1 1 1 1 2 22 2 1 1 1 1 2 22 2 1 1 1 1 2 22 1 1 1 1 1 1 2

266666666664

377777777775

It is no ted that H0( i, j) þ H 1(i , j ) þ H2( i , j ) ¼ 8 for all possibl e ( i, j). The value of q (i , j ), r ( i , j ),and s ( i, j) are the value s of the pixels in the previo us frame at the locatio ns,

q ( i , j ) ¼ p( i þ MV 0x , j þ MV 0y );

r ( i , j ) ¼ p( i þ MV 1x , j þ MV 1y );

s ( i , j ) ¼ p( i þ MV 2x , j þ MV 2y ) :

(18 : 3)

where (MV 0x ,MV0y ) is the MV of the curren t 8 3 8 luminance block p( i,j ), (MV 1x ,MV1

y ) is theMV of the block either above (for j ¼ 0 , 1, 2, 3) or below (for j ¼ 4, 5, 6, 7) the curren t block and(MV 2x ,MV2

y ) is the MV of the block either to the left (for i ¼ 0, 1, 2, 3) or righ t (for i ¼ 4, 5, 6, 7) ofthe cur rent block. The overlap ped MC can redu ce the pred iction noise a t cer tain level.

18.3.3 Texture Codin g

Text ure coding is used to code the INTRA VOP s and the predic tion residual data after MC.The algori thm for video tex ture coding is based on the con vention al 8 3 8 DCT wi th MC.DCT is perform ed for each lumi nance and chromi nance block, where the MC is perform edonly on the lumi nance blocks. This algorithm is similar to thos e in H.263 and MPEG -1 aswell as M PEG-2. Howe ver, MPEG -4 video tex ture codi ng has to deal wi th the require mentof object-bas ed coding, which is not includ ed in the othe r video codi ng st andards . He re-after, we will focus on the new features of the M PEG-4 vide o codi ng. These new fea turesinclud e: the INTR A DC and AC predic tion for I-VO P and P-VO P; the algori thm of ME andMC for arbit rary shape VOP ; and the strategy of arbit rary shape textu re codi ng. Thede finitions of I-VO P, P-VO P, and B-VOP are simi lar to the intra-code d (I), predic tive-coded (P), and bidirect ionally pred ictive-cod ed (B ) picture s in Chapter 16 fo r MPEG -1and MPEG-2.

18.3.3.1 INTRA DC and AC Prediction

In the intra-mode coding, the predictive coding is not only applied on the DC coefficientsbut also the AC coefficients to increase the coding efficiency. The adaptive DC predictioninvolves the selection of the quantized DC (QDC) value of the immediately left block or theimmediately above block. The selection criterion is based on comparison of the horizontaland ver tical DC gradients aro und the blo ck to be coded. Figu re 18.2 shows the threesurrounding blocks A, B, and C to the current block X whose QDC is to be coded whereblock A, B, and C are the immediately left, immediately left and above, and immediatelyabove block to the X, respectively. The QDC value of block X, QDCX, is predicted by either


FIGURE 18.2Previous neighboring blocks used in DC prediction.

B

A X

C D

Y

Macroblock

the QDC value of block A, QDCA, or the QDC value of block C, QDCC, based on thecomparison of horizontal and vertical gradients as follows:

If jQDCA �QDCBj< jQDCB �QDCCj, QDCP ¼ QDCC

Otherwise QDCP ¼ QDCA(18:4)

The differential DC is then obtained by subtracting the DC prediction, QDCP, from QDCX.If any of block A, B, or C are outside of the VOP boundary, or they do not belong to anINTRA coded block, their QDC value are assumed to take a value of 128 (if the pixel isquantized to 8 bits) for computing the prediction. The DC prediction are performedsimilarly for the luminance and each or the two chrominance blocks.

For AC coefficient prediction, either coefficients from the first row or the first column of aprevious coded block are used to predict the co-sited (same position in the block) coeffi-cients in the current block. On a block basis, the same rule for selecting the best predictivedirection (vertical or horizontal direction) for DC coefficients is also used for the ACcoefficient prediction. A difference between DC and AC prediction is the issue of quant-ization scale. All DC values are quantized to the 8 bits for all blocks. However, the ACcoefficients may be quantized by the different quantization scales for the different blocks.To compensate for differences in the quantization of the blocks used for prediction, scalingof prediction coefficients becomes necessary. The prediction is scaled by the ratio of thecurrent quantization step size and the quantization step size of the block used for predic-tion. In the cases when AC coefficient prediction results in a larger range of predictionerrors as compared to the original signal, it is desirable to disable the AC prediction. Thedecision of AC prediction switched on or off is performed on macroblock (MB) basisinstead of block basis to avoid the excessive overhead. The decision for switching on oroff AC prediction is based on a comparison of the sum of the absolute values of all ACcoefficients to be predicted in an MB and that of their predicted differences. It should benoted that the same DC and AC prediction algorithm is used for the INTRA blocks in theinter-coded VOP. If any blocks used for prediction are not INTRA blocks, the QDC andQAC values used for prediction are set to 128 and 0 for DC ad AC prediction, respectively.

18.3.3.2 Motion Estimation=Compensation of Arbitrary Shaped VOP

In the previous section, we discussed the general issues of ME and MC. In this section, weare going to discuss the ME and MC for coding the texture in the arbitrary shaped VOP. Inan arbitrary shaped VOP, the shape information is given by either binary shape informa-tion or alpha components of a gray level shape information. If the shape information isavailable to both encoder and decoder, three important modifications have to be consi-dered for the arbitrary shaped VOP. The first is for the blocks that are located in the borderof VOP. For these boundary blocks, the block matching criterion should be modified.


Second, a special padding technique is required for the reference VOP. Finally, as the VOPshave arbitrary shapes rather rectangular shapes, and the shapes change from time to time,an agreement on a coordinate system is necessary to ensure the consistency of MC. At theMPEG-4 video, the absolute frame coordinate system is used for referencing all of theVOPs. At each particular time instance, a bounding rectangle that includes the shape ofthat VOP is defined. The position of upper-left corner in the absolute coordinate in the VOPspatial reference is transmitted to the decoder. Thus, the MV for a particular block inside aVOP is referred to as the displacement of the block in absolute coordinates.

Actually, the first and second modifications are related since the padding of boundaryblocks will affect the matching of ME. The purpose of padding is aiming at more accurateblock matching. In current algorithm, the repetitive padding is applied to the referenceVOP for performing ME and MC. The repetitive padding process is performed as per thefollowing steps:

. Define any pixel outside the object boundary as a zero pixel.

. Scan each horizontal line of a block (one 163 16 for luminance and two 83 8 forchrominance). Each scan line is possibly composed of two kinds of line segments:zero and nonzero segments. It is obvious that our task is to pad zero segments.There are two kinds of zero segments: (1) between an end point of the scan line andthe end point of a nonzero segment and (2) between end points of two differentnonzero segments. In the first case, all zero pixels are replaced by the pixel value ofthe end pixel of nonzero segment; whereas in the second kind, all zero pixels takethe averaged value of the two end pixels of the nonzero segments.

. Scan each vertical line of the block and perform the identical procedure asdescribed for the horizontal line.

. If a zero pixel is located at the intersection of horizontal and vertical scan lines, thiszero pixel takes the average of two possible values.

. For the rest of the zero pixels, to find the closest nonzero pixel on the samehorizontal scan line and the same vertical scan line (if there is a tie, the nonzeropixel on the left or the top of the current pixel is selected). Replace the zero pixel bythe average of these two nonzero pixels.

. For a fast moving VOP, padding is further extended to the blocks outside the VOPbut immediately next to the boundary blocks. These blocks are padded byreplacing the pixel values of adjacent boundary blocks. This extended paddingis performed in both horizontal and vertical directions. Since block matching isreplaced by polygon matching for the boundary blocks of the current VOP, theSAD values are calculated by the modified formula:

SADN(x, y) ¼

PN�1i¼0

PN�1j¼0jc(i, j)� p(i, j)j � a(i, j)� C if (x, y)¼ (0, 0),

PN�1i¼0

PN�1j¼0jc(i, j)� p(iþ x, jþ y)j �a(i, j)� C otherwise,

8>>>><>>>>:

(18:5)

whereC¼NB=2 þ 1 and NB are the number of pixels inside the VOP and in this blocka(i, j) is the alpha component specifying the shape information and it is not equal to

zero here


Bounding box

Video object plane

Boundarymacroblock

Shape

Macroblockinside VOP

Macroblockoutside VOP

FIGURE 18.3A VOP is represented by a bounding rectangular box.

18.3.3.3 Texture Coding of Arbitrary Shaped VOP

During encoding the VOP is represented by a bounding rectangle that is formed to com-pletely contain the VO but with minimum number of MBs in it (Figure 18.3). The detailedprocedure of VOP rectangle formation is given in MPEG-4 video VM [mpeg-4 vm12].

There are three types of MBs in the VOP with arbitrary shape: the MBs that arecompletely located inside of the VOP, the MBs that are located along the boundary ofthe VOP, and the MBs that are located outside of the boundary. For the first kind of the MB,there is no need for any particular modified technique to code them and just use normalDCT with entropy coding of quantized DCT (QDCT) coefficients such as coding algorithmin H.263. The second kind of MBs, referred to as transparent blocks are located along theboundary, contain two kinds of 83 8 blocks: the blocks lie along the boundary of VOP andthe blocks do not belong to the arbitrary shape but lie inside the rectangular bounding boxof the VOP. For those 83 8 blocks, which do lie along the boundary of VOP, two differentmethods have been proposed: low pass extrapolation (LPE) padding and shape adaptiveDCT (SA-DCT). All blocks in the MB outside of the boundary are also referred to astransparent blocks. The transparent blocks are skipped and not coded at all.

18.3.3.3.1 Low Pass Extrapolation Padding Technique

This block padding technique is applied to intra-coded blocks, which are not locatedcompletely within the object boundary. To perform this padding technique we first assignthe mean value of those pixels that are located in the object boundary (both inside andoutside) to each pixel outside the object boundary. Then an average operation is applied toeach pixel p(i, j) outside the object boundary starting from the upper-left corner of the blockand proceeding row by row to the lower-right corner pixel:

p(i, j) ¼ [ p(i, j� 1)þ p(i� 1, j)þ p(i, jþ 1)þ p(iþ 1, j)]=4 (18:6)

If one or more of the four pixels used for filtering are outside of the block, the correspond-ing pixels are not considered for the average operation and the factor 1=4 is modifiedaccordingly.

18.3.3.3.2 SA-DCT

The shape adaptive DCT is only applied to those 83 8 blocks that are located on the objectboundary of an arbitrary shaped VOP. The idea of the SA-DCT is to apply one-dimensional(1-D) DCT vertically and horizontally according to the number of active pixels in the rowand column of the block, respectively. The size of each vertical DCT is the same as thenumber of active pixels in each column. After vertical DCT is performed for all columns


Column

DCTs

Row

DCTs

Active image pixels Coefficients of column DCTs SA-DCT result

FIGURE 18.4Illustration of shape adaptive discrete cosine transformation (SA-DCT).

with at least one active pixel, the coefficients of the vertical DCTs with the same frequencyindex are lined up in a row. The DC coefficients of all vertical DCTs are lined up in the firstrow, the first-order vertical DCT coefficients are lined up in the second row, and so on.After that, horizontal DCT is applied to each row. As the same as for the vertical DCT, thesize of each horizontal DCT is the same as the number of vertical DCT coefficients lined upin the particular row. The final coefficients of SA-DCT are concentrated into the upper-leftcorner of the block. This procedure is shown in the Figure 18.4.

The final number of the SA-DCT coefficients is identical to the number of active pixels ofimage. Since the shape information is transmitted to the decoder, the decoder can performthe inverse shape adapted DCT to reconstruct the pixels. The regular zigzag scan ismodified so that the non-active coefficient locations are neglected when counting theruns for the run-length coding (RLC) of the SA-DCT coefficients. It is obvious that for ablock with all 83 8 active pixels, the SA-DCT becomes a regular 83 8 DCT and thescanning of the coefficients is identical to the zigzag scan. All SA-DCT coefficients arequantized and coded in the same way as the regular DCT coefficients employing the samequantizers and VLC code tables. The SA-DCT is not included in MPEG-4 video version 1,but it is being considered for inclusion into version 2.

18.3.4 Shape Coding

Shape information of the arbitrary shaped objects is very useful not only in the field ofimage analysis, computer vision, and graphics, but also in object-based video coding.MPEG-4 video coding is the first to make effort at providing a standardized approach tocompress the shape information of objects and contain the compressed results within avideo bitstream. In the current MPEG-4 video coding standard, the video data can becoded on an object basis. The information in the video signal is decomposed to shape,texture, and motion. This information is then coded and transmitted within the bitstream.The shape information is provided in binary format or gray-scale format. The binaryformat of shape information consists of a pixel map that is generally the same size as thebounding box of the corresponding VOP. Each pixel takes on one of two possible valuesindicating whether it is located within the VO or not. The gray-scale format is similar to thebinary format with the additional feature that each pixel can take on a range of values, i.e.,times an alpha value. Alpha typically has a normalized value of 0–1. The alpha value canbe used to blending two images on a pixel-by-pixel basis in such a way: new pixel¼ (alpha)(pixel A color) þ (1-alpha)(pixel B color).

Let us now discuss how to code the shape information. As mentioned earlier, shapeinformation is classified as binary shape or gray-scale shape. Both binary and gray-scaleshapes are referred to as an alpha plane. The alpha plane defines the transparency of anobject. The multilevel alpha maps are frequently used to blend different images. A binaryalpha map defines whether or not a pixel belongs to an object. The binary alpha planes are


enc oded by mo difi ed conte nt-based arithme tic enc oding (CA E), whil e the gray -scale alphaplane s are encoded by MC DCT coding, which is simi lar to tex ture codi ng. For bina ryshape coding, a rectangu lar box enclosin g the arbit rary shaped VO P is forme d as shown inFigu re 18.3. The bounde d rec tangle box is then exten ded in both vertical and horizon taldirecti ons on the righ t-bottom side to the multiple of 16 3 16 blocks . Each 16 3 16 blockwithi n the rectangu lar box is referred to as binary alpha block (BAB). Eac h BAB isassoci ated with colocated MB. The BAB can be classi fied to three types: transp arentblock, opaqu e block , and alpha or shape block. The transpare nt block doe s not c ontainany inf ormatio n ab out obje ct. The opaque block is entir ely located insi de the object. Thealpha or shape block is located in the are a of obje ct bou ndary, i.e., a par t of block is insid ethe object and the rest of block is in the backgro und. The value of pixels in the transp arentregio n is zero. For shape coding, the type informati on wi ll be includ ed in the bitstr eam andsignal ed to the decod er as MB type. But only the alpha blocks need to be proce ssed by theenc oder and deco der. Th e metho ds used for each shape format con tain several enc odingmo des. For exa mple, the binary shape informatio n can be encoded usi ng eithe r an int ra orinter -mode. Each of these modes can be further divided into los sy and lossless option.Gray- scale shape infor mation also contains intra and int er mo des; howeve r, only a lossyoptio n is used.

18.3.4. 1 Binary Shape Codin g with CAE Algori thm

As menti oned previo usly, the CA E is use d to code each bina ry pixel of the BA B. For aP-VO P, the BAB may be enc oded in intra or int er mode. Pixe ls are code d in sca n-lin e orderi.e., row by row for both modes. The process for coding a given pixel includes three steps:(1) compute a context number, (2) index a probability table using the context number, and(3) use the indexed probability to drive an arithmetic encoder. In intra mode, a template of10 pixels is used to define the causal context for predicting the shape value of the currentpixel as shown in Figure 18.5. For the pixels in the top and left boundary of the current MB,the template of causal context will contain the pixels of the already transmitted MBs onthe top and on the left side of the current MB. For the 2 rightmost columns of the VOP,each undefined pixel such as C7, C3, and C2, of the context is set to the value of itsclosest neighbor inside the MB, i.e., C7 will take the value of C8 and C3 and C2 will takethe value of C4.

A 10-bit context is calculated for each pixel, X as

C ¼X9k¼0

Ck � 2k (18:7)

This causal context is used to predict the shape value of the current pixel. For encoding thestate transition, a context-based arithmetic encoder is used. The probability table ofthe arithmetic encoder for the 1024 contexts was derived from sequences that are outsideof the test set. Two bytes are allocated to describe the symbol probability for each context,the table size is 2048 bytes. To increase coding efficiency and rate control, the algorithm

FIGURE 18.5Template for defining the context of the pixel, X, to be coded in intra mode.

C8 C 7

C6 C5

C9

C4 C3 C2

C1 C0 X


allows lossy sh ape codi ng. In los sy shape coding an MB can be dow n-sampled by a factorof 2 or 4 resu lting in a subblock of size 8 3 8 pixels or 4 3 4 pixe ls, respec tively . Thesubblock is then enc oded usi ng the sam e met hod as for full size block. The down-samp lingfactor is inc luded in the encoded bitstr eam and the n transmi tted to the decod er. Thedecod er deco des the shape dat a and then up-sam ples the deco ded subb lock to full MBsize acco rding to the down-samp ling factor. Obviousl y, it is more ef fic ient to code sh apeusing a high dow n-sampling factor, but the coding errors may occur in the decoded shapeafter up-s ampling. Howeve r, in the case of low bit rate coding, los sy sh ape coding may benecess ary since the bit budg et may no t be enou gh fo r lossle ss shape coding. Dependi ng onthe up-samp ling fi lter, the decod ed shape can loo k somew hat blocky. Sever al up-samp lingfilters were investiga ted. The best perform ing filter in terms of subj ective picture qual ity isan adaptive no nlinear up -sampling fi lter. It sh ould be noted that the coding effi ciency ofshape coding also depen ds on the orientat ion of the sha pe dat a. Th erefore the encoder canchoo se to code the blo ck as describ ed above or transpo se the MB prior to arithme tic codi ng.Of course , the transp ose inform ation has to be signal ed to the deco der.

For shape codi ng in a P-VO P or B-VO P, the inter mo de ma y be used to exploit thetemporal redu ndancy in the shape informati on with MC. For MC, a 2-D intege r pixe l MV isestimate d using full search for each MB in order to minimize the pred iction err or betweenthe previ ous code d VOP shape and the current VO P sh ape. The shape MVs are pred ict-ively encoded with respect to the shape MVs of neighbori ng MBs. If no shape MV isavailable, texture MVs are used as predictors. The template for inter mode differs from theone used for intra mode. The inter mode template contains nine pixels among which fivepixels are located in the previous frame and four are the current neighbors as shown inFigure 18.6.

The inter mode template defines a context of nine pixels. Accordingly, a 9-bit context or512 contexts can be com puted in a simi lar way to Equ ation 18.7.

C ¼X8k¼0

Ck � 2k (18:8)

The probability for one symbol is also described by 2 bytes giving a probability table size of1024 bytes. The idea of lossy coding can also be applied to the inter mode shape coding bydown-sampling the original BABs. For inter mode shape coding, the total bits for codingthe shape consists of two parts, one part for coding MVs and other for prediction residue.The encoder may decide that the shape representation achieved by just using MVs issufficient, thus bits for coding the prediction error can be saved. Actually, there are severalmodes to code the shape information of each MB: transparent, opaque, intra, interwith=without shape MVs, and prediction error coding. These different options withoptional down-sampling and transposition allow for encoder implementations of differentcoding efficiency and implementation complexity. Again, this is a problem of encoderoptimization that does not belong to the standard.

C3 C2 C1

C0 X

C8

C6

C4

C7 C5

Alignment

Pixels of thecurrent BAB

Pixels of theMC BAB

FIGURE 18.6Template for defining the context ofthe pixel, X, to be coded in intermode.


FIGURE 18.7Gray-scale shape coding.

Gray-scalealpha

Support Texture

Binary shapecoding

Texturecoding

18.3.4.2 Gray-Scale Shape Coding

The gray-scale shape information is encoded by separately encoding the shape and trans-parency information as shown in Figure 18.7. For a transparent object, the shape informa-tion is referred to as the support function and is encoded using the binary shape codingmethod. The transparency or alpha values are treated as the texture of luminance andencoded using padding, MC and the same 83 8 block DCT approach for the texturecoding. For an object with varying alpha maps, shape information is encoded in twosteps. The boundary of the object is first losslessly encoded as a binary shape, and thenthe actual alpha map is encoded as texture coding.

The binary shape coding allows one to describe objects with constant transparency,while gray-scale shape coding can be used to describe objects with arbitrary transparency,providing more flexibility for image composition. One application example is a gray-scalealpha shape that consists of a binary alpha shape with the value around the edges taperedfrom 255 to 0 to provide for a smooth composition with the background. The description ofeach VO layer includes the information to give instruction for selecting one of six modes forfeathering. These six modes include no effects, linear feathering, constant alpha, linearfeathering and constant alpha, feathering filter, and feathering filter, and constant alpha.The detailed description of the function of these modes are given in the reference of VM12.0 [mpeg4 vm12].

18.3.5 Sprite Coding

As mentioned earlier, MPEG-4 video has investigated a number of new tools, whichattempt to improve the coding efficiency at low bit rates compared with MPEG-1=2video coding. Among these tools, sprite coding is an efficient technology to reach thisgoal. A sprite is an especially composed VO that is visible throughout an entire piece ofvideo sequence. For example, the sprite generated from a panning sequence contains all thevisible pixels of the background throughout the video sequence. Portion of the backgroundmay not be seen in certain frames due to the occlusion of the foreground objects or thecamera motion. This particular example is one of the static sprites. In other words, a staticsprite is a possible still image. Since the sprite contains all visible background scenes of asegment video sequence where the changes within the background content is mainlycaused by camera parameters, the sprite can be used for direct reconstruction of thebackground VOPs or as the prediction of the background VOPs within the video segment.The sprite coding technology first efficiently transmits this background to the receiver andthen stores it in a frame at both encoder and decoder. The camera parameters are thentransmitted to the decoder for each frame so that the appropriate part of the backgroundscene can be either used as the direct reconstruction or as the prediction of the back-ground VOP. Both cases can significantly save the coding bits and increase the codingefficiency. There are two types of sprites, static sprite and dynamic sprite, which are being


cons idered as coding tools fo r MPEG -4 vide o. A Stat ic sprite is used for a vide o sequ encewhere the obje cts in a scene can be separated into foregr ound objects and a static ba ck-groun d. A st atic spri te is a special VO P, whi ch is generat ed by copying the back groundfrom a video sequenc e. Th is copying includ es the appro priate warpi ng and cropping.Ther efore, a static sprite is always built off-l ine. In con trast, a dynami c spri te is dynami c-ally built duri ng the predic tive coding. It can be bui lt eithe r online or of f-line. The staticsprite has sh own signi ficant coding gain over existing com pression technol ogy fo r cer tainvideo sequenc e. Th e dyn amic sprite is more com plicated in the real-time app lication due tothe diffi culty of updati ng the spri te duri ng the coding. Th erefore, the dyn amic sprite hasnot bee n adopted by version 1 of st andard. Add itional ly, both sprite s are not easi ly applie dto the generic scene content. Al so, the re is ano ther kind of class ific ation of sprite codi ngaccord ing to the method of spri te generation, namely, off-lin e and online sprite s. Off-line isalways use d fo r static sprite generat ion. Off-l ine sprite s are well suited for syntheti c obje ctsand objects that most ly undergo es rigid motion . Whereas on line sprites are used on ly fordynami c sprite s. Online sprites provide a no-laten cy solution in the case of natural spriteobjects . Off-line dyn amic sprites provide an enhance d pred ictive coding environ ment. Thesprite is bui lt with a similar way in both off-lin e and on line metho ds. In partic ular, thesame glo bal ME algorithm is exploite d. The difference is that the off-lin e sprite is builtbefore starting the enc oding process whil e in the online spri te case, bot h the enc oder a ndthe decod er build the sam e sprite from reconst ructed VOPs. Th is is why the onlinedynami c sprite s are mo re com plicated in the implement ation . The online sprite is notinclud ed in version 1, and will mo st lik ely not be conside red for ver sion 2 either. In spritecoding, the chrominan ce com ponents are process ed in the same way as the luminancecompone nts, with the prop erly sca led parame ters acco rding to the vide o format.

18.3.6 Interl aced Video Coding

Since June 1997, MPEG -4 exten ded its app lication to sup port interlaced vide o. Interlacedvideo con sists of two fi elds per frame: even fi eld and odd fi eld. MPEG -2 has a numb er oftools that are use d to deal with field struc ture of vide o sig nals. These tools inc ludeframe =fi eld adapt ive DCT coding and frame =field ada ptive MC. Ho wever, the field iss uein MPEG -4 has to be conside red on a VO P basis instead of the conve ntional frame ba sis.When field -based MC is speci fied, two field MVs and the correspo nding ref erence fieldsare used to generate the predic tion from each ref erence VOP . Th e shape inf ormatio n has tobe con sidered in the inter laced video fo r MPEG -4.

18.3.7 Wavel et-Base d Texture Coding

In MPEG -4 there is a textu re-coding mode that is use d to code the texture or st ill imagesuch as in JPEG . Th e basic techni que used in this mode is the wavel et-bas ed TC. The reasonof adop ting wave let transf orm inste ad of DCT for still textu re c oding is no t only due tohigh coding ef fi ciency, but also becaus e the wavelet can provide exc ellent sca lability, bothspatial scalability and SNR scalability. Si nce the princip le of the wave let-based TC forimage com pression has bee n expl ained in Chapter 8, we just brie fl y descri be the codi ngproced ure of this mode. The block diagram of the encode r is shown in Figure 18. 8.

18.3.7. 1 Decompos itio n of the Textur e Inform ation

The textu re or st ill image is first decomp osed into bands using a ba nk of analy sis fi lters.This decomp osition ca n be applied recursive ly on the obtained band s to yield a decom-posit ion tree of subband s. An example of decomp osition to depth 2 is shown in Figu re 18.9.


Other bands

Decompositionwith DWT

Quantization

Quantization

Prediction

Zero-treescanning

Arithmeticcoder

Arithmeticcoder

Input

Low–low band

Bitstream

FIGURE 18.8Block diagram of encoder of wavelet-based texture coding, DWT stands for discrete wavelet transform (DWT).

18.3.7. 2 Quanti zation of Wav elet Coef ficien ts

Aft er deco mposi tion, the coef ficie nts of the lowes t band are coded indep endently fromthe other band s. Th ese coef ficients are quan tized usi ng a uniform midri ser quantiz er. Th ecoef ficients of high ba nds are quantiz ed with a multi level quantiz ation. The mul tilevelquan tization provi des a ver y fl exible approach to sup port the right trad e-off betweenlevels and ty pe of sca lability, comple xity, and coding ef ficiency for any applicat ion. Al lquan tizers for the higher ba nds are uniform midri se quan tizer with a dead zone that is 2tim es the quantiz er step size . The levels and quan tization steps are determi ned by theenc oder and spe cifi ed in the bitstream . To achi eve sca lability, a bi-level quan tizationschem e is used for all mul tiple quan tizers. This quan tizer is also uniform and midris ewith a dead zon e that is 2 tim es the quan tization step. The coef fi cients outsi de of the deadzone are quan tized to 1-bit. The num ber of quan tizers is equal to the maxi mum num ber ofbit pl anes in the wave let transf orm repres entati on. In this bi-le vel quantiz er, the maximu mnum ber of bit plane s instead of quan tization st ep size is spe cifi ed in the bitstr eam.

18.3.7. 3 Codin g of Wavelet Coef fi cients of Lo w –Low Band an d Other Bands

The quantiz ed coef ficients at the lowe st ba nd are DPCM code d. Each of the cur rentcoef ficients is pred icted from three other quantiz ed coef ficients in its neighborh ood asshown in Figu re 18.10.

The coef fi cients in high ba nds are coded with zero-tre e algorithm [Sh apiro 1993],discuss ed in Chap ter 8.

18.3.7. 4 Adapt ive Ar ithmetic Coder

The quantiz ed coef fi cients and the symbol s generat ed by the zero-tre e are code d using anada ptive arit hmetic coder. In the arithmeti c coder three differe nt tables that cor respond tothe different statist ical mo dels have been utilized . The met hod used her e is very simi larto one in Chapter 8. Further details can be foun d in MPEG -4 VM [mepg4 vm12].

FIGURE 18.9An example of wavelet decomposition of depth 2.

0 1

2 34

5 60 1

4 5 6

2 3


If |WA – WB| < |WB – WC|, WXp = WC, else WXp = WA

B

A X

C

FIGURE 18.10Adaptive DPCM coding of the coefficients in the lowestband.

18.3.8 Gener alized Sp atial and Temporal Scalabili ty

The sca lability framew ork is refer red to as generali zed sca lability that includ es the spatialand the temporal scalabi lity simi lar to MPEG -2. The major difference is that MPEG -4exten ds the concept of sca lability to be conte nt-based . This uniqu e function ality allo wsMPEG -4 to be able to resolve obje cts into different VOP s. Usin g the multi ple VOP st ruc-ture, different resol ution enhancem ent can be appl ied to different portions of a video scene.Ther efore, the enh anceme nt layer may be appl ied only to a par ticular object or region of thebase layer instead of the entir e base layer. This is a feature that MPEG -2 doe s not have.

In spatial scalability, the base layer and the enhance ment layer can have different spatialresol ution. The ba se laye r VOP s are encoded in the same way as the non-scal able enc odingtechni que describ ed previ ously. The VOPs in the enhance ment layer are enc oded asP-VOPs or B-VOP s as shown in Figure 18.11. The curren t VOP in the enh anceme nt layercan be predic ted from eithe r the up-s ampled base laye r VOP or the previousl y deco dedVOP at the sam e laye r as we ll as bot h of the m. The down-samp ling and up-samp lingproces sing in spati al scalabi lity is not a part of standard and can be de fi ned by the use r.

In temporal sca lability, a subsequ ence of subsamp led VO P in the time doma in is codedas a base laye r. The remain ing VO Ps can be coded as enhanceme nt layers. In thi s way, theframe rate of a selected object can be enha nced so that it has a sm oother motion than ot herobjects . An example of the tem poral scalability is illustrat ed in Figu re 18.12. In Figu re 18.12,the VO L0 is the entire frame with both an obje ct and a backgro und, while VOL 1 is apartic ular obje ct in VOL0. VOL 0 is encoded wi th a low frame rate and VOL1 is theenhance ment layer. Th e high frame rate can be reache d for the partic ular object bycomb ining the decoded dat a from both base layer and enhance ment laye r. Of c ourse, theB-VOP is also used in tem poral sca lability fo r coding the enhance ment laye r, which isanother type of temporal scalabi lity. As in spati al scalability , the enhance ment layer can beused to improve either the entire base layer frame resolution or only a portion of the baselayer resolution.

18.3.9 Error Resilience

The MPEG-4 visual coding standard provides error robustness and resilience to allowaccess of image and video data over a wide range of storage and transmission media. Theerror resilience tool development effort is divided into three major areas, which include

Enhancement layer

Base layer

P

I

B B

P P FIGURE 18.11Illustration of spatial scalability.


0 1 2 3 4 5 6

0 1 2 3 4 5 6

VOL1

VOL0

Frame number

Frame number

Enhancementlayer

Base layer

FIGURE 18.12An example of temporal scalability.

resynchronization, data recovery, and error concealment. As with other coding standards,MPEG-4 makes heavy use of variable-length coding (VLC) to reach high coding perform-ance. However, if even one bit is lost or damaged, the entire bitstream becomes undecode-able due to loss of synchronization. The resynchronization tools attempt to enableresynchronization between the decoder and the bitstream after a transmission error orerrors have been detected. Generally, the data between the synchronization point prior tothe error and the first point, where synchronization is reestablished, is discarded. Thepurpose of resynchronization is to effectively localize the amount of data discarded bythe decoder, then the other methods such as error concealment can be used to conceal thedamaged areas of a decoded picture. Currently, the resynchronization approach adoptedby MPEG-4 is referred to as a packet approach. This approach is similar to the Group ofBlocks (GOBs) structure used in H.261 and H.263. In the GOB structure, the GOB contains astart code that provides information on the location of the GOB. MPEG-4 adopted a similarapproach where a resynchronization marker is periodically inserted into the bitstream atthe particular MB locations. The resynchronization marker is used to indicate the start ofnew video packet. This marker is distinguished from all possible VLC code words as wellas the VOP start code. The packet header information is then provided at the start of avideo packet. The header contains the information necessary to restart the decodingprocess. These include the MB number of the first MB contained in this packet and thequantization parameter necessary to decoder the first MB. The MB number provides thenecessary spatial resynchronization while the quantization parameter allows the differen-tial decoding process to be resynchronized. It should be noted that when the errorresilience is used within MPEG-4, some of the compression efficiency tools need to bemodified. For example, all predictively encoded information must be contained within avideo packet to avoid error propagation. In conjunction with the video packet approach toresynchronization, MPEG-4 has also adopted fixed interval synchronization method,which requires that VOP start codes and resynchronization markers appear only at legalfixed interval locations in the bitstream. This will help to avoid the problems associatedwith start codes emulation. In this case, when fixed interval synchronization is utilized, thedecoder is only required to search for a VOP start code at the beginning of each fixedinterval. The fixed interval synchronization method extends this approach to be anypredetermined interval.

After resynchronization is reestablished, the major problem is recovering lost data.A new tool called reversible variable-length codes (RVLC) is developed for the purposeof data recovery. In this approach, the VLCs are designed such that the codes can be read


both in the forwa rd as we ll as the rev erse directio n. An exa mple of such code inc ludes codewords like 111, 101, 010. All these code words can be read reversib ly. Howeve r, it isobvious that this approach will reduce the coding ef fi ciency that is achieved by the entropycoder. Th erefore, thi s appro ach is used only in the case where the err or resilie nce isimportan t.

Error con cealment is an importan t component of any err or robust video coding. Theerror concea lment strateg y is highly dep endent on the perform ance of the resynchro niza-tion techni que. Basical ly, if the resynchro nization met hod can local ize the dam aged dat aarea ef ficiently , the error concea lment strategy beco mes m uch more tractable . Er ror con-cealment is actually a decoder issue if there is no additional information provided by theencode r. Th ere are many appro aches of error con cealment (refer Chapter 17).

18.4 MPEG-4 Visual Bitstream Syntax and Semantics

The common feature of MPEG-4 with MPEG-1 and MPEG-2 is the layered structure of thebitstream. MPEG-4 defines a syntactic description language to describe the exact binarysyntax of an AVO bitstream, as well as that of the scene description information. Thisprovides a consistent and uniform way of describing the syntax in a very precise form,while at the same time it simplifies bitstream compliance testing. The visual syntaxhierarchy includes the layers:

– Video session (VS)

– Video object (VO)

– Video object layer (VOL) or texture object layer (TOL)

– Group of video object plane (GOV)

– Video object plane (VOP)

A typical video syntax hierarchy is shown in Figure 18.13:Video session is the highest syntactic structure of the coded video bitstream. A VS is a

collection of one or more VOs. A VO can consist of one or more layers. Since MPEG-4 isextended from video coding to visual coding, the type of visual objects not only includesVOs, but also still texture objects, mesh objects, and face objects. These layers can be eithervideo or texture. Still texture coding is designed for high visual quality applications intransmission and rendering of texture. The still coding algorithm supports a scalablerepresentation of image or synthetic scene data such as luminance, color, and shape. Thisis very useful for progressive transmission of images or 2D=3D synthetic scenes. The

Video session

Visual object

Video object layer

Group of video object plane

Video object plane VOP0 VOPn VOPn + 1 VOPm

GOV0GOV1

VOL0VOL1

VO0VO1

VS0 VS1

FIGURE 18.13MPEG-4 video syntax hierarchy.


images can be gradually built up in the terminal monitor as they are received. The bit-streams for coded mesh objects are non-scalable; they define the structure and motion of a2-D mesh. The texture of the mesh has to be coded as a separate VO. The bitstreams for faceobjects are also non-scalable; these bitstreams contain the face animation parameters. VOsare coded with different types of scalability. The base layer can be decoded independently,while the enhancement layers can be decoded only with the base layer. In the special caseof a single rectangular VO, all of the MPEG-4 layers can be related to MPEG-2 layers. Thatis, VS is as same as VO since in this case a single VO is a video sequence, VOL or TOL is thesame as the sequence scalable extension, GOV is like the GOP and VOP is a video frame.Visual object sequence may contain one or more visual objects coded concurrently.The visual object header information contains the start code followed by profile andlevel identification, and a visual object identification to indicate the type of object, whichmay be a VO, a still texture object, a mesh object, or a face object. The VO may contain oneor more VOLs. In the VOL, the VO can be coded with spatial or temporal scalability. Also,the VO may be encoded in several layers from coarse to fine resolution. Depending on theneed of the application, the decoder can choose the number of layers to decode. A VO at aspecified time is called a VOP. Thus, a VO contains many VOPs. A scene may containmany VOs. Each VO can be encoded to an independent bitstream. A collection of VOPs ina VOL is called a group of VOPs (GOV). This concept corresponds to the group of pictures(GOP) in MPEG-1 and MPEG-2. A VOP is then coded by shape coding and texture coding,which is specified at lower layers of syntax, such as MB and block layer. The VOP or higherthan VOP layer always commence with a start code and are followed by the data of lowerlayers, which are similar to the MPEG-1 and MPEG-2 syntax.

18.5 MPEG-4 Visual Profiles and Levels

In MPEG-4, many profiles have been defined. In this section, we have only introducedsome visual profiles. There are totally 19 visual profiles defined for different applications:

1. Simple

2. Simple scalable

3. Core

4. Main

5. N-bit

6. Hybrid

7. Basic animated texture

8. Scalable texture

9. Simple face animation

10. Simple FBA

11. Advanced real-time simple

12. Core scalable

13. Advanced coding efficiency

14. Advance core profile

15. Advanced scalable texture

16. Simple studio


17. Core st udio

18. Advanc ed simple

19. FGS

Among these visu al pro files, simple pro file and adv anced simp le pro file have been ext en-sively used by indu stry for mobile applicat ion and strea ming on the networks .

The simple pro file is use d to code the rectangu lar vide o with intra (I) and predic ted (P)VOP s, actu ally is the same as for MPEG -2 frame -based coding. Th e simple profi le permi tsthe use of thre e compr ession levels with bit rates from 64 kbits =s in level 1 to 384 kbits =s inlevel 3.

The advance d simp le pro file is a lso used to code the rec tangula r video with intra (I) andpredic ted (P) VOP s, but it is enhance d to add bidirectiona l (B ) VOP s for better codi ngeffi ciency than the simp le pro file. It sup ports six com pression levels (0 –5). Leve ls 0–3 havebit rate s from 128 to 768 kbits =s. The sup port fo r int erlaced coding is added for levels 4 and5 with bit rates from 3 to 8 Mb its=s.

18 .6 MPEG-4 Video V eri fi cation Model

Since all video coding stand ards de fine only the bitstr eam syn tax and decod ing process ,the use of test model (TM ) to verify and opti mize the algori thms is neede d during thedevelop ment process . For thi s purpose a common platform with a precis e de finition ofencodi ng and deco ding algo rithms has to be provid ed. Th e TM of MPEG -2 to ok the above-menti oned role. The TM of MPEG-2 was update d conti nually from ver sion 1.0 to ver sion5.0, unt il the M PEG-2 Video IS was com pleted. MPEG -4 vide o use s a similar tool duri ngthe deve lopmen t proces s; this tool in MPEG -4 is calle d the veri ficati on mo del (VM ). So far,the MPEG-4 video VM has evolved gradually from version 1.0 to version 12.0 and in theprocess has addressed an increasing number of desired functionalities such as content-based scalability, error resilience, and coding efficiency. The material presented in thissectio n is diffe rent from Secti on 18.3. Becau se Se ction 18.3 presente d the tech nologiesadopted or will be adopted by MPEG-4, while this section provides an example how touse the standard, i.e., how to encode or generate the MPEG-4 compliant bitstream. Ofcourse, the decoder is also included in the VM.

18.6.1 VOP-Based Encoding and Decoding Process

Since the most important feature of MPEG-4 is an object-based coding method, the inputvideo sequence is first decomposed into separate VOs, these VOs are then encoded intoseparate bitstream so that the user can access and manipulate (cut, past, . . . ) the videosequence in the bitstream domain. Instances of VO in a given time are called VOP. Thebitstream contains also the composition information to indicate where and when each VOPis to be displayed. At the decoder, the user may be allowed to change the composition ofthe scene displayed by interactively changing the composition information.

18.6.2 Video Encoder

For an object-based coding, the encoder mainly consists of two parts: the shape coding andthe texture coding of the input VOP. Texture coding is based on the DCT coding withtraditional MC predictive coding. The VOP is represented by means of a bounding


Shapecoding

Motionestimation

Motioncompensation

MUX

Framememory

�� Texturecoding

Buffer

VOP of arbitraryshape

VOP

Shapeinformation

Motioninformation

FIGURE 18.14Block diagram of MPEG-4 video encoder structure.

rec tangula r as described previo usly. Th e phase betw een luminance and chrominan cepixe ls of the boundi ng rectangu lar has to be correctl y set to the 4:2:0 fo rmat as in MPEG -1=2. The blo ck diagram of enc oding structur e is shown in Figure 18.14.

The core tech nologi es use d in VOP coding of MPEG -4 have been descri bed previousl y.He re we are going to dis cuss seve ral encodi ng iss ues. Althou gh the se issue s are essentia l tothe perform ance and applicat ion, they are not depen dent on the syn tax. As a result, theyare not included as no rmative parts of the st andard, but are inc luded as inf ormativeannexe s.

18.6.2. 1 Video Segm entatio n

Obje ct-base d codi ng is the most imp ortant feature of MPEG-4. Therefor e, the tool fo r objectbound ary detectio n or segme ntation is a key issue in ef ficientl y perform ing the object-based coding scheme. But the method of deco mposi ng a natural scene to seve ral sep arateobje cts is not spe cifi ed by the stand ard since it is a prep rocessin g iss ue. Ther e are curren tlytwo kinds of algori thms for segmentat ion of VOs . One kind of algorithm is an autom aticsegme ntation a lgorithm . In the cas e of real-t ime appl ications, the segme ntation must bedone autom atically . Real -time autom atic segme ntation algorithm s are curren tly notmatu re. An aut omati c seg mentatio n algori thm has been proposed in [mpeg96 =m960].This algorithm separates regio ns cor respondin g to mo ving objects from regions belongin gto a static backg round for each frame of a vide o sequenc e. The algori thm is based onmo tion analysis for each frame. Motion anal ysis is perform ed along several frame s to trackeach pixe l of the current frame a nd to detect whe ther the pixel belon gs to the movin gobje cts.

Anoth er kind of segme ntation algo rithms is on e that is use r-assiste d or ‘‘ semiau to-mati c ’’ . In non-re al-time appl ications , the semiau tomatic segme ntation may be use d effect-ively and give better resu lts than the automati c segme ntation. In the core experi ments ofMPEG -4, a semiau tomati c segme ntation algorithm was propo sed in [mp eg97 =m3147] . Th eblock diagram of the semiau tomati c seg mentatio n is shown in Figure 18 .15.

This tech nique cons ists of two steps. First, the intrafr ame segme ntation is applie d to thefirst frame, which is considered as a frame that either contains newly appeared objects or areset frame. Then the interframe segmentation is applied to the consecutive frames. Forintraframe, the segmentation is processed by a user manually or semiautomatically. The


Intra-frame segmentation ininitially marked region around

object boundary by user via GUI

Inter-frame segmentationby object boundary tracking

Unsatisfactoryresults or shot boundary

occur?

Input video

YesNoFIGURE 18.15Block diagram of a user-assisted video object segmen-tation method.

user use s a graphic al use r inter face (GU I) to draw the bou ndaries of intereste d obje cts. Theuser can mask the entire obje cts all the way aroun d obje cts using a mouse with a pre-de fined thick ness of the line (numbe r of pixe ls). A marke d swath is then resu lted by themouse and thi s marke d area is assu med to con tain the obje ct boundar ies. A bound arydetecti on algorithm is appl ied to the marked area to cre ate the rea l obje ct boundar ies. Forinter frame segme ntation, an object bound ary-track ing algori thm is proposed to obtain theobject boundar ies of the consecuti ve frame s. At fi rst, the boundar y of the previ ous obje ct isextracte d and the ME is perform ed on the object bound ary. The object boundar y of thecurren t frame is initia lly obtained by MC and then re fined by using tem poral informati on(TI) and spatial informati on (SI) all the way aro und the object boundar y. Fina lly, therefi ned object boundar y can be obtained. As men tioned previo usly, the segmentat iontechni que is an importan t to ol for object-bas ed proce ssing in MPEG -4, but it is not de finedby the stand ard. The metho d des cribed her e is jus t an example provid ed by the coreexperi ments of MPEG-4. There are many other algorithm s under inv estigati on such asthe circular Viterb i algorithm describ ed in [lin 1998].

18.6.2. 2 Intra =Inter Mode Decis ion

For inter -VOP coding, an MB can be coded in on e of the four modes . Th ese four mo desinclude direct coding mode, forward coding, backward coding, and bidirectional coding.In the encoder we have to decide which mode is the best. The mode decision is animportant part of encoding optimization. An example of the selection of optimized modedecisio n has been given in Chapte r 17 fo r MPEG -2 encode r. The sam e tech nique can beextended to an MPEG-4 encoder. The basic idea of mode decision is to choose the codingmode that results in the best operation point on the rate distortion curve. For obtaining thebest operation point on the rate distortion curve, the encoder has to compare all possiblecodingmodes and choose the best one. This is a very complicated procedure. In theMPEG-2case, we used a quadratic model to unify the measures of bits used to code predictionresidues and the MVs. A simplified mode but near optimized mode decision method hasresulted. Here, the VM.12 proposes the following steps to make coding mode decisions.First, the MC prediction error is calculated by each of the four modes. Next, the SAD of eachof the MC prediction MBs is calculated and compared with the variance of the MB to becoded. Then a mode generating the smallest SAD (for direct mode, a bias is applied) isselected. For the interlaced video, more coding modes are involved. This method of mode


decision is simple, but it is not optimal since the cost for coding MVs is not considered.Consequently, the mode may not lie on the best operation point on the distortion curve. Butagain this is an encoding issue, the encoding designers have the freedom to use their ownalgorithm. The VM just provides an example of encoder, which can generate the compliantbitstream.

18.6.2.3 Off-Line Sprite Generation

The sprite is a useful tool in MPEG-4 for coding certain kind of video sequences at very lowbit rates. The method of generating a sprite for a video sequence is an encoder issue. TheVM gives an example of off-line sprite generation. For a natural VO, sprite is referred to asa representative view collected from a video sequence. Before decoding, the sprite istransmitted to the decoder. Then the MC can be performed by using the sprite fromwhich the video can be reconstructed. The effectiveness of video reconstruction dependson whether the motion of the object can be effectively represented by a global motionmodel such as translation, zooming, affine, and perspective. The key technology of thesprite generation is the ME to find perspective motion parameters. This can be implemen-ted by many algorithms described in this book such as the three-step matching technique.The block diagram of sprite generation using the perspective ME is shown as inFigure 18.16.

The sprite is generated from the input video sequence by the following steps. First, thefirst frame is used as the initial value of sprite. From the second frame, the ME is applied tofind the perspective motion parameters between two frames. The current frame is wrappedtowards the initial sprite using the perspective MVs to get wrapped image. Then thewrapped image is blended with initial sprite to obtain a updated sprite. This procedureis continued to the entire video sequence. The final sprite is then generated.

18.6.2.4 Multiple VO Rate Control

As we know, the purpose of rate control is to obtain the best coding performance for agiven bit rate in the constant bit rate (CBR) video coding. In MPEG-4 video coding, there isan additional objective for rate control, how to assign the bits among multiple VOs. In themultiple VO video coding rate control algorithm, the total target is first adjusted based onthe buffer fullness, and then distributed proportional to the size of the object, the motionwhich the object is experiencing, and its maximum absolute differences. Based on the newindividual targets and second-order model parameters [lee 1997], appropriate quantizationparameters can be calculated for each VO. To compromise the trade-offs in spatial andtemporal coding, two modes of operation have been introduced. With these modes,suitable decisions can be made to differentiate between low and high bit rate coding. Inaddition, a shape rate control algorithm has been included. The algorithm for performing

FIGURE 18.16Block diagram of sprite generation.

Motionestimation

Wrapping Blending

Framememory

VOP

Mask

Sprite


the joint rate control can be decomposed into a pre-encoding stage and a post-encodingstage. The pre-encoding stage consists of (i) the target bit estimation, (ii) joint buffercontrol, (iii) pre-frame skip control, and (iv) the quantization level and alpha thresholdcalculation. Whereas the post-encoding stage consists of (i) updating the rate distortionmodel, (ii) post-frame skip control, and (iii) determine mode of operation. The initializationprocess is very similar to the single VOP initialization process. Since a single buffer is used,the buffer drain rate and initializations remain the same, but many of the parameters areextended to vector quantities. As a means of regulating the trade-offs between spatial andtemporal coding, two modes of operation are introduced: low mode and high mode. Whenencoding at high bit rates, the availability of bits allows the algorithm to be flexible in itstarget assignment to each VO. Under these circumstances, it is reasonable to imposehomogeneous quality among each VO. Therefore, the inclusion of MAD2[i] is essential tothe target distribution and should carry the highest weighting. On the other hand, whenthe availability of bits is limited, it is very difficult (if not impossible) to achieve homoge-neous quality among the VO. Under these conditions, it is desirable to spend less bits onthe background and more bits on the foreground. Consequently, the significance of thevariance has decreased and the significance of the motion has increased. Besides regulatingthe quality within each frame, it is also important to regulate the temporal quality as well,i.e., keep the frame skipping to a minimum. In high mode, this is very easy to do since theavailability of bits is plentiful. However, in low mode, frame skipping occurs much moreoften. In fact, the number of frames being skipped is a good indication of which mode thealgorithm should be operating. Overall, this particular algorithm is able to successfullyachieve the target bit rate, effectively code arbitrarily shaped objects, and maintain a stablebuffer [vetro 1999].

18.6.3 Video Decoder

Thedecodermainly consists of three parts: shape,motion, and texturedecoding. Thedecoderblock diagram is shown in Figure 18.17. At the decoder the bitstream is first demultiplexedinto shape and motion information as well as texture information. The reconstructed VOPis obtained by the right combination of the shape, texture, and motion information. Theshape decoding is a unique feature of MPEG-4 decoder. The basic technology of shapedecoding is the context-based arithmetic decoding and block-based MC.

The primary data structure denoted is the BAB. The BAB is a square block of binarypixels representing the opacity or transparency for the pixels in a specified block-shaped

DemultiplexerBitstream

Shapedecoding

Motiondecoding

Texturedecoding

Motioncompensation

VOPcomposition

VOPmemory

FIGURE 18.17VOP decoder structure.


Variablelength

decodingInverse

scan

InverseDC/AC

prediction

Inversequantizer

InverseDCT

Motioncompensation

VOPmemory

Reconstructed VOP

FIGURE 18.18Block diagram of texture decoding.

spatial region of size 163 16 pixels which is colocated with each texture MB. Block diagramof texture decoder is shown in Figure 18.18.

The texture decoding is similar to the video decoder in MPEG-1=2 except that the inverseDC=AC prediction and more quantization methods. The DC prediction is different fromthe one used in MPEG-1=2. In MPEG-4 the DC coefficient is adaptively predicted from theabove block or left block. The AC prediction is similar to the one used in H.263 but is notused in the MPEG-1=2. For MC, the MVs must be decoded. The horizontal and vertical MVcomponents are decoded differentially by using a prediction from the spatial neighbor-hood consisting of three MVs already decoded. The final MV is obtained by adding theprediction MV values to the decoded differential motion values. Also, in MPEG-4 videocoding the several advanced MC modes such as four 83 8 MV compensation and over-lapped MC have to be handled. Another issue of MC in MPEG-4 is raised by VOP-basedcoding. In order to perform MC prediction on a VOP basis, a special padding technique isused to each MB that lies on the shape boundary of the VOP. The padding process definesthe values of pixels, which are located outside the VOP for prediction of arbitrary shapedobjects. Padding for luminance pixels and chrominance pixels is defined in the standard[mpeg4 visual]. The additional decoding issues that are special for MPEG-4 include spritedecoding, generalized scalable decoding, and still texture decoding. We do not go intofurther detail for these topics (interested readers can get detail from the standard docu-ments). The outputs of decoded results are the reconstructed VOPs that are finally sent tothe compositor. In the compositor, the VOPs are recursively blended in the order specifiedby the VOP composition order. It should be noted that the decoders could take advantageof object-based decoding. They are able to be flexible in the composition of the recon-structed VOPs such as re-allocating, rotation, or other editing actions.

18.7 Summary

In this chapter, the new video coding standard, MPEG-4, was introduced. The uniquefeature of MPEG-4 video is the content-based coding. This feature allows the MPEG-4 toprovide much functionality, which other video coding standards do not have. The keytechnologies used in MPEG-4 video have been described. These technologies provide basictools for MPEG-4 video to provide object-based coding functionality. Finally, the videoVM, a platform of MPEG-4 development and an encoding and decoding example, has beendescribed.


Exercises

1. Why is object (or content)-based coding the most important feature of MPEG-4 visualcoding standard? Describe several applications for this feature.

2. What are the new coding tools in MPEG-4 visual coding that are different from MPEG-2video coding? Is MPEG-4 backward compatible to MPEG-2?

3. MPEG-4 video coding has the feature of using either 163 16 block MV or 83 8 blockMV, for what kind of video sequences will the 83 8 block motion increase codingefficiency? For what kind of video sequences will the 83 8 block MC decrease thecoding efficiency?

4. What approaches for error resilience are supported by the MPEG-4 syntax? Make acomparison with the error resilience method adopted in MPEG-2 (supported by MPEG-2 syntax) and indicate their relative advantages and disadvantages.

5. Design an arithmetic coder for zero-tree coding and write a program to test it withseveral images.

6. Sprite is a new feature of MPEG-4 video coding. MPEG-4 specifies the syntax for spritecoding, but does not give any detail how to generate a sprite. Conduct a project togenerate an off-line sprite for a video sequence and use it for coding the video sequence.Do you observe any increased coding efficiency? When do you expect to see such anincrease?

7. Shape coding (binary-shape coding) is an important part of MPEG-4 due to object-basedcoding. Besides the shape coding method used in MPEG-4, name another shape codingmethod. Conduct a project to compare the method you knowwith the method proposedin MPEG-4. (Do not expect to get better performance, but expect to reduce the com-plexity.)

References

[lee 1997] H.J. Lee, T. Chiang, and Y.Q. Zhang, Scalable rate control for very low bit-rate coding,Proceedings of the International Conference on Image Processing (ICIP’97), Vol. II, SantaBarbara, CA, pp. 768–771, October 1997.

[lin 1998] Circular Viterbi: Boundary detection with dynamic programming, I-jong Lin, S.Y. Kung,Anthony Vetro and Huifang Sun, MPEG98=.

[mpeg4 visual] ISO=IEC 14496–2, Coding of audio-visual objects, Part 2, December 18, 1998.[mpeg4 vm12] ISO=IEC 14496–2, Video Verfification Model V.12, N2552, December, 1998.[mpeg96=m960] S. Colonnese and G. Russo, FUB results on core experiment N2: comparison of

automatic segmentation techniques, MPEG96=M960, Tempere, July 1996.[mpeg97=m3147] J.G. Choi, M. Kim, H. Lee, and C. Ahn, Partial experiments on a user-assisted

segmentation technique for video object plane generation, MPEG97=M3147, San Jose meeting ofISO=IEC JTC1=SC29=WG11, February 1998.

[shapiro 1993] J. Shapiro, Embedded image coding using zero trees of wavelet coefficients, IEEETransactions on Signal Processing, 3445–3462, December 1993.

[vetro 1999] A. Vetro, H. Sun, and Y. Wang, MPEG-4 rate control for multiple video objects, IEEETransaction Circuits and Systems for Video Technology, 9, 1, 186–199, February 1999.


19ITU-T Video Coding Standards H.261 and H.263

This chapter introduces ITU-T video coding standards H.261 and H.263, which areestablished mainly for videophony and videoconferencing. The basic technical detail ofH.261 is presented. The technical improvements with which H.263 achieves high codingefficiency are discussed. Features of H.263þ, H.263þþ, and H.26L are presented.

19.1 Introduction

Very low bit rate video coding has found many industrial applications such as wireless andnetwork communications. The rapid convergence of standardization of digital video cod-ing standards is the reflection of several factors: the maturity of technologies in terms ofalgorithmic performance, hardware implementation with VLSI technology, and the marketneed for rapid advances in wireless and network communications. As stated in theprevious chapters, these standards include JPEG for still image coding and MPEG-1=2 forCD-ROM storage and digital television (DTV) applications. In parallel with the ISO=IECdevelopment of the MPEG-1=2 standards, the ITU-T has developed H.261 [h261] for video-telephony and videoconferencing applications in an ISDN environment.


The H.261 video coding standard was developed by ITU-T study group XV during 1988–1993. It was adopted in 1990 and the final revision approved in 1993. This is also referred toas the P3 64 standard because it encodes the digital video signals at the bit rates of P3 64kbits=s, where P is an integer from 1 to 30, i.e., at the bit rates from 64 kbits=s to 1.92Mbits=s.

19.2.1 Overview of H.261 Video Coding Standard

The H.261 video coding standard has many common features with the MPEG-1 videocoding standard.However, as they target different applications, there existmanydifferencesbetween the two standards, such as data rates, picture quality, end-to-end delay, etc. Beforeindicating the differences between two coding standards, we describe the major similaritybetween H.261 and MPEG-1=2. First, both standards are used to code the similar videoformat. H.261 is mainly used to code the video with common intermediate format (CIF) orquarter-CIF (QCIF) spatial resolution for teleconference application. MPEG-1 uses CIF, SIF,or higher-spatial resolution for CD-ROMapplication. The original motivation of developing


H.261 video coding standard is to provid e a standa rd, which can be use d, fo r both PAL andNTSC tele vision signals. But later, the H.261 is main ly used for vide o con ferencin g and theMPEG -1 =2 is used for DTV , VCD (video CD), and DV D (digital vide o disk). The twoTV systems , pha se alternati ng line (PAL) and Natio nal Televis ion Sys tems Comm ittee(NTSC ), use different li ne and picture rate . The NTSC , which is used in North Ame ricaand Jap an, uses 525 lines pe r int erlaced picture at 30 frame s=s. Th e PAL system is use d formo st othe r countri es and it uses 625 line s per interlaced picture at 25 frame s=s. For thi spurp ose, the CIF was adopted as the sou rce video format for H.261 video coder. The CIFcons ists of 352 pixels per line, 288 line s per frame, and 30 frames =s. This format represen tshal f the a ctive line s of the PA L signal and the same picture rate of the NTSC signal . The PALsystem s need only perform a picture rate con version and NTSC syste ms need only perform aline numb ers c onversion. Color picture s cons ist of one lumi nance and two color-di ffere ncecom ponents (referre d to as Y Cb Cr format) as speci fied by the CCIR 601 standa rd.The Cb and Cr com ponents are half the size on bot h horizon tal and vertical direc tions andhave 176 pixels pe r line and 1 44 line s per frame. Another format, QCIF is used for ver y lowbit rate appl ications . The QCIF has hal f the num ber of pixels and half the number of line sof CIF. Seco nd, the key coding algori thms of H.261 and MPEG -1 are ver y similar . Both H.261and MPEG -1 use discre te cosine transf orm (DC T)-based coding to rem ove int raframeredund ancy and motion compe nsation to remove interframe redundanc y.

Now let us describ e the main difference s betwe en the two codi ng standard s with resp ectto coding algori thms. The main differe nces inc lude:

H.261 use s only I- and P-m acroblocks (MB s) but no B- MBs, where as MPEG -1 uses thre eMB types, I-, P-, B-MBs (I-MB is int raframe- coded MB , P-MB is predic tive-cod ed MB andB-MB is bidirect ionally coded MB ), also three picture type s, I-, P-, and B-pictu res as de finedin Chapte r 16 for MPEG -1 st andard.

Ther e is a con straint of H.261 that fo r every 132 interframe -coded MBs, whi ch corre s-ponds to four GOBs (group of blocks ) or to one-third of CIF picture s; it requi res a t leas t on eintrafr ame- coded MB. To obtain better coding perform ance at low bit rate applicat ions,mo st encodi ng schem es of H.261 prefer not to use intrafr ame coding on all the MBs of apictu re but only few M Bs in every pictu re wi th a ro tational schem e. MPEG -1 use s the GO P(group of pictu res) st ructure, where the size of GO P (the distanc e between two I-p ictures)is not speci fied.

The end-to-en d delay is not a critica l issue for M PEG-1 but is critical for H.261. The vide oenc oder and vide o deco der delays of H.261 need to be known to allo w audio compensati ondelays to be fixed when H.261 is used in interactive applications. This will allow lipsynchronization to be maintained.

The accuracy of motion compensation in MPEG-1 is up to a half pixel, but only a fullpixel in H.261. However, H.261 uses a loop-filter to smooth the earlier frame. This filterattempts to minimize the prediction error.

In H.261, a fixed picture aspect ratio of 4:3 is used. In MPEG-1, several picture aspectratios can be used and the picture aspect ratio is defined in the picture header.

Finally, inH.261, the encoded picture rate is restricted to allowup to three skipped frames.This would allow the control mechanism in the encoder some flexibility to control theencoded picture quality and satisfy the buffer regulation. Although MPEG-1 has no restric-tion on skipped frames, the encoder usually does not perform frame skipping. Rather, thesyntax for B-frames is exploited, as B-frames require much fewer bits than P-pictures.

19.2.2 Technical Detail of H.261

The key technologies used in the H.261 video coding standard are the DCT andmotion compensation. The main components in the encoder include DCT, prediction,


DCT Q

Coding control

VLC

IQ

IDCT

Loopfilter

Motioncompensation

Framememory

Buffer

Motionestimation

Input

Intra /intermodeswitch +

–

FIGURE 19.1Block diagram of a typical H.261 video encoder.

quantization (Q), inverse DCT (IDCT), inverse quantization (IQ), loop filter, framememory, variable-length coding (VLC), and coding control unit. A typical encoderstructure is shown in Figure 19.1.

The input video source is first converted to the CIF frame and then is stored in the framememory. The CIF frame is then partitioned into groups of blocks (GOBs). The GOBcontains 33 MBs, which are one-twelfth of a CIF picture or one-third of a QCIF picture.Each MB consists of six 83 8 blocks among which four are luminance (Y) blocks and twoare chrominance blocks (one of Cb and one of Cr).

For intraframe mode, each 83 8 block is first transformed with DCT and then quantized.The VLC is applied to the quantized DCT (QDCT) coefficients with a zigzag scanning ordersuch as in MPEG-1. The resulting bits are sent to the encoder buffer to form a bitstream.

For interframe coding mode, the frame prediction is performed with motion estimationin a similar manner to that in MPEG-1 but only P-MBs and P-pictures but no B-MBs andB-pictures are used. Each 83 8 block of differences or prediction residues is coded by thesame DCT coding path as for the intraframe coding. In the motion compensated (MC)predictive coding, the encoder should perform the motion estimation with the recon-structed pictures instead of the original video data, as it will be done in the decoder.Therefore, the IQ and IDCT blocks are included in the motion compensation loop to reducethe error propagation drift. However, the VLC operation is lossless; there is no need toinclude the VLC block into the motion compensation loop. The role of spatial filter is tominimize the prediction error by smoothing the previous frame that is used for motioncompensation.

The loop filter is a separable two-dimensional (2-D) spatial filter that operates on 83 8block. The corresponding one-dimensional (1-D) filters are nonrecursive with coefficients1=4, 1=2, 1=4. At block boundaries, the coefficients are 0, 1, 0 to avoid that the taps falloutside the block. It should note that MPEG-1 uses sub-pixel accurate motion vectorsinstead of a loop filter to smooth the anchor frame. The performance comparison of twomethods should be interesting.

The role of coding control includes the rate control, the buffer control, the quantizationcontrol, and the frame rate control. These parameters are intimately related. The codingcontrol is not the part of standard; however, it is an important part for the encodingprocess. For a given target bit rate, the encoder has to control several parameters toreach the rate target and at the same time provide reasonable coded picture quality.


As H.261 is a pred ictive code r and the VLC s are used every where such as codi ng QDCTcoef ficients and mo tion vector s, a singl e transmis sion error may cause a loss of synchron-izat ion and cons equentl y cause problem s for the rec onstru ction. To enhance the perform -ance of H.261 vide o coder in the noisy env ironmen t, the transmi tted bitstr eam of H.261 canoptio nally contain a BCH (Bose, Chaud huri, and Hocquengh am) (511,493 ) forward errorcorre ction code.

The H.261 video decoder perform s the inv erse operati ons of the encoder. Aft er optio nalerror corre ction decodin g, the com pressed bitstream enter s the deco der buffer and then ispar sed by the variable-l ength decod er (VLD ). The outp ut of VLD is appl ied to the IQ andIDCT where the dat a are conve rted to the values in the spati al doma in. For inter framecoding mode, the motion com pensatio n is perform ed and the dat a from the MBs in theanchor frame are added to the cur rent data to form the reconstru cted dat a.

19.2.3 Syn tax Desc ription

The syn tax of H.261 vide o coding has a hierar chical- layered struc ture. From the top to thebot tom, the layers are picture layer, GO B layer, MB layer, and block layer.

19.2.3. 1 Pictur e Layer

The pictu re layer begi ns with a 20-bi t picture start code (PSC) . Follow ing the PS C, there aretem poral referenc e (5-bit), picture -type informatio n (PTY PE, 6-bit ), ext ra inserti on infor-mati on (PEI, 1-bit), and spare inf ormatio n (PSP ARE). Then the data for GOBs are followe d.

19.2.3. 2 Group of Bloc ks Layer

A GO B correspond s to 176 pixels by 48 lines of Y and 88 pixels by 24 lines of Cb and Cr . TheGO B laye r contai ns the foll owing data in or der: 16-bit GOBs st art code (GBSC), 4-bit groupnum ber (GN), 5-bit quan tization infor mation (GQ UANT), 1-bit ext ra insertio n informati on(GEI) , and spare inf ormatio n (GSPA RE). Th e number of bits for GSPA RE is vari abledep ending on the set of GEI bit. If GEI is set to 1, then 9 bits fo llow, cons isting of 8 bitsof data and anothe r GE I bit to indica te whether a fur ther 9 bits follo w and so on . Data ofGO B header is then follo wed by data for MBs.

19.2.3. 3 Macrob lock Layer

Eac h GO B contains 33 MB s, which is arran ged as in Figure 19.2.An MB consists of 16 pixels by 16 lines of Y that spatially correspond to 8 pixels by 8 lines

of each Cb and Cr. Data in the bitstream for an MB consists of an MB header followed bydata for blocks. The MB header may include MB address (MBA) (variable length), typeinformation (MTYPE) (variable length), quantizer (MQUANT) (5 bits), motion vector data(MVD) (variable length), and coded block pattern (CBP) (variable length). The MBAinformation is always present and is coded by VLC. The VLC table for MB addressing isshown in Table 19.1. The presen ce of othe r items depen ds on MB -type informati on, whichis shown in the VLC table (Table 19.2).

FIGURE 19.2Arrangement of macroblocks (MBs)in a group of block (GOB).

1 2 3 4 5 6 7 8 9 10 11

12 13 14 15 16 17 18 19 20 21 22

23 24 25 26 27 28 29 30 31 32 33


TABLE 19.1

Variable-Length Coding (VLC) Table for Macroblock (MB) Addressing

MB Address(MBA) Code



1 1 13 0000 1000 25 0000 0100 0002 011 14 0000 0111 26 0000 0011 1113 010 15 0000 0110 27 0000 0011 1104 0011 16 0000 0101 11 28 0000 0011 1015 0010 17 0000 0101 10 29 0000 0011 1006 0001 1 18 0000 0101 01 30 0000 0011 0117 0001 0 19 0000 0101 00 31 0000 0011 0108 0000 111 20 0000 0100 11 32 0000 0011 0019 0000 110 21 0000 0100 10 33 0000 0011 00010 0000 1011 22 0000 0100 011 MBA stuffing 0000 0001 11111 0000 1010 23 0000 0100 010 Start code 0000 0000 0000 000112 0000 1001 24 0000 0100 001

19.2.3.4 Block Layer

Data in the block layer consists of the transformed coefficients followed by an end of block(EOB) marker (10). The data of transform coefficients (TCOEFF) is first converted to thepairs of RUN and LEVEL according to the zigzag scanning order. The RUN represents thenumber of successive zeros and the LEVEL represents the value of nonzero coefficients.The pairs of RUN and LEVEL are then encoded with VLCs. The DC coefficient of anintrablock is coded by a fixed-length code with 8 bits. All VLC tables can be found in thestandard document [h261].


The H.263 video coding standard [h263] is specifically designed for very low bitrate applications such as practical video telecommunication. Its technical content wascompleted in late 1995 and the standard was approved in early 1996.

TABLE 19.2

Variable-Length Coding (VLC) Table for Macroblock (MB) Type

Prediction

MBQuantizer(MQUANT)

MotionVector

Data (MVD)

CodedBlock

Pattern (CBP)

TransformCoefficients(TCOEFF)

Variable-LengthCoding(VLC)

Intra X 0001Intra X X 0000 001Inter X X 1Inter X X X 0000 1Inter þ MC X 0000 0000 1Inter þ MC X X X 0000 0001Inter þ MC X X X X 0000 0000 01Inter þMC þ FIL X 001Inter þMC þ FIL X X X 01Inter þMC þ FIL X X X X 0000 01

Notes: ‘‘X’’means that the item is present in the MB. It is possible to apply the filter in a nonmotion compensated(MC) MB by declaring it as MC þ FIL but with a zero vector.


TABLE 19.3

Number of Pixels per Line and the Number of Lines for Each Picture Format

Picture Format

Number ofPixels

for Luminance(dx)

Number ofLines

for Luminance(dy)

Number of Pixelsfor Chrominance

(dx=2)

Number of Linesfor Chrominance

(dy=2)

Subquarter commonintermediate format (QCIF)

128 96 64 48

QCIF 176 144 88 72CIF 352 288 176 1444CIF 704 576 352 28816CIF 1408 1152 704 576

19.3.1 Overv iew of H.263 Video Codi ng

The basic con figura tion of the video source coding algo rithm of H.263 is ba sed on theH.261. Several impo rtant fea tures that are differe nt from H.261 inc lude the followi ng newoptio ns: unrestr icted motion v ectors, syn tax-ba sed arithmeti c codi ng, advanced prediction ,and PB-frame s. All these fea tures can be use d to gether or sep arately for imp roving thecoding efficiency . The H.263 vide o stand ard can be used for both 625-lin e and 525-lin etele vision stand ards. Th e source coder operates on the noni nterlaced pictu res at pic-ture rate ab out 30 picture s=s. Th e pictu res are coded as lum inance a nd two color differe ncecom ponents ( Y, Cb , and Cr ). The source code r is based on a CIF. Actual ly, the re are fivestand ardized formats , which inc lude sub-QCI F, QCIF, CIF, 4CI F, and 16CIF . The detail offormats is shown in Table 19.3.

It is note d that for each format, the chromi nance is a quarter size of the lumi nance pictu re,i.e., the chro minance pi ctures are half the size of the lumi nance pictu re in bot h horizon taland v ertical directi ons. This is de fi ned by the ITU- R 601 format. For CIF, the number ofpixe ls per line is compatibl e with samplin g the ac tive portio n of the lumi nance and colordiffere nce signal s from a 525- or 626-lin e sou rce at 6.75 and 3.375 MHz, respec tively. Th esefreq uencies have a simp le relatio nship to thos e defi ned by the ITU- R 601 fo rmat.

19.3.2 Tec hnical Feature s of H.263

The H.263 encoder st ructure is simi lar to the H.261 enc oder with the excepti on that there isno loop filter in H.263 enc oder. The main component s of the enc oder inc lude blocktransf orm, MC predic tion, block quantiz ation, and VLC . Eac h pictu re is partitione d int oGO Bs. A GO B con tains multiple num ber of 16 line s, k *16 lines, dep ending on the pictu reformat ( k ¼ 1 for sub-Q CIF, QCIF; k ¼ 2 fo r 4CI F; k ¼ 4 for 16CIF ). Eac h GO B is divided intoMB s that are the sam e as in H.261, each MB con sists of four 8 3 8 luminance block s andtwo 8 3 8 chrominan ce block s. Compare d to H.261, H.263 has seve ral new technicalfea tures fo r the enhance ment of coding ef ficiency for very low bit rate app lications .These new features include picture-extrapolating motion vectors (or unrestricted motionvector mode), motion compensation with half-pixel accuracy, advanced prediction (whichincludes variable-block size motion compensation and overlapped block motion compen-sation), syntax-based arithmetic coding and PB-frame mode.

19.3.2.1 Half-Pixel Accuracy

In H.263 video coding, the half-pixel accuracy motion compensation is used. The half-pixelvalue s are foun d using bilin ear interpol ation as sh own in Figu re 19.3.


A

C

B

D

a b

c d

Integer pixel position

Half-pixel position

a = Ab = (A + B + 1)/2c = (A + C + 1)/2d = (A+ B + C + D + 2)/4

“/ ” indicates division by truncation

FIGURE 19.3Half-pixel prediction by bilinearinterpolation.

Note that H.263 uses sub-pixel accuracy for motion compensation instead of using a loopfilter to smooth the anchor frames as in H.261. This is also done in other coding standardssuch as MPEG-1 and MPEG-2, which also use half-pixel accuracy for motion compensa-tion. In MPEG-4 video, quarter-pixel accuracy for motion compensation has been adoptedas a tool for the version 2.

19.3.2.2 Unrestricted Motion Vector Mode

Usually the motion vectors are limited within the coded picture area of anchor frames. Inthe unrestricted motion vector mode, the motion vectors are allowed to point outside thepictures. When the values of motion vectors exceed the boundary of anchor frame in theunrestricted motion vector mode, the picture-extrapolating method is used. The values ofreference pixels outside the picture boundary will take the values of boundary pixels. Theextension of motion vector range is also applied to the unrestricted motion vector mode. Inthe default prediction mode, the motion vectors are restricted to the range of [�16, 15.5].In the unrestricted mode, the maximum range for motion vectors is extended to [�31.5,31.5] under certain conditions.

19.3.2.3 Advanced Prediction Mode

Generally, the decoder will accept no more than one motion vector per MB for baselinealgorithm of H.263 video coding standard. However, in the advanced prediction mode, thesyntax allows up to four motion vectors to be used per MB. The decision of using one orfour vectors is indicated by the MB type and CBP for chrominance (MCBPC) code word foreach MB. How to make this decision is the task of encoding process.

The following example gives the steps of motion estimation and coding mode selectionfor advanced prediction mode in the encoder.

Step 1: Integer pixel motion estimation

SADN(x, y) ¼XN�1i¼0

XN�1j¼0joriginal---previousj (19:1)

where SAD is the sum of absolute difference, values of (x, y) is within the search range, N isequal to 16 for 163 16 block, and N is equal to 8 for 83 8 block.

SAD4�8 ¼ SSAD8(x, y) (19:2)

SADinter ¼ min (SAD16(x, y), SAD4�8) (19:3)

Step 2: Intra=intermode decisionIf A < (SADinter � 500), this MB is coded as intra-MB


othe rwise, it is code d as inter-MBwhere SA Dinter is determi ned in step 1, and

A ¼X15i ¼ 0

X15j ¼ 0j original ---MBmea n j

MBmean ¼ 1256

X15i¼ 0

X15j¼ 0

original

(19: 4)

If thi s MB is determine d to coded as inter -MB, go to step 3

Step 3: Hal f-pixel searchIn thi s step, half-pix el search is perform ed for bot h 16 3 16 and 8 3 8 blocks as shown in

Figu re 19.3.

Step 4: Decisio n on 16 3 16 or fo ur 8 3 8 (one mo tion vector or four motion vector s pe r MB)If SAD4 3 8 < SAD 16 � 100, four mo tion vector s per MB will be used, one of the motion

vect ors is use d for all pixe ls in one of the four luminance blocks in the MB , otherwis e, on emo tion vector wi ll be used fo r all pixels in the MB.

Step 5: Di fferenti al codi ng of motion vector s fo r each of 8 3 8 luminance blocks is per-forme d as in Figu re 19.4.

Whe n it has been decided to use four motion vect ors, the MVDCHR motion vector forbot h chrom inance block s is derive d by calcul ating the sum of the four lumi nance vector sand dividing by 8. Th e component values of the resultin g sixte enth pixe l resolutio n vector sare mo di fied toward the position as indicate d in the Table 19.4.

Anoth er advanced predic tion mode is overl apped motion com pensatio n for lumi nance.Actual ly, this ide a is also use d by MPEG -4, whi ch has been descri bed in Chapter 18. In theoverla pped motion compensati on m ode, each pixe l in an 8 3 8 lumi nance blo ck is aweigh ted sum of three value s divi ded by 8 with roun ding. The three value s are obt ainedby the motion compens ation wi th three motion vector s: the mo tion vect or of the current

MVDx = MVx − Px

MVDy = MVy − Py

Px = Median (MV1x , MV2x

, MV3x)

Py = Median (MV1y , MV2y

, MV3y)

Px = Py = 0, if MB is intracoded or block is outside of picture boundary

MV2

MV1

MV3

MV MVMV1

MV2 MV3

MVMV1

MV2 MV3MV2

MV1

MV3

MV

FIGURE 19.4Differential coding of motion vectors.


TABLE 19.4

Modification of Sixteenth Pixel Resolution Chrominance Vector Components

Sixteenth pixel position 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 =16

Resulting position 0 0 0 1 1 1 1 1 1 1 1 1 1 1 2 2 =2

luminance block and two of four remote vectors. These remote vectors include the motionvector of the block to the left or right of the current block and the motion vector of the blockabove or below the current block. The remote motion vectors from other GOBs are used inthe same way as remote motion vectors inside the current GOB. For each pixel to be codedin the current block, the remote motion vectors of the blocks at the two nearest blockborders are used, i.e., for the upper half of the block, the motion vector corresponding tothe block above the current block is used while for the lower half of the block, the motionvector corresponding to the block below the current block is used. Similarly, the left half ofthe block uses the motion vector of the block at the left side of the current block and theright half uses the one at the right side of the current block. To make this more clear, let(MV0

x,MV0y) be the motion vector for the current block, (MV1

x,MV1y) be the motion vector

for the block either above or below, and (MV2x,MV2

y) be the motion vector of the blockeither to the left or right of the current block. Then the value of each pixel, p(x, y) in thecurrent 83 8 luminance block is given by

p(x, y) ¼ [q(x, y) �H0 þ r(x, y) �H1 þ s(x, y) �H2(x, y)þ 4]=8 (19:5)

where

q(x, y) ¼ p(xþMV0x, yþMV0

y); r(x, y) ¼ p(xþMV1x, yþMV1

y); and

s(x, y) ¼ p(xþMV2x, yþMV2

y);(19:6)

H0 is the weighting matrix for prediction with the current block motion vectorH1 is the weighting matrix for prediction with the top or bottom block motion vectorH2 is the weighting matrix for prediction with the left or right block motion vector

This applies to the luminance block only. The values of H0, H1, and H2 are shown inFigure 19.5.

H0 H1 H2

4 5 5 5 5 5 5 4 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2

5 5 5 5 5 5 5 5 1 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2

5 5 6 6 6 6 5 5 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2

5 5 6 6 6 6 5 5 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2

5 5 6 6 6 6 5 5 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2

5 5 6 6 6 6 5 5 1 1 1 1 1 1 1 1 2 2 1 1 1 1 2 2

5 5 5 5 5 5 5 5 1 1 2 2 2 2 1 1 2 2 1 1 1 1 2 2

4 5 5 5 5 5 5 4 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2

FIGURE 19.5Weighting matrices for overlapped motion compensation.


It should be noted that the above coding scheme is not optimized in the selection ofmode decision since the decision depends only on the values of predictive residues.Optimized mode decision techniques that include the above possibilities for predictionhave been considered in [weigand 1996].

19.3.2.4 Syntax-Based Arithmetic Coding

As in other video coding standards, H.263 uses VLC=VLD to remove the redundancy inthe video data. The basic principle of VLC is to encode a symbol with a specific table basedon the syntax of the coder. If the symbol is mapped to an entry of the table in a table look-up operation then the binary code word specified by the entry is sent to a bitstream bufferfor transmitting to the decoder. In the decoder, an inverse operation, VLD, is performed toreconstruct the symbol by the table look-up operation based on the same syntax of thecoder. The tables in the decoder must be the same as the one used in the encoder forencoding the current symbol. To obtain the better performance, the tables are generated ina statistically optimized way (such as a Huffman coder) with a large number of trainingsequences. This VLC=VLD process implies that each symbol must be encoded into a fixedintegral number of bits. An optional feature of H.263 is to use arithmetic coding to removethe restriction of fixed integral number bits for symbols. This syntax-based arithmeticcoding mode may result in bit rate reductions.

19.3.2.5 PB-Frames

The PB-frame is a new feature of H.263 video coding. A PB-frame consists of two pictures,one P-picture and one B-picture, being coded as one unit as shown in Figure 19.6. SinceH.261 does not have B-pictures, the concept of a B-picture comes from the MPEG videocoding standards. In a PB-frame, the P-picture is predicted from the previous decoded I- orP-picture and the B-picture is bidirectionally predicted both from the previous decodedI- or P-picture and the P-picture in the PB-frame unit, which is currently being decoded.

Several detailed issues are addressed at MB level in PB-frame mode:If an MB in PB-frame is intracoded, the P-MB in the PB unit is intracoded and the B-MB

in the PB unit is intercoded. The motion vector of intercoded PB-MB is used for theB-MB only.

An MB in PB-frame contains 12 blocks for 4:2:0 format, six (four luminance blocks andtwo chrominance blocks) from P-frame and six from B-frame. The data for six P-blocks istransmitted first and then for the six B-blocks.

FIGURE 19.6Prediction in PB-frames mode.

I B P

PB-frame


Differe nt par ts of a B-block in a PB-fr ame can be predic ted with differe nt mo des. Forpixels where the backwar d vector points insid e of coded P-MB, bidir ectional predic tion isused. For all ot her pixels, forwa rd prediction is use d.

19 .4 H.2 63 V ideo Coding Standa rd Vers ion 2

19.4.1 Overv iew of H.263 Version 2

The H.263 ver sion 2 [h263 þ ] video codi ng st andard, also kno wn as H.263 þ , was approve din Janu ary of 1998 by the ITU-T. H.263 version 2 includes a number of new opti onalfeatur es based on the H.263 video coding stand ard. Th ese new opti onal fea tures areadded to bro aden the app lication range of H.263 and to improve its codi ng ef fi ciency.The main fea tures are flexibl e vide o fo rmat, scalability, a nd backwar d-co mpatible supple-mental enh anceme nt inform ation. Among these new optio nal features, five fea tures areinten ded to improve the coding ef ficiency and three featur es are proposed to address theneeds of mo bile video and othe r noisy transmiss ion environ ments . The features of scal-abilit y provide the capa bility of gene rating layere d bitstream which are spati al scalabi lity,temporal sca lability, and signal-to-n oise ratio (SN R) scalability simi lar to thos e de fined bythe MPEG -2 video codi ng standard. Th ere are also othe r mo des of H.263 versio n 2 thatprovid e some enhanceme nt fun ctions. We wi ll descri be these featur es in the followi ngsectio n.

19.4.2 New Feature s of H.263 Version 2

The H.263 version 2 inc ludes a numb er of new features. In the followi ng we brie flydescri be the key techniq ues used fo r these featur es.

19.4.2. 1 Scalabi lity

The scalabi lity functi on allows for encodi ng the vide o sequ ences in a hierar chical way thatpartit ions the picture s into on e basic layer and one or more enh anceme nt laye rs. Thedecod ers have the option to deco de only the ba se layer bitstream to obt ain lowe r qualit yrecon structed pictu res or furthe r decode the enhanceme nt layers to obt ain the higherquality decoded pictures. There are three types of scalability in H.263: temporal scalability,SNR scalability, and spatial scalability.

Temporal scalability is achieved by using B-pictures as the enhancement layer. All threetypes of scalability are similar to the ones in MPEG-2 video coding standard. TheB-pictures are predicted from either or both a previous and later decoded picture inthe base layer as shown in Figure 19.7 .

In the SNR scalability, the pictures are first encoded with coarse quantization in the baselayer. The differences or coding error pictures between a reconstructed picture and itsoriginal in the base layer encoder are then encoded in the enhancement layer and sent tothe decoder providing an enhancement of SNR. In the enhancement layer, there are twotypes of pictures. If a picture in the enhancement layer is only predicted from the baselayer, it is referred to as an EI (enhancement I)-picture. It is a bidirectionally predictedpicture if it uses both an earlier enhancement layer picture and a temporally simultaneousbase layer reference picture for prediction. Note that the prediction from the reference layeruses no motion vectors. However, EP (enhancement P)-pictures use motion vectors whenpredicted from their temporally earlier reference picture in the same layer. Also, if more


FIGURE 19.7Temporal scalability.

I1

B2

P3

B4

P5

Enhancementlayer

Baselayer

than two layers are used, the refer ence may be the lowe r layer instead of the ba se layer (seeFigu re 19.8).

In the spatial scalabi lity, lower- resol ution pictu res are encoded in the ba se layer or lowerlayer. The difference s or err or picture s betwe en up-samp led deco ded base layer picture sand its origi nal pictu re are encoded in the enhanceme nt layer and sent to decod er provid-ing the spatial enh anceme nt pictu res. As in MPEG -2, spatial inter polation fi lters are use dfor the spatial sca lability. Ther e are also two types of picture s in the enhance ment layer: EIand EP. If a decod er is able to perform spatial scalability, it may also need to be able to usea custom picture format. For example , if the base layer is sub -QCIF (128 3 96), theenhance ment layer pictu re wou ld be 256 3 192, which does no t bel ong to a sta ndardpictu re format (see Figure 19.9).

The scalability in H.263 can be perform ed with multila yers. In the case of multi layersca lability, the picture laye r use d for upward pred iction in an EI or EP picture may be an I,P, EI, or EP pi cture, or may be the P par t of a PB or improved PB-fra me in the base laye r asshown in Figu re 19.10.

19.4.2. 2 Improved PB -Frames

The differe nce betwe en the PB-frame and the imp roved PB-frame is that bidirectiona lpredic tion is use d for B MBs in the PB-frame , whil e in the improved PB-frame , B MB scan be coded in thre e prediction mo des: bidirectiona l, forwa rd, and ba ckward pred ictions.This means that in forward prediction or backward prediction, only one motion vector isused for a 163 16 MB instead of using two motion vectors for a 163 16 MB in bidirectionalprediction. In very low bit rate case, this mode can improve the coding efficiency by savingbits for coding motion vectors.

FIGURE 19.8Signal-to-noise ratio (SNR) scalability.

I PP

Enhancementlayer

Baselayer

EI EP EP


I PP

Enhancementlayer

Baselayer

EI EP EP

FIGURE 19.9Spatial scalability.

19.4.2.3 Advanced Intracoding

The advantage of intracoding is to protect the error propagation because intracoding doesnot depend on the previous decoded picture data. However, the problem of intracoding isthat more bits are needed since the temporal correlation between frames are not exploited.The idea of advanced intracoding (AIC) is used to address this problem. The codingefficiency of intracoding is improved by the use of following three methods:

1. Intrablock prediction using neighboring intrablocks for the same color component(Y, Cb, or Cr): a particular intracoded block may be predicted from the blockabove or left to the current block being decoded, or from both. The main purposeof these predictions tries to use the correlation between neighboring blocks. Forexample, the first row of AC coefficients may be predicted from those in the blockabove, the first column of AC coefficients may be predicted from those in the leftand the DC value may be predicted as an average from the block above and left.

2. Modified IQ for intracoefficients: IQ of the intra DC coefficient is modified to allowa varying quantization step size. IQ of all intra AC coefficients is performedwithout a dead-zone in the quantizer reconstruction spacing.

3. A separate VLC for intracoefficients: To improve intracoding a separate VLCtable is used for all intra DC and intra AC coefficients. The price paid for thismodification is the use of more tables.

EI EP

Enhancementlayer 2

Enhancementlayer 1

EI B

I PBaselayer

EP

EI

P

EP

EP

P FIGURE 19.10Multilayer scalability.


A

DCB

DCBAExample for filtered pixelson a vertical block edge

Example for filtered pixelson a horizontal block edge

Block boundary

Block 2Block 1

Block1

FIGURE 19.11Positions of filtered pixels.

19.4.2.4 Deblocking Filter

The deblocking filter (DF) is used to further improve the decoded picture quality bysmoothing the block artifacts. Its function on improving picture quality is similar to theoverlapped block motion compensation. The filter operations are performed across 83 8block edges using a set of four pixels on both horizontal and vertical directions at the blockboundaries such as shown in Figure 19.11. In the figure, the filtering process is applied tothe edges. The edge pixels, A, B, C, and D, are replaced by A1, B1, C1, and D1 by thefollowing operations:

B1 ¼ clip(Bþ d1) (19:7a)

C1 ¼ clip(C� d1) (19:7b)

A1 ¼ A� d2 (19:7c)

D1 ¼ Dþ d2 (19:7d)

d ¼ (A� 4Bþ 4C�D)=8 (19:7e)

d1 ¼ f (d,S) (19:7f)

d2 ¼ clip d1[(A�D)=4, d1=2] (19:7g)

where clip is a function of clipping the value to the range of 0–255, clip d(x, d) is a functionthat clips x to the range of from �d to þd, and the value S is a function of quantization stepQUANT that is defined in Table 19.5.

TABLE 19.5

The Value S as a Function of Quantization Step (QUANT)

QUANT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

S 1 1 2 2 3 3 4 4 4 5 5 6 6 7 7 7

QUANT 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

S 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12


f (d,S)

S 2*S |d`| FIGURE 19.12The plot of function of f(d, S).

The function f(d, S) is defined as

f (d,S) ¼ SIGN(d)*(MAX(0, abs(d))�MAX(0, 2*abs(d)� S)) (19:8)

This function is described by Figure 19.12. From the figure, it is seen that this function isused to control the amount of distortion introducing by filtering. The filter has an effectonly if d is smaller than 2S. Therefore, some features such as an isolated pixel, corner, etc.would be reserved during the nonlinear filtering since for those features the value d mayexceed the 2S. The function f(d, S) is also designed to ensure that small mismatch betweenencoder and decoder will remain small and will not allow the mismatch to be propagatedover multiple pictures. For example, if the filter is simply switched on or off with amismatch of only þ1 or �1 for d, then this will cause the filter to be switched on at theencoder and off at the decoder, or vice versa. It should be noted that the DF proposed hereis an optional selection. It is a result of a large number of simulations; it may be effective forsome sequences but may not be effective for all kinds of video sequences.

19.4.2.5 Slice-Structured Mode

A slice contains a video picture segment. In the coding syntax, a slice is defined as a sliceheader followed by consecutive MBs in scanning order. The slice-structured (SS) mode isdesigned to address the needs of mobile video and other unreliable transmission environ-ments. This mode contains two submodes: the rectangular slice (RS) submode and thearbitrary slice ordering (ASO) submode. In the rectangular submode, a slice contains arectangular region of a picture, such that the slice header specifies the width. The MBs inthis slice are in scan order within the rectangular region. In the ASO submode, the slicesmay appear in any order within the bitstream. The arbitrary arrangement of slices in thepicture may provide an environment for obtaining better error concealment. The reason isthat the damaged area caused by packet loss may be isolated from each other and can beeasily concealed by the good decoded neighboring blocks. In this submode, there is usuallyno data dependency that can cross the slice boundaries, except for the DF mode becausethe slices may not be decoded in the normal scan order.

19.4.2.6 Reference Picture Selection

With optional mode of the reference picture selection (RPS), the encoder is allowed to use amodified interframe prediction method. In this method, additional picture memories areused. The encoder may select one of the picture memories to suppress the temporal errorpropagation due to the interframe coding. The information to indicate which picture isselected for prediction is included in the encoded bitstream that is allowed by syntax. Thestrategy used by encoder to select the picture to be used for prediction is open for algorithmdesign. This mode can use the backward channel message that is sent from a decoder to anencoder to inform the encoder which part of which pictures have been correctly decoded.


The encode r can use the mess age fro m the backwar d chann el to deci de which picture willprovid e better predic tion. From the abov e des cription of RPS mode, it become s evi dent thatthis mode is useful for impro ving the perfo rmance over unreli able channel s.

19.4.2. 7 Indepen dent Segmen tation Decod ing

The indep endent segme ntation decod ing (IS D) mo de is another opti on of H.263 vide ocoding which can be use d for unr eliable transmi ssion environ ment. In this mo de, eachvide o picture segme nt is deco ded withou t the presen ce of any data depen dencie s acros sslic e boundar ies or ac ross GOB bound aries, i.e., with com plete inde pende nce fro m all othervide o pictu re segments and all data outside the sam e video picture segme nt locatio n in therefer ence picture s. This indep endence includ es no use of mo tion vectors outsi de ofthe c urrent vide o picture segme nt fo r motion predic tion or rem ote motion vector s foroverla pped mo tion com pensati on in the adv anced pred iction mode, no DF operati on andno line ar inter polati on across the bound aries of current video picture segme nt.

19.4.2. 8 Referenc e Pictur e Resampl ing

The referenc e pictu re resam pling (R PR) mo de allows an earlier -coded pictu re to beresam pled, or wrapp ed, before it is used as a referenc e pictu re. The idea of using thi smo de is similar to the ide a of global motion , whi ch is expecte d to obt ain better perform anceof motion estimatio n and com pensatio n. The wrappin g is de fi ned by four motion vector sfor the corne rs of the ref erence picture as shown in Figure 19.13.

For the cur rent pictu re wi th horizon tal size H and ver tical size V, four c onceptual motionvect ors, MVOO , MVOV, MVHO , a nd MVHV are de fined for the upper left, low left, upperrigh t, and low righ t corne rs of the pictu re, resp ectively. Th ese mo tion vectors as wrappin gpar ameters have to be coded with VLC and inc luded in the bitstr eam. These vect ors areuse d to des cribe ho w to move the corne rs of the cur rent pictu re to map them onto thecorre spondin g cor ners of the earlier decod ed pictu res as shown in Figu re 19.13. The motioncom pensatio n is perform ed using bilin ear interpol ation in the deco der with the wrappin gpar ameters.

19.4.2. 9 Reduced- Reso lution Update

Whe n encoding a video sequenc e wi th high ly active scenes , the enc oder may have aproblem to provide sufficient subjective picture quality at low bit rate coding. Thereduced-resolution update (RRU) mode is expected to be used in this case for improvingthe coding performance. This mode allows the encoder to send update information for apicture that is encoded at a reduced resolution to create a final image at the higherresolution. At the encoder, the pictures in the sequence are first down-sampled to a quartersize (half in both horizontal and vertical directions) and then the resulting low-resolutionpictu res are enc oded as show n in Figu re 19.14.

FIGURE 19.13Reference picture resampling. MVOV

MVOO

MVHO


Inputsequence

Bitstream

Down-sampling Encoder

Decoder

Up-sampling

Prediction

VLC+

+

−

FIGURE 19.14Block diagram of encoder with reduced-resolution update (RRU) mode.

The decod er wi th this mo de is mo re com plicated than one withou t this mode. The blockdiagram of decodin g process wi th the RRU mo de is shown in Figure 19.15.

The decod er with RRU mode has to deal with seve ral new issues . First, the reconst ructedpicture s are up-s ampled to the full size fo r dis play. Howeve r, the referenc e pictu res have tobe exten ded to the intege r times of 32 3 32 MBs if it is necess ary. Th e pixel value s in theexten ded areas take the value s of original bor der pixels. Seco nd, the mo tion vector s for16 3 16 MBs in the encoder a re use d for the up-samp led 32 3 32 MB in the decoder .Ther efore, an additi onal proced ure is neede d to reconst ruct the motion vectors for eachup-samp led 16 3 16 MB s including chrominan ce MB s. Third, bil inear inter polati on is use dfor up-s ampling in the decod er loo p. Fina lly, in the boundar y of rec onstruc ted picture, ablock boundary filter is used along the edges of the 163 16 reconstructed blocks at theencoder as well as on the decoder. There are two kinds of block boundary filters that havebeen proposed . On e is the DF, describ ed in Secti on 19.4.2.4. Th e other is de fi ned as follo ws.If two pixels, A and B, are neighboring pixels and A is in block 1 and B is in block 2,respectively, then the filter is designed as

A1 ¼ (3*Aþ Bþ 2)=4 (19:9a)

B1 ¼ (Aþ 3*Bþ 2)=2 (19:9b)

where A1 and B1 are the pixels after filtering and ‘‘=’’ is division with truncation.

Full-sizereconstruction

VLD8 � 8 IDCT

IQUp-sampling

to 16 � 16

Prediction16 � 16

Motioncompensation

Motion vectorsreconstruction

Full-sizeframe

memory

Bitstream

FIGURE 19.15Block diagram of decoder with reduced-resolution update (RRU) mode.


19.4.2.10 Alternative Inter VLC and Modified Quantization

The alternative inter-VLC (AIV) mode is developed for improving coding efficiency ofinter-picture coding for the pictures containing significant scene changes. This efficiencyimprovement is obtained by allowing some VLC codes originally designed for intra-picture to be used for inter-picture coefficients. The idea is very intuitive and simple.When the rapid scene change occurs in the video sequence, the inter-picture predictionbecomes difficult. This results in large prediction differences, which are similar to theintra-picture data. Therefore, the use of intra-picture VLC tables instead of using inter-picture tables may yield better results. However, there is no syntax definition for thismode. In other words, the encoder may use the intra VLC table for encoding an inter-block without informing the decoder. After receiving all coefficient codes of a block, thedecoder will first decode these code words with the inter VLC tables. If the addressing ofcoefficients stays inside the 64 coefficients of a block, the VLD will accept the results evenif some coding mismatch exists. Only if coefficients outside the block are addressed, thecode words will be interpreted according to the intra VLC table. The modified quantiza-tion mode is designed for providing several features, which can improve the codingefficiency. First, with this mode more flexible control of the quantizer step can bespecified in the dequantization field. The dequantization field is no longer a 2-bit fixed-length field; it is a variable-length field which can either be 2 or 6 bits depending on thefirst bit. Second, in this mode, the quantization parameter of the chrominance coefficientsis different from the quantization parameter of the luminance coefficients. The chromin-ance fidelity can be improved by specifying a smaller quantization step for chrominancethan that for luminance. Finally, this mode allows the extension of range of coefficientvalues. This provides more accuracy representation of any possible true coefficient valuewith the accuracy allowed by the quantization step. However, the range of quantizedcoefficient levels is restricted to those which can reasonably occur to improve the detect-ability or errors and minimize decoding complexity.

19.4.2.11 Supplemental Enhancement Information

The usage of supplemental information may be included in the bitstream in the picturelayer to signal enhanced display capabilities or to provide tagging information forexternal usage. This supplemental enhancement information includes full-picture freeze=freeze–release request, partial-picture freeze=freeze–release request, resizing partial-picture freeze request, full-picture snapshot tag, partial-picture snapshot tag, video timesegment start=end tag, progressive refinement segment start=end tag, and chroma keyinformation. The full-picture freeze request is used to indicate that the contents of theentire earlier displayed video picture will be kept and not updated by the contentsin current decoded picture. The picture freeze will be kept under this request untilthe full-picture freeze–release request occurs in the current or later picture-type informa-tion. The partial-picture freeze request indicates that the contents of a specifiedrectangular area of the earlier displayed video picture are frozen until the release requestis received or timeout occurs. The resizing partial-picture freeze request is used tochange the specified rectangular area for the partial picture. One use of this informationis to keep the contents of picture in the corner of display unchanged for a periodfor commercial use or some other purpose. All information given by the tags indicatesthat the current picture is labeled as either a still-image snapshot or a subsequence ofvideo data for external usage. The progressive refinement segment tag is used toindicate the display period of the pictures with better quality. The chroma keyinginformation is used to request transparent and semitransparent pixels in the decoded


video pictures [chen 1997]. One application of chroma key is to simply describe the shapeinformation of objects in a video sequence.

19.5 H.263þþ Video Coding and H.26L

H.263þþ is the next version of H.263 that is considering adding more optional enhance-ments to H.263. It is the extension of H.263 version 2 and is currently scheduled becompleted late in the year 2000. H.26L is a project to seek more efficient video codingalgorithms that will be much better than the current H.261 and H.263 standards, where theL stands for long term. The algorithms for H.26L can be fundamentally different from thecurrent DCT with motion compensation framework that is used for H.261, H.262 (MPEG-2), and H.263. The expected improvements from the current standards include severalaspects: higher coding efficiency, more functionality, low complexity permitting softwareimplementation, and enhanced error robustness. H.26L addresses very low bit rate, real-time, and low end-to-end delay applications. The potential application targets can beInternet video phones, sign language or lip reading communications, video storage andretrieval service, multipoint communication, and other visual communication systems.H.263L is currently scheduled for approval in the year 2002.

19.6 Summary

In this chapter, the video coding standards for low bit rate applications have been intro-duced. These standards include H.261, H.263, H.263 version 2, and the versions underdevelopment, H.263þþ as well as H.26L. H.261 and H.263 are extensively used for videoconferencing and other multimedia applications at low bit rates. In H.263 version 2, all newnegotiable coding options are developed for special applications. Among these options,five options that include AIC mode, alternative inter VLC mode, modified quantizationmode, DF mode, and improved PB-frame mode, are intended to improve coding efficiency.Three modes that include SS mode, RPS mode, and independent segment decoding modeare used to meet the need of mobile video application. The others provide the functionalityof scalability, such as spatial, temporal, and SNR scalability. H.26L is a future standard tomeet the requirements of very low bit rate, real-time, low end-to-end delay applications,and other advanced performance.

Exercises

1. What is the enhancement of H.263 over H.261? Describe the applications of eachenhanced tool of H.263.

2. Compare with MPEG-1 and MPEG-2, which features of H.261 and H.263 are used toimprove coding performance at low bit rates? Explain the reasons.

3. What is the difference between spatial scalability and RRUmode in H.263 video coding?4. Conduct a project to compare the results by using the DFs in coding loop and out of

coding loop. Which method will cause less drift if a large number of pictures arecontained between two consecutive I-pictures?


References

[chen 1997] T. Chen, C.T. Swain, and B.G. Haskell, Coding of sub-regions for content-based scalablevideo, IEEE Transactions on Circuits and Systems for Video Technology, 7, 1, 256–260, February1997.

[h261] ITU-T Recommendation H.261, Video Codec for Audiovisual Services at px64 kbit=s, March1993.

[h263] ITU-T Recommendation H.263, Video Coding for Low Bit Rate Communication, Draft H.263,May 2, 1996.

[h263þ] ITU-T Recommendation H.263, Video Coding for Low Bit Rate Communication, DraftH.263, January 27, 1998.

[weigand 1996] T. Weigand, Rate-distortion optimized mode selection for very low bit-rate videocoding and the emerging H.263 standard, IEEE Transactions on Circuits and Systems for VideoTechnology, 6, 2, 182–190, April 1996.


20A New Video Coding Standard: H.264=AVC

Many video coding standards have been developed during past two decades. In thischapter, we introduce a recently developed video coding standard, H.264 or MPEG-4 Part10 Advanced Video Coding (AVC) [h264], which has been developed and standardizedcollaboratively by the joint video team (JVT) of ISO=IECMPEG and ITU-T VCEG. Themainobjective of H.264 is for high coding efficiency. The test results have shown that it has madean important milestone of video coding standard at the coding efficiency improvement.

20.1 Introduction

Several video coding standards have been introduced in the previous chapters includingMPEG-1=2=4 and H.261 as well as H.263. Recently, the JVT of ISO=IEC MPEG and ITU-TVCEG (Video Coding Expert Group) has developed new video coding standard, which isreferred to formally as ITU-T Recommendation H.264 and ISO=IEC MPEG-4 (Part 10)Advanced Video Coding (referred in short as H.264=AVC). The work of H.264=AVCactually started in early 1998 when the VCEG issued a call for proposals for a projectcalled H.26L. The target of H.26L is to greatly improve the coding efficiency over anyexisting video coding standards. The first draft of H.26L was completed in October 1999.The JVT was formed in December 2001, with the mission to finalize the new codingstandard based on H.26L. The draft of new video coding standard was submitted forformal approval as the final committee draft (FCD) in March 2003 and promoted to finaldraft international standard (FDIS) in June 2003. The current MPEG-4 AVC standard itselfis ISO=IEC 14496-10:2004. The FRExt enhancements have also received final approval asISO=IEC 14496-10:2004=AMD 1.

The H.264=AVC mainly targets for high coding efficiency. Based on the conventionalblock-based motion compensated (MC) hybrid video coding concepts H.264=AVC pro-vides approximately a 50% bit rate savings from equivalent perceptual quality relative tothe performance of earlier standards. This has been shown from extensive simulationresults. The superior coding performance of H.264=AVC is obtained because many newfeatures such as enhanced prediction capability and smaller block size motion compensa-tion are incorporated. The details of these features are described in the following sections.With high coding efficiency, the H.264=AVC can provide technical solutions for manyapplications, including broadcasting over different medias, video storage on optical andmagnetic devices, high definition (HD) DVD, and others. Though, it will not be that easy toreplace the current existing standard such as MPEG-2 with H.264=AVC in some applica-tions such as in the area of the digital televisions. However, it may be used for newapplication areas, such as HD-DVD, mobile video transmission, and others. From otherside, to achieve the high coding efficiency, H.264=AVC has to use a lot of new tools or


modified tools from existing standards that substantially increases the complexity of thecodec; it would be about four times higher for the decoder and nine times higher for theencoder compared with MPEG-2 video coding standard. However, with fast advances ofsemiconductor technique, the silicon solution can alleviate the problem of high complexity.

20.2 Overview of H.264=AVC Codec Structure

To address the variety of applications and networks, the H.264=AVC codec design consistsof two layers: video coding layer (VCL) and network abstraction layer (NAL). The layeredstructure of H.264=AVC video encoder is shown in Figure 20.1.

From Figure 20.1, the input of video source is first compressed in the VCL into abitstream. The function of the VCL is to efficiently compress the video content. The NALis a new concept, which is designed for efficient transmission of the compressed bitstreamin different network or storage environment which includes all current and future proto-cols and network architectures. These applications include broadcasting over terrestrial,cable, and satellite networks; streaming over IP-networks, wireless, and ISDN channels. Inthis layer, the head information is added to the coded bitstream for handling a variety oftransport layers or storage media. The interface of NAL is designed to enable a seamlessintegration of the coded video data with all possible protocols and network architectures.

The bitstream can be in one of two formats: the NAL unit (NALU) stream or the bytestream. The NALU stream format consists of a sequence of syntax structures calledNALU s. The fo rmat of an NALU is shown in Figure 20. 2.

In the header of an NALU, the first bit is a 0 bit, and the next 2 bits are used to indicatewhether the NALU contains the sequence or picture parameter set or a slice of a referencepicture. The next 5 bits are used to indicate type of NALU units, which corresponds thetype of data being carried in that NALU unit. There are total 32 types of NALUs allowed.The 32 types of NALUs can be classified in two categories: VCL NALUs and non-VCLNALUs. The VCL units carry the data corresponding to the VCL, whereas the non-VCL NALUs carry information like supplemental enhancement information (SEI),sequence and picture parameter set, access unit delimiter, and others. The detail can befound in the specification of H.264=AVC [h264].

In the NALU stream, the NALUs are decoded on the decoding order. The byte streamcan be constructed from the NALU stream by ordering the NALUs in decoding orderand adding a start code to each NALU and zero or more zero-valued bytes to form astream of bytes. The NALU stream can be extracted from the byte stream by removingthe start code which has the unique start code prefix pattern within this byte stream.The NALU is a syntax structure containing an indication of the type of data to follow

FIGURE 20.1Layered structure of H.264=AVC video encoder.

Video coding layer

Data partitioning

Network abstraction layer

etc.MPEG-2H.323/IPMP4FFH.320

Coded macroblock

Coded slice/partition

Controldata


Header byte VCL or non-VCL data

1 bit 2 bits 5 bitsFIGURE 20.2Network abstraction layerNALunit (NALU) format.

and bytes containing that data in the form of a raw byte sequence payload (RBSP)interspersed as necessary with emulation prevention bytes. The emulation preventionbyte is a byte equal to 03 03 that may be present within an NALU. The presence ofemulation prevention bytes ensures that no sequence of consecutive byte-aligned bytesin the NALU contains a start code prefix, which is a unique sequence of three bytesequal to 03 000001 embedded in the byte stream as a prefix to each NALU. The locationof a start code prefix can be used by a decoder to identify the beginning of a new NALUand the end of a previous NALU. Emulation of start code prefixes is prevented withinNALUs by the inclusion of emulation prevention bytes. An NALU specifies a genericformat for use in both packet-oriented and bitstream systems. The format of NALUs forboth packet-oriented transport and bitstream delivery is identical except that eachNALU can be preceded by a start code prefix in a bitstream-oriented transport.

Compared to other existing video coding standards, the basic coding structure ofH.264=AVC is similar, which is the structure with the MC transform coding (TC). Theblock diagram of H.264=AVC video encoder is shown as in Figure 20.3.

4 � 4Transform FQ

VLC/CAVLCCABAC

IQ

4 � 4 Inversetransform

Motionestimator

Buffer

Rate control

Decoder loop

Intra/inter

Motion vectors

Deblockingfilter

Motioncompensation

Intraframeprediction

FIGURE 20.3 (See color insert following page 288.)Block diagram of H.264 encoder.


Except many common tools, the H.264=AVC includes many highlighted features that areenabled to greatly improve the coding efficiency and increase the capability of errorrobustness and the flexibility for operation over a variety of network environments.Features for improving coding efficiency can be classified to two parts: the first is toimprove the accuracy of prediction for the picture to be encoded and the second includesthe method of transform and entropy coding. Several tools have been adopted inH.264=AVC to improve inter- and intra-prediction, which are briefly summarized asfollows.

Variable block size for motion compensation with small block sizes is used, in whichtotal seven selections of block sizes are used for motion compensation in H.264=AVCamong those the smallest block size for luma motion compensation can be as small as 43 4.

The quarter-pel accurate motion compensation is adopted in H.264=AVC. The quarter-pel accurate motion compensation has been used in the advanced profile of MPEG-4 Part 2,but H.264=AVC further reduces the complexity of the interpolation process.

Multiple reference pictures for motion compensation and weighted prediction are usedto predict the P- and B-pictures. The number of reference pictures can be up to 15 forlevel 3.0 or lower and four reference pictures for levels higher than 3.0. When themultiple reference pictures are used for motion compensation prediction, the contributionof prediction from different references should be weighted and offset by amounts specifiedby the encoder. This can greatly improve coding efficiency for those scenes thatcontain fades.

Directional spatial prediction for intracoding is adopted for further improving codingefficiency. In this technique, the intracoded regions are predicted with the references of thepreviously coded areas, which can be selected from different spatial directions. In such away, the edges of the previously decoded areas of the current picture can be extrapolatedinto the current intracoded regions.

Skip mode in P-picture and direct mode in B-picture are used to alleviate the problem forusing too many bits for coding motion vectors in the interframe coding. H.264=AVC usesthe skip mode for P-picture and direct mode for B-pictures. In these modes, the recon-structed signal is obtained directly from the reference frame with the motion vectorsderived from previously encoded information by exploiting either spatial (for skip mode)or temporal (for direct mode) correlation of the motion vectors between adjacent macro-blocks (MBs) or pictures. In such a way, bits saving for coding motion vectors can beachieved.

The use of loop deblocking filters (DFs) is another feature that is used to reduce the blockartifacts and improve both objective and subjective video quality. The difference fromMPEG-1=2 is that in H.264=AVC the DF is brought within the motion compensation loop,so that it can be used for improving the interframe prediction and therefore improving thecoding efficiency.

H.264=AVC uses a small transform block size of 43 4 instead of 83 8 as in most videocoding standards. The merit of using the small transform block size is able to encode thepicture in a more local adaptive fashion, which would reduce the coding artifacts such asringing noise. However, the problem of using small transform block size may cause codingperformance degradation due to the correlations of large area may not be exploited forcertain pictures. H.264=AVC uses two ways to alleviate this problem: one is by using ahierarchical transform to extend the effective block size of non-active chroma informationto an 83 8 block, and another is by allowing encoder to select a special coding type ofintracoding, which enables the extension of the length of the luma transform for non-activearea to a 163 16 block size. As mentioned earlier, the basic function of integer transformused in H.264=AVC does not have equal norm. To solve this problem, quantization tablesize has been increased.


Two very pow erful ent ropy coding metho ds, conte nt-adapti ve variable-l ength c oding(CAV LC) and conte nt-adapti ve bina ry arithme tic codi ng (CABAC ), are use d inH.264 =AVC for fur ther improving codi ng perfo rmance.

In H.264 =AV C, seve ral tools have been adopted fo r inc reasing the capa bility of err orrobustn ess.

Flexi ble slice size allo ws encode r to adaptive ly select the slic e size for increa sing thecapa bility of error rob ustness .

Flexi ble macrobl ock orderi ng (FMO ) allows partit ioning the MBs into slic es in a fl exibleorder. Becau se each slic e is an indep endently decod able unit, the FMO can sig nificantl yenhance err or ro bustness by managi ng the spati al relatio nship betw een the MBs inthe slice.

Ther e are also several fea tures, which are used to increa se the fl exibility for operati onover a vari ety of network env ironmen ts.

The parame ter set structur e is used to provid e mo re flexibl e way to protect the keyheader informati on and increa se the error ro bustness .

The NALU syn tax st ructure allows for carryi ng video con tent in a mann er approp riatefor each speci fic network in a customi zed way.

Arbitra ry slic e or dering (ASO) is used to improve end- to-end delay in real-t ime appli-cation , par ticularl y for the appl ications on the Intern et protocol networks.

Switch ing P (SP)- and switchi ng I (SI)-slice s are new slice ty pes. They are spe ciallyencode d-slices that allow ef fi cient switchi ng between vide o bitstr eams and ef fi cient ran-dom access for video decod ers. This feature ca n be use d fo r ef fi ciently switchi ng adecod er to decod e differe nt bitstr eams with diffe rent bit rates , recovery from errors andtrick modes.

An overview of H.264=AVC video coding standard can be found in [wiegand 2003] andthe detailed specification can be found in [h264]. The technical details of above tools will bedescribed in the following sections.

20.3 Technical Description of H.264=AVC Coding Tools

In the Se ction 20.2, we brie fly des cribed the fea tures of H.264 =AVC vide o coding stand ard.In this section we introduce the technical details of some of those important features.

20.3.1 Instantaneous Decoding Refresh Picture

It is well known that in the previous MPEG video coding standards, the input videosequence is organized into groups of pictures (GOPs). Each GOP consists of three typesof frames or pictures, which are intracoded (I) frame or picture, predictive-coded (P) frameor picture, and bidirectionally predictive-coded (B) frame or picture. As in MPEG-2, we usethe word ‘‘picture’’ instead of the word ‘‘frame’’ to provide a more general discussion,because a picture can either be a frame or a field. It should be noted that there is no I-, P-,B-picture concept in the H.264=AVC. There are only slice types which can be I-, P-, orB-slices. Therfefore, strictly speaking, there is no such thing as an I-picture in theH.264=AVC video standard. The term is not used. However, a picture can contain I-, P-,or B-slices in any combination.

To satisfy requirements for some applications, the H.264=AVC video coding standard hasspecified a new picture type, instantaneous decoding refresh (IDR) picture. Its exact defin-ition is a coded picture inwhich all slices are I- or SI-slices that causes the decoding process tomark all reference pictures as ‘‘unused for reference’’ immediately after decoding the IDR


pictu re. Th is means that after the deco ding of an IDR picture , all followi ng coded pictu res indecod ing order can be decod ed witho ut inter predic tion from any picture deco ded beforethe IDR pictu re. The first picture of each code d video sequenc e is an IDR picture .

Base d on this de finition, the primar y diffe rence betw een IDR pic ture of H.264 andI-pic ture of MPEG -2 is that fo r H.264 afte r sendi ng an IDR picture , the encode r cannotuse any pictu res that preced ed the IDR pictu re (in decod ing order) as refer ences for theinter predic tion of any picture s that follow the IDR pictu re (in deco ding order) . So thepresen ce of an IDR pictu re in H.264 =AV C is roughly similar to the presence of an MPEG -2GO P header in which the c losed_gop fl ag is set to 1. Th e closed_gop fl ag is a 1-b it fl ag,which indicate s the nature of the predic tions used in the fi rst consecuti ve B-pictu res (if any)imm ediate ly following the fi rst coded I-frame follo wing the group of picture header. Theclosed_g op is set to ‘‘ 1’’ to indi cate that the se B- picture s have been encode d using onlyback ward predic tion or intracod ing. Th e pres ence of a H.264 =AV C ‘‘ I- pictu re’’ that is notan IDR pictu re but con tains all I-slices is simi lar to either an MPEG -2 I-pic ture withou t aGO P header or to the pres ence of an MPEG-2 GOP header in whi ch the closed_go p flag isequal to 0. Also, the presen ce of an IDR picture in H.264 =AVC causes a rese t of thePicO rderCo unt and frame _num count ers of the deco ding process , while an I-pic ture thatis not an IDR picture does not. Therefor e, it shoul d be note d that an IDR pictu re is a mo reseve re event than an MPEG -2 I-pictur e, as it proh ibits open GOP behavior, that mean s therefer ences fo r inter predic tion have to be within a GOP. In an open GOP, the ref erencepictu res fro m the previo us GOP at the cur rent GOP boundar y can be expl oited. Forexa mple the GOP is open when B-pictu res at the st art of a GO P rely on I- or P-pi cturesfrom the imm ediately previo us GO P.

20.3.2 Swi tching I-Sl ices and Swi tching P-Slices

Excep t the new concep t of IDR picture int roduced in the Section 20.3.1, H.264 has add-itional new types of slices: SP- and SI- slices [karcze wisz 2003]. The main purp ose of SP- andSI-sli ces is to enable ef ficient sw itching betwe en video streams and ef ficient rand om accessfor video deco ders. Vid eo strea ming is an importan t app lication over IP networks and 3Gwire less net works. Howeve r, due to v arying network cond itions, the effective band widthto a use r may vary accord ingly. Ther efore, the vide o server shoul d sca le the bit rate of thecom pressed vide o stream s to accommo date the bandwid th vari ations. There are seve ralways to achi eve bitstream scaling such as video transc oding, but the simp lest way fo r rea l-tim e appl ication is to generat e seve ral separate pre-e ncoded bitstream s fo r the sam e vide osequ ence wi th differe nt bit rate s, of course at different quality levels at the same time. Th eserver can then dyn amically switch fro m higher rate bitstr eam to the lower rate bitstreamwhen the network bandwi dth dr ops. This can be des cribed in Figure 20.4. In Figu re 20.4,we assume that each frame is encoded as a single slice type and predicted from onereference. Also assume that stream A is coded with higher bit rate and stream B is codedwith lower bit rate. After decoding P-slices A1 and A2 in stream A, the decoder wants toswitch to stream B and decode B3, B4, and so on. In this case, it is obvious that the B3 has tobe coded as an I-slice. If B3 is coded as a P-slice then the decoder will not have the correctdecoded reference pictures required to reconstruct B3 because B3 is predicted from thedecoded frame B2, which does not exist in stream A. Therefore, the bitstream switching canbe accomplished by inserting an I-slice at regular intervals in the coded sequence to createswitching points. However, an I-slice does not exploit any temporal redundancy and itlikely requires much more bits to be coded than a P-slice. This would result as a peak inthe coded bitsream at each switching point. To address this problem, the SP-slices areproposed to support switching without the increased bit rate penalty of I-slices.


A1 A2 A3 A4

P-slices

I-slices

Switching point

A5

B1 B2 B3 B4 B5 FIGURE 20.4A decoder is decoding stream A and wants to switch todecoding stream B.

The ide a behind SP-slice is that assume we encode a v ideo sequenc e with differe ntencodi ng parame ters and generat e mul tiple inde pendent strea ms with different bit rates .For simp licity, assume we have two streams A and B as in Figure 20.4. We use the samescenar io for bitstr eam switching . After decod ing P-sl ices A1 and A2 in st ream A, thedecod er wants to swit ch to strea m B and decode B3, B4, and so on. The SP-slice s are place dat the switching poi nts as shown in Figure 20.5.

The key poi nt is that the SP-slice AB3 is enc oded as P-slice with B3 as input pictu re andrecon structed A2 as predic tive referenc e. The enc oding and decod ing proced ure of AB 3 isshown in Figure 20.6.

It is clear that the SP-sli ce will no t resu lt in a peak in the bitstr eam since it is coded usi ngMC predic tion as a P-slice, which is more ef ficient than intraco ding. From Figu re 20.5, it isshown that the SP-sli ce AB3 can be decod ed using refer ence frame A 2. It sh ould be notedthat the decod er outpu t pictu re B3 is ide ntical wheth er decod ing B2 is fo llowed by B 3 or A 2is follo wed by AB3. If we want to switch the bistream in othe r directi on, anothe r SP-slice ,BA3, wou ld be require d. But this is still more ef ficient than encodi ng frames A3 and B3 asI-slices . Howe ver, the SI-sli ce may be use d for switchi ng from one sequenc e to a comple telydifferent sequ ence, where the MC predic tion is not ef fi cient due to signi fi cant scenechanges.

It should be indicated that the SP- and SI-slices are not only used for stream switching,they can also be used for error-resilience video coding. The feature of SP- and SI-slices canbe exploited in the adaptive intra refresh mechanism.

A1 A2 A3

AB3

A4 A5

B1 B2 B3 B4 B5

P-slices

P-pictureFIGURE 20.5Switching streams using switching P (SP)-slices.


(a)

(b)

Transform QuantizerPicture B3

VLC+

−

MC Transform Quantizer

Reconstructedpicture A2

SP AB3

Inversetransform

Inversequantizer

Reconstructedpicture B3

VLD+

+

MC Transform Quantizer

Reconstructedpicture A2

SP AB3

FIGURE 20.6(a) Switching P (SP)-slice encoding and (b) switching P (SP)-slice decoding.

20.3.3 Transform and Quantization

The previous video coding standards, such as MPEG-1=2, MPEG-4 Part 2, JPEG, H.261,and H.263, all use 83 8 discrete cosine transform (DCT) as the basic transform. InH.264=AVC, the block size used in TC is 43 4. There are several questions that shouldbe answered why the H.264=AVC chooses 43 4 integer transform. The first question is theselection of block size of 43 4 instead of 83 8 as in most previous video coding standards.In general, the larger block size could be better for exploiting the global correlations toincrease the coding efficiency. From other side, the smaller block size could be better forexploiting the adaptivity accoring to the local activity in content and also it is obvious thatthe complexity of implementation is greatly reduced. In addition, the smaller block sizecould more adaptively march the motion compensation with variable block size used inH.264=AVC, the smallest block size for motion compensation is 43 4.

In the H.264 video, three transforms have been used for three different applications,which include 43 4 Hadamard transform for the 43 4 luma DC coefficients in intra MBspredicted in 163 16 mode, 23 2 transform for 23 2 of chroma DC coefficients in any MB,and 43 4 integer transform for 43 4 blocks for the lum residual data. The matrices of 43 4(luma) and 23 2 (chroma) DC coefficients are formed as in Figure 20.7.

As shown in Figure 20.7, for the 163 16 intramode there are in total sixteen 43 4 blocks.After 43 4 transform, 16 luma DC coefficeints are extracted to form a 43 4 block, which is

16 � 16 Intramode onlyDC of luma

DC of chroma

Cb

Cr

FIGURE 20.7Matrix formation for DC coefficients of luma and chroma.


coded by a 43 4 Hadamard TC. Accordingly, the chroma DC coefficients are used to forma 23 2 block, which is coded by a 23 2 Hadamard TC. The function of these two DC TC isto remove the spatial redundancy among sixteen 43 4 neighboring blocks. This two-leveltransform is referred to as a hierarchical transformation, which aims at higher codingefficiency and lower complexity. The two transforms used to code luma and chroma DCcoefficients are as follows:

HL ¼1 1 1 11 1 �1 �11 �1 �1 11 �1 1 �1

2664

3775 HC ¼ 1 1

1 �1� �

The most important transform used in H.264=AVC is the 43 4 integer transform which isreferred to as high correlation transform (HCT) [cham 1983; hallapuro 2002]. The 43 4HCT is applied to 43 4 predicted residual blocks. The forward transform is represented inthe matrix format as follows:

[Hf] ¼ �1 1 1 12 1 �1 �21 �1 �1 11 �2 2 �1

2664

3775

This matrix is an integer approximation of 43 4 DCT. The inverse transform isrepresented by

[Hi] ¼1 1 1 1=21 1=2 �1 �11 �1=2 �1 11 �1 1 �1=2

2664

3775

It can be seen that the HCT used in H.284=AVC is not orthogonal due to the approximationof DCT, which can be found from the following:

[Hf] � [Hi] ¼1 1 1 12 1 �1 �21 �1 �1 11 �2 2 �1

2664

3775 �

1 1 1 1=21 1=2 �1 �11 �1=2 �1 11 �1 1 �1=2

2664

3775 ¼

4 0 0 00 5 0 00 0 4 00 0 0 5

2664

3775

Therefore, in the inverse transform of decoding, all scale factors resulting from thisoperation have to be compensated by the quantization process.

The H.264 uses a scalar quantizer. As mentioned earlier, we use integer transform toavoid division and floating-point arithmetic, we need to add rescaling function in theinverse quantization. The detailed procesure can be found in [hallapuro 2002].

20.3.4 Intraframe Coding with Directional Spatial Prediction

In the new video coding standard, H.264=AVC, a new intraframe technique based on thedirectional spatial prediction has been adopted. The basic idea of this technique is topredict the MBs to be coded as intra with the previously coded regions selected fromproper spatial direction in the same frame. The merit of directional spatial prediction is able


FIGURE 20.8Eight predictive directions for intra 43 4 prediction inH.264=AVC. 0

1

43

57

8

6

to extrapo late the edges of previousl y deco ded par ts of the cur rent pictu re to the MBs to becoded . This can greatly improve the accu racy of the predic tion and improve the codingef ficiency. For the 4 3 4 intramode , in additi on to DC pred iction, the re are total eightpredic tion directio ns as sh own in Figu re 20.8. For the 16 3 16 intramod e, there are fo urpredic tion modes: ver tical, ho rizontal , DC, and plan predic tion. For the techni cal detai l,please ref er to [wiega nd 2003].

20.3.5 Adapt ive Blo ck Size Moti on Compen sation

In many video codi ng standards , an M B consisti ng of a 16 3 16 block of lu ma pixels andtwo corre spondin g blocks of chroma pixels is used as the basic proces sing unit of the vide odecod ing proce ss. An MB can be fur ther partitione d for inter predic tion. Th e selecti on ofthe block size using for inter predic tion partit ions is a com promis ed resu lt betwe en the bitssav ing provi ded by using motion compens ation with smaller blocks and the increa sednum ber of bits needed for coding motion vector s. In M PEG-4 there is an adv anced motioncom pensatio n mo de. In thi s mode, the int er predic tion proces s can be prefo rmed withada ptive selectio n of 16 3 16 or 8 3 8 block. The purpo se of the adaptive selecti on of thematch ing blo ck size is to further enhance c oding ef ficie ncy. The coding perfo rmance maybe improve d at low bit rate becau se the bits for coding predic tion difference c ould begrea tly redu ced at the limit ed ext ra cost for increa sing m otion vect ors. Of course, if the c ostfor coding mo tion vectors become s too high, the mode for usi ng small block size wi ll not beselecte d. Th e decisio n made in the enc oder sh ould be ver y care ful. If the 8 3 8 pred iction ischos en, there are four m otion vect ors for the fo ur 8 3 8 luminance blo cks in an MB wi ll betransmi tted. Th e mo tion vect ors for codi ng two chrominan ce blocks are then obtain ed bytakin g an averag e of the se four mo tion vect ors and dividing the average value by a factorof 2. As each motion vect or for the 8 3 8 lumi nance block has half-p ixel accuracy, themo tion vector for the chro minance block may have a sixtee nth-pi xel accu racy. Th e issues ofmo tion estim ation proce ss in the encode r a nd the selectio n of wheth er to use inter pred ic-tion for each regio n of the video conte nt are not speci fied in the standard s. The enc odingissues are usually described in the informative parts of the standards. In the recentdeveloped MPEG and ITU joint standard, H.264=AVC, the 163 16 MB, is further parti-tioned int o even sm all blocks as show n in Figu re 20.9.

In Figure 20.9, it can be seen that in total eight kinds of blocks can be used for adaptiveselection of motion estimation=compensation. With optimal selection of motion compen-sation mode in the encoder, coding efficiency can be greatly improved for some sequences.Of course, this is again an encoding issue, and an optimal mode selection algorithm isneeded.


0 0 10 1

2 3

16 � 16 16 � 8 8 � 16 8 � 8

8 � 8 8 � 4 4 � 8

0 0 1

1

0

0 1

2 3

4 � 4

1

0

MB-Modi

8 � 8-Modi FIGURE 20.9Macroblock (MB) partitioning inH.264.

20.3.6 Moti on Compen sation wit h Mu ltiple Ref erences

As mention ed in Secti on 20. 3.1, in most stand ards three pictu re ty pes, I-, P-, and B- pictureshave been de fi ned. Also, usu ally no mo re than two referenc e frames have been use d formotion com pensatio n. In the recently develop ed new stand ard, H.264 =AVC, a propo sal forusing more than two refer ence frame s has been adop ted. The com pariso n of H.264 =AVCwith MPEG-2 =4 about the refer ence frame s is show n in Figu re 20.10.

The number of ref erence frames of H.264 can be up to 15 frames. The major reason forusing multi ple refer ence frame s is to improve the codi ng ef ficiency. It is obvio us that thebetter match ing wou ld be fo und by using multi ple ref erence frames than usi ng less orequal than two frames in the mo tion esti matio n. Such an example is shown in Figure 20.11.

MPEG -2 =4 Part 2 mo tion estimati on cannot get better ref erence always .

20.3.7 Entropy Codi ng

The H.264 =AVC video standard speci fies two types of ent ropy coding: CAVLC andCABAC. The maj or function of bot h sch emes is to impro ve the coding perform ance ; butamong these two schem es, the former has less comple xity and the latter has more compl i-cated algori thm. As we kn ow in the previo us vide o coding st andards , such as MPEG -2 andMPEG -4 Part 2, the fi xed variable-l ength coding (VLC) metho d is used for coding eachsyntax elem ent or sets of syntax elem ents. Th e VLC s are design ed with the statist icalcharac teristics of each syn tax elem ent under assump tion that the st atistical charac teristicsare closely matching the vide o dat a to be coded and also they are stationary . Howeve r, thi sis not true in practic e; fo r example , the statist ical beha vior of the predictiv e residues in anMC code is no nstationary and highly dep ends on the video con tent and the accuracy of thepredic tion model . In the CAVLC of H.264 =AV C, a total num ber of 32 differe nt VLCsare use d. Mo st of the se VLCs are tables ; howeve r, some of VLCs enable simp le onlinecalcul ation of any code word with no need of st oring the code tables .

Furthe r, the CAVLC is simpler than CA BAC, it become s the basel ine ent ropy coding forH.264 =AVC. In the CA VLC scheme, inter -symb ol redundanc ies are use d by switchi ng

I PB PB PB B I

MPEG-2 or MPEG-4

H.264

I PB PB PB B I FIGURE 20.10 (See color insert followingpage 288.)Comparison on reference frames betweenMPEG-2=4 with H.264.


Referenceframe

Referenceframe

Referenceframe

H.264/AVC motion estimation/compensation—can get better reference

FIGURE 20.11 (See color insert following page 288.)An example to explain the benefit by using multiple reference frames, it is noted that the better reference can beobtained by using multiple reference pictures for the video sequences with periodic changes.

VLC table s fo r differe nt syntax component s dep ending on the histor y of transmit-ted codi ng symbol s. The basic codi ng tool in the CA VLC is the Exp-Golomb codes(Exp onentia l Golomb codes ). The Ex p-Golomb codes are VLC s, which consist of a pre fixpar t (1, 01, 001, . . . ), and a suf fix par t that is a set of bits ( x0, x1x0, x2x1x0, . . . ) where xi is abina ry bit. The code wo rd st ructure is repre sented in Table s 20.1 throu gh 20.3.

It can be seen that in Table 20.1, the structure of code word can be represented as

[M 0s][1][INFO],

where INFO is anM-bit suffix part carrying information. Each Exp-Golomb code word canbe constructed by its index code num as follows:

M ¼ Log2 (code num� 1)

INFO ¼ code numþ 1� 2M

There are three ways of mapping in the Exp-Golomb coder. The first is the unsigned directmapping, ue(v), which is used for coding MB type, reference frame index and others. In this

TABLE 20.1

Code Word Structure with Prefix and Suffix

Code Word Range

1 001x1 1–2001x1x0 3–60001x2x1x0 7–1400001x3x2x1x0 15–30000001x4x3x2x1x0 31–62


TABLE 20.2

Exp-Golomb Code Words

Code Word Code_num

1 0010 1011 200100 300101 400110 5

mapping, code_num¼ v. The second is the signed mapping, se(v), which is used formotion vector difference, delta quantizer parameter (QP), and others. The mapping isdescribed in Table 20.3; the relation between syntax element value (v) and code num is

code num ¼ 2jvj, for v < 0;

code num ¼ 2jvj � 1, for v > 0:

In the third mapping, me(v) is the mapped symbol. In this mapping, the parameter v ismapped to code num according to the table specified in the standard [h264]. The basicprinciple of these mappings is to produce the code words according to the statistics, i.e., theshorter code words are used to encode the components with higher probability and longercode words are used to code the components with small probability.

The CAVLC is the method used to encode the predictive residual, zigzag ordered 43 4(and 23 2) blocks of transform coefficients (TCOEFF). In the CAVLC, there is no end ofblock (EOB) code such as in MPEG-2. For a given 43 4 block after prediction, transform-ation, and quantization, the statistical distribution shows that only few coefficients havesignificant values and many coefficients have its magnitude equal to 1. An example of atypical block is shown as below:

0 3 −1 0

0 −1 1 0

1 0 0 0

0 0 0 0

TABLE 20.3

Mapping for Signed Exp-GolombCode Words

Code_num Syntax Element Value (v)

0 01 12 �13 24 �25 3


After zigzag reordering, the coefficients can be given as

0, 3, 0, 1, � 1, � 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0:

In the CAVLC, the number of nonzero quantized coefficients (which are notedas TotalCoeff) and the actual value and position of the coefficients are encoded separately.The coefficients with value equal to 1 are called as trailing 1’s (T1). For this example, theTotalCoeff¼ 4, T1s¼ 3.

At the first step of encoding, a coeff token is used to code the number of coefficientsand trailing 1’s. There are four look-up tables used for encoding coeff token. These tablesare described as Num-VLC0, Num-VLC1, Num-VLC2, and Num-FLC (three VLC tablesand an FLC). To use the correlation between neighboring blocks (context-based adaptive),the choice of table depends on the number of nonzero coefficients in upper and left-handside of the previously coded blocks Nu and NL. The parameter N is defined as follows:if blocks U and L are available (i.e., in the same coded slice), N¼ (Nu þ NL)=2; if only blockU is available, N¼NU; if only block L is available, N¼NL; if neither is available, N¼ 0.After N is decided, the look-up table for coding coeff token can be decided:

If N¼ 0 or 1, Num-VLC0 is selected; if N¼ 2 or 3, Num-VLC1 is selected; if N¼ 4, 5, 6,or 7, Num-VLC2 is selected; finally if N � 8, the Num-FLC is selected.

When you check the tables in the standard, you can find that Num-VLC0 is usedfor small numbers of coefficients; the short codes are assigned to low values of TotalCoeffs(0 and 1) and the long codes are used for large value of TotalCoeffs. Num-VLC1 is used formedium numbers of coefficients (TotalCoeff values around 2–4 are assigned relativelyshort codes), Num-VLC2 is used for higher numbers of coefficients and FLC assigns afixed 6 bit code to every value of TotalCoeff.

The second step is to encode the sign of each T1. At this step, 1 bit is used to code the signof each T1 with the order from highest frequency.

The third step is to encode the levels of the remaining nonzero coefficients. The choice ofVLC table to encode each level adapts depending on the value of each successive codedlevel, and the encoding is reverse order, i.e., starting with the highest frequency toward theDC coefficient. For this step, we have seven VLC tables to choose from, starting fromLevel VLC0 for coding lower level value and to Level VLC1 for encoding slightly highervalues and so on. The way how to select the look-up table depends on the threshold valuessuch as the Level VLC0 will be initially used unless there are more than 10 nonzerocoefficients and less than three trailing ones, in which case start with Level VLC1. Thenwe encode the highest-frequency nonzero coefficient. If the value of this coefficient is largerthan a predefined threshold, move up to the next VLC table. In this way, the choice of levelis matched to the value of the recently encoded coefficients. The thresholds are listed inTable 20.4; the first threshold is zero, which means that the table is always incrementedafter the first coefficient level has been encoded.

TABLE 20.4

Thresholds for Determining Whether to IncrementLevel Table Number

Current Variable-LengthCoding (VLC) Table

Threshold toIncrement Table

VLC0 0VLC1 3VLC2 6VLC3 12VLC4 24VLC5 48VLC6 N=A (highest table)


The fourth step is to encode the total number of zeros before the last coefficient.TotalZeros is the sum of all zeros preceding the highest nonzero coefficient in the zigzagor alternative reordered array. The reason used to separate VLC table to encodeTotalZeros is that many blocks contain a number of nonzero coefficients at the start ofthe array and this approach means that zero-runs at the start of the array need not beencoded.

The fifth step is to encode each run of zeros, run before. The run before is the number ofconsecutive zero-valued quantized TCOEFF in the reverse scan order starting from the lastnonzero-valued coefficient. For each block, run before specifies zero-runs before thelast nonzero coefficient.

Now we use the above example to explain how to perform the encoding.

. Reordered block: 0, 3, 0, 1, �1, �1, 0, 1, 0, . . .

. TotalCoeffs¼ 5

. TotalZeros¼ 3

. T1s¼ 3 (in fact there are four trailing ones but only three can be encoded as aspecial case)

The encoding procedure is described in the Table 20.5.The final result of transmitted bitstream for this block is 000010001110010111101101. For

typical test conditions and test sequences, the CAVLC can obtain 2%–7% saving in bit ratecompared with conventional VLC scheme based on a single Exp-Golomb code.

The CAVLC is a simple and efficient entropy coding method. However, this cannotprovide adaptation to the actually conditional symbol statistics which limits its perform-ance. Furthermore, the symbols with probabilities higher than 0.5 cannot be efficientlycoded with CAVLC as those symbols appear to be coded with a high fractional accuracy inbits, whereas VLC is limited with lower accuracy of 1 bit=symbol. As mentioned earlier,another entropy coding scheme adopted in H.264=AVC is CABAC. The CABAC achievesbetter performance than CAVLC with 10%–15% average bit rate saving at the cost ofincreasing the complexity. The main reasons for obtaining better performance include thefollowing factors. The first reason is that CABAC is to select probability model for each

TABLE 20.5

Encoding Procedure of Content-Adaptive Variable-LengthCoding (CAVLC)

Element Value Code

coeff token TotalCoeffs¼ 5, T1s¼ 3 0000100T1 sign (4) þ 0T1 sign (3) � 1T1 sign (2) � 1Level (1) þ1 (use Level VLC0) 1Level (0) þ3 (use Level VLC1) 0010TotalZeros 3 111run before(4) ZerosLeft¼ 3; run before¼ 1 10run before(3) ZerosLeft¼ 2; run before¼ 0 1run before(2) ZerosLeft¼ 2; run before¼ 0 1run before(1) ZerosLeft¼ 2; run before¼ 1 01run before(0) ZerosLeft¼ 1; run before¼ 1 No code required;

last coefficient


syntax element according to the element’s context. The second reason is to adapt probabilityestimation based on local statistics in CABAC. Finally, the CABAC is to use arithmeticcoding, which could reach high fractional accuracy in bits.

The CABAC encoder consists of four steps: binarization, context modeling, arithmeticcoding, and probability updating. In the step of binarization, only the nonbinary-valuedsyntax elements, such as TCOEFF and motion vectors, are uniquely mapped to a binarynumber, the so-called bin string. These binary-valued syntax elements will bypass thisstep. The reason for the need of the binarization step is to reduce the alphabet size of thesyntax elements, which would result in fast and accurate estimation of conditional prob-abilities, subsequently minimize the computational complexity involved in performingeach elementary operation of probability estimation and subsequent arithmetic coding.In this step four basic schemes of binarization and its derivatives have been used: unary,truncated unary (TU), kth-order Exp-Golomb (EGk), and fixed-length (FL) binarizationschemes. From these four basic binarization schemes, three more binarization schemes arederived by concatenation. The first is a concatenation of a 4 bit FL prefix as a representationof the luminance related part of coded block pattern (CBP) and a TU suffix with S¼ 2representation of the chrominance part of CBP. The second and third concatenationschemes are derived from the TU and EGk binarization. The detail of these schemes canbe found in [marpe 2003].

The context model is the conditional probability for one or more bins of the binarizedsymbols. At the step of context modeling selection, a context model is assigned to thegiven symbols from a selection of available models depending on the statistics of recentlycoded symbols.

The third step is the arithmetic encoding. At this step, an arithmetic coder is used toencode each bin according to the selected probability model. The encoding is performedwith recursive subdivision to fractional accuracy of an existing interval; in general, theinitial interval is the range from 0 to 1.

Finally, the selected context model is updated based on the actual coded value. More-over, the statistical model determines the code and its efficiency; it is very important tochoose an adequate model that explores the statistical dependencies of recently codedsymbols.

20.3.8 Loop Filter

It is similar to other video coding standards, the block artifacts could be introduced inH.264=AVC video coding since its coding scheme is block based. The most significantblock artifact in H.264 is caused by 43 4 integer transformation in intra- and interframepredictive residue coding followed by quantization. The course quantization of TCOEFFwould result in visible discontinuities at the block boundaries. Also, for intercoded blocks,the MC reference may not be perfect and if there are not enough bits to code the predictiveresidues; this would cause the edge discontinuities for the blocks to be compensated.There are two ways to reduce the block artifacts. The first is the post-filtering, whichoperates on the display buffer and outside of the coding loop. The post-filtering is anoptional for the decoder and it is not a normative part of the standard. In H.264=AVC, theDF is a normative part of the standard; it has to be in both encoder and decoder. The DFhas been used in the coding loop to every decoded MB to reduce blocking artifacts. In thecoding loop, the filter operation is applied after the inverse transform before reconstructingand storing the MB for future predictions in the encoder and before reconstructing anddisplaying the MB in the decoder. The use of DF wants to reach two main goals. The firstgoal is to smooth block edges and improve the appearance of decoded images, particularly


Vertical edges

16 � 16 Luma MB

8 � 8 Chroma MB

Horizontal edges

Vertical edges Horizontal edges

FIGURE 20.12Boundaries in a macroblock (MB) to be fil-tered (dark lines represent the block bound-aries to be filtered).

at higher compression ratios. The second goal is to reduce the predictive residue for MCprediction of further frames in the encoder. It should be noted that the intraprediction iscarried out using unfiltered reconstructed MBs to form the prediction, though intracodedMBs are filtered. Filtering is applied to vertical or horizontal edges of 43 4 blocks in an MBas shown in Figure 20.12.

In Figure 20.12, the block boundaries in an MB are filtered in the following orders. First,the vertical boundaries are filtered from left to right, and then the vertical boundaries arefiltered from top to bottom. This is the same for both luma and chroma. Each filteringoperation is applied to a set of pixels at either side of the block boundary; total eight pixelsacross vertical or horizontal boundaries of the block are involved as shown in Figure 20.13.Figure 20.13 shows four pixels on either side of a vertical or horizontal boundary inadjacent blocks p and r (p0, p1, p2, p3 and q0, q1, q2, q3). Depending on the current quantizer,the coding modes of neighboring blocks, and the gradient of image samples across theboundary, several outcomes are possible, ranging from no filtering at all to filtering of alleight pixels.

The decision of whether the filtering operation would be conducted depends on theboundary strength and the gradient of image samples across the boundary. The boundarystrength parameter BS is derived based on MB type, motion vectors, reference picture ID,and MB coding parameters. Bs is defined as follows.

If pixels of p and q are intracoded and they are MB boundary, Bs is assigned to 4, whichmeans that the strongest filtering is needed; if pixels of p and r are intracoded but they arenot the MB boundary, Bs is assigned to 3. If neither pixels of p or q are intracoded but thesepixels contain coded coefficients, then Bs is equal to 2; if pixels of p and q have differentreference pictures or a different number of reference or different motion vector values, Bs isset to 1; finally, neither pixels p or q are intracoded; neither p or q contain codedcoefficients and p and q have same reference picture as well as identical motion vectors,Bs is equal to 0.

p 3

p 3

p 2

p 1

p 0

q 0

q1

q 2

q 3

p 2 p 1 p 0 q 0 q 1 q 2 q 3

Vertical boundary

Horizontal boundary

FIGURE 20.13Boundary adjacent pixels would beinvolved in filtering operation.


The value of Bs indicates the strength of filtering process performed on the blockboundaries including a selection between the three filtering modes. From the rule ofvalue assignment to Bs, it can be seen that the filtering operation is stronger at placeswhere there is likely to be significant blocking distortion, such as the boundary of anintracoded MB or a boundary between blocks that contain coded coefficients.

In cases of Bs¼ 0, the filtering process is not conducted for the current 43 4 blockboundary. For Bs > 0, the filtering operation will be conducted and the strength of filteringdepends on the difference between boundary pixels, and threshold values of a and b. Aswe know that the block artifacts are most visible in very smooth areas where the pixelvalues do not change much across block boundaries. Therefore, the filtering thresholdvalues should be derived based on the pixel values. The values of a and b are defined in[list 2003]. In general, the values of a and b increase with the average QP of the twoneighboring blocks p and q. When Bs > 0, and jp0 � q0j, jp1 � p0j, and jq1 � q0j are all lessthan the thresholds a or b, then (p2, p1, p0, q0, q1, q2) are filtered. The reason for switchingfiltering operation on when the difference or gradient is low can be described as follows.When QP is small and the difference or gradient of pixels across boundary is likely to bedue to image features rather than blocking effects. In this case, the gradient shouldbe preserved and so the thresholds a and b are low. When QP is larger, blocking distortionis likely to be more significant; a and b are higher so that the chance for switching filteringon is higher.

The detailed filtering operation is as follows. For 0< Bs < 4, a 4-tap linear filter is appliedwith inputs p1, p0, q0, and q1 producing filtered outputs P0 and Q0. In addition, if jp2 � p0j isless than threshold a, a 4-tap linear filter is applied with inputs p2, p1, p0, and q0, producingfiltered output P1. If jq2 � q0j is less than threshold b, a 4-tap linear filter is applied withinputs q2, q1, q0, and p0, producing filtered output Q1. It should be noted that p1 and q1 arenever filtered for chroma, but only for luma data.

For Bs¼ 4, if jp2 � p0j < a and jp0 � q0j < round(b=4), P0 is produced by 5-tap filteringof p2, p1, p0, q0, and q1; P1 is produced by 4-tap filtering of p2, p1, p0, and P2 (luma only) isproduced by 5-tap filtering of p3, p2, p1, p0, and q0; otherwise P0 is produced by 3-tap fil-tering of p1, p0, and q1. If jq2 � q0j < a and jp0 � q0j < round(b=4), Q0 is produced by 5-tapfiltering of q2, q1, q0, p0, and p1; Q1 is produced by 4-tap filtering of q2, q1, q0, and p0, Q2

(luma only) is produced by 5-tap filtering of q3, q2, q1, q0, and p0; otherwise Q0 is producedby 3-tap filtering of q1, q0, and p1.

20.3.9 Error-Resilience Tools

Although coding efficiency is the most important aspect in the design of any video codingscheme, the transmission of compressed video through noisy channels has always been akey consideration. This is evident by many error-resilience tools that are available in videocoding standards, such as in MPEG-2, MPEG-4 Part 2, and some only in H.264=AVC.

The first category of error-resilience tools is the localization. These tools are used toremove the spatial and temporal redundancy between segments of the video to preventerror propagation. It is well known that video compression efficiency is achieved byexploiting the redundancy in both the spatial and temporal dimensions of the video.Due to the high correlation within and among neighboring frames, predictive codingschemes are employed to exploit this redundancy. Although the predictive coding schemesare able to reach high compression ratios, they are highly susceptible to the propagationof errors. Localization techniques essentially break the predictive coding loop so thatif an error does occur then it is not likely to affect other parts of the video. Obviously,a high degree of localization will lead to lower compression efficiency. There are two


methods for localization of errors in a coded video: spatial localization and temporallocalization. The spatial localization technique is supported in MPEG-2 and H.264=AVCusing slices, and in MPEG-4 using video packets. The resynchronization marker insertionis suitable to provide a spatial localization of errors. The temporal localization is usuallyimplemented for preventing error propagation with the insertion of intracoded MBsby decreasing the temporal dependency in the coded video sequence. Although this isnot a specific tool for error resilience, the technique is widely adopted and recognizedas being useful for this purpose. The higher percentage of intrablocks used for coding thevideo will reduce the coding efficiency, but reduce the impact of error propagation onsuccessively coded frames. In the most extreme case, all blocks in every frame are codedas intrablocks. In this case, there will be no temporal propagation of errors, but asignificant increase in bit rate could be expected. The selection of intracoded blocks maybe cyclic, in which the intracoded blocks are selected according to a predetermined pattern;the intracoded blocks may also be randomly or adaptively chosen according to contentcharacteristics.

The second category of error-resilience tools is the data partitioning. It is well known thatevery bit in a compressed video bitstream is not of equal importance. Some bits belong tosegments defining vital information, such as picture types, quantization values, etc. Whencoded video bitstreams are transported over error-prone channels, errors in such segmentscause a much longer lasting and severe degradation on the decoded video than that causedby errors in other segments. Therefore, data partition techniques have been developed togroup together coded bits according to their importance to the decoding such that differentgroups may be more effectively protected using unequal protection techniques. Forexample, during the bitstream transmission over a single channel system, the more impor-tant partitions can be better protected with stronger channel codes than the less importantpartitions. Alternatively, with a multichannel system, the more important partitions couldbe transmitted over the more reliable channel. This kind of tool is defined in MPEG-2 andMPEG-4 Part 2 video but not in H.264=AVC.

The third category is redundant coding. This category of techniques tries to enhanceerror resilience by adding redundancy to the coded video. The redundancy may be addedexplicitly, such as the concealment motion vectors, or implicitly in the coding scheme, as inthe reversible variable-length codes (RVLC) and multiple description (MD) coding.

All these strategies for error resilience indirectly lead to an increase in the bit rate andloss of coding efficiency, where the overhead with some is more than others. In thefollowing, we describe each tool in terms of the benefit it provides for error-resilienttransmission, as well as its impact on coding efficiency.

In the H.264=AVC, several new error-resilience tools, which are different from previousstandards, MPEG-2 or MPEG-4 Part 2, have been adopted. These tools include FMO, ASO,and redundant slices. The idea of FMO is to specify a pattern that allocates the MBs in apicture to one or several slice groups not in normal scanning order, but in a flexible way. Insuch a way, the spatially consecutive MBs are assigned to different slice groups. Each slicegroup is transmitted separately. If a slice group is lost, the image pixels in spatiallyneighboring MBs that belong to other correctly received slice groups can be used forefficient error concealment. The allowed patterns of FMO range from rectangular patternsto regular scattered patterns, such as checkerboards, or completely random scatter pat-terns. Furthermore, the idea of FMO can be extended to the slice level. In some profiles ofthe H.264=AVC standard the slices can be sent in an arbitrary order to increase thecapability of error resilience. The slices can also be bundled into slice groups, which maycontain one or more slices. The exact number of slices is specified by a parameter in thepicture parameter set.


20.4 Profiles and Levels of H.264=AVC

In this section, we would like to give brief description about H.264=AVC profiles. It is thesame as inMPEG-2 andMPEG-4; a profile defines a set of coding tools or algorithms, whichare used in generating a compliant bitstream with this profile. If a decoder is claimed toconform a specific profile, it must support all tools and algorithms in that profile.

20.4.1 Profiles of H.264=AVC

There are total seven profiles defined in the H.264=AVC so far, which are the baseline,main, extended, high, high 10, high 4:2:2, and high 4:4:4. The process is carried on toreplace the High 4:4:4 Profile with a better one which is the Advanced High 4:4:4 Profile.In the following we briefly introduce each of these profiles.

The Baseline Profile supports all following features in H.264=AVC:

. Support I- and P-slice types, but no B-slice type.

. NALU streams do not contain the coded slices, which is a non-IDR picture.

. Sequence parameter sets contain the parameters such that every coded picture ofthe coded video sequence is a coded frame containing only frame MBs.

. Support 4:2:0 chroma format, 8 bit luma and chroma pixels.

. The TCOEFF decoding process and picture construction process before DF processshall not use the transform bypass operation.

. Use only flat quantization matrix; it means that all components in the matrix are 16.

. Weighted prediction shall not be applied to P- and SP-slices and the defaultweighted prediction specified in [h264] shall be applied to B-slices.

. Entropy coding uses Exp-Golomb codes or CAVLC and does not support CABAC.

. Support FMO.

. Use only 43 4 transformation, same quantization matrix specified in sequencelevel and no quantization offset.

. No interlacing support.

. No SP=SI-slices and slice data partitioning.

Also, there are several flags and their combinations used to define the conformance of abitstreams to the Baseline Profile. The detail can be found in the specification [h264].

The Main Profile supports the following features:

. Support I-, P-, and B-slices.

. NALU streams do not contain the coded slices, which is a non-IDR picture.

. Not support ASO and FMO.

. Support 4:2:0 chroma format, 8 bit luma and chroma pixels.

. Support interlacing.

. All slices of the picture belong to the same slice group.

. No slice partitioning.

. Use only 43 4 transformation, same quantization matrix specified in sequencelevel and no quantization offset.


The features of the Extend ed Pro file inc lude

. Su pport I-, P-, and B-sli ces, and inter laced tools .

. Su pport 4:2:0 chrom a fo rmat, 8 bit luma and chroma pixels.

. En tropy coding use s Exp-Golomb code s or CA VLC and does no t support CABAC.

. Su pport FMO and ASO.

. Su pport SI-, SP -slices, and data partitioni ng.

. Use on ly 4 3 4 transfor matio n, same quan tization matri x speci fied in sequ encelevel and no quan tization offset.

In summa ry, the Main Pro file sup ports all features except SP =SI- slices, slice partitioni ng,FMO, ASO, and redundant picture s. The Extend ed Profi le support s all featur es exc eptCABAC. Ther efore, these three pro files are not the subs et for each othe r and target fordifferent app lications . The Baseline Pro file is used for videopho ne, mo bile commu nication ,and low delay applicat ions. The Main Pro file targets inter laced vide o, broadc ast, andpacka ged media a pplications . The Extended Pro file mainly targets streamin g videoand wireles s transmis sion appl ications. Comp ared with MPEG -2 video, the Main Profi leof H.264 =AVC can provi de 50% –100% improve ment on the coding ef fi ciency but inc reaseabout 2.5 –4 times of decoder com plexity.

Also seve ral pro files have bee n de fi ned in H.264 =AVC for profes sional applicat ions.These pro files includ e High Pro fi le, High 10 Pro file , High 4:2:2 Pro file, and Ad vanced 4:4:4Pro file.

High Pro file is a superset of Main Pro file . The main differe nce with Main Pro file is thatthe High Pro file sup ports bot h the 8 3 8 and 4 3 4 transf ormatio ns, which can improve thecoding perform ance a t high bit rate s fo r high de fi nition tele vision (HDTV) sequ ences. HighPro file has also the tools for ena bling to change the quantiz ation parame ter offset differ-ently for two chroma com ponents for bett er subject ive quality .

High 10 Pro file is de fined for 10 bit v ideo sequenc es. The applicat ions of this profi leinclud e medical image seq uences and some other high qualit y image sequ ences enc oding.

The Advanced 4:4:4 Profile is under development which is used to replace the currentHigh 4:4:4 Profile with better coding performance and more functionality. The mainfeatures compared with the original High 4:4:4 Profile is that the three color componentsare encoded with the same tools which can improve the coding performance for thechromas.

20.4.2 Levels of H.264=AVC

As we disc ussed in the Section 20.4.1, a pro file speci fies a subs et of the entir e bitstreamsyntax of the standard for certain applications. However, within the specified limitationsimposed by a given profile, it may still be possible to require a very large variation in theprocessing power and memory size of encoders and decoders. These variations dependupon the factors, such as the picture size, the frame rate, the maximum bit rate, and themaximum number of reference frames. Therefore, within a profile, we have to define thelevels, which specify a set of constrains on values. The following table shows the levelsspecified in the standard of H.264=AVC. A decoder compliant with a specified profile andlevel must be able to decode the bitstreams compliant to that profile and level as well asthose bitstreams with levels lower than the specified level. The detail of the level definitioncan be found in the H.264 specification and shown as follows.


TABLE A.20.1

Level Limits

LevelNumber Typical Picture Size

TypicalFrameRate

MaxVideoBit Rate(kbits=s)

Vertical MVComponentRange (Luma

FrameSamples)

MaximumNumber ofReferenceFrames

Max Numberof MotionVectors per

TwoConsecutive

MBs

1 Quarter-commonintermediate format(QCIF)

15 64 [�64,þ63.75] 4 —

1b QCIF 30 128 [�64,þ63.75] 4 —

1.1 Quarter video graphicsarray (QVGA)(3203 240)

10 192 [�128,þ127.75] 3 —

QCIF 30 91.2 Common intermediate

format (CIF)15 384 [�128,þ127.75] 6 —

1.3 CIF 30 768 [�128,þ127.75] 6 —

2 CIF 30 2,000 [�128,þ127.75] 6 —

2.1 HHR(3523 480) 30 4,000 [�256,þ255.75] 7 —

(3523 576) 252.2 SD

(7203 480) 30 4,000 [�256,þ255.75] 6 —

(7203 576) 253 SD

(7203 480) 30 10,000 [�256,þ255.75] 6 32(7203 576) 25Video graphics array

(VGA) (6403 480)30

3.1 12803 720P 30 14,000 [�512,þ511.75] 5 16Super video graphics

array (SVGA)(8003 600)

56

12803 720P 603.2 4VGA (12803 960) 45 20,000 [�512,þ511.75] 4 164 HD

(12803 720P) 60 20,000 [�512,þ511.75] 9 16(19203 1080I) 30 4(2K3 1K) 30 4

4.1 High definition (HD)formats

(12803 720) 60 50,000 [�512,þ511.75] 9 16(19203 1080) 30 4

4.2 19203 1080 60 50,000 [�512,þ511.75] 4 165 2K3 1K 72 135,000 [�512,þ511.75] 14 16

16VGA 30 55.1 2K3 1K 120 240,000 [�512,þ511.75] 16 16

4K3 2K 30 5

HHR: half horizontal resolution.SD: standard definition.


20.5 Summary

In this chapter, the new video coding standard, MPEG-4 Part 10 AVC standard, or H.264,which is jointly developed by JVT of MPEG and ITU-T VCEG, has been introduced. TheH.264=AVC is an efficient and state-of-the-art video compression standard, where codingefficiency is about two times better than that of MPEG-2. The H.264=AVC has been planedfor many applications including HD-DVD, DTV for satellite and wireless networks, IPTV,and many others.

Exercises

1. Indicate at least three new tools, which make H.264=AVC to have a better codingperformance. Explain why? If it is possible, conduct computer simulations to verify it.

2. What are the entropy coding schemes used in H.264=AVC? Explain each of them.3. Describe the principle of the DF of H.264=AVC. Conduct a simulation experiment to

compare the subjective quality of decoded images between with and without the DF.4. What are the new tools, which are different from previous MPEG video for increasing

the error robustness of H.264 video coding scheme? Give explanation.5. Describe the principle of the integer transform in H.264. Why integer transformation is

adopted by H.264 video? What is the problem of integer transformation and how tosolve this problem in H.264?

6. Describe the intraprediction algorithm used in H.264.

References

[cham 1983]W.K. Cham, Family of order-4 four-level orthogonal transforms, Electronic Letters, 19, 21,869–871, October 1983.

[h264] ITU-T Rec. H.264=ISO=IEC 11496–10, Advanced video coding for generic audiovisual services,February 28, 2005.

[hallapuro 2002] A. Hallapuro, M. Karczewicz, and H. Malvar, Low complexity transform andquantization—Part I: Basic implementation, JVT-B38, Joint Video Team of ISO=IEC MPEGand ITUT VCEG, January 2002.

[karczewisz 2003] M. Karczewisz and R. Kurceren, The SP- and SI-frames design for H.264=AVC,IEEE Transactions on Circuits Systems for Video Technology, 13, 7, 637–644, July 2003.

[list 2003] P. List, A. Joch, J. Lainema, G. Bjøntegaard, and M. Karczewicz, Adaptive deblocking filter,IEEE Transactions on Circuits Systems for Video Technology, 13, 614–619, July 2003.

[marpe 2003] D. Marpe, H. Schwarz, and T. Wiegand, Context-adaptive binary arithmetic coding inthe H.264=AVC video compression standard, IEEE Transactions on Circuits Systems for VideoTechnology, 13, 620–636, July 2003.

[wiegand 2003] T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra, Overview of theH.264=AVC video coding standard, IEEE Transactions on Circuits and Systems for Video Techno-logy, 13, 7, 560–576, July 2003.


21MPEG System: Video, Audio,and Data Multiplexing

In this chapter, we present the methods and standards of how to multiplex and synchronizethe MPEG coded video, audio, and other data into a single bitstream, or multiple bitstreamsfor storage and transmission.

21.1 Introduction

ISO=IEC MPEG has completed work on the ISO=IEC 11172 and 13818 standards knownas MPEG-1, MPEG-2, and MPEG-4 Part 2 as well as Part 10, respectively, which deal withthe coding of digital audio and video signals. As mentioned in the previous chapters,the MPEG-1, 2, and 4 standards are designed as a generic standard and as such are suitablefor use in a wide range of audiovisual applications. The coding part of the standardsconvert the digital visual, audio, and data signals to the compressed formats that arerepresented as binary bits. The task of MPEG system is focused on multiplexing andsynchronizing the coded audio, video, and data into a single bitstream or multiple bit-streams. In other words, the digital compressed video, audio, and data all are firstrepresented as binary formats which are referred to as bitstreams, and then the functionof system is to mix the bitstreams from video, audio, and data together. For this purpose,several issues have to be addressed by the system part of the standard:

. Distinguishing different data, such as audio, video, or other data

. Allocating bandwidth during muxing

. Reallocating or decoding the different data during demuxing

. Protecting the bitstreams in error-prone media and detecting the errors

. Dynamically multiplexing several bitstreams

Additional requirements for the system should include extensibility issues such as

. New service extensions should be possible

. Existing decoders should recognize and ignore data they cannot understand

. The syntax should have extension capacity

It should also be noted that all system-timing signals are included in the bitstream. Thisis the big difference between digital systems and traditional analog systems in which the


timing signals are transmitted separately. In this chapter, we introduce the concept ofsystems and give detailed explanations for existing standards such as MPEG-2. However,we will not go through the standard page by page to explain the syntax; we pay moreattention to those core parts of the standard and the parts, which always cause confusionduring the implementation. One of the key issues is system timing. For MPEG-4, we give apresentation of the current status of the system part of the standards.

21.2 MPEG-2 System

The MPEG-2 system standard is also referred to as ITU-T Rec. H.222.0=ISO=IEC 13818-1[mpeg2 system]. The ISO document gives very detailed description of this standard.A simplified overview of this system is shown in Figure 21.1.

The MPEG-2 system coding is specified in two forms: the transport stream and theprogram stream. Each form is optimized for a different set of applications. The audio andvideo data are first encoded by audio and video encoder, respectively. The coded data isthe compressed bitstreams, which follow the syntax rules specified by the video codingstandard 13818-2 and audio coding standard 13818-3. The compressed audio and videobitstreams are then packetized to the packetized elementary streams (PES). The video PESand audio PES are coded by system coding to the transport stream or program streamaccording to the requirements of the application.

The system coding provides a coding syntax which is necessary and sufficient tosynchronize the decoding and presentation of the video and audio information, at thesame time it also has to ensure that data buffers in the decoders do not overflow andunderflow. Of course, the buffer regulation is also considered by the buffer control or ratecontrol mechanism in the encoder. The video, audio, and data information are multiplexedaccording to the system syntax by inserting time stamps for decoding, presenting, anddelivering the coded audio, video, and other data. It should be noted that both programstream and transport stream are packet-oriented multiplexing. Before we explain thesestreams, we first give a set of parameter definitions used in the system documents. Then,we describe the overall picture regarding the basic multiplexing approach for single videoand audio elementary streams.

Videoencoder

Audioencoder

Packetizer

Packetizer

PS

Mux

TS

Mux

Video data

Audio data

Programstreams

Transportstreams

Extents of system specification

Video PES

Audio PES

FIGURE 21.1Simplified overview of system layer scope.


21.2.1 Major Technical Definitions in MPEG-2 System Document

In this section, the technical definitions that are often used in the system document areprovided. The major packet- and stream-related definitions are given.

Access unit: a coded representation of a presentation unit. In the case of audio, an accessunit is the coded representation of audio frame. In the case of video, an access unitindicates all the coded data for a picture, and any stuffing that follows it, up to butexcluding the start of the next access unit. In other words, the access unit begins with thefirst byte of the first start code. Except for the end of sequence, all bytes between the lastbyte of the coded picture and the sequence end code belong to the access unit.

DSM-CC: digital storage media command and control.

Elementary stream (ES): a generic term for one of the coded video, coded audio, or othercoded bitstreams in PES packets. One ES is carried in a sequence of PES packets with oneand only one stream identification. This implies that one ES can only carry the same type ofdata such as audio or video.

Packet: a packet consists of a header followed by a number of contiguous bytes from anelementary data stream.

Packet identification (PID): a unique integer value used to associate ESs of a program in asingle or multiprogram transport stream. It is a 13 bit field, which indicates the type of datastored in the packet payload.

PES packet: the data structure used to carry ES data. It contains a PES packet headerfollowed by PES packet payload.

PES: a PES consists of PES packets, all whose payloads consist of data from a singleES, and all which have the same stream identification. Specific semantic constraintsapply.

PES packet header: the leading fields in the PES packet up to and excluding the PES packetdata byte fields. Its function will be explained in the section of syntax description.

System target decoder (STD): a hypothetical reference model of a decoding process used todescribe the semantics of the MPEG-2 system-multiplexed bitstream.

Program-specific information (PSI): PSI includes normal data that will be used for demulti-plexing of programs in the transport stream by decoders. One case of PSI, the nonmanda-tory network information table, is privately defined.

System header: the leading fields of program stream packets.

Transport stream packet header: the leading fields of program stream packets.

The following definitions are related to the timing information:

Time stamp: a term that indicates the time of a specific action such as the arrival of a byte orthe presentation of presentation unit.

System clock reference (SCR): a time stamp in the program stream from which decodertiming is derived.

Elementary stream clock reference (ESCR): a time stamp in the PES from which decoders ofPES may derive timing information.

Decoding time stamp (DTS): a time stamp that may be presented in a PES packet header usedto indicate the time when an access unit is decoded in the system target decoder.


Program clock reference (PCR): a time stamp in the transport stream from which decodertiming is derived.

Presentation time stamp (PTS): a time stamp that may be presented in PES packet headerused to indicate the time that a presentation unit is presented in the system target decoder.

21.2.2 Transport Streams

The transport stream is a stream definition that is designed for communicating or storingone or more programs of coded video, audio, and other kind of data in lossy or noisyenvironments where significant errors may occur. A transport stream combines one ormore programs with one or more time bases into a single stream. However, there are somedifficulties with constructing and delivering a transport stream containing multiple pro-grams with independent time bases such that the overall bit rate is variable. As in otherstandards, the transport stream may be constructed by any method that results in a validstream. In other words, the standards just specify the system coding syntax. In this way,all compliant decoders can decode bitstreams generated according to the standard syntax.However, the standard does not specify how the encoder generates the bitstreams. It ispossible to generate transport streams containing one or more programs from elementary-coded data streams, from program streams, or from other transport streams, which maythemselves contain one or more program. An important feature of transport stream is thatthe transport stream is designed in such a way that makes the following operations becomepossible with minimum effort. These operations include several transcoding requirements,which include:

. Retrieve the coded data from one program within the transport stream, decode it,and present the decoded results. In this operation, the transport stream is directlydemultiplexed and decoded. The data in the transport stream is constructed in twolayers: a system layer and a compression layer. The system decoder decodes thetransport streams and demultiplexes them to the compressed video and audiostreams that are further decoded to the video and audio data by the video decoderand the audio decoder, respectively. It should be noted that non-audio=video datais also allowed. The function of transport decoder included demultiplexing,depacketization, and other such as error detection that will be later explained indetail. This procedure is shown in Figure 21.2.

. Extract the transport stream packets from one programwithin the transport streamand produce as the output a new transport stream that contains only that oneprogram. This operation can be seen as system layer transcoding that converts a

Transport stream containing singleor multiple programs

Channel-specificdecoder

Transport streamdemultiplexand decoder

Clockcontrol

Videodecoder

Audiodecoder

Channel

Decodedvideo

Decodedaudio

FIGURE 21.2Example of transport demultiplexing and decoding.


transport stream containing multiple programs to a transport stream containingonly a single program. In this case, the re-multiplexing operation may need thecorrection of PCR values to account for changes in the PCR locations in thebitstream.

. Extract the transport stream packets of one or more programs from one or moretransport streams and produce as output of a new transport stream. This is anotherkind of transcoding that converts selected programs of one transport stream to adifferent one.

. Extract the contents of one program from the transport stream and produce asoutput another program stream. This is a transcoding that converts the transportprogram to a program stream for certain applications.

. Convert a program stream to a transport stream that can be used in a lossycommunication environment.

To answer the question of how to define the transport stream and then make the abovetranscoding become simpler and more efficient, we describe the technical detail of thesystems specification in the following section.

21.2.2.1 Structure of Transport Streams

As described earlier, the task of the transport stream coding layer is to allow one or moreprograms to be combined into a single stream. Data from each ES is multiplexed togetherwith timing information, which is used for synchronization and presentation of the ESduring decoding. Therefore, the transport stream consists of one or more programs such asaudio, video, and data ES access units. The transport stream structure is a layered struc-ture. All the bits in the transport stream are packetized to the transport packets. The sizeof transport packet is chosen to be 188 bytes; among those, 4 bytes are used as the transportstream packet header. In the first layer, the header of transport packets indicates whetherthe transport packet has an adaptation field. If there is no adaptation field, the transportpayload may either consist of only PES packets or consist of both PES packets and PSIpackets. Figure 21.3 illustrates the case of containing only PES packets.

If the transport stream carries both PES and PSI packets then the structure of transportstrea m as sh own Figu re 21.4 would result .

If the transport stream packet header indicates that the transport stream packet includesthe adapt ation fi eld then the c onstructi on is sh own in Figu re 21.5.

PESheader

Video data

PESheader

Audio data

Transportheader

PESheader

Videodata

PESheader

Audiodata

Transportheader

Transportheader Video data

188 bytes 188 bytes 188 bytes

FIGURE 21.3Structure of transport stream containing only packetized elementary streams (PES) packets.


PES header Video data

PESheader

Audio data PSIheader

PSIdata

CRC

Transportheader

PESheader

Videodata

Transportheader

PESheader

Audiodata

Transportheader

PSIheader

PSIdata

CRC

188 bytes 188 bytes 188 bytes

FIGURE 21.4Structure of transport stream containing both packetized elementary streams (PES) packets and program specificinformation (PSI) packets.

In Figure 21.5, the appearance of the optional field depends on the flag settings. Thefunction of adaptation field will be explained in the syntax section. Before we go aheadthough, we give a little explanation regarding the size of the transport stream packet. Morespecifically, why is a packet size of 188 bytes chosen? Actually, there are several reasons.First, the transport packet size needs to be large enough so that the overhead due to thetransport headers is not too significant. Second, the size should not be so large thatthe packet-based error correction code becomes inefficient. Finally, the size 188 bytes isalso compatible with ATM packet size which is 47 bytes, then one transport stream packetis equal to the four ATM packets. So the size of 188 bytes is not a theoretical solution but apractical and compromised solution.

21.2.2.2 Transport Stream Syntax

As we indicated, the transport stream is a layered structure. To explain the transportstream syntax, we start from the transport stream packet header. Because the headerpart is very important, it is the highest layer of the stream. We describe it in moredetail. For the rest, we do not repeat the standard document and just indicate the importantparts that we think there may cause some confusion for the readers. The detail ofother parts that are not covered here can be found from the MPEG standard document[mpeg2 system].

Transport pay loadAdaptation fieldTransport

header

Stuffing bytesOptionalfield

Five flagsES priorityindicator

Random-access-indicator

Discontinuityindicator

Adaptationlength

Optional fieldThree flagsAdaptation-field-extension length

Transport privatedata length

Splice countdownOPCRPCR

FIGURE 21.5Structure of transport stream header, which contains adaptation field.


21.2.2.2.1 Transport Stream Packet Header

This header contains four bytes that are assigned as eight parts:

� 2007 by Taylo

Syntax No of Bits Mnemonic

r & Francis Group, LLC.

sync_byte
8 bslbf transport_error_indicator 1 bslbf payload_unit_start_indicator 1 bslbf transport_priority 1 bslbf PID 13 uimsbf transport_scrambling_control 2 bslbf adaptation_field_control 2 bslbf continuity_counter 4 uimsbf The mnemonic in the above table means bslbf—bitstream left bit first unimsbf—unsigned integer, most significant bit first
21.2.2.2.2 Syntax No of Bits Mnemonic. The sync_byte is a fixed 8 bit field whose value is 0100 0111 (hexadecimal 47¼ 71).. The transport_error_indicator is a 1 bit flag, when it is set to 1, it indicates that at

least 1 uncorrectable bit error exists in the associated transport stream packet.It will not be reset to 0 unless the bit values in error have been corrected. This flagis useful for error concealment purpose, because it indicates the error location.When an error exists, either resynchronization or other concealment method canbe used.

. The payload_unit_start_indicator is a 1 bit flag that is used to indicate whether thetransport stream packets carry PES packets or PSI data. If it carries PES packetsthen the PES header starts in this transport packet. If it contains PSI data then a PSItable starts in this transport packet.

. The transport_priority is a 1 bit flag, which is used to indicate that the associatedpacket is of greater priority than other packets having the same PID which do nothave the flag bit set to 1. The original idea of adding a flag to indicate the priorityof packets comes from video coding. The video elementary bitstream containsmostly bits that are converted from discrete cosine transform (DCT) coefficients.The priority indicator can set a partitioning point that can divide the data into amore important part and a less important part. The important part includes theheader information and low frequency coefficients and the less important partincludes only the high frequency coefficients that have less affect on the decodingand quality of reconstructed pictures.

. PID is a 13 bit field that provides information for multiplexing and demultiplexingby uniquely identifying which packet belongs to a particular bitstream.

. The transport_scrambling_control is a 2 bit flag. Here 00 indicates that the packet isnot scrambled, the other three (01, 10, and 11) indicate that the packet is scrabbledby a user-defined scrambling method. It should be noted that the transport packetheader and adaptation field (when it is present) should not be scrambled. In otherwords, only the payload of transport packets can be scrambled.

. The adaptation_field_control is a 2 bit indicator, which is used to inform whetherthere is an adaptation field present in the transport packet. The value 00 is reservedfor future use. 01 indicates no adaptation field, 10 indicates that there is only an

adaptation fi eld and no payl oad. Finally , 11 indi cates that the re is an adapt ationfield follo wed by a paylo ad in the transpo rt strea m packet.

. The continu ity_c ounter is a 4 bit count er which inc reases with each transp ortstream packet having the sam e PID.

Fro m the hea der of the transport strea m packe t we obt ain the infor mation about futur ebits. There are two possibi lities; if the ada ptation fi eld control value is 10 or 11 then the bitsfollo wing the hea der are a daptation field , otherwis e the bits are payl oad. The informati oncontai ned in the ada ptation field is describ ed as fo llows.

21.2.2.2.3 Adaptation Field

The st ructure of the adaptation field data is shown in Figu re 21.5. The fun ctionality ofthese headers is basically related to the timing and decoding of the elementary bitsteam.Some important fields are explained below:

. Adaptation-field-length is an 8 bit field specifying the number of bytes immedi-ately following it in the adaptation field including stuffing bytes.

. Discontinuity indicator is 1 bit flag which when it is set to 1 indicates that thediscontinuity state is true for the current transport packet. When this flag is set to 0,the discontinuity is false. This discontinuity indicator is used to indicatetwo types of discontinuities, system time base discontinuities and continuitycounter discontinuities. In the first type, this transport stream packet is the packetof a PID designed as a PCR-PID. The next PCR represents a sample of a newsystem time clock (STC) for the associated program. In the second type, thetransport stream packet could be any PID type. If the transport stream packetis not designated as a PCR-PID, the continuity counter may be discontinuouswith respect to the previous packet with the same PID or when a system timebase discontinuity occurs. For those PIDs that are not designated as PCR-PIDs,the discontinuity indicator may be set to 1 in the next transport stream packet withthe same PID, but will not be set to 1 in three consecutive transport streampacket with the same PID.

. Random-access-indicator is 1 bit flag that indicates the current and later transportstream packets with the same PID, containing some information to aid withrandom access at this point. Specifically, when this flag is set to 1, the next PESpacket in the payload of transport stream packet with the current PID willcontain the first byte of a video sequence header or the first byte of an audioframe.

. ES priority indicator is used for data partitioning application in the ES. If this flagis set to 1, the payload contains high-priority data such as the header informationor low-order DCT coefficients of the video data. This packet will be highlyprotected.

. PCR-flag and OPCR-flag: if these flags are set to 1, it means that the adaptationfield contains the PCR data and original PCR data. These data are coded in twoparts.

. Splicing-point-flag: when this flag is set to 1, it indicates that a splice-countdownfield will be present to specify the occurrence of a splicing point. The splice point isused to smoothly splice two bitstreams into one stream. SMPTE has developed astandard for seamless splicing of two streams [smpte pt20]. We will describe thefunction of splicing later.


. Tr ansport -private- flag: this flag is used to indicate whe ther the adapt ation fi eldcon tains privat e data.

. Ada ptation -fi eld-exte nsion- flag: this fl ag is use d to indicate wheth er the ada pta-tion field con tains the exten sion field that gives mo re detailed spl icing informati on.

21.2.2.2.4 Packet ized Elem entary Stream

It is noted that the ES data is carri ed in PES packe ts. A PES packet consists of a PES packetheader followe d by packet dat a, or paylo ad. The PES packe t hea der begi ns wi th a 32 bitstart code that also identi fies the strea m or strea m type to which the packe t dat a belongs .The firs t byte of each PES packet hea der is located at the fi rst availabl e payload locatio n ofa transp ort stream packe t. The PES packet hea der may also contai n deco ding time stamp s(DTS ), PTS, ES clock referenc e (ESCR), and ot her optio nal fields such as DSM trick m odeinforma tion. Th e PES packet data field contai ns a variable number of contiguou s bytesfrom one ES. Reade rs can learn this part of syntax in the same way as desc ribed for thetranspo rt packe t hea der and ada ptation fi eld.

21.2.2.2.5 Pr ogram-Spec ifi c Infor mation

PSI includes both MPEG -2 syste m com pliant data and private data. In the transport stream s,the PS I is classi fi ed into four tab le structur es: progra m associ ation tab le, progra m map table,cond itional access (CA ) table, and netwo rk inform ation table. The net work infor mation tableis private data and other three are MPEG -2 system compl iant data. The progra m associatetable provid es the inf ormatio n of progra m num ber and the PI D value of the transpo rt strea mpackets . The progra m map table speci fies PID value s for compone nts of on e or moreprogra ms. The CA table provides the associ ation betw een on e or more CA systems, theirentit lement managem ent message s (EMM), and any special par ameters associate d withthem. The EMM a re privat e CA informati on that spec ifies the aut horizati on levels or theservice s of speci fic decod ers. Th ey may be address ed to a singl e decod er or groups ofdecod ers. The network informati on table is opti onal and its contents are privat e. Its conte ntsprovid e physi cal network parame ters, such as FDM frequenci es, transpo nder number s, etc.

21.2.3 Transpo rt Stream s Sp licing

The ope ratio n of bitstr eam spl icing is switc hing form one source to ano ther accordin g tothe requi remen ts of the applicat ions. Splicin g is the most commo n ope ratio n pe rformed inTV station s today [hurst 1997]. Th e exa mple s inc lude inserting comme rcials into progra m-ming, editing, inserting, or replacing a segment into a existing stream, and inserting localcommercials or news into a network feed. The most important problem for bitstreamsplicing is managing the buffer fullness at the decoder. Usually, the encoded bitstreamsatisfies the buffer regulation with a buffer control algorithm at the encoder. Duringdecoding, this bitstream will not cause the decoder buffer to suffer from buffer overflowand underflow. A typical example of buffer fullness trajectory at the decoder is shown inFigure 21.6. Ho wever, after bitstr eam spl icing, the buffer regul ation is not guara nteeddepending on the selection of splicing point and the bit rate of new bitstream. It isnecessary to have a rule for selecting the splicing point.

The committee on packetized television technology, PT20 of SMPTE (Society of MotionPicture and Television Engineers), has proposed a standard that deals with the splice pointfor MPEG-2 transport streams [smpte pt20]. In this standard, two techniques have beenproposed for selecting splicing points. One is the seamless splicing and other is non-seamless splicing. The seamless splicing approach can provide clean and instant switchingof bitstreams, but it requires careful selection of splicing points on video bitstreams. The


FIGURE 21.6Typical buffer fullness trajectory at thedecoder.

Bufferfullness

The resulted output stream

appears to have a scene cut

I

BB

PB B

P

TimeStart-up delay

Decoderbuffer size

non-seamless splicing approach inserts a drain-time that is a period of time betweenthe end of an old stream and the start of a new stream to avoid overflow in the decoderbuffer. The drain-time assures that the new stream begins with an empty buffer. However,the decoder has to freeze the final presented picture of old stream and wait for a period ofstart-up delay while the new stream is initially filling the buffer. The difference between theseamless splicing and the non-seamless splicing is shown in Figure 21.7.

In the SMPTE-proposed standard [smpte pt20], the optional indicator data in the PIDstreams (all the packets with the same PID within a transport stream) is used to provide

Bufferfullnes

Decoderbuffer size

Old stream

New stream

New streambegins at themoment theold stream

ends

The last old pictureis decoded exactly

one frame before thefirst new picture is

decoded

Bufferfullnes

Decoderbuffer size

Old stream

New streambegins afterdrain-time

New streamstart-up delay

must be atleast 55 ms

New stream

Splicing-decoding delay Drain-time

(a) (b)

≥55 ms

FIGURE 21.7Difference between seamless splicing and non-seamless splicing (a) the video buffer verifier (VBV) buffer behaviorof seamless splicing, (b) the VBV buffer behavior of non-seamless buffer behavior.


importan t infor mation about the spl ice for the applicat ions such as inserting comme rcialprogra ms. The propo sed standard de fi nes a syn tax that may be carried in the ada ptationfield in the packe ts of the transport stream . The syntax provides a way to conve y twokinds of informatio n. One type of infor mation is spl ice point inform ation that consists offour splicing parame ters: drain -time, in-poi nt- fl ag, gro und-id, and pictu re-par am-type .The ot her types of infor matio n are splice point indicators that provi de a method to indica teappl ication- speci fic infor mation. On e such appl ication exa mple is the insertion indi catorfor comme rcial adv ertisemen t. This indicator includ es fl ags to indicate that the originalstrea m is obtain ed from the network and that the spl ice point is the time poi nt where thenetwork is going out or going in. Oth er fi elds give infor mation ab out wheth er it isschedu led, how long it is expe cted to last as well as an ID code. Th e detai l about spl icingcan be foun d in the propo sed standard [smpt e pt20].

Althou gh the stand ard provi des a tool for bitstream spl icing, the re are st ill somedif ficulti es for perform ing bitstream splicing in practi ce. One probl em is that the selecti onof a spl icing point has to cons ider that the bitstr eam contai ns video that has been encodedby a predic tive coding schem e. Ther efore, the new strea m should begi n from the anchorpicture . Other problem s includ e uneve n timing frame s and splicing of bitstr eams withdifferent bit rates. In such cas es, one need s to be aware of any consequ ences relate d tobuffer over fl ow a nd unde r flow.

21.2.4 Program Stream s

The progra m strea m is de fi ned for the multiple xing of audio , video, and other dat a into asingle strea m fo r com munica tion or storage applicat ion. The essen tial difference betweenthe progra m stream and transp ort st ream is that the transpo rt strea m is design ed forappl ications with noisy media, such as in terrest rial bro adcastin g. Becaus e the progra mstream is designed for applications in the relatively error-free environment, such as in thedigital video disk (DVD) and digital storage applications, the overhead in the programstream is less than in the transport stream.

A program stream contains one or more elementary streams. The data from ESs isorganized in the form of PES packets. The PES packets from different ESs are multiplexedtogeth er. The struc ture of a progra m stream is shown in the Figure 21.8 .

A program stream consists of packs. A pack begins from a pack header followed by PESpackets. The pack header is used to carry timing and bit rate information. It begins with a32 bit start code followed by SCR information, program muxing rate, and stuffing bits. TheSCR indicates the intended arrival time of the byte that contains the last bit of SCR base atthe input of the decoder. The program muxing rate is a 22 bit integer that specifies the rateat the decoder. The value of this rate may vary from pack to pack. The stuffing bits areinserted by the encoder to meet channel requirements. The pack header may contain asystem header that may be repeated optionally. The system header contains the summaryof the system parameters such as header length, rate bound, audio bound, video bound,stream id, and other system parameters. The rate bound is used to indicate the maximumrate in any pack of the program stream, and it may be used to assess whether the decoder iscapable of decoding the entire stream. The audio bound and video bound are used toindicate the maximum values of audio and video in the program stream. There are someother flags that are used to give some system information. A PES packet consists of a PESpacket header followed by packet data. The PES packets have the same structure as in thetransport stream.

A special type of PES packet is the program stream map; it is present when the stream idvalue is 03BC. The program stream map provides a description of the ESs in the program


Pack header Pack 1 Pack header Pack 2 . . .

. . .

. . .

Pack header Pack n

Packstartcode

Programmuxrate

Packstuffinglength

Packstuffing

byte

Systemheader01 SCR Reserved PES packet 1 PES packet 2 PES packet n

32 2 42 22 5 3

Systemheader

start code

Headerlength

Ratebound

Audiobound

Fixedflag

CSPSflag

Audiolockflag

Videolockflag

Videobound

Packet raterestriction

flagReserved N -loop

32 8 22 6 1 1 1 1 5 1 7

Streamid

P-STD bufferbound scale

P-STD buffersize bound

11

8 2 1 13

FIGURE 21.8Structure of program stream.

stream and their relationship to one another. The data structure of program stream map isshown in Figure 21.9.

Other special types of PES packets include program stream directory and programelement descriptors. The major information contained in the program stream directoryincludes the number of access units, packet stream id, and PTS. The program and programdescriptors provide the coding information about the ESs. There are total of 17 descriptors,including video descriptor, audio descriptor, and hierarchy descriptor. For the detail onthese descriptors, the reader is referred to the standard document [mpeg2 system].

21.2.5 Timing Model and Synchronization

The principal function of the MPEG system is to define the syntax and semantics of thebitstreams that allows the system decoder to perform two operations among multiple ESs:demultiplexing and resynchronization. Therefore, the system encoder has to add thetiming information to the program streams or transport streams during the process of

24 8 16 1 2 5 7 16 16 32

Packetstart code

prefix

Mapstream

id

Programstream map

length

Currentnext

indicator

Programstream

map version

Programstream

info length

N -loopdescriptors

Elementarystream

map length

CRC32Reserved Reserved N -loops

Systemtype

Elementarysystem id

Elementary streaminfo length

N -loopdescriptors

8 8 16

FIGURE 21.9Data structure of program stream map.


multiple xing the coded video, audi o, and data ESs to a single strea m or multi ple stream s.Syste m, video, and audi o all have a tim ing model in which the end-to -end del ay from thesignal input to an enc oder to the signal outpu t from a decod er is a cons tant. Th e del ay isthe sum of enc oding, encoder buffer ing, multipl exing, transmi ssion or storage, demulti-plexin g, deco ding buffer ing, decodin g, and presen tation delays. The buffer ing delayscould be variab le, whi le the sum of total del ays shoul d be con stant.

In the progra m strea m, the timing inf ormatio n for a decod ing system is the SCR,wherea s in the transp ort strea m, the timing informa tion is given by the PCR . The SCRand PC R are time stamp s that are used to encode the timing infor mation of the bitstreamitself . The 27 MHz SCR is the kernel tim e base for the entire system. The PCR is 90 kHz,which is 1=300 of the SCR. In the transp ort strea m, the PC R is enc oded with 33 bits and iscontai ned in the adapt ation field of the transp ort st ream. The PCR can be extended to theSCR wi th an addi tional 9 bits in the a daptation field. For the progra m st ream, the SCR isdirectl y encode d with 42 bits a nd it is locate d in the pack header of the progra m strea m.The syn chronizati on am ong m ultiple ESs is a ccompli shed with a PTS in the progra m andtranspo rt strea ms. The PTS is 90 kH z and repres ented wi th a 33 bit num ber coded in threeseparate parts containe d in the PES packe t header. In the cas e of audio , if a PTS is pres ent,it wi ll refer to the first access unit com mencin g in the PES packet. An audio access unitstarts in a PES packet if the fi rst byte of the audi o access unit is present in the PES packet.In the case of vide o, if a PTS occ urs in the PES packe t hea der, it refers to the access unitcontai ning the first picture start code (PSC) that com mences in this PES packe t. A PSCcomme nces in the PES packet if the fi rst byt e of the PSC is presen t in the PES packe t. In anMPEG -2 syste m, the SCR is speci fied to satisfy the fo llowing cond itions:

27 MHz � 810 Hz � SCR � 27 MHz þ 810 Hz

Chan ge rate of SCR � 75 � 10 � 3 Hz =s

In the enc oder, the SCR or PC R are enc oded in the bitstr eam at inter vals up to 100 ms in thetranspo rt st ream and up to 700 ms in the progra m stream. As such, they can be used torecon struct the STC in the deco der wi th suf ficient accu racy for all identi fi ed applicat ions.The decod er has its own STC with the same freq uency, 90 kH z for the transpo rt stream and27 MHz fo r the progra m stream. In a correctl y constru cted MPEG -2 syste m bitstr eam, eachSCR arrive s at the decoder , preci sely at the time indicate d by the value of that SCR. If thedecod er’ s clock fre quency match es the one in the encoder, the decodin g, and presentati onof video and audio will autom atically have the same rate as those in the enc oder, then theend-to-end delay will be constant. However, the STC in the decoder may not exactly matchthe one in the encoder due to the independent oscillators. Therefore, a decoder’s systemclock frequency may not match the encoder’s system clock frequency that is sampled andindicated in the SCR. One method is to use a free run 27 MHz in the decoder. The mismatchbetween the encoder’s STC and the decoder’s STC is handled by skipping or repeatingframes. Another method to handle the mismatch is to use the received SCRs (which occurat least once in the intervals of 100 ms for transport stream and 700 ms for the programstream). In this way, the decoder’s STC is a slave to the encoder’s STC. This can beimpl emented with a phase-lo cked loo p (PLL) as shown in Figure 21. 10.

The synchronization among multiple ESs can be achieved by adjusting the decoding ofstreams to a common master time base rather than by adjusting the decoding of one streamto match that of another. The master time base may be one of the many decoder clocks, theclock of the data source, or some external clock. Each program in a transport stream, whichmay contain multiple programs, may have its own time base. The time bases of differentprograms within a transport stream may be different.


Low-pass filter and gain

Voltage-controlledoscillator

Counter

SCR orPCR

27 MHz systemclock frequency

System time clock

Load

−

FIGURE 21.10System time clock (STC) recovery using phase-locked loop (PLL).

In the digital video systems, the 13.5 MHz sampling rate of the luminance signal and6.25 MHz chrominance signals of the CCIR601 digital video are all synchronized to27 MHz system time clock. The National Television Systems Committee (NTSC) or phasealternating line (PAL) TV signals are also phase-locked to the same 27 MHz clock such thatthe horizontal and vertical synchronous signals and the color burst clock are all locked tothe 27 MHz system time clock.

In the TV studio applications, the entire TV studio equipment is synchronized to thesame time base of a composite horizontal and vertical synchronization signals to performthe seamless video source switching and editing. It should be noted that this time base isdefinitely not synchronized to the PCRs from various remote encoder sites. The 27 MHzlocal decoder’s STC is locked to the same studio composite horizontal and vertical syn-chronization signal. The 33 bits of video STC counter is initialized by the latest video PTSthen calibrated using the 90 kHz clock derived from the 27 MHz system clock in thedecoder. If the 27 MHz system clock in the decoder is synchronized with the system clockon the transmitting end, the STC counter will be always the same as the incoming PTSnumbers. However, there may be some mismatch between the system clocks. As each newPTS arrives, the PTS will be compared with the STC counter. If the PTS is larger than theSTC plus half of the duration of the PTS, it means that the 27 MHz decoder clock is tooslow and the bit buffer may overflow. In this case, the decoder should skip some of currentdata to search for the next anchor frame so that decoding can be continued. If the PTS is lessthan the STC minus half of the duration of the PTS, the bit buffer may underflow. Thedecoding will halt and repeatedly display the current frame. The audio decoder will also belocked on the same 27 MHz system clock, where similar skipping and repeating of audiodata is used to handle the mismatch.

In the low cost consumer set-top box (STB) applications, a simple free run 27 MHzdecoder system clock with the skipping and repeating frame scheme can provide prettygood results. In fact, the skipping or repeating frame may happen once in a 2 or 4 h periodwith a free run 27 MHz crystal clock. The STC counter will be set by the latest PTS, thencount on the 90 kHz STC derived from the free run 27 MHz crystal clock. The sameskipping or repeating display control as the TV studio will be used.

For a complex STC solution, a PLL with VCXO (voltage-controlled crystal oscillator) inthe decoder is used to synchronize the incoming PCR data. The 33 bit decoder’s PCRcounter is initialized by the latest PCR data, then the 27 MHz system clock is calibrated.If the decoder’s system clock is synchronized with the e encoder’s remote 27 MHz systemclock, every incoming PCR data will be the same as the decoder’s PCR counter or havesmall errors from PCR jitter. The difference between the decoder’s PCR counter andthe incoming PCR data indicates this frequency jitter or drift. As long as the decoder’s


27 MHz system clock is locke d on the PCR data, the STC count er will be initial ized by thelatest PTS then calibrat ed usi ng the 90 kHz clock. The similar skipping and repeati ngframe sch eme will be used again , but the 27 MHz system clock in the deco der is synchron-ized with the inc oming PCR . As long as the deco der’ s 27 MHz is locke d on the enc oder ’s27 M Hz, there will be no skipping or repe ating of frames. Howeve r, if the PCR PLL is notworki ng prope rly, the skippi ng or repeati ng of frames will occ ur mo re often than when thefree run 27 MHz system clock is use d.

Fina lly, it shoul d be note d that the PTS- DTS- fl ag is used to indicate wheth er the PTSand DTS or bot h of them will be pres ent in the PES packe t header. Th e DTS is a 33 bitnumb er code d in three separate field s in the PES packet hea der. It is used to indicate thedecod ing time.

21 .3 MPEG-4 Sys t em

This section descri bes the speci fication of the MPEG -4 system or ISO =IEC 14496-1 .

21.3.1 Overv iew and Arch itecture

The spe cifi cation of the MPEG -4 syste m [mp eg4 system] is use d to de fine the requi remen tsfor the commu nication of interacti ve audi ovisual scenes. An example of such a scene isshown in Figure 21.11 [mpeg4s ystem]. The overall ope ratio n of this system can be sum-marized as fo llows. At the encode r, the audi o, visu al, and ot her data informati on is fi rstcompr essed and supplem ented with synchroni zation tim ing infor mation. The compr esseddata with timing informati on is the n pass ed to a del ivery layer that multipl exes these datainto one or more coded bina ry stream s for st oring or transmis sion. At the decod er, thesestrea ms are fi rst demu ltiplexed and decomp ressed. Th e rec onstru cted audio and visualobjects are then com posed acco rding to the scene descript ion and synchroni zation infor-matio n. The compos ed audio visual objects (AVO ) are the n presente d to the end user. Theimportan t featur e of the MPEG -4 stand ard is that the end user m ay have the optio n tointer act with this presentati on because the com pression is perform ed on the obje ctor conte nt basis. The interacti on informati on can be process ed local ly or transmi ttedback to the encoder. The scene information is contained in the bitstreams and used in thedecoding processes.

The system part of the MPEG-4 standard specifies the overall architecture of a generalreceivin g ter minal. Figu re 21.12 shows the ba sic archite cture of the receiv ing ter minal. Themajor elements of this architecture are delivery layer, sync layer (SL), and compressionlayer. The delivery layer consists of the FlexMux and TransMux. At the encoder, the codedESs, which include the coded video, audio, and other data with the synchronization andscene description information, are multiplexed to the FlexMux streams. The FlexMuxstreams are transmitted to the TansMux of the delivery layer from the network. Thefunction of TransMux is not within the scope of the system standard and it can be any ofthe existing transport protocols such as MPEG-2 transport stream, RTP=UDP=IP,AAL5=ATM, and H223=Mux.

Only the interface to the TransMux layer is part of the standard. Usually, the interface isthe DMIF application interface (DAI), which is not specified in the system part, but in part6 of the MPEG-4 standard. The DAI specifies the data that needs to be exchanged betweenthe SL and the delivery layer. The DAI also defines the interface for signaling informationrequired for session and channel set up as well as tear down. For some simple applications,


Multiplexeddownstream control/data

Multiplexedupstream control/data

Audiovisualpresentation

3-D objects

2-D background

Voice

Sprite

Hypothetical viewer

Videocompositorprojection

plane

Audiocompositor

Scenecoordinate

system X

Y

Z User events

Audiovisualobjects

Speaker Display

User input

FIGURE 21.11An example of MPEG-4 audiovisual scene [mpeg4 system].

it does not require full functionality of the system specification. A simple multiplexingtool, FlexMux, with low delay and low overhead is defined in the system part of MPEG-4.The FlexMux tool is a flexible multiplexer that accommodates for the interleaving ofSL-packetized streams with varying instantaneous bit rate. The FlexMux packet has avariable length which may contain one or more SL packets. Also, the FlexMux toolprovides the identification for the SL packets to indicate which ES it comes from. FlexMuxpackets with data from different SL-packetized streams can therefore be arbitrary inter-leaved.

The SL specifies the syntax for packetizing the ESs to the SL packets. The SL packetscontain a SL packet header and a SL packet payload. The SL packet header provides theinformation for continuity checking in case of data loss and also carries the timing andsynchronization information as well as fragmentation and random access information. TheSL packet does not contain its length information. Therefore, SL packets must be framed bythe FlexMux tool. At the decoder, the SL packets are demultiplexed to the ESs in the SL. Atthe same time, the timing and the synchronization information as well as fragmentationand random access information are also extracted for synchronizing the decoding processand subsequently for composition of the ESs.

At the compression layer, the encoded ESs are decoded. The decoded information isthen used for the reconstruction of audiovisual information. The operation of the recon-struction includes composition, rendering, and presentation with the timing synchronizationinformation.


Multiplexed streams

Interactive audiovisualscene

Elementary streams

Composition and rendering

Display anduser

interaction

Transmission/storage medium

(RTP)UDP

IP

H223PSTN

DABMux

Deliverylayer

DMIF application interface

SL SLSL SL ... Synclayer

FlexMuxFlexMuxFlexMux

Elementary stream interface

AV objectdata

Scenedescriptioninformation

Objectdescriptor

... Compressionlayer

SL

SL-packetized streams

(PES)MPEG-2

TS

AAL2ATM

Upstreaminformation

SL

SL

...

FIGURE 21.12The MPEG-4 system terminal architecture.

21.3.2 System s Decod er Model

The systems decod er mo del (SDM) is a con ceptual model that is used to des cribe thebeha vior of deco ders com plying with MPEG -4 syste ms. It may be use d for the encoder topredic t how the decoder or receiving termin al will beha ve in term s of buffer manage mentand synchroni zation during the process of decodin g, rec onstru cting, and compos ing ofAVO. The SD M includ es a syste m timing m odel and a syste m buffer mo del. These model sspecif y the inter faces fo r access ing demu ltiplexe d data strea ms, decod ing buff ers fo r eachES, the behavior of ES decod er, compos ition mem ory for decoded data from each decoder ,and the outpu t beha vior of composit ion mem ory towar d the com positor. The SDM isshown in Figure 21.13.


Decodingbuffer DB 1

Decoder 1 Compositionmemory 1

Decodingbuffer DB 2

Decoder 2 Compositionmemory 2

Decodingbuffer DB n

Decoder n Compositionmemory n

Compositor

Elementary stream interface

DAI

Demultiplexer

FIGURE 21.13Block diagram of systems decoder model.

The timing model defines the mechanisms that allow a decoder or receiving terminal toprocess time-dependent objects. This model also allows the decoder or receiving terminalto establish mechanisms to maintain synchronization both across and within parti-cular media types as well as with user interaction events. To facilitate these functions atthe decoder or receiving terminal, the timing model requires that the transmitted datastreams contain implicit or explicit timing information. There are two sets of timinginformation that are defined in the MPEG-4 system. One indicates the periodic valuesof the encoder clock that is used to convey the encoder’s time base to the decoder orthe receiving terminal, whereas the other is the desired presentation timing for eachaudiovisual object. For real-time applications, the end-to-end delay from the encoderinput to the decoder output is constant. The delay is equal to the sum of the delaydue to the encoding process, buffering, multiplexing at the encoder, the delay due tothe delivery layer and demultiplxing, decoder buffering, and decoding process at thedecoder.

The buffer model is used for the encoder to monitor and control the buffer resources thatare needed for decoding each ES at the decoder. The information of the buffer requirementsis transmitted to the decoder by descriptors at the beginning of the decoding process.The decoder can then decide whether it is capable of handling this particular bitstream ornot. In summary, the buffer model allows the encoder to schedule data transmissionand to specify when the bits may be removed from these buffers. Then the decoder canchoose proper buffers so that the buffers will not overflow or underflow during thedecoding process.

21.3.3 Scene Description

In multimedia applications, a scene may consist of AVO that include the objects of naturalvideo, audio, texture, two-dimensional (2-D) or three-dimensional (3-D) graphics andsynthetic video. As MPEG-4 is the first object-based coding standard, reconstructing orcomposing a multiple audiovisual scene is quite new. The decoder not only needs the ESsfor the individual AVO but also synchronization timing information and the scene struc-ture. This information is called the scene description and it specifies the temporal andspatial relationships between the objects or scene structures. The information of the scenedescription can be defined in the encoder or interactively determined by the end userand transmitted with the coded objects to the decoder. The scene description only describesthe scene structure. The action of assembling these AVO to a scene is called composition.


Scene

Furniture

Globe Desk

Audiovisualpresentation

2-D backgroundPerson

SpriteVoice

FIGURE 21.14Hierarchical graph representation of an audiovisual scene [mpeg4system].

The acti on of transmitt ing these obje cts from a com mon represen tation space to a spe cifi cpresen tation device is calle d rend ering.

The MPEG -4 syste m de fi nes the syn tax and seman tics of a bitstr eam that can be used todescri be the relations hips of the obje cts in space and tim e. Howeve r, for visual dat a,the syste m stand ard does not spe cify the compos ition algori thms. Only fo r audiodata, the com position proces s is speci fied in a normat ive mann er. To all ow the operatio nsof aut horing, edi ting, a nd intera ction of visual objects a t the decoder , the scene descrip-tions are coded inde pendent from the audio visual media . This a llows the decod er tomodify the scene accord ing to the requi rements of the end user. Two kinds of userinter action are provid ed in the system speci ficati on. One is client-si de inter actionthat involve s object mani pulations reque sted in the end user ’ s ter minal. The mani pulationinclud es the mo di fication of attrib utes of scene obje cts accordin g to the speci fied useraction s. Th e other type of m anipula tion is the serve -side interacti on that the stand arddoes not deal with.

The scene descri ption is a hierar chical structure that can be repre sented as a graph. Theexampl e of the audi ovisua l scene in Figure 21.11 can be repre sented as in Figure 21.14.The scene descri ption is repre sented by a parame tric approach , the bina ry format fo rscenes (BIFS) . The descript ion consists of an enc oded hierarchi cal tre e of nodes withattribu tes and ot her inform ation. In this tree, the leaf nodes c orrespond to the elemen taryAVO a nd inf ormatio n fo r gro uping, transf ormati on, and other operatio ns.

21.3.4 Objec t Desc riptio n Framework

The ESs ca rry data for audi o or visual objects as well as for the scene descri ption itself .The purpo se of the object descri ption framew ork is to provi de the link betwe en the ESsand the audiovisu al scene descript ion. The obje ct des cription framew ork consists of a set ofdescri ptors that allow ide ntifying , describ ing, and approp riately associati ng ESs to eachothe r and to AV O used in the scene descript ion. Each object descript or is a coll ection of oneor more ES descriptors that are associated to a single audiovisual object or a scenedescription. Object descriptors are themselves conveyed in ESs. Each object descriptor isassigned an identifier (object descriptor id), which is unique within a defined name scope.This identifier is used to associate AVO in the scene description with a particular object


descriptor, and thus the ESs related to that particular object. ES descriptors includeinformation about the source of the stream data, in the form of a unique numeric identifier(the ES id) or a URL pointing to a remote source for the stream. ES descriptors alsoinclude information about the encoding format, configuration information for the decodingprocess, and the SL packetization, as well as quality of service requirements for thetransmission of the stream and intellectual property identification. Dependencies betweenstreams can also be signaled within the ES descriptors. This functionality may be used,for example, in scalable audio or visual object representations to indicate the logicaldependency of an enhancement layer stream to a base layer stream. It can also be usedto describe alternative representations for the same content (e.g., the same speech contentin various languages).

The object description framework provides the hooks to implement intellectual propertymanagement and protection (IPMP) systems. IPMP descriptors are carried as part of anobject descriptor stream. IPMP ESs carry time variant IPMP information that can beassociated to multiple object descriptors. The IPMP system itself is a nonnormative com-ponent that provides IPMP functions for the terminal. The IPMP system uses the informa-tion carried by the IPMP ESs and descriptors to make protected IS 14496 content availableto the terminal. An application may choose not to use an IPMP system, thereby offering nomanagement and protection features.

21.4 Summary

In this chapter, the MPEG system issues have been discussed. Two typical systems, MPEG-2 system and MPEG-4 system, have been introduced. The major task of system layer is tomultiplex and demultiplex video, audio, and other data to a single bitstream with syn-chronization timing information. For MPEG-4 systems, there are some more issues beingaddressed such as interface with network applications.

Exercises

1. What are two major system streams provided by MPEG-2 system? Describe someapplication examples for these two streams and explain the reasons.

2. The MPEG-2 system bitstream is a self-contained bitstream to facilitate synchronousplayback of video, audio, and related data. Describe what kinds of timing signals arecontained in the bitstream to achieve the goal of synchronization?

3. How does MPEG-2 system deal with different system clocks between the encoder anddecoder? Describe what a system may do when the decoder clock is running too slow ortoo fast?

4. Why is the 27 MHz system clock in MPEG-2 is represented in two parts: 33 bit þ 9 bitextension?

5. What is bitstream splicing of a transport stream? Give several application examples ofbitstream splicing and indicate the problems that may arise.

6. Describe the difference between the MPEG-2 system and MPEG-4 system.


References

[hurst 1997] Norm Hurst, Splicing—high definition broadcasting technology, year 1 demonstration,Meeting talk, 1997.

[mpeg2 system] ISO=IEC 13818-1: 1996 Information Technology—Generic Coding of MovingPictures and Associated Audio Information.

[mpeg4 system] ISO=IEC 14496-1: 1998 Information Technology—Coding of Audio-VisualObjects.

[smpte pt20] Proposed SMPTE Standard for Television—Splice Points for MPEG-2 Streams, PT20.02,April 4, 1997.


(a)

FIGURE 1.2(a) A picture of boy and girl.

(a)

FIGURE 1.8The bridge in Vancouver: (a) Original [Courtesy of Minhuai Shi].


(a)

FIGURE 1.9Christmas in Winorlia: (a) Original.

Ext

ende

d re

gion

Gamut of xvYCCY

254

0 < R�,G�,B� < 1(Gamut of BT.709)

235

128

16

Extended

1 16 128 240 254Extended

–0.57 –0.5 +0.5

sRGB

sYCC

Cb, Cr+0.56

1 < R�, G�, B�

R�, G�, B� < 0 R�, G�, B� < 0

Black

White

Ext

ende

d re

gion

1 < R�, G�, B�

FIGURE 15.2Two-dimensional (2-D) view of xvYCC.


Gamut coverage

sRGB

sYCC

xvYCC

Cover ratio = 100%

Munsell color cascade (769 colors)

— provided by Dr. Pointer and measured by NPL, UK

FIGURE 15.3Gamut coverage of sRGB, sYCC, and xvYCC color spaces.

(a) An example of progressive video

(b) An example of interlaced video

Full frame (1/60 s)Part frame scanned (1/60 s)

Odd field (1/60 s) Even field (1/60 s) Full frame (1/30 s)

FIGURE 15.4(a) An example of progressive video and (b) an example of interlaced video.


4 � 4Transform FQ

VLC/CAVLCCABAC

IQ

4 � 4 Inversetransform

Motionestimator

Buffer

Rate control

Decoder loop

Intra/inter

Motion vectors

Deblockingfilter

Motioncompensation

Intraframeprediction

FIGURE 20.3Block diagram of H.264 encoder.

FIGURE 20.10Comparison on reference frames betweenMPEG-2=4 with H.264.


I PB PB PB B I

MPEG-2 or MPEG-4

H.264

I PB PB PB B I

Referenceframe

Referenceframe

Referenceframe

H.264/AVC motion estimation/compensation—can get better reference

FIGURE 20.11An example to explain the benefit by using multiple reference frames, it is noted that the better reference can beobtained by using multiple reference pictures for the video sequences with periodic changes.

read.pudn.comread.pudn.com/downloads335/ebook/1470705/imageandvideocompression.pdfimage processing...

Documents