the essence of image and video compression 1e8 ...ack/teaching/1e8/lecture3.pdf · the essence of...

19
The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image and Video Processing Dr. Anil C. Kokaram, Electronic and Electrical Engineering Dept., Trinity College, Dublin 2, Ireland, [email protected] 1 Overview This handout covers the basics of Image and Video compression as follows 1. What is compression and why is it needed? 2. The simplest possible compression scheme: Run Length Encoding 3. Representing signals by sums of sines and cosines [The Fourier Transform] 4. Transform compression and JPEG 5. Motion estimation and predicting pictures in a sequence 6. Video Compression 1

Upload: others

Post on 29-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

The Essence of Image and Video Compression1E8: Introduction to Engineering

Introduction to Image and Video Processing

Dr. Anil C. Kokaram,

Electronic and Electrical Engineering Dept.,

Trinity College, Dublin 2, Ireland,

[email protected]

1 Overview

This handout covers the basics of Image and Video compression as follows

1. What is compression and why is it needed?

2. The simplest possible compression scheme: Run Length Encoding

3. Representing signals by sums of sines and cosines [The Fourier Transform]

4. Transform compression and JPEG

5. Motion estimation and predicting pictures in a sequence

6. Video Compression

1

Page 2: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

2 THE NEED FOR COMPRESSION

2 The need for compression

� Consider a typical television image. It consists of���

pixels in each row, and there are���rows. A 4:2:2 (broadcast standard) video frame (as you wouldget from your Digital Set Topbox, or DVD) represents colour as below.

4:2:0

4:1:14:2:2

� In one frame there are��� � ��� � ��� � ��� �����

pixels. As each pixel is repre-sented by one byte, then that is

�����bytes. At 25 frames/sec this means a bandwidth of�������� ���

MB/sec is required to transmit the VIDEO ALONE! This means about�� ������ ��� ����to store one hour of movie. This is the RAW DATA bandwidth.

� The available bandwidth for a single Digital television channel is at best 6Mbits/sec. This isabout 30 times smaller than the 20MB/sec needed. DVD can store at most 4GB, how doesone fit 2 hours of movie on a DVD?

� You digital mobile phone can handle maybe 1 Mbit/sec absolute TOPS . That is 180 timessmaller than required for video.

� Imagine you are a film and TV archive (like www.ina.fr or the BBC or rte). You need to keepa record of 24 hours of programming on 100’s of channels dailyfor up to 50 years (in the caseof the BBC). Hmm.. there is not enough space in a town to stack up the CD’s needed to storethat!

� So a mechanism is needed to represent images with fewer bytesthan the raw data

1E8 Introduction to Engineering 2 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 3: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

3 TOWARDS COMPRESSION

3 Towards compression

� I don’t really need��� � ��� pixels for my 1 inch mobile screen do I? So I can throw away

every 4th pixel and 4th line (subsampling) for instance, andyield a � � ��

picture instead.So now I can show the same picture for 1/16 the storage. Not good enough. Besides,

�� ��pictures look really crap on a TV set.

Format Total Active MB/secResolution Resolution

CCIR 601 30 frames/sec, 4:3 Aspect Ratio, 4:2:2QCIF

� � � � �� � �� ���CIF

��� ���� ��� ���� ����Full

� ���� ��� ��� �����CCIR 601 25 frames/sec, 4:3 Aspect Ratio, 4:2:2QCIF

� � � �� �� � �� ���CIF

��� �� � ��� �� ����Full

�� ���� ��� ���� �����

� What if I start to think about mathematical models for pictures . . . ? Then I can send/store theparameters of my model instead of the actual pictures, and ifmy model is simple, I can storeless parameters than pixels and get some compression. Hmmm.But pictures look pretty com-plicated. In fact most interesting pictures tend to be different from other pictures. Otherwisewhy look?

� It turns out that youcanmake some generic statements about images and image sequences.

1. In small, local regions, pixel intensity and colour tendsto be the same or at least slowlyvarying. For small .. think

�blocks of pixels.

2. You can construct any picture by adding together a weighted set of pre-definedprimitivepictures. Theseprimitivepictures are in fact the 2D equivalent of sines and cosines.

3. In a video sequence consecutive pictures tend to look the same except for the movingbits.

� We’ll use these ideas now.

1E8 Introduction to Engineering 3 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 4: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

4 RUN LENGTH ENCODING

4 Run Length Encoding

� Consider that you want to transmit a fax as an image. There arejust 2 colours 0 = black and 1= white. Let’s say your image is as below (the letterH in a binary image).

0 0 0 0 0 0 0 00 0 0 0 0 0 0 00 1 1 0 1 1 0 00 1 1 0 1 1 0 00 1 1 1 1 1 0 00 1 1 1 1 1 0 00 1 1 0 1 1 0 00 0 0 0 0 0 0 0

� Instead of sending every single pixel, since there tend to belong lengths of consecutive re-peated pixels (i.e. long runs) we could send a

�(for instance) followed by the number of times

it is repeated.

� So instead of sending or storing� � � � � � � �

for instance, you would store�

, the first numberbeing the colour, and the second being the number of times that colour occurred consecutively.Instead of storing 8 bytes, we have stored just 2. We haveencodedsome raw data of 8 zeros,as just 2 bytes. We have achieved a compression factor of

� � � � !

� In typical RLE schemes, you do not account for all possible runs. Instead you only allow forruns of length say 0 to 32 for instance. Then a run of length 64 would need to be encoded as2 runs of length 32.

� Lets say for our RLE scheme we allow a maximum run length of 8, and the data is either 0 or1. The image example then can be represented by . . .

� But what about a real/grayscale image? Hmm. RLE might get inefficient if the data is notmostly flat!10 32 22 1210 20 20 1010 30 20 108 31 20 15

1E8 Introduction to Engineering 4 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 5: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

5 SIGNAL TRANSFORMS

5 Signal Transforms

What if it were possible to change the image in some reversible process, so that we created a resultthat was easier to compress? In other words we take our data and transformit in some clever wayto make RLE work better.

This is related to another idea.

Suppose I had a photoalbum/dictionary of all the possible images in the world ever made in thepast and ever will be possible in the future. And suppose I gave you a copy of this dictionary inwhich each image was assigned a number.

Then instead of having to send you the raw data, I would just send you the number of the imagein the dictionary, and you could look it up and you’d have the picture!

This dictionary would be very large since pictures come in many flavours. To make a smallerdictionary, you can instead choose images which whenaddedtogether make up the picture youwant to send or store.

So now to send a picture, the transmitting end has to work out which set of images could beadded together to give the picture. Then the transmitter sends the indexes of those images to thereceiver. The receiver then looks up the pictures and then adds them together to give the receivedimage.

About 200 years ago1, a guy called Fourier, spotted that you could actually do this with anysignal. He was working on 1D signals but the same applies to 2Dones.

1No electricity, no computers, no cinema, no television, no hot baths, no baths, no showers. Lice in your hair all the time,nosoap, no nylon, no jeans, no flushing toilets, no sewage system ...

1E8 Introduction to Engineering 5 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 6: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

5.1 Representing signals with waves 5 SIGNAL TRANSFORMS

5.1 Representing signals with waves

The brilliant discovery of Fourier, was that any 1D signal can be represented by a weighted sum ofsines and cosines. So to make a triangle wave for instance, all you need to do is to add a bunch ofsines and cosines together of different frequencies and different amplitudes.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1

0

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1

0

1

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−0.2

0

0.2

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5−1

0

1

Time (seconds)

0 1 2 3 4 5 6 7

2/π

−1/π

And he came up with a mathematical formula that says which frequencies and which amplitudeswere needed to synthesise a particular signal.

Since we all know what sines and cosines look like, we can summarise thissignal decompositionwith a graph of Amplitude versus Frequency. That graph will tell us how much of each frequency

1E8 Introduction to Engineering 6 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 7: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

5.1 Representing signals with waves 5 SIGNAL TRANSFORMS

should be added together. This is the Frequency Spectrum of asignal.

Given this graph, Fourier also worked out how to reconstructthe original signal. He discovereda completelyreversibletransform: The Fourier Transform. It converts or transforms a signal fromthe time domain into a frequency domain. For audio signals like music, this sorta makes intuitivesense, for images and other signals its less intuitive but noless useful.

150 years later2 (in the 1960’s) people3 worked out how to use this for Digital signals and howit could be automated with computers. Then Fourier’s idea really became super-useful.

You see: we can think of the sines and cosines at different frequencies as our dictionary, andthe amplitudes as a weight attached to each one. So to transmit some data all you need to do is towork out frequencies and amplitudes and send that instead ofthe actual raw data. The signals inthis special dictionary are calledbasis functionsand the corresponding amplitudes needed are calledcoefficients.

So its a bit like saying, instead of sending the sawtooth wave(in the example above), sendinstead the graph of amplitude versus frequency. That graphis a whole lot smaller, but it containsall the same information.

Think of this. Suppose I have a music signal which is a pure sine wave lasting 10 secs at 50Hertz that is represented by a digital signal sampled at 44.1KHz. This means that my data recordis 441 K samples long. Say we’re using 16bit audio, that’s 441

�2 bytes. Instead of transmitting all

882 bytes : how could I send the same signal with just 3 bytes?

2People were sorting out the showers, baths, electricity, lice in the meantime3A guy with the funny name of Tukey

1E8 Introduction to Engineering 7 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 8: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

5.2 Image Transforms 5 SIGNAL TRANSFORMS

5.2 Image Transforms

With 2D signals things are a bit trickier. 2D sines and cosines look a bit like a wave in a wave tank,or a wave in your bath, or a wave in the sea. Except the wave is a wave in intensity or brightness.The equation for working out how much of each wave you need to make a picture is also a bit tricky.Furthermore, each wave is represented by a complex number. Urgh?

10 20 30 40 50 60

10

20

30

40

50

600

1020

3040

5060

70

0

20

40

60

80−1

−0.5

0

0.5

1

� ����� � ����� � �� � � � �� � ���� � �� Wave is directed at

��degrees off

horizontal, frequency is���� cycles per pelin that directionand phase lag

�.

Instead electrical/signal processing engineers have comeup with a simpler4 Transform that usesonly Cosine waves. This transform, known as the Discrete Cosine Transform, results in only realnumbers. It is the basis of JPEG.

4Not really

1E8 Introduction to Engineering 8 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 9: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

5.3 JPEG for First Year Undergraduates 5 SIGNAL TRANSFORMS

5.3 JPEG for First Year Undergraduates

� JPEG is based on Transforming �

blocks of pixels using the 2D DCT. For a signal of 8samples, the 8 possibleDCT basis function(the dictionary) is as below.

0 2 4 6 8-5

-4.5

-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

08-point DCT: rows 1 to 4

0 2 4 6 8-9

-8.5

-8

-7.5

-7

-6.5

-6

-5.5

-5

-4.5

-48-point DCT: rows 5 to 8

� The 64 2D DCT basis functions and the 2D DCT of a block in Lenna are shown below.

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

� Now we can see that the effect of Transforming a block of pixels is to reduce its overall energy.Its flatter in the DCT space. This means that we haveless informationto transmit. Here iswhat happens if we take every

�block in Lenna and transform it with the 2D DCT.

1E8 Introduction to Engineering 9 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 10: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

5.3 JPEG for First Year Undergraduates 5 SIGNAL TRANSFORMS

Now we’re almost there ...

� You can see that in the Transformed images, there are manycoefficientsthat are almost zero.So why transmit or store them at all? If we wanted to reconstruct the imageexactly, we wouldneed all these tiny values, but because we know that the HumanVisual System can toleratedefects in pictures, we know that maybe we can throw away the small coefficients and keepthe big ones and still have a reasonable looking picture.

� In fact, in JPEG what is done is toquantisethe coefficients with varying degrees of accu-racy. So the top left hand corner coefficient is quantised with 32 levels say, while the bottomright hand corner is quantised to 2 levels. This is because low frequency information is moreimportant than high frequency for visual perception.

� When you set theQuality setting for JPEG in Adobe Photoshop, you are changing the quan-tisation levels. For low quality, you throw away more information, i.e. you quantise morecoarsely. For high quality you keep more information so you quantise finely.

� After that step JPEG uses RLE to encode each block of coefficients in a zig-zag scan.� � � � � �� �� � � � � �� �� ��� � � �� �� � ��� �� � �� �� �� � � �� �� �� �� �� ���� �� �� � �� � �� ��� �� �� �� �� �� �� � �� �� � �� �� � �� ��

1E8 Introduction to Engineering 10 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 11: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

5.3 JPEG for First Year Undergraduates 5 SIGNAL TRANSFORMS

� Problems : blocking artefacts and mosquito noise.

1E8 Introduction to Engineering 11 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 12: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

6 VIDEO COMPRESSION

6 Video Compression

� All the best codecs for media are based on transforming the data in some way. JPEG2000 isbased on a new kind of transform, the Wavelet Transform discovered only in the late 1980’s.Compression of audio .mp3 is based on 1D DCT. MPEG (Motion Picture Experts Group) isused for compression of video for DVD or DTV [MPEG1,2,4]. Ireland was a major player inestablishing the MPEG 4 standard.

Intel Indeo, Apple Quicktime, Divx are all based on MPEGGy ideas.

MPEG is based again on the 8 point DCT just like JPEG except....

� In video most consecutive pictures look the same. So if I knewwhat one picture lookedlike, then in theory I could build all the others by slightly adjusting that one. This is calledprediction.

� But things move around in video, so we have to estimate that motion to work out how to shiftthe pixels around in order to create the next image.

6.0.1 On Motion Compensated Prediction

To understand how prediction can help with video compression, The top row of figure 2 shows a sequence ofimages of theSuziesequence. It is QCIF (��� ����) resolution and at a frame rate of 30 frames/sec.

We have already seen that Transform coding of images yields significant levels of compression, e.g.JPEG. Therefore a first step at compressing a sequence of datais to consider each picture separately. Considerusing the 2D DCT of� �� blocks. The DCT coefficients for each frame of Suzie are shownin the secondrow of figure 2. The use of the DCT on the raw image data yields a compression of the original 8 bits/peldata to about 0.8 bits/pel on each frame. Note that the DCT coefficients have NOT been quantised using thestandard JPEG Quantisation matrix for demonstration purposes.

We know that most images in a sequence are mostly the same as the frames nearby except with differentobject locations. Thus we can propose that the image sequence obeys a simple predictive model (discussedin previous lectures) as follows:

��� � ����� ������� �� (1)

where�� is some small prediction error that is due to a combination ofnoise and “model mismatch”. Thuswe can measure the prediction error at each pixel in a frame as

�� � ��� � ����� ������� (2)

This is the motion compensated prediction error, sometimesreferred to as the Displaced Frame Difference(DFD). The only model parameter required to be estimated is the motion vector���. Assume for the momentthat we use some process to estimate these vectors. We will look at that later.

Figure 1 illustrates how motion compensation can be appliedto predict any frame from any previousframe using motion estimation. The figure shows block based motion vectors being used to match every

1E8 Introduction to Engineering 12 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 13: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

6 VIDEO COMPRESSION

nn−1

Location of Block in Frame n

Motion Vector

Motion Shifted Block in Frame n−1

Block in Frame n

Frame n−1

Motion

Vector

Block

Object

Frame n

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � �

Figure 1: Explaining how motion compensation works.

block in frame� with the block that is most similar in frame���. The difference between the correspondingpixels in these blocks according to equation 2 is the prediction error.

In MPEG, the situation shown in figure 1 (where frame� is predicted by a motion compensated versionof frame� � �) is calledForward Prediction. The block that is to be constructed i.e. frame� is called theTarget Block. The frame that is supplying the prediction is called theReference Picture, and the resultingdata used for the motion compensation (i.e. the displaced block in frame� � �) is thePrediction Block.

6.0.2 Image prediction

The Fourth row of Figure 2 shows the prediction error of each frame of the Suzie sequence starting from thefirst frame as a reference. A three level Block Matcher was used with � �� blocks and a motion thresholdfor motion detection of��� at the highest resolution level. The accuracy of the search was�

��� pixels. EachDFD frame is the difference between frame� and a motion compensated frame� � �, given the originalframe� � �.

Again, we can compress this sequence of ‘transformed’ images (including the first I frame) using theDCT of blocks of� ��. Now the amount of data needed per is about 0.4 bits/pel. Substantial compressionhas been achieved over attempting to compress each image separately. Of course, you will have deduced thatthis was going to be the case because there ismuchless information content in the DFD frames than in theoriginal picture data.

To confirm that it is indeedmotion compensated predictionthat is contributing most of the benefit, the

1E8 Introduction to Engineering 13 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 14: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

6 VIDEO COMPRESSION

Figure 2: Frames 50-53 of the Suzie sequence processed by various means. From Top to Bottom row: Orig-inal Frames; DCT of Top Row; Non-motion compensated DFD; Motion Compensated DFD with backwardprediction; DCT of previous row.

1E8 Introduction to Engineering 14 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 15: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

6.1 Block Matching 6 VIDEO COMPRESSION

B BPP II B B B B

Figure 3: A typical Group of Pictures (GOP) in MPEG2

3rd row of figure 2 shows the non-motion compensated frame difference (FD)��� � ����� between the

frames of Suzie. There is substantially more energy in theseFD frames than in the DFD frames, hence thehigher bit rate.

6.0.3 Problems with occlusion

A closer look at the DFD frame sequence in row 2 of Figure 2 shows that in frames 52 and 53 (in particular)there are some areas that show very high DFD. This is explained by observing the behaviour of Suzie in thetop row. In those frames her head moves such that she uncoversor occludes some area of the background.The phone handset also uncovers a portion of her swinging hair. In the situation of uncovering, the data insome parts of frame� simply does not exist in frame� � �. Thus the DFD must be high. However, the datathat is uncovered in frame�, typically is also exposed in frame� �. Therefore, if we could look into thenextframe as well as thepreviousframe we probably will be able to find a good match for any blockwhetherit is occludedor uncovered.

Using suchBi-directional prediction gives much better image fidelity. This idea is used in MPEG-2. Ituses both backward prediction for some frames (P frames) andbidirectional prediction for others (B frames).

The sequencing is shown below. Typically MPEG2 encodes images in the following order IBBPBBPBBPBBPI. . . .I-frames (Intra-coded frames) are encoded just like JPEG i.e. without any motion compensation. This allowsthe codec to cope with varying image content...think what would happen if you tried to predict every imagein a movie from the first frame. Its not going to work is it? So I-frames are slipped in every 12 frames or soto give a new reference frame for prediction of the next 12 frames.

6.1 Sledgehammer motion estimation: Block Matching

The most popular and to some extent the most robust techniqueto date for motion estimation is Block Match-ing (BM).

Two basic assumptions are made in this technique.

1E8 Introduction to Engineering 15 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 16: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

6.1 Block Matching 6 VIDEO COMPRESSION

1. Constant translational motion over small blocks (say� �� or �� � ��) in the image. This is the sameas saying that there is a minimum object size that is larger than the chosen block size.

2. There is a maximum (pre-determined) range for the horizontal and vertical components of the motionvector at each pixel site. This is the same as assuming a maximum velocity for the objects in thesequence. This restricts the range of vectors to be considered and thus reduces the cost of the algorithm.

The image in frame�, is divided into blocks usually of the same size ,� ��. Each block is considered inturn and a motion vector is assigned to each. The motion vector is chosen by matching the block in frame�with a set of blocks of the same size at locations defined by some search pattern in the previous frame.

Given a possible vector� � ���� ���, we can define the DFD between a pixel in the current frame andits motion compensatedpixel in the previous frame as

����� � ��� � ����� � (3)

Define the Mean Absolute Error of the DFD between the block in the current frame and that in the previousframe as ��

��� � ��� ��Block

������� (4)

We can use Mean Squared Error (MSE) as well, but MAE is more robust to noise.

The block matching algorithm then proceeds as follows at each image block.

1. Pre-determine a set of candidate vectors� to be tested as the motion vector for the current block

2. For each� calculate the MAE

3. Choose the motion vector for the block as that� which yields the minimum MAE.

The set of vectors� in effect yield a set of candidatemotion compensatedblocks in the previous frame��� for evaluation. The separation of the candidate blocks in the search space determines the smallest vectorthat can be estimated. For integer accurate motion estimation the position of each block coincides with theimage grid. For fractional accuracy, blocks need to be extracted between locations on the image grid. Thisrequires some interpolation. In most cases Bilinear interpolation is sufficient.

Figure 4 shows the search space used in afull motion search technique. The current block is comparedto every block of the same size in an area of size��� � � ��� � . The search5 space is chosen bydeciding on the maximum displacement allowed: in Figure 4 the maximum displacement estimated is��for both horizontal and vertical components.

The technique arises from a direct solution of equation 1. The BM solution can be seen to minimizethe Mean Absolute DFD (or Mean Square DFD) with respect to�, over the� �� block. The chosendisplacement,� satisfies the model equation 1 in some ‘average’ sense.

5There are��� � ��� searched locations.

1E8 Introduction to Engineering 16 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 17: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

6.1 Block Matching 6 VIDEO COMPRESSION

w

w

N

N

N+2w

N+

2w

Frame n-1 Frame n

Centre pixel of block to be matched

Centre pixel of candidate

matching block

Border of entire search area

Figure 4: Motion estimation via Block Matching. The positions indicated by a� in frame���are searchedfor a match with the� �� block in frame�. One block to be examined is located at displacement�� � ��,and is shaded.

6.1.1 Computation

The Full Motion Search is computationally demanding. Givena maximum expected displacement of��pels, there are��� �

�searched blocks (assuming integer displacements only). Each block considered

requires on the order of��operations to calculate the MAE. This implies����� �

�operations per block

for an integer accurate motion estimate. Several reduced search techniques have been introduced which lessenthis burden. They attempt to reduce the operations requiredeither by reducing the locations searched or byreducing the number of pixels sampled in each block. However, reduced searches may find local minima inthe DFD function and yield spurious matches.

6.1.2 Three step search

The simplest mechanism for reducing the computational burden of Full Search BM is to reduce the numberof motion vectors that are evaluated. The Three-step searchis a hierarchical search strategy that evaluatesfirst 9 then 8 and finally again 8 motion vectors to refine the motion estimate in three successive steps. Ateach step the distance between the evaluated blocks is reduced. The next search is centred on the positionof the best matching block in the previous search. It can be generalised to more steps to refine the motionestimate further. Figure 5 shows the searched blocks in frame � � � for this process.

6.1.3 Cross Search

The cross search is another variant on the subsampled motionvector visiting strategy. It changes the geometryof the search pattern to a or

�pattern. Figure 5 shows the searched blocks in frame� � � for this process.

If the best match is found at the centre of the search pattern or the boundary of the search window, then thesearch step is reduced.

1E8 Introduction to Engineering 17 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 18: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

6.2 Video codec issues 6 VIDEO COMPRESSION

3

1

1

1

11

1

1

1

0

2 2

2

222

2

2

3 3 33333

501

1

1

1

2

2

2

3

3

3

4

45

5 5

Figure 5: Illustration of searched locations (central pixel of the searched block is shown) in Three-step BM(left) and Cross-search BM (right). The search window extent is shown in red for Cross-search. The bestmatches at each search level are circled in blue.

6.1.4 Problems

The BM algorithm is noted for being a robust estimator of motion since noise effects tend to be averaged outover the block operations. However, if there is no textural information in the the two blocks compared, thennoise dominates the matching process and causes spurious motion estimates.

This problem can be isolated by comparing the best match found (��), to the ‘no motion’ match (

��). If

these matches are sufficiently different then the motion estimate is accepted otherwise no motion is assumed.A threshold acts on the ratio�� � ��

�� . The error measure used is the MAE. If�� � �, where� is somethreshold chosen according to the noise level suspected, then no motion is assumed. This algorithm verifiesthe validity of the motion estimate once motion is detected.

The main disadvantages of Block Matching are the heavy computation involved (although these are bytewise manipulations) and the motion averaging effect of the blocks. If the blocks chosen are too large thenmany differently moving objects may be enclosed by one blockand the chosen motion vector is unlikely tomatch the motion of any of the objects. The advantages are that it is very simple to implement6 and it isrobust to noise due to the averaging over the blocks.

There are many more useful motion estimators than this. These others do give you motion better matchedto what is actually going on in the scene. But we will not look at these here.

6.2 Video codec issues

DVD and DTV both use MPEG-2 , and the core is exactly as described here. MPEG-2 became a standardaround 1992, and just 4 years later Digital Television was a reality. This is quite amazing considering that theadvances in research in video compression that made this possible were only really about 15 years old at thetime. Compare that to the 200 years it took Fourier to be really appreciated!

6It has been implemented on Silicon for video coding applications.

1E8 Introduction to Engineering 18 Anil Kokaram www.mee.tcd.ie/�sigmedia

Page 19: The Essence of Image and Video Compression 1E8 ...ack/teaching/1e8/lecture3.pdf · The Essence of Image and Video Compression 1E8: Introduction to Engineering Introduction to Image

6.2 Video codec issues 6 VIDEO COMPRESSION

Mobile phone video communications will use MPEG-4 (established around 1998). Unfortunately that isgoing through some teething trouble at the moment.

Sadly, the creation of MPEG standards is not as simple as motion estimation, DFD, DCT, quantisationand transmission. When you actually start to think about putting together codecs the following issues arise.

Compression There are at least three fundamentally different types of multimedia data sources: pictures, audio andtext. Different compression techniques are needed for eachdata type. Each piece of data has to beidentified with unique codewords for transmission.

Sequencing The compressed data from each source is scanned into a sequence of bits. This sequence is thenpacketised for transport. The problem here is to identify each different part of the bitstream uniquelyto the decoder, e.g. header information, DCT coefficient information.

Multiplexing The audio and video data (for instance) has to be decoded at the same time (or approximately thesame time) to create a coherent signal at the receiver. This implies that the transmittedelementarydata streamsshould be somehow combined so that they arrive at the correcttime at the decoder. Thechallenge is therefore to allow for identifying the different parts of the multiplexed stream and to insertinformation about the timing of each elementary data stream.

Media The compressed and multiplexed data has to be stored onsome DSM and then later (or live) broadcastto receivers across air or other links. Access to different Media channels (including DSM) is governedby different constraints and this must somehow be allowed for in the standards description.

Errors Errors in the received bitstream invariably occur. The receiver must cope with errors such that thesystem performance is robust to errors or it degrades in somegraceful way.

Bandwidth The bandwidth available for the multimedia transmission is limited. The transmission system mustensure that the bandwidth of the bitstream does not exceed these limits. This problem is calledRateControland applies both to the control of the bitrate of theelementary data streamsand the multiplexedstream.

Multiplatform The coded bitstream may need to be decoded on many different types of device with varying proces-sor speeds and storage resources. It would be interesting ifthe transmission system could provide abitstream which could be decoded to varying extents by different devices. Thus a low capacity devicecould receive a lower quality picture than a high capacity device that would receive further features andhigher picture quality. This concept applied to the construction of a suitable bitstream format is calledScalability.

What we have covered here is thecoreof the standard used for image and video compression. This justsays how the data itself is compressed. If you open up an .avi or .mpg file, you will not see this data in thatsame form. It has to be encoded into symbols, and timing and copyright information embedded at the veryleast. This makes the design of codecs a tricky business. Butit is certainly true that without standards, therewould be no business in video communications.

Finally, note that none of the compression standards actually describe how you do the things you haveto do. It just describes how to represent bits and package them. So you can use cleverer DCTs or cleverermotion estimators to get better speed and performance. Thatis why one manufacturer’s codec could be betterthan another’s even though they both create compressed video according to the same standard.

1E8 Introduction to Engineering 19 Anil Kokaram www.mee.tcd.ie/�sigmedia