[ieee first international conference on distributed frameworks for multimedia applications -...

8
Modelling, Analysis and Parallel Implementation of an On-line Video Encoder Ismail Assayad Philippe Gerner Sergio Yovine Valerie Bertin Abstract Video encoding is a fundamental component of a wide range of real-time multimedia applications. In this paper we present the fine grain MPEG-4 parallelism and describe a modelling, mapping and scheduling approach that pro- duces code for an industrial video encoder on SMP plat- forms. Keywords: MPEG-4, Real-time Encoding, Mapping and Scheduling, SMP. 1. Introduction The MPEG committee has defined widely used standards for digital video encoding providing high quality images and high compression rates. However, real-time MPEG en- coding also demands high computational power, usually far beyond traditional sequential computers can provide. Fortu- nately, algorithms compliant to the MPEG standard can be parallelized. In this work, we propose (1) a parallel model for an MPEG-4 video encoding algorithm, which expresses the maximal parallelism originally hidden in the standard by the given block diagram specification, and (2) a parallel implementation based on the exploitation of fine grain par- allelism at macroblock level. A detailed description of the MPEG-4 encoder’s phases and the macroblock-level com- putations’ dependencies is first introduced at the beginning of this paper. The parallel model is then described in a formal language which is not within the scope of this pa- per. For ease of exposition, however, we briefly describe some constructs of this language which are used here to ex- press both the parallelism and the dependencies between the MPEG-4 encoder’s computations at macroblock level. Starting from this parallel model, we first generate an ef- ficient non-executable parallel specification for SMP plat- forms. To generate this specification, we propose an effi- cient mapping and scheduling heuristic, based on results VERIMAG, Centre Equation, 2 av. de Vignate, 38610 Gieres, France. Email: Ismail.Assayad, Philippe.Gerner, Sergio.Yovine @imag.fr ST MICROELECTRONICS , rue Jean Monnet, 38921 Crolles, France. Email: [email protected] from automatic loop parallelization techniques. By doing two transformations, our heuristic extracts parallel loops without iterations’ dependencies from the model. By ap- plying a third model transformation, corresponding to a greedy dynamic load balancing strategy of these loops on an SMP architecture, the heuristic obtains the parallel non- executable specification. Then, an implementation is auto- matically produced by translating this specification into an MPI-based, message-passing parallel program, on SMP ar- chitecture. Speedup analysis of our mapping and scheduling algorithm on a CREW-PRAM architecture model, and ex- perimental performances results, show the usefulness of our approach for SMP. Also, this approach, which we have de- signed to be integrated into a SOC compilation chain [5], has the advantage to be flexible since it can be applied to generate an MPEG-4 parallel implementation even if some features, like the rate control mechanism, are changed or new ones are added to the proposed MPEG-4 model. The paper is organized as follows. Sec. 3 presents a par- allel specification for MPEG compression algorithm. Sec. 4 describes our approach for code generation starting from the proposed parallel model. Sec. 5 discusses the experi- mental results. Finally, some conclusions and future work are outlined in Sec. 6. 2. Definitions In this work, we are specifically interested in the MPEG- 4 Video Standard [1] which is widely used in multime- dia real-time applications such as videoconferencing. Here- inafter we use the term MPEG to refer to the MPEG-4 stan- dard. A complete description of the MPEG compression scheme is beyond the scope of this paper. For details on MPEG see e.g., [1]. Nevertheless, we will give some gen- eral definitions in this introduction and in section 3 which are useful for understanding related work and ours. Each MPEG frame is divided into groups of pixels called macroblocks, each of which is encoded sep- arately. Each macroblock contains a section of the lumi- nance component and the spatially corresponding chromi- nance components. For example a macroblock is Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05) 0-7695-2273-4/05 $ 20.00 IEEE

Upload: v

Post on 15-Mar-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE First International Conference on Distributed Frameworks for Multimedia Applications - Besoncon, France (06-09 Feb. 2005)] First International Conference on Distributed Frameworks

Modelling, Analysis and Parallel Implementation of an On-line Video Encoder

Ismail Assayad � Philippe Gerner � Sergio Yovine � Valerie Bertin �

Abstract

Video encoding is a fundamental component of a widerange of real-time multimedia applications. In this paperwe present the fine grain MPEG-4 parallelism and describea modelling, mapping and scheduling approach that pro-duces code for an industrial video encoder on SMP plat-forms.

Keywords: MPEG-4, Real-time Encoding, Mapping andScheduling, SMP.

1. Introduction

The MPEG committee has defined widely used standardsfor digital video encoding providing high quality imagesand high compression rates. However, real-time MPEG en-coding also demands high computational power, usually farbeyond traditional sequential computers can provide. Fortu-nately, algorithms compliant to the MPEG standard can beparallelized. In this work, we propose (1) a parallel modelfor an MPEG-4 video encoding algorithm, which expressesthe maximal parallelism originally hidden in the standardby the given block diagram specification, and (2) a parallelimplementation based on the exploitation of fine grain par-allelism at macroblock level. A detailed description of theMPEG-4 encoder’s phases and the macroblock-level com-putations’ dependencies is first introduced at the beginningof this paper. The parallel model is then described in aformal language which is not within the scope of this pa-per. For ease of exposition, however, we briefly describesome constructs of this language which are used here to ex-press both the parallelism and the dependencies betweenthe MPEG-4 encoder’s computations at macroblock level.Starting from this parallel model, we first generate an ef-ficient non-executable parallel specification for SMP plat-forms. To generate this specification, we propose an effi-cient mapping and scheduling heuristic, based on results

� VERIMAG, Centre Equation, 2 av. de Vignate, 38610 Gieres, France.Email: �Ismail.Assayad, Philippe.Gerner, Sergio.Yovine�@imag.fr

� ST MICROELECTRONICS, rue Jean Monnet, 38921 Crolles, France.Email: [email protected]

from automatic loop parallelization techniques. By doingtwo transformations, our heuristic extracts parallel loopswithout iterations’ dependencies from the model. By ap-plying a third model transformation, corresponding to agreedy dynamic load balancing strategy of these loops onan SMP architecture, the heuristic obtains the parallel non-executable specification. Then, an implementation is auto-matically produced by translating this specification into anMPI-based, message-passing parallel program, on SMP ar-chitecture. Speedup analysis of our mapping and schedulingalgorithm on a CREW-PRAM architecture model, and ex-perimental performances results, show the usefulness of ourapproach for SMP. Also, this approach, which we have de-signed to be integrated into a SOC compilation chain [5],has the advantage to be flexible since it can be applied togenerate an MPEG-4 parallel implementation even if somefeatures, like the rate control mechanism, are changed ornew ones are added to the proposed MPEG-4 model.

The paper is organized as follows. Sec. 3 presents a par-allel specification for MPEG compression algorithm. Sec.4 describes our approach for code generation starting fromthe proposed parallel model. Sec. 5 discusses the experi-mental results. Finally, some conclusions and future workare outlined in Sec. 6.

2. Definitions

In this work, we are specifically interested in the MPEG-4 Video Standard [1] which is widely used in multime-dia real-time applications such as videoconferencing. Here-inafter we use the term MPEG to refer to the MPEG-4 stan-dard.

A complete description of the MPEG compressionscheme is beyond the scope of this paper. For details onMPEG see e.g., [1]. Nevertheless, we will give some gen-eral definitions in this introduction and in section 3 whichare useful for understanding related work and ours.

Each MPEG frame is divided into �� � �� groups ofpixels called macroblocks, each of which is encoded sep-arately. Each macroblock contains a section of the lumi-nance component and the spatially corresponding chromi-nance components. For example a � � � � � macroblock is

Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05) 0-7695-2273-4/05 $ 20.00 IEEE

Page 2: [IEEE First International Conference on Distributed Frameworks for Multimedia Applications - Besoncon, France (06-09 Feb. 2005)] First International Conference on Distributed Frameworks

composed of six � � � blocks : four luminance blocks andtwo chrominance blocks.

The standard defines 4 models for frame encoding: I, P,B, and D. D-frames cannot be mixed with frames of othertypes and are only used in special purpose sequences. Iframes are encoded independently of other frames in thesequence, while P and B frames are encoded as differ-ences from one or two previous (or subsequent) referenceframes. These differences are obtained via motion estima-tion and compensation techniques. Because of this, the en-coder, when compressing a B or a P frame, needs access notonly to the current frame but also to one or two more.

The frames in a MPEG sequence are encoded insideGroups of Pictures (GOPs). When encoding MPEG videos,the temporal redundancy of the images is used to save bits.This is achieved by motion-vector searches in past/futureframes. For that purpose, MPEG sequences are divided intosets of frames, typically somewhere from 4-20 frames each,called Group of Pictures. This is the coarsest approach toparallelism. Each frame has a decoding timestamp relativeto the beginning of its corresponding GOP.

2.1. Related work

The best method of parallelization and scheduling de-pends on whether the encoding is on-line or off-line.

In off-line encoding, the data stream does not arrive froma live source, rather an entire video sequence is alreadyavailable. Such an approach is useful for production sys-tems such as encoding a large video sequence into a disk.In this approach, one can partition video data in the tem-poral domain and let each processor independently encodeframes, GOPs or a sequence of GOPs. In [2, 4], the authorspresent a parallel MPEG-1 and MPEG-2, respectively, en-coder algorithms where a video sequence is divided intoseveral sets of GOPs which are then assigned to computenodes. A similar work is done in [8, 9] where each GOPin the video sequence is distributed among five processingelements of the multicomputer. In [10], the authors give aspecification of the MPEG-2 encoder parallelism at frame-level using timed extensions of Petri nets, and experimen-tally show that the performance of the encoding algorithmcan be improved by a factor of nearly 50% by using two pro-cessors.

In on-line encoding, data arrives from a live source, likein videoconferencing, at a fixed rate, e.g., 30 frames/sec,which must be compressed on-line at that speed. To avoidany delay in this context, each incoming frame must be pro-cessed in real-time. Little work has been done in this direc-tion. Akramullah et al [3] presented a data-parallel approachwhich focus on the motion estimation stage of MPEG-2 video encoder for a two-dimensional grid target withdistributed-memory processor topology. In order to avoid

communication between processors, as the search windowduring motion estimation would move to the boundary oflocal processor data, the authors allocate redundant data toeach processor to form the search window. The aim of themethod is to avoid data migration between processors.

2.2.. Our approach

For real-time encoders each frame in the input queuemust be processed in real-time. This constraint is due tothe synchronization latency between the process responsi-ble for (periodically) inserting frames in the input queueand the process which consumes frames from the outputqueue. In this context, encoders can not exploit parallelismat frame or GOP level. This is why we are interested in ef-fectively exploiting the potential fine grain parallelism inthe MPEG-4 encoder, even if the latter involves some newfeatures which include data dependencies between differ-ent small frame-blocks’ treatment, and which limit the par-allelism inside the encoder at macroblock level. In this pa-per, we give the MPEG-4 parallel model which fairly ex-presses the hidden parallelism in the compression standardand a distribution/scheduling algorithm to generate the par-allel MPEG-4 implementation from the model.

3. Parallel model of MPEG-4

Since P and B-Pictures depend on other nearby pictures,the real-time frame grain parallel encoding process for real-time applications, such as videoconferencing, will lead tomany serializing dependencies, even if the encoder has alarge (GOP) frame-size window (i.e, from 10 to 15). Also,the MPEG standard algorithm limits the parallelism at finegrain level.

The MPEG-4 encoding consists of the stages describedin the following subsection [1].

3.1. Description of phases and dependencies

Motion estimation (ME). The motion estimation can befully done in parallel. There are several classes of block-matching algorithms in the standard: Full, Hierarchical andGradient search algorithms. These algorithms can also becontrolled, for instance, by the window search size or thenumber of passes in a hierarchical search. The choice of analgorithm will directly influence on the target frame size andrate. This phase returns the best found motion vector with aset of distortion reports.

Macroblock coding choice (Choice). This phase uses re-sults from the Motion estimation phase (motion vector anddistortion reports) to choose the macroblock type and thequantization step. The macroblock type determines if a

Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05) 0-7695-2273-4/05 $ 20.00 IEEE

Page 3: [IEEE First International Conference on Distributed Frameworks for Multimedia Applications - Besoncon, France (06-09 Feb. 2005)] First International Conference on Distributed Frameworks

macroblock will be coded or not, the compression type ofthe macroblock (inter/intra) and the prediction mode. A ratecontrol mechanism is also included in this phase. It com-putes a distortion value for the current frame based on theset of estimators provided in the ME phase. This computedvalue is used later by the next frame to determine the quan-tization factor for a macroblock.

Motion vector prediction (MVP). There are no data depen-dencies between executions of this stage at the macroblocklevel, each macroblock is dealt with independently, there-fore, it can be done in parallel. Nevertheless, MVP usessome data computed for neighboring macroblocks in theME stage. We explain these data dependencies, and showwhich motion estimation results on macroblocks must bewaited for before starting the motion vector prediction forthe current macroblock.

In the two modes of motion vector prediction, i.e., ����� (Macroblock basis prediction) and ��� (block basis pre-diction), the neighboring macroblocks are used as predic-tors for the current block or macroblock motion vector. Thischoice of neighbors for spatial correlation information re-duces the bit rate for the motion vectors. As described in theMPEG-4 standard, when the block in consideration is nearthe picture boundary, not all predictors are available and insuch cases, only the available predictors are used. As a con-sequence of this, data dependencies exist between the mo-tion vector prediction stage of a macroblock and the motionestimation stage of some neighboring macroblocks. The fig-ure below shows the neighboring macroblocks whose mo-tion vectors (computed in the ME phase) are used by MVPto predict the macroblock (x,y) motion vector (macroblockbasis prediction).

ME(x−1,y−1)

ME(x−1,y)

ME(x+1,y−1)

MVP(x,y)

ME(x,y−1)

Motion vector difference (MVD). No data dependencies ex-ist between executions of this stage for a video object plane(VOP) macroblocks, each macroblock can be dealt with in-dependently. For a given macroblock, this stage must be ex-ecuted sequentially with the motion estimation and motionvector prediction stage.

Block prediction and difference (Prd). If the chosen mac-roblock type is I, a predictor is substracted before the dis-crete cosine transformation phase. This phase can be exe-cuted in parallel at the macroblock level.

Discrete cosine transformation (DCT). The difference be-tween the predictor and the current macroblock is DCT-ized, block after block. In this process, the high frequencycomponents, which are insignificant to human visual per-

QuantI(x,y−1)

QuantI(x−1,y)

QuantI(x−1,y−1)

Quant(x,y)

Figure 1. Data dependencies for DC predic-tion in Quant

ception, are discarded. This phase can be fully executed inparallel at the macroblock level.

Quantization DCTed blocks (Quant). This phase includesDC coefficient prediction for type I macroblocks and quan-tization of the DCT-ized coefficients.

Inverse quantization (QuantI). Inverse quantization ofquantized blocks. This phase can be fully executed in par-allel at the macroblock level. The figure below shows theprevious neighboring macroblocks whose inverse quantizedcoefficients are used in the DC coefficients prediction in thepreceeding phase:

Inverse DCT (DCTI). Inverse discrete cosine transforma-tion of inverse quantized blocks. This phase can be fully ex-ecuted in parallel at the macroblock level.

Block Addition (ADD). It consists in the reconstruction ofthe frame blocks by adding the inverse DCT-ized coeffi-cients to the coefficients of the reference frame predictorblock. After the reconstruction of the frame planes fromthe encoded data, they are extended by border pixels (ex-tend_rec).

Macroblock layer (MBL). Data, transformed using DCTand quantized as described before, is scanned in direc-tion depending on DC prediction called the zigzag order.This provides a large run of zeros in general, which thenare efficiently used in run-length coding to yield the com-pressed data including header and block layer generation.This phase is thus composed of three sub-phases: the zigzag(Zigzag) and the run-length encoding (RLE) which can befully executed in parallel at macroblock level and the layer(Layer), the output sub-phase, which is done sequentially.

3.2. Main Features of the Language

Before presenting the MPEG-4 model, we briefly de-scribe the features of our language that we used to model theMPEG-4 fine grain parallelism. The language uses a declar-ative parallelism approach: the default between two com-putation units is concurrency (expressed by the construct ||),and the programmer adds explicit precedences (indicated byarrows on the figures below) to suppress this concurrency.The language is similar to a data flow-style specificationlanguage, with the following differences:

Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05) 0-7695-2273-4/05 $ 20.00 IEEE

Page 4: [IEEE First International Conference on Distributed Frameworks for Multimedia Applications - Besoncon, France (06-09 Feb. 2005)] First International Conference on Distributed Frameworks

1. The granularity of computation units is not fixed. Thesmaller grain is the assignment. The other units are:for loops with their body, a block (like a C block),etc. Computation units can be naturally nested, like Cblocks. A computation unit can also be legacy C code.

2. There can be implicit dependencies between computa-tion units, coming from data dependencies, as in dataflow languages; but the programmer can also write ex-plicit dependencies between computation units. Ex-plicit dependencies are necessary for programming aprecise order between computations, that is not natu-rally expressed in a data flow model.

3. A forall construct allows the programmer to de-clare several concurrent tasks that share code. Thisconstruct is similar to the Fortran 95 forall, withthe difference that it is possible to have (implicit orexplicit) dependencies between iterations. Since se-quences are allowed in the body of a forall, thisconstruct provides a convenient duplication mecha-nism for tasks which are not fully data flow. Besides,it retains the C-style loop expression, which is conve-nient for C programmers, both for writing new codeand for adapting legacy code (transforming for loopsinto forall loops).

4. Information can be given in the program, aboutthe concrete execution model (e.g., message-passing,shared memory, PRAM, number of processors), andthe implementation model (e.g., OpenMP, Pthreads,MPI), which both appear in steps 3 and 4 of the codegeneration, see Sec. 4.

3.3. MPEG-4 model

In this section we present a specification of MPEG-4 inour language which describes all the existing concurrency inthe compression standard originally hidden in the block di-agram model [1]. In this model we choose to use a fixedframes type pattern: the distance between two I-frames,noted ISpace in the model, is fixed at compile time.

The specification is composed essentially of forallnodes and sequential C block nodes, and is annotated withdependencies within the MPEG algorithm. In order to in-dex the nodes, we label them by numbers in brackets as in-dicated in the figure of the IR below. For instance, we note(1,(x,y)) the computation corresponding to the execution ofthe ME (motion estimation) phase on the frame macroblockat position (x,y). The arrows indicate dependencies betweenthese computations.

There are three types of dependencies : (1) data depen-dencies resulting from the MPEG-4 standard specification,(2) functional dependencies necessary for the correct func-tioning of the application, and (3) dependencies resulting

from implementation decisions of encoding frames one af-ter another.

The graph corresponding to the main encoding proce-dure is:

||

�����������

������������

forall(fr) forall(fr)

encodeif (fr % ISpace + 1 != 0){

(3,fr-1) -> (3,fr)(7,fr-1) -> (1,fr)

}(6,fr-1) -> (6,fr)(1,fr-1) -> (1,fr)

extend_rec (7)(5,fr) -> (7,fr)

(7,fr-1) -> (7,fr)

Input and output buffers with one-frame capacity are usedthus introducing the two following dependencies: (1,fr-1)� (1,fr) and (7,fr-1) � (7,fr), where fr is the frame in-dex. The remaining dependencies are data dependencies.The graph corresponding to encode procedure is given inFig. 2. All the dependencies in the figure are data depen-dencies. For instance, (1,(x,y))� (3,(x,y)) is a data depen-dency which expresses that the ME phase on macroblock(x,y) must finish before starting the Choice phase on thesame macroblock. The mb_encode (5) procedure (Fig. 3) iscomposed of seven forall loops (Prd, DIFF, DCT, Quant,QuantI, DCTI, ADD) which are not shown here.

Finally, the specification corresponding to the node MBLis the following:

MBL

||

���������������������

����������������������

forall(x,y) forall(x,y) forall(x,y)

Zigzag (6.1)(5.4,(x,y)) -> (6.1,(x,y))

RLE (6.2)(6.1,(x,y)) -> (6.2,(x,y))

Layer (6.3)(4,(x,y)) -> (6.3,(x,y))

(6.3,(x,y)) -> (6.3,(x+1,y))(6.2,(x,y)) -> (6.3,(x,y))

(6.3,(x,y))� (6.3,(x+1,y)) is a functional dependency dueto the fact that the generated headers and blocks are sequen-tially written in the output bitstream. The remaining depen-dencies are data dependencies.

4.. Parallel implementation

We derive a parallel implementation for an SMP archi-tecture running MPI. To do this, the model goes throughsuccessive transformation phases which produce modelsthat are refinements. The last transformation generates exe-cutable code. The sequence of transformation is the follow-ing:

Elimination of inter backward forall loops dependencies.There are some producer/consumer style data dependenciesbetween the Quant forall iterations and the QuantI ones(see Fig. 1). For eliminating these inter-forall dependen-cies, we merge the Quant and the QuantI computations in a

Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05) 0-7695-2273-4/05 $ 20.00 IEEE

Page 5: [IEEE First International Conference on Distributed Frameworks for Multimedia Applications - Besoncon, France (06-09 Feb. 2005)] First International Conference on Distributed Frameworks

encode

||

����������������

�����������������������������������

����������������������������������������������������������

����������������

������������������������������

forall(x,y) forall(x,y) forall(x,y) forall(x,y) mb_encode(5)

MBL(6)

if (ptype==INTER)ME (1)

if (ptype==INTER)MVP (2)

(1,(x,y)) -> (2,(x,y))(1,(x-1,y)) -> (2,(x,y))(1,(x,y-1)) -> (2,(x,y))

(1,(x+1,y-1)) -> (2,(x,y))

Choice (3)(1,(x,y)) -> (3,(x,y))

if (ptype==INTER)MVD (4)

(1,(x,y)) -> (4,(x,y))(2,(x,y)) -> (4,(x,y))

Figure 2. The encode procedure

mb_encode

||

���������������������

��������������������������������������������

�������������������

���������������������������������������

forall(x,y) forall(x,y) forall(x,y) forall(x,y) forall(x,y) forall(x,y) forall(x,y)

Prd (5.1)(1,(x,y)) -> (5.1,(x,y))(3,(x,y)) -> (5.1,(x,y))

DIFF (5.2)(5.1,(x,y)) -> (5.2,(x,y))

DCT (5.3)(5.2,(x,y)) -> (5.3,(x,y))

Quant (5.4)(1,(x,y)) -> (5.4,(x,y))(3,(x,y)) -> (5.4,(x,y))

if (mtype == I){(5.5,(x-1,y)) -> (5.4,(x,y))(5.5,(x-1,y-1)) -> (5.4,(x,y))(5.5,(x,y-1)) -> (5.4,(x,y))(5.3,(x,y)) -> (5.4,(x,y))

}

QuantI (5.5)(1,(x,y)) -> (5.5,(x,y))(3,(x,y)) -> (5.5,(x,y))

(5.4,(x,y)) -> (5.5,(x,y))

DCTI (5.6)(5.5,(x,y)) -> (5.6,(x,y))

ADD (5.7)(5.6,(x,y)) -> (5.7,(x,y))

Figure 3. The mb_encode procedure

new forall computation called the quantization and in-verse quantization (noted QiQ) with a new set of result-ing dependencies. We summarize these new dependenciesin the following matrix where each bullet at position (x,y)corresponds to the execution of both the quantization andinverse quantization on the macroblock of position (x,y):

�� � ��

��

��

��

�����

����

��

� ��

�����

����

��

� ��

�����

����

��

� ��

�����

����

��

��� ��

�����

����

��

� ��

�����

����

��

� ��

�����

����

��

� ��

�����

����

��

��� ��

�����

����

��

� ��

�����

����

��

� ��

�����

����

��

� ��

�����

����

��

��� ��� ��� ��� ��

This matrix corresponds to the dependencies in the follow-ing QiQ pseudo-code:

forall(0 <= x < W) forall(0 <= y < H)QiQ(x,y)=F[QiQ(x-1,y),QiQ(x-1,y-1),QiQ(x,y-1)];

Elimination of intra forall loops dependencies. The nexttransformation aims at obtaining a new model without theintra-forall dependencies exhibited in the preceedingcode. We note QiQ(x,y) the execution of both quantiza-tion and inverse quantization on the macroblock of posi-tion (x,y). To transform this code, we take advantage of theaffine dependencies [7]. We thus obtain the new model byreplacing the initial forall QiQ node by a sequence of

three nested loop nodes: for each of them, the outermostloop is a for node and the inner loop is a dependency-freeforall node. That is, we obtain the new QiQ pseudo-codegiven in Fig. 4 without intra-forall loop dependencies:

The figure below illustrates the transformation.

(x−1,y) (x,y)

(x,y−1)(x−1,y−1)

d−2 d−1 d

Concerning ME, MVP, Choice, MVD, Prd, DIFF, DCT,DCTI, ADD, Zigzag and RLE no transformation is neededsince each of them is a forall loop without intra depen-dencies. The phases Layer and extend_rec are sequentialdue to implementation and functional dependencies.

Bounding parallelism. We divide each dependency-freeforall loop into � chunks corresponding to rows of mac-roblocks, where � � ��

�� and � is the number of proces-

sors, and we dynamically assign chunks to processors on afirst-come, first-serve basis. We choose � as the size of theforall for the QiQ phase and the row numbers� for theME, MVP, Choice, MVD, Prd, DIFF, DCT, DCTI, ADD,Zigzag and RLE phases. This solution is illustrated on the

Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05) 0-7695-2273-4/05 $ 20.00 IEEE

Page 6: [IEEE First International Conference on Distributed Frameworks for Multimedia Applications - Besoncon, France (06-09 Feb. 2005)] First International Conference on Distributed Frameworks

for(0 <= d <= min(H,W)-1)forall(0 <= y <= d)QiQ(d-y,y)=F[QiQ(d-y,y),QiQ(d-y-1,y-1),QiQ(d-y,y-1)];

for(min(H,W) <= d <= max(H,W)-1)forall(0 <= y <= min(H,W)-1)QiQ(d-y,y)=F[QiQ(d-y,y),QiQ(d-y-1,y-1),QiQ(d-y,y-1)];

for(max(H,W) <= d <= H+W-1)forall(d-max(H,W)+1 <= y <= min(H,W)-1)QiQ(d-y,y)=F[QiQ(d-y,y),QiQ(d-y-1,y-1),QiQ(d-y,y-1)];

Figure 4. QiQ code with dependency-free forall loops

figure 6 where the forall loop shown in Fig. 5 is trans-formed into a nested loop. The inner loop is a for loopand corresponds to the sequential execution on input datacorresponding to a row of macroblocks (on one processor).This transformation is done for any function f (ME, MVP,Choice, MVD, Prd, DIFF, DCT, DCTI, ADD, Zigzag andRLE) except QiQ.

forall(0 <= x < W, 0 <= y < H)f(x,y);

Figure 5. The forall pseudo-code

is transformed to :

forall(0 <= y < H)for(0 <= x < W)

f(x,y);

Figure 6. The transformed forall

Mapping strategy. Our strategy consists in exploiting intraforall loops parallelism. The mapping phase consists intransforming the � operator combining the forall loopsinto a ";" operator and assigning a set of processors toeach group of partitioned forall loops obtained from thepartitioning phase. As the for loops combined by theseforall operators are dependency-free, we thus assign, toeach group, the whole set of processors or a subset of themif the architecture size is bigger than the number of the par-allel for loops in the group.

Scheduling strategy. We use a dynamic scheduling strat-egy. The for loops of each forall group are mapped toprocessors on a first-come, first-serve basis. This approachallows faster processors to request more jobs than slowerprocessors since this greedy dynamic scheduling permitsready processors to execute new tasks whenever they be-come available.

Code generation. An MPI implementation is derived fromthe specification. To do this, we use a master/slave mech-

anism where one processor is the master in charge of thescheduling control and the other processors are slaves incharge of computations.

Variation of the average relative gain with the processors number

3 5 7 9 11 13 15 17 19 21 23

0

10

20

30

40

50

60

70

∇∇ ∇ ∇

∇ ∇∇ ∇ ∇ ∇ ∇ ∇ ∇

3 5 7 9 11 13 15 17 19 21 23

0

10

20

30

40

50

60

70

Number of processors (�)

Average relative gain in the compression time in % (����)

To evaluate the parallel encoder performances we measuredthe relative gain in compression time

���� ��� � ��

��

where �� is the execution time for � processors, comparedwith the sequential encoder, for a bench of one hundredframes with resolution size of 640�480, on a SMP clus-ter. The motion estimation algorithm used by the encoder inthese experiments is a three pass hierarchical search algo-rithm with a motion vector range of 64 pixels.

The results show that the addition of new processingunits to the architecture has a positive impact on the gainin video encoding time. The latter increases systematicallyfor architecture size from three to sixteen processors. In-deed, the parallel compression time is smaller than the se-quential time by nearly 15% for two processors and reachesnearly 70% for sixteen processors. The variation in aver-age relative gain is less significant (and even null) when thenumber of processors is greater than or equal to a thresh-old of seventeen processors. Even if the gain is not impres-sive for small architectures, it remains important for hardreal-time encoders where each frame is subject to a strictencoding deadline constraint.

Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05) 0-7695-2273-4/05 $ 20.00 IEEE

Page 7: [IEEE First International Conference on Distributed Frameworks for Multimedia Applications - Besoncon, France (06-09 Feb. 2005)] First International Conference on Distributed Frameworks

The speedup result above is sub-linear due to the com-plex dependencies and the sequential phases in the MPEG-4 algorithm. In this section, we compare the experimentalresults above with the (best) expected parallel time of anygreedy scheduling policy on a PRAM architecture model inorder to evaluate the goodness of our mapping and schedul-ing approach.

5. Speedup analysis

Given a precedence graph and a greedy scheduling al-gorithm of this graph on the PRAM-CREW architecturemodel [6] (this model is widely used to evaluate parallelalgorithms and is well adapted for shared-memory multi-processors), we define the following values :

� The parallel time �� which corresponds to the criticalpath of the graph;

� The parallel time �� which corresponds to the parallelexecution time obtained by the scheduled algorithm on� processors;

� The arithmetic work �� which corresponds to the se-quential execution time on one processor.

We note � the set of the following computation stages givenin the MPEG model detailed in Sec. 3 : ME, MVP, Choice,MVD, Prd, DIFF, QiQ, DCTI, ADD, Zigzag, RLE. So, wehave :

�� �����

� �� � ������

�� � ����_���

�� �����

� ��� ������

�� � ����_���

To avoid ambiguity we also note � ��

and � ��

the quantity ��and �� respectively for the phase � when necessary.

We evaluate the speedup � �� ��

��

for each phase � in theset �.

Forall nodes of the QiQ phaseFor the three for nodes associated with QiQ (see Fig.4 ) :

� � � �

�������

��

��� �

��� � �

�������

������� �

��� �

���������

����������� �

��� �

where � and � are the frame width and height and� � is the maximal execution time of the three QiQinner foralls bodies. The development of the compu-tation above, using � � ��� � � � � and

��� �

����� , leads to the following two inequalities for the

PRAM-CREW model :

� � �

�� � �

� �� � �

�� �� ���� �

We note that if the number of available processors � ismuch smaller than the average parallelism �����, i.e,

�� ���� � ������

�, we have

�����

�����

� ��

.

Forall nodes other than QiQ onesFor a forall node associated with the MPEG com-putation stage � in the set � (see Fig. 5), the paralleltime � �

� on the PRAM architecture model is :

� ��

�� � �

� � �� ��

���� �

� ��

�� ��

where �� is the maximal execution time of theforall body of � . Under the parallel slackness as-

sumption we note as above that� ��

���

� ��

. As a con-

sequence of this, the speedup for each of the forallnodes is nearly linear.

Speedup for all the compression algorithmFinally, we deduce the parallel time �� of the encoderalgorithm on a PRAM architecture model by addingthe execution times of the sequential phases (Layer, ex-tend_rec). The following two inequalities are obtained:

�� ������

� ���

�� ������

� � � ����_����

�� ������

� ���

�� ������

� � � ����_����

��� ���� � ��

������ � �

���

From this analysis we conclude that :

���� �� �

��

������� � � ����_���

��

That is, the best achievable speedup for the MPEG-4 en-coder by exploiting fine grain parallelism on a PRAM ar-chitecture model is sub-linear. Finally, we note that the ex-perimental results are coherent with this result which statesthat the relative gain curve is under an hyberbol branch(���� � �� ����).

6. Conclusion

We presented a parallel model of the MPEG-4 video en-coder. We applied a set of transformations on that model to

Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05) 0-7695-2273-4/05 $ 20.00 IEEE

Page 8: [IEEE First International Conference on Distributed Frameworks for Multimedia Applications - Besoncon, France (06-09 Feb. 2005)] First International Conference on Distributed Frameworks

generate a specification of the parallel encoder for SMP ma-chines corresponding to a dynamic data partitioning at mac-roblock level. Finally, we translated that specification to aMPI message passing program and evaluated (both experi-mentally and analitically) its performances.

Our implementation is an efficient on-line parallelMPEG-4 encoder over an SMP architecture, since the finegrain parallelism inside the standard is exhibited and ex-ploited by the code generation process. Moreover, our ap-proach has the advantage of flexibility since our MPEG-4model and code generation process will ease addition ofcomponents: the addition of custom non-standard encodingfeature components to the model, such as filtering, does notimpact neither the partitioning/scheduling heuristic nor theMPI program generation.

Currently, we are working on the evaluation of a mech-anism of performance-quality-rate control, which includesa number of quality changes aware controller [11], on awireless videophone application. To ease the systematic de-velopment and evaluation of such new ideas, we are alsoworking on the design of an automated compilation chain inwhich the programmer provides the original parallel speci-fication and the compiler performs the transformation steps(given as input) one after another and generates the targetdistributed program.

References

[1] Inf. technology – Coding of audio-visual objects – Part 2: Vi-sual. Prentice Hall, ISO/IEC 14496-2:2001.

[2] I. Ahmad, S. M. Akramullah, M. L. Liou, and M. Kafil. Ascalable off-line MPEG-2 video encoding scheme using amultiprocessor system. Parallel Computing, 27(6):823–846,2001.

[3] S. M. Akramullah, I. Ahmad, and M. L. Liou. A data-parallel approach for real-time MPEG-2 video encoding.Journal of Parallel and Distributed Computing, 30(2):129–146, November 1995.

[4] D. Barbosa, J. P. Kitajima, and W. M. Jr. Parallelizing MPEGvideo encoding using multiprocessors. In Proceedings of theXII Brazilian Symposium on Computer Graphics and ImageProcessing, Campinas, SP, Brazil, October 1999.

[5] V. Bertin, J.-M. Daveau, P. Guillaume, T. Lepley, D. Pilat,C. Richard, M. Santana, and T. Thery. FlexCC2: An opti-mizing retargetable C compiler for DSP processors. In EM-SOFT’02, pages 282–398, 2002.

[6] S. Cook. A taxonomy of problems that have fast parallel al-gorithms. In Information and control, volume 64, pages 2–22, 1985.

[7] A. Darte, Y. Robert, and F. Vivien. Scheduling and Auto-matic Parallelization. Birkhäuser, Boston, 2000.

[8] T. Olivares, P. Cuenca, F. Quiles, and A. Garrido. Paralleliza-tion of the MPEG coding algorithm over a multicomputer. aproposal to evaluate its interconnection network. In IEEEPacific Rim Conference on Communications, Computers and

Signal Processing, PACRIM, volume 1, pages 113–116, NewYork, USA, september 1997.

[9] K. Shen, L. A. Rowe, and E. J. Delp. A parallel implemen-tation of an MPEG1 encoder: Faster than real-time! In Com-puters and Graphics, 1995.

[10] V. Valero, F. L. Pelayo, F. Cuartero, and D. Cazorla. Spec-ification and analysis of the MPEG–2 video encoder withtimed-arc petri nets. In R. Cleaveland and H. Garavel, edi-tors, Electronic Notes in Theoretical Computer Science, vol-ume 66. Elsevier, 2002.

[11] C. Wust, L. Steffens, R. Bril, and W. Verhaegh. QoS controlstrategies for high-quality video processing. In 16th Euromi-cro Conference on Real-Time Systems (ECRTS’04), Sicily,Italy, 2004.

Proceedings of the First International Conference on Distributed Frameworks for Multimedia Applications (DFMA’05) 0-7695-2273-4/05 $ 20.00 IEEE