performance and memory profiling for embedded system design

Performance and Memory Profiling for

Embedded System Design

Heiko Hubert, Benno Stabernack, Kai-Immo WelsImage Processing Department,

Fraunhofer Institute for Telecommunications, Heinrich-Hertz-InstitutEinsteinufer 37, 10587 Berlin, Germany

[huebert,stabernack,wels] ghhi. fraunhofer.de

Abstract- The design of embedded hardware/software systems isoften underlying strict requirements concerning various aspects,including real time performance, power consumption and diearea. Especially for data intensive applications, such asmultimedia systems, the number of memory accesses is adominant factor for these aspects. In order to meet therequirements and design a well-adapted system, the softwareparts need to be optimized and an adequate hardwarearchitecture needs to be designed. For complex applications thisdesign space exploration can be rather difficult and requires in-depth analysis of the application and its implementationalternatives. Tools are required which aid the designer in thedesign, optimization and scheduling of hardware and software.We present a profiling tool for fast and accurate performanceand memory access analysis of embedded systems and show howit can be applied within the design flow. This concept has beenproven in the design of a mixed hardware/software system forH.264/AVC video decoding.

Keywords- profiling, embedded hardware/software systems,design space exploration, scheduling

I. INTRODUCTION

The design of an embedded system often starts from asoftware description of the system in C language. Forexample, the designer writes an executable specification basedon a reference implementation of the application, e.g. fromstandardization organizations or the open-source community.This software code is often not optimized in any manners,because it mainly serves the purpose of functional andconformance testing. Therefore it has to be transformed intoan efficient system, including hardware and softwarecomponents. The design of the system requires the followingsteps: system architecture design, hardware/softwarepartitioning, software optimization, design of hardwareaccelerators and system scheduling. All these steps requiredetailed information about the performance of the differentparts of the application. Besides the arithmetical demands ofthe application, memory accesses can have a huge influenceon performance and power consumption. This is especially thecase for data intensive applications, such as multimediasystems, due to the huge amount of data to be transferred inthese applications. This problem is even increased if the givendata bandwidth is not used efficiently.

In order to reduce the overall data traffic, those parts of thecode, which require a high amount of data transfers, have to beidentified and optimized. The above mentioned applicationscontain up to 100.000 lines of source code. Therefore tools arerequired, which help the designer identifying the critical partsof the software. Several analysis tools exist, e.g. timinganalysis is provided by gprof or VTune. Memory accessanalysis is part of the ATOMIUM [2] tool suite. However, allthese tools provide only approximate results for either timingor memory accesses. A highly accurate memory analysis canbe done with a hardware (HDL) simulator, if an HDL modelof the processor is available. However, such an analysisimplies a long simulation time.

In order to achieve a fast and accurate solution, wedeveloped a specialized profiler, called Memtrace [3], forobtaining performance and memory access statistics. Thispaper describes the tool with all its features. We show how theprovided profiling results can be used during the design andoptimization of embedded hardware/software systems. As acase study, Memtrace is applied during the efficient design ofa mixed hardware/software system for H.264/AVC videodecoding. Starting from a software implementation, it isshown, how the software is optimized, an efficient hardwarearchitecture is developed, and the system tasks are scheduledbased on the profiling results.

II. MEMTRACE: A PERFORMANCE AND MEMORY PROFILER

A. Tool ArchitectureMemtrace is a non-intrusive profiler, which analyzes the

memory accesses and real time performance of an application,without the need of instrumentation code. The analysis iscontrolled by information about variables and functions in theuser application, which is automatically extracted from theapplication. Furthermore, the user can specify the systemparameters, e.g. the processor type and the memory system.During the analysis, Memtrace utilizes the instruction setsimulator ARMulator [1] for executing the application. TheARMulator provides Memtrace with the information requiredfor the analysis, e.g. the program counter, the clock cyclecounter and the memory accesses. Memtrace creates detailedresults on memory accesses and timing for each function andvariable in the code.

1-4244-0840-7/07/$20.00 02007 IEEE. 94Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.

executable ofthe application _

list of functions memitaceanalysis stack location

specification variable location fronten4M result table format /, srf

system Processor AK backend (ARMulator)specification Caches.16K1tIII

&IMemTimingn Set Simulatorlil ~~~~~~Instruction Set Simulator

with memtrace backeind

Clock Cycles n 60it funcl1func2 o

>,401| 201 30 . > 40121 271 38 20131 231 34 o

results of function analysis

Cache Misses t 60

it var1 var21 15 6 402 48 13 203, 38, 22

results of memory analysis

1 2 3 4 5 6

1 2 3 4 5 6

Figure 1. Performance analysis tool: Memtrace profiles the performance and memory accesses of a user application.

B. Analysis WorkflowThe performance analysis with Memtrace is carried out in

three steps, the initialization, the performance analysis and thepostprocessing of the results.

During initialization Memtrace extracts the names of allfunctions and variables of the application. During this processuser variables and functions are separated from standard libraryfunctions, such as printf() or malloc(. This is achieved bycomparing the symbol table of the executable with the ones ofthe user library and object files. The results are written to theanalysis specification file. The specification file can be editedby the user, e.g. for adding user-defined memory areas, such asthe stack and heap variables, for additional analysis.Furthermore the user can define a so called "split function",which instructs Memtrace to produce snapshot results, eachtime the "split function" is called. This can be used e.g. in videoprocessing for generating separate profiling results for eachprocessed frame. Additionally the user can control if theanalysis results, e.g. clock cycles, of a function should includethe results of a called function (accumulated) or if it shouldonly reflect the function's own results (self). Typicallyauxiliary functions, e.g. C library or simple arithmeticfunctions, are accumulated to the calling functions.

In the second step the performance analysis is carried out,based on the analysis specification and the systemspecification, as shown in Figure 1. The system specificationincludes the processor, cache and memory type definitions. TheMemtrace backend connects to instruction set simulator for thesimulation of the user application and writes the analysisresults of the functions and variables to files, see chapter II.Cfor more details. If a "split function" has been specified, thesefiles include tables for each call of the "split function", TABLEI. shows exemplary results for function profiling. The outputfiles serve as a database for the third step, where user-defineddata is extracted from these tables.

TABLE I. 32-BIT EXEMPLARY RESULT TABLE FOR FUNCTIONS

f ca cyl Is Id 18 st s8 pm cm BI BC BDfl 8 215 75 22 7 52 3 42 5 123 92 02 2 295 39 35 3 14 9 17 9 55 153 87f3 2 432 78 68 4 10 2 31 17 143 289 0

Abbreviations are: f: function; ca: calls, yl: bus (clock) cycles; ls: all load/store accesses fromthe core; Id: all loads; 18: byte and half-word loads; st: all stores; s8: byte and half-word stores;

pm: page misses; cm: cache miss; BI: bus idle cycles, BC: core bus cycles, BD: DMA bus cycles

In the third step a postprocessing of the results can beperformed. Memtrace allows the generation of user-definedtables, which contain specific results of the analysis, e.g. the

load memory accesses for each function. Furthermore theresults of several functions can be accumulated in groups forcomparing the results of entire application modules. The user-

defined tables are written to files in a tab-separated format.Thus they can be further processed, e.g. by spreadsheetprograms for creating diagrams.

C. Tool Backend - Interface to the ISSMemtrace communicates with the Instruction Set Simulator

(ISS) via its backend, as depicted in Figure 2. The backend isimplemented as dynamic link library (DLL), which connects tothe ISS. Currently only the ARM instruction set simulatorARMulator is supported. The backend is automatically calledby the ISS during simulation. During the startup phase, thebackend creates a list of all functions and marks the user andsplit functions found in the analysis specification file. For eachfunction a data structure is created, which contains thefunction's start address and variables for collecting the analysisresults. Finally two pointers, called currentFunction andevaluatedFunction, are initialized. The first pointerindicates the currently executed function and, if this functionshould not be evaluated, the second pointer indicates the callingfunction, to which the result of the current function should beadded.

System BusMemory&BusTiming Model

Memorie5

Figure 2. Interface between memtrace backend and the ISS

Each time the program counter changes memtrace checks,if the program execution has changed from one function toanother. If so, the cycle count of the evaluatedFunctionis recalculated and the call count of the currentFunctionis incremented. Finally the pointers to thecurrentFunction and evaluatedFunction are

updated. If currentFunction is a split function, thedifferential results from the last call of this function up to thecurrent call are printed to the result files.

95

---- fundc1= func2

---- va rl--va r2

Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.

For each access that occurs on the data bus (to the datacache or TCM), the memory access counters of theevaluatedFunction are incremented. Depending on theinformation provided by the ARMulator, it is decided, if a loador store access was performed, and which bitwidth (8/16 or 32bit) was used. Furthermore the ARMulator indicates if a cachemiss occurred. Page hits and misses are calculated bycomparing the address of the current with the previous memoryaccess and incorporating the page structure ofthe memory.

For each bus cycle (on the external memory bus) memtracechecks if it was an idle cycle, a core access or DMA access andincrements the appropriate counter of theevaluatedFunction.

At the end of the simulation the results of the lastevaluatedFunction are updated and the results ofthe lastcall of the split function and the accumulated results are printedto the result files.

D. Memtrace FrontendMemtrace comes with two frontends, a commandline

interface and a graphical user interface (GUI). Thecommandline interface is very well suited for the usage inbatch files, for example for performing a profiling for a set ofsystem configurations or input data. The GUI version allows aneasy and fast access to all features ofthe tool. Especially for thequick generation of result diagrams the GUI version is veryhelpful.

Figure 3. Memtrace GUI frontend

processors of the ARM family can be profiled, a wide varietyof architectural features is covered, including variations ofpipeline length, instruction bit-width, availability ofDSP/SIMD instructions, MMUs, cache size and organization,tightly coupled memories, bus width and detailed memorytiming options. For a profiling estimation of a non-ARMprocessor an ARM processor with a similar feature set shouldbe chosen. In TABLE II. a list of common embeddedprocessors is given, which have similarities with ARMprocessors. They have a basic feature set in common, includinga 32-bit Harvard architecture with caches, a 5- to 8-stagepipeline and a RISC instruction set. Although, it has to bementioned, that some ofthe processor provide specific features,which may have a significant influence on the performance, forexample the custom instruction extensions of ARC andTensilica Xtensa processors.

TABLE II. 32-BIT EMBEDDED RISC PROCESSORS

Processor

ARM9E

ARMII

ARC600

ARC700

TesilicaXtensa7

TensilicaDiamond232L

LatticeMico32

AlteraNIOS IIXilinxMicroBlaze v5

MIPS 4KE

openRISCOR1200

LEON3

Pipe-line5

stage

Reg-isters'

16

Instr./DataCache, TCMA128k/128kyes/yes

SpecialFeatures

coprocessor interf

SIMD,8 16 64k/64k branch pred.

stage yes/yes 64-bit buscoprocessor interf

5 32 32k/32k custom instr.

stage (- 60) 512k/16k extend. reg.file

custom instr.7 32 64k/64k branch pred.

stage (- 60) 512k/256k extend. reg. file64-bit bus

5 64 32k/32k custom instr.stage or > 256k/256k windowed regs.

up to 128-bit bus5

stage6

stage5-6stage5

stage5

stage5

stage7

stage

32

32

32

32

32

32

520

16k/16k

32k/32k

64k/64kyes/yes64k/64kyes/yes64k/64kyes/yes64k/64k

INIlM/yMyes/yes

windowed regs.

direct-map. cachecustom instr.

direct-map. cachecoprocessor interf

coprocessor interf

direct-map. cachecustom instr.

windowed regs.coprocessor interf

a many features are customizable, given is the maximum value

E. Portability to other Processor ArchitecturesThe current version of Memtrace is only targeted to the

ARM processor family, as it uses the ISS from ARM(ARMulator). However the interface of the profiler, asdescribed before, is rather simple and could be ported to otherprocessor architectures if an instruction set simulator isavailable, which allows debugging access to its memorybusses. Our plans for future work include Memtrace backendsfor other processor architectures.

As long as other backends are not available, the ARM-based profiling results may function as a rough estimation forthe results on other RISC processor architectures. Since all

III. MEMTRACE WITHIN THE DESIGN FLOW

This chapter describes how the profiler can be appliedduring the design of embedded systems. Figure 4. shows atypical design flow for such hardware/software systems.Starting from a functionally verified system description insoftware, this software is profiled with an initial systemspecification, in order to measure the performance and see, ifthe (real-time) requirements are met. Ifnot, an iterative cycle ofsoftware and hardware partitioning, optimization andscheduling starts. In this process detailed profiling results arecrucial for all steps in the design cycle.

96Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.

HWSWPartitioning

Scheduling

System

Figure 4. Typical embedded system design flow

A. Hardware/Software Partioning andDesign Space ExplorationFor the definition of a starting point of a system architecture

an initial design space exploration should be performed. Thesesteps include a variation of the following parameters:

* processor type

* cache size and organization

* tightly coupled memories

* bus timing

* external memory system and timing (DRAM, SRAM)

* hardware accelerators, DMA controller

Memtrace can be run in batch mode and thus differentsystem configurations can be tested and profiled. Thus theinfluence of the system architecture on the performance can beevaluated. This initial profiling also reveals the hot-spots of thesoftware. The most time consuming functions are goodcandidates for either software optimization or hardwareacceleration. Especially computational intensive functions arewell-suited for hardware acceleration in a coprocessor. Withsupport of a DMA controller even the burden of data transferscan be taken from the processor. Control-intensive functionsare better suited for software implementation, as a hardwareimplementation would lead to a complex state machine, whichrequires long design time and often doesn't allowparallelization. In order to get a first idea of the influence ofhardware acceleration, a (well-educated guessed) factor can bedefined for each hardware candidate function. This factor isused by Memtrace, in order to manipulate the original profilingresults.

B. Software Profiling and OptimizationAfter a partitioning in hardware and software is found, the

software part can be optimized. Numerous techniques exist,that can be applied for optimizing software, such as loopunrolling, loop invariant code motion, common subexpressionelimination or constant folding and propagation. Forcomputational intensive parts arithmetic optimizations or

SIMD instructions can be applied, if such instructions areavailable in the processor. If the performance of the code issignificantly influenced by memory accesses, as it is mainlythe case in video applications, the number of accesses has tobe reduced or they have to be accelerated. The profiler gives adetailed overview of the memory accesses and allowstherewith identifying the influence of the memory access. Oneoptimization mechanism is the conversion of byte (8-bit) toword (32-bit) memory accesses. This can be applied ifadjacent bytes in memory are required concurrently or withina short time period, for example pixel data of an image duringimage processing. A further mechanism is the usage of tightlycoupled memories (TCMs) for storing frequently used data.For finding the most frequently accessed data area, thememory access statistics ofMemtrace can be used. In [1] thesetechniques are described in more detail.

C. Hardware/Software Profiling and SchedulingBesides the software profiling and optimization a system

simulation including the hardware accelerators needs to becarried out in order to evaluate the overall performance.Usually hardware components are developed in a hardwaredescription language (HDL) and tested with an HDL simulator.This task requires long development and simulation times.Therefore HDL modelling is not suitable for the early designcycles, where exhaustive testing of different design alternativesis important. Furthermore, if the system performance is datadependent also a huge set of input data should be tested to getreliable profiling results. Therefore, a simulation and profilingenvironment is required, which allows short modification andsimulation time.

For this purpose, we used the instruction set simulator andextended it with simulators for the hardware components of thesystem. The ARMulator provides an extension interface, whichallows the definition of a system bus and peripheral buscomponents. It comes already with a bus simulator, whichreflects the industry standard AMBA bus and a timing modelfor access times to memory mapped bus components, such asmemories and peripheral modules, see Figure 5.

Figure 5. Environment for hardware/software cosimulation and profiling


1) CoprocessorsWe supplemented this system with a simple template for

coprocessors, including local registers and memories and acycle-accurate timing. The functionality of the coprocessor canbe defined as standard C-code, thus the software function canbe simulated as hardware accelerator by copying the softwarecode to the coprocessor template. The timing parameter can beused to define the delay of the coprocessor between activationand result availability, i.e. the execution time of the task, as itwould be in real hardware. This value can be either achievedfrom reference implementation found in literature or by aneducated guess of a hardware engineer. Furthermore, oftenmultiple hardware implementations of a task with differentexecution time (and hardware cost) are possible. In theproposed profiling environment, simply by varying the timingparameter and viewing its influence on the overallperformance, a good trade-off between hardware cost andspeed-up can be found quickly.

2) DMA ControllerFor data intensive applications data transfers have a

tremendous influence on the overall performance. In order toefficiently outsource tasks into hardware accelerators also theburden of data transfer has to be taken from the CPU. This jobcan be performed by a DMA-Controller. The Memtracehardware profiling environment includes a highly efficientDMA-Controller with the following features:

* multi-channel (parameterizable number of channels)* ID- and 2D- transfers* activation FIFO (non-blocking transfer, autonomous)* internal memory for temporary storage between read

and write* burst transfer modeThus the designer is enabled to determine the influence of

different DMA modes in order to find an appropriate trade-offbetween DMA Controller complexity and required CPUactivity.

3) SchedulingAfter the software and hardware tasks have been defined a

scheduling of these tasks is required. For increasing the overallperformance a high degree of parallelization should beaccomplished between hardware and software tasks. In order tofind an appropriate scheduling for parallel tasks the followinginformation is required:

* dependencies between tasks

* the execution time of each task

* data transfer overhead

Especially for data intensive application the overhead fordata transfers can have a huge influence on the performance. Itmight even happen that the speed-up of a hardware acceleratoris vanished by the overhead for transferring data to and fromthe accelerator.

The overhead for data transfers to the coprocessors isdependent on the bus usage. Furthermore side effects on otherfunctions may occur, if bus congestion occurs or when cacheflushing is required in order to ensure cache coherency. Inorder to find these side-effects, detailed profiling of the systemperformance and the bus usage is required. Memtrace provides

these results, for example Figure 6. shows the bus usage foreach function depending on the access time ofthe memory.

,0e5 _ 1111 100 Bus Idle (SRAM)1

7 M Bus Accesses (DRAM)7 - 1lilill l l l l l | *~11Bus Idle (DRAM)

04

2

0

Functions

Figure 6. Bus usage for each function, depending on the memory type

4) HDL SimulationIn a later design phase, when the hardware/software

partitioning is fixed and an appropriate system architecture isfound, the hardware component need to be developed in ahardware description language and tested using a HDLsimulator, such as Modelsim. Finally, the entire system needsto be verified including hardware and software components.For this purpose the instruction set simulator and the HDLsimulator have to be connected. The codesign environmentPeaCE [4] allows the connection of the Modelsim Simulatorand the ARiulator.

IV. APPLIcATioN EXAMPLE H.264/AVGCVIDEo DECODERFOR MOBILE TV TERMINALS

The proposed design methodology has been applied to thedesign of a video decoder as part of a mobile digital TVreceiver. Starting from an executable specification of the videodecoder, namely the (unoptimized) reference software, at first apure optimized software implementation and then an ASIC hasbeen developed incorporating hardware accelerators and acustomized processor.

A. DVB-H andH 2641A VC Video CompressionThe receiver is compliant to DVB-H, which is a new

standard for broadcasting of digital audio and video content tomobile devices. The content is encoded using highly efficientcompression methods, namely AAC-HE for audio data and theH.264/AVC [5] codec for video content. DVB-H focuses on ahigh mobility and low power consumption of the receivers. Themost demanding part of the receiver in terms of computationalrequirements is the H.264 AVG video decoder.

The H.264pAVG video compression standard is similar toits predecessors, however it adds various new coding featuresand refiements of existing mechanisms, which lead to a 2 to 3time's increased coding efficiency compared to MPEGf-2.However, the computational demands and required dataaccesses have also increased significantly. In Figure 7. theblock diagram of an H.264/AVC decoder is depicted.


CL~~~~~~ E-- -- -- --

04----------------->Ca)~~~~~~~~~~~~~~~~~~~~~~~~C

F decoding inversetransformage t i| j ~~~~~~~~referencefr001ame

buffer

Figure 7. Block diagram of an H.264/AVC decoder

The bitstream parsing and entropy decoding interpret theencoded symbols and are highly control flow dominated. Thesymbols contain control information and data for the followingcomponents. The inter and intra prediction modes are used topredict image data from previous frames or neighboring blocks,respectively. Both methods require filtering operations,whereas the inter prediction is more computational demanding.The residuals of the prediction are received as transformed andquantized coefficients. The applied transformation, which canbe considered as a simplified discrete cosine transformation(DCT), is based on integer arithmetic and is computationaldemanding. The reconstructed image is post processed by adeblocking filter for reducing blocking artifacts at block edges.The deblocking filter includes the calculation of the filterstrength, which is control flow dominated, and the actual 3- to5-tap filtering, which requires many arithmetic operations.Each of these components allows various modes of operation,which are chosen by symbols in the bitstream. This involves ahigh degree of control flow in the decoder.

The H.264/AVC baseline decoder has been profiled withMemtrace using a system specification typical for mobileembedded systems comprising an ARM946E-S processor core,a data and instruction cache (16kB each) and an externalDRAM as main memory. The execution time for each moduleof the decoder has been evaluated as depicted in Figure 8. Theresults show, that the distribution over the modules differssignificantly between I- and P-frames. Whereas in I-frames thedeblocking has the most influence on the overall performance,in P-frames the motion compensation is the dominant part.

B. Design and OptimizationsBased on the acquired profiling results several software and

hardware architectural optimizations are applied. Our firsttarget is a pure software version of the video decoder for theimplementation of a DVB-H terminal on a PDA. In a secondstep an embedded hardware/software is developed.

1) Software Implementation and OptimizationsFollowing Amdahl's law, those parts of the software should

be considered for optimization first, which take up the most ofthe execution time. Figure 8. shows, that motioncompensation, loopfilter, inverse transformation and memoryrelated functions are those candidates. Exploring the results ofthe functions corresponding to the motion compensation, itcan be seen that the function motionCompChroma ()requires the most execution time. This function performs the

motion compensation for the chrominance pixels, which ismainly based on bilinear interpolation. Focusing on the readmemory accesses, which are performed inmotionCompChroma(), as given in the second column ofTABLE III. , it shows that more than 30%0 are byte or halfword accesses (third column). This is due to the fact, that thepixel values have the size of one byte each.

8-

cs- 7 -CD,

6- ~

..................................................................................................................................................................................

.........................................................................................................................................................................................................

Figure 8. Profiling results for the H.264/AVC software decoder

Since the interpolation is applied iteratively on adjacentpixels, the source code can be optimized by reading 4 adjacentbytes at once. This leads to a reduction of the execution timeof the function by almost 30°0o. The speedup of the functionleads to a reduction of the execution time for processing a P-frame by about 500.

TABLE III. PROFILING RESULTS FOR MOTIONCOMPCHROMAO) FUNCTION

Clock Cycles All Load Load 8/16

before optimization 13,149,109 309,368 104,784after optimization 9,355,709 196,746 34,584

Further speed-up of the software could be achieved byapplying well-known software optimization techniques andthose proposed in [3] to the functions identified by theprofiler. The resulting software decoder has been tested on anIntel PXA270-based PDA within the DVB scenario. Therequired processor clock frequency for H.264/AVC decodingis about 420 MHz. (320x240 pixel resolution, 384 kBit/s).

Considering the dynamic power consumption of CMOS-circuits, given in equation 1, the rather high system frequencyleads to high power consumption.

M

Pdynamic Ck fk VDDk=l

(1)

For achieving lower power consumption, methods need tobe applied, which allow the reduction of the system frequency,which in turn also allows a lower supply voltage (voltagescaling). Hardware accelerators can be used for this purpose.However, their influence on the capacitance has to beconsidered and reduced by mechanism like clock gating.Furthermore the memory architecture needs to be adapted(reduced) to the specific application requirements.


2) Memory SystemBesides the processing power of the CPU the memory and

bus architecture determine the overall performance of thesystem. Namely the caches size and architecture, the speed andusage of a tightly coupled (on-chip) memory (TCM), the widthof the memory bus, the bandwidth of the off-chip memory anda DMA controller are the most influencing factors. Adjustingthese factors requires a trade-off between hardware cost, powerconsumption and performance. The H.264/AVC decoder hasbeen simulated with different cache sizes in order to find anappropriate size for the DVB-H terminal scenario (QVGAimage resolution). It has been evaluated how the requireddecoding time changes when either the instruction cache size orthe data cache size is increased, see Figure 9.

n 1=4k:D=var m I=var:D=Ok120 -

g 100-

- 80-0,, 60-0

,, 40-0" 20-

0-

Figure 9. Influence of the instruction (I) and data (D) cache sizes on theexecution time of the H.264/AVC decoder.

The results show that increasing the instruction cache sizefrom 4 kByte up to 32 kByte has a minor influence on theoverall performance. However, adding a data cache of 4 kByteto the system decreases the decoding time to less than 20%.Further increasing the data cache size does not yield a dramaticperformance increase. Therefore a data and instruction cachesize of 4 kByte each is a good tradeoff between performanceand die area. The data cache increases the performance bydecreasing the number of accesses to the external memory.This is especially efficient for data areas with frequent accessesto the same memory location, e.g. the stack. However forrandomly accessed data areas, e.g. lookup tables, a fast on-chipmemory (SRAM) is more appropriate. As the H.264/AVCdecoder requires about 1.1 MByte of data memory (@ QVGAvideo resolution), only small parts of the used data structures(less than 3%0 with 32 kByte of SRAM) can be stored in the ofon-chip memory. In order to find a useful partitioning of dataareas between on-chip and off-chip memory, it is required toprofile the accesses to each data area of the decoder. Since adata cache is instantiated, accesses to these memories onlyhappen if cache misses occur. Therefore, the cache misses havebeen analyzed separately for each data area in the codeincluding global variables, heap variables and the stack. Dataareas with many cache misses are stored in on-chip memory.

3) Hardware/Software PartitioningIn order to further increase the system efficiency and

decrease power consumption and hardware costs, the CPU canbe enhanced by coprocessors. Again, the hot spots in thesoftware code should be considered, namely the loop filter, themotion compensation and the integer transformation. These arethe foremost candidates for hardware implementation. All thesecomponents are rather demanding on an arithmetical than on a

control flow level. Therefore they are well suited for hardwareimplementation as coprocessors, which can be controlled bythe main CPU. In order to ease the burden of providing thecoprocessors with data, a DMA controller can be appliedallowing memory transfers concurrently to the processing ofthe CPU. The coprocessors should be equipped with localmemory for storing input and output data for processing at leastone macroblock at a time preventing fragmented DMAtransfers. As the video data is stored in the memory in a twodimensional fashion, the DMA controller should feature 2-Dmemory transfers. The output of the video data to a display,which is required by a DVB-H terminal, even increases theproblem ofthe high amount of data transfers.

4) Hardware/Software Interconnection and SchedulingAfter the software optimization is performed and the

hardware accelerators are developed, a scheduling of the entiresystem is required. The scheduling is static and controlled bythe software. The hardware accelerators are introduced step-by-step to the system. Starting from the pure softwareimplementation, at first the software functions are replaced bytheir hardware counterparts. This also requires the transfer ofinput data to and output data from the coprocessors. These datatransfers are at first executed by load-store operations of theprocessor and in a next step replaced by DMA transfers. Thismight also requires flushing the cache or cache lines, whichmay decrease the performance of other software functions. In afinal step the parallelization of the hardware task and softwaretasks takes place. All decision taken in these steps are based ondetailed profiling results.

The following example shows how the hardwareaccelerator for the deblocking is inserted into the softwaredecoder. The hardware accelerator only includes the filteringprocess of the deblocking stage, filter strength calculation isperformed in software, because it is rather control intensive andtherefore more suitable for software implementation. The filterprocesses the luminance and chrominance data for onemacroblock at a time. It requires the pixel data and filterparameters as an input and provides filtered image data as anoutput, this sums up to about 340 32-bit words of data transfer.Figure 10. shows the results for the pure softwareimplementation, when using the filter accelerator with datatransfer managed by the processor, and when additionally usingthe DMA controller. As can be seen, if data is transferred bythe processor, the performance gain of the accelerator isvanished by the data transfers, only in conjunction with theDMA controller the coprocessor can be used efficiently.

Million

14 M Paaee Caclto10-

SW HWwith CPU LD/ST HWwith DMA

Figure 10. Clock cycle comparison of different deblocking implementations


C. Hardware/Software System ImplementationThe profiling and implementation results of the previous

chapters lead to a mixed hardware/software implementation ofthe video decoder, which is given in Figure 11. An applicationprocessor is extended with a companion chip for accelerationof the video decoding. The companion chip contains thehardware accelerators for H.264/AVC decoding. TABLE IV.shows a comparison of the required cycle times of theaccelerators with their software counterparts.

TABLE IV. COMPARISON OF THE EXECUTION TIME iN HARDWARE ANDSOFTWARE

Implementation Deblocking Pixel InverseInterpolation Transform

Software 3000-7000 cylces 100-700 cycles 320 cycles

Hardware 232 cylces 16-34 cycles 30 cycles

a memory transfers are not included in this cycle counts

Furthermore a so called SIMD engine is available on thechip, which is 32-bit RISC processor enhanced with specialSIMD instructions. The 32-bit system bus connecting theprocessor core with the main memory and coprocessorcomponents is augmented with a DMA-controller whichsupports the main processor by performing the memorytransfers to the coprocessor units. A video output unit isimplemented directly driving a connected display or videoDAC. To avoid a heavy bus load on the mentioned system busdue to transfers from a frame buffer to the video outputinterface, an extra frame buffer memory and the video outputunit are provided by a separate video bus system. The datatransfers between these bus systems are also performed by theDMA controller. The main control functionality of the decodercan either be run on the application processor or on the RISCcore on the companion chip.

display

Figure 11. SOC architecture of the DVB-H/DMB companion chip

To fully evaluate the proposed concept the complete SOCarchitecture has been implemented as an ASIC design usingUMC's L180 1P6M GII logic technology, see Figure 12. Themaximum clock frequency of the design is 120 MHz, whereas50 MHz should be sufficient for the DVB-H scenario. Anevaluation board for the chip is currently under development. Itallows the fully functional verification and furthermore

exhaustive performance testing and power measurements,separately for memory, core and IO supply voltages.

Figure 12. ASIC layout

V. CONCLUSIONS AND FUTURE WORK

The design of an efficient system for applications with highdemands on the real-time performance requires the selection ofan appropriate system architecture and the incorporatedhardware and software components. For this decision a detailedknowledge of the computational demands of the application ismandatory. Furthermore for data intensive applications also theinfluence ofmemory accesses has to be taken into account. Wepresented a profiling tool which provides this information andhave shown how it can be integrated in the design flow. Thetool aids the designer in taking the right decision during eachstep of the design, including the hardware/softwarepartitioning, the optimization ofthe components and the systemscheduling. We have applied this methodology for thedevelopment of a software solution and a hardware/softwaresystem for real-time video decoding.

Our future work includes the retargeting of the profilerbackend to other processors. Many processor simulators offeralready profiling capabilities, e.g. the LisaTek tool suite;however their results are not as detailed as the Memtraceresults. Furthermore we plan to integrate power models forcache and memory accesses and instruction execution in orderto allow power consumption estimation. These models will bebased on existing power models of caches and memories andon measurement results of the presented ASIC design.

REFERENCES[1] RealView ARMulator ISS User Guide Version 1.4, Ref: DUI0207C,

January 2004, http://www.arm.com[2] J. Bormans, K. Denolf, S. Wuytack, L. Nachtergaele, and I. Bolsens,

"Integrating system-level low power methodologies into a real-lifedesign flow," The Ninth Int. Workshop Power and Timing Modeling,Optimization and Simulation, pp. 19-28, Oct. 1999, Kos Island, Greece

[3] H. Hubert, B. Stabernack, and H. Richter, "Tool-Aided PerformanceAnalysis and Optimization of an H.264 Decoder for EmbeddedSystems," The Eighth IEEE International Symposium on ConsumerElectronics (ISCE 2004), Reading, England, Sept. 2004

[4] s. Ha, C. Lee, Y. Yi, S. Kwon, and Y.-P. Joo, "Hardware-softwareCodesign of Multimedia Embedded Systems: the PeaCE Approach,"12th IEEE Int. Conf on Embedded and Real-Time Computing Systemsand Applications, Sydney, Australia, Vol. 1 pp. 207-214, Aug. 2006

[5] International Standard of Joint Video Specification (ITU-T Rec. H.264ISO/IEC 14496-10 AVC), Joint Video Team (JVT) of ISO/IEC

MPEG and ITU-T, VCEG, JVT-G050, March 2003