pac duo soc performanceanalysis with esldesign methodology

PAC Duo SoC Performance Analysiswith ESL Design Methodology

I-Yao Chuang, Chi-Wen Chang, Tso-Yi Fan, Jen-Chieh Yeh,Kung-Ming Ji, Jui-Liang Ma, An-Yeu Wu and Shih-Yin Lin*

Abstract - PAC Duo System on Chip (SoC) is a highperformance and low power multimedia tri-core SoCdesigned at SoC Technology Center (STC) of IndustrialTechnology and Research Institute (ITRI). We are facing asituation ofcontinuous increasing ofdesign complexity, whenwe integrate more components and try to evaluate the systemperformance or do further architecture exploration. In thispaper, we present a system-level virtual platform andsimulation environment for performance profiling andevaluation based on electronic system-level (ESL) designmethodology. Through the fine-tuning offunctionality, timing,and simulation speed, the resulted virtual platform achieves ahigh accuracy with a less than 10% of the cycle count erroragainst RTL simulation with an 80,,-,150 times simulationperformance improvement. With this methodology, the systemfunction evaluation and performance profiling can be easilyrealized. We also show the experimental results for variousmultimedia applications compared with RTL simulation andfurther demonstrate how the virtual platform successfullypredicts the real chip performance ofthe evaluation board.

Index Terms - parallel architecture core (PAC), system onchip (SoC), electronic system-level (ESL), transaction-levelmodeling (TLM), performance evaluation.

I. INTRODUCTION

PACDSP is a 5-way VLIW DSP has a scalable clustereddatapath and an optimized instruction set with 8-bit/16-bitSIMD operations. In order to eliminate the common dataconflicts at the register file in most VLIW processors, we usea distributed & ping-pong register file in PACDSP, whichsupports a comparable high data bandwidth as centralizedregister files but reduces 76.8% silicon area and shortens46.9% access time. Five-way VLIW are further grouped intoscalar and two DSP clusters. Each cluster in PACDSP can beturned off independently for power saving. The code densityof VLIW is significantly improved through variable-lengthoperation encoding, NOP removal, and embedded codereplication mechanisms. The program sequencer can alignVLIW packets with different numbers of operations, each ofwhich is itself variable-length encoded. The result turned outto be that PACDSP has 8.8/MHz BDTlsimMark2000 score(released in 2009/2), which outperforms most popularlicensable cores (2.1/MHz for ARM9E, 6.4/MHz for TeakliteIII, 8.1/MHz for Ceva-X 1620, 4.7/MHz for ZSP400, and7.9/MHz for ZSP500 respectively).

1 All author are with SoC Technology Center, Industrial TechnologyResearch Institute, Hsinchu, 31040 Taiwan.

An-Yeu Wu is also with Department/Graduate Institute of ElectricalEngineering, National Taiwan University, Taipei, 10617 Taiwan.

978-1-4244-3870-9/09/$25.00 ©2009 IEEE

PAC Duo Soc is designed with a heterogeneous multi-corearchitecture, composed of an ARM926 and two PACDSPswith some special function block, for example the EmbeddedMultimedia DMA (EMDMA) which is optimized for datatransfer in multimedia codec application, especially for H.264decoding. This SoC also consists of three buses (i.e., AXI,AHB and APB) matrix to separate the loading of DSP, CPU,and slow speed components. Many challenges of verificationare faced due to the complexity and flexibility of this multicore SoC design [1]. FPGA prototyping board do help, whileit is only available after RTL code is ready and have adrawback of limited observability of internal signals andstates, while RTL simulation is too slow for real applications.Raising the abstraction level of design minimizes themodeling efforts, increases simulation speed and makes earlyverification and prediction of performance possible.

Recently, SystemC has become a standard in system-leveldesign. It is one of the leading C/C++ design environments,and is an open source simulation environment [2]. In addition,transaction-level modeling (TLM) becomes easier and moreefficient, because the entire system can be simulated within asingle simulation engine. It allows the simulation at differentabstraction levels starting at a very high level of functionaldescription and continuing after refining over time, tosynthesizable RTL style, and even combines different levelsin one model. The SystemC simulation kernel also treatsparallel execution and provides functions required to modelhardware timing and concurrency [2].

Using the SystemC to create an ESL virtual platform andprofile the system performance is also a hot topic. In [3], theauthors based on statistical models to calculate executioncycles for bus traffic analysis and system performanceestimation. However, the main drawback of static analysis isthe lack of dynamic analysis information, such as buscontention, arbitration, dynamic scheduling, etc. Dynamicanalysis is based on dynamic simulation of all the hardware(HW) and software (SW) components, and a trade-offbetween accuracy and simulation time can be achieved byproperly adopt the abstraction level for bus, memory and IPblock modeling [4]. For real SoC platform, an ESLcounterpart can be build to perform further analysis using realapplication which may be extremely time-consuming in RTLor difficult to observe internal details in FPGA developmentboard. In addition to the system performance, it also providesa good way to find out the bottleneck of the system and thedirection for improvements [5-7].

In this paper, we propose a system-level performanceevaluation methodology on an ESL virtual platform for our

399

PAC (Parallel Architecture Core) Duo SoC. The virtualplatform is built up and refined against the behaviors andtiming of RTL platform to achieve high accuracy whilekeeping better simulation speed by adopting properabstraction level of core, bus, memory and IP modeling.Experiments also show the system performance and theaccuracy for various multimedia applications and successfullypredict the real chip performance when compared with theresults measured on the evaluation board.

The rest of this paper is organized as follows. Section IIand III describes the PAC Duo platform and the constructionof the virtual platform. In Sec. IV, we introduce how toachieve function and timing accuracy while optimizing thesimulation speed. Section V shows the performancecomparison of various applications on RTL and on the realchip with the evaluation board. Finally, we make someconclusions in Sec. VI.

II. PAC Duo PLATFORM

PAC Duo system is a heterogeneous multi-core architecture,composed of an ARM926 and two PACDSPs which are 32bits fixed point digital signal processors with five-way VLIWpipeline targeted for the mobile device [8]. The PACDSPs isgenerally responsible for multimedia data processing, and theARM926 is used for flow control and other generalapplication. The Embedded Multimedia DMA (EMDMA) isdesigned and optimized for data transfer in multimedia codecapplication, especially for H.264 decoding. This system alsoconsists of three buses (i.e., AXI, AHB and APB) matrix toseparate the loading of DSP, CPU, and slow speedcomponents. There are also many peripherals implemented inthe system. Fig. 1 shows block diagram of our PAC Duoplatform.

Fig. 1. PAC Duo virtual platform.

III. PAC Duo VIRTUAL PLATFORM

However, the main data stream in which we are mostinterested is flowing among PACDSPs, DMA and memories.We do not construct high level models for all IP. Fig. 2 showsour targeted PAC Duo virtual platform.

In the PAC Duo virtual platform, there are several key IPmodeled at different abstraction levels. First of all, the ARM

400

and PACDSPs are modeled as cycle-accurate instruction setsimulator (ISS). The model of bus system is a cycle-accuratestate machine which is triggered by certain clock sources. Thebus system will notify events to these IP through thecommunication interface. On the other hand, the other IP arecycle-approximately modeled.

Fig. 2. PAC Duo virtual platform.

Since the functionality of memory (Read/Write) is simple,it is modeled as a storage array at high abstraction level.However, different kinds of memory have different timinginformation and thus, we need to implement a latencycalculation function. In our platform, we have a DDR2, aSDRAM and two SRAM models. The interface betweenmemory controller and memory is based on OSCI TLM whichis at PV-T (Programmer's View with Timing) level ofabstraction [2], and we model the latency calculation in thetransport call-back function. In this way, communicationbetween memory controller and memory is in the unit of atransaction or burst and, therefore, dramatically increases thesimulation speed without sacrificing the timing accuracy.Moreover, communication, computation and timing ofmemory are separated and modeled. The benefit is that thememory module can be reused by only changing the latencycalculation.

The virtual platform is globally synchronous because all IPare connected to the bus system that communicates with IP inthe unit of a transfer based on the bus protocol. In other words,in spite of cycle accuracy, this bus model is still faster thanRTL model, since it is not pin-accurate. Thus, detailed runtime traffic on bus system can be observed andcommunication congestion is easily explored with transfer asthe basic unit.

IV. VIRTUAL PLATFORM VALIDATION

The PAC Duo virtual platform is built up based on theexisting SoC platform and a bottom-up methodology isadopted for function verification and timing correction. Fig. 3illustrates the virtual platform validation flow. We first focuson the correctness of function and behaviors for IP modelsand the whole system, and then we do the timing annotationand adjustment for all of the IP models, buses and bridges.After the goal of timing accuracy is achieved, theoptimization for simulation speed is made.

uo unction an mung EFINEMENT

SimulationTiming

PerformanceVersion

time (s)Speed-up accuracy

(cycle/s)(%)

Ver. 1 179.24 152.33x 92.48 54,123

Ver. 2 332 81.98x 98.84 31,233

Ver. 3 235.74 116.82x 97.50 43,388

• First is the timing accuracy of IP models where manytiming mismatches are found when comparing withthe RTL simulation. The actual delays are extractedfrom RTL simulation and annotated to the modelsthrough parameterization of the two kinds of delays,one is the processing delay between two consecutivetransactions, and the other is the delay from the start ofa transaction to the end of the response. For example,the DDR2 memory model is included many timingparameters for operation latency.

• Second part is the accuracy of bus model whereadditional delay stages must be inserted to compensatethe lack of timing details for specific traffics.Moreover, the timing of the bridge between AHB andAXI bus is corrected by ensuring the same behaviorsas its RTL counterpart including handling nonalignment address transactions and supportingdownsizing of bus width (64-bits AXI to 32-bits AHB).

At the end of this stage, both function correctness andtiming accuracy are achieved.

D. The optimization for simulation speed (Stage 4)

With the increase of timing accuracy, simulation speed isaffected because more timing events need to be handled.Since our goal is to evaluate the system performance, reliabletiming accuracy is a key factor that should be kept. However,some efforts from both HW and SW views are still engaged toimprove simulation speed. From HW view, decreasingdebugging level of the virtual platform, increasing thehardware optimization level, and replacing the ARM cycleaccurate model with instruction accurate model can enhancethe simulation speed. From SW view, simulation efficiencycan be improved by setting proper optimization level ofapplication compilation.

E. PAC Duo virtual platform validation and refinement

Table I summarizes the function and timing refinementprocess of the PAC Duo virtual platform using the H.264video decoding application as an example.

TABLE I PACD VPF un . R

* Bit stream: QVGA 5-frames

Version 1 is the simulation results of the virtual platformat the end of stage 2 which has correct functionality butwithout timing correction, and Version 2 is the virtualplatform at the end of stage 3 that the timing adjustment iscompleted. Version 3 is the results of stage 4 that simulationspeed optimization is performed. The speed-up is theimprovement of simulation time with respect to RTLsimulation. The timing accuracy is obtained by comparison oftotal execution cycles with that of RTL simulation. Theperformance index is the execution cycles that can besimulated per second. Version 1 has the highest simulationtime speed-up that is around 150x than RTL and the worstaccuracy because all IP block models have no timinginformation and delay correction of bus and bridge is not

401

B. System integration and verification (Stage 2)

After the function verification of all the IP models is done,these IP can be easily integrated via the bus library as asystem virtual platform. For checking the functionality of thewhole system, functional tests and real applications areadopted. Before porting the applications to the system, somegeneric system functional tests should be applied first,

• Memory Map Checking: to scan all the models via theARM core to ensure the correctness of the entirememory address space.

• Interconnection Checking: to ensure all theconnections between models and buses, including thebackdoor access (i.e., DMA and DDR2 memory).

• Memory Access Checking: to scan all the memories(e.g., SRAM, SDRAM, DDR2, etc.) with differenttransfer type and size.

• System Interrupt Checking: to check if the processorcan enter ISR and return to the previous state.

After all the generic system functions are verified, variousapplications are ported to the system. The same programsused for FPGA prototyping board and real chip evaluationboard can be transparently re-used on the ESL virtualplatform.

C. System timing annotation and adjustment (Stage 3)

Next we tum our focus to the timing. Two portions of theplatform are the main sources of timing error.

Stage 4: The optimization forsimulation speed

Stage 3: System timingannotation and adjustment

Stage 2: System integrationand verification

Stage 1: Function verificationfor all of the IP models

A. Function verification for all ofthe IP (Stage 1)

The virtual platform consists of several independent IPmodels whose stability and correctness may affect the overallquality and reliability of the virtual platform. So the directtests and random tests are used for qualifying the IP modelshere. Furthermore, the following test items should be coveredfor each IP.

• Bus Interface: to validate the transformation ofRead/Write transaction, it is the compliance of TLMAPI, between IP models and buses.

• Registers: all of the implemented configurationregisters can be correctly accessed and responded bythe SW through memory map.

• Behaviors: to validate the kernel function or algorithmof the IP model (i.e., the ID/2D transformation ofDMA, and the YUV/RGB data processing of LCDcontroller).

Fig. 3. The virtual platform validation flow.

considered. In other words, it has the high simulation speed atthe expense of accuracy. Version 2 has the highest timingaccuracy and has the longest simulation time. Version 3maintains high accuracy while the simulation speed is alsooptimized.

V. EXPERIMENTAL RESULTS

A. Application Porting

Some typical multimedia applications including lPEG andMP3 decoding are ported to evaluate system performance.Table II shows the results of these applications. The decodingbuffer is the memory where the decoded results are stored.The execution cycles are the cycle count of the actualhardware execution, and the timing accuracy is obtained bycomparison of execution cycles. Note that for H.264application, DMA is used for the major data movement andhence two cases are performed where DMA is configured asgeneral or dedicated for multimedia application. The timingaccuracy for all the applications are all above 98% so furtherperformance analysis like DMA configuration as mentionedabove can be conducted with confidence. Next, we would liketo evaluate the real chip performance using the virtualplatform.

TABLE II. VARIOUS ApPLICATIONS PORTING ON VIRTUAL PLATFORM

ApplicationsDecoding Execution Timing

Buffer Cycles Accuracy

lPEG Decoder SDRAM 24,122,138 99.61%

MP3 Decoder DDR2 18,367,113 99.99%

GeneralSDRAM 40,945,561 99.09%

H.264 DMA*Decoder Multimedia

DDR2 8,381,137 98.26%DMA**

* Bit stream: QVGA 5-frames **Bit stream: VGA 2-frames

B. Performance Evaluation

For virtual platform (VP) performance evaluation, we useH.264 video decoding to predict the performance, and theresults are compared later with what measured fromevaluation board (EVB). Fig. 4 is the die photo of our PACDuo Soc with outline of important function blocks. Thecomparisons are shown in Table III.

Fig. 4. Die Photo of PAC Duo SoC.

Two sets of configurations are set according to the AXIbus frequency since H.264 decoding is mainly performed byPACDSP, DMA and DDR2 that are all on the AXI bus. InCase 1, the AXI and AHB/APB bus frequency are 312MHz

402

and 24MHz respectively. In Case 2, the AXI, AHB, and APBbus frequencies are 408MHz, 102MHz, and 24MHzrespectively. Each IP has the same operation frequency withthe bus that it is connected to. The decoding buffer andexecution cycles are explained above. Note that both theexecution cycles for VP and EVB are counted by the timer onAPB running at 24MHz. The timing accuracy is obtained bycomparing the VP execution cycles with that of EVB. Thetiming accuracy is all above 98% for different frequencysettings, and as a result, the PAC Duo virtual platform is areliable tool for performance analysis and architectureexploration.

TABLE III. REAL CHIPPERFORMANCE EVALUATIONSystem Decoding Execution Cycles Timing

Configurations

Buffer VP EVB Accuracy

Case 1 DDR2 1,003,226 987,916 98.45%(AXI:

312MHz) SDRAM 2,315,754 2,301,326 99.37%

Case 2 DDR2 621,760 618,635 99.49%(AXI:

408MHz) SDRAM 853,240 850,250 99.65%

* H.264 decoding, bit stream size QVGA, decoding 5 frames

VI. CONCLUSIONS

In this paper, we have introduced our PAC Duo with it'svirtual platform. Through the process of fine-tuning function,timing and simulation speed, the virtual platform can achievea high accuracy that the error rate is less than 10% and around80----150 times faster than traditional RTL simulation. Withsuch a fast and accurate virtual platform, we can investigatesystem performance evaluation for various applications anddo our real chip performance estimation easily. It alsoprovides us a good stepping stone for the future worksincluding system bottleneck analysis, architecture explorationand early SW development for the next generation multi-corePAC platform.

REFERENCES

[1] International Technology Roadmap for Semiconductors (ITRS), 2007Edition, http://www.itrs.net

[2] Transaction Level Modeling (TLM) Library, Open SystemC Initiative(OSCI), http://www.systemc.org

[3] Y.-S. Cho, E.-1. Choi, and K.-R. Cho, "Modeling and analysis of thesystem bus latency on the SoC platform", in Proc. Int. Workshop onSystem-Level Interconnect Prediction, pp. 67-74,2006.

[4] G. Schirner and R. Domer, "Quantitative analysis of transaction levelmodels for the AMBA bus", in Proc. Design, Automation and Test inEurope (DATE), pp. 230-235, 2006.

[5] K. Lee and Y. Yoon, "Architecture exploration for performanceimprovement of SoC chip based on AMBA system", in Proc. Int.Conference on Convergence Information Technology (ICCIT), pp. 739744,2007.

[6] T.-H. Tsai, Y.-N. Pan, and C.-H. Lin, "An Electronic System LevelDesign and Performance Evaluation for Multimedia Applications", inProc. Int. Conference on Embedded Software and Systems (ICESS), pp.621-624,2008.

[7] C. Haubelt, T. Schlichter, 1. Keinert, and M. Meredith,"SystemCoDesigner: Automatic design space exploration and rapidprototyping from behavioral models", in Proc. Design AutomationConference (DAC), pp. 580-585, 2008.

[8] T.-1. Lin, C.-N. Liu, S.-Y. Tseng, Y-H Chu, and A.-Y. Wu, "Overviewof ITRI PAC project-from VLIW DSP processor to multi-corecomputing platform", in Proc. IEEE Int. Symposium on VLSI Design,Automation and Test (VLSI-DAT), pp.188-191, 2008.

pac duo soc performanceanalysis with esldesign methodology

Documents