prototyping multiprocessor system-on-chip applications: a platform

May 2007 (vol. 8, no. 5), art. no. 0704-o5002 1541-4922 © 2007 IEEE Published by the IEEE Computer Society

Rapid System Prototyping Prototyping Multiprocessor System-on-Chip Applications: A Platform-Based Approach Benaoumeur Senouci, Aimen Bouchhima, Frédéric Rousseau, and Frédéric Pétrot • TIMALaboratory Ahmed Jerraya • CEA-LETI Menatec

A new multiprocessor system-on-chip prototyping flow based on the Portable Operating System Interface (Posix) standard and a multiprocessor hardware platform lets you quickly prototype Posix-based applications.

Modern MPSoCs (multiprocessor systems on chip) contain a huge amount of software and rely on complex hardware components. As application complexity grows, programmable multiprocessor platforms are becoming more desirable. In fact, chips with several processors (such as general purpose processors and digital signal processors) are emerging in the industry, either for low-end applications such as audio codec or high-end applications such as video encoders.

Reconfigurable hardware platforms recently emerged as effective solutions to validate and prototype MPSoC designs early in a design flow. Such prototyping platforms make simultaneous hardware and software development possible and enable early software design and debugging,1–3 thus allowing for early software and hardware integration—which is the critical step in MPSoC system designs. Applications running on these new multiprocessor platforms usually require sophisticated multitasking operating systems to execute the system parts mapped to software. These operating systems provide a suitable abstraction, allowing easy development of application software.4 Using a standard API at this level makes this process even more effective and enhances software portability and reuse across different operating systems. However, the same portability doesn’t apply from a hardware perspective, where changes to the underlying configurable hardware architecture are still seen as a major source of “nonportability” and usually lead to long, tedious redesign cycles.

MPSoC designers have recently introduced the concept of hardware-dependent software (HdS) to tackle such strong coupling between hardware and software within the lower system software layers.1

Here, we describe our experience in MPSoC prototyping using a multiprocessor operating system kernel that implements the Portable Operating System Interface (Posix) API standard on top of a reconfigurable multiprocessor platform using the HdS concept. (We chose Posix as the API owing to its wide acceptance and availability in many runtime environments.5) Investigating the complexity of hardware and software integration in multiprocessor system design, a process still not completely mastered in MPSoC design flows, helped us understand the duality between the low-level software layer (HdS) and the underlying hardware platform in the context of MPSoC design.

Hardware-software boundary: Defining HdS

When introducing standards such as Posix threads in embedded software development, the aim is to make the applications portable from one platform to another.2 However, it still isn’t always obvious whether software running on one platform will run on another. Different platform specificities (such as memory maps, the processors family, multiprocessor booting strategies, or on-chip commutation) might require different tuning or tradeoffs, forcing designers to redesign a major part of their embedded software.

IEEE Distributed Systems Online (vol. 8, no. 5), art. no. 0704-o5002 1

The HdS concept aims to tackle the disadvantages of such low-level programming practices by dividing the embedded software code into two parts: code that depends on the hardware architecture (the HdS) and code that is implementation independent (the hardware-independent software). The exact meaning of HdS depends on the context in which you use it, but HdS generally includes those low-level software functionalities whose implementations depend directly on the underlying hardware architecture.1,6 This includes, for instance, device drivers, digital signal processor-specific algorithms, and parts of the operating system (interrupt management, context related operations, semaphores, and so on). The hardware-independent software comprises application, middleware, and operating system software (see figure 1). We assume that the application software comprises a set of concurrent tasks and the middleware software represents dedicated communication libraries. The operating system software provides a useful abstraction interface between applications and target architectures by simplifying the control code required to coordinate processes.

Figure 1. The different layers of typical multiprocessor system-on-chip embedded software.

A generic MPSoC prototyping flow

Using an HdS-based approach and the Posix standard, we propose a generic MPSoC prototyping flow, as figure 2 shows. The shaded parts present the steps in which we’re interested—in particular, extracting platform specifications for redesigning certain parts of the HdS code.

Figure 2. A generic MPSoC prototyping flow. The shaded parts highlight the steps in which we’re interested.


To prototype an application on top of a multiprocessor architecture, we must parallelize the code in several threads. The entry point of the flow is then a parallel application composed of several concurrent Posix threads. A Posix thread is created by a call to

int pthread_create (pthread_t *thread, pthread_attr_t *attr,

void *(* start function) (void *arg), void *arg)

This executes the thread whose behavior is the start_function called with arg as argument. The attrstructure contains thread attributes, such as stack size, stack address, and scheduling policies. Such attributes are particularly useful when dealing with embedded systems or SoCs, in which the memory map isn’t standardized. The value returned in the thread pointer is a unique identifier for the thread.

The application threads communicate with different communication primitives. We usually distinguish two types of parallel programming models suited to multiprocessor architectures: the shared-memory model and the message-passing model. In our case study, we used the shared-memory model at the implementation level, derived from a symmetric multiprocessor (SMP) kernel.

Mutek: A Posix-compliant multiprocessor kernel

Mutek is an open source project that implements the Posix API standard.1 It’s a lightweight implementation of Posix threads, which lets you design an operating system kernel for multiprocessor embedded systems. Figure 3 depicts Mutek’s internal architecture.

Figure 3. The internal architecture of Mutek, an open source project that implements a Posix-compliant multiprocessor kernel.

Scheduler organizations

The scheduler manages several lists of threads, and our Mutek kernel can implement it in one of three ways:

SMP. The processors all share one unique scheduler, which is protected by a lock. The different threads can run on any processor, which leads to task migration. NON_SMP_Centralized-Scheduler. The processors all share one unique scheduler, which is protected by a lock, but every thread is statically assigned to a given processor and can run only on that processor. NON_SMP_Distributed-Scheduler. There are as many schedulers as processors. Every thread is assigned to a given processor and can run only on that processor. This allows better parallelism by replicating the scheduler, which is a key resource. This implementation requires communication between schedulers.


Kernel protection and thread migration

Processors provide access to the scheduler through critical section (that is, through common or shared operating system resources) and under the protection of locks. Lock granularity is a major player in determining the balance between the overhead introduced by the locking mechanism and the opportunity to increase parallelism among different processors.

The SMP version of Mutek allows thread migration. Intuitively, when a CPU finishes the threads currently allocated on it for scheduling, it can resume executing a preempted thread that was previously executed on another processor. In that way, the system is dynamically balanced, reducing the mean response time.

Thread synchronization

When multiple processors require access to shared data, synchronization among threads is required. Mutek performs this synchronization using different primitives:

Mutual exclusion locks (mutex). A mutex allows exclusive access to shared resources such as global data. Threads attempting to access an object locked by a mutex will be blocked until the thread holding the object releases it. Condition variables. Using condition variables, a thread can wait until (or indicate that) a predicate becomes true. A condition variable requires a mutex to protect the data associated with the predicate. Semaphores. In Posix.1b, named and unnamed semaphores have been “tuned” specifically for threads. The semaphore is initialized to a certain value and decremented. Threads may wait to acquire a semaphore. If the semaphore’s current value is greater than 0, it is decremented and the wait call returns. If the value is 0 or less, the thread is blocked until the semaphore is available.

Memory coherency

If the architecture’s interconnect is a shared bus, using a snoopy cache algorithm is sufficient to ensure cache coherence. This has the advantage of avoiding any processor slow down owing to memory traffic.

Processor identification number

Processors generally provide a specialized register allowing their identification within the system. Each processor is assigned a number at boot time. This identification number is needed, because some start-up actions, such as clearing the Blank Static Storage (BSS) and creating the scheduler, should occur only once.

Implementation

Here we describe the configurable multiprocessor platform we used to implement our MPSoC prototyping flow and explain how HdS let us port the kernel on the target hardware platform.

The prototyping platform

We used the ARM (Advanced RISC Machines) Integrator/AP prototyping platform (see www.arm.com/documentation/Boards_and_Firmware/index.html), which consists of three main parts.

The motherboard. The motherboard is composed of four core processor modules (which are mountable on a stack), one logic module (an FPGA—field programmable gate array), and a system bus implemented as an AMBA/AHB (Advanced Microcontroller Bus Architecture/Advanced High-Performance Bus). Figure 4 gives a simplified block diagram of the Integrator platform.


Figure 4. The ARM Integrator platform.

Core modules. Each core module is built around one ARM core processor without caches, and each contains 256 Kbytes of local synchronous static RAM (SSRAM), which only that ARM processor can access. An adjacent 128 Mbytes of synchronous dynamic access memory (SDRAM) can be directly accessed by all master processors on Integrator via the system bus. A system-bus-bridge adapts each ARM core interface to the AHB protocol and lets the processor access the system bus.

Memory architecture. The ARM Integrator platform implements a distributed shared memory (DSM) architecture. Two memory types exist in each core module: a static memory of limited capacity (the SSRAM) and a dynamic one of greater capacity (the SDRAM). Only the local processor can access the SSRAM, while all bus masters (processors) at alias address locations can access the local SDRAM (on the same core module).

HdS adaptation layer for Mutek porting

As we mentioned earlier, the gap between the embedded software and the Integrator platform brings about the need for a software adaptation layer (HdS) as figure 5 shows. Here we detail the techniques we used to tune Mutek kernel specificities to the Integrator platform constraints. More particularly, we focus on multiprocessor booting, memory mapping, synchronization, and context switching.

Figure 5. HdS adaptation for the Mutek operating system.


Multiprocessor booting. As a multiprocessor kernel, the Mutek boot code should pay close attention to the synchronization of the different concurrent processors to ensure the coherency of the operating system’s initialization phase. At this stage, the operating system should allocate and initialize its different vital data structures (including its task queue). It should also clear the BSS section (noninitialized global and static C variables) according to the American National Standards Institute’s C standard. Given that the system needs only one processor to perform this initialization process, the other processors must wait until the process ends. Therefore, the system needs an identification mechanism that can differentiate the running processors.

In Mutek, the get_proc_id function ensures the processor identification mechanism by returning the current processor’s ID. Implementing this function depends on the hardware architecture and is thus part of the HdS layer. In the case of the Integrator platform, the processor ID is available via a special status register (CM_STAT) on each core module that appears on the same shared logical address (0x10000010).

Figure 6 shows the specific implementation of get_proc_id for the Integrator platform. The kernel assumes that the processors are labeled from 0 to (n–1), where n is the number of available processors. The figure also shows the algorithm used to ensure synchronization among different processors. Only the master processor (ID = 0) is allowed to carry out kernel initialization. The other processors (core_modules 1, 2, and 3) enter a busy loop waiting for the master processor to finish. The shared variable scheduler_created is declared as volatile to ensure that the compiler won’t optimize the waiting loop.

/*get_proc_id function*/ unsigned int get_proc_id (void) { unsigned int r; r =*((char*) 0x10000010); // (CM_STAT) register return r; }

/*********************************************************************************

if (get_proc_id = = 0) {core_module 0 (processor 0) doing kernel initialization Scheduler and main thread creation (application enter);

scheduler_created < = 1}else {(get_proc_id 0) For the other core_modules (1, 2, 3) waiting master processor (core_module 0) to create the scheduler and set the variable (scheduler_created < =1) wait for a thread to be reenabled; }

Figure 6. Multiprocessor booting.

Memory mapping. Figure 7 shows the global memory mapping system. After resetting, all processors must jump to a specific shared address alias in the DSM address space. This address is set when first initialization routine starts inside the Mutek kernel (__init). The binary image (*.axf) is physically loaded on the local SDRAM of core module 0 (CM_0) according to this specific shared address (using the MultiIce connector).


Figure 7. Project memory mapping.

Multiprocessor synchronization. Mutek intensively relies on low-level semaphore primitives to synchronize between the different concurrent processors that compete for common operating system resources. From an implementation viewpoint, we can distinguish between two implementations of semaphore: CPU based (software) and FPGA based (hardware).

For a CPU-based implementation, binary semaphore implementations on general purpose CPUs are based on atomic read and (conditional) write of a shared variable.7 These existing mechanisms can be integrated in shared memory multiprocessors (the SMP) to synchronize between applications running on multiple homogeneous CPUs. Figure 8 shows the implementation of the different functions (SEM_LOCK and SEM_UNLOCK) using the specific swap (SWP) multiprocessor atomic instruction. SEM_LOCKchecks the SEM-LOCK variable and SEM_UNLOCK releases it.

void SEM_LOCK (unsigned int semaddr) {__asm {

Tryagain Request the Semaphore (SWP) Is it free? YES (we have the Lock) NO (Branch: Tryagain)

};}void SEM_UNLOCK (unsigned int semaddr) {__asm { Release the Semaphore;

Semaddr <= ‘1’; }; }

Figure 8. CPU-based semaphore implementation.


We can use the platform’s FPGA to implement more efficient synchronization mechanisms that are independent of the CPU family (the semaphore engine). As such, these new mechanisms are easily portable across shared and distributed memory multiprocessor configurations. Thus, the architecture can implement semaphores that don’t lock the system bus that grants other processors or threads access to the memory system.7

The semaphore engine uses a standard read of a memory mapped register Sem_addr. We define a simple control structure within the FPGA (logic module) that updates the register after a read operation. Figure 9 shows the semaphore engine’s implementation on the ARM platform.

Figure 9. The semaphore engine for Mutek.

The basic semantics of all SEM_LOCK and SEM_UNLOCK API’s for accessing the lock are implemented identically for all system processors (see figure 10).

void SEM_LOCK (unsigned int semaddr) { while (Sem_addr! =0); }void SEM_UNLOCK (unsigned int semaddr) { Sem_addr =0; }

Figure 10. A CPU-independent semaphore API.


Context management. The context switch code written in assembly language assures the commutation of the processors between threads. A context switch stores the current processor state (general purpose registers and status register) in a memory location (on the stack) and loads a new processor context from another location in the memory that corresponds to the new thread to be executed. This context switch is a part of the HdS layer. In Mutek, the scheduler_commute kernel routine is responsible for thread commutation.

This routine performs two different low-level calls to the commute function, which depends on the target processor; the function’s implementation, which is part of the HdS layer, varies accordingly. Figure 11 shows an example of context switch implementation for two different processors: the ARM architecture and the MIPS R3000 architecture.

/* Context switch routine for ARM architecture*/ STMIA R0!, {R0 – R14} ; save the old context registers MRS R5, cpsr ; we get the cpsr MRS R4, spsr ; and the spsr

…LDMIA R1, {R0, R14} ; load the new context registers MOV PC,Lr ; and we branch

(a)

/*Context switch routine for MIPS R3000 architecture*/ SW $at, 4*1($a0)

… Save the old context registers SW $ra, 4*31($a0)LW $at, 4*1($a1)

… Load the new context registersLW $ra, 4*31($a1)

(b)

Figure 11. An example of context switch implementation for two different processors: (a) the ARM architecture and (b) the MIPS R3000 architecture.

Environment setting. We used ADS (ARM Developer Suite v1.2) to make up our project, using the ARM CC compiler and ARM Link as linker. The output file is a .axf file targeted at the ARM platform.

Validation: M-JPEG video encoder

We validate our approach using a video decoder of a flow of JPEG images (known as M-JPEG and Motion JPEG). Figure 12 shows its task graph, which is composed of eight paralleled threads communicating with each other using hardware or software channels.


Figure 12. M-JPEG task graph (the circles are the threads).

Kahn network communication layer. The video decoder application is a graph of communicating threads in the form of a Kahn process network. In this formalism, the threads communicate with each other via circular first-in, first-out (FIFO) channels (C0 … C10). Our implementation of this communication library previews different communication schemes (software-software, software-hardware, hardware-software, and hardware-hardware). In this case, we use software-software communication, building FIFOs on top of the Posix standard and protecting them using semaphores.

Application mapping. The designer maps the parallelized code on the given multiprocessor architecture. This includes mapping the software parts (the concurrent threads, operating system, and HdS) onto the system memory and mapping the concurrent threads on the top of the multiprocessor platform architecture.

In platform-based design approaches, mapping the different functions (threads) on the hardware architecture is the key process that correlates the function to the architecture. In our case (as figure 13 shows), the abstract architecture model consists of four ARM CPU/SMP_OS units, an AMBA bus unit, and a global memory (DSM). The software architecture is built around an OS/SMP, allowing a dynamic threads-scheduling policy. The four ARM CPUs share a common view of the DSM via the system bus. Several software threads can share a CPU unit and request services from it.

Figure 13. M-JPEG threads mapping using the dynamic scheduler.


A thread can be executed on any CPU of the platform (thread migration), allowing an efficient load balance of the different software tasks. This helps balance the system, reducing its mean response time.

Experimentation results

In this SMP configuration, the operating system kernel’s memory footprint is approximately 11 Kbytes. This was the result of compiling approximately 100 C source files using the ARM CC compiler, with –O3 as the optimization option. Table 1 shows the software’s code size.

Table 1. The software’s executable code size.

Code SizeHdS 472 bytes Mutek kernel 11 Kbytes per symmetric multiprocessor Multiprocessor boot 360 bytes per processor Communication library 896 bytes

Compared to our previous experience with custom and application-specific operating system generation, Mutek has a larger footprint (four times larger). However, when the number of processors scales up, the memory footprint of the application-specific operating system (which implements distributed scheduling only) increases accordingly.

Table 2 shows the number of cycles necessary to perform typical Posix functions obtained from our implementation.

Table 2. The number of cycles necessary to perform typical Posix operations.

Operations No. of cyclesContext switch 1,462 Thread creation 2,750 Semaphore request 162Semaphore release 78

In addition to functional tests, we also performed tests to quantify the Semaphore Engine’s performance to address our concern that the hardware implementation’s speed could lead to better performance than a software one (see table 3). The test sequence was a SEM_LOCK (request) and SEM_UNLOCK (release).

Table 3. Clock cycles for semaphore request and release.

Operation CPU-based cycles (software)

FPGA-based cycles (hardware)

SEM_LOCK 162 54SEM_UNLOCK 78 41

The software Semaphore Engine implementation average access time was 162 clock cycles, compared to 54 clock cycles for the hardware implementation, yielding a 3 average performance access ratio. The Semaphore Engine’s performance is quantified by accounting for just one processor on the system bus.


Using a reconfigurable hardware platform (the ARM Integrator) and the Posix threads for applications development let us validate several multiprocessor applications by developing a prototype for each application. The validation process’s critical step is the HdS design process, which depends on the platform’s configuration instance.

We estimate that developing and debugging the new HdS layer for the Integrator platform required approximately three designers per month. Note that this is only for a particular configuration of the programmable hardware platform. Of course, for subsequent configurations of the same platform, the effort should be considerably less because the designers will be able to reuse the predesigned parts. Also, they won’t have the same learning curve the second time around.

Creating an operating system service (semaphore) based on FPGA and thus eliminating the difference between a CPU and FPGA from the developer’s viewpoint requires codesigning the operating system’s hardware and software to extend system services across the FPGA-CPU boundary. An attractive goal of such a hardware-software codesign is improving application portability on several mixed multiprocessor platforms. This FPGA-based design of certain CPU-dependent operating system services—such as synchronization, processor identification, and interrupts control—promise to overcome HdS design problems.

Future work will focus on developing methods and tools that can automate this design step to further shorten design and validation time and enable effective design space exploration.

References

1. B. Senouci et al., “Fast Prototyping of Posix Based Applications on a Multiprocessor SoC Architecture: Hardware Dependant Software Oriented Approach,” (http://doi.ieeecomputersociety.org/10.1109/RSP.2006.17), Proc. Workshop on Rapid System Prototyping, IEEE CS Press, 2006, pp. 69–75.

2. K. Keutzer et al., “System Level Design: Orthogonalization of Concerns and Platform-Based Design,” IEEE Trans. Computer-Aided Design of Circuits and Systems, vol. 19, no. 12, 2000, pp. 1523–1543.

3. N. Ohba and K. Takano, “An SoC Design Methodology Using FPGAs and Embedded Microprocessors,” Proc. 41st Design Automation Conf., DAC, 2004, pp. 747–752.

4. V. Mooney III and J. Lee, “Hardware/Software Partitioning of Operating Systems: Focus on Deadlock Detection and Avoidance,” IEE Proc. Computers and Digital Techniques, vol. 152, no. 2, 2005, pp. 167–182.

5. I. Augé et al., "Platform Based Design from Parallel C Specifications," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 24, no. 12, 2005, pp. 1811-1826.

6. S. Yoo and A.A. Jerraya, “Introduction to Hardware Abstraction Layers for SoC,” EmbeddedSoftware for SoC, A.A. Jerraya et al., ed., Kluwer Academic, 2003, pp. 179–186.

7. D. Andrews, D.L. Neihaus, and D. Ashenden, “Programming Models for Hybrid FPGA/CPU Computational Components,” (http://doi.ieeecomputersociety.org/10.1109/MC.2004.1260732), Computer, Jan. 2004, pp. 118–120.

Benaoumeur Senouci is a PhD student at the TIMA Laboratory, working with the System Level Synthesis group. His research interests concern hardware-platform-based design and prototyping of multiprocessor systems on chip with a particular focus on MPSoC’s embedded software. He received


his Master 2 Research degree in computer science and integrated system design from the Institut National Polytechnique de Grenoble. Contact him at TIMA Laboratory–SLS Group, 46 Ave. Félix Viallet, 38031 Grenoble Cedex, France; [email protected].

Aimen Bouchhima is a postdoctoral researcher at the TIMA Laboratory. His research interests include embedded software design and validation, high-level hardware and software modeling, and simulation and multiprocessor system-on-chip design flows. He received his PhD in microelectronics from the Institut National Polytechnique de Grenoble. Contact him at TIMA Laboratory–SLS Group, 46 Ave. Félix Viallet, 38031 Grenoble Cedex, France; [email protected].

Frédéric Rousseau is an assistant professor at the University of Grenoble and a researcher at the TIMA Laboratory. His research interest concerns system-on-chip design and architecture—in particular, the design and validation of hardware and software interfaces. He received his PhD in computer science from the University of Evry. Contact him at TIMA Laboratory–SLS Group, 46 Ave. Félix Viallet, 38031 Grenoble Cedex, France; [email protected].

Frédéric Pétrot is a professor of computer architecture at the Institut National Polytechnique de Grenoble. His main research interests concern computer-aided design of VLSI circuits and system architecture, with a particular emphasis on system integration, kernels, and multiprocessor systems on chip. He received his PhD in computer science from Université Pierre et Marie Curie, Paris. Contact him at TIMA Laboratory–SLS Group, 46 Ave. Félix Viallet, 38031 Grenoble Cedex, France; [email protected].

Ahmed Amine Jerraya is the head of Design Programs for the Design and System Division of CEA/LETI (Commissariat à l’Énergie Atomique / Laboratoire d’Électronique et de Technologie de l’Information). He received his Docteur d'Etat degree in computer science from the University of Grenoble. Contact him at CEA/LETI/DCIS, Minatec, 17 rue des Martyrs, 38054 Grenoble, France; [email protected]; www-leti.cea.fr.


Related Links

"Energy-Efficient Thread-Level Speculation," IEEE Micro(http://doi.ieeecomputersociety.org/10.1109/MM.2006.11) "Cross Layer Design to Multi-thread a Data-Pipelining Application on a Multi-processor on Chip," Proc. ASAP 06 (http://doi.ieeecomputersociety.org/10.1109/ASAP.2006.24)"Automatic Phase Detection for Stochastic On-Chip Traffic Generation ," Proc. CODES+ISSS 06(http://doi.ieeecomputersociety.org/10.1145/1176254.1176277)

Cite this article:

Benaoumeur Senouci, Aimen Bouchhima, Frédéric Rousseau, Frédéric Pétrot, and Ahmed Jerraya, "Prototyping Multiprocessor System-on-Chip Applications: A Platform-Based Approach," IEEEDistributed Systems Online, vol. 8, no. 5, 2007, art. no. 0705-o5002.


prototyping multiprocessor system-on-chip applications: a platform

Documents