intel x86-based digital signal processing - rtcrtcgroup.com/whitepapers/files/intelwp1.pdf ·...

cwcembedded.com

Te c h n o l o g y W h i t e P a p e r

Intel® x86-Based Digital Signal Processing

Using the Intel® Architecture in High-Per formanceMili tary Embedded Signal Processing Applications

Executive Summary

Historically, processors from the PowerPC® family, now known as Power Architecture® processors, have been the dominant choice for implementing Digital Signal Processing (DSP) in high-performance embedded military applications that take advantage of open-system Commercial Off-the-Shelf (COTS) products. Today, however, beginning with Intel Core™ i7 dual-core processors, the low-power, high-performance advantages of the Intel architecture processor technology can be used for the first time to design DSP engines for the rugged deployed COTS signal processing space.

Situation Analysis

Since the 1990s, processors from the PowerPC family, also known as Power Architecture, have been the dominant choice for open-system COTS boards used in high-performance embedded military DSP applications. These applications include radar, signal intelligence, sonar, and image processing.

In the early 1990s, systems were implemented largely with specialized processors such as the Intel® i860 processor, the Texas Instruments 320C40, and the Analog Devices SHARC®. These processors were popular because of their floating-point performance. At this time, two companies that would later become part of Curtiss-Wright Controls Embedded Computing -- Dy 4 Systems and Ixthos -- entered the signal processing market with VME form factor products based on both the 320C40 and the SHARC.

In the late 1990s, Analog Devices and Texas Instruments introduced follow-on processors, the TigerSHARC® and 320C6701, respectively. Both were unsuccessful, partly due to lack of software compatibility with their predecessors. The PowerPC processor from the Apple®/IBM®/Motorola® alliance, intended for personal computer use to compete with Intel x86-based processors, was also introduced

™

cwcembedded.com

at this time. Its reduced instruction set computer (RISC) architecture was touted to be the future of high-performance microprocessors. But it was the introduction of the AltiVec™ instruction unit in the Motorola PowerPC 7400 (G4) that changed the signal processing landscape.

Signal processing experts soon realized that the floating-point–capable AltiVec unit could greatly accelerate the inner-loop processing found in common functions such as fast Fourier transforms (FFTs). The ability to perform up to four simultaneous floating-point multiplies and additions was revolutionary. The PowerPC with AltiVec has had a long run in the military market with a continuous succession of faster processors, ending with the MPC8640/8641. From the early 2000s until now, system developers have enjoyed a steadily evolving series of COTS products that offer more features and higher performance, while providing a high degree of software compatibility.

However, in April 2008 Apple® Computer Inc. announced plans to acquire P.A. Semi Inc., manufacturer of the latest Power Architecture family, and to discontinue development of the high-performance processor line, which included the dual-core 1682. The promise of P.A. Semi’s products had been an industry-leading performance per watt, and they were poised to be designed into many systems that had previously used Freescale™ and IBM processors.

Although it was reported that Apple had stated intentions to support the 1682 processor for three years, Curtiss-Wright Controls immediately discontinued development of products based on the P.A. Semi processor.

Selection Criteria for Processors Used in Development of Next-Generation Signal Processing Products

In order to select a processor family that Curtiss-Wright Controls could employ in its next generation of signal processing board products, the company conducted extensive research into the needs of its signal processing customers as well as its own needs as an open-systems COTS board-level provider.

When Curtiss-Wright Controls surveyed its DSP customers, several “care-abouts” emerged, as well as other attributes important to producing a commercially viable product line (see Table 1). Ninety-four percent of survey respondents used the AltiVec floating-point unit. Survey respondents said that memory performance, and the closely related inter-processor bandwidth, were even more important than the board’s raw FLOPs rating. This indicated that many users were doing processing on streaming dataflows where there was sensitivity to the computer’s memory performance. Users also showed a strong preference for utilizing floating-point arithmetic, which is well known for simplifying algorithm development, versus the more complex effort required to work in fixed-point format with its attendant management of overflow and underflow conditions.

Table 1: Processor Selection Criteria Based on Survey Data from Signal Processing Developer Customers

Processor Attribute Customer Criteria

Performance

� Single precision floating-point � Memory bandwidth � Data processing � Inter-processor bandwidth � Double precision floating-point

Power Consumption

� VME board power to be ≤80W � VPX board power to be ≤120W � Desirable to be configurable for

optimizing power vs. performance

Functional Density Must fit into limited space envelope of a rugged board

RoadmapPreference for mainstream technology with a strong roadmap vs. the fastest exotic solution available

cwcembedded.com

Curtiss-Wright Controls’ most recent multiprocessor DSP engine, the CHAMP-AV6, is based on the Freescale MPC8641 processor. While a follow-on to the MPC8641 with more cores and better performance per watt would have been a strong contender, no direct successor to the MPC8641 is planned, and Freescale has decided not to include the AltiVec unit in its next generation of high-performance processors. The QorIQ™ P4080 processor, announced last year, is an excellent choice for a single board computer (SBC), with its eight cores, integrated memory controllers, and Serial RapidIO® (SRIO) interface. Unfortunately, the lack of AltiVec severely impairs its floating-point performance. The processor core still features a regular floating-point capability but it is not a vector processor, which is required to attain the needed performance for signal processing applications.

In contrast, Intel has continued to develop the floating-point capability of its processors. Intel processors feature a vector-processing unit generically known as Streaming Single Instruction, Multiple Data (SIMD) Extensions (SSEs), first introduced in the Intel® Pentium® III processor. Since then, Intel has continually added features and new instructions, culminating in the current implementation, Intel® Streaming SIMD Extensions (Intel® SSE 4.2.) Like AltiVec, SSE is a 128-bit wide processing unit, capable of simultaneously operating on four 32-bit floating-point values. SSE also features support for double-precision floating point, a feature that was never included in AltiVec. In multi-core Intel processors, each core has its own SSE unit, so the raw floating-point performance scales with the number of cores.

It is important to remember that Intel x86 processors are classic CISC processors. Successive generations of Intel processors continue to dispatch more instructions per clock. Since many more instructions per clock cycle get done, and the code density is higher, Intel processors can perform more than twice the useful work per clock cycle as a Freescale RISC processor.

As a result, beginning with the dual-core Intel Core™ i7 processors, the low-power, high-performance advantages of the Intel architecture processor technology can be used for the first time to design

products such as DSP engines for the rugged deployed COTS signal processing space.

How the Intel Architecture Addresses Signal Processing Performance Needs

The latest generation of Intel architecture processors is produced on 32nm process technology and is based on the Intel microarchitecture, formerly codenamed Westmere. Westmere includes many features that suit high-performance and power-efficient execution of signal processing workloads (see Figure 1).

Figure 1: Intel microarchitectureFront End

Fetch and Predecode

Instruction Queue

Instruction Decode

Execution

Rename/Allocation

Reserve/Schedule

Execution Units

ReorderRetire

Memory Cache

L1 InstructionCache

Unified L2Cache

L1 DataCache

SharedLast-levelCache

For example, the front end fetches and decodes instruction streams through the L1 cache and can provide a sustained throughput of four decoded micro-operations (µops) per cycle. Up to 16 bytes of instructions can be fetched per cycle and the unit can support decoding instruction streams for two hardware threads in alternate cycles, known as symmetrical multithreading (SMT), the latest version of Intel® Hyper-Threading Technology (Intel® HT Technology).

The µop stream then enters the first stage of the Intel® Wide Dynamic Execution unit, an out-of-order superscalar execution core that can issue up to 6 µops per cycle and can have up to 128 µops in flight at any

cwcembedded.com

moment. The reservation station schedules which µops are issued for execution and an entry per µop in the reorder buffer of the retirement unit ensures that the processing of the operations and architectural states are updated according to the specified program order.

Figure 2: Intel microarchitecture Execution Units

RESE

RVAT

ION

STA

TIO

N

Port 0

Port 1

Port 2

Port 3

Port 4

Port 5

Integer ALU/ShiftFP Multiply

SSE ALU/Shuffle

Integer ALU/MULFP ADD

SSE MUL/Shift

LOAD

Store Address

Store DATA

Integer ALU/ShiftSSE ALU/Shuffle

Of the six issued µops per cycle, three can be related to computational operations and three to memory operations, up to 128-bits each (see Figure 2). For signal processing workloads this means that if SIMD operations are used, the architecture can support up to 12 compute and three memory I/O operations in a single cycle.

To support high-instruction throughput, the Intel microarchitecture contains a sophisticated memory sub-system. In a quad-core processor, each core contains a first-level instruction cache (32KB 4-way), a first-level data cache (32KB 8-way), a second-level unified cache (256KB 8-way), and a third-level cache of up to 8MB 16-way that is shared among all the processor cores. With two or three DDR3 memory controllers the processor can provide a peak memory bandwidth of 17.1 or 25.6GB/s. This high-throughput capability is required to support the multi-gigabit rates for the

processing of the sample streams in military signal processing applications such as radar.

Table 2: Common SSE Instructions Used in Signal Processing Applications

Sub-Group Instructions Usability Description

Load and Store Floating-point (single/double) Precision

MOVAPS,MOVAPD,MOVHPS,MOVHPD,MOVLPS,MOVLPD

Data movement instructions. Load and store packed single- and double-precision floating-point value from/to memory/registers. Used extensively in FFT and other signal processing algorithms.

Arithmetic Floating-point (single/double) Precision

ADDPS/PD,SUBPS/PD,MULPS/PD,SQRTPS/PD,ADDSUBPS,ADDSUBPD

These arithmetic instructions are heavily used in signal processing algorithms such as FFTs.

Floating-point Round

ROUNDPS,ROUNDSS,ROUNDPD,ROUNDSD

Efficiently rounds the scalar and packed single- and double- precision operands to integers, with enhanced support for various language requirements.

Packed Test and Set PTEST

Faster branching from SIMD decisions to support conditionally vectorized code.

Accelerated Searching and Pattern Recognition of Large Data Sets

POPCNT

Calculates the number of bits set to one in the given operand. Often used for schedulers and buffer/memory management.

Thread Synchronization

MONITOR,MWAIT

Places processor in an optimized state until a write to the monitored address range occurs.

Memory BarriersSFENCE,LFENCE,MFENCE

Insures a performance-efficient way of load and store memory ordering. Benefits multitasking programming where tasks execute in the out-of-order core. Often used in inter-thread/inter-process communication mechanisms such as queues and shared memory.

Support for the efficient implementation of high-throughput signal processing is based on SSE instructions, which are extensions to the standard Intel Instruction Set Architecture (ISA). Including the latest generation, Intel® SSE 4.2, there are more than 300 SSE instructions. SSE operations work from a set of 16

cwcembedded.com

128-bit wide XMMx registers, capable of simultaneous operation on four packed floating-point values, as well as other formats (see Table 2).One of the most common signal processing algorithms is the FFT. The FFT implementation shown in Figure 3 is a version that is included in the Intel® Integrated Performance Primitives (Intel® IPP) library.

This example uses 32-bit single-precision complex floating-point samples. The FFT is implemented for different sizes and the number of cycles per sample has been measured. The profiled results use a single thread on an Intel microarchitecture core running at 2.67GHz. Note that these results are for a 2009 processor, not the Intel Core™ i7 processor used in the new Curtiss-Wright Controls platform, which is based on Intel’s new 32nm processor technology.

The Intel® IPP implementation of an N point FFT uses a complex multiplication taking six operations (2MUL & 2ADD) and a complex addition taking two operations (2ADD) for each point. Since a MUL takes four operations, this amounts to 8N.log2N FLOPs. By calculating the number of FLOPs per cycle the sustained GigaFLOP (GFLOP) performance can be derived. A single core is capable of 20 to 30GFLOPS for FFT execution, which is up to more than 90% of theoretical capability (see Figure 4).

Effective implementation of signal processing algorithms requires efficient use of all resources on the processor platform, so the ability to parallelize algorithms across multiple cores in a linear manner is essential. Parallelized scaling across the multiple cores of an Intel® microarchitecture-based platform can be done for common operations used in signal processing such as complex multiplication, or for more computationally-intense algorithms. A threading model can be used to implement the complex multiplication algorithm with parallel execution. The input data is divided into blocks, and each block, or number of blocks, depending on data size, is executed in full in separate parallel threads. This method assumes no interdependence between blocks. Some processor architecture parameters must be taken into account to optimize performance, including cache size, cache line alignment, and thread affinity, and inter-thread dependencies must be minimized or avoided.

A single quad-core Intel® Xeon® processor-based platform can be used to execute the complex floating-point multiplications. The results depicted in Figure 5 show the expected linear performance scaling from one to four threads, as additional cores and SSE vector units are employed in the algorithm. The 8-thread case demonstrates that additional efficiency can be obtained from the hyper-threading feature of the cores, even though the floating-point calculation resources of the core remain the same between the 4-thread and 8-thread case.

Figure 3: FFT Performance - Cycles per SampleFFT Performance Cycles/Sample

Cycl

es

FTT Size

0 256512

7680

8

6

10

14

12

4

2

10241280

15361792

20482560

28163072

33282304

35843840

40964352

Figure 4: FFT Performance - GFLOPSFFT GFLOPS

GFL

OPS

FTT Size

2560

20

15

25

35

30

10

5

10244096

16384

65536

Figure 5: Performance Scaling for Multiple CoresComplex Floating-PointMultiplication GFLOPS

GFL

OPS

per

Thre

ad

Number of Threads

01 Thread 2 Threads 4 Threads 8 Threads

20

15

25

30

10

5

cwcembedded.com

Intel Core™ i7 Processor Power Management

The discussion of processor performance in a rugged embedded application is entirely academic without also considering power consumption and cooling. Readers may understandably have a “>100W” impression of Intel processors from the perspective of gaming computers with massive CPU coolers. Perhaps less well known are the advances Intel has made in low-power operation. The scope of technologies devoted to power reduction is large, but the combination of internal architecture, low-leakage 32nm and 45nm high-K/metal gate transistors, and low-voltage capabilities means that system developers have a catalog of embedded-oriented devices to choose from, with maximum thermal design power ranging from 18W and under to 45W. With the reduction of the number of necessary companion chips, the Intel Core™ i7 processors have become viable for multiprocessor board level products that will offer large gains in performance/watt relative to the prior generation of Power Architecture/AltiVec designs.

In conjunction with excellent FLOPs/watt metrics, Intel Core™ i7 processors have very useful capabilities for tailoring power consumption and monitoring silicon temperature. Intel SpeedStep® technology allows fine-grained control over the processor’s operating frequency. Recalling that processor power consumption has a square-law relationship with voltage, the ability of the Intel Core™ i7 processor to direct the core power supply to reduce voltage at lower frequencies further provides power savings gains. Intel Core™ i7 processors feature a multipoint temperature sensor feature that allows users to more confidently “dial up” performance, since the system can provide a precise indication that thermal limits are not being exceeded. This ability affords developers the opportunity to fine-tune system performance to minimize power consumption, especially in hot environments such as next to a jet engine. It will also be useful for systems that run in a reduced operational mode, such as a targeting radar that does not need to be fully active until a pilot engages that function.

Platform Longevity and Software Issues

Intel architecture has evolved significantly over the past few years, incorporating several micro-architectural, platform, and ISA enhancements that have improved the execution of signal processing workloads. The advantages of consolidating the workloads for signal processing applications on a single standard architecture are many: the support of a large ecosystem, a reduction in design complexity, rapid time-to-market, and software reuse. The latest Intel microarchitecture allows the design process to be significantly streamlined and cost-optimized.

Intel architecture processors are evolving at a regular rate. Known as the “Tick/Tock” model, the major evolutions of the microarchitecture are interleaved with process technology evolution. The “Tick” represents the introduction of the new architecture. The “Tock” represents a reduction in the process technology. In 2011 the new architecture will implement the Intel® Advanced Vector Extensions (Intel® AVX) ISA. Intel AVX SIMD registers are 256-bits wide, so the theoretical floating-point performance per clock will roughly double. Backwards software compatibility with earlier SSE instructions will be maintained.

The products on Intel’s embedded roadmap have a lifetime of at least seven years. In addition to a strong product roadmap, Intel’s next-generation processors will be largely software-compatible with previous generations. Since the software in Curtiss-Wright Controls customers’ applications lasts longer than is required in many non-military embedded systems, this forward software compatibility is a major advantage to the Intel processor platform for signal processing developers. In the signal processing application space there are custom chips with tremendous performance, but they are point solutions and none of them have a long lineage. A long lineage and stability over time, such as Intel’s processors possess, are necessary for a major COTS board vendor, as well as for that vendor’s customers.

cwcembedded.com

Curtiss-Wright Controls Product Development Plans

Curtiss-Wright Controls already has a family of SBCs based on Intel processors. However, switching from Freescale to Intel processors in multiprocessor DSP board products will be a significant undertaking. As a leading vendor of advanced signal processing boards and systems for the rugged deployed aerospace and defense market, Curtiss-Wright Controls believes that the cost of this transition, for the company and for customers, will reap long-term benefits since Intel’s strong roadmap of future processors will fuel a series of processing products for many years. A long lineage and stability over time – both characteristics of Intel processors – are necessary for a major COTS board vendor, as well as the vendor’s customers.

2007-09

TICK TOCK

45nm

Intelmicroarchitecture

formerlycodenamedPenryn


formerlycodenamedNehalem

R R

2010-11

TICK TOCK

32nm


formerlycodenamedWestmere


formerlycodenamed

Sandy Bridge

R R

2012-13

TICK TOCK

22nm


formerlycodenamedIvy Bridge


formerlycodenamedHaswell

R R

~1.5-1.7X SIMDfloating-point

performance withIntel Advanced

Vector Extensions

R

2nd generation high-k + metalgate transistors

Integrated MemoryController

Figure 6: Intel’s “Tick-Tock” Silicon Release Cadence

Curtiss-Wright Controls’ first multiprocessor DSP board products will be based on a dual-core Intel®Core™ i7 processor. The first two products based on the Intel microarchitecture are the CHAMP-AV5 6U VME64x DSP engine and the SVME/DMV-1905 SBC.

Figure 7: Curtiss-Wright Controls CHAMP-AV5 6U VME DSP board

cwcembedded.com

© C

opyr

ight

201

0, C

urtis

s-Wrig

ht C

ontro

lsA

ll Ri

ghts

Rese

rved

. MKT

-EC

-Inte

l DSP

-081

110v

2

Utilizing two 2.53GHz dual-core Intel Core™ i7 processors, the CHAMP-AV5 delivers performance rated up to 81GFLOPs. With 4MB of cache and two hardware threads per core, the Intel Core™ i7 processor can process larger vectors at peak rates significantly greater than was possible with previous AltiVec-based systems. The CHAMP-AV5 is pin-compatible with the Curtiss-Wright Controls’ MPC7447/7448-based CHAMP-AV4, enabling performance upgrades of existing qualified systems without necessitating a change of chassis, backplane, or power supplies. An extensive suite of software includes support for Wind River® VxWorks® and Wind River® GPP Linux® operating environments. Additional software support includes Inter-Processor Communications (IPC) and Curtiss-Wright Controls’ Continuum Vector™ SSE-optimized signal processing library.

The SVME/DMV-1905 SBC complements Curtiss-Wright Controls’ new CHAMP-AV5 and brings the low-power, high-performance advantages of Intel® architecture to demanding, harsh environment compute applications. Supplying 8GB of flash and up to 8GB of SDRAM, the SVME/DMV-1905 is ideal for handling applications with demanding storage, data logging and sensor processing needs.

Intel, Intel Core™, Intel SpeedStep, Xeon, Pentium and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.*Other names and brands may be claimed as the property of others.

Portions of the Intel content originally appeared in “Using Intel Architecture for Implementing SDR in Wireless Base Stations”, from the 2009 SDR Forum (now the Wireless Innovation Forum).

intel x86-based digital signal processing - rtcrtcgroup.com/whitepapers/files/intelwp1.pdf ·...

Documents