© 2005 mercury computer systems, inc. yael steinsaltz, [email protected] scott geaghan,...
TRANSCRIPT
© 2005 Mercury Computer Systems, Inc.
•Yael Steinsaltz, [email protected]•Scott Geaghan, [email protected]•Myra Jean Prelle, [email protected]•Brian Bouzas, [email protected]•Michael Pepe, [email protected]
Leveraging Multicomputer Frameworks for Use in Multi-Core Processors
High Performance Embedded Computing Workshop
September 21, 2006
© 2005 Mercury Computer Systems, Inc.2
Outline
• Introduction
• Channelizer Problem
• Preliminary Results
• Summary
© 2005 Mercury Computer Systems, Inc.3
Multi-Core Processors
• Multi-Core processors vary in architecture from 2-4 identical cores (Intel Xeon, Freescale
8641), to a single Manager, several Workers on a die (IBM Cell Broadband Engine™ (BE) processor).
• Focusing on the IBM Cell BE processor, and using the standard presented in www.data-re.org, we implemented an API ‘Multi-Core Framework’ (MCF).
• MCF is applicable across architectures as long as one process acts as a Manager; more established APIs would work as well.
© 2005 Mercury Computer Systems, Inc.4
Multi-Core Framework
• MCF is based on Mercury's prior implementation of www.data-re.org, a product named “Parallel Acceleration System” or PAS.
• Distributed data flows in a Manager-Worker fashion enabling concurrent I/O and parallel processing.
• Function Offload model, where user programs both Manager and Workers. MCF simplifies development.
• LS memory is used efficiently (< 5% for MCF kernel).
• Runs tasks on SPE without Linux® overhead (thread create is bypassed).
© 2005 Mercury Computer Systems, Inc.5
Data Movement
• Multi-buffered, strip mining of N-dimensional data sets between a large main memory (XDR) and small worker memories.
• Provides for overlap and duplication when distributing data as well as different partitioning.
• Data re-organization enables easy transfer of data between local stores.
© 2005 Mercury Computer Systems, Inc.6
Outline
• Introduction
• Channelizer Problem
• Preliminary Results
• Summary
© 2005 Mercury Computer Systems, Inc.7
Objective and Motivation
•Objective : Develop a Cell BE based real-time signal acquisition system composed of frequency channelizers and signal detectors in a single ~6U slot.
•Motivation : Benchmark computational density between PPCs, FPGAs & Cell-BE for a typical streaming application
© 2005 Mercury Computer Systems, Inc.8
The Channelizer Problem
• FM3TR Signal (Hopping, Multi-Waveform, Multiband)
• Channelization using 16K real FFT with 75% overlap of the input (Computation signal independent).
• Simple threshold for detection of the active channels (Computation is data dependent).
© 2005 Mercury Computer Systems, Inc.9
Channelizer Problem
• The signal acquisition system separates a wide radio frequency band into a set of narrow frequency bands.
• Implementation Specifications 4:1 Overlap Buffer: 16K sample buffer -> 8K complex FFT. Blackman Window (Embedded Multipliers). Log-magnitude Threshold: adjustable register and comparator to determine detections
© 2005 Mercury Computer Systems, Inc.10
Data Flow and Work Distribution
• manager
• thread of
• manager
manager
manager
manager
Teams perform
data parallel math
Manager
thread of
execution High speed
Alarm worker
Channelizer
workers
Input data
Channelizer output
worker
HSA output
Unused processing elements
Unused processing elements
worker worker
worker workerworker
© 2005 Mercury Computer Systems, Inc.11
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.12
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.13
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.14
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.15
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.16
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.17
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.18
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.19
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.20
Data Flow – Re-org Channels
Channelizer team Local StoreXDR
HSA team LS
© 2005 Mercury Computer Systems, Inc.21
Outline
• Introduction
• Channelizer Problem
• Preliminary Results
• Summary
© 2005 Mercury Computer Systems, Inc.22
Development Time and Hardware Use
• PPC – 22 PPC needed for the channelizer, and 7 PPC for the HSA; about 2 man-months for development.
• FPGA – one half of a VirtexIIPro P70 FPGA (quarter board), about 8 man-months, all the math had to be developed using some Xilinx cores.
• Cell BE – single processor (half board), about 4 man-weeks (using the same math and SAL calls as the PPC code).
© 2005 Mercury Computer Systems, Inc.23
Data Rates Tested
• PPC implementation accepted data at 70, 80 and 105 MHz (and is easily scalable).
• FPGA implementation met data rates at 70 and 80 MHz (MS/sec).
• Cell BE implementation met data rates at 70, 80 and 105 MHz (MS/sec).
Windowing wasn’t implemented in Cell BE because of insufficient local store for the weights. To add this an extra 2-3 weeks of design modification to the data organization and channels would be needed (Times were measured with a multiply by constant to be true to performance).
Math only started to impact data rates when using less than 4 SPEs for the FFT, adding more SPEs didn’t result in added speed.
© 2005 Mercury Computer Systems, Inc.24
Outline
• Introduction
• Channelizer Problem
• Preliminary Results
• Summary
© 2005 Mercury Computer Systems, Inc.25
Summary
• Morphing a library with similar API to new architecture makes porting applications efficient.
• Hardware footprint (6U slots) is comparable to FPGA use.
• The small size of the SPE local store is a significant contributor in determining whether an application will port easily or require additional work.
• Mercury is fully cognizant of the architecture and works to reduce code size while benefiting from the large I/O bandwidth and fast processing capability of the Cell BE.