high performance compute platform based on multi-core …packet accelerator network power management...

TI Information – Selective Disclosure

High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging

Presenter: Murtaza Ali, Texas Instruments

Contributors:

Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments

William Symes, Jan Odegard, Rice University

1


Outline

• Introduction to TI Multi-core DSP

• Brief review of IWAVE based seismic signal modeling

• Details and challenges of implementation

• Results and conclusions

2


A New Paradigm in High Performance Computing

• Industry-best floating point performance

– 16 Gflops/W

• Standard programming model

– supports MPI and OpenMP

• Wide range of applications

– from embedded systems to server blades

• Full ecosystem support

– Off the shelf PCIe and ATCA cards

– O/S and application software

Supported by a full set of development tools and Code Composer Studio IDE

http://images.google.com/imgres?imgurl=http://www.quadnet.co.uk/images/blade.jpg&imgrefurl=http://www.quadnet.co.uk/Solutions/Hardware/HP+Blades&usg=__GHRq7-uJkBqTFSNPApyL5xwzu9E=&h=624&w=800&sz=147&hl=en&start=1&tbnid=nI5omCpZcz2scM:&tbnh=112&tbnw=143&prev=/images?q=blade+servers&gbv=2&hl=en&sa=G


Shannon (TMS320C6678) – Block Diagram

4

Multicore Navigator

Tera

Net

C66x

DSP

L1 L2

C66x

DSP

L1 L2

C66x

DSP

L1 L2

C66x

DSP

L1 L2

C66x

DSP

L1 L2

C66x

DSP

L1 L2

C66x

DSP

L1 L2

C66x

DSP

L1 L2

8 x CorePac

SRIO

x4

PCIe

x2

EMIF

16

TSIP

2x

I2C

SPI UART

Peripherals & IO

GbE

Switch

SGMII SGMII

IP Interfaces

Crypto

Packet

Accelerator

Network

CoProcessors

Power Management

Debug

Multicore Shared Memory Controller

(MSMC)

Shared Memory 4MB

DDR3-

64b

EDMA

SysMon

System Elements

Memory Subsystem

• Multi-Core KeyStone SoC

• Fixed/Floating CorePac • 8 CorePac @ 1.25 GHz

• 0.5MB L2/core, 4.0 MB Shared L2

• 320G MAC, 160G FLOP, 60G DFLOPS

• 10W

• Navigator • Hardware Queue Manager with DMA

• Multicore Shared Memory

Controller • Low latency, high bandwidth memory access

• Network Coprocessor • IPv4/IPv6 Network interface solution

• IPSec, SRTP, Encryption fully offloaded

• HyperLink • 50G Baud Expansion Port

• Transparent to Software Hyper

Link

50


C66x – Core Architecture

• 8 issue VLIW Architecture

– Can issue 8 instructions per

cycle

• 2 data paths

– 4 units per data path

– L, S, D, M

• 64 registers (32 bit)

– 32 per data path

– Can be arranged in dual (64 bit)

or quad (128 bit) registers

– Cross connect available

• Single Instruction Multiple

Data (SIMD) available

– Dual or quad multiplies


TI DSP SW Resources

• Multicore Software Development Kit

– Peripheral drivers

– Demos for quick start

• OpenMP – alpha version released, example code available

• Linear Algebra Library (BLAS, LAPACK)

– Working with UT Austin to port “libflame” (LAPACK equivalent) to Shannon

• Optimized Libraries

– DSPLIB (math functions), ImageLib

– Medical Imaging SW Toolkit – Ultrasound, Optical Coherence, 3D Rendering


Shannon PCIe Development Cards

•1 Tera-flop

•120 W

•Available 1Q12

•512 Gflops

•50 W

•Available Now!


Seismic Modeling

• IWAVE: A framework to enable efficient and scalable Finite Difference

simulation on regular grid

– includes seismic modeling and imaging

– Implement different wave equation update

– Used for modeling and imaging

– Open source from Rice University 8

• wave equation update

• source addition

• boundary condition

Typical iteration in forward

sweep (essential part in

modeling)

• wave equation update

• Receiver addition

• boundary condition

• Imaging after iterations

complete

Typical iteration in Backward

sweep essential part in

imaging) Reverse Time migration (RTM)

Focus of our

current study


Inside wave update

x

z

y

vx

Lin

ear

Com

bin

atio

n

dvxdx

dvydy

dvzdz

Update

px

vy

vz

px

Update

py epy

py

Update

pz

pz

mpy

epz mpz

epx mpx

lax

lay

laz

x px

dpxdx Update

vx

vx

evx mvx

lax

y py

dpydy Update

vy

vy

evy mvy

lay z

pz dpzdz

Update

vz

vx

evz mvz

laz

• Based on velocity –stress PDE

• First order hyperbolic system

• 10th order finite difference method


Kernels Implementations

• Identified four kernels to optimize to core

instruction architecture

– Differential in x-direction (first dimension)

– Differential in y or z-direction (orthogonal

dimension)

– Update in x-directions

– Update in y or z directions

10

Cache friendly (first dimension)

Lo

ad

sto

re f

rie

nd

ly

Compute resource

Me

mo

ry a

cc

es

s

(lo

ad

/sto

re)

Optimization trade-off at

kernel levels

;* .L units 0 0

;* .S units 0 0

;* .D units 8* 8*

;* .M units 5 7

;* .X cross paths 3 2

;* .T address paths 8* 8*

……………………..

;*

;* Searching for software pipeline schedule at ...

;* ii = 8 Schedule found with 4 iterations in

parallel


Kernel Results

• Kernels takes between 1-3 cycles per cell

• Summing up kernel numbers show capability

of over 200 M cells/sec on 8 core DSP

running at 1 GHz.

• Initial benchmarks carried out using all data

being kept in DDR3 memory

– OpenMP used to parallelize across cores

• Assignment is based on z direction

– Need better data movement strategy over

DDR3

– Analyze bottlenecks of performance

11

Core #0

Core #3

Core #4

Core #5

Core #6

Core #7

Core #1

Core #2

op

en

MP

th

read

s r

un

nin

g o

n e

ach

co

re


Data Movement Strategy

• C66 architecture allows 3-D data movement

using DMA

• Allows defining strides in two direction

• Some limitations exist on sizes of strides

limiting shape

– May limit sub-domain definition

– A tall sub-domain will be most useful

• DMAs can be linked

– Multiple data transfer can be initiated

– Continued without core intervention

12

• Compute can be overlapped to Data

movement

– Need double buffering


3-D differential calculation strategy

• Kernel operates on 4 lines

simultaneously

• Operate on a set of 4 x 4 x nx

data set as the core

computations strategy

• Determine x-differentials on

the set of 16 lines

• Add y-differentials on a

horizontal plane of 4 x nx fours

times

• Add z-differentials on a vertical

plane of 4 x nx fours times

13

Total data set needed

x-differential

y-differential

z-differential


Example of Data Movement

MSMCSRAM

(shared by all

cores)

CPU

DDR

L1 (16K SRAM/

16K Cache)

L2 (384K SRAM/

128K Cache)


Results

• After implementing DMA data movement, performance went from 45 to

59 M cells/sec on a single 8-core C6678 multi-core DSP

• Performance limited by data transfers over DDR3

– Performance only went up to 63 M cells/sec when all computes are disables

– Theoretical DDR3 bandwidth limited performance is 120 M cells/sec @

1330 MHz DDR3.

– Currently we at operating at about 50% of DDR3 bandwidth

15


Future Activity

• Continued performance analysis

– Current measurements done with DDR3 clock rate of 1330 MHz

– Device capable of handling 1600 MHz-> 20% improvement

– Optimize further for parameters for maximum data transfer utilization

• Extend analysis to multiple DSP based PCI board

– MPI based message passing

– Side region data exchange

• Integrate with IWAVE framework

– Framework can run on host with main computes being handled by DSP

board(s)

• Add more complicated wave equation update

– Elastic modeling

16

high performance compute platform based on multi-core …packet accelerator network power management...

Documents