high performance compute platform based on multi-core …packet accelerator network power management...
TRANSCRIPT
TI Information – Selective Disclosure
High Performance Compute Platform Based on multi-core DSP for Seismic Modeling and Imaging
Presenter: Murtaza Ali, Texas Instruments
Contributors:
Murtaza Ali, Eric Stotzer, Xiaohui Li, Texas Instruments
William Symes, Jan Odegard, Rice University
1
TI Information – Selective Disclosure
Outline
• Introduction to TI Multi-core DSP
• Brief review of IWAVE based seismic signal modeling
• Details and challenges of implementation
• Results and conclusions
2
TI Information – Selective Disclosure
A New Paradigm in High Performance Computing
• Industry-best floating point performance
– 16 Gflops/W
• Standard programming model
– supports MPI and OpenMP
• Wide range of applications
– from embedded systems to server blades
• Full ecosystem support
– Off the shelf PCIe and ATCA cards
– O/S and application software
Supported by a full set of development tools and Code Composer Studio IDE
TI Information – Selective Disclosure
Shannon (TMS320C6678) – Block Diagram
4
Multicore Navigator
Tera
Net
C66x
DSP
L1 L2
C66x
DSP
L1 L2
C66x
DSP
L1 L2
C66x
DSP
L1 L2
C66x
DSP
L1 L2
C66x
DSP
L1 L2
C66x
DSP
L1 L2
C66x
DSP
L1 L2
8 x CorePac
SRIO
x4
PCIe
x2
EMIF
16
TSIP
2x
I2C
SPI UART
Peripherals & IO
GbE
Switch
SGMII SGMII
IP Interfaces
Crypto
Packet
Accelerator
Network
CoProcessors
Power Management
Debug
Multicore Shared Memory Controller
(MSMC)
Shared Memory 4MB
DDR3-
64b
EDMA
SysMon
System Elements
Memory Subsystem
• Multi-Core KeyStone SoC
• Fixed/Floating CorePac • 8 CorePac @ 1.25 GHz
• 0.5MB L2/core, 4.0 MB Shared L2
• 320G MAC, 160G FLOP, 60G DFLOPS
• 10W
• Navigator • Hardware Queue Manager with DMA
• Multicore Shared Memory
Controller • Low latency, high bandwidth memory access
• Network Coprocessor • IPv4/IPv6 Network interface solution
• IPSec, SRTP, Encryption fully offloaded
• HyperLink • 50G Baud Expansion Port
• Transparent to Software Hyper
Link
50
TI Information – Selective Disclosure
C66x – Core Architecture
• 8 issue VLIW Architecture
– Can issue 8 instructions per
cycle
• 2 data paths
– 4 units per data path
– L, S, D, M
• 64 registers (32 bit)
– 32 per data path
– Can be arranged in dual (64 bit)
or quad (128 bit) registers
– Cross connect available
• Single Instruction Multiple
Data (SIMD) available
– Dual or quad multiplies
TI Information – Selective Disclosure
TI DSP SW Resources
• Multicore Software Development Kit
– Peripheral drivers
– Demos for quick start
• OpenMP – alpha version released, example code available
• Linear Algebra Library (BLAS, LAPACK)
– Working with UT Austin to port “libflame” (LAPACK equivalent) to Shannon
• Optimized Libraries
– DSPLIB (math functions), ImageLib
– Medical Imaging SW Toolkit – Ultrasound, Optical Coherence, 3D Rendering
TI Information – Selective Disclosure
Shannon PCIe Development Cards
•1 Tera-flop
•120 W
•Available 1Q12
•512 Gflops
•50 W
•Available Now!
TI Information – Selective Disclosure
Seismic Modeling
• IWAVE: A framework to enable efficient and scalable Finite Difference
simulation on regular grid
– includes seismic modeling and imaging
– Implement different wave equation update
– Used for modeling and imaging
– Open source from Rice University 8
• wave equation update
• source addition
• boundary condition
Typical iteration in forward
sweep (essential part in
modeling)
• wave equation update
• Receiver addition
• boundary condition
• Imaging after iterations
complete
Typical iteration in Backward
sweep essential part in
imaging) Reverse Time migration (RTM)
Focus of our
current study
TI Information – Selective Disclosure
Inside wave update
x
z
y
vx
Lin
ear
Com
bin
atio
n
dvxdx
dvydy
dvzdz
Update
px
vy
vz
px
Update
py epy
py
Update
pz
pz
mpy
epz mpz
epx mpx
lax
lay
laz
x px
dpxdx Update
vx
vx
evx mvx
lax
y py
dpydy Update
vy
vy
evy mvy
lay z
pz dpzdz
Update
vz
vx
evz mvz
laz
• Based on velocity –stress PDE
• First order hyperbolic system
• 10th order finite difference method
TI Information – Selective Disclosure
Kernels Implementations
• Identified four kernels to optimize to core
instruction architecture
– Differential in x-direction (first dimension)
– Differential in y or z-direction (orthogonal
dimension)
– Update in x-directions
– Update in y or z directions
10
Cache friendly (first dimension)
Lo
ad
sto
re f
rie
nd
ly
Compute resource
Me
mo
ry a
cc
es
s
(lo
ad
/sto
re)
Optimization trade-off at
kernel levels
;* .L units 0 0
;* .S units 0 0
;* .D units 8* 8*
;* .M units 5 7
;* .X cross paths 3 2
;* .T address paths 8* 8*
……………………..
;*
;* Searching for software pipeline schedule at ...
;* ii = 8 Schedule found with 4 iterations in
parallel
TI Information – Selective Disclosure
Kernel Results
• Kernels takes between 1-3 cycles per cell
• Summing up kernel numbers show capability
of over 200 M cells/sec on 8 core DSP
running at 1 GHz.
• Initial benchmarks carried out using all data
being kept in DDR3 memory
– OpenMP used to parallelize across cores
• Assignment is based on z direction
– Need better data movement strategy over
DDR3
– Analyze bottlenecks of performance
11
Core #0
Core #3
Core #4
Core #5
Core #6
Core #7
Core #1
Core #2
op
en
MP
th
read
s r
un
nin
g o
n e
ach
co
re
TI Information – Selective Disclosure
Data Movement Strategy
• C66 architecture allows 3-D data movement
using DMA
• Allows defining strides in two direction
• Some limitations exist on sizes of strides
limiting shape
– May limit sub-domain definition
– A tall sub-domain will be most useful
• DMAs can be linked
– Multiple data transfer can be initiated
– Continued without core intervention
12
• Compute can be overlapped to Data
movement
– Need double buffering
TI Information – Selective Disclosure
3-D differential calculation strategy
• Kernel operates on 4 lines
simultaneously
• Operate on a set of 4 x 4 x nx
data set as the core
computations strategy
• Determine x-differentials on
the set of 16 lines
• Add y-differentials on a
horizontal plane of 4 x nx fours
times
• Add z-differentials on a vertical
plane of 4 x nx fours times
13
Total data set needed
x-differential
y-differential
z-differential
TI Information – Selective Disclosure
Example of Data Movement
MSMCSRAM
(shared by all
cores)
CPU
DDR
L1 (16K SRAM/
16K Cache)
L2 (384K SRAM/
128K Cache)
TI Information – Selective Disclosure
Results
• After implementing DMA data movement, performance went from 45 to
59 M cells/sec on a single 8-core C6678 multi-core DSP
• Performance limited by data transfers over DDR3
– Performance only went up to 63 M cells/sec when all computes are disables
– Theoretical DDR3 bandwidth limited performance is 120 M cells/sec @
1330 MHz DDR3.
– Currently we at operating at about 50% of DDR3 bandwidth
15
TI Information – Selective Disclosure
Future Activity
• Continued performance analysis
– Current measurements done with DDR3 clock rate of 1330 MHz
– Device capable of handling 1600 MHz-> 20% improvement
– Optimize further for parameters for maximum data transfer utilization
• Extend analysis to multiple DSP based PCI board
– MPI based message passing
– Side region data exchange
• Integrate with IWAVE framework
– Framework can run on host with main computes being handled by DSP
board(s)
• Add more complicated wave equation update
– Elastic modeling
16