overview
DESCRIPTION
xxTRANSCRIPT
An Introduction to 3L Diamond
on Sundance Hardware
Some slides have extra information as notes.
What is 3L Diamond?
Diamond is a set of tools and other components that work together with the TI C compiler and linker to support applications using multiprocessor hardware.
Sundance hardware is well-suited Diamond’s way of dealing with multiprocessors and the combination provides the most rapid way to get your application running efficiently.
Why Diamond?
The first response of many people when offered Diamond is:
“We do not need any extra software.
Code Composer Studio provides everything we need to write multiprocessor applications.”
Is this really true?
The Hardware
The structure of Sundance hardware is a good place to start.Sundance provides modular hardware that allows you to build complex multiprocessor systems.Modules include an FPGA that is used to implement interprocessor links that allow pairs of processors to communicate. These include comports and SDBs.
A Sundance Module
C6000 DSP
FPGA
Flash ROM
External Memory
Comports
JTAG out
JTAG in
Reset
SDBs
Typical Hardware
Host PC
SMT395DSP
SMT374DSP
SMT362DSP
SMT390ADC
Comport
SDBSMT374
DSP
Scaling
Sundance hardware scales:• There are no shared resources• Adding processors adds communication• No contention for shared memory or
busses
How to Develop Applications
Given hardware like this, the first thought will be that Code Composer from TI is ideal for developing applications.
We shall now investigate this thought.
Code Composer Studio
•A good platform for single-processor work.
•No real support for multiprocessors.• CCS is really a single-processor system• You have to treat each processor separately.
•You build separate programs for each processor as follows:
Building with CCS
Source file...Source file
...
Object file
Source file...
Source file
...
Object file
Source file Object file...
Source file
...
Object file
Texas Instruments
Compiler Linker
Executable.out
Executable.out
Executable.out
Object file
Object file
Problem: Specification
•You have to divide your application into separate programs for each processor.• Modularity should be driven by the program
structure.• You should not use the hardware structure.
•Difficult to use several developers:• only one program for each processor
•Difficult to test components• hard to make each processor work in
isolation
How do you load the application?
•You have to load using JTAG
•JTAG is very slow (0.2MB/s)
•You have all the parts of your application as separate .out files, one for each processor.
•You have to load these, one at a time.• it is very easy to load the wrong processor• it is very easy to forget to load a processor• instructions for your users are complicated
Problem: Loading
•Customers need CCS (or load from ROM)
•Difficult to develop your own host program
•You can’t use JTAG from a program.
•You must use a separate mechanism to allow processors to communicate.• This means you have to maintain two,
unrelated networks:• JTAG chain for loading• I/O network for communication
Problem: Host integration
•Host communication is with JTAG.• very slow• very difficult to add your own host code
•Need to use other devices• need to write host driver code• how to start the host code & DSP code?
Problem: Communication
•How do the processors communicate?• No support for Sundance peripherals• Need to write device drivers
• Learn device details• Manage EDMA• Deal with EDMA coherency problems• Manage interrupts• Learn the tricks to make them run fast
Problem: Message routing
If two processors want to exchange data but there is no direct connection between them, the data will have to be routed through intermediate nodes.
• How do you do this?
• How do you construct routing tables?• by hand?• build in knowledge of the processor
network?
Problem: Deadlock
A problem with all message routing systems is deadlocking.
This is when sending data from one processor to another has to wait for data to be transmitted between another pair of processors, but that transmission needs to wait for the first to complete!
Deadlock prevention options
•Use a proven deadlock-free system.
•Make the user stop the program and change parameters each time a deadlock happens.
•Hope it never happens.
The most common technique is:
•Be completely unaware deadlock can happen.
Problem: The Cache
•There are problems with cache coherency• The cache cannot maintain coherence between:
• external memory• EDMA transfers
•Transfers must handle cache coherency• you cannot turn the cache off• cache errors are very hard to find
•You have to sort out all these problems.
Why loading may fail
JTAG loading assumes the cache is clear.This is not true with Sundance hardware. After reset, a bootloader is loaded from ROM and executed. This initialises the processor and configures the FPGA to implement the inter-processor communication links.The code for the bootloader gets into the cache. JTAG loads behind the cache, leading to inconsistencies that prevent programs running.
Problem: Making changes
•How do you change the network?• Rewrite sections of your code• Are there enough EDMA channels?
• only 4 external interrupt lines for synchronisation
• what if you use more than 4 devices?• host comport (2 devices)• comport to another processor (2 devices)• SDB to another processor (2 devices)• that is already 6 devices
Problem: Changing Devices
•How do you change processors?• different device addresses• different memory sizes• different memory addresses• different initialisation requirements
•With CCS: rewrite sections of your code.
Problem: Choosing devices
• Comports• Sundance Digital Bus (SDB)• Rocket I/O
•You need to learn how to use them.
•You need to write & maintain device drivers.
•You need to change your code to use them.
Before you start coding…
•Be certain you know how to partition the problem.
•Be certain you know how much memory you need.
•Be certain you know which modules you need.
•Be certain of the system topology.
… because it will be very hard to change.
The advantage of CCS
•You have complete control of everything…
•… because you have to do everything yourself
… and this takes a lot of time and experience.
CCS: Summary
•CCS works well with single processors
•It was not designed for multiple processors
•You have to do all the hard work
•Knowledge gets built into the application:• processor types• memory layout• I/O devices being used• connections between processors
•It is very hard to make significant changes.
Diamond
•Originally designed in 1987• tried and tested• proven model
•Designed for multiprocessor systems
•Designed for simplicity
•Designed for efficiency• during development• during execution
Some advantages of Diamond
•Easy to use
•Gives you flexibility: late binding• easy to change topology• easy to change modules
•Reduces housekeeping• memory usually allocated for you• interrupts handled for you• loading managed for you• communication details managed for you• processor issues handled for you
What Diamond is not
•Diamond is not a compiler• we use the standard TI compiler and linker
•Diamond is not a simulator or an interpreter• real, optimised code is generated
•Diamond is not DSP/BIOS• it has it’s own optimised kernel, designed
for multiprocessor operation• it does not have or need a large API
Building with Diamond
•You partition the application into tasks:• modularity determined by the needs of the
application; you ignore processors here.
•Diamond adds an extra configuration step.
•The configurer:• can see the whole application• can optimise communication and device
access.• builds a single output file; nothing can get lost.• arranges to load from this single file.
Building with CCS
Source file...
Source file
...
Object file
Source file...
Source file
...
Object file
Source file Object file...
Source file
...
Object file
Texas Instruments
Compiler Linker
Executable.out
Executable.out
Executable.out
Object file
Object file
Building with Diamond
Source file Object file...
Source file
...
Object file
Relocatable.tsk
Source file Object file...
Source file
...
Object file
.appRelocatable
.tsk
Source file Object file...
Source file
...
Object file
Texas Instruments 3L Diamond
ConfigurerCompiler Linker
Relocatable.tsk
With Diamond…
• The application is in a single file.• Nothing can get lost.• You cannot get loading wrong.• Loading is easy
•load from the host•no need for ROM during development•development is fast
Diamond…•is designed for multiprocessor systems.
•has its own small, efficient microkernel.
•has a small but effective API.
•is optimised for target hardware:• it knows about different modules• it automatically inserts optimised device drivers• it handles interrupts• it handles memory and the cache
•is very good at communication
•leaves you free to concentrate on your code.
Sundance TIMs
Comport Links
Memory ROME
MIF
EM
IF
C6000 DSPE
MIF
FPGA SDB Links
Dual-Processor Module
Memory
C6000 DSP
Comports
Memory
C6000 DSP
FPGA SDBsInternal comports
Identical to two separate modules; there areno shared resources.
The Diamond Model
Diamond builds applications from independent tasks that send data to other tasks using channels.
This model is based upon CSP: Communicating Sequential Processes.
CSPCommunicating Sequential Processes
Task Task
Task Task
Channel
Forget about processors
A Diamond application is…•Tasks
• complete C programs• start at a main function• fully linked (but relocatable)• input & output ports for connecting channels
• unlimited number of ports
• Multi-threaded
•Channels• data transfer mechanisms• transfer data from one task to one other• blocking: both ends wait for completion
Channels
•Many possible implementations• memcpy – between tasks on one processor• I/O - between adjacent processors
• comports• SDBs• Rapid IO links
• Routed I/O – between remote processors• software routing• guaranteed deadlock-free• any task can communicate with any other task
Diamond will choose the best implementation.
The Hardware
Module
C6000
FPGA
EMIF
comports
SDBs
Host PC
A Sundance NetworkHost PC
C64
C62
C67 C64
C67 C62
C64
Ideal Hardware
•No shared resources• Simplifies hardware• Simplifies software• Scales: more processors = more power
•Connected by communication links• Add processors = add bandwidth
•Designing multiprocessor hardware:• Speak to 3L first.
Tasks & Channels
Map onto hardware
A simple task
AddOne
Words coming in Incremented words going out
DATA_IN(input channel)
DATA_OUT(output channel)
0 0
1 12 2
input ports
output ports
A simple task
#include <chan.h>
INPUT_PORT(0, DATA_IN)
OUTPUT_PORT(0, DATA_OUT)
main()
{
int n;
for (;;) {
chan_in_word (&n, &DATA_IN);
chan_out_word(n+1, &DATA_OUT);
}
}
Team Working
•Tasks are self-contained
•They are developed separately
•Communication between tasks:• is a contract• allows test systems to be built
•Ideal for team working
Design Flow
•Network• Tasks• Channels
Sources
Design Flow
•Network
•Code tasks
Tasks
Design Flow
•Network
•Code tasks
•Compile & Link
configuration file
Design Flow
•Network
•Code tasks
•Compile & Link
•Configuration File
application file
Design Flow
•Network
•Code tasks
•Compile & Link
•Configuration File
•Configure
application file
processor network
Design Flow
•Network
•Code tasks
•Compile & Link
•Configuration File
•Configure
•Load & Run
Running an application
Demonstration Hardware
SMT365
SMT370
SMT374
SMT361
Only the SMT365 and the SMT361 will be used in the examples.
A Correlator Example
Example2
Correlator
0
Control channel
Data channel
UI
Disp_corDisp_raw
MainCtrl
Code Each TaskOUTPUT_PORT(2, COR_DATA)
INPUT_PORT (1, COR_RESULT)
. . .
main()
{
printf("3L Diamond Example\n");
for (;;) {
. . .
chan_out_message(BYTES, Data, &COR_DATA);
chan_in_message(BYTES, Result, &COR_RESULT);
. . .
}
}
Configuration
Write a configuration file to:
• Describe the hardware• processors• connections between processors
• Describe the software• tasks• channels connecting tasks
• Map the software onto the hardware• place tasks on processors
Task names
TASK example2
TASK mainctrl
TASK disp_raw
TASK disp_cor
TASK UI
TASK correlator
Task ports
TASK example2 INS=3 OUTS=7
TASK mainctrl INS=1 OUTS=1
TASK disp_raw INS=2 OUTS=0
TASK disp_cor INS=2 OUTS=0
TASK UI INS=1 OUTS=1
TASK correlator INS=1 OUTS=1
Task stack & heap
TASK example2 INS=3 OUTS=7 DATA=500K
TASK mainctrl INS=1 OUTS=1 DATA=200K
TASK disp_raw INS=2 OUTS=0 DATA=200K
TASK disp_cor INS=2 OUTS=0 DATA=200K
TASK UI INS=1 OUTS=1 DATA=200K
TASK correlator INS=1 OUTS=1 DATA=32K
Task starting priorities
TASK example2 urgent INS=3 OUTS=7 DATA=500K
TASK mainctrl INS=1 OUTS=1 DATA=200K
TASK disp_raw INS=2 OUTS=0 DATA=200K
TASK disp_cor INS=2 OUTS=0 DATA=200K
TASK UI urgent INS=1 OUTS=1 DATA=200K
TASK correlator priority=2 INS=1 OUTS=1 DATA=32K
! The starting priority is 1 unless explicitly stated.
Channel creation
! channel output port input port
! ======= =========== ==========
CONNECT C1 UI[0] example2[0]
CONNECT C2 example2[5] mainctrl[0]
CONNECT C3 mainctrl[0] example2[2]
CONNECT C4 example2[0] disp_raw[0]
CONNECT C5 example2[1] disp_raw[1]
CONNECT C6 example2[2] correlator[0]
CONNECT C7 correlator[0] example2[1]
CONNECT C8 example2[3] disp_cor[0]
CONNECT C9 example2[4] disp_cor[1]
CONNECT C10 example2[6] UI[0]
The processor & placement
PROCESSOR Root SMT365_8_1
…
PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root
Processor types
Diamond supports all of the Sundance TIMs. The ProcType utility will display them all.
A note about memory
• With CCS you need to:• specify memory explicitly.• know which “sections” are used by the compiler• allocate memory explicitly at the start
• Diamond can do all memory allocation• available memory determined automatically• no linker command files• but, you can tell Diamond how to use memory• this is an optimisation once the code is working.• ignore it until the program’s needs are
understood.
Building & Running
•Compile each task with the command: 3L C
•Link each task with the command: 3L T
•Configure with the command: 3L A
•Execute with the command: 3L X
Making it run faster
Example2
Correlator
0
Control channel
Data channel
UI
Disp_corDisp_raw
MainCtrl
Use a second processor
We shall use TIM1 (SMT365) and TIM4 (SMT361) connected by comports 0 & 3 respectively.
Demonstration Hardware
SMT365
SMT370
SMT374
SMT361
Use a second processor
PROCESSOR Root SMT365_8_1
…
PLACE mainctrl Root
PLACE example2 Root
PLACE disp_raw Root
PLACE disp_cor Root
PLACE UI Root
PLACE correlator Root
Use a second processor
PROCESSOR Root SMT365_8_1
PROCESSOR NodePROCESSOR Node SMT361SMT361
…
PLACE mainctrl Root
PLACE example2 Root
PLACE disp_raw Root
PLACE disp_cor Root
PLACE UI Root
PLACE correlator Root
Use a second processor
PROCESSOR Root SMT365_8_1
PROCESSOR Node SMT361
WIRE W1WIRE W1 Root[CP:0] Node[CP:3]Root[CP:0] Node[CP:3]
…
PLACE mainctrl Root
PLACE example2 Root
PLACE disp_raw Root
PLACE disp_cor Root
PLACE UI Root
PLACE correlator Root
Use a second processor
PROCESSOR Root SMT365_8_1
PROCESSOR Node SMT361
WIRE W1 Root[CP:0] Node[CP:3]
…
PLACE mainctrl Root
PLACE example2 Root
PLACE disp_raw Root
PLACE disp_cor Root
PLACE UI Root
PLACE correlatorcorrelator NodeNode
Notes
•The two tasks have not changed in any way.
•Their connections have not changed.
•No need to recompile them or relink them.
•All we changed to move the tasks onto a second processor was the configuration file.
•We just built a new application by running the configuration command again (3L A).
•Loading the two processors is automatic.
Making it go even faster
Module
C6000
FPGA
EMIF
comports
SDBs
Host PC
Use the FPGA on the SMT365
PROCESSOR Root SMT365_8_1PROCESSOR F FPGAFPGA
…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root
The FPGA is already being used
•The FPGA is also used to support functions on the SMT365 DSP.
•Attaching the FPGA to its processor allows the configurer to include all the necessary logic to support the needed functions.
Use the FPGA
PROCESSOR Root SMT365_8_1PROCESSOR F FPGA ATTACH=RootATTACH=Root
…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root
Use the FPGA
PROCESSOR Root SMT365_8_1PROCESSOR F FPGA ATTACH=Root
WIRE W1 Root[SDB:0] F[SDB_DEVICE:0]Root[SDB:0] F[SDB_DEVICE:0]…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlator Root
Use the FPGA
PROCESSOR Root SMT365_8_1PROCESSOR F FPGA ATTACH=Root
WIRE W1 Root[SDB:0] F[SDB_DEVICE:0]…PLACE mainctrl RootPLACE example2 RootPLACE disp_raw RootPLACE disp_cor RootPLACE UI RootPLACE correlatorcorrelator FF
FPGA Tasks
•Placing a task on an FPGA instructs the configurer to look for an FPGA version of the task.
•This can be written using:• VHDL• Xilinx System Generator• Handel-C (Celoxica)• Any other method you like.
Building with FPGA
•The configurer will construct a Xilinx project for the FPGA
•It will call the Xilinx toold to build a complete bitstream.
•The bitstream will be included in the single application file.
•The FPGA will be configured automatically as the application is loaded.
Conclusion
•Diamond does a lot of the work for you.
•Diamond allows you to change your mind and alter processors and topology.
•Diamond gives a structured model for developing efficient applications.
•The Diamond model is the same for any number and any combination of processors: DSP or FPGA.
•Diamond simplifies developing multiprocessor applications.