sharc
DESCRIPTION
embedded system sharcTRANSCRIPT
2
The SHARC
• Developed by Analog Devices
• Optimized for demanding DSP and imaging applications.
• 32 Bit floating point, with 40 bit extended floating point capabilities.
• Large on-chip memory.
• Ideal for scalable multi-processing applications.
3
Harvard Architecture
• Program memory can store data.
• Able to simultaneously read or write data at one location and get instructions from another place in memory.
• 2 buses1 Data memory bus.2 Program bus.
• Either two separate memories or a single dual-port memory.
4
Super Harvard Architecture
• Many processor employ Harvard Architecture by having two separate memories or caches integrated into the processor chip
• The SHARC is unique in that it’s internal memory is capable of holding a large program as well a large amount of data. This is what makes it SUPER!!!
5
DSP
• Digital Signal Processor.
• High speed, low overhead data movement and rapid computations required.
• Usually has a small on-board ROM, RAM and single cycle multiply.
• Designed to run single line, serial in, serial out, signal processing applications very fast.
6
DSP Computations
• The inner product of two vectors is a common computation for determining energy or correlation.
• The following C code is an example: for (n=0; n<length; n++) result+= x[n] * y[n];
• The process which has the lowest instruction time will have the best performance.
7
SHARC DSP
• The SHARC incorporates features aimed at optimizing such loops.
• High-Speed Floating Point Capability
• Extended Floating Point
• These features are DSP specific.
• Meaning, when applied to a non-DSP application performance may not be as optimal.
8
Floating Point and Extended Floating Point
• The SHARC supports floating, extended-floating and non-floating point.
• No additional clock cycles for floating point computations.
• Data automatically truncated and zero padded when moved between 32-bit memory and internal registers.
• Not accurate enough for scientific algorithms. Excellent signal to noise ratio.
9
SHARC’s Internal Memory
• Makes SHARC unique.
• Size• Allows many complex functions to be preformed on-chip.
Eliminating the need to move data between internal and external memory.
• Memory size is significantly larger then most other high speed computational devices.
• Dual-block, Dual-port• Optimizes the Harvard Architecture by allowing the fetch
of instructions while performing data memory accesses.
10
Multiply and Accumulate Instructions on the SHARC
• Like most DSPs the SHARC is able to compute a product and add the product to a running total in a single clock cycle.
• The SHARC’s super instruction is that it can multiply and accumulate while adding, subtracting, or averaging data in two other registers.
• These instructions give the SHARC its 120 megaflop rating.
11
Zero Overhead Loopingon the SHARC
• A single instruction outside the loop performs loop set-up. Informing the SHARC that there is a loop approaching.
• The instruction also includes the iteration count and termination condition.
• This causes the pipeline to remain full during loop execution and also allows the termination condition to be tested in parallel.
12
DAGs on the SHARC
• Data Address Generators are integer computation units that manage the indexing of registers.
• Allows the SHARC to fetch a value and update the index value.
• If the updated value exceeds a limit, the DAB adjusts the index so that it wraps.
• This occurs in the same clock cycle as the read or write.
13
DAG Capabilities
• Circular Buffering• Rather then actually moving data in and out of a vector,
circular buffers are used.• Updating the index modulo, the oldest entry can be
conveniently replaced by the newest entry.
• Bit Reverse Addressing• The bit pattern of a vector index is reversed.• Done automatically by the SHARC.• Required for Fast Fourier Transform (FFT), which is
often critical to DSP applications.
14
SHARC DSP
• What Makes the SHARC unique?– It also has some features not related directly
related to optimizing numeric computations.• Pipelining
• Handling Branches
• Why has this not emerged sooner?– Technology has only recently become available
to make it economical to integrate general single computing devices.
15
SHARC’s Pipeline
• 3 stages1 Instruction Fetch
2 Decode
3 Execution
• Takes three clock cycles for an instruction to propagate through the pipeline.
• The processor execution speed is one instruction per clock cycle even though each instruction requires three clock cycles.
16
SHARC’s Handling BranchesDelayed Branching
• When a branch instruction is encountered the two instructions which have been loaded and decoded are executed before the branch.
• This keeps the pipeline full and avoids junking those two instructions and reloading the pipeline.
• Beneficial in situations such as a few instruction loops. When the ratio of wasted clock cycles to instructions is significant.
17
SHARC’s Handling BranchesNon-delayed Branching
• Traditional branching.
• If the pipeline cannot be reordered to use delayed branching, non-delayed branching is space saving.
• Uses only one word of storage.
• Although, it takes three cycles as the pipeline gets reloaded.
18
Multi-processing
• SHARC is uniquely equipped for multi-processing.
• Links to ports are very powerful multi-processing capabilities.
• Two main program models depending on the application.
• Adapts well to different multi-processing architectures.
19
Multi-processingSHARC Links
• SHARC has 6 link ports that can transport data at rates up to 40Mbytes/sec.
• Links designed for point-to-point connections.
• Data can be transmitted in either direction but not both simultaneously.
20
Multi-processing Program ModelMIMD
• Multiple instruction, multiple data.
• Good for applications that require multiple instruction threads to execute concurrently.
• Processors operate individually.• Each processor executes different code.
• Typically used for image reconstruction and multi-channel DSP.
21
Multi-processing Program ModelSIMD
• Single instruction, multiple data.
• Works best when all processors execute identical instruction sequences.
• Do not require overhead for inter-processor synchronization.
• Typically used for synthetic aperture radar and automatic target recognition.
22
Multi-processing ArchitecturesCluster Design
• Groups of up to 6 in a cluster
• Most common for joining multiple SAHRC's
• All processors, global I/O and global memory connected to a common “Cluster bus.”
• Each SHARC can “drive” the bus.
23
Multi-processing ArchitecturesMesh Design
• All SHARC’s joined by their link ports and are connected to a common bus.
• In SIMD mode one single master SHARC drives the bus.
• In MIMD mode mesh architecture cannot function if data is lager then on chip available memory.
• Advantageous scalability over a wider range of applications.
24
How optimal is the SHARC for non-DSP Applications?
• It is obviously geared for DSP applications.
• While it may fare better then other processors it is still behind those which are designed specifically for non-DSP applications.