envision. accelerate. arrive. a high performance...

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 1

ENVISION. ACCELERATE. ARRIVE.

A high performance, low-power, scaleable platform for next-generation on-board payload data processing applications

Ray McConnell ([email protected])CTOClearSpeed Technology Plc.


Company background

• Public fabless semiconductor company• Offices – Bristol (UK HQ) and San Jose (US)• >$100 million and 300 man-years investment• 86 patents granted/applied for • 82 employees and growing• 2003: introduced CS301, a prototype• 2004: announced first commercial processor• 2005: CSX600 processor and Advance™ board start shipping• Internationally recognised as “disruptive technology”• Solves the power/heat challenges of HPC


Outline

• New limits – New frontiers, next generation DSP require next generation IP– 10 years means 2 orders of magnitude 100x

• Description of ClearSpeed’s Acceleration Technology – Processors– IP

• Applications, Libraries and Performance

• Software Tools -SDK– Roadmap– Cn: Extending C for data parallel programming– Heterogeneous system level debugging and profiling

• Summary


Value proposition: • Provide full access to COTS scaleable processor:

– IP modules available– Full verification for Right First Time– Or complete processor design

• With quality Heterogeneous SDK tools and libraries.– Ported to the Eclipse platform– Recent integration for technical demonstration with Intel EXOCHI

• 30x speed up of a Monte-Carlo FSI code on high end Intel platform

• BAE Systems (US-SSA) first licensee– Rad Hardening on IBM 90nm


BAE Systems Sign Technology License Agreement

• ClearSpeed Technology and BAE Systems Sign Technology License Agreement

In a significant expansion of its strategy, ClearSpeed has licensed the design of its next generation processor to BAE Systems for use in the most demanding of environments – space

• Bristol, UK – September 4, 2007 – ClearSpeed Technology (LSE:CSD), the world leader in coprocessors for compute acceleration, and BAE Systems, a leading global defense and aerospace company, today announced that they have signed an agreement under which BAE Systems will license the design of ClearSpeed’s next generation processor for inclusion in satellite systems. In addition, BAE Systems will gain access to ClearSpeed’s software development kit (SDK) for the purpose of adding new libraries relevant to satellite system design and deployment.

• The agreement gives BAE Systems an embedded technology for deployment in the harsh environments that satellite systems must endure, by meeting stringent criteria for performance, energy efficiency, ease of programming, accuracy and reliability, each of which ClearSpeed met or exceeded. This combination of features, including fault tolerance, has been developed by ClearSpeed in order to meet the intense demands of acceleration in high performance computing (HPC).

“To deploy and achieve sustained operation of systems in space sets the most challenging requirements, “ said George Nossaman, director of Advanced Digital Systems for BAE Systems. “Our ability to deliver ClearSpeed’s coprocessors to our space customers, with this technology’s unique combination of very high performance, power efficiency and full programmability will enable a major increase in mission capability for key civil, commercial and defensive space systems.”

• As the base of its leadership in HPC acceleration, ClearSpeed is focused on delivering technology with a key set of attributes which are also relevant to areas of embedded processing. The agreement with BAE Systems marks ClearSpeed’s first extension into the embedded space.

• “Licensing our technology to a limited number of key partners has always been planned as part of our strategy, said Tom Beese, chief executive officer of ClearSpeed Technology. “To be chosen by BAE Systems for use in satellite systems is a clear demonstration of the versatility and robustness of our core technology in meeting the most intensive needs in multiple markets.”


• Multi-Threaded Array Processing– Programmed in familiar languages– Hardware multi-threading– Asynchronous, overlapped I/O– Run-time extensible instruction set– Bi-endian for compatibility

• Array of 96 Processor Elements (PEs)– Each is a Very Long Instruction Word

(VLIW) core, not just an ALU

• Cn is the natural language– Single “poly” data type modifier – Rich expressive semantics

MTAP processor core

CSX600

Programmable I/O to DRAM

PE 0

Peripheral Network

PE 1

PE 95…

Data Cache

Mono Controller Instruc-

tion Cache

Control and

Debug

System Network

Poly Controller

System Network


Processing ElementsEach PE is a VLIW core:

• Multiple configurable execution units• 4-stage floating point adder• 4-stage floating point multiplier• Divide/square root unit• Fixed-point MAC 16x16 → 32+64• Integer ALU with shifter• Load/store

• High-bandwidth, 5-port register file (3r, 2w)• Closely coupled 6 KB SRAM for data• High bandwidth per PE DMA (PIO)• Per PE address generators

• Complete pointer model, including parallel pointer chasing and vectors of addresses

32/64-bitIEEE 754

PEn

Programmed I/O

Register File128 Bytes

PE SRAM6 KBytes

FP M

ul

FP A

dd

Div

, Sqr

t

MAC

ALU

PIO Collection & Distribution

64 64 64

32

6464 PEn+1

PEn–1

128

32

}


• Memory Transaction Switched Network (NoC)– Block to Block– Chip to Chip– Configurable width– Configurable 64bit addressing– Can be extended into FPGA

• Flexible late stage layout – Localised Timing – Straight forward synthesis– T-Switch to optimise bandwidths

• Lower Power– No Long RC wires

ClearConnect Inter/Intra Chip/Block Connect


First Generation CS301 (2003)

PE Array

Control SRAM

Bus

• Multi-Threaded Array Processor– 25.6 GFLOPS, SP IEEE– 2.5W worst-case, 1.8W typical– 200MHz– 64 PEs, 4 KBytes each

• Scratchpad memory– 128 KBytes of SRAM

• IBM 130nm

• ClearConnect bus– 64-bit full duplex– 1.6 GByte/s each direction– 2x 800 MByte/s bridge ports


Present CSX600 Processor (2005)

• Array of 96 Processor Elements; 64-bit and 32-bit floating point

• 210-250 MHz… key to low power

• 47% logic, 53% memory– About 50% of the logic is FPUs– Hence around one quarter of the chip

is floating point hardware

• ~1 TB/sec internal bandwidth– At the register file

• 128 million transistors: IBM 130nm

• Low Power, Approx 10 WattsClearSpeed CSX600


CSX600 SoC


The ClearSpeed AdvanceTM X620 accelerator board

• Dual ClearSpeed CSX600 coprocessors• R∞ ≈ 77 GFLOPS for 64-bit matrix multiply (DGEMM) calls

– Hardware does also support 32-bit floating point and integer calculations• Single PCI slot (PCI-X or PCIe 8-lane)

– Multiple boards can be used together for greater performance• About 1 Gbyte/s to/from card from host

– 1GByte of memory on the board• Drivers today for Linux (RedHat and Suse) and Windows• 8.8 ounces, 8 inches long, 25 watts for entire card (at socket)• No extra cooling or space required


Scaling to Next Generation Processor (2007) • Processor Elements;

– 64-bit and 32-bit floating point

• Complete Accelerator System– PCIe I/O– 2x Low Power DDR2 ECC SDRAM– 250+ MHz… key to low power

• 47% logic, 53% memory– About 50% of the logic is FPUs– Hence around one quarter of the chip is

floating point hardware

• ~2 TB/sec internal bandwidth– At the register file

• 256 million transistors IBM 90nm

• Low Power, Approx 12 Watts


ClearSpeed development environment• Cn optimising compiler

– C with poly extension for SIMD control– Uses ACE CoSy compiler development system

• Assembler, linker• Simulators

– Fast high level and slower timing accurate versions• Debuggers –gdb, csgdb

– A port of the GNU debugger gdb for x86 and csgdb that can run on ClearSpeed’s hardware, together give a consistent host and CSX600 view.

• Profiling -csprof– Visualises an accelerated application’s performance while running on

both a multi-core host and either ClearSpeed’s Advance board or the simulator. Intimately integrates with debuggers.

• Libraries (LAPACK, RNG, FFT, more..)• High level APIs

– Stable, with advanced under development• Documentation, training materials• Available for Windows and Linux (Red Hat 4 and SLES 9)


Open Accelerator Model (SMP + Accelerators)

Copyright © 2007 ClearSpeed Technology plc. All rights reserved. 15

Application

Operating System Device driver(CSAPI Kernel)

CSX Cores

CSAPI

ClearStack

CPU

Open Accelerator

Streaming

CPU

With care application can mix and match any ClearSpeed API

Can also support for EXOCHI style shared memory model

Cn/OpenMP SIMD

CSAPI User


ClearSpeed development environment: Roadmap

• Present– Optimising Cn Compiler (now)– System level profiling and debug (now)– Move to Eclipse development environment (Beta)

• Can now access the Parallel Tools Platform– http://www.eclipse.org/ptp/

• Future– ClearStack (near)

• Advanced APIs (Beta)• C++ Object Migration base class (Beta, with partner)

– Heterogeneous programming environment (next)• “Exploiting Loop-Level Parallelism for SIMD Arrays using OpenMP”• IWOMP 2007 (Beijing)

– Fortran (later)


Growing list of Accelerated math libraries

• There are five categories of math function that ClearSpeed provides:– A subset of L3 BLAS

• Matrix operations– A subset of LAPACK

• Linear algebra routines and solvers for LU, QR etc.– A set of random number generators– A set of vector math functions (sin, exp, log etc.)– A set of Fast Fourier Transform routines– More in development

• ClearSpeed’s L3 BLAS and LAPACK library includes breakthrough “heterogeneous execution” plug-and-play technology– Uses multiple x86 cores simultaneously with the ClearSpeed

accelerator board– Accelerates Matlab out of the Box


• 1D 128 forward single complex 1.21m FFT/sec• 1D 128 forward double complex 600k FFT/sec

• 1D 256 forward single complex 600k FFT/sec• 1D 256 forward double complex 311k FFT/sec

• 1D 1024 forward single complex 157k FFT/sec• 1D 1024 forward double complex 78k FFT/sec

• Performance is limited by low power external memory interface.

• More computing recommend to re-addresses this balance.

CXS600 Processor FFT Performance


Vector Math functions speed comparison (CSX600)

Typical speedup of ~8X over the fastest x86 processors

64-bit function operations per second (billions)

0.0

0.5

1.0

1.5

2.0

2.5

Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name

2.6 GHz dual-core Opteron3 GHz dual-core WoodcrestClearSpeed Advance card


A Cn example radix-2 FFT

void cn_fft(poly float *xy,poly float *w, short n) { poly short n1,n2,ie,ia,i,j,k,l; poly float xt,yt,c,s; n2 = n; ie = 1; for (k=n; k > 1; k = (k >> 1) ) { n1 = n2; n2 = n2>>1; ia = 0; for (j=0; j < n2; j++) { c = w[2*ia]; s = w[2*ia+1]; ia = ia + ie; for (i=j; i < n; i += n1) { l = i + n2; xt = xy[2*l] - xy[2*i]; xy[2*i] = xy[2*i] + xy[2*l]; yt = xy[2*l+1] - xy[2*i+1]; xy[2*i+1] = xy[2*i+1] + xy[2*l+1]; xy[2*l] = (c*xt + s*yt); xy[2*l+1] = (c*yt - s*xt); } } ie = ie<<1; }}


csgdb/ddd debugger and csprof integration

• x86 gdb port enables standard GUIs – multi core – multi thread– single step, breakpoint, watchpoint

• csgdb port consistent with x86 gdb– enables standard GUIs with the CSX600– single step, breakpoint, watchpoint

• csgdb port is multi-everything– Card, processor, thread, PE– Profiler control integrated via new csgdb

command set.

• Visualize all the state in everything• Visualize all data movement PCI bus


Advance Accelerator Board

CSX 600

Pipeline

CSX 600

Pipeline

HostCPU(s)Host

CPU(s)HostCPU(s)

Profiling details of host and board system level activity

Advance Accelerator Board

HostCPU(s)

CSX600

Pipeline

HOST/BOARD INTERACTIONView host/board interactions

Provides performance information for data transfer

operations. Trace cluster node/board interaction See

overlap of host compute and board compute

CSX600 PIPELINEView detailed instruction

issue information. Visualize overlap of executing

instructions. Optimize code at the instruction level. View

instruction level performance bottlenecks.

Get accurate instruction timing

CSX600 SYSTEMView system level trace

Visually inspect the overlap of compute and

I/O visualize cache utilization View branch trace of code executing

Find and analyse performance bottlenecks Get accurate event timing

CSX600

Pipeline

HOST CODE PROFILINGVisually inspect host code

executing Supports multiple threads

and processes Time specific code sections See

overlap of host threads executing Platform and

processor agnostic trace collection


Profile of complete LINPACK run (x86 view)

Overview of system performance during

LINPACK run

Profiling of x86 source code inside

LINPACK

CSX600 Interaction displayed with x86

code profile


Multiple CSX600 DGEMM calls (CSX600 view)

View the DGEMM calls on the

CSX600 processor

Much higher level of detail available from the profiler

Each call ties up with the host view of card execution


Pipeline view of CSX600 DGEMM inner loop (CSX600 view)

Profile the code running at the

instruction level

See the pipeline performance for each instruction

Tune the instruction scheduling for the application code


Summary• European technology

• Genuine co-processor technology

• Scaleable technology

• COTS processors and boards available now

• IP available for licensing

• Potential for Rad-Hardening on 90-65nm process

envision. accelerate. arrive. a high performance...

Documents