envision. accelerate. arrive. a high performance...
TRANSCRIPT
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 1
ENVISION. ACCELERATE. ARRIVE.
A high performance, low-power, scaleable platform for next-generation on-board payload data processing applications
Ray McConnell ([email protected])CTOClearSpeed Technology Plc.
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 2
Company background
• Public fabless semiconductor company• Offices – Bristol (UK HQ) and San Jose (US)• >$100 million and 300 man-years investment• 86 patents granted/applied for • 82 employees and growing• 2003: introduced CS301, a prototype• 2004: announced first commercial processor• 2005: CSX600 processor and Advance™ board start shipping• Internationally recognised as “disruptive technology”• Solves the power/heat challenges of HPC
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 3
Outline
• New limits – New frontiers, next generation DSP require next generation IP– 10 years means 2 orders of magnitude 100x
• Description of ClearSpeed’s Acceleration Technology – Processors– IP
• Applications, Libraries and Performance
• Software Tools -SDK– Roadmap– Cn: Extending C for data parallel programming– Heterogeneous system level debugging and profiling
• Summary
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 4
Value proposition: • Provide full access to COTS scaleable processor:
– IP modules available– Full verification for Right First Time– Or complete processor design
• With quality Heterogeneous SDK tools and libraries.– Ported to the Eclipse platform– Recent integration for technical demonstration with Intel EXOCHI
• 30x speed up of a Monte-Carlo FSI code on high end Intel platform
• BAE Systems (US-SSA) first licensee– Rad Hardening on IBM 90nm
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 5
BAE Systems Sign Technology License Agreement
• ClearSpeed Technology and BAE Systems Sign Technology License Agreement
In a significant expansion of its strategy, ClearSpeed has licensed the design of its next generation processor to BAE Systems for use in the most demanding of environments – space
• Bristol, UK – September 4, 2007 – ClearSpeed Technology (LSE:CSD), the world leader in coprocessors for compute acceleration, and BAE Systems, a leading global defense and aerospace company, today announced that they have signed an agreement under which BAE Systems will license the design of ClearSpeed’s next generation processor for inclusion in satellite systems. In addition, BAE Systems will gain access to ClearSpeed’s software development kit (SDK) for the purpose of adding new libraries relevant to satellite system design and deployment.
• The agreement gives BAE Systems an embedded technology for deployment in the harsh environments that satellite systems must endure, by meeting stringent criteria for performance, energy efficiency, ease of programming, accuracy and reliability, each of which ClearSpeed met or exceeded. This combination of features, including fault tolerance, has been developed by ClearSpeed in order to meet the intense demands of acceleration in high performance computing (HPC).
“To deploy and achieve sustained operation of systems in space sets the most challenging requirements, “ said George Nossaman, director of Advanced Digital Systems for BAE Systems. “Our ability to deliver ClearSpeed’s coprocessors to our space customers, with this technology’s unique combination of very high performance, power efficiency and full programmability will enable a major increase in mission capability for key civil, commercial and defensive space systems.”
• As the base of its leadership in HPC acceleration, ClearSpeed is focused on delivering technology with a key set of attributes which are also relevant to areas of embedded processing. The agreement with BAE Systems marks ClearSpeed’s first extension into the embedded space.
• “Licensing our technology to a limited number of key partners has always been planned as part of our strategy, said Tom Beese, chief executive officer of ClearSpeed Technology. “To be chosen by BAE Systems for use in satellite systems is a clear demonstration of the versatility and robustness of our core technology in meeting the most intensive needs in multiple markets.”
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 6
• Multi-Threaded Array Processing– Programmed in familiar languages– Hardware multi-threading– Asynchronous, overlapped I/O– Run-time extensible instruction set– Bi-endian for compatibility
• Array of 96 Processor Elements (PEs)– Each is a Very Long Instruction Word
(VLIW) core, not just an ALU
• Cn is the natural language– Single “poly” data type modifier – Rich expressive semantics
MTAP processor core
CSX600
Programmable I/O to DRAM
PE 0
Peripheral Network
PE 1
PE 95…
Data Cache
Mono Controller Instruc-
tion Cache
Control and
Debug
System Network
Poly Controller
System Network
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 7
Processing ElementsEach PE is a VLIW core:
• Multiple configurable execution units• 4-stage floating point adder• 4-stage floating point multiplier• Divide/square root unit• Fixed-point MAC 16x16 → 32+64• Integer ALU with shifter• Load/store
• High-bandwidth, 5-port register file (3r, 2w)• Closely coupled 6 KB SRAM for data• High bandwidth per PE DMA (PIO)• Per PE address generators
• Complete pointer model, including parallel pointer chasing and vectors of addresses
32/64-bitIEEE 754
PEn
Programmed I/O
Register File128 Bytes
PE SRAM6 KBytes
FP M
ul
FP A
dd
Div
, Sqr
t
MAC
ALU
PIO Collection & Distribution
64 64 64
32
6464 PEn+1
PEn–1
128
32
}
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 8
• Memory Transaction Switched Network (NoC)– Block to Block– Chip to Chip– Configurable width– Configurable 64bit addressing– Can be extended into FPGA
• Flexible late stage layout – Localised Timing – Straight forward synthesis– T-Switch to optimise bandwidths
• Lower Power– No Long RC wires
ClearConnect Inter/Intra Chip/Block Connect
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 9
First Generation CS301 (2003)
PE Array
Control SRAM
Bus
• Multi-Threaded Array Processor– 25.6 GFLOPS, SP IEEE– 2.5W worst-case, 1.8W typical– 200MHz– 64 PEs, 4 KBytes each
• Scratchpad memory– 128 KBytes of SRAM
• IBM 130nm
• ClearConnect bus– 64-bit full duplex– 1.6 GByte/s each direction– 2x 800 MByte/s bridge ports
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 10
Present CSX600 Processor (2005)
• Array of 96 Processor Elements; 64-bit and 32-bit floating point
• 210-250 MHz… key to low power
• 47% logic, 53% memory– About 50% of the logic is FPUs– Hence around one quarter of the chip
is floating point hardware
• ~1 TB/sec internal bandwidth– At the register file
• 128 million transistors: IBM 130nm
• Low Power, Approx 10 WattsClearSpeed CSX600
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 11
CSX600 SoC
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 12
The ClearSpeed AdvanceTM X620 accelerator board
• Dual ClearSpeed CSX600 coprocessors• R∞ ≈ 77 GFLOPS for 64-bit matrix multiply (DGEMM) calls
– Hardware does also support 32-bit floating point and integer calculations• Single PCI slot (PCI-X or PCIe 8-lane)
– Multiple boards can be used together for greater performance• About 1 Gbyte/s to/from card from host
– 1GByte of memory on the board• Drivers today for Linux (RedHat and Suse) and Windows• 8.8 ounces, 8 inches long, 25 watts for entire card (at socket)• No extra cooling or space required
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 13
Scaling to Next Generation Processor (2007) • Processor Elements;
– 64-bit and 32-bit floating point
• Complete Accelerator System– PCIe I/O– 2x Low Power DDR2 ECC SDRAM– 250+ MHz… key to low power
• 47% logic, 53% memory– About 50% of the logic is FPUs– Hence around one quarter of the chip is
floating point hardware
• ~2 TB/sec internal bandwidth– At the register file
• 256 million transistors IBM 90nm
• Low Power, Approx 12 Watts
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 14
ClearSpeed development environment• Cn optimising compiler
– C with poly extension for SIMD control– Uses ACE CoSy compiler development system
• Assembler, linker• Simulators
– Fast high level and slower timing accurate versions• Debuggers –gdb, csgdb
– A port of the GNU debugger gdb for x86 and csgdb that can run on ClearSpeed’s hardware, together give a consistent host and CSX600 view.
• Profiling -csprof– Visualises an accelerated application’s performance while running on
both a multi-core host and either ClearSpeed’s Advance board or the simulator. Intimately integrates with debuggers.
• Libraries (LAPACK, RNG, FFT, more..)• High level APIs
– Stable, with advanced under development• Documentation, training materials• Available for Windows and Linux (Red Hat 4 and SLES 9)
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 15
Open Accelerator Model (SMP + Accelerators)
Copyright © 2007 ClearSpeed Technology plc. All rights reserved. 15
Application
Operating System Device driver(CSAPI Kernel)
CSX Cores
CSAPI
ClearStack
CPU
Open Accelerator
Streaming
CPU
With care application can mix and match any ClearSpeed API
Can also support for EXOCHI style shared memory model
Cn/OpenMP SIMD
CSAPI User
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 16
ClearSpeed development environment: Roadmap
• Present– Optimising Cn Compiler (now)– System level profiling and debug (now)– Move to Eclipse development environment (Beta)
• Can now access the Parallel Tools Platform– http://www.eclipse.org/ptp/
• Future– ClearStack (near)
• Advanced APIs (Beta)• C++ Object Migration base class (Beta, with partner)
– Heterogeneous programming environment (next)• “Exploiting Loop-Level Parallelism for SIMD Arrays using OpenMP”• IWOMP 2007 (Beijing)
– Fortran (later)
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 17
Growing list of Accelerated math libraries
• There are five categories of math function that ClearSpeed provides:– A subset of L3 BLAS
• Matrix operations– A subset of LAPACK
• Linear algebra routines and solvers for LU, QR etc.– A set of random number generators– A set of vector math functions (sin, exp, log etc.)– A set of Fast Fourier Transform routines– More in development
• ClearSpeed’s L3 BLAS and LAPACK library includes breakthrough “heterogeneous execution” plug-and-play technology– Uses multiple x86 cores simultaneously with the ClearSpeed
accelerator board– Accelerates Matlab out of the Box
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 18
• 1D 128 forward single complex 1.21m FFT/sec• 1D 128 forward double complex 600k FFT/sec
• 1D 256 forward single complex 600k FFT/sec• 1D 256 forward double complex 311k FFT/sec
• 1D 1024 forward single complex 157k FFT/sec• 1D 1024 forward double complex 78k FFT/sec
• Performance is limited by low power external memory interface.
• More computing recommend to re-addresses this balance.
CXS600 Processor FFT Performance
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 19
Vector Math functions speed comparison (CSX600)
Typical speedup of ~8X over the fastest x86 processors
64-bit function operations per second (billions)
0.0
0.5
1.0
1.5
2.0
2.5
Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name
2.6 GHz dual-core Opteron3 GHz dual-core WoodcrestClearSpeed Advance card
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 20
A Cn example radix-2 FFT
void cn_fft(poly float *xy,poly float *w, short n) { poly short n1,n2,ie,ia,i,j,k,l; poly float xt,yt,c,s; n2 = n; ie = 1; for (k=n; k > 1; k = (k >> 1) ) { n1 = n2; n2 = n2>>1; ia = 0; for (j=0; j < n2; j++) { c = w[2*ia]; s = w[2*ia+1]; ia = ia + ie; for (i=j; i < n; i += n1) { l = i + n2; xt = xy[2*l] - xy[2*i]; xy[2*i] = xy[2*i] + xy[2*l]; yt = xy[2*l+1] - xy[2*i+1]; xy[2*i+1] = xy[2*i+1] + xy[2*l+1]; xy[2*l] = (c*xt + s*yt); xy[2*l+1] = (c*yt - s*xt); } } ie = ie<<1; }}
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 21
csgdb/ddd debugger and csprof integration
• x86 gdb port enables standard GUIs – multi core – multi thread– single step, breakpoint, watchpoint
• csgdb port consistent with x86 gdb– enables standard GUIs with the CSX600– single step, breakpoint, watchpoint
• csgdb port is multi-everything– Card, processor, thread, PE– Profiler control integrated via new csgdb
command set.
• Visualize all the state in everything• Visualize all data movement PCI bus
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 22
Advance Accelerator Board
CSX 600
Pipeline
CSX 600
Pipeline
HostCPU(s)Host
CPU(s)HostCPU(s)
Profiling details of host and board system level activity
Advance Accelerator Board
HostCPU(s)
CSX600
Pipeline
HOST/BOARD INTERACTIONView host/board interactions
Provides performance information for data transfer
operations. Trace cluster node/board interaction See
overlap of host compute and board compute
CSX600 PIPELINEView detailed instruction
issue information. Visualize overlap of executing
instructions. Optimize code at the instruction level. View
instruction level performance bottlenecks.
Get accurate instruction timing
CSX600 SYSTEMView system level trace
Visually inspect the overlap of compute and
I/O visualize cache utilization View branch trace of code executing
Find and analyse performance bottlenecks Get accurate event timing
CSX600
Pipeline
HOST CODE PROFILINGVisually inspect host code
executing Supports multiple threads
and processes Time specific code sections See
overlap of host threads executing Platform and
processor agnostic trace collection
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 23
Profile of complete LINPACK run (x86 view)
Overview of system performance during
LINPACK run
Profiling of x86 source code inside
LINPACK
CSX600 Interaction displayed with x86
code profile
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 24
Multiple CSX600 DGEMM calls (CSX600 view)
View the DGEMM calls on the
CSX600 processor
Much higher level of detail available from the profiler
Each call ties up with the host view of card execution
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 25
Pipeline view of CSX600 DGEMM inner loop (CSX600 view)
Profile the code running at the
instruction level
See the pipeline performance for each instruction
Tune the instruction scheduling for the application code
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 26
Summary• European technology
• Genuine co-processor technology
• Scaleable technology
• COTS processors and boards available now
• IP available for licensing
• Potential for Rad-Hardening on 90-65nm process