ascenium: a continuously reconfigurable architecture

Ascenium: A ContinuouslyReconfigurable Architecture

Robert MyklandFounder/CTO

robert@ascenium.comAugust, 2005

Ascenium:A Continuously Reconfigurable Processor

Continuously reconfigurable approach provides:– The computational efficiency of direct logic implementations

in ASICs– All the flexibility of microprocessors (pure software)

Unique architecture enables the array to beeffectively targeted by an ANSI Standard C compiler– CPU logic is continuously redefined to match the compiler’s

needs for each instruction (≈20 cycles) Massive parallelism even in programs with little or no

inherent instruction-level parallelism– Dozens of lines of sequential C code are executed every

instruction

Ascenium Block Diagram

ReconfigurableLogic Unit (RLU)

ConfigControl In Out

MemoryControl

Memory

Ascenium

No instruction set RLU is combinatorial

logic that settles Memory is accessed in

statically scheduled blockoperations

Two instruction streams1) Memory2) Configuration

Whole loops are oftenexecuted as one RLUinstruction

12 265

Ascenium Basic Operation

11 145

11 155

12 165ALU

ControlRegister

Memory

ReconfigurableLogic Unit (RLU)

ConfigControl In Out

MemoryControl

Memory

Conventional Ascenium

RISC Instructions

Clock =

12 165

PadRing

500 mW @ 200 MHz 52 mm2

OnchipMemory

256KB onchip memory 19.2 GB/s peak onchip memory bandwidth

Inst.Pipe

DataPipe

200 MHz DDR II = 3.2 GB/s external memory bandwidth

A128X16 Rough Layout

RLU 130 nm process 128 16-bit logic elements 64 MACs/clock cycle

Input A (16 bits) Input B (16 bits) Input C (16 bits)

½ 16-Bit FullMultiplier -

Accumulator

Fast Carry

Wide Logic

Iterator X (16 bits) Iterator Y (16 bits) Iterator Z (16 bits)

Output X (16 bits) Output Y (16 bits)

LUT Y (16 bits)LUT X (16 bits)

Logic Element

16-Bit Add

Array Connectivity

Compiler Block Diagram

GCC BasedCompiler

(open source)

Sourcecode

C, C++,Java

ProprietaryAscenium

CodeGenerator(develop)

Library (open source)

AsceniumExecutable

Bytecodefiles

BytecodeBytecode

Linker(open source)

Bytecode

Code Generator Diagram

LLVMPARSER

GENERICINT.REP. (GIR)

GLOBALTHREADING

GIROPTS.

ASCENIUMINT.REP. (AIR)

FUNC.ASSIGNMENT& SCORING

ARRAY INST.PACKING &COMP.

DATADIRECTIVES

FETCHORDERASSY.

AIROPTS.

OUTPUT

indicates whereoptimizations occur

Generic Intermediate Representation (GIR)

PARAMA

PARAMB

PARAMA

FFT Inner Loop C Codei1 = i0 + n2;i2 = i1 + n2;i3 = i2 + n2;r1 = x[2 * i0] + x[2 * i2];r2 = x[2 * i0] - x[2 * i2];t = x[2 * i1] + x[2 * i3];x[2 * i0] = r1 + t;r1 = r1 - t;s1 = x[2 * i0 + 1] + x[2 * i2 + 1];s2 = x[2 * i0 + 1] - x[2 * i2 + 1];t = x[2 * i1 + 1] + x[2 * i3 + 1];x[2 * i0 + 1] = s1 + t;s1 = s1 - t;

x[2 * i2] = (r1 * co2 + s1 * si2) >>15;x[2 * i2 + 1] = (s1 * co2-r1*si2)>>15;t = x[2 * i1 + 1] - x[2 * i3 + 1];r1 = r2 + t;r2 = r2 - t;t = x[2 * i1] - x[2 * i3];s1 = s2 - t;s2 = s2 + t;x[2 * i1] = (r1 * co1 + s1 * si1)>>15;x[2 * i1 + 1] = (s1 * co1-r1 * si1)>>15;x[2 * i3] = (r2 * co3 + s2 * si3)>>15;x[2 * i3 + 1] = (s2 * co3-r2 *si3)>>15;

FFT Inner Loop Virtual Circuit

16>>15

16>>15KK

16>>15LL

16>>15MM

16>>15NN

X0 X2 X0 X2 X1 X3 Y0 Y2 Y0 Y2 Y1 Y3 Y1 Y3 X1 X3

C2 S2 C2 S2 C1 S1 C1 S1 C3 S3 C3

Y0o X2o Y2o X1o Y1o X3o Y3o

r1 r2 t1 s1 s2 t2 t3 t4

r3 s3 r4 r5 s4 s5

FFT Loop Array Instruction

K LM N

YAA BB CC DD

GG HHX0o

Y0oX2oY2o

A CD F I J GG HH

B E GH K LM N

AA BB CC DDX Y IILL

FFT Loop Data Directives

1. Fetch the array instruction and data directives2. Load the array instruction3. Read x[t] in four 64-bit wide banks 32 times4. Write x[t] in one 256-bit wide bank 32 times,

changing banks every 8 operations5. Repeat steps 3 and 4 three more times6. At the same time as 3 –5, read in w coefficients7. Pass 1: w every clock; 2: w every 4 clocks, then

every 16, then every 648. Stop

DSP Benchmark Results(Simulated, Hand-Coded)

6,827 ns512 nsReal Block FIR Filter6,827 ns512 nsComplex Block FIR Filter6,080 ns512 nsDCT x 24

292 ns

512 ns

A128X16 core,130nm, 250MHz,250mW

6,827 nsIIR Filter NR = 2,048

8,920 ns256-Point Radix-4 FFT

TMS320C64x core,130nm, 300MHz*,250mW

Benchmark

* The 130nm C64x core will operate up to 720 MHz, but burns more powerat higher frequencies.

Ascenium Status

Patents filed

Chip architecture complete, polishing ongoing

Prototype compiler complete & demo-able

Development plan in place

Raising seed round

Summary

Great performance and performance/mW:>10x programmable DSPs>4x performance/mW of FPGAs

Great time-to-market:-- Normal GPP software development cycle

Great ease-of-use:-- Programmers writing standard, non-optimized

code in C/C++ get all the performance

ascenium: a continuously reconfigurable architecture

Documents

inter-coarse-grained reconfigurable architecture...

a reconfigurable and extendable digital architecture for

plasticine: a reconfigurable architecture for … ›...

cryptarray a scalable and reconfigurable architecture for

cygraph: a reconfigurable architecture for parallel ... ·...

architecture reconfigurable pour un equipement radio

continuously adaptive datacenter architecture

a restful based architecture for reconfigurable...

générateur d ’architecture pour plate-forme...

a new reconfigurable machine tool architecture for

reconfigurable open architecture computing hardware (roach...

architecture de puissance distribuée reconfigurable

dynamic partial reconfigurable hardware architecture for

onboard processing expandable reconfigurable architecture...

fault tolerant fpga reconfigurable hardware architecture -...

team morphing architecture reconfigurable computational...

northwestern university architecture generation of ... ·...

software development environment for reconfigurable ... ·...

reconfigurable computing architecture for linux - … ·...

reconfigurable computing. this class is about reconfigurable...