ascenium: a continuously reconfigurable architecture

17
Ascenium: A Continuously Reconfigurable Architecture Robert Mykland Founder/CTO [email protected] August, 2005

Upload: others

Post on 05-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ascenium: A Continuously Reconfigurable Architecture

Ascenium: A ContinuouslyReconfigurable Architecture

Robert MyklandFounder/CTO

[email protected], 2005

Page 2: Ascenium: A Continuously Reconfigurable Architecture

Ascenium:A Continuously Reconfigurable Processor

Continuously reconfigurable approach provides:– The computational efficiency of direct logic implementations

in ASICs– All the flexibility of microprocessors (pure software)

Unique architecture enables the array to beeffectively targeted by an ANSI Standard C compiler– CPU logic is continuously redefined to match the compiler’s

needs for each instruction (≈20 cycles) Massive parallelism even in programs with little or no

inherent instruction-level parallelism– Dozens of lines of sequential C code are executed every

instruction

Page 3: Ascenium: A Continuously Reconfigurable Architecture

Ascenium Block Diagram

ReconfigurableLogic Unit (RLU)

ConfigControl In Out

MemoryControl

Memory

Ascenium

No instruction set RLU is combinatorial

logic that settles Memory is accessed in

statically scheduled blockoperations

Two instruction streams1) Memory2) Configuration

Whole loops are oftenexecuted as one RLUinstruction

Page 4: Ascenium: A Continuously Reconfigurable Architecture

25

12 265

Ascenium Basic Operation

1

0 0

2

1 0

3

1 5

4

2 10

5

2 15

6

3 20

7

3 25

8

4 30

9

4 30

10

5 30

11

5 30

12

6 30

13

6 30

14

7 30

15

7 30

16

8 30

17

8 30

18

9 30

19

9 35

20

10 40

21

10 45

22

11 145

23

11 155

24

12 165ALU

ControlRegister

File

Memory

ReconfigurableLogic Unit (RLU)

ConfigControl In Out

MemoryControl

Memory

Conventional Ascenium

RISC Instructions

Clock =

12 165

24

Page 5: Ascenium: A Continuously Reconfigurable Architecture

PadRing

500 mW @ 200 MHz 52 mm2

OnchipMemory

256KB onchip memory 19.2 GB/s peak onchip memory bandwidth

Inst.Pipe

DataPipe

200 MHz DDR II = 3.2 GB/s external memory bandwidth

A128X16 Rough Layout

RLU 130 nm process 128 16-bit logic elements 64 MACs/clock cycle

Page 6: Ascenium: A Continuously Reconfigurable Architecture

Input A (16 bits) Input B (16 bits) Input C (16 bits)

½ 16-Bit FullMultiplier -

Accumulator

Fast Carry

Wide Logic

Iterator X (16 bits) Iterator Y (16 bits) Iterator Z (16 bits)

Output X (16 bits) Output Y (16 bits)

LUT Y (16 bits)LUT X (16 bits)

Logic Element

16-Bit Add

Page 7: Ascenium: A Continuously Reconfigurable Architecture

Array Connectivity

Page 8: Ascenium: A Continuously Reconfigurable Architecture

Compiler Block Diagram

GCC BasedCompiler

(open source)

Sourcecode

C, C++,Java

ProprietaryAscenium

CodeGenerator(develop)

Library (open source)

AsceniumExecutable

Bytecodefiles

BytecodeBytecode

Linker(open source)

Bytecode

Page 9: Ascenium: A Continuously Reconfigurable Architecture

Code Generator Diagram

LLVMPARSER

GENERICINT.REP. (GIR)

GLOBALTHREADING

GIROPTS.

ASCENIUMINT.REP. (AIR)

FUNC.ASSIGNMENT& SCORING

ARRAY INST.PACKING &COMP.

DATADIRECTIVES

FETCHORDERASSY.

AIROPTS.

OUTPUT

indicates whereoptimizations occur

Page 10: Ascenium: A Continuously Reconfigurable Architecture

Generic Intermediate Representation (GIR)

PARAMA

PARAMB

LOADI

STORE

RET

+

CALL

PARAMA

RET

PARAMA

RET

LOADI

*

Page 11: Ascenium: A Continuously Reconfigurable Architecture

FFT Inner Loop C Codei1 = i0 + n2;i2 = i1 + n2;i3 = i2 + n2;r1 = x[2 * i0] + x[2 * i2];r2 = x[2 * i0] - x[2 * i2];t = x[2 * i1] + x[2 * i3];x[2 * i0] = r1 + t;r1 = r1 - t;s1 = x[2 * i0 + 1] + x[2 * i2 + 1];s2 = x[2 * i0 + 1] - x[2 * i2 + 1];t = x[2 * i1 + 1] + x[2 * i3 + 1];x[2 * i0 + 1] = s1 + t;s1 = s1 - t;

x[2 * i2] = (r1 * co2 + s1 * si2) >>15;x[2 * i2 + 1] = (s1 * co2-r1*si2)>>15;t = x[2 * i1 + 1] - x[2 * i3 + 1];r1 = r2 + t;r2 = r2 - t;t = x[2 * i1] - x[2 * i3];s1 = s2 - t;s2 = s2 + t;x[2 * i1] = (r1 * co1 + s1 * si1)>>15;x[2 * i1 + 1] = (s1 * co1-r1 * si1)>>15;x[2 * i3] = (r2 * co3 + s2 * si3)>>15;x[2 * i3 + 1] = (s2 * co3-r2 *si3)>>15;

Page 12: Ascenium: A Continuously Reconfigurable Architecture

FFT Inner Loop Virtual Circuit

16+A

16-B

16+C

16+D

16-E

16+F

16-G

16-H

16-I

16-J

16+K

16-L

16-M

16+N

16XO

16XP

16XQ

16XR

16XS

16XT

16XU

16XV

16XW

16XX

16XY

16XZ

32+

AA

32+

BB

32+

CC

32+

DD

32+

EE

32+

FF

16+

GG

16+

HH

16>>15

II

16>>15

JJ

16>>15KK

16>>15LL

16>>15MM

16>>15NN

X0 X2 X0 X2 X1 X3 Y0 Y2 Y0 Y2 Y1 Y3 Y1 Y3 X1 X3

C2 S2 C2 S2 C1 S1 C1 S1 C3 S3 C3

X0o

Y0o X2o Y2o X1o Y1o X3o Y3o

S3

r1 r2 t1 s1 s2 t2 t3 t4

r3 s3 r4 r5 s4 s5

Page 13: Ascenium: A Continuously Reconfigurable Architecture

FFT Loop Array Instruction

AX0X2

Y0Y2

B

CX1X3

Y1Y3

D

E

F

GH

I J

K LM N

C2 S2

OP Q

R

C1

S

S1

TV

U

W

C3 S3

XZ

YAA BB CC DD

EE FF

GG HHX0o

Y0o

II

X2o

JJ

Y2o

KK

X1o

Y1o

LL

MM

X3o

Y3o

NN

X0X2

Y0Y2

X1X3

Y1Y3

C2 S2

C1 S1

C3 S3

X0o

Y0oX2oY2o

X1o

Y1oX3

o

Y3o

A CD F I J GG HH

B E GH K LM N

OP Q

R ST

VU

AA BB CC DDX Y IILL

W Z

EE FF

JJ

KKNN

MM

Page 14: Ascenium: A Continuously Reconfigurable Architecture

FFT Loop Data Directives

1. Fetch the array instruction and data directives2. Load the array instruction3. Read x[t] in four 64-bit wide banks 32 times4. Write x[t] in one 256-bit wide bank 32 times,

changing banks every 8 operations5. Repeat steps 3 and 4 three more times6. At the same time as 3 –5, read in w coefficients7. Pass 1: w every clock; 2: w every 4 clocks, then

every 16, then every 648. Stop

Page 15: Ascenium: A Continuously Reconfigurable Architecture

DSP Benchmark Results(Simulated, Hand-Coded)

6,827 ns512 nsReal Block FIR Filter6,827 ns512 nsComplex Block FIR Filter6,080 ns512 nsDCT x 24

292 ns

512 ns

A128X16 core,130nm, 250MHz,250mW

6,827 nsIIR Filter NR = 2,048

8,920 ns256-Point Radix-4 FFT

TMS320C64x core,130nm, 300MHz*,250mW

Benchmark

* The 130nm C64x core will operate up to 720 MHz, but burns more powerat higher frequencies.

Page 16: Ascenium: A Continuously Reconfigurable Architecture

Ascenium Status

Patents filed

Chip architecture complete, polishing ongoing

Prototype compiler complete & demo-able

Development plan in place

Raising seed round

Page 17: Ascenium: A Continuously Reconfigurable Architecture

Summary

Great performance and performance/mW:>10x programmable DSPs>4x performance/mW of FPGAs

Great time-to-market:-- Normal GPP software development cycle

Great ease-of-use:-- Programmers writing standard, non-optimized

code in C/C++ get all the performance