ascenium: a continuously reconfigurable architecture
Post on 05-Feb-2022
3 Views
Preview:
TRANSCRIPT
Ascenium: A ContinuouslyReconfigurable Architecture
Robert MyklandFounder/CTO
robert@ascenium.comAugust, 2005
Ascenium:A Continuously Reconfigurable Processor
Continuously reconfigurable approach provides:– The computational efficiency of direct logic implementations
in ASICs– All the flexibility of microprocessors (pure software)
Unique architecture enables the array to beeffectively targeted by an ANSI Standard C compiler– CPU logic is continuously redefined to match the compiler’s
needs for each instruction (≈20 cycles) Massive parallelism even in programs with little or no
inherent instruction-level parallelism– Dozens of lines of sequential C code are executed every
instruction
Ascenium Block Diagram
ReconfigurableLogic Unit (RLU)
ConfigControl In Out
MemoryControl
Memory
Ascenium
No instruction set RLU is combinatorial
logic that settles Memory is accessed in
statically scheduled blockoperations
Two instruction streams1) Memory2) Configuration
Whole loops are oftenexecuted as one RLUinstruction
25
12 265
Ascenium Basic Operation
1
0 0
2
1 0
3
1 5
4
2 10
5
2 15
6
3 20
7
3 25
8
4 30
9
4 30
10
5 30
11
5 30
12
6 30
13
6 30
14
7 30
15
7 30
16
8 30
17
8 30
18
9 30
19
9 35
20
10 40
21
10 45
22
11 145
23
11 155
24
12 165ALU
ControlRegister
File
Memory
ReconfigurableLogic Unit (RLU)
ConfigControl In Out
MemoryControl
Memory
Conventional Ascenium
RISC Instructions
Clock =
12 165
24
PadRing
500 mW @ 200 MHz 52 mm2
OnchipMemory
256KB onchip memory 19.2 GB/s peak onchip memory bandwidth
Inst.Pipe
DataPipe
200 MHz DDR II = 3.2 GB/s external memory bandwidth
A128X16 Rough Layout
RLU 130 nm process 128 16-bit logic elements 64 MACs/clock cycle
Input A (16 bits) Input B (16 bits) Input C (16 bits)
½ 16-Bit FullMultiplier -
Accumulator
Fast Carry
Wide Logic
Iterator X (16 bits) Iterator Y (16 bits) Iterator Z (16 bits)
Output X (16 bits) Output Y (16 bits)
LUT Y (16 bits)LUT X (16 bits)
Logic Element
16-Bit Add
Array Connectivity
Compiler Block Diagram
GCC BasedCompiler
(open source)
Sourcecode
C, C++,Java
ProprietaryAscenium
CodeGenerator(develop)
Library (open source)
AsceniumExecutable
Bytecodefiles
BytecodeBytecode
Linker(open source)
Bytecode
Code Generator Diagram
LLVMPARSER
GENERICINT.REP. (GIR)
GLOBALTHREADING
GIROPTS.
ASCENIUMINT.REP. (AIR)
FUNC.ASSIGNMENT& SCORING
ARRAY INST.PACKING &COMP.
DATADIRECTIVES
FETCHORDERASSY.
AIROPTS.
OUTPUT
indicates whereoptimizations occur
Generic Intermediate Representation (GIR)
PARAMA
PARAMB
LOADI
STORE
RET
+
CALL
PARAMA
RET
PARAMA
RET
LOADI
*
FFT Inner Loop C Codei1 = i0 + n2;i2 = i1 + n2;i3 = i2 + n2;r1 = x[2 * i0] + x[2 * i2];r2 = x[2 * i0] - x[2 * i2];t = x[2 * i1] + x[2 * i3];x[2 * i0] = r1 + t;r1 = r1 - t;s1 = x[2 * i0 + 1] + x[2 * i2 + 1];s2 = x[2 * i0 + 1] - x[2 * i2 + 1];t = x[2 * i1 + 1] + x[2 * i3 + 1];x[2 * i0 + 1] = s1 + t;s1 = s1 - t;
x[2 * i2] = (r1 * co2 + s1 * si2) >>15;x[2 * i2 + 1] = (s1 * co2-r1*si2)>>15;t = x[2 * i1 + 1] - x[2 * i3 + 1];r1 = r2 + t;r2 = r2 - t;t = x[2 * i1] - x[2 * i3];s1 = s2 - t;s2 = s2 + t;x[2 * i1] = (r1 * co1 + s1 * si1)>>15;x[2 * i1 + 1] = (s1 * co1-r1 * si1)>>15;x[2 * i3] = (r2 * co3 + s2 * si3)>>15;x[2 * i3 + 1] = (s2 * co3-r2 *si3)>>15;
FFT Inner Loop Virtual Circuit
16+A
16-B
16+C
16+D
16-E
16+F
16-G
16-H
16-I
16-J
16+K
16-L
16-M
16+N
16XO
16XP
16XQ
16XR
16XS
16XT
16XU
16XV
16XW
16XX
16XY
16XZ
32+
AA
32+
BB
32+
CC
32+
DD
32+
EE
32+
FF
16+
GG
16+
HH
16>>15
II
16>>15
JJ
16>>15KK
16>>15LL
16>>15MM
16>>15NN
X0 X2 X0 X2 X1 X3 Y0 Y2 Y0 Y2 Y1 Y3 Y1 Y3 X1 X3
C2 S2 C2 S2 C1 S1 C1 S1 C3 S3 C3
X0o
Y0o X2o Y2o X1o Y1o X3o Y3o
S3
r1 r2 t1 s1 s2 t2 t3 t4
r3 s3 r4 r5 s4 s5
FFT Loop Array Instruction
AX0X2
Y0Y2
B
CX1X3
Y1Y3
D
E
F
GH
I J
K LM N
C2 S2
OP Q
R
C1
S
S1
TV
U
W
C3 S3
XZ
YAA BB CC DD
EE FF
GG HHX0o
Y0o
II
X2o
JJ
Y2o
KK
X1o
Y1o
LL
MM
X3o
Y3o
NN
X0X2
Y0Y2
X1X3
Y1Y3
C2 S2
C1 S1
C3 S3
X0o
Y0oX2oY2o
X1o
Y1oX3
o
Y3o
A CD F I J GG HH
B E GH K LM N
OP Q
R ST
VU
AA BB CC DDX Y IILL
W Z
EE FF
JJ
KKNN
MM
FFT Loop Data Directives
1. Fetch the array instruction and data directives2. Load the array instruction3. Read x[t] in four 64-bit wide banks 32 times4. Write x[t] in one 256-bit wide bank 32 times,
changing banks every 8 operations5. Repeat steps 3 and 4 three more times6. At the same time as 3 –5, read in w coefficients7. Pass 1: w every clock; 2: w every 4 clocks, then
every 16, then every 648. Stop
DSP Benchmark Results(Simulated, Hand-Coded)
6,827 ns512 nsReal Block FIR Filter6,827 ns512 nsComplex Block FIR Filter6,080 ns512 nsDCT x 24
292 ns
512 ns
A128X16 core,130nm, 250MHz,250mW
6,827 nsIIR Filter NR = 2,048
8,920 ns256-Point Radix-4 FFT
TMS320C64x core,130nm, 300MHz*,250mW
Benchmark
* The 130nm C64x core will operate up to 720 MHz, but burns more powerat higher frequencies.
Ascenium Status
Patents filed
Chip architecture complete, polishing ongoing
Prototype compiler complete & demo-able
Development plan in place
Raising seed round
Summary
Great performance and performance/mW:>10x programmable DSPs>4x performance/mW of FPGAs
Great time-to-market:-- Normal GPP software development cycle
Great ease-of-use:-- Programmers writing standard, non-optimized
code in C/C++ get all the performance
top related