how to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · how to...

22
How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden, LIACS Computer Systems Group http://www.liacs.nl ICT-Kenniscongres 10 April 2006

Upload: others

Post on 16-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

How to teach a billion transistor chip a new trickDr. ir. Bart Kienhuis

University Leiden, LIACS

Computer Systems Group

http://www.liacs.nl

ICT-Kenniscongres 10 April 2006

Page 2: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

2

Outline� Focus is programming ICs for

Stream based Applications� FPGAs

� Stream Based Applications� Multi-media

� Imaging

� Bio-informatics

� Classical DSP

� Increasing demand for high-performance, embedded compute performance

� We know how to build billion transistor FPGAs, but cannot program efficiently

FPGABillion

Transistors

Applications

(C / Matlab / Java)

pro

gra

mm

ing?

Page 3: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

3

Stream Based Applications

� Imaging� HDTV 1080x720 @ 100 Hz

� 77 Million samples per seconds

� Suppose 100 operations per sample

� 8 Billion Operations per second

� Classical DSP� LOFAR One antenna

� Sampling rate 200*2 = 400 Million Samples per second

� 100 Operations per second (First stage)

� 40 Billion Operations per second

� There are 10.000 Antennas expected = 40*10^13 operations per second

� Next beamforming is performed, statistics obtained, but on decimated signal

Operation

10 – 10.00 Operations per Sample

10 – 1000 Million Samples per second

Stream

Sample

Real-Time

Page 4: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

4

Compute Requirements

Source: TI, Xilinx – 1 MAC = 8 bit Multiply-Accumulate Trend: ferocious appetite for more

embedded compute power

0

500

1000

1500

2000

2500

2000 2001 2002 2003 2004 2005 2006

Bil

lio

n M

AC

/s

HDTV

MPEG4

Video

over IP

3G

Wireless/

WCDMA

Future

Broadband

Standards

Voice

over IP General Purpose DSP/CPU

Market Requirements

Increasing

Gap

2.5G

?

Page 5: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

5

Intrinsic Computational Efficiency (ICE)

E. Roza, System-on-chip: what are the limits?, IEE Electronics

Communication Engineering Journal, vol13, No6, Dec 2001, pp 249-255.

Computational efficiency

[MOPS/W]

i386SX

i486DXP5

68040

microsparc

Supersparc

601604

Ultrasparc P6

604e21164a

Turbosparc 604e

21364

7400

106

105

104

103

102

101

100

2 1 0.5 0.25 0.13 0.07Feature size [µm]

Microprocessors

Intrinsic Computational

Efficiency of Silicon

Pla

yin

g fie

ld

Page 6: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

7

Xilinx Virtex II Pro

� Heterogeneous Architecture� Multi CPUs

� Distributed Memory Blocks

� Programmable Logic (IP cores)

� Commercially Available Platform� Virtex-4 is released and available

� Virtex-5 is on the drawing board

� FPGAs are becoming more important in Embedded System Design� Flexible since they can easily be

re-programmed

� More functionality at lower cost

� replacing a larger part of the ASICs market

Reconfigurable logic

Heterogeneous Architecture

PowerPCs

Page 7: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

8

4

16 w

ord

s x

1 b

it me

mo

ry

[A,B,C,D]

F

� A 4-input lookup table

(LUT) can implement any

function of 4 inputs.

� For example, a 1-bit

adder needs 2 LUTs:

A⊕B⊕Ci

A.B.Ci

AB

Ci

Co

S

Building an FPGA: Logic First

01011

01010

10010

10001

FAddress

(ABCD)

Page 8: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

9

4

FF

CE RST

M

16 w

ord

s x

1 b

it me

mo

ry

Carry M

M

M

Din

WECin

Cout

� Make LUT RAM

a user resource.

� Fast carry ripple

to neighbor.

Arithmetic, Distributed RAM

Page 9: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

10

4

4

4

4

4

4

4

4

40

� Group logic cells to

reduce overhead.

� Add H, V routing

channels with

switchboxes.

� Add input, output

MUXing between

logic and routing.

Add Interconnect

Page 10: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

11

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

4

4

4

4

4

4

4

4

40

Build an Array

Page 11: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

12

FPGA Technology

� Virtex-4 has already over a billion transistors!

� FPGA can take most advantage of Moore’s law� More an more logic

available on the same die

� CPUs in programmable logic � Hardcore =

� Softcore =

Page 12: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

13

1/91 1/92 1/93 1/94 1/95 1/96 1/97 1/98 1/99 1/00 1/01 1/02 1/03 1/04

1

10

100

1000

Capacity

Speed

Price Virtex-II Pro

Spartan-3

Virtex-II Pro

Memory

Bandwidth

Year

FPGA, the last 10 years

Xilinx Research, Kees Vissers

Page 13: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

14

FPGA Programming� FPGA is flexible in Hardware

and Software…� How to teach a billion

transistor chip a new trick

� Solution Industry doesn’t work

� Programming means:� Write “C” programs for the

Micro processors

� Write “VHDL” programs for the IP cores and interconnection network

� “C” programs and “VHDL” programs need to work together� High performance

� Error proof

� Nightmare to debug

� Programming is currently done manually with lots of engineers

“C”

“C”

“C”

“C”

VHDL

Virtex-II Pro FPGA

Page 14: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

15

Productivity

Lines of Code/Man Month

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1981

1985

1989

1993

1997

2001

2005

2009

Logic transistors per chip

(K)

10

100

1,000

10,000

100,000

1,000,000

10,000,000

Logic Tr./Chip

58% / Yr. compoundcomplexity growth rate

Loc/MM

8-10% / Yr. compoundproductivity growth rate

Productivitygap

Software Productivity

We know how to build Billion transistor chips, but do not know how to program them efficiently

Page 15: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

16

Research at LIACS (COMPAAN)

Kahn Process Network

Matlab/C/Java

Compaan

FPGA Laura /Espam

for j = 1:1:N,

[x(j)] = S1( ); endfor i = 1:1:K,

[y(i)] = S2( ); endfor j = 1:1:N,

for i = 1:1:K,[y(i), x(j)] = func(y(i), x(j) );

endendfor i = 1:1:K,

[Out(i)] = Sink( y( I ) ); end

F1 F2

S2 F3 F4

Sink

S1

How ICT Research at LIACS tackles the Productivity Gap(Leiden University, STW PROGRESS)

Parameterized Nested Loop Programs

Page 16: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

17

Toolbox� Toolbox to change the

number of processes

� Unrolling (More Parallelism)

� Merging (Less Parallelism)

� Better use of Resources

� Re-timing

� C-Slowing

� Exploration

� Track Technology

changes PartitioningLessParallelism

More Parallelism

P1 P2

P3

Application

Page 17: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

18

Results: Motion JPEG (1)

for k = 1:1:NumFrames,

for j = 1:1:VNumBlocks,

for i = 1:1:HNumBlocks,

[ Block ] = VideoInMain();[ Block ] = DCT( Block );

[ Block ] = Q( Block );

[ Packets ] = VLE( Block );

[ ] = VideoOut( Packets );end

end

end

Generated Process Network

DATE04, ‘System Design using Kahn Process Networks: The Compaan/Laura Approach’

Block( j , i )

Page 18: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

19

Motion JPEG (2)

� From Matlab to FPGA implementation� Automatically derived a multi processor solution

� Free choice between Microprocessor (MicroBlaze) or Hardware (IPCore)

� Automatically generated� “C” program for all Microprocessors

� VHDL program for interconnections and IP Core integration

� Correct by Construction

� Seamless integration into FPGA Development environment (EDK)

ND_2DCT

400

ND_1

VideoIn

ND_2

DCT

ND_3

Q

ND_4

VLE

ND_5

VideoOut

10837 135036 68972 54210 2795

FIFO

Difference: 337

- Manual: 3 months

- Compaan: 3 weeks

Page 19: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

20

Results: QR Case studyfor k = 1:1:21,

for j = 1:1:7, [r(j,j), rr(j,j), a, b, d(k)] =

bcell( r(j,j), rr(j,j),x(k,j), d(k));

for i = j+1:1:7,

[r(j,i), x(k,i)] = icell( r(j,i), x(k,i), a, b );

end

end

end

0

200

400

600

800

1000

1200

1400

1600

1800

2000

QR QR

Skew

QR

Skew

+2st

QR

Skew

+4st

QR

Skew

+6st

QR

Skew

+7st

QR

Skew

+10st

MF

LO

PS x

28

QR decomposition

Beam former/Radar

FPL04, ‘Increasing pipelined IP core utilization in Process Networks using Exploration’

- Manual: 1 year- Compaan: 3 days

Exploration

- 60MFlops out-of-the-box

- 2 GFlops, with Compaan/Laura

Page 20: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

21

Multidisciplinary Work

Mathematics

ElectricalEngineering

Mathematical Modeling (Polyhedral Mathematics)

Efficient Solvers (Parametric Integer Programming)

Processor DesignFPGA Technology

Electronics Design Automation

Models of ComputationsCompiler Technology

Software Engineering

Compaan Project: 6 years, 6 PhDs, 2 Postdocs

ComputerScience

Page 21: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

22

Conclusions� Stream Based Applications require Heterogeneous

Multiprocessor architecture� Stream of 10 – 1000 million samples per second

� 10 – 100 Giga operations per second

� Latest FPGAs are Heterogeneous Multiprocessor architectures� Over a Billion Transistor and still counting

� Problem is that we know how to build FPGAs, but programming them is very hard

� Programming/Compiler Technology is a strategic ICT component to reduce Productivity Gap

� Showed the Compaan Project at LIACS, which provides a solution for stream based applications.

Page 22: How to teach a billion transistor chip a new trickkienhuis/ftp/kennis... · 2006-09-06 · How to teach a billion transistor chip a new trick Dr. ir. Bart Kienhuis University Leiden,

23

Thanks to

� Prof. Ed Deprettere

� Dr. Todor Stefanov

� Dr. Sven Verdoolaege

� Alexandru Turjan

� Claudiu Zissulescu

� Hristo Nikolov

� Sjoerd Meijer

� Jerome Lemaitre

� Bin Jiang

� Wei Zhong

For more information http://www.liacs.nl/~kienhuis