a 1024-core 70gflops/w floating point manycore microprocessor · a 1024-core 70gflops/w floating...

15
A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva, Lexington, MA

Upload: ngokhuong

Post on 19-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor

Andreas Olofsson, Roman Trogan, Oleg RaikhmanAdapteva, Lexington, MA

Page 2: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

The Past, Present, & Future of Computing

2

BIGCPU

BIGCPU

BIGCPU

BIGCPU

PE

SIMD

PE PE PE

BIGCPU

MINICPU

MINICPU

MINICPU

MINICPU

MINICPU

MINICPU

MINICPU

MINICPU

MINICPU

MIMD

BIGCPU

PE PE PE PE

PE PE PE PE

MINICPU

MINICPU

MINICPU

PAST PRESENT FUTURE

Page 3: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

Adapteva’s Manycore Architecture

C/C++ Programmable Incredibly Scalable 70 GFLOPS/W

3

Page 4: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

Routing Architecture

4

Page 5: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

E64G400 Specifications (Jan-2012)

• 64-Core Microprocessor• 100 GFLOPS performance• 800 MHz Operation• 8GB/sec IO bandwidth• 1.6 TB/sec on chip memory BW• 0.8 TB/sec network on chip BW• 64 Billion Messages/sec • 2 Watt total chip power• 2MB on chip memory• 10 mm2 total silicon area• 324 ball 15x15mm plastic BGA

5

IO Pads

Link Logic

Core

Page 6: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

Lab Measurements

0

10

20

30

40

50

60

70

80

0 200 400 600 800 1000 1200

GFLOPS/W

MHz

Energy Efficiency

ENERGY EFFICIENCY

ENERGY EFFICIENCY (28nm)

6

Page 7: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

Epiphany Performance Scaling

1664

2561024

4096

1

4

16

64

256

1,024

4,096

16,384

GFLOPS

# CoresOn‐demand scaling from 0.25W to 64 WattOn‐demand scaling from 0.25W to 64 Watt

7

Page 8: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

Hold on...the title said 1024 cores!

a

• We can build it any time!• Waiting for customer• LEGO approach to design• No global timing paths• Guaranteed by design• Generate any array in 1 day• ~130 mm2 silicon area 1024 Cores 1 Core

8

Page 9: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

What about 64-bit Floating Point?

Double Precision2 FLOPS/CYCLE

64KB SRAM0.237mm^2

600MHz

Single Precision2 FLOPS/CYCLE

64KB SRAM0.215mm^2

700MHz

9

Page 10: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

Epiphany Latency Specifications

Spec

Nearest Neighbor Core to Core Write <4ns

Nearest Neighbor Core to Core Read <15ns

Farthest On‐chip Core to Core Write <15ns

Farthest On‐chip Core to Core Read <30ns

Chip to Chip Write <60ns

Smallest Message that Achieve Peak Bandwidth 8 Bytes

MaximumMessages Per Second 64 Billion

10

Page 11: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

What about the programming model?

• Whatever fits within a XX-KB memory constraint

• Fork/join task scheduling demonstrated

• Message passing (like MPI) demonstrated

• Master/slave openCL framework in progress

• Others?

11

Page 12: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

Generic Processing Example

• Algorithm:• Fast convolution

• FFT done on input streams• Point-wise multiply, then inverse FFT

• Multicore FFT routines provided by Adapteva• The magic happens at the backend

• Key enablers:• Embarrassment of riches (like FPGA)• Ease of use (like microprocessor)• High bandwidth, flexibility

FFT

FFT

A

B

IFFT

“CUSTOMER SECRETS”

12

Page 13: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

Technology Status (Shipping)

BittWare Design Wins

Shipping Evaluation Systems Worldwide

Epiphany 16 core Chips

Normal Ant

13

Page 14: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

The Big Dream For 2012

14

Process Technology

28nm

#CPUs/Chip 1024

#Chips/Node 16

#Nodes/System 48

#CPUs/Rack 786,432

Performance: ~1PFLOP

Max Power: ~30KW

Page 15: A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor · A 1024-core 70GFLOPS/W Floating Point Manycore Microprocessor Andreas Olofsson, Roman Trogan, Oleg Raikhman Adapteva,

• It’s possible to make CPU’s energy efficient.• 70 GFLOPS/Watt is possible in 2012• There is an alternative to SIMD GPGPUs.• It’s possible to get 60-80% peak efficiency from C-code.• Parallel programming is still too hard!• It’s possible to put 1 Million CPU Cores in a single rack

Conclusions

15