why put fpgas in your cpu socket? - 筑波大学yoshiki/icfpt/2013/day3_keynote.pdf · 2016. 5....

68
High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Why Put FPGAs in your CPU Socket? Paul Chow

Upload: others

Post on 22-Sep-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

High-Performance Reconfigurable Computing Group

Department of Electrical and Computer EngineeringUniversity of Toronto

Why Put FPGAs in your CPU Socket?

Paul Chow

Page 2: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

What are we talking about?

1. Start with a motherboard with multiple CPU sockets

2. Plug FPGAs into some of those sockets

Achieves the minimum latency between the FPGA and the CPU such that

FPGA CPU

is the same as

CPU CPU

December 11, 2013 FPT 2013

2

Page 3: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Why me?

• Not many have been able to touch in-socket accelerators

• Used in the Toronto Molecular Dynamics Machine project

• Worked closely with the group at Xilinx Labs who were developing the technology in collaboration with Intel

• Disclaimer – some fact checking via Google, but some recollections based on fading memories

December 11, 2013 FPT 2013

3

Page 4: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Benefits

• Avoids major problem of accelerators– Time to move data takes longer than doing the

computation on the host

• With lower latency can do acceleration of finer-grain tasks

• Easier path to certification for data centers– “Just” swapping a CPU chip with an FPGA chip

• New architectures for processor interconnection and moving data onto CPUs (stay tuned for more)

December 11, 2013 FPT 2013

4

Page 5: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Why would AMD/Intel do this?

• Make platforms more open

• Adding FPGAs allows platforms to access use cases not serviced by just CPUs– Not replacing CPUs, but for applications where FPGAs

are needed– Sell more CPUs

• Will still try to displace FPGAs eventually but learn about new requirements– FPGA companies still happy with short-term business!

December 11, 2013 FPT 2013

5

Page 6: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

THE 1ST GENERATION IN-SOCKET ACCELERATORS

December 11, 2013 FPT 2013

6

Page 7: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

AMD

• Torrenza initiative (2006) promoted accelerators using HyperTransport

December 11, 2013 FPT 2013

7CPU

FPGA

FPGA

CPU

• HyperTransport• AMD’s processor bus• Point-to-point so scalable• Cache coherency

Memory

Memory Memory

Memory

Page 8: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

8

In a HyperTransport CPU socket

December 11, 2013 FPT 2013

Page 9: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

December 11, 2013 FPT 2013

9

Page 10: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

In a HyperTransport HTX Socket

December 11, 2013 FPT 2013

10

CPU

HTXHTX

CPU • Not restricted to form factor of the CPU

• Can build board to connect to HT

• More area to put other stuff, like memory

HTXHTX

FPGA

Memory

FPGA

Memory

Page 11: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

11

FPGA in an HTX socket

December 11, 2013 FPT 2013

Page 12: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Intel

• Still using Front-Side Bus– Not scalable– Intel QuickAssist Technology for acclerators

December 11, 2013 FPT 2013

12CPU FPGA

Memory

FPGACPU

MCH (FSB Switch)

Memory

Page 13: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

13

FPGAs in an FSB socketInside the Intel Caneland

December 11, 2013 FPT 2013

Page 14: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

FPT 2013

14

How it works: Cache-based communication

X86

cache

FPGA

cache

Host RAM Memory

2 3

45

• Five steps: X86 to FPGA data transfer (i.e. X86 initiates communication)

– 1) X86 writes data to memory– 2) GPR request (X86 writes into FPGA's cache address range; the content: the memory address where

data in step 1 was placed)– 3) FPGA receives cache update and initiates a DMA read (where X86 put data in step 1)– 4) Data from host's main memory is transferred to the FPGA where data is consumed– 5) GPR Acknowledge (FPGA writes into X86's cache address range; the content: a 1 bit flag that toggles

every time data is written to signal the original GPR request in step 1 has been processed)

• FPGA to X86 data transfer is similar (i.e. FPGA initiates communication)

1

Few cachelines (GPRs)

December 11, 2013

Page 15: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

QPI: THE NEXT GENERATION

December 11, 2013 FPT 2013

15

Page 16: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Intel QPI

• FSB was a bus– What we played with – more later

• Quick Path Interconnect (QPI)– Point-to-point for scalability

December 11, 2013 FPT 2013

16

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

QPI

Page 17: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

What’s Different?

December 11, 2013 FPT 2013

17

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

CPU FPGA

Memory

FPGACPU

MCH (FSB Switch)

Memory

• FSB form factor limits local memory for FPGA

• Cannot provide other I/O easily – used another layer in stack

• Smaller FPGAs –V5

• Two memory banks per CPU socket – FPGA can access DIMMs, lots of local memory

• Larger FPGAs –V7• PCIe slot per socket can be

used for an I/O card

Page 18: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

How does it work?

• Caching Agent– Holds the cache and uses (consumes) cache lines

• Home Agent– Memory controller that serves up physical address

space cache lines

• CPU is both Caching Agent and Home Agent

• FPGA can have either or both, depending on requirements

December 11, 2013 FPT 2013

18

Page 19: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Compute Acceleration

• Utilize coherency provided by Caching Agent

• FPGA application accesses the same address space as the host

• Easier programming using shared-memory model

December 11, 2013 FPT 2013

19

Page 20: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Custom memories

December 11, 2013 FPT 2013

20

CPU FPGA

Memory Flash

Memory Flash

PCIe

PCIe

Bring Flash memory into the QPI memory space or some other funky memory type or behavior – include a Home Agent

Page 21: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

But there’s more…

December 11, 2013 FPT 2013

21

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

N x

10G

High-speed cable

Utilize the PCIe slot to build I/O for FPGA

Page 22: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Streaming Data Processing

• Data streaming in via network links filtered in FPGA

• FPGA transfers only important data to CPU for further processing– Do not have to transfer all data to CPU memory and

then have CPU filter the data

December 11, 2013 FPT 2013

22

Page 23: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Expand QPI across systems

December 11, 2013 FPT 2013

23

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

CPUFPGA

MemoryMemory

MemoryMemory

PCIe

PCIe

SFP for HS I/O

CPUFPGA

MemoryMemory

MemoryMemory

PCIe

PCIe

SFP for HS I/O

Shared memory across QPI platforms

Page 24: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

The “Inverted Cluster”

December 11, 2013 FPT 2013

24

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

CPU FPGA

Memory Memory

Memory Memory

PCIe

PCIe

SFP

for

HS

I/O

CPUFPGA

MemoryMemory

MemoryMemory

PCIe

PCIe

SFP for HS I/O

CPUFPGA

MemoryMemory

MemoryMemory

PCIe

PCIe

SFP for HS I/O

Network connections via FPGAs and CPUs are slaves to FPGAs – lower latency network stack

Page 25: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

QPI vs PCIe Gen 3

QPI PCIe Gen 3

Latency About half PCIe Gen 3 for about 1KB transfer

500 ns

Bandwidth 7 GB/s x8 = 8 GB/s

Standard Proprietary Open

December 11, 2013 FPT 2013

25

• Use QPI if you really need minimum latency• Risk with QPI is proprietary bus

• Note that Convey started with FSB and now uses PCIeGen 3

Page 26: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Where are they today? (a)

• 1st generation had several attempts at developing commodity systems– None exist today– Difficult technology to build– No easy programming model

• Intel developed AAL (accelerator abstraction layer)– Provides virtual memory access from the FPGA– Large page table managed by host AAL driver– Host processes can reserve accelerator by first loading page

table– Available for QPI systems

December 11, 2013 FPT 2013

26

Page 27: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Where are they today? (b)

• Xilinx– Not targeting commodity sales– Pursuing customers interested in customized QPI

• Altera (Pactron) announced April IDF– no longer on Pactron web site!

December 11, 2013 FPT 2013

27

Page 28: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

In an Achronix FPGA!

December 11, 2013 FPT 2013

28

http://www.achronix.com/applications/hpc.html

Page 29: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Heterogeneous Computing

• HSA Foundation– Heterogeneous System Architecture– Building a heterogeneous compute software ecosystem

built on open, royalty-free industry standards and open-source software

– Make processing elements work together seamlessly

December 11, 2013 FPT 2013

29

Page 30: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

USING THE XILINX INTEL FSB PLATFORM – A CASE STUDY

The Accelerated Computing Platform (ACP)

December 11, 2013 FPT 2013

30

Page 31: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

The Accelerated Computing Platform

• Developed by Xilinx• Sold through Nallatech

• Commodity platform to drive down cost

• COTS server-grade motherboard

• FPGA in Xeon socket readily available

• FSB latency and bandwidth between FPGA, Xeon and Memory

December 11, 2013 FPT 2013

31

Page 32: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

32

FSB Configuration Options

North Bridge

FSB8.5GB/s(peak)

21GB/s(peak)

System Memory

South Bridge

2x PCIex84GB/s

Intel’s Caneland MP Xeon platform

10GB/s

switch switch

4x PCIex8 Slots1x PCIex4 Slot

2x PCIex4 Slots

4x SATASource: Nallatech FPT 2013

32

Page 33: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Supported Xeon 7300 System Platforms

• ACP M2 is targeted to Intel 7300 MP server platforms

• Design mechanically validated for Intel SKU S7000FC4UR

December 11, 2013

33

Page 34: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

ACP M2: A Flexible, Modular Architecture

• M2 Compute Module– Supports 2 large Virtex 5 FPGAs

• Can accommodate any FF1738 packaged parts• Enables up to 660K LCs per compute module

– Design allows two (2) Compute modules to be combined in a single stack if desired

• Enables up to 1,320K LCs per CPU socket• Subject to socket power limits

• M2 Base Module– The foundation module that attaches to the

7300 platform socket 604– 1066 MHz design in an FPGA!!– Features a Virtex-5 LX110 which configures as a

persistent FSB Bridge– Configures and feeds the Compute modules

under program control

December 11, 2013 FPT 2013

34

Page 35: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

DDR2SRAM

ACP M2 Stack Topology

1,066MHz FSB8.5GB/s, 105ns

500MHzDDR LVDS

10GB/s, 5ns

300MHz DDR2.4GB/s each, 5ns

M2Base

Module

M2ComputeModule

M2ComputeModule(optional)

Config. Memories

300MHz DDR2.4GB/s each, 5ns

ACP

M2 C

ompu

te St

ack

FlashSRAM

ACP M2 Compute

FPGA(FF1738)

ACP M2 Compute

FPGA(FF1738)

ACP M2 Compute

FPGA(FF1738)

ACP M2 Compute

FPGA(FF1738)

ACP M2 Base FPGA

(FSB Bridge)

December 11, 2013

Page 36: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Programming ModelsThat is great! but how do we program this?

X86 X86

X86 X86

MCHSystem

Memory

FSB

Intel Quad-core Xeon

XilinxVirtex5s

ACP1 ACP0

FSBFSB

December 11, 2013

36

Page 37: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

The Flow

December 11, 2013 FPT 2013

37

Also a system simulation

HLS can do this

Page 38: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Communication Middleware

FSB Interface

MPI FSB Bridge

Xilinx FPGA

Each line is two FSLs(one in each direction)

HW Engine

HW MPI

MicroBlaze

SW MPI

HW Engine

HW MPI

LVDS interface

MPI LVDS Bridge

MGT Interface

MPI MGT Bridge

Packet

December 11, 2013 FPT 2013

38

Page 39: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Achieving Portability with MPI

• Portability is achieved by using a Middleware abstraction layer. MPI natively provides software portability

• Provide a Hardware Middleware to enable hardware portability. The MPE provides the portable hardware interface to be used by a hardware accelerator

December 11, 2013 FPT 2013

39

Host

FPGA

SW Application

SW OS

SW Middleware

Host-specificHardware

HW Application

HW OS

HeterogeneousEnvironment

HW Middleware

SW Application

SW OS

Host-specificHardware

Software Environment

SW Middleware

Page 40: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

MPI Ring Communication Patternvoid main (int argc, char **argv) {

int x, my_rank, size; MPI_Init(…);MPI_Comm_rank(…&my_rank);MPI_Comm_size(…, &size);if ( my_rank == 0 ) {

x = 1;MPI_Send(&x,1,MPI_INT,1,…);MPI_Recv(&x,1,MPI_INT,size-1,…);

}else if (my_rank == size-1) {

MPI_Recv(&x,1,MPI_INT,my_rank-1,…);x++;MPI_Send(&x,1,MPI_INT,0,…);

} else {

MPI_Recv(&x,1,MPI_INT,my_rank-1,…);x++;MPI_Send(&x,1,MPI_INT,my_rank+1,…);

}MPI_Finalize();

}

R0

R3

R1

R2

MPI Size = 5

R4

December 11, 2013 FPT 2013

40

Page 41: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Mapping Ranks to Heterogeneous Computing Elements

R0

R3

R1

R2

R4 HWEngine

December 11, 2013 FPT 2013

41

Page 42: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Ring Communication Example

X86 X86

X86 X86

MCHSystem

Memory

Intel Quad-core XeonACP1 ACP0

12

3

4

5

FPGA-FPGA communicationthrough FSB without X86 intervention

December 11, 2013 FPT 2013

42

Page 43: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

ACP0 – M2 Base FPGA

Intel FSB

Xilinx FSB interface

MPI FSB Bridge

XCV5LX110

Each line is two FSLs(one in each direction)

MicroBlaze GPIO

LEDs

December 11, 2013

43

Page 44: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

ACP1 – M2 Base FPGA

Intel FSB

Xilinx FSB interface

MPI FSB Bridge

XCV5LX110Each line is two FSLs(one in each direction)

FSL-LVDS

to/from compute FPGA 0

to/fromcompute FPGA 1

FSL-LVDS MicroBlaze GPIO

LEDs

RouterInit

December 11, 2013 FPT 2013

44

Page 45: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

ACP1 – M2 Compute 0 and 1 FPGAs

XCV5LX330

FSL-LVDS

to/from othercompute FPGA

to/fromBase FPGA FSL-LVDS

MicroBlaze GPIO

LEDs

RouterInit

December 11, 2013 FPT 2013

45

Page 46: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

PERFORMANCE TESTING

December 11, 2013 FPT 2013

46

Page 47: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Configurations

Send round-trip messages between two MPI tasks (black squares)X86 has Xeon cores using software MPI, FPGA has hardware engines (HW) using the MPE

Δt = round_trip_time/(2*num_samples)Latency = Δt for a small message sizeBW = message_size/Δt

Measurements here are done using only FSB-Base modules

December 11, 2013

47

FPT 2013

Xeon-Xeon Xeon-HW Intra-FPGA HW-HW Inter-FPGA HW-HW

Page 48: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Preliminary Performance Numbers

Xeon-Xeon Xeon-HW HW-HW(intra-FPGA)

HW-HW(inter-FPGA)

Latency [μs](64-byte transfer) 1.9 2.78 0.39 3.5

Bandwidth [MB/s] 1000 410 531 400

December 11, 2013

48

FPT 2013

On-chip network using 32-bit channels and clocked at 133 MHzMPI using Rendezvous Protocol

Xilinx driver performance numbers Latency = 0.5 μs (64 byte transfer) Bandwidth = 2 GB/s

MPI Ready Protocol achieves about 1/3 of the Rendezvous latency. For Xeon-HW it is 1μs (only 2X slower than Xilinx driver transfer latency)

128-bit on-chip channels will quadruple the HW bandwidth (to approx. 2GB/s) and also reduce latency

Other performance enhancements possible

Page 49: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Performance Improvements

• Ready protocol– no synchronization overhead as in Rendezvous

• Tiny message protocol– lower latency for small messages (40 Bytes or less)

• From 32 to 128 bits wide data path– 32 bits @ 133 MHz = 532 MB/s– 128 bits @ 133 MHz = 2.128 GB/s

• Zero copy transfers– no intermediate copy to preallocated buffers (↑BW)

December 11, 2013 FPT 2013

49

Page 50: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Latency (Point-to-Point)

December 11, 2013 FPT 2013

50

CPU-initiated ping-pong transfers (FPGA hardware: 128 bits @ 133 MHz)

Page 51: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Bandwidth

December 11, 2013 FPT 2013

51

Ping-pong test hw, 128-bits @ 133 MHz

Page 52: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

BUILDING A LARGE HPC APPLICATION

December 11, 2013 FPT 2013

52

Page 53: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

December 11, 2013 FPT 2013

53

Molecular Dynamics

• Simulate motion of molecules at atomic level

• Highly compute-intensive

• Understand protein folding

• Computer-aided drug design

Page 54: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

The TMD Machine

• The Toronto Molecular Dynamics Machine

• Use multi-FPGA system to accelerate MD

• Built using an MPI programming model

• Principal algorithm developer: Chris Madill, Ph.D. candidate (now done!) in Biochemistry– Writes C++ using MPI, notVerilog/VHDL

• Have used three platforms – portability

• Plus scalability and maintainability

December 11, 2013 FPT 2013

54

Page 55: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Platform Evolution

Network of Five V2Pro PCI Cards (2006) Network of BEE2 Multi-FPGA Boards (2007)

• First to integrate hardware acceleration• Simple LJ fluids only

• Added electrostatic terms• Added bonded terms

FPGA portability and design abstraction facilitated ongoing migration.

December 11, 2013 FPT 2013

55

Page 56: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

2010 – Xilinx/Nallatech ACP

December 11, 2013 FPT 2013

56

Stack of 5 large Virtex-5FPGAs + 1 FPGA for FSBPHY interface

Quad socket Xeon Server

Page 57: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Origin of Computational Complexity

103

-10

10

i

iiib rrkU 20 )(

N

i

N

j ij

ji

n nrqq

U1 12

1

612

4)(rr

rV

i

iiia kU 20 )(

i iii

iiiiit nk

nnkU

0,00,cos1

2

O(n2)

O(n)

December 11, 2013

57

FPT 2013

Page 58: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

CPUi

Processi

Bonded

Nonbonded

PME

Datai

CPUi

Processi

Bonded

Nonbonded

PME

Datai

Typical MD Simulator

CPUi

Processi

Bonded

Nonbonded

PME

Datai

CPUi

Processi

Bonded

Nonbonded

PME

Datai

December 11, 2013

58

FPT 2013

Page 59: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

TMD Machine Architecture

Bond Engine

Visualizer

Output

Scheduler

Input

MPI::Send(&msg, size, dest …);Atom

ManagerAtomManagerAtom

Manager

Bond Engine

Long rangeElectrostatics

Engine

Long rangeElectrostatics

Engine

Long rangeElectrostatics

Engine

AtomManager

Short rangeNonbond

Engine

Short rangeNonbond

Engine

Short rangeNonbond

Engine

Short rangeNonbond

Engine

Short rangeNonbond

Engine

Short rangeNonbond

Engine

December 11, 2013

59

FPT 2013

Page 60: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

FSB

Target Platform for MD

FSB

NBE NBE

NBE NBE

FSB

NBE NBE

NBE NBE

MEM PME

FSB

NBE NBE

NBE NBE

PME MEM

Socket0

Socket2

Socket1

Socket3

Short rangeNonbonded

Long rangeElectrostatic

Bonds

Initial Breakdown of CPU Time 12 short range nonbond FPGAs 2-3 pipelines/NBE FPGA; Each runs 15-30x CPU NBE 360-1080x

2 PME FPGAs with fast memory and fibre optic interconnects PME 420x

Bonds on quad-core Xeon server Bonds 1x

Sys Mem

Sys Mem

QuadXeon

Sys Mem

8.5 GB/s @ 1066 MHz

72.5 GB/s60

December 11, 2013 FPT 2013

Page 61: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Performance Modeling

Problem :Difficult to mathematically predict the expected speedup a priori due to the contentious nature of many-to-many communications.

Solution:Measuring the non-deterministic behaviour using Jumpshot on the software version and back-annotate the deterministic behaviour.• Make use of existing tools!

December 11, 2013

61

FPT 2013

Page 62: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Single Timestep Profile

Timestep = 108 ms (327 506 atoms)December 11, 2013

62

FPT 2013

Page 63: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Performance

• Significant overlap between all force calculations.

• 108.02 ms is equivalent to between 80 and 88 Infiniband-connected cores at U of T’s supercomputer, SciNet.

• 160-176 hyperthreaded cores

• Can we do better?– 140 with hardware bond

engines – change engine from SW to HW, no architectural change

December 11, 2013

63

FPT 2013

Page 64: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Final Performance Equivalent for MD

FPGA/CPU Supercomputer Scaling Factor

Space 5U 17.5*2U 1/7Cooling N/A Share of 735-ton

chiller∞?

Capital Cost $15000* $120000 1/8Annual Electricity Cost

$241(Assuming 500W)

$6758 1/30

Performance (Core Equivalent)

140 Cores 1*140 Cores 140x

*Current system is a prototype. Cost is based on projections for next-generation system.December 11, 2013 FPT 2013

64

Page 65: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

TMD Perspective

• Still comparing apples to oranges.• Individually, hardware engines are able to sustain

calculations hundreds of times faster than traditional CPUs.

• Communication costs degrade overall performance.• FPGA platform is using older CPUs and older

communication links than SciNet.• Migrating the FPGA portion to a SciNet compatible

platform will further increase the relative performance and provide a more accurate CPU/FPGA comparison.

December 11, 2013 FPT 2013

65

Page 66: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Conclusion

• In-socket accelerators– Use for absolute minimum latency– Cache coherency for easier programming– Proprietary bus so at mercy of vendor– “Exotic” technology– Use only if you really, really need it!

December 11, 2013 FPT 2013

66

Page 67: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

December 11, 2013 FPT 2013

67

Acknowledgements

SOCRNemSYSCAN

Page 68: Why Put FPGAs in your CPU Socket? - 筑波大学yoshiki/ICFPT/2013/Day3_keynote.pdf · 2016. 5. 30. · • Subject to socket power limits • M2 Base Module – The foundation module

Questions?

December 11, 2013 FPT 2013

68