the next cloud fabric - university of south carolina · target: accelerate ranking as a service...

20
Catapult: A Reconfigurable Fabric for Petaflop Computing in the Cloud Doug Burger Director, Hardware, Devices, & Experiences MSR NExT November 15, 2015

Upload: others

Post on 30-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Catapult: A Reconfigurable Fabric for Petaflop Computing in

the Cloud

Doug Burger

Director, Hardware, Devices, & Experiences

MSR NExT

November 15, 2015

Page 2: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

The Cloud is a Growing Disruptor for HPC

Disruption

Homogeneity

Moore’s Law

Economics

Page 3: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

A 2-3 Horse Race

Page 4: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Hyperscale Cloud Fabrics

ToR

ToR ToR

ToR

CS CS

ToR

CS

Page 5: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Accelerator Constraints of the Cloud

5

Efficiency(ASICS)

Homogeneity

Page 6: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Catapult Project History

• December 9, 2010 – initial meeting• Christmas break 2010: feasible to accelerate ranking?

• January 12, 2011 – Meeting with Bing leadership

• 2011 – v0: ported then Bing ranking stack, built BFB board

• 2012 – v1: developed distributed architecture

• 2013 – Took v1 to scale, Bing pilot

• 2014 – v2: developed new architecture, commenced work with Azure

• 2015 – Mainstreamed: production and expansion• Intel announced Altera acquisition, $16.7B

Page 7: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Microsoft Open Compute Server

Two 8-core Xeon 2.1 GHz CPUs64 GB DRAM4 HDDs, 2 SSDs10 Gb EthernetNo cable attachments to server

Microsoft Confidential 7

Page 8: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Catapult V1 Accelerator Card

Microsoft Confidential 8

• Altera Stratix V D5• 172.6K ALMs, 2014 M20Ks

• 457KLEs• 1 KLE == ~12K gates• M20K is a 2.5KB SRAM

• PCIe Gen 2 x8, 8GB DDR3• 20 Gb network among FPGAs

Stratix V

8GB DDR3

PCIe Gen3 x8

Page 9: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

6x8 Torus in a 2x24 Server Layout

Page 10: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

1,632 server pilot deployed in production BN datacenter

Page 11: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

Target: Accelerate Ranking as a Service

SaaS 1

SaaS 2

SaaS48

SaaS 3

Ranking-as-a-Service (RaaS) - Compute relevance scores for each selected doc- Sort the scores and return the results

Selection-as-a-Service (SaaS)- Find all docs that contain query terms- Filter and select candidate documents for ranking

Selection as a Service (SaaS)

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

IFM 1

IFM 2

IFM 44

IFM 3

RaaS 1

RaaS 2

RaaS48

RaaS 3

Ranking as a Service (RaaS)

Query

SelectedDocuments

10 blue links

Page 12: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

FPGA Accelerator for Bing Ranking

FFE: Free-Form Expressions

MLS: Machine Learning Scoring

FE: Feature Extraction

Document + Query

Score

Document features- Hand-coded Verilog

FFE #1 =(2*NumberOfOccurrences_0 + NumberOfOccurrences_1)(2 * NumberOfTuples_0_1)

~4K features

~2K Synthetic featuresFE7

FFE3FFE2

FE9

≤ 𝑇1 > 𝑇1

≤ 𝑇2 > 𝑇2

score

≤ 𝑇3 > 𝑇3

scorescore

≤ 𝑇3 > 𝑇3

scorescore

Query Augmentation

Query Understanding

Document Selection

Document Ranking

Caption Generation

Page Assembly

FPGA 0

FPGA 1

FPGA 2

FPGA 3

FPGA 4

FPGA 5

FPGA 6

FPGA 7

12-Stage Pipeline

FPGA 8

FPGA 9

FPGA 10

FPGA 11

Demonstrated ~2x throughput gain and stability justifying production

Page 13: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Pilot Results (FPGA vs. Software)

0

500

1000

1500

2000

2500

3000

3500

4000

0 2 4 6 8 10

Thro

ugh

pu

t

Average Latency

Average Latency vs. Throughput

HW SW

0

500

1000

1500

2000

2500

3000

3500

4000

4500

0 5 10 15 20Th

rou

ghp

ut

Latency

95% Latency vs. Throughput

HW SW

Bing’s latencytarget at ~2X throughput

Page 14: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Catapult V1 Shell Architecture

PCIecore

Gen2 x8(Gen3 Capable)

PCIeDMA

Inter-FPGA router

Xcvrconfig

SLIIIcore

SLIIIcore

SLIIIcore

SLIIIcore

Local application

DDR3core

DDR3core

4GB SO-DIMM

RSU256 Mb NAND

120120

4444

4

Driver

Reconfig

Voltageregulator

4GB SO-DIMM1.5V

12V

0.85V

Status LEDs

JTAG

FPGA

… …

2 x

16

RA

Ms

32

B –

64

KB

/ s

lot

64

slo

ts

I O

Page 15: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Production issues at scale• Build system

• License servers, availability of source, build machines

• Scale-out qualification of IP

• Clean interfaces for high-productivity development environment

• Shell/driver/application versioning and deployment• Backwards compatibility

• Health monitoring and failure diagnostics• Continuous reporting of interfaces health, soft error rate, etc.

• Debugging (esp. on livesite)• Flight Data Recorder to replay bug-generating condition

• System integrity testing - many servers/vendors

• Scalability of verification

• In situ updates to drivers, golden image, shell

• Supply chain management

Page 16: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Azure SmartNIC

• Announced at ONS

• Use an FPGA for reconfigurable functions• FPGAs are already used in Bing (Catapult)• Roll out hardware as we do software

• Programmed using Generic Flow Tables (GFT)• Language for programming SDN to hardware• Uses connections and structured actions as

primitives

• SmartNIC can also do Crypto, QoS, storage acceleration, and more …• 40Gb bidirectional AES demo

Host

NIC ASIC

FPGA

CPU

ToR

Page 17: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores
Page 18: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

FPGAs “versus” GPUs

Language C/C++ CUDA Verilog -> OpenCL (?)

Performance 400 Gflops 6 Tflops -> 10T 100G -> 1T -> 4T

Efficiency 5 Gflops/W -> 20 Gflops/W 40-50 G/W -> 80-100 G/W

Scale 2M+ and growing 1s -> 10s -> 100s 10Ks -> 100Ks -> 1M+

CPUs GPUs FPGAs

DRAM BW 85 GB/s 2x240 GB/s 10GB/s -> 20GB/s -> 200-500GB/s

Page 19: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Large-Scale Reconfigurable Computing for HPC

ToR

ToR ToR

ToR

CS CS

Deep Learning

Bing Ranking HW

HPC / MPI Offload

Deep Compression

Bing Ranking SW

Page 20: The NExT Cloud Fabric - University of South Carolina · Target: Accelerate Ranking as a Service SaaS 1 SaaS 2 SaaS 48 SaaS 3 Ranking-as-a-Service (RaaS) - Compute relevance scores

Conclusions

• We are at the dawn of a new era

• Programmable logic playing a central role in systems at massive scale

• “A new kind of computer”

• Will enable new applications and services to be cost effective

• Will change system architecture, both in server and at cloud scale