thinking outside the die

55
Sean Lie Co-founder & Chief HW Architect, Cerebras Thinking Outside the Die: Architecting the ML Accelerator of the Future

Upload: others

Post on 04-Jan-2022

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thinking Outside the Die

Sean LieCo-founder & Chief HW Architect, Cerebras

Thinking Outside the Die:Architecting the ML Accelerator of the Future

Page 2: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Cerebras Systems

Building and deploying a new class of computer system

Designed for the purpose of accelerating AI and changing the future of AI work

Founded in 2016

350+ Engineers

Offices

Silicon Valley | San Diego | Toronto | Tokyo

Customers

North America | Asia | Europe

Page 3: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

1800x more compute

In just 2 years

Tomorrow, multi-trillion parameter models

Exponential Growth of Neural Networks

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys

Model memory requirement, GB

Memory and compute requirements

BERT Base (110M)

BERT Large (340M)

2018

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys

Model memory requirement, GB

Memory and compute requirements

BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys

Model memory requirement, GB

Memory and compute requirements

BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

T-NLG (17B)

GPT-3 (175B)

MSFT-1T (1T)2020+

MT-NLG (530B)

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys

Model memory requirement, GB

Memory and compute requirements

BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

T-NLG (17B)

GPT-3 (175B)

MSFT-1T (1T)2020+

MT-NLG (530B)

Page 4: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

A lot of innovation in last few years

For example…

• Process: 16nm, 12nm, 7nm

• Memory: HBM, HBM2

• Precision: float16, bfloat16

• Datapath: Systolic array, Tensor cores

ML Acceleration Improvements in Industry

Low precision numerics

Dense GEMM datapath

8b 23b

5b 10b

8b 7b

FP32

FP16

BFLOAT16

Page 5: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

For example…

• Process: 16nm, 12nm, 7nm

• Memory: HBM, HBM2

• Precision: float16, bfloat16

• Datapath: Systolic array, Tensor Core

Process Technology Gains

1.0E+06

1.0E+07

1.0E+08

2006 2008 2010 2012 2014 2016 2018 2020 2022

Transistors/mm2

Moore’s Law is not dead!

A100 7nm

V100 12nm

P100 16nm

GTX480 40nm

G80 90nm

Page 6: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

For example…

• Process: 16nm, 12nm, 7nm

• Memory: HBM, HBM2

• Precision: float16, bfloat16

• Datapath: Systolic array, Tensor Core

Architecture and Microarchitecture Gains

1.0E+02

1.0E+03

1.0E+04

2006 2008 2010 2012 2014 2016 2018 2020 2022

Peak FLOPS/transistor

Architecture Matters!

A100

V100

P100

GTX480

G80

Page 7: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

In the last 2 years…

What our industry delivered:

• 2x: Process technology

• 3x: Architecture and µarch

• 300x: Cluster scale-out

What the ML needs:

• BERT → GPT-3

• 1800x more compute

Meeting the Need

Cluster Scale-out Dominated Performance Gains

Google TPUs in one of its data centers. Image credit: Google (cloud.google.com/tpu)

Page 8: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Last 2 years

2x: Process technology

3x: Architecture and µarch

300x: Cluster scale-out

The Answer is Just Scale-out, Right?

Scale-out is needed…

But not practical

And not very scalable

Next 2 years?

2x: Process technology

3x: Architecture and µarch

300x: Cluster scale-out

100k chip clusters?

Page 9: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Distribution complexity scales dramatically with cluster size

Massive models need massive memory, massive compute, and massive communication.

On giant clusters of small devices, all three become intertwined, distributed problems.

Need to do inefficient, fine-grained partitioning and coordination of memory, compute, and communication across thousands of devices.

Limits of Existing Scale-out

Page 10: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Last 2 years

2x: Process technology

3x: Architecture and µarch

300x: Cluster scale-out

Next 2 years

2x: Process technology

3x: Architecture and µarch

1-10x: Cluster-scale out

Traditional Scale-out is Not Enough

1k-10k chip clusters

Path to 60x gains

But we need 1800x

Page 11: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

We must find ways to:

• Improve process gains

• Improve architecture gains

• Improve scale-out

Imagine more balanced scaling:

• 2x → 20-100x: Process technology

• 3x → 30-150x: Architecture and µarch

• 300x: Scalable cluster scale-out

The ML Demand Challenge

Requires thinking outside the box with co-design

But is it possible?

Page 12: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Process:Amplifying Moore’s Law

Page 13: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

0

100

200

300

400

500

600

700

800

900

1960 1970 1980 1990 2000 2010 2020 2030

Die Size (mm2)

Transistor density improvement continues!

Historically, amplified with bigger chips

But die sizes are not getting bigger

Because of challenges in…

1. Yield

2. Lithography

3. Power and cooling

Building Bigger Chips

A100V100

P100

GTX480

G80

Pentium

80486

8028680804004

Page 14: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Extending Beyond a Single Die in Industry

Largest traditional chip

2-5x

10-20x

Single reticle chip

Chiplets and MCM

Advanced interposer

Page 15: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Cerebras WSE-2 56x Larger than Traditional Chip

The Largest Chip Ever Built

46,225 mm2 silicon

2.6 Trillion transistors

850,000 AI optimized cores

40 Gigabytes on-chip memory

20 Petabyte/s memory bandwidth

220 Petabit/s fabric bandwidth

7nm Process technology at TSMC

Page 16: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

• Impossible to yield full wafer with zero defects

• Silicon and process defects are inevitable even in mature process

1. Solving Yield Challenges

Defects

Die

Page 17: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Co-designed with

chip architecture

Redundancy

• Uniform small core architecture enables redundancy to address yield at very low cost

• Design includes redundant cores and redundant fabric links

• Redundant cores replace defective cores

• Extra links reconnect fabric to restore logical 2D mesh

Page 18: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

• Standard fabrication process requires die to be independent

• Scribe line separates each die

• Scribe line used as mechanical barrier for die cutting and for test structures

2. Solving Lithography Limitations

Page 19: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

• Add wires across scribe line in partnership with TSMC

• Extend 2D mesh across die

• Same connectivity between cores and across scribe lines create a homogenous array

Cross-Die Wires

Co-designed with

fab and chip architecture

Page 20: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Advantages of On-chip Fabric

• Same bandwidth on die as across die

• Short wires enable high bandwidth at low cost

• Spanning <1mm on silicon is highly efficient

• On-chip bandwidth is “easy”

• Shifts complexity to system and package

Reticle Bandwidth Power

mm2 TB/s GB/s/mm2 pJ/bit WNvidia

A100826 0.6 0.7 10 60

Cerebra

s WSE-2525 3.2 6.1 0.15 4.8

Ratio 5.3x 8.4x 66.6x 12.5x

Nvidia estimates based on PCIe 5.0 SERDES design in 7nm

Inter-reticle IO

Co-designed with

package

Page 21: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Concentrated high density exceeds traditional power & cooling capabilities

• Power delivery• Current density too high for power plane

distribution in PCB

• Heat removal• Heat density too high for direct air cooling

3. Solving Power and Cooling

Page 22: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

• Power delivery

• Current flow distributed in 3rd dimension perpendicular to wafer

• Heat removal

• Water carries heat from wafer through cold plate

Using the 3rd Dimension

Co-designed with

system

Page 23: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

The Co-designed System

Engine Block

Page 24: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

The Co-designed System

Cerebras CS-2

Power Supplies

Heat Exchanger

Water Pumps

Page 25: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Page 26: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Built from the ground up to run the WSE-2

Single chip equivalent to 56x traditional chips

Combined with process shrinks…

Imagine more balanced scaling:

✅ 2x → 20-100x: Process technology

3x → 30-150x: Architecture and µarch

300x: Scalable cluster scale-out

Cerebras CS-2 System

Cluster scale performance on a single system

Page 27: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Architecture:Designing From Ground Up for Neural Networks

Page 28: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

……

Neural networks are expressed as GEMMs

• Leading to high density GEMM datapaths

Not a path to significant gains because…

• Physical and power limits number of FMACs

• Coarse granularity fit overheads

We need to change the rules

What if… we could run fewer FLOPS?

Driving More Compute Traditionally

Dense GEMM datapath

Page 29: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Neural Networks are Naturally Sparse

Natural sparsity arises from common ML techniques in neural networks

• e.g. nonlinears that zero activations

• ReLU, Dropout, …

Natural sparsity arises in training backward pass even when forward pass is dense

• Backward pass accounts for 2/3 of compute

• Nonlinear derivatives that zero deltas

• Hard Sigmoid, Max Pool, …

Dense Network

Sparse Network

Page 30: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Neural Networks Can be Made Sparse

Technique Sparsity FLOP ↓ Reference

Fixed Sparse Training 90% 8x Lottery Ticket [MIT CSAIL]

Dynamic Sparse Training 80% 2x Rig the Lottery [Google Brain, DeepMind]

Scaling-up Sparse Training 90%+ 10x+ Pruning scaling laws [MIT CSAIL]

Monte Carlo DropConnect 50% 2x DropConnect in Bayesian Nets [Nature]

ML Community has invented various sparsity techniques

Sparsity Research Shows 10x+ Opportunity

Page 31: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Existing GEMM architectures are dense-only

Fundamental design limitations:

1. Memory bandwidth limitation

2. Structured computation

What if we could solve these?

Architecture and ML Co-design for Sparsity

Page 32: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Traditional memory bandwidth is low

• Central shared memory is slow & far away

• Requires high data reuse and caching

Distributed memory has full bandwidth

• All memory is fully distributed with cores

• Cores get full perf without caching

1. Full Memory Bandwidth

Co-designed with

system and ML

Page 33: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Matrix-Matrix Matrix-Vector Vector-Vector Vector-Scalar

GEMM GEMV DOT AXPY

𝐶 ← 𝛼𝐴𝐵 + 𝛽𝐶 y ← 𝛼𝐴𝑥 + 𝛽𝑦 𝛼 ← (𝑥, 𝑦) 𝑦 ← 𝛼𝑥 + 𝑦

Full Performance on All BLAS Levels

Ay x+=AC B+= 𝛼y x+=

0.005 Byte/FLOP 1 Byte/FLOP 2 Byte/FLOP 3 Byte/FLOP

GPU Cerebras

𝛼 y x+=

Bandwidth jump!

Sparse GEMM is one AXPY per non-zero weight

Page 34: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

2. Architecture for Unstructured Sparsity

Dataflow scheduling in hardware

• Data and control received from fabric

• Triggers instruction lookup to set up tensor control state machine

• State machine schedules datapath cycles to consume fabric input

• Datapath output is written back to memory or fabric output

Fabric Output

Instructions

fmac[z]=[z],[w],a

Data

Tensor

Control

FMAC

Datapath [z][w]

Registers

Data a Ctrl

Dataflow Trigger

Page 35: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Dataflow sparsity harvesting

• Sender filters out sparse zero data

• Receiver skips unnecessary processing

Enabled by fine-grained execution datapaths

• Small cores with independent instructions

• Optimized for dynamic non-uniform work

Co-designed with

ML and compiler/kernels

Core Designed for Sparsity

Fabric Output

Instructions

fmac[z]=[z],[w],a

Data

Tensor

Control

FMAC

Datapath [z][w]

Registers

Data a Ctrl

Dataflow Trigger

Page 36: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

The Wafer is the Matrix-Multiply array

• High-capacity local memory stores all activations across compute fabric

• Large compute core array receives sparse weight stream and triggers multiplies with activations

• Massive memory bandwidth enables full performance of operands to the datapath

• High BW interconnect enables partial sum accumulation across wafer at full performance

• No matrix blocking or partitioning required

Sparse MatMul Kernel

Core Array

Sequence

Feature

Wafer Scale Engine

Partial

Sum

Ring

Sparse Weights

Same flow supports dense and sparse

Page 37: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Sparsity reduces time-to-accuracy

• WSE runs AXPY at full performance

• Limited only by low fixed overheads

• Minimized by high bandwidth interconnect

• Reduced as networks grow larger

• Accelerates all unstructured sparsity

• Fully dynamic and fine-grained

• Even fully random patterns

Near-linear sparsity acceleration

1

2

3

4

5

6

7

8

9

10

0% 50% 67% 75% 80% 83% 86% 88% 89% 90%

Unstructured Sparsity Factor (% of Zeros)

Measured Speedup vs. Sparsity Factor

On GPT-3 Layer (12k*12k MatMul)Linear

Measured

Demonstrated Unstructured Sparsity Speedup

Page 38: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Unstructured sparsity acceleration enables new level of ML co-design

Sparsity opportunity likely increases with model size

Combined with other architecture improvements…

Imagine more balanced scaling:

✅ 2x → 20-100x: Process technology

✅ 3x → 30-150x: Architecture and µarch

300x: Scalable cluster scale-out

Beyond FLOPS

Enabling existing and new novel sparse ML techniques

Page 39: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Scale-out:Inherently Scalable Clustering

Page 40: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Several existing scale-out techniques

Challenges to Existing Scale-out

Device 1 Device 2

Data Parallel

Sample 1

Multiple samples at a time

Parameter memory limits

Sample N

Multiple layers at a time

Communication overhead

N2 activation memory

Device 1 Device 2

Pipelined Model Parallel

Sample 1

Multiple splits at a time

Communication overhead

Complex partitioning

Device 1 Device 2

Tensor Model Parallel

Sample 1

All share same limitation: memory tied to compute

Page 41: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Thinking outside the device

Cluster Level Co-design

Page 42: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Scale model size and training speed independently

The Cluster is the ML AcceleratorDisaggregation of memory and compute

CS-2 Compute

External Model

Memory

Weights

Gradients

Training Database

Samples

Labels

Activations kept local Streamed 1

layer at a time

Page 43: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

CS-2

850,000 compute cores in single chip

Page 44: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

MemoryX Technology

CS-2

MemoryX Technology

Up to 120 trillion parameters on a single CS-2

Page 45: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

CS-2

CS-2

Near-linear performance scaling up to 192 CS-2s

SwarmX Interconnect Technology CS-2

MemoryX Technology

SwarmX

Interconnect Technology

CS-2

Page 46: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Built for extreme-scale neural networks:

• Weights stored externally off-wafer

• Weights streamed onto wafer to compute layer

• Activations only are resident on wafer

• Execute one layer at a time

Decoupling weight optimizer compute

• Gradients streamed out of wafer

• Weight update occurs in MemoryX

Weight Streaming Execution Model

Weights

Gradients

MemoryX

Optimizer

Compute

Weight

MemoryCS-2 Dataset

Server

Page 47: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

• Efficient performance scaling by removing latency-sensitive communication

• Coarse-grained pipelining

• Forward/delta/gradient are fully pipelined

• All streams within an iteration have no inter-layer dependencies

• Fine-grained pipelining

• Overlapping of weight update and forward pass covers inter-iteration dependency

Solving the Latency Problem

Time

Iteration N Back Pass Iteration N+1 Forward Pass

L2 Gradient L2 Forward

L2 Weight Update

L2 Weight Stream Out

CS-2:

MemoryX:

L3 Weight Stream Out

L1 Weight Update

L1 Gradient L1 Forward

L1 Weight Stream Out

Page 48: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Purpose-built to support large neural network execution:

• 4TB – 2.4PB capacity

• 200 billion – 120 trillion weights with optimizer state

• DRAM and flash hybrid storage

• Internal compute for weight update/optimizer

• Handles intelligent pipelining to mask latency

Scalable to extreme model sizes

Capacity scaling independent from compute

MemoryX Technology

Page 49: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

• Data parallel training across CS-2s

• Weights are broadcast to all CS-2s

• Gradients are reduced on way back

• Multi-system scaling with the same execution model as single system

• Same system architecture

• Same network execution flow

• Same software user interface

SwarmX Fabric Connects Multiple CS-2s

Scalable to extreme model sizes

Compute scaling independent from capacity

MemoryX

Optimizer

Compute

Weight

Memory

CS-2

Weights

Gradients

Weights

Gradients

CS-2

CS-2

CS-2

SwarmX

Page 50: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Weight sparsity induced in MemoryX

• Sparse weights streamed to all CS-2s

• Sparse gradients reduced on the way back

• Sparse weight updates on sparse matrix

No change to the weight streaming model

Same flow supports dense and sparse

Streaming Sparse Weights

MemoryX

Optimizer + Sparsity

Compute

Weight

Memory

CS-2

Sparse

Weights

Sparse

Gradients

Sparse

Weights

Sparse

Gradients

CS-2

CS-2

CS-2

SwarmX

Sparsify

Sparse

Compute

Page 51: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Near-Linear Performance Scaling

1

2

4

8

16

32

64

128

256

1 2 4 8 16 32 64 128 256

Number of CS-2s

Projected Speedup vs. Number of CS-2s

10B

100B

1T

10T

100T

NLP Model Size

(parameters)

Model Size

Megatron-LM 8B

T5 11B

T-NLG 17B

GPT-3 175B

MT-NLG 530B

MSFT-1T 1T

Projections based on Scaling Laws for Neural Language Models [OpenAI]

Page 52: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Cluster level purpose-built memory, compute, interconnect

Disaggregation avoids traditional scale-out complexity

Imagine more balanced scaling:

✅ 2x → 20-100x: Process technology

✅ 3x → 30-150x: Architecture and µarch

✅ 300x: Scalable cluster scale-out

The Cluster is the ML Accelerator

Page 53: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Is it possible?

The Grand ML Demand Challenge

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys

Model memory requirement, GB

Memory and compute requirements

BERT Base (110M)

BERT Large (340M)

2018

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys

Model memory requirement, GB

Memory and compute requirements

BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000

To

tal

train

ing

co

mp

ute

, P

FL

OP

-da

ys

Model memory requirement, GB

Memory and compute requirements

BERT Base (110M)

BERT Large (340M)

2018

GPT-2 (1.5B)

Megatron-LM (8B)

T5 (11B)

2019

T-NLG (17B)

GPT-3 (175B)

MSFT-1T (1T)2020+

MT-NLG (530B)

???

Page 54: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

Architecting the ML Accelerator of the Future

Thinking outside

the die

Pushing beyond

FLOPS

Cluster is the ML

Accelerator

Changing the rules through co-design

Page 55: Thinking Outside the Die

© 2021 Cerebras Systems Inc. All Rights Reserved

[email protected]

Thank You