hotchips 2019 tutorial cloud tpu: codesigning architecture

64
HotChips 2019 HotChips 2019 HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture and Infrastructure Clifford Chao Brennan Saeta 1

Upload: others

Post on 10-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019HotChips 2019

HotChips 2019 TutorialCloud TPU: Codesigning Architecture and InfrastructureClifford ChaoBrennan Saeta

1

Page 2: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

How to Codesign: Leverage expertise

across disciplines and stakeholders for optimal

results.2

Page 3: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Gordon Moore: “No exponential is forever”

● Past: take general-purpose compute and shrink technology to improve performance

● Current reality: many devices not scaling well anymore. Running out of tricks!

● Solution: many gains left to be had through specializing compute to specific applications. Codesign!

3

But First...Why Codesign?

Page 4: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019Confidential + Proprietary

Codesign: TPU Architecture

4

Page 5: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Brief Historical Perspective: TPU v1 (Google I/O 2016)

Google internally deployed in 2015 (not Cloud).

Inference workloads, including AlphaGo (Go matches again Lee Sedol).

ISCA paper in 2017, describing technical details.

5

Page 6: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

● For inference and training workloads● 180 teraflops of computation, 64 GB of HBM memory, 2400 GB/s memory BW

Cloud TPU v2 Board (Google I/O 2017)

6

Page 7: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

● 180 teraflops of computation, 64 GB of HBM memory, 2400 GB/s memory BW● Architecture co-designed with ML research, systems and data center teams.

Cloud TPU v2 Board

7

TPU Codesign

● ML research: computational requirements for cutting-edge models.

● Systems: power delivery, board space.

● Data Center: cooling, buildability.

Page 8: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Host Server

32 lane PCIe gen3

8

Cloud TPU v2 Host Connection

Page 9: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Host Server

PCI v3 x 32

9

Cloud TPU v2 Host Connection

TPU Codesign

● ML research: input/output data feed rates.

● Systems: board layout.● Data Center: wiring and

serviceability.

Page 10: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019 10

Vector Unit

Scalar Unit

Matrix Unit (MXU) Matrix Unit (MXU)

8GBHBM

8GBHBM

Vector Unit

Scalar Unit

Cloud TPU v2 Chip Architecture

Page 11: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Vector Unit

Scalar Unit

Matrix Unit (MXU) Matrix Unit (MXU)

8GBHBM

Vector Unit

Scalar Unit

11

Cloud TPU v2 Chip Architecture

8GBHBM

Page 12: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Vector Unit

Scalar Unit

Matrix Unit (MXU) Matrix Unit (MXU)

8GBHBM

Vector Unit

Scalar Unit

12

Cloud TPU v2 Chip Architecture

8GBHBM

22.5 TFLOPS per core

● 2 cores per chip● 4 chips per board● Scalar Unit● Vector Unit● Matrix Unit● Mostly float32

Page 13: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Vector Unit

Scalar Unit

Matrix Unit (MXU) Matrix Unit (MXU)

8GBHBM

Vector Unit

Scalar Unit

13

Cloud TPU v2 Chip Architecture

8GBHBM

22.5 TFLOPS per core

● 2 cores per chip● 4 chips per 180

TFLOP Cloud TPU● Scalar Unit● Vector Unit● Matrix Unit● Mostly float32

TPU Codesign

● ML research: types of operations to accelerate.

● Software: proper flexibility and programmability.

● Data center: performance metrics.

Page 14: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Vector Unit

Scalar Unit

8GBHBM

8GBHBM

Vector Unit

Scalar Unit

Cloud TPU v2 Chip Architecture

Matrix Unit (MXU) Matrix Unit (MXU)

14

Page 15: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Vector Unit

Scalar Unit

8GBHBM

8GBHBM

Vector Unit

Scalar Unit

Cloud TPU v2 Chip Architecture

Matrix Unit (MXU) Matrix Unit (MXU)

Matrix Unit (MXU)

● 128 x 128 systolic array

● float32 accumulate● bfloat16 multiplies

15

Page 16: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

)Matrix Unit Systolic Array

W11 W12 W13

W21 W22 W23

W31 W32 W33

X11

X12

Computing Y = WX

W = 3x3 matrixX = 3-elem vectorY = 3-elem vector

X13

X21

X22

X23

X31

X32

X33in

puts

wei

ghts

accumulation 16

Page 17: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

) W11X11

W12 W13

W21 W22 W23

W31 W32 W33

X12

X13

X21

X22

X23

X31

X32

X33

inpu

tsw

eigh

ts

Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3

accumulation 17

Page 18: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

) W11X21

W12X12+

W11X11

W13

W21X11

W22 W23

W31 W32 W33

X13X22

X23

X31

X32

X33inpu

tsw

eigh

ts

Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3

accumulation 18

Page 19: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

) W11X31

W12X22+

W11X21

W13X13+ ...

W21X21

W22X12+

W21X11

W23

W31X11

W32 W33

X23X32

X33

inpu

tsw

eigh

ts

Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3

accumulation 19

Page 20: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

)

W11

W12X32+

W11X31

W13X23+ ...

W21X31

W22X22+

W21X21

W23X13+ ...

W31X21

W32X12+

W31X11

W33

X33

inpu

tsw

eigh

ts

Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3

Y11 = W11X11 + W12X12 + W13X13

outputs

accumulation 20

Page 21: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

)

W11 W12

W13X33+ ...

W21

W22X32+

W21X31

W23X23+ ...

W31X31

W32X22+

W31X21

W33X13+ ...

inpu

tsw

eigh

ts

Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3

Y21 = W11X21 + W12X22 + W13X23 Y11 = W11X11 + W12X12 + W13X13

Y12 = W21X11 + W22X12 + W23X13

outputs

accumulation 21

Page 22: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

)

W11 W12 W13

W21 W22

W23X33+ ...

W31

W32X32+

W31X31

W33X23+ ...

inpu

tsw

eigh

ts

Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3

outputs

Y31 = W11X31 + W12X32 + W13X33

Y22 = W21X21 + W22X22 + W23X23

Y11 = W11X11 + W12X12 + W13X13

Y12 = W21X11 + W22X12 + W23X13

Y21 = W11X21 + W12X22 + W13X23

Y13 = W31X11 + W32X12 + W33X13

accumulation 22

Page 23: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

)

W11 W12 W13

W21 W22 W23

W31 W32

W33X33+ ...

inpu

tsw

eigh

ts

Matrix Unit Systolic ArrayComputing Y = WXwith W = 3x3, batch-size(X) = 3

outputs

Y31 = W11X11 + W12X12 + W13X13

Y22 = W21X21 + W22X22 + W23X23 Y12 = W21X11 + W22X12 + W23X13

Y21 = W11X21 + W12X22 + W13X23

Y13 = W31X11 + W32X12 + W33X13

Y32 = W21X31 + W22X32 + W23X33

Y23 = W31X21 + W32X22 + W33X23

accumulation 23

Page 24: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

)

W11 W12 W13

W21 W22 W23

W31 W32 W33

inpu

tsw

eigh

ts

Matrix Unit Systolic Array

outputs

Y31 = W11X11 + W12X12 + W13X13

Y22 = W21X21 + W22X22 + W23X23

Y13 = W31X11 + W32X12 + W33X13

Y32 = W21X31 + W22X32 + W23X33

Y23 = W31X21 + W32X22 + W33X23Y33 = W31X31 + W32X32 + W33X33

accumulation

Computing Y = WXwith W = 3x3, batch-size(X) = 3

24

Page 25: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Mat

rix U

nit (

MXU

)

W11 W12 W13

W21 W22 W23

W31 W32 W33

inpu

tsw

eigh

ts

Matrix Unit Systolic Array

Very high throughput for matrix multiplications

accumulation

Computing Y = WXwith W = 3x3, batch-size(X) = 3

25

Page 26: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Systolic Arrays and Energy Efficiency

From Mark Horowitz’s ISSCC 2014 Keynote: “Computing’s Energy Problem (and what we can do about it)”.

Energy for control logic, SRAM, and register accesses needed by matrix multiply dominates in conventional processors.

26

Page 27: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Systolic Arrays and Energy Efficiency

This doesn’t even count DRAM energy cost, which is orders of magnitude larger (1300-2600pJ).

Now consider doing 16,384 operations per clock in a systolic array, which has high reuse so only the red portion scales up.

27

Page 28: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Systolic Arrays and Energy Efficiency

For this figure, the DRAM energy cost is even bigger (1300-2600pJ).

Now consider doing 16,384 or 65,536 operations per second in a systolic array.

Each operand gets reused 128 or 256 times within the systolic array.

28

TPU Codesign

● ML research: latency and BW requirements.

● Software: controllability.● Systems: thermal limits.

Page 29: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Numerics

Neural-network researchers showed that inference did not need float32 precision.Vanhoucke, V., Senior, A. and Mao, M.Z., 2011, December. Improving the speed of neural networks on CPUs.In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop (Vol. 1, p. 4).

TPUv1 deployed with int8 hardware and support for int16 through software.int16 intended as “insurance”; used mostly in LSTM-based models.

Training has traditionally been done in floating point.

Q: Does training require full float32 resolution? A: Not everywhere. But where matters.

29

Page 30: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

M M M M M M M M M M M M M M M M M M M M M M MS E E E E E E E E

Exponent: 8 bits Mantissa (Significand): 23 bits

float32: Single-precision IEEE Floating Point Format Range: ~1e−38 to ~3e38

S E E E E E M M M M M M M M M M

Exponent: 5 bits Mantissa (Significand): 10 bits

float16: Half-precision IEEE Floating Point Format Range: ~5.96e−8 to 65504

bfloat16: Brain Floating Point Format

S E E E E E E E E

Exponent: 8 bits Mantissa (Significand): 7 bits

M M M M M M M

Range: ~1e−38 to ~3e38

30

Numerics

Page 31: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

M M M M M M M M M M M M M M M M M M M M M M MS E E E E E E E E

Exponent: 8 bits Mantissa (Significand): 23 bits

float32: Single-precision IEEE Floating Point Format Range: ~1e−38 to ~3e38

S E E E E E M M M M M M M M M M

Exponent: 5 bits Mantissa (Significand): 10 bits

float16: Half-precision IEEE Floating Point Format Range: ~5.96e−8 to 65504

bfloat16: Brain Floating Point Format

S E E E E E E E E

Exponent: 8 bits Mantissa (Significand): 7 bits

M M M M M M M

Range: ~1e−38 to ~3e38

31

NumericsTPU Codesign

● ML research: trains without loss scaling, unlike float16 [Micikevicius 2017].

● ASIC: small mantissa reduces multiplier power, area.

● Software: same dynamic range as float32, same Inf/NaN behavior.

Page 32: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Cloud TPU v3(Google I/O 2018)

32

Page 33: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 201933

● 420 teraflops of computation, 128 GB of HBM memory● Liquid cooling

Cloud TPU v3 Board

Page 34: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Vector Unit

Scalar Unit

Matrix Unit (MXU) Matrix Unit (MXU)16GBHBM

16GBHBM

Vector Unit

Scalar Unit

Cloud TPU v3 Chip Architecture

34

Matrix Unit (MXU) Matrix Unit (MXU)

Page 35: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Vector Unit

Scalar Unit

Matrix Unit (MXU) Matrix Unit (MXU)16GBHBM

16GBHBM

Vector Unit

Scalar Unit

Cloud TPU v3 Chip Architecture

35

Matrix Unit (MXU) Matrix Unit (MXU)

TPU Codesign

● ML Research: latest computational requirements.

● Systems: liquid cooling.● Data Center: space provisioning,

network requirements.

Page 36: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Datacenter-Scaled Tiled Architecture

Three ways to spend on computer architectural resources:● Compute● Memory● Custom interconnect

Allows us to scale the system by connecting many boards together into a super-computing pod to work on a single workload.

36

Page 37: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Datacenter-Scaled Tiled Architecture

First, about the custom interconnect itself:● 2-D torus network topology● Custom protocol allows lower overhead than generic protocols (eg ROCE,

Infiniband)● Enables algorithms like parallel in-network reduction to a single node● Integrated router into the TPU chip

Now, let’s use the interconnect to scale the system...

37

Page 38: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Single board

Cloud TPU v2 Board38

Page 39: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019 39

● 11.5 pflops● 4TB HBM● 2D torus topology

Cloud TPU v2 Pod

Page 40: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019 40

Page 41: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019 41

Page 42: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019 42

● >100 pflops● 32TB HBM

Cloud TPU v3 Pod

Page 43: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019 43Cloud TPU v3 Pod

TPU Codesign

● ML research: size and scope of models.

● Data Center: cooling, space and network provisioning, power delivery.

● Software: programmability for parallelism infrastructure.

Page 44: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 201944

Example Slicing of TPU Pod Hardware

Page 45: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Cloud TPU Sizes

45

https://cloud.google.com/tpu/docs/types-zones

Page 46: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019 46

Page 47: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Actual Models Map to TPU Architecture

Matrix Unit→ High-FLOPS Ops

● Matrix multiplication (dense layers)

● Convolutions

Vector Units→ Element-wise Ops

● Non-linearities● Activations● Dropout● Pooling● Parameter gradient

updates

Scalar Unit● TPU control● Data flow● Memory addressing

(concat’s)

47

Page 48: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Linear Performance Scaling

48

Page 49: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

import tf.keras.layers as l

import tf.losses.sparse_softmax_cross_entropy as softmax

def model_fn(images, labels, mode):

conv = l.Conv2D(32, 5)(images)

flat = l.Flatten()(conv)

logits = l.Dense(10)(flat)

loss = softmax(labels, logits)

optimizer =

tf.train.GradientDescentOptimizer(

learning_rate=0.3)

optimizer =

tf.contrib.tpu.CrossShardOptimizer(

optimizer)

train_op = optimizer.minimize(loss)

return

tf.contrib.tpu.TPUEstimatorSpec(

loss=loss,

train_op=train_op)

import tensorflow as tf

import tf.contrib.cluster_resolver.TPUClusterResolver

def input_fn(params):

ds = tf.data.TFRecordDataset(path)

ds = ds.map(parse_function)

ds = ds.cache().repeat()

ds = ds.shuffle(buffer_size=50000)

ds = ds.batch(params['batch_size'])

return ds

cr = TPUClusterResolver()

rc = tf.contrib.tpu.RunConfig(

cluster=cr,

model_dir=FLAGS.model_dir)

est = tf.contrib.tpu.TPUEstimator(

model_fn=model_fn, use_tpu=True,

train_batch_size=1024)

est.train(input_fn=input_fn)

Runs on TPU

Runs on CPU

Page 50: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

import tf.keras.layers as l

import tf.losses.sparse_softmax_cross_entropy as softmax

def model_fn(images, labels, mode):

conv = l.Conv2D(32, 5)(images)

flat = l.Flatten()(conv)

logits = l.Dense(10)(flat)

loss = softmax(labels, logits)

optimizer =

tf.train.GradientDescentOptimizer(

learning_rate=0.3)

optimizer =

tf.contrib.tpu.CrossShardOptimizer(

optimizer)

train_op = optimizer.minimize(loss)

return

tf.contrib.tpu.TPUEstimatorSpec(

loss=loss,

train_op=train_op)

import tensorflow as tf

import tf.contrib.cluster_resolver.TPUClusterResolver

def input_fn(params):

ds = tf.data.TFRecordDataset(path)

ds = ds.map(parse_function)

ds = ds.cache().repeat()

ds = ds.shuffle(buffer_size=50000)

ds = ds.batch(params['batch_size'])

return ds

cr = TPUClusterResolver()

rc = tf.contrib.tpu.RunConfig(

cluster=cr,

model_dir=FLAGS.model_dir)

est = tf.contrib.tpu.TPUEstimator(

model_fn=model_fn, use_tpu=True,

train_batch_size=1024)

est.train(input_fn=input_fn)

Runs on TPU

Runs on CPUZero code changes when scaling from a single device to a whole pod!

● ML research: influence abstractions.● Compilers: JIT code-gen & parallel

IRs.● APIs: Carefully constructed for

distribution.

Page 51: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019 51Cloud TPU System Architecture

Host TPU

Host TPU

HostTPU

HostTPU

Cloud Storage

VM

Py

Page 52: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019 52Cloud TPU System Architecture

Host TPU Device

Host TPU Device

HostTPU Device

HostTPU Device

Cloud Storage

VM

Py

System Codesign

● Storage systems: high bandwidth disks, network, etc.

● Accelerators: infeed & asynchronous abstractions.

● Software: high throughput, software pipelining, horizontal scaling.

Page 53: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Streaming Data to Cloud TPUs at High Speed

Cloud Storage

A distributed file system optimized for extreme scale and low cost.

Compute Engine

Stream data from Compute Engine VM(s) local file system or memory for maximum flexibility.

Cloud Bigtable

Scalable, low-latency wide-row structured storage system.

53

Page 54: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019Confidential + Proprietary

Results

54

Page 55: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

MLPerf 0.6: TopLine* Comparison (mlperf.org)

* TopLine: fastest convergence time (regardless of resource consumption)

TPUv3 Tesla V100 Speedup

Resnet-50 (1024) 77 (1536) 80 4%SSD (1024) 72 (480) 134 86%Mask-RCNN (128) 2136 (192) 1107 -48%GNMT (512) 127 (384) 108 -15%Transformer (1024) 51 (480) 95 86%

E2E time (units in secs; # of chips in parentheses)

55

Page 56: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019Confidential + Proprietary

Performance & Optimization Tips

56

Page 57: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Performance & Optimization Tips

On-device execution

● Leverage MXU size● Avoid reshapes (XLA helps a lot!)● Scale out for trivial data-parallel

improvements (interconnect awesomeness!)

● More advanced parallelism available

57

Page 58: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

Performance & Optimization Tips

Input pipelines

● Many parallel, large, streaming reads● Software pipelining critical● Tradeoffs between storage, bandwidth, and compute to tune input pipeline

○ Leverage CPU parallelism

58

Use the profiler!

Page 59: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 201959

From: https://cloud.google.com/tpu/docs/cloud-tpu-tools#profile_tab

Performance & Optimization Tips

Page 60: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019Confidential + Proprietary

Codesign Summary

1. Necessary due to EoML2. Architect infrastructure for usability

and scale!3. Leverage all areas of expertise

60

Page 61: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019

We already co-design:

61

1. Data centers2. Power & cooling systems3. Board topology4. On-chip compute units5. Accelerator-accelerator networks6. Numerics7. Memory & cache hierarchies8. Compilers9. ML frameworks & APIs

10. Datacenter networks11. Data storage hardware & software12. File formats13. Accelerator-host interfaces &

bandwidth14. Cluster scheduling systems15. ML Algorithms16. Model partitioning & parallelism

research17. Security

Page 62: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019Confidential + Proprietary

What else can we co-design?

62

Page 63: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

63

Co-design theprogramming languagetoo

Check out Swift for TensorFlow at:

https://github.com/tensorflow/swift

Please join our open design reviews on Friday’s! Sign up at: [email protected] (Google Group)

Page 64: HotChips 2019 Tutorial Cloud TPU: Codesigning Architecture

HotChips 2019Confidential + Proprietary

Thank you!

64