intelligence processors - cambridge wireless · 2017. 11. 28. · 3 a canonical intelligent agent...

18
Intelligence Processors Simon Knowles, CTO CWTEC 3 rd October 2017 DeepVoice2

Upload: others

Post on 21-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

Intelligence Processors

Simon Knowles, CTO

CWTEC 3rd October 2017

DeepVoice2

Page 2: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

2

“The capacity for rational inference, based on imperfect

knowledge, adapted with experience”

Machines acting under uncertainty require intelligence

Intelligence

• Talking with humans

• Moving among humans

• Learning to do human work

Page 3: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

3

A canonical intelligent agent is a sequence-to-sequence translator.

Learning is inference (of model structure and parameters).

Inference is stochastic optimization.

Noisy Inference

Engine

Knowledge

Model

Sensor Data Actor Data

Intelligence Machine

Page 4: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

4

Knowledge models and inference

algorithms are naturally and jointly

representable as graphs…

• Vertices embody code and state.

• Edges pass data.

• Graph structure is static.

AlexNet

Page 5: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

5

MI is a new (heavy) compute workload

Computation on graphs

• massive exposed parallelism.

Sparse, high-dimensional data

• representational layering, embedding dimension mismatch.

Low precision compute

• approximate inference on stochastically-learned models.

Structural invariance priors

• convolution, recurrence.

Static graph structure

• allows compiler to partition work, allocate memory, and schedule messages.

Entropy

• search required when cost gradients are not computable.

Page 6: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

6

CPU

Scalar

Designed for office apps

Evolved for web servers

GPU

Vector

Designed for graphics

Evolved for HPC

IPU

Graph

Designed for intelligence

Processors are designed for Workloads

Page 7: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

7

Don’t invent a new processor architecture without 20 year utility

• In 2005 SVMs were hot

• In 2010 RFs were hot

• In 2015 NNs were hot

• In 2020 ...?

Page 8: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

8

Massively parallel compute on static graphs favours a pure distributed machine

• Static partitioning of work and memory

• Local threads hide only local latencies (arithmetic, memory, branch)

• Many narrow-vector engines are more versatile than a few fat-vector engines

• Arbitrary communication patterns over a stateless “exchange”

Scalable general parallel computer

M

Exchange

R

P

M

R

P

M

R

P

M

R

P

Page 9: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

9

Compute

• Rate of arithmetic

Memory data bottleneck

• Bandwidth and latency for parameters and activations

Memory address bottleneck

• Rate of scatter/gather addressing for sparse data

Entropy

• Rate of generation of random numbers

Power

• ...

What might limit the performance of an MI processor?

Page 10: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

10

fpu64

fpu32

fpu16.32

2pJ/cycle

250/mm2

8pJ/cycle

60/mm2

32pJ/cycle

15/mm2

50W/GHz/cm2

Arithmetic Power

~ 33% activemax @ 1.5GHz

8cm2 die @ 200W

darksilicon⇒

Page 11: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

11

320pJ/B

256GB

64GB/s @ 20W

Memory Power

64pJ/B

16GB

900GB/s @ 60W

10pJ/B

256MB

6TB/s @ 60W

1pJ/B

1000 x 256kB

60TB/s @ 60W

DDR4 modules HBM2 on interposerSRAM on chip

memory power density is

~25% of logic power density

Page 12: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

12

GPUs use dark silicon to serve multiple markets, IPUs use it to localize memory...

GPU plan IPU plan

93% fpu

core area power core area power

10% ram

35% fpu

7% ram

57% fpu

75% ram

25% fpu

43% ramUsed exclusively for

other applications

(graphics, HPC)

used for ML

55% dark

Page 13: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

13

Silicon efficiency is the full use of available power

• Keep memory local

• Serialise communication and compute

• Re-compute what you can’t remember

Page 14: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

14

Concurrent compute

and communication...

Serialized compute

and communication...

Design budget Operating point

Page 15: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

15

Bulk Synchronous Parallel

Compute Phase

stateful codelets execute

on local memory state

Exchange Phase

memory-to-memory data movement,

no compute, no concurrency hazards

P

M

P

M

P

MP

MP

MP

M

PM

P M

PM

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

MP

M P

MP

M

PM

PM

PM

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

MP

MP

MP

M

PM

P M

PM

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

MP

M P

MP

M

PM

PM

PM

P

M

P

M

P

M

P

M

P

M

P

M

P

M

stateless

exchangeSYNC

Page 16: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

16

weight update

forward

DNN tensors

backward

wi

zi-1

gi-1

zi

gi

zi+1

gi+1

wi+2

zi+2

gi+2

wi+3

zi+3

gi+3

wi+1

wi

zi-1

gi-1

zi

gi

zi+1

gi+1

wi+2

zi+2

gi+2

wi+3

zi+3

gi+3

wi+1

wi

zi-1

gi-1

zi

gi

zi+1

gi+1

wi+2

zi+2

gi+2

wi+3

zi+3

gi+3

wi+1

wi

zi-1

gi-1

zi

gi

zi+1

gi+1

wi+2

zi+2

gi+2

wi+3

zi+3

gi+3

wi+1

Keeping only 1-in-N activations trades most activation

storage for one repeat of forward pass compute.

gradients

activations

Re-compute what you can’t remember

Page 17: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

17

• Designed ground-up for MI, both training and deployment.

• Large 16nm custom chip, cluster-able, 2 per PCIe card.

• >1000 truly independent processors per chip; all-to-all non-blocking exchange.

• All model state remains on chip; no directly-attached DRAM.

• Mixed-precision floating-point stochastic arithmetic.

• DNN performance well beyond Volta and TPU2; efficient without large batches.

• Unprecedented flexibility for non-DNN models; thrives on sparsity.

• Program in TensorFlow (other frameworks to follow) or Poplar™ for close-to-metal.

• Early access cards and appliances end-2017.

Graphcore’s “Colossus”(in honour of Tommy Flowers)

Page 18: Intelligence Processors - Cambridge Wireless · 2017. 11. 28. · 3 A canonical intelligent agent is a sequence-to-sequence translator. Learning is inference (of model structure and

18

Thank [email protected]