intelligence processors - cambridge wireless · 2017. 11. 28. · 3 a canonical intelligent agent...

Intelligence Processors

Simon Knowles, CTO

CWTEC 3rd October 2017

DeepVoice2

2

“The capacity for rational inference, based on imperfect

knowledge, adapted with experience”

Machines acting under uncertainty require intelligence

Intelligence

• Talking with humans

• Moving among humans

• Learning to do human work

3

A canonical intelligent agent is a sequence-to-sequence translator.

Learning is inference (of model structure and parameters).

Inference is stochastic optimization.

Noisy Inference

Engine

Knowledge

Model

Sensor Data Actor Data

Intelligence Machine

4

Knowledge models and inference

algorithms are naturally and jointly

representable as graphs…

• Vertices embody code and state.

• Edges pass data.

• Graph structure is static.

AlexNet

5

MI is a new (heavy) compute workload

Computation on graphs

• massive exposed parallelism.

Sparse, high-dimensional data

• representational layering, embedding dimension mismatch.

Low precision compute

• approximate inference on stochastically-learned models.

Structural invariance priors

• convolution, recurrence.

Static graph structure

• allows compiler to partition work, allocate memory, and schedule messages.

Entropy

• search required when cost gradients are not computable.

6

CPU

Scalar

Designed for office apps

Evolved for web servers

GPU

Vector

Designed for graphics

Evolved for HPC

IPU

Graph

Designed for intelligence

Processors are designed for Workloads

7

Don’t invent a new processor architecture without 20 year utility

• In 2005 SVMs were hot

• In 2010 RFs were hot

• In 2015 NNs were hot

• In 2020 ...?

8

Massively parallel compute on static graphs favours a pure distributed machine

• Static partitioning of work and memory

• Local threads hide only local latencies (arithmetic, memory, branch)

• Many narrow-vector engines are more versatile than a few fat-vector engines

• Arbitrary communication patterns over a stateless “exchange”

Scalable general parallel computer

M

Exchange

R

P

M

R

P

M

R

P

M

R

P

9

Compute

• Rate of arithmetic

Memory data bottleneck

• Bandwidth and latency for parameters and activations

Memory address bottleneck

• Rate of scatter/gather addressing for sparse data

Entropy

• Rate of generation of random numbers

Power

• ...

What might limit the performance of an MI processor?

10

fpu64

fpu32

fpu16.32

2pJ/cycle

250/mm2

8pJ/cycle

60/mm2

32pJ/cycle

15/mm2

50W/GHz/cm2

Arithmetic Power

~ 33% activemax @ 1.5GHz

8cm2 die @ 200W

darksilicon⇒

11

320pJ/B

256GB

64GB/s @ 20W

Memory Power

64pJ/B

16GB

900GB/s @ 60W

10pJ/B

256MB

6TB/s @ 60W

1pJ/B

1000 x 256kB

60TB/s @ 60W

DDR4 modules HBM2 on interposerSRAM on chip

memory power density is

~25% of logic power density

12

GPUs use dark silicon to serve multiple markets, IPUs use it to localize memory...

GPU plan IPU plan

93% fpu

core area power core area power

10% ram

35% fpu

7% ram

57% fpu

75% ram

25% fpu

43% ramUsed exclusively for

other applications

(graphics, HPC)

used for ML

55% dark

13

Silicon efficiency is the full use of available power

• Keep memory local

• Serialise communication and compute

• Re-compute what you can’t remember

14

Concurrent compute

and communication...

Serialized compute

and communication...

Design budget Operating point

15

Bulk Synchronous Parallel

Compute Phase

stateful codelets execute

on local memory state

Exchange Phase

memory-to-memory data movement,

no compute, no concurrency hazards

P

M

P

M

P

MP

MP

MP

M

PM

P M

PM

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

MP

M P

MP

M

PM

PM

PM

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

MP

MP

MP

M

PM

P M

PM

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

M

P

MP

M P

MP

M

PM

PM

PM

P

M

P

M

P

M

P

M

P

M

P

M

P

M

stateless

exchangeSYNC

16

weight update

forward

DNN tensors

backward

wi

zi-1

gi-1

zi

gi

zi+1

gi+1

wi+2

zi+2

gi+2

wi+3

zi+3

gi+3

wi+1

wi

zi-1

gi-1

zi

gi

zi+1

gi+1

wi+2

zi+2

gi+2

wi+3

zi+3

gi+3

wi+1

wi

zi-1

gi-1

zi

gi

zi+1

gi+1

wi+2

zi+2

gi+2

wi+3

zi+3

gi+3

wi+1

wi

zi-1

gi-1

zi

gi

zi+1

gi+1

wi+2

zi+2

gi+2

wi+3

zi+3

gi+3

wi+1

Keeping only 1-in-N activations trades most activation

storage for one repeat of forward pass compute.

gradients

activations

Re-compute what you can’t remember

17

• Designed ground-up for MI, both training and deployment.

• Large 16nm custom chip, cluster-able, 2 per PCIe card.

• >1000 truly independent processors per chip; all-to-all non-blocking exchange.

• All model state remains on chip; no directly-attached DRAM.

• Mixed-precision floating-point stochastic arithmetic.

• DNN performance well beyond Volta and TPU2; efficient without large batches.

• Unprecedented flexibility for non-DNN models; thrives on sparsity.

• Program in TensorFlow (other frameworks to follow) or Poplar™ for close-to-metal.

• Early access cards and appliances end-2017.

Graphcore’s “Colossus”(in honour of Tommy Flowers)

18

Thank [email protected]

intelligence processors - cambridge wireless · 2017. 11. 28. · 3 a canonical intelligent agent...

Documents