intelligence processors - cambridge wireless · 2017. 11. 28. · 3 a canonical intelligent agent...
TRANSCRIPT
Intelligence Processors
Simon Knowles, CTO
CWTEC 3rd October 2017
DeepVoice2
2
“The capacity for rational inference, based on imperfect
knowledge, adapted with experience”
Machines acting under uncertainty require intelligence
Intelligence
• Talking with humans
• Moving among humans
• Learning to do human work
3
A canonical intelligent agent is a sequence-to-sequence translator.
Learning is inference (of model structure and parameters).
Inference is stochastic optimization.
Noisy Inference
Engine
Knowledge
Model
Sensor Data Actor Data
Intelligence Machine
4
Knowledge models and inference
algorithms are naturally and jointly
representable as graphs…
• Vertices embody code and state.
• Edges pass data.
• Graph structure is static.
AlexNet
5
MI is a new (heavy) compute workload
Computation on graphs
• massive exposed parallelism.
Sparse, high-dimensional data
• representational layering, embedding dimension mismatch.
Low precision compute
• approximate inference on stochastically-learned models.
Structural invariance priors
• convolution, recurrence.
Static graph structure
• allows compiler to partition work, allocate memory, and schedule messages.
Entropy
• search required when cost gradients are not computable.
6
CPU
Scalar
Designed for office apps
Evolved for web servers
GPU
Vector
Designed for graphics
Evolved for HPC
IPU
Graph
Designed for intelligence
Processors are designed for Workloads
7
Don’t invent a new processor architecture without 20 year utility
• In 2005 SVMs were hot
• In 2010 RFs were hot
• In 2015 NNs were hot
• In 2020 ...?
8
Massively parallel compute on static graphs favours a pure distributed machine
• Static partitioning of work and memory
• Local threads hide only local latencies (arithmetic, memory, branch)
• Many narrow-vector engines are more versatile than a few fat-vector engines
• Arbitrary communication patterns over a stateless “exchange”
Scalable general parallel computer
M
Exchange
R
P
M
R
P
M
R
P
M
R
P
9
Compute
• Rate of arithmetic
Memory data bottleneck
• Bandwidth and latency for parameters and activations
Memory address bottleneck
• Rate of scatter/gather addressing for sparse data
Entropy
• Rate of generation of random numbers
Power
• ...
What might limit the performance of an MI processor?
10
fpu64
fpu32
fpu16.32
2pJ/cycle
250/mm2
8pJ/cycle
60/mm2
32pJ/cycle
15/mm2
50W/GHz/cm2
Arithmetic Power
~ 33% activemax @ 1.5GHz
8cm2 die @ 200W
darksilicon⇒
11
320pJ/B
256GB
64GB/s @ 20W
Memory Power
64pJ/B
16GB
900GB/s @ 60W
10pJ/B
256MB
6TB/s @ 60W
1pJ/B
1000 x 256kB
60TB/s @ 60W
DDR4 modules HBM2 on interposerSRAM on chip
memory power density is
~25% of logic power density
12
GPUs use dark silicon to serve multiple markets, IPUs use it to localize memory...
GPU plan IPU plan
93% fpu
core area power core area power
10% ram
35% fpu
7% ram
57% fpu
75% ram
25% fpu
43% ramUsed exclusively for
other applications
(graphics, HPC)
used for ML
55% dark
13
Silicon efficiency is the full use of available power
• Keep memory local
• Serialise communication and compute
• Re-compute what you can’t remember
14
Concurrent compute
and communication...
Serialized compute
and communication...
Design budget Operating point
15
Bulk Synchronous Parallel
Compute Phase
stateful codelets execute
on local memory state
Exchange Phase
memory-to-memory data movement,
no compute, no concurrency hazards
P
M
P
M
P
MP
MP
MP
M
PM
P M
PM
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
MP
M P
MP
M
PM
PM
PM
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
MP
MP
MP
M
PM
P M
PM
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
M
P
MP
M P
MP
M
PM
PM
PM
P
M
P
M
P
M
P
M
P
M
P
M
P
M
stateless
exchangeSYNC
16
weight update
forward
DNN tensors
backward
wi
zi-1
gi-1
zi
gi
zi+1
gi+1
wi+2
zi+2
gi+2
wi+3
zi+3
gi+3
wi+1
wi
zi-1
gi-1
zi
gi
zi+1
gi+1
wi+2
zi+2
gi+2
wi+3
zi+3
gi+3
wi+1
wi
zi-1
gi-1
zi
gi
zi+1
gi+1
wi+2
zi+2
gi+2
wi+3
zi+3
gi+3
wi+1
wi
zi-1
gi-1
zi
gi
zi+1
gi+1
wi+2
zi+2
gi+2
wi+3
zi+3
gi+3
wi+1
Keeping only 1-in-N activations trades most activation
storage for one repeat of forward pass compute.
gradients
activations
Re-compute what you can’t remember
17
• Designed ground-up for MI, both training and deployment.
• Large 16nm custom chip, cluster-able, 2 per PCIe card.
• >1000 truly independent processors per chip; all-to-all non-blocking exchange.
• All model state remains on chip; no directly-attached DRAM.
• Mixed-precision floating-point stochastic arithmetic.
• DNN performance well beyond Volta and TPU2; efficient without large batches.
• Unprecedented flexibility for non-DNN models; thrives on sparsity.
• Program in TensorFlow (other frameworks to follow) or Poplar™ for close-to-metal.
• Early access cards and appliances end-2017.
Graphcore’s “Colossus”(in honour of Tommy Flowers)
18
Thank [email protected]