why simplicity matters: a hardware perspective
TRANSCRIPT
Why Simplicity Matters: A Hardware Perspective
by Andreas OlofssonMarch 27, 2015Erlang Factory (San Fransisco)
Adapteva, Erlang Factory 2015
The Free Lunch is Over!
ILP
PWR
FREQ
XTORS
Adapteva, Erlang Factory 2015
Communication
Robotics
IoT
Datacenters/HPC
Life without Mooooooore will be boring!
Adapteva, Erlang Factory 2015
Chip Hardware Design 101CMOS NAND GATE
SRAM STORAGE
Power ~= VDD^2 * F * CAPPower ~= VDD^2 * F * CAP
The cost of HW may approach zero, but will never be zero....
Adapteva, Erlang Factory 2015
A 5 minute introduction to Modern HW design
SYNTAX EXAMPLE MEANING
module ...endmodule module flipflop (d,clk,q);input d, clk;output q;...endmodule
Basic unit of hiearchy.Can be instantiated many times.
wire wire a; Declares a physical wire
reg reg [7:0]; Declares a state variable
assign assign a = b & c; Continuous assignmentalways @ always @ (clk)
mystate <=1;end
Act on an event
if...else(and case() )
if(a) out2=celse out2=d
Control flow
&,|,~,*,+,-, etc assign a[31:0] = b[31:0] + c[31:0] Boolean operations
Adapteva, Erlang Factory 2015
The Future of ComputingConstraint --> Result
1. Performance limits Massive parallelism
2. Amdahl's law New algorithms + languages
3. Thermal density Slow clocks (1MHz-1GHz)
4. Failure rate Distributed systems
5. IO Bandwidth Limited
6. Density + cost 3D chip stacking
7. Energy Efficiency Heterogeneity
8. Productivity Multiple Languages
9. Development cost Collaboration/complexity
10. Latency Locality
Adapteva, Erlang Factory 2015
The Architecture of the Future?
Sequencer FPU
?SMP (coherent, shared)
DISTRIBUTED
SIMD (lockstep)
Adapteva, Erlang Factory 2015
Epiphany Manycore Processor
Adapteva, Erlang Factory 2015
Epiphany 64-core 2011 Processor
ELINK (LVDS)
ELINK (LVDS)
EL
INK
(L V
DS
)
● 64 RISC cores
● 800MHz
● 100GFLOPS
● 2MB SRAM
● 1.6TB/s local memory BW
● 102GB/s bisection BW
● 7.2 GB/s IO
●15x15mm BGA
● < 2 Watts (ie 50 GFLOPS/W)
Adapteva, Erlang Factory 2015
It's all about silicon efficiency...
Intel Haswell14.5mm2
Intel Atom5.6mm2
AMD Jaguar3.1mm2
ARM A151.62mm2
ARM A70.45mm2
Epiphany0.13mm2
100 Epiphany CPU cores fit in the space of one Intel
Haswell CPU core!
100 Epiphany CPU cores fit in the space of one Intel
Haswell CPU core!
Adapteva, Erlang Factory 2015
There is Plenty of Room at the Bottom
“There is STILL plenty of room at the bottom”
Tianhe233 PFLOPS$390M24MWInsanity!
33 PFLOPS=~16, 28nm Epiphany Wafers
● Moving one electron at VDDMIN: ● Emin = QVDD/2 = q 2(ln2)kT/2q = kTln(2)● At 300K, Emin = 0.29e-20 J
● Minimum sized CMOS inverter at 28nm● E = CVDD^2 =~ 0.2e-15 J,
5 orders of magnitude larger!
Adapteva, Erlang Factory 2015
Yes...but does it work?
●25X over GPU and CPU on 'bcrypt' (OpenWall, Russia)
●25x over Intel Xeon on FFTs/DSP (Ericsson)
●25x over Intel Xeon in HPC application by UK customer
●85% of peak performance by students at ANU
Adapteva, Erlang Factory 2015
The Parallella Project
● An open parallel computing platform
● Launched in 2012 at $99
● Open source SW/HW!
● Dual-core ARM A9 processor
● FPGA logic
● 1GB RAM, USB, HDMI, GigE
● 16/64 Epiphany coprocessors
● 50 Gbit/sec IO, 25/100 GFLOPS
● 10,000 shipped (20,000 built)
● 200 Universities
Adapteva, Erlang Factory 2015
Some perspective...
● 1993 CM-5● 1024 processors● 136 GFLOPS/100KW● #1 in 1993 Top500 List● Price: ~$30M
● 2014 Parallella-64● 66 processors● 100 GFLOPS/5W● #1 in energy efficiency● Price: $199*
Adapteva, Erlang Factory 2015
16K-64K CPUs1MB/core (3D)~20 TFLOPS
0.2W-20W
16K-64K CPUs1MB/core (3D)~20 TFLOPS
0.2W-20W
64 CPUs32KB/core
100 GFLOPS0.1W-2W
64 CPUs32KB/core
100 GFLOPS0.1W-2W
1024 CPUs64KB/core2 TFLOPS 0.2W-10W
1024 CPUs64KB/core2 TFLOPS 0.2W-10W
1K CPUs64KB/core2 TFLOPS1W-40W
1K CPUs64KB/core2 TFLOPS1W-40W
2013 2015 2016 2018
The Epiphany Roadmap
Road map anchored by our 28nm 64-core chip data
Adapteva, Erlang Factory 2015
A 1024 core strawman processor ● 1024 CPU cores● 1GHz operation ● 64 MB local memory● 2 TFLOPS performance● 32 TB/s local memory BW● 1 Tera-messages/sec● 2.5 Tbps IO
Adapteva, Erlang Factory 2015
Programming a 1024 core processor
Faulty cores
Cooperating Program(messages)
NODE:64KB1GHz RISC Core2 GFLOPS/core
Physical Constraints:1.5ns/hop latency10pJ / FLOP30pJ / off chip read/write10pJ / on chip end2end(give or take....)
Adapteva, Erlang Factory 2015
The future of programming is HARD
● Minimize code size● Minimize data movement and communication● Minimize energy● Minimize heat density● Minimize failures● Minimize congestion
....AND MINIMIZE EXECUTION TIME
Adapteva, Erlang Factory 2015
Parallel Computing Needs Your Help!● Parallel Standard Libraries:
– It's about time we have open libraries with parallelism built in– “PAL” (github.com/parallella/pal)
● VM's:
– When will we have Erlang running on Epiphany?● Spread the word...
– Parallel programming can be easy– Show, don't tell.
● A Standard Language?
– We need a “C/JAVA/BASIC/PYTHON” of parallel computing?