edcc14 keynote, newcastle 15may14

40
1 Where Did All The Errors Go? European Dependable Computing Conference http://conferences.ncl.ac.uk/edcc2014/ Newcastle, 13-16may14 Prof. Ian Phillips Principal Staff Engineer ARM Ltd [email protected] Visiting Prof. at ... Contribution to Industry Award 2008 Opinions expressed are my own ... Links to Pdf and SlideCast @ http://ianp24.blogspot.com

Upload: ian-phillips

Post on 06-May-2015

102 views

Category:

Engineering


0 download

DESCRIPTION

Keynote at the European Dependable Computing Conference at Newcastle. 15mar14

TRANSCRIPT

Page 1: EDCC14 Keynote, Newcastle 15may14

1

Where Did All The Errors Go?

European Dependable Computing Conference http://conferences.ncl.ac.uk/edcc2014/

Newcastle, 13-16may14

Prof. Ian Phillips Principal Staff Engineer

ARM Ltd [email protected]

Visiting Prof. at ...

Contribution to

Industry Award 2008

Opinions expressed are my own ...

Links to Pdf and SlideCast @ http://ianp24.blogspot.com

Page 2: EDCC14 Keynote, Newcastle 15may14

2

When we think of Computing we think of ... HPC and Mainframe ... maybe Desktop ... but not really Laptop or (Heaven forbid) Pocketable?

Page 3: EDCC14 Keynote, Newcastle 15may14

3

The Visible Face of Computing Today

Essential but not Vital ... All want Reliable

Page 4: EDCC14 Keynote, Newcastle 15may14

4

The Invisible Face of Computing Today

Unrecognised but Vital ... All need Dependabile

Page 5: EDCC14 Keynote, Newcastle 15may14

5

... State (s) and Time (t) are usually factors in this. It can include phenomena ranging from human thinking to calculations with a narrower meaning.

... Wikipedia Usually used it to animate analogies (models) of real-world situations

... frequently fast enough to be used as a stabilising factor in a loop (Real-time).

... Not prescriptive about the choice of Implementation Technology! ... Nor prescriptive about Programmability!

So What is Computing ... A mechanism for the algebraic manipulation of Data ...

y=F(x,t,s) IN (x)

Enumerated Phenomena

OUT (y) Processed Data/

Information

Page 6: EDCC14 Keynote, Newcastle 15may14

6

Hipparchos’s Antikythera - c87BC Early-Mechanical

Computation

Hipparchos c.190 BC – c.120 BC.

Ancient Greek Astronomer, Philosopher

and Mathematician A Machine for Calculating Planetary Positions Technology: Metal, Hand-Cut Gears, Analogue Found in the Mediterranean in 1900 (Believe there might have been 10’s)

Page 7: EDCC14 Keynote, Newcastle 15may14

7

Orrery c1700 ... Planet Motion Computer

Inventor: George Graham (1674-1751). English Clock-Maker. Single-Task, Continuous Time, Analogue Mechanical Computing (With backlash!)

Mechanical Technology

Page 8: EDCC14 Keynote, Newcastle 15may14

8

A Machine for Computing Polynomial Tables Technology: Metal, Precision Gears, Digital (base 10) Beyond the gear-cutting technology of the day

Babbage's Difference Engine - 1837 Constructed 2000

Late-Mechanical Computation

Page 9: EDCC14 Keynote, Newcastle 15may14

9

Amsler’s Planimeter - c1856 Mechanical

Computation

Planimeter 2014 !

A Machine for Calculating Area of an arbitrary 2D shape Technology: Precision Mechanics, Analogue Available today ... Electronically enhanced

Page 10: EDCC14 Keynote, Newcastle 15may14

10

General Purpose (Programmable) Computing Machine Technology: Electronics (valves), Digital (base 2) Available today ... Micro-Electronically enhanced (Mainframe <=> Laptop)

Uo.Manchester’s “Baby” - 1947 (Reconstruction)

Electronic Computation

Page 11: EDCC14 Keynote, Newcastle 15may14

11

Digital Electronics Software Memory Optics Analogue Electronic Sensors/Transducers Mechanics Micro-Motors Displays Discharge Tube Robotic Assembly Plastic, Metal, Glass Image Input => Compute (Image Processing) => Data-File

... Many Technologies working, seamlessly, to Enhanced Human Memory

Electronic System (Cyber-physical System) - 2014

1: aka; Cyber-Physical System (Geek-Talk!)

Incorporating DIGIC5+ (ARM)

System-Level Computation

‘Classic’ Computer

Page 12: EDCC14 Keynote, Newcastle 15may14

12

They sell things that Customers want to buy Supporting the End-Customers needs ... Who maybe several ‘layers’ above their business.

Focus on their Core Competencies in a Globally Competitive Market Avoid Commoditisation by Differentiation

Cost and Quality (by improving Process) ..and.. Improved Business-Models (which make the Money) ..and.. New/Improved Technology (which are Expensive and/or Risky)

Product Development is a Cost (Risk) to be Minimised Technology (HW, SW, Mechanics, Optics, Graphene, etc) just enables Options! New-Technology may cost more (including risk) than it delivers in Product Value! Over-Design costs ... Cannot afford the Precautionary Principle!

... Because successful End-Products fund their entire (RD&I) Value-Chains ... Their Technologies will be economic necessity in (all) lower volume markets!

Computing Technologies in Business Context Businesses have to be Competitive, Money Making Machines today ...

Page 13: EDCC14 Keynote, Newcastle 15may14

13

... Old Compute Markets remain; but are no-longer the Technology Drivers!

Business Opportunities Drive Technology Developments ... And 21c Products are increasingly ‘Intelligent’

1970 1980 1990 2000 2010 2020 2030

Mill

ions

of U

nits

1st Era Select work-tasks

2nd Era Broad-based computing for specific tasks

3rd Era Computing as part of our lives

Page 14: EDCC14 Keynote, Newcastle 15may14

14

How often can ... An Anti-lock Braking system be unavailable? Your Mobile Phone crash/restart? An Autopilot be unavailable? ... As often as it likes: As long as it is available when you need it! The Power Grid crash/restart? An Engine Management unit get stuck at Full Throttle? A spurious Cash Transaction in your Bank Account? ... Never! A PC crash before it is unusable? Weather forecast be incorrect before it matters? ... Surprisingly often: Humans are inclined to blame themselves.

... Dependability is Subjective; Application, User and Context dependent (Quality)

What Dependable Computing do we Expect? “To be trusted to do or provide what is needed” (Merriam-Webster)

Page 15: EDCC14 Keynote, Newcastle 15may14

15

End-Products are about Function, not about Technology You can’t tell which bits are done in Hardware and which in Software? Hardware Module ? Software Module ?

Hardware + Software Module ?

... So where are the Dependability Vulnerabilities located?

Page 16: EDCC14 Keynote, Newcastle 15may14

16

Boolean Mathematics is Dependable; but implementation depends on reliably mapping its equations to the physical world through Logic-Gates (For HW and SW!)

CMOS has been a reliable Boolean mapping for 30 years, but ...

Today’s 20nm transistors at have larger variability, and there are more of them on a chip (Typically 500M in 2012)

At 70degC, Vtn=130mv (sigma ~25mv) around 1 in 5 million, transistors have Vt<0 (Can’t be turned off)

That’s ~100 transistors/chip that don’t switch off And another hundred that only turn-on weakly (low drive/slow) And they will always be randomly placed!

... So today’s chips shouldn’t work?

Is Hardware (Logic) Dependable? 1/3

B

A

+V

A

B OUT NAND

OUT

Page 17: EDCC14 Keynote, Newcastle 15may14

17

Mitigating this we have ... Transistors: Not all ...

Are at 70 degC even if the die is (local variation) Are Minimum Size ... Increasing ‘area’ reduces variability Are on Critical Paths ... And ‘chains’ of gates perform closer to average! Non-Functionality is (easily) Observable ... The effects can be very subtle.

CMOS Logic: Is very robust and will continue to work with extreme transistors Leaky Gates and Faster Transitions are not usually failure criteria The chance of a second extreme transistor on a single Critical Path is the order of <1:1,000,000

Memory: Circuits are much more sensitive to Vt/gm variation ... But spare rows/columns are part of SRAM designs and allow lots of defects to be ‘repaired’ AND >75% of typical SoC die area is memory, so ... Most of the sensitive area has a repair strategy! ..and... The rest is inherently more robust!

Is Hardware (Logic) Dependable? 2/3

Page 18: EDCC14 Keynote, Newcastle 15may14

18

But we haven't included ... Internally and Externally generated synchronous supply noise? (Greater susceptibility at lower voltages) High-energy particles? (Greater susceptibility at smaller geometries) Wear-out (Vt/Gain drift)? (Greater susceptibility at smaller geometries) Temperatures greater than 70degC (140C is not uncommon) Limitations of Verification and Test (Limited exploration of state-space)

We repeatedly multiplying tiny-improbables, by large-numbers ... And many of the values are only guesses! We have no real idea about the reliability/dependability of modern Systems or Components

We only know that as process geometries shrink, Susceptibility will get worse ... Chips will get ever more complex (and more chips will be used in more complex Systems) Transistors will get smaller and Designers will erode safety margins to get performance

... Despite this Chips and Systems do Yield today more than we would rightly expect ... ... So we must be utilising Unknown Safety Factors!

Is Hardware (Logic) Dependable? 3/3

Page 19: EDCC14 Keynote, Newcastle 15may14

19

All Software Crashes! Software providers seldom guarantee the functionality of their product

Quality is tested-in; and improved by bug-fixes/patches in the field (To what level?) So software Reuse offers improved Quality and Productivity (But over what?)

Residual Errors ... No code has zero residual errors!! Well structured and tested Source-Code has ~5 errors per 1,000 lines of code (E-KLOC) Commercial code is typically ~5x worse than this

No Useful Correlation between residual-errors and their system-impact severity Only the Heuristic, that ‘most of them are harmless’.

Formal-Methods are better; but cost is high if you need a clean-sheet design. Even Perfect-Software would have to work with an Imperfect-Platform Don’t underestimate the Commercial Importance of TTM and Cost !!!

Is Software Dependable? 1/3 Demonstrating the limitations of achieving Quality through Test ...

Page 20: EDCC14 Keynote, Newcastle 15may14

20

Is Software Dependable? 2/3 Hardware and Software Design are indistinguishable ...

// A master-slave type D-Flip Flop module flop (data, clock, clear, q, qb); input data, clock, clear; output q, qb; // primitive #delay instance-name // (output, input1, input2, .....), nand #10 nd1 (a, data, clock, clear), nd2 (b, ndata, clock), nd4 (d, c, b, clear), nd5 (e, c, nclock), nd6 (f, d, nclock), nd8 (qb, q, f, clear); nand #9 nd3 (c, a, d), nd7 (q, e, qb); not #10 inv1 (ndata, data), inv2 (nclock, clock); endmodule

Hardware (Verilog Language)? Software (C Language)?

#include<time.h> /* Use the PC's timer to check */ /* processing time */ main() { clock_t time,deltime; long junk,i; float secs; LOOP: printf("input loop count: "); scanf("%ld",&junk); time = clock(); for(i=0;i<junk;i++) deltime = clock() - time; secs = (float) deltime/CLOCKS_PER printf("for %ld loops, #tics = % %f\n",junk,deltime,secs); goto LOOP; ...

Target Platform HW ----- & ----- SW

Target Architecture Info

Compilers HW ----------- SW

Configuration Files HW -------------- SW

Page 21: EDCC14 Keynote, Newcastle 15may14

21

Is Software Dependable? 3/3 Somebody will see the bugs! (The Open Source Delusion)

1: http://www.wired.com/2014/04/heartbleedslesson/ 2: http://veridicalsystems.com/blog/of-money-responsibility-and-pride/

“It is now very clear that OpenSSL development could benefit from dedicated full-time, properly funded developers” “OSF typically receives only $2,000 a year in donations” OpenSSL HeartBleed bug 1

Update was received just before a Public Holiday Editor was a known and high-quality source Code was reviewed informally and released

Editor was conflicted with day-job, family and holiday pressure 2 Too little resources to do a proper job.

This was a E-KLOC error ... Not a Formatting error, nor a Functional error It was a System error (an omission in a non-functional aspect of the code).

... Was the ‘fault’ with the software Source (OpenSSL Software Foundation (OSF)) ? ... Or a User Community too-ready to believe in the Quality of Open Source software?

Page 22: EDCC14 Keynote, Newcastle 15may14

22

‘Optimal’ Platform

HW1 HW2 HW3 HW4 Hardware Interface RTOS/Drivers

Thre

ad

Bus(es) Processor(s)

F1 F2

F3 F4

F5

Create Functional-Model1 on a ‘Generic’ Platform

(F1) (F3)

(F5) (F2)

Designing the Computing System ... ... is about creating a Model of Behaviour to meet Non-Functional Constraints

Translate to Functional-Model on an ‘Optimal’ Platform

1: This includes a Model of Execution such as a Java VM.

Page 23: EDCC14 Keynote, Newcastle 15may14

23

Typical 2014 Computing Platform ... ... is just 137.2 x 70.5 x 5.9 mm

Page 24: EDCC14 Keynote, Newcastle 15may14

24

Typical 2014 Computing Platform Exynos 5422 Eight 32 bit CPUs (big.LITTLE): • Four big (2.1GHz ARM A15) for

heavy tasks; • Four small (1.5GHz ARM A7) for

lighter tasks. + Nine Mali GPU cores ...

... A ~30 Core Heterogeneous Multi-Processor ... In your Shirt Pocket!

... 21 significant ‘Chips’

Page 25: EDCC14 Keynote, Newcastle 15may14

25

2010: Apple’s A4 SIP Package (Cross-section)

IC Packaging Technology The processor is the centre rectangle. The silver circles beneath it are solder balls. Two rectangles above are RAM die, offset to make room for the wirebonds. Putting the RAM close to the processor reduces latency, making RAM

faster and reduces power consumption ... But increases cost. Memory: Unknown Processor: Samsung/Apple (ARM Processor) Packaging: Unknown (SIP Technology)

Source ... http://www.ifixit.com

Processor SOC Die

2 Memory Dies

Glue

Memory ‘Package’

4-Layer Platform Package’

Steve Jobs WWDC 2010

Page 26: EDCC14 Keynote, Newcastle 15may14

26

2013: Samsung Solid-State Memory

Smart Memory Interface (eMMC) 16-128Gb in a single package

8Gb/die. Stacked 2-16 die/package Handles errors in the bulk-data store Package just 1.4mm thick! (11.5x13x1.4mm) ... Smaller than a postage stamp

Page 27: EDCC14 Keynote, Newcastle 15may14

27

2012: Nvidea’s Tegra 3 Processor Unit (Around 1B transistors)

NB: The Tegra 3 is similar to the Apple A4

Page 28: EDCC14 Keynote, Newcastle 15may14

28

Component and Sub-Systems from Global Enterprise ... ... Global Teams contributing Specialist Knowledge & Knowhow

Apple ID’d 159 Tier-1 Suppliers ... Thousands of Engineers Globally

Est. 10x Tier-2 Suppliers ... Including Virtual Components1 and

Sub-Systems (ARM and other IP Providers)

Multiple Technologies ... Hardware, Software, Optics,

Mechanics, Acoustics, RF, Plastics, etc Manufacturing, Test, Qualification, etc. Methods, Tools, Training, etc

Tens of thousands Engineers Globally ... More than 90% of Technology and

Methods are Reused (productivity)!

1: Virtual Components do not appear on BOM

Page 29: EDCC14 Keynote, Newcastle 15may14

29

10nm

100nm

1um

10um

100um

App

roxi

mat

e P

roce

ss G

eom

etry

ITRS’99

Tran

sist

ors/

Chi

p (M

)

Tran

sist

or/P

M (

K)

X

http://en.wikipedia.org/wiki/Moore’s_law

Moore’s Law: A Technology Opportunity...

Page 30: EDCC14 Keynote, Newcastle 15may14

30

10nm

100nm

1um

10um

100um

App

roxi

mat

e P

roce

ss G

eom

etry

ITRS’99

Tran

sist

ors/

Chi

p (M

)

Tran

sist

or/P

M (

K)

http://en.wikipedia.org/wiki/Moore’s_law

Moore’s Law: An Increasing Design Problem...

Page 31: EDCC14 Keynote, Newcastle 15may14

31

Designer Productivity has become the Technology Driver

The Product Possibilities offered by utilising the Billions of Affordable and Aesthetically Encapsulate-able Transistors is Commercially Beguiling!

But the only way to utilise these possibilities in a reasonable time, with a reasonable team and at a reasonable cost; is huge amounts of Reuse of Design and Technology ... Hardware, Software and other Technologies; Methods and Tools In-Company: Sourced and Evolved from Predecessor Products Ex-Companies: Sourced from businesses with lesser-known(?) Histories, but Specialist Knowledge Reuse Improves Quality; as objects are designed more carefully, and bug-fixes are incremental But this is ‘trend towards zero-defects’, not ‘zero-defects’ approach.

... Reuse Methods do seems to be good-enough for Commercial Applications!

... ‘Rigorous lean-sheet approaches’ will be orders of magnitude higher cost, so use of Commercial Techniques for Dependable Systems are inevitable!

... The Available Components and Sub-Systems are unreliable; “get over it!”

Page 32: EDCC14 Keynote, Newcastle 15may14

32

ARM: brings the Right Horse to the Right Course ...

... Delivering ~5x speed (Architecture + Process + Clock)

About 50MTr

About 50KTr

Page 33: EDCC14 Keynote, Newcastle 15may14

33

... Which means: 24 Processors in 6 Families ...

Page 34: EDCC14 Keynote, Newcastle 15may14

34

... CoreLink for Hetrogeneous Multi-Processing ...

ACE

ACE

NIC-400 Network Interconnect

Flash GPIO

NIC-400

USBQuad Cortex-

A15

L2 cache

Interrupt Control

CoreLink™DMC-520

x72DDR4-3200

PHY

AHB

Snoop Filter

Quad Cortex-

A15

L2 cache

Quad Cortex-

A15

L2 cache

Quad Cortex-

A15

L2 cache

CoreLink™DMC-520

x72DDR4-3200

8-16MB L3 cache

PCIe10-40GbE

DPI Crypto

CoreLink™ CCN-504 Cache Coherent Network

IO Virtualisation with System MMU

DSPDSP

DSP

SATA

Dual channel DDR3/4 x72

Up to 4 cores per cluster

Up to 4 coherent clusters

Integrated L3 cache

Up to 18 AMBA interfaces for I/O coherent accelerators

and IO

Peripheral address space

Heterogeneous processors – CPU, GPU, DSP and accelerators

Virtualized Interrupts

Uniform System

memory

Page 35: EDCC14 Keynote, Newcastle 15may14

35

… Tools, Libraries and Partners to Realize the Opportunity Technology to build Electronic System solutions:

Software, Drivers, OS-Ports, Tools, Utilities to create efficient system with optimized software solutions

Diverse Physical Components, including CPU and GPU processors designed for specific tasks

Interconnect System IP delivering coherency and the quality of service required for lowest memory bandwidth

Optimised Cell-Libraries for a highly optimized SoC implementations

Well Connected to Partners in the Life-Cycle: For complementary tools and methods required by

System Developers Global Technology Global Partners:

>900 Licences; Millions of Developers

Page 36: EDCC14 Keynote, Newcastle 15may14

36

We Can’t Design it Right HW is SW; and Coding errors remain. State-space too big for simulation

exploration. Can’t model or explore whole Systems and they are too complex for Formal methods

We Can’t Make it Right Chips are subject to Process Imperfections and Variability. Chips and

Systems are subject to Verifications and Test Escapes. Boolean math is absolute; logic cells are not

We Can’t Keep it Right Chips are susceptible to Supply Transients, Wear-Out and High-Energy

particles.

... And all it get worse as processes shrink and complexity grows

... Yet we DO make Complex Electronic Systems that work! ... What is the explanation? (can we quantify it and use it?) ... Or are we just being Harbingers of a Ever-Threatening Doom ?

Where Do All The Errors Go?

Page 37: EDCC14 Keynote, Newcastle 15may14

37

System-Level Dependability is what matters ...

Dependable Systems need to Reuse Components and Sub-Systems (Physical and Virtual) for Productivity; and the only affordable ones are of Commercial quality! Clean-Sheet design is off-the-table for almost all complex products! ... the possible exception being the (diminishing) cost-no-object market!

The Only Place to implement System-Level Dependability is in the System ‘Layer’!

Dependability of Component and Sub-Systems may be enhanced, which will help with the System-Level task; but they cannot achieve System-Level Dependability by themselves!

... I believe this is the only viable Strategy for creation of Dependable Systems

Facing the Unavoidable Truth Dependable on Undependable ...

Page 38: EDCC14 Keynote, Newcastle 15may14

38

Toolbox to help us “Get over It”... The only universal interpretation of Fail-Safe is Fail-Functional!

Probably impossible for the General Case; but may be for Specific Critical Cases.

So the identification of Failure and the initiation of appropriate Response must be the highest System-Layer; Above the Functional-Integration-Layer. This can include the ‘zero-case’ (In the even that it is all non-critical)

Recognising the differing requirements for Failure Survival (All cases are not equal)

Components and Sub-Systems may have protection built in, to increase their Reliability (How probable are they to fail? How many/What type of defects can be tolerated?)

We need a Toolbox (equivalent of ‘Spare Rows and Columns’) for the System-Level Memory Chip providers build in Repair mechanisms to overcome process limitations Memory Systems providers Overcome memory limitations by handling Files not Addresses. Redundancy (Double/Triple) is a black-box implementation strategy for logic blocks Defensive Programming is a technique for building checking into software ...

Page 39: EDCC14 Keynote, Newcastle 15may14

39

Conclusions Systems are what End-Customers buy; they expect them to be Dependable Enough.

A subjective level which is Application, State and Context dependent.

Commercial Components and Sub-Systems (HW/SW) are the building blocks Commercial use has given us the Technologies which we are economically bound to use They work better than we would rightly expect, but we cannot quantifying their quality We can improve their Quality/Reliability/Dependability; but 100% is an asymptotic goal!

Dependable Systems must be based on Less-Dependable Components So: System Dependability must be handled by the System-Level Software (Top-Level); only it can

determine the expected action and appropriate corrective action for everything in its domain. And: Because Dependability is Application and State Dependent, then it can only be handled by a

Methodology ... Not every System state needs the same Dependability.

... The Commercial Imperative won’t wait for the ‘right way’ ... before it produces systems that People Depend on!

Page 40: EDCC14 Keynote, Newcastle 15may14

40

The END ...