survival of vlsi design – coping with device variability...

57
Survival of VLSI design – coping with device variability and uncertainty Kevin Nowka Sr Mgr VLSI Systems IBM Austin Research Laboratory Acknowledgements: Sani Nassif, Anne Gattiker (IBM Austin Research) Chandu Visweswariah, David Frank (IBM Watson Research), Lars Liebmann, Dan Maynard (IBM Server & Technology Group)

Upload: others

Post on 09-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Survival of VLSI design – coping with device variability and uncertaintyKevin NowkaSr Mgr VLSI SystemsIBM Austin Research Laboratory

Acknowledgements:Sani Nassif, Anne Gattiker (IBM Austin Research)Chandu Visweswariah, David Frank (IBM Watson Research), Lars Liebmann, Dan Maynard (IBM Server & Technology Group)

Slide 2Texas A&M 23 Oct 2007

Motivation for overcoming variation (or at least coping)?

What is at stake? The VLSI economy

Using greaterthan 10k ofthese..

to make these..

to make these..

to make these..

Very Large Scale Integration is:

Slide 3Texas A&M 23 Oct 2007

The VLSI Economy

1900 1920 1940 1960 1980 2000 2020

MechanicalElectro-mechanical

Vacuum tubeDiscrete transistor

Integrated circuit

Year

1,000,000,000,000

1,000,000,000

1,000,000

1,000

1

0.001

0.000001

Com

puta

tions

/ se

c

after Kurzweil, 1999 & Moravec, 1998

What $1000 buys

VLSI era6-orders magnitude

Oct 1981IBM PC8088 CPU, 64K RAM, 160K floppy drivelist price $2,880.

Slide 4Texas A&M 23 Oct 2007

The Secrets to this Success

Resilient CMOS VLSI Devices & InterconnectSimple Design Processes

Physical Abstraction with small number of rulesSimple design and design migrationComposable designs

Functional AbstractionResulting predictable functional & timing behavior

Cell-based design, place & route, static timing

Scaled Lithography (and Manufacturing Process Improvements)

Lithography improvements and the application of Dennard Scaling Rules enabling Moore’s Law

Slide 5Texas A&M 23 Oct 2007

65nm technology and beyond

Is the VLSI Economy in jeopardy because of “variability?”

What is variability?What are the important sources of variability?What are the effects on VLSI design?How are fundamental design processes impacted?How can we cope?

Slide 6Texas A&M 23 Oct 2007

What is “variability”

Intending to build this….

And sometimes (or someplaces) getting this..

And sometime (or some places) getting this

Slide 7Texas A&M 23 Oct 2007

Variability and UncertaintyVariability: known quantitative relationship between design behavior (eg. current, delay, power, noise-margin, leakage, …) and a source

Relationship can be accurately modeled, simulated, and compensated.eg. Conductor thickness as function of interconnect density.

Uncertainty: sources unknown or model too difficult/costly to generate or simulate

must be “budgeted” with some type of worst case analysiseg. Vt as a function of dopant dose and placement

Lack of modeling resources often transforms variability to uncertainty.

eg: deterministic circuit switching activity factor

Slide 8Texas A&M 23 Oct 2007

Some Classes of VLSI VariabilityPhysical

Changes in characteristics of devices and wires (manufacturing & aging). Time scale: 109 sec (years).

FunctionalChanges in characteristics due to application cycles or workload changes. Time scale: 107 to 10−6 sec (execution time)

EnvironmentalChanges in supply voltage, temperature, local noise coupling. Time scale: 10−3 to 10−9 sec (clock tick).

InformationalLack of knowledge about design due to inadequate modeling. Time scale: ignorance cannot be measured in units of time.

Slide 9Texas A&M 23 Oct 2007

Lithography induced variabilitySubwavelength lithography

Using 193nm light to create <30nm features

Imperfect Process ControlCritical Dimensions are sensitive to:

focusdose (intensity and time)resist sensitivity (chemical variations)layer thicknesses

Intensity affected by interferencestrongly dependent on layer thicknesses.Anti-reflection coatings help

Errors in Alignment, Rotation and Magnification:Result in either global or local shape-dependent device variations.

29.5nm lines/spaces

Slide 10Texas A&M 23 Oct 2007

Mask Complexity Continues to Escalate

Exacerbated by increasing use of resolution enhancement techniques (RETs)

altPSM – Alternating phase shift maskSRAF – Sub-resolution assist feature

MBOPC – Model-based optical proximity correctionRBOPC – Rules-based optical proximity correction

0

10

20

30

40

250nm 180nm 130nm 90nm 65nm

Technology Node

Mas

k Le

vels

with

RET

altPSMSRAFsMBOPCRBOPC

Model-based OPC

Sub-resolutionassist

features

Design

Post-OPC

Wafer Image

Slide 11Texas A&M 23 Oct 2007

Lithography induced variability

Imperfect Process Control (cont’d)Pattern sensitivity.

Interference effects from neighboring shapes.

Predominantly in same planeSome buried feature interference for interconnect

[T. Brunner, ICP 2003]

Slide 12Texas A&M 23 Oct 2007

Line-edge roughness

Sources of line-edge variationFluctuations in the total dose due to finite number of quanta

Shot noiseFluctuations in the photon absorption positions

Nanoscale nonuniformities in the resist composition

With decreasing feature size, a larger percentage of Lpoly has LER randomness

Impact delay and leakage power

80 Ao

80 Ao

90nm

32nm

CD=90nm!9nm

8nm = 25%

Source: D. Frank, VLSI Tech 99

Significant gate length uncertainty

Slide 13Texas A&M 23 Oct 2007

Courtesy Anne Gattiker, IBM

Physical Variation Effects: Circuit Performance

11% slower than mean13% faster than mean

On the same die!

Slide 14Texas A&M 23 Oct 2007

Variation Effects: Not just ring oscillators….Real Microprocessors

Multicore design -- Core-0 was found to be ~15% slower than other parts.Models predict all parts of the design are identical.

Core-0

Core-1

Cache

Slide 15Texas A&M 23 Oct 2007

Random Dopant Fluctuation

Threshold Voltage is dependant upon the doping within a device channel area.

The number of dopant atoms in the depletion layer of a MOSFET has been scaling roughly as Leff1.5.Statistical variation in the number of dopants, N, varies as N1/2, causing increasing Vt uncertainty for small N.

Source: D. Frank, et al, VLSI Tech 99, D. Frank, H. Wong IWCE, May 2000]

>200mV Vt Shift

Slide 16Texas A&M 23 Oct 2007

Random Dopant Fluctuation Effect

Performance, power, and leakage variation

Source: K. Agarwal, VLSI 2006

>200mV Vt Shift: ~100x leakage

Slide 17Texas A&M 23 Oct 2007

NBTI and Hot-carrier-induced VariationNegative Bias Temperature Instability

At high negative bias and elevated temperature the pFETVt gradually shifts more and more negative (reducing the pFET current).

The mechanism is thought to be the breaking of hydrogen-silicon bonds at the Si/SiO2 interface, creating surface traps and injecting positive hydrogen-related species into the oxide.Associated with the average NBTI shift, there are also random shifts, which even for identical use conditions and devices, will cause mismatch shifts due to random variations in the number and spatial distribution of the charges/interface states formed.

There are also other charge trapping and hot-carrier defect generation mechanisms that cause long-term Vt shifts in both nFETs and pFETs. Long-term Vt shifts are parameter variations that must be accounted for in the design of circuits.

N. Rohrer, ISSCC 2006

Slide 18Texas A&M 23 Oct 2007

Gate Oxide Thickness FluctuationGate oxide variation

Exponential effect on gate tunneling currentsAffects device threshold, butsignificantly less important Vt variation factor than random-dopant fluctuation

0 1 2 3 4 5 6 7

x 10-8

0

1000

2000

3000

4000

5000

6000

7000

800010s0 Pull Down historgram

Current [A]

Bin

Cou

nt

1.1nm oxide is ~6 atomic layers.across a 300mm wafer (>109 atomic layers)

Slide 19Texas A&M 23 Oct 2007

Back-end Variability -- CMPChemical/Mechanical polishing

Introduces large systematic intra-layer interconnect thicknessAdditional inter-layer interconnect thickness effects as well

Copper

Oxide

Dishing Erosion

CMP Variation

Topography variation translated into focus variation for lines which

results in width variation

Slide 20Texas A&M 23 Oct 2007

Measured Variation: interconnect performanceNormalized metal resistance data over 3 months

1.0

3.0

1.5

2.0

2.5

Wafer means change over timeSome real outliers Source: Chandu Visweswariah,

C2S2 Robust Circuits Wkshp, 7/28/06

Slide 21Texas A&M 23 Oct 2007

Normalized single-level capacitance distribution

0

50

100

150

200

2501.

0

1.05

3

1.10

5

1.15

8

1.21

1

1.26

5

1.31

8

1.37

1

1.42

3

1.47

6

1.52

9

1.58

2

1.63

5

1.68

8

Mor

e

Variability is enormous! Source: Chandu Visweswariah,C2S2 Robust Circuits Wkshp, 7/28/06

Slide 22Texas A&M 23 Oct 2007

Functional Variation

Workload variability – utilization of design based on changing workload requirements

time

Processorutilization

Source: J. Fredrich, ACEED ‘07~5% to ~95%

Slide 23Texas A&M 23 Oct 2007

Environmental Variation – Supply voltage

Supply variation due to input variation (eg. battery lifecycle) and self-generated and coupled supply noiseSupply variation affects performance, power, reliability

Power supply droop mapSource: Sani Nassif, IBM

VD

D

Chip Y (mm) Chi

p X

Source: P. Restle, ICCAD06, IBM

>10% dynamic supply droop

Slide 24Texas A&M 23 Oct 2007

Environmental Variation – Thermal

Thermal variation due to ambient fluctuation and self-heatingThermal variation affects performance, reliability

Die thermal map

Source: Sani Nassif, IBMSource: J. Friedrich, ACEED 2007

~30C dynamic temperature variation

Slide 25Texas A&M 23 Oct 2007

45nm technology and beyond

Is the VLSI Economy in jeopardy because of “variability?”

What is variability?What are the important sources of variability?What are the effects on VLSI design?How are fundamental design processes impacted?How can we cope?

Slide 26Texas A&M 23 Oct 2007

Revisiting….the Secrets to Success

Resilient CMOS VLSI Devices & InterconnectSimple Design Processes

Physical Abstraction with small number of rulesSimple design concepts and design migrationComposable designs

Functional AbstractionResulting predictable functional & timing behavior

Cell-based design, place & route, static timing

Scaled Lithography (and Manufacturing Process Improvements)

Lithography improvements and the application of Dennard Scaling Rules enabling Moore’s Law

Slide 27Texas A&M 23 Oct 2007

Technology Resiliency

Defects were the major yield detractors for technology in the early days, yield and area were the major tradeoffs.

λ = 365nm

λ = 248nm

λ = 193nm

0.01

0.1

1

'86 '88 '90 '92 '94 '96 '98 '00 '02 '04 '06 '08 '10

ISQED ’03, Dan Maynard, “Productivity Optimization Techniques for the Proactive Semiconductor Manufacturer”

Slide 28Texas A&M 23 Oct 2007

The Resiliency Problemλ = 365nm

λ = 248nm

λ = 193nm

0.01

0.1

1

'86 '88 '90 '92 '94 '96 '98 '00 '02 '04 '06 '08 '10

Contact ResistanceContact Resistance

1Ω 1MΩ

CircuitOK

CircuitNot OK

100Ω

Distribution of“Good” contacts

Distribution of“Bad” contact

Near futuredistribution

Fails that looklike opens!

Other factors, like the environment, make the failure region fuzzy and broad!

With scaling, variability – both random and systematic – has emerged as a source of performance and yield loss.

This can be viewed as the merger of failure modes due to structural (topological), and parametric (variability) defects.In the very near future, we will have to deal with circuits where a non-trivial portion of the devices simply do not work!

Slide 29Texas A&M 23 Oct 2007

What has changed?Resiliency & redundancy cannot be ignored.

Need to start design assuming partial functionality!

Mead-Conway design is dead…Physical abstraction is broken – ground-rule explosion Physical abstraction is broken – composability in

jeopardyFunctional abstraction is broken – increasingly difficult

to treat these as “logic devices”Transistor performance determined by new features

and phenomena, â large variety in behaviors (not easily bounded).

Key Factor: Variability

Slide 30Texas A&M 23 Oct 2007

1980: Abstraction – the great enabler

With abundant performance, it became possible to abstract design to a few simple rules. Thus came the age of “chip computer science” and equality for all designers!

λ = 365nm

λ = 248nm

λ = 193nm

0.01

0.1

1

'86 '88 '90 '92 '94 '96 '98 '00 '02 '04 '06 '08 '10

Slide 31Texas A&M 23 Oct 2007

Physical Abstraction 2003: Abstract this!Technology has become so complex it is not well represented by “rules”.

Rules developed to deal with defects

Insufficient for capturing systematic, statistical variability relations

Maybe “migratable design” was just a dream after all…..

λ = 365nm

λ = 248nm

λ = 193nm

0.01

0.1

1

'86 '88 '90 '92 '94 '96 '98 '00 '02 '04 '06 '08 '10

100

150

200

250

300

350

500nm 350nm 250nm 180nm 130nm 90nm

Technology

# Design Rules

Slide 32Texas A&M 23 Oct 2007

What has changed?Resiliency & redundancy cannot be ignored.

Need to start design assuming partial functionality!Mead-Conway design is dead…

Physical abstraction is broken – ground-rule explosion Physical abstraction is broken – composability in jeopardyFunctional abstraction is broken – increasingly difficult to treat these as “logic devices”Transistor performance determined by new features and phenomena, â large variety in behaviors (not easily bounded).

Key Factor: Variability

Slide 33Texas A&M 23 Oct 2007

Litho and Physical Abstraction ca. 1990

1990

Before the advent of deep sub-wavelength lithography, the salient properties of a transistor were determined by geometries very local to the device itself!

λ = 365nm

λ = 248nm

λ = 193nm

0.01

0.1

1

'86 '88 '90 '92 '94 '96 '98 '00 '02 '04 '06 '08 '10

Slide 34Texas A&M 23 Oct 2007

Litho and Physical Abstraction ca. 2000

2000

As scaling required resolution enhancement and optical proximity correction, the number of shapes that determine the final outcome increased.

λ = 365nm

λ = 248nm

λ = 193nm

0.01

0.1

1

'86 '88 '90 '92 '94 '96 '98 '00 '02 '04 '06 '08 '10

Slide 35Texas A&M 23 Oct 2007

Litho and Physical Abstraction ca. 2010

Cell youplace heredetermines behavior of this cell

In the very near future, so much of what is around the device is needed that the notion of arbitrarily composable design is not valid any longer!

λ = 365nm

λ = 248nm

λ = 193nm

0.01

0.1

1

'86 '88 '90 '92 '94 '96 '98 '00 '02 '04 '06 '08 '10

Slide 36Texas A&M 23 Oct 2007

What has changed?Resiliency & redundancy cannot be ignored.

Need to start design assuming partial functionality!Mead-Conway design is dead…

Physical abstraction is broken – ground-rule explosion Physical abstraction is broken – composability in jeopardyFunctional abstraction is broken – increasingly difficult to treat these as “logic devices”Transistor performance determined by new features and phenomena, â large variety in behaviors (not easily bounded).

Key Factor: Variability

Slide 37Texas A&M 23 Oct 2007

Functional Abstraction: 0.25um

n+STI STIp

n+

TransistorSource

TransistorGate

TransistorDrain

Conventional Silicon Substrate

Electron Flow

Ig~0

Ids=0, Vg<VtIon, Vg>Vt

Resulting rather simple timing delay relations – static timingTout = Tin + delay(cell output load, interconnect, Vdd, Temp, Process).Modest number of corner analysis cases required (long-path – SS, hold-time -- FF, power corner, noise corner, reliability/electomigrationcorner)

Stupidity screens – functional verifications, slew-violations, x-talk…footnote: analog and array designers exempt from simple abstraction

It’s a switch! It turns on and turns off

Slide 38Texas A&M 23 Oct 2007

Functional Abstraction Broken:

Pow

er D

ensi

ty (W

/cm

2 )

0.010.110.001

0.01

0.1

1

10

100

1000

Gate Length (microns)

Active Power

Passive Power

1994 2005

Gate Length (microns)

Gate Leakage

-Vt variability increasing passive power contribution-Thin oxide and high supply driving gate leakage power

Source: E. Nowak, et al

Just when is it off?

Slide 39Texas A&M 23 Oct 2007

Functional Abstraction: 65nm

polyt Lw1~N

1~δV

Now add variability – like Vt shifts…

And everything is a distribution!

Slide 40Texas A&M 23 Oct 2007

What has changed?Resiliency & redundancy cannot be ignored.

Need to start design assuming partial functionality!Mead-Conway design is dead…

Physical abstraction is broken – ground-rule explosion Physical abstraction is broken – composability in jeopardyFunctional abstraction is broken – increasingly difficult to treat these as “logic devices”Transistor performance determined by new features and phenomena, â large variety in behaviors (not easily bounded).

Key Factor: Variability

Slide 41Texas A&M 23 Oct 2007

So now what?Back to days of the “Hero Designer?”Or Cope? -- fix the incomplete technology specification, modify the abstractions, validate the models, and change the design practices.Just how many Hero Designers are there in VLSI?

Slide 42Texas A&M 23 Oct 2007

Coping – part 1Know thine enemy: “You can fix what you can’t measure”

Build structures to measure variation effects and causes – density & pattern sensitivities, CAA, threshold variation, matching….Capture significant variation effects in modelsIn-situ variation sensing thru on-die monitor circuits – thermal, performance ROs, supply, aging, …

Slide 43Texas A&M 23 Oct 2007

Variations for Design Rule Exploration

canon canon_jogs dp1 pc1 pcc1

pclp_len pclp_w1a tip1 rx1 rxx1

4 versions 4 versions

3 versions,repeated 3 times 2 versions2 versions

repeated 4 times

canonical rx jogs 2X dummy pc space 1-sided pc corner 2-sided pc corner

pc landing pad length pc landing pad to rx active pc extends past dummy

1-sided rx corner 2-sided rx corner

Slide 44Texas A&M 23 Oct 2007

Example Results

d

49

50

51

52

53

1 2 3 4 5 6d

Frequency (MHz)

~4%

impact

(arbitrary units)

3X space

Slow

Fast

Experimental hardwareSchematically identicalLayout variation20% frequency variation

Slide 45Texas A&M 23 Oct 2007

Targeted structures -- IV spatial variationSpatial Leakage Distribution

DUT Array

Right LSSD bank

Row SenseCurrent Steering

Left LSSD bank

DUT Array

Bottom LSSD bank and column drivers

Top LSSD bank and column drivers

DUT Array

Right LSSD bank

Row SenseCurrent Steering

Left LSSD bank

DUT Array

Bottom LSSD bank and column drivers

Top LSSD bank and column drivers

~100k device array structure

Slide 46Texas A&M 23 Oct 2007

Use of In-situ ring-oscillator structures

12 ring oscillators distributed across the die.

Chip map with Ring Oscillator locations

1

2

3

4

5

6

7

8

9

10

11

12

Courtesy Anne Gattiker, IBM

Slide 47Texas A&M 23 Oct 2007

1 vs. 14 vs. 17 vs. 110 vs. 1

2 vs. 15 vs. 18 vs. 111 vs. 1

3 vs. 16 vs. 19 vs. 112 vs. 1

Measured Speed Variations

~200%

~50%

Slide 48Texas A&M 23 Oct 2007

Coping – part 2Fix the abstraction and the design process

Use modeled behavior to drive physical and functional abstraction

Incorporate sensitivities into physical abstraction – eg. Raise the level of physical abstraction for cells Incorporate sensitivities into timing abstraction – eg. Statistical Static Timing

Variation aware DA (placement, routing, buffer insert…)Recognize that rampant variability = defective

Test for the tails – At Speed Scan TestsCut out the tails – eg. SRAMs with Vt-induced Vmin issues should be mapped out with redundant row/columns With 80 cores can’t you just turn the worst one or two into decoupling capacitors?

Slide 49Texas A&M 23 Oct 2007

Statistical Static TimingPath-based SSTA

Conduct a nominal timing analysisSelect a representative set of critical pathsModel the delay of each path as a function of random variables (the underlying sources of variation)Predict the parametric yield curve, as well as generate diagnostics (integration of a feasible region in parameter space)

EinsStat (IBM tool) models all timing arcs and produces all timing results in the canonical 1st order form:

Slide 50Texas A&M 23 Oct 2007

Current Modeling EnvironmentLots of variability characterization dataNumerous variability modeling toolsLittle commonality!

chip powersupply variation model

Package electrical variation modelleakage estimate & modeling

Measurement of spatial variations of device performance

die temperature variation modeling

0 1 2 3 4 5 6 7

x 10-8

0

1000

2000

3000

4000

5000

6000

7000

800010s0 Pull Down historgram

Current [A]

Bin

Cou

nt

Slide 51Texas A&M 23 Oct 2007

Coping – part 3“Bob and weave” – Adapt design for variation

If it’s functional then adapt… to spatial/temporal variation

split/multiple suppliesbody biasDVFSthermal throttlingpower and performance efficiency-based job scheduling

Does variation-induced timing variation warrant fundamental shift from synchronous systems to inherent timing adaptation?

Is 2X die-to-die, 50% within die variation sufficient?If half of this is systematic and nullible, where do we spend our effort?

Slide 52Texas A&M 23 Oct 2007

FailOK Adapt Degrade

Adaptation Required vs. Variation Sigma

SigmaCircuit delay

exceedsspecification

Circuit delay beyond fixingby adaptation

Circuit doesnot invertany longer

time

outp

ut v

olta

ge

nominal

time

outp

ut v

olta

ge

nominal

time

outp

ut v

olta

ge

nominal

Performanceindistinguishable

from a “stuckat 1” fault!

A B C

Del

ay perf spec

Slide 53Texas A&M 23 Oct 2007

0

5

10

15

20

25

30 50 70 90 110

AB

CProcess σ point

Technology Trend For a Simple Buffer

Simplest possible circuit (if this fails, everything else will).Performed analysis for 90nm, 65nm and 45nm.Clear trend in sigma!

Nominal VDDDelay@150%

VDD@+15%Delay@150%

No edgepropagation

Slide 54Texas A&M 23 Oct 2007

0

1

2

3

4

5

6

7

8

9

30 50 70 90 110

Technology Trend for an SRAM

SRAM is known to be a more sensitive circuit… (lower σ).But, circuit optimized for each technology. (No redundancy included)

Much lower σ values + similar trend in sigma!

A

B

CProcess σ point

Nominal VCSDelay@150%

VCS@+15%Delay@150%

Cannot writeto cell

Mb Array Reqmts

Slide 55Texas A&M 23 Oct 2007

Impact of A/B/C Sigma on Chip DesignThe values of sigma determine:

Whether to build adaptation into the chipWhether to include redundancy in the chipThe size of “yieldable” components on the chip

Such activities are already routine in the design of SRAM.But such techniques are not well developed for standard logic design…Different technology sensitivities of SRAM vs. logic make the problem difficult

Sigma

SRA

M S

ize

No

redu

ndan

cy

2% re

dund

ancy

Sigma

# La

tch

Bits

NormalLatches

ResilientLatches

Slide 56Texas A&M 23 Oct 2007

Ultimate VisionGet to the point where site-specific hardware-derived models are ubiquitously available… Enable accurate model to hardware correlation and sophisticated design adaptation.

Parameter History

Performance History

Model toHardware Matching

Model toHardware Matching

Chip

Teststructure

Wafer0

102030405060

Lot 1 Lot 2 Lot 3 Lot 4 Lot 5 Lot 6

FmaxPowerYield

0.7

0.8

0.9

1

1.1

1.2

Lot 1 Lot 2 Lot 3 Lot 4 Lot 5 Lot 6

Target

VTHDLPSRO

SimulationParameters

& Tools

adapt

Slide 57Texas A&M 23 Oct 2007

Summary Trends and ChallengesTrends/Challenges

Variability increasing as Design/Manufacturing interface complexity rising.

More design rules, more 2nd order effects, more systematic variations, more correction steps…

Current techniques are insufficientAbstractions no longer good enoughPredictability is poor

Ability to confidently bound performance is degrading.Frequent model/hardware mismatch.

Required ActionBetter, targeted measurements through characterization structuresHardware-driven variation-enabled modeling

Corners not sufficient any more – statistical timing

Technology aware circuit and PD toolsVariation tolerance in design

Technology aware physical design, redundancy, adaptation.