random stuff centaur technology inc. g glenn henry quick background our security functions centaur...

Random StuffCentaur Technology Inc.

G Glenn Henry

Quick Background

Our Security Functions

Centaur Build Methodology

Physical Design Example

Quick Background We’re Centaur Technology Inc. (Austin, TX)

We design x86 processors Have been alive for 11 yrs, have shipped processors for 8.5 We operate independently, but are owned by VIA We are a tiny group; but shipping millions of processors/yr

Our processors are software & bus compatible with Intel x86 But are unique vs. Intel & AMD (re design & target market):

+ lower cost (price) + lower power consumption + smaller chip footprint + unique integrated security features – generally, lower performance

This fits some rapidly growing “new” markets for x86

Parent company is VIA Technologies (Taiwan) They manufacture, market & sell our processor designs They develop all other PC platform chips (including chip sets for Intel & AMD processors), etc.

90nm IBM SOI Technology

400 MHz–2.0 GHz

31.2 mm2

26.2 M transistors

First Shipped 8/2005

Lowest Power/MHz 3.5W @ 1 GHz TDP 20W @ 2 GHz TDP

128KB, 32-wayexclusive L2

P4 instructions(incl SSE2 & SSE3)

C5J (aka VIA Esther, VIA C7-M)

+2-way SMP support

Exclusive security features

64KB4-way

D-cache

64KB4-way

I-cache

P-Mpower mngt

features+

P-M busand

new VIA “V4”bus

(400-800 MHz)

unique nanoBGA

package

4

90nm Intel Pentium M (Dothan)

90nm VIA C7-M

84 mm2

31 mm2

our die cost

128 KB32-way

L2

64 KB4-wayL1-D

SSE 1,2 & 3, MMX

ROM

x87 FPBr

pred

I-unit

Fetch, Decode

&Translate

DCU

Bus & APIC

SecurityPLLs etc

C5J Die6.9 mm

64 KB4-wayL1-I

fuses

Our Security Strategy Provide comprehensive set of data security functions

…That are very secure …That are world’s fastest (for a single chip)

These goals require that the functions …Be Integrated tightly into the processor core

Processor silicon & implementation is fastest hdw Only hdw can be “trusted” (no viruses, etc.)

…Require no operating system support/involvement available via non-privileged x86 instructions hardware must manage multi-tasking considerations

Available in all of our processors, for free We believe data security should be built into all processors It’s easy to do & small (effectively free) It’s our hobby

Our Security Implementation

2 units

fastest in world!

Hardware RNG Encryption Secure Hash

C5P(shipped 1/2004)

C5J(shipped 8/2005)

Full AES (FIPS-197) standard in hdw

ECB,CBC,CFB,OFBModes in hdw

fastest in world!

+CBC/CFB-MAC modes+CTR mode

+unaligned support+faster

RSA Hdw Assist(Montgomery multiply)

(can also feed entropy to hardware

SHA to get faster high quality)

CN(future)

xxx xxx

(faster/better using built-in hdw

hash functions)

Full SHA-1 & -256 (FIPS-180-1)

standard in hdw

Hardware RNG unit

C5XL(shipped 1/2003)

Centaur Hardware RNG

adjDC

bias

asynch

clocked

2 duplicate RNGs in different physical areas

(& rotated)

SSE store bus1-of-n bit selector

1-byte per delivery

up to 8-byte deliveryper store request

status in EAX

32 byte hardware collection buffer

A, B, or both

x86 “store-rand”instruction

~

^

whitener

~~

~

~~~ ~

^

whi

tene

r

~~

~

~ ~ ~

RNG “Typical” Performance “Randomness” too hard to describe here,

but here’s some basics… Key requirements for “truly random” (per Schneier)

Unbiased statistical distribution determined by statistics Unpredictability determined by modeling Unreproducibility only hardware need apply

Many statistical tests defined & used (& argued about) Collections of many different statistical analyses

FIPS-140-2 useless (4-tests, broken, 20,000 bit sample!) Diehard (18 tests) oriented to software RNGs, 10Mb

sample NIST (16 tests) we think the best (much overlap with

Diehard)Ent, etc. everyone has one, everyone has their favorite

Individual testsentropy important & widely reported, but it’s not randomnesschi2 heavily used, especially for huge samples, our favorite

Maurer, etc. everyone has their favorite Many different evaluation approaches

threshhold value, fixed ranges, probability analysis (p-value)

Much analysis & interpretation needed to make sense here

RNG “Typical” Performance Performance & randomness varies by part; these are “typical” We have done extensive analysis

Many terabytes of data Massive sample sizes (terabyte) Hundreds of chip Our own testbed software Analysis & report by external group

www.cryptography.com/research/evaluations.html

Here’s an embarrassingly simple summary…

Setting Speed(Mbs)

Entropy (byte)

Randomness

1 MB sample random?1

Max sample size for random2

white8 1.7 7.9999+ Y 50 MB-10 GBwhite4 3.4 7.999 Y–N 0-10 MB

raw 28–240 7.5-7.95 N –hashed

raw (AES)3 150– 1,000 7.9999+ Y 1 TB up

1. Passes standard test collections: FIPS, NIST, Diehard2. “Good” chi2 results 3. Many variations: SHA, random seed size, etc.

Centaur AES Encryption Features Full FIPS-197 implemented in hardware

Encrypt & decrypt 128b, 192b, & 256b keys 128b data blocks

Multiple operating modes in hardware ECB, CBC, CFB, OFB CBC/CFC-MAC & CTR modes

Optional extended key generation in hardware For 128b key (both E & D) only

Various “experimentation” options supported Round count 1-16, intermediate round results, etc.

Accessed via new application-level x86 instructions No OS support needed Hardware provides inherent multitasking

US export licenses in place

Centaur AES Hardware

input 1input 0key ctrl

S-boxrow-shift

out 0 out 1

Round key

generation

SSE store bus

SSE load bus

round key

ExtendedKey Ram

16x16B

block startup+ CBC, CFB, OFB, etc.

block finish+ CBC, CFB, OFB, etc.

column mixkey add

roundfwd

blk-blkfwd

shared logic

can pipeline 2 blks in ECB

16-byte blocks

0.3 mm2

total!

Everything runs at processor clock speed

Centaur AES Performance AES instruction performance (approx.)

128-bit key & block size: usual instruction timing assumptions = data in cache, no interrupts, aligned, key done, etc.

Approximate clocks w/ 128b extended keys already loaded ECB, 1 block: 17 clocks ECB, large block count: 11.8/blk

CBC/CFB/etc, 1 block: 37 CBC/etc, large block count: 22.5/blk

Additional extended key generation/load time (128b key) Hardware generated: 38 Loaded from memory: 53

AES Performance Measured Performance

P4 = Gladman library AES, C5J = replaced routine with AES inst ECB mode (other modes slower, but same advantage over P4) Same memory size (512MB), same bus speeds (533 MHz)

Another example: Gladman reports (his site) using his library (ECB)

data size 2.53-GHz P4 2.0-GHz C5J

8 KB 0.56 Gb/s 21.5 Gb/s

64 KB 0.56 19.5

1 Mb 0.56 5.45

10 MB 0.56 5.23

data size 1.2-GHz C5P

16 Kb 15.2 Gb/s

bus limited

Earlier part

C5J Montgomery Multiplier Features Goal: Speed up RSA’s modular exponentiation

c = me mod n is dominated by repeated d = m x y mod(n) ops

where m, y, n are thousand bits long!

This multiply is “always” done using “Montgomery Multiply” algorithm Uses special number space to make d’ = a’ x b’ mod(m) much faster by eliminating divide But initial & result values must be transformed to/from Montgomery number space In real usage, the transformation overhead is relatively small

Our hardware directly performs “Montgomery Multiply” About as fast as an ordinary multiply! For up to 32Kb numbers!

New application-level x86 MontMul instruction

Centaur Montgomery Multiplier

M[j]T[j]

32 x 32

SSE store bus

SSE load bus

temp regs

16-byte blocks

A[j]

32 x 32

+

32

32

64

32

+

64

64

33

T[j-1]

Hi 33b

Bits 64:32

33

Bits 31:0

B[i]U

32

32b x 32b mod(32b)= 4 clks (2 clk pipelined)

Ucode sequences loads & stores

Usable with any size data(256 to 32Kb,128b steps) hack of

existingmultipliers

Centaur MontMul Performance Compared to GMP library

Perform c = me mod n (m,e,n chosen randomly) An example (speeds vary slightly based on values) Note: this is most of RSA time, but not the whole thing Same hardware as for AES chart

mod size (bits) 2.53-GHz P4 2.0-GHz C5J

512 340 exp/s 1800 exp/s

1024 50 243

1536 15.6 78

2048 7.1 35

Centaur SHA Features FIPS-180-1 completely implemented in hardware

SHA-1 (160-bit result) SHA-256 (256-bit result)

Instruction timing SHA-1: 251 clks SHA-256: 262

where n is the number of 64B blocks to be compressed

Measured performance (Gb/s) Same hardware as for AES chart, GPL SHA SW (Devine)

data size

2.53-GHz P4 2.0-GHz C5J

SHA-1 SHA-256 SHA-1 SHA-256

10 B 0.07 0.04 0.38 0.35

100 B 0.43 0.24 2.41 2.24

1,000 B 0.59 0.33 3.81 3.60

1,000,000 B 0.62 0.34 2.97 2.97 bus limited

Function generators

C5J SHA Hardware

next 64b data

SSE store bus

SSE load bus

accumulating digest

Initial digest160b64

+

regs

data scheduler

(16 x32b regs)

+

SHA-1: 2 clks/32b rnd (5)

SHA-256: 3 clks/round

Final sha-256

add

5-way add

20

Build Process

21

Physical Build& TapeoutProcess

DesignVerification(full chip)

Timing Process(full chip)

Compatible?CPI ?

Power?MHz?

Foundry

MaskData

Silicon

hdw & swcompatibility,benchmarks,power data,

supported hdw,etc.

test vectors, test programs, test fixtures,packages, FA,

silcon debug, qual,MHz calibration, etc.

TheProcessor

Source

DesignProcess

archlogic

circuitlayout

technology

fails

bugs

timing data

verilogschematics

layoutglobal wiresmicrocode

models

ViaManufacturing,

Marketing &Sales

"release tomanufacturing"

SystemVerification

ManufacturingEngineering

tech data & support

Power?Bugs?MHz?

feedback to process

feedback to process

requirements

mfg requirements

The Centaur Process

Centaur Build Methodolgy Our challenges!

Complex logic with lots of architectural interconnections 2-GHz & aggressive power/size objectives Relatively few designers (30 logic & circuit) Strong schedule pressure (must do it fast) Industry tools not sufficient (oriented to APR methodology)

Our Basic Approach Hundreds of top-level stand-alone “blocks”

Allows parallel development of “one-person” blocksFacilitates fast “build” time (chip assembly, timing, etc.)Facilitates use of optimum process for particular logic

Hook blocks together with top-level routing, clocks, etc. Significant “content” added in top-level build

Full-chip timing with fast iterations Fast full-chip build iterations Develop our own tools & methodology to accomplish above

processor.v

APR blocks

datapath stacks

circuit elements

Defines the top-level blocks& the connecting globalwiring

Verilog sources for eachphysical APR (controllogic)

Verilog source (specialformat) for each physicaldatapath stack

Verilog for control logic tobe placed in buffer section

Schematics for eachphysical custom element/block

Plus timing models, sizemodels, RTL behaviorals,etc.

verilog-to-layoutAPR flow

Full-chipintegration

& buildprocess

global wiring definitions

processor.mcOn-chip microcode

microcode flow

verilog-to-layout

stack flow

custom layoutflow

standard celllibrary

customblocks

stack elementlibrary

routingRC repeatersclock treepower/grdvia addcap fill

ROMs

Centaur Chip Physical Build Process

C5J Die

62Full Custom Blocks

(299 instances)3.12 mm2

4.82 mm2 (I/O)1.32 M xistors

63Datapath Stacks

6.62 mm2

12.38 m routing

3.38 M xistors

10Bit-Cell Arrays(18 instances)

8.02 mm2

20.02 M xistors

2 ROMs

0.39 mm2

0.48 M xistors

Global Wiringinterconencting all top-level blocks

21,512 nets22.73 m routing

Global RC Repeatersautomatic insertion tool3,500 x 7 bfrs inserted On-die

Decoupling Capsautomatic

insertion tool

Power/ground Gridsboth hand-drawn & automatic

Clock Distribution Networkhand-drawn

49 top-level elements/395 nets

I/O Drivers

I/O

Drivers

I/O Drivers

I/O

Drivers

perf optimized routingwidths & spacing

20APR Blocks

2.32 mm2

5.89 m routing

1.18 M xistors

Underlying Source Statistics Verilog lines as written (small)

(no behaviorals, no comments, no clocks, no “top” chip) APR logic 112K lines 129K cells Stack logic 41K lines 172K cells Note: this is “single instance” as written

much of this gets instantiated multiple times

Schematic “pages” as written (large) Primitive (inv, nand2, nor2, etc.) 110 Standard cells 712 Datapath elements 1308 Full customs 1332

-------3462

Circuit library size avail used Clock regens 445 277 Std cell 547 435 G datapath elements 493 271 W datapath elements 248 147

----- -----1733 1130

C5J Security Components (metal 1-4 only)

stkstk

APR(control for all stacks)

cus-tom

stk stkstk

stk

stk stk

clock repeaters

7 RC bfrs

global clk meanders

32b data“bfr”

section

decoupling caps

Note: global interconnects not shown

128b-wide AES enginekey

RAM

commoncontrollogic

RNG buffersSHA sch & ALU

C5J Security Components (metal 1-4 only)

“Fast Build & Timing” Every 1-5 days Full-chip “Release”

APRs synthesized, placed & RCs estimated Stacks “cracked”, placed & RCs estimated Full-chip timing done with estimated RCs Takes < 1 day for full-chip timing report

Every 5-10 days Full-Chip Physical Build APRs routed Stacks routed Global chip routed Global chip layout produced APRs, stack & global route RC extraction

RCs feed back to calibrate estimated RCs This goes on continuously, picking up new Releases as needed

Our experience at other companies much slower

Stackbuild

mergeflatten

subsituteexpandrename

split

synthesis& place

pwr libtiming

lib tech libphy lib

shapefrom

floorplan

RCestimator

I/Oconstraints

gen

splitStackplace

RCestimator

gen

ctl

dp

RC

APRbuild

global nl

elementtiming

models

tech file

nl

nlnl

globalbuild

build ctl nl

5-10 daycycle

1-5 daycycle

verilogsource

modules

processor.v

APRblks

DPstacks

synthesis

Full-ChipTimingauto or

by hand

floorplanclock tree

wire controletc.

Basic “Release” Process

RTL Design Rules APR Blocks Element instantiation OK

Registers (req’d synthesis can’t infer them correctly) Clock buffers & distribution (req’d synthesis clocks are slow!!) Occasional logic (this has diminished over time) The instantiated elements are really macros

Auto expanded to right size, number bits, etc. in the flow

Wires & continuous assignment OK Including operators like ?, +, < etc.

Nothing else! (no procedural stuff) No if/else, no case, no loops, no “always”, no “at”, etc. No timing information/control Synthesis generates bad logic for these

Unexpected/surperflous elements, registers where not expected, timing doesn’t work, etc.

Stacks Component instantiation & wires only!

32

assign idleNS = (T[0] | T[8]) | shaDone_P;assign funcNS = (T[1] | T[3] | T[6] | T[10]) & ~shaDone_P;assign add1NS = (T[2]) & ~shaDone_P;assign add2NS = (T[5]) & ~shaDone_P;assign faddNS = (T[4] | T[7] | T[9]) & ~shaDone_P;rregs #(5) state (.q ({idleState, funcState, add1State, add2State, faddState}),

.d ({idleNS,funcNS,add1NS,add2NS,faddNS}), .clk (ph1c)

);------------------sha2cnst sha2cnst(.in (iteration[5:0] ), .ksel (shKSel ), .algo (sha1_P ), .out (KsubI )); ------------------wire [6:0] nextIteration;assign nextIteration = (shaDone_P | idleState) ? 7'b0000000 : shIterationStall ? iteration : iteration + 1;

APR RTL ExampleAs Written

33

Datapath Section

/*------------------- KeyGen XOR --------------------------*/wire [31:0] aesKeyGenXorOut2_L;zdxor #(32,15) keyg1 (.out (aesKeyGenXorOut2_L ), .in0 (aesWord2I_LB ), .in1 (aesKeyGenXorOut1_LB ));

zinv #(32,60) kgen2 (aesKeyGenXorOut2_LB, aesKeyGenXorOut2_L);

wire [31:0] aesKeyGenXorOut2_MB;wire [31:0] aesKeyGenXorOut2_M;zregi_en #(32,10) keyg2 (.q (aesKeyGenXorOut2_MB ), .d (aesKeyGenXorOut2_L ), .clk (EPH1 ), .en (aesDynEn_K));zinv #(32,10) keyg2i (aesKeyGenXorOut2_M, aesKeyGenXorOut2_MB);

Buffer Section rregsi #(2,20) bf_kk (.qb (aesKeyMuxSel_M ), .d (aesKeyMuxSel_LB), .clk (evph1));

Stack RTL Example

Stack Placement Tool Output (32-bit AES stack)

Buffer section added Inter-element routing (m2-6)

Global wires added

37

time path element delta load cap wire rise/fall

0.875ns eeph1aesdp2 ^ aesdp2/eph1buf_aesdp2/ 0.050ns 0.2423pF 0.000ns 0.000ns0.925ns aesdp2/eph1 ^ aesdp2/sc_c0ph1_48/ 0.160ns 0.0321pF 0.000ns 0.000ns1.085ns aesdp2/keyg2_ph1 ^ aesdp2/gxregi_x4_10…………………… 0.063ns 0.0035pF 0.000ns 0.004ns1.148ns aesdp2/aesdp2_dp_aeskeygenxorout2_mb10 v 0.000ns 0.0035pF 0.000ns 0.004ns1.148ns aesdp2/aesdp2_dp_keyg2i_stack_bit10_i0 v

aesdp2/ginv_10………………………………… 0.026ns 0.0209pF 0.000ns 0.044ns1.173ns aesdp2/aesdp2_dp_aeskeygenxorout2_m10 ^ 0.000ns 0.0209pF 0.000ns 0.045ns1.174ns aesdp2/aesdp2_dp_invk_stack_bit10_i0 ^

aesdp2/gemux3i_19………………………… 0.045ns 0.0336pF 0.000ns 0.031ns1.219ns aesdp2/aesdp2_dp_key_mb10 v 0.000ns 0.0336pF 0.000ns 0.031ns1.219ns aesdp2/aesdp2_dp_kml_stack_bit10_i0 v

aesdp2/ginv_31………………………………… 0.017ns 0.0188pF 0.000ns 0.013ns1.236ns aesdp2/aesdp2_dp_key_m10 ^ 0.001ns 0.0188pF 0.001ns 0.014ns1.236ns aesdp2/aesdp2_dp_mixcoldec_xorout_stack_bit10_in0 ^

aesdp2/gxor8_10……………………………… 0.095ns 0.0170pF 0.000ns 0.029ns1.331ns aesdp2/aesdp2_dp_decout_m10 v 0.000ns 0.0170pF 0.000ns 0.030ns1.332ns aesdp2/aesdp2_dp_mcmux_stack_bit10_i2 v

aesdp2/gmux3i_10………………………… 0.030ns 0.0089pF 0.000ns 0.017ns1.362ns aesdp2/aesdp2_dp_mcout_mb10 ^ 0.000ns 0.0089pF 0.000ns 0.017ns1.362ns aesdp2/aesdp2_dp_invm_stack_bit10_i0 ^

aesdp2/ginv_31……………………………… 0.030ns 0.1101pF 0.000ns 0.053ns1.391ns aesdp2/aesdp2_dp_mcout_m10 v 0.012ns 0.1101pF 0.012ns 0.078ns1.403ns aesdp2/aesdp2_dp_pipemux0_stack_bit10_i1 v

aesdp2/gmux2i_16…………………………… 0.048ns 0.0249pF 0.000ns 0.030ns1.451ns aesdp2/aesdp2_dp_aesword2i_kb10 ^ 0.001ns 0.0249pF 0.001ns 0.032ns1.452ns aesdp2/aesdp2_dp_byte1_indx_pb2 ^

Sample Timing Report “Path”

Local reg clock-to-next reg input = 1.452-1.085 = 367ps

Random Circuit Topics Clocking is very difficult & very critical

Very aggressive skew goals “0” ps clock skew across all top-level blocks<20 ps skew worst case within a block

These are met in our designs ignoring on-chip silicon variations Multiple clock domains required (for bus & various power states) Many “early”, late”, etc. versions of the clocks needed Clocks must be gated (for power management)

Our clocking methodology is proprietary, but… Hand-routed global clock tree (continually changing) Our own tools to generate clock shields tuned to surroundings Tunable “repeaters” (via fuse & via metal) Hand instantiated clock elements within blocks Many selectable clocks (xx ps for each reg) Auto-generated clock grids within APRs & stacks Fuse adjustable PLL characteristics (duty cycle, etc.)

Power/ground distribution critical Extensive analysis & “management” required

Random Circuit Topics (cont) Robust circuit design req’d across 12 “corner” models

54 formal corners identified, we choose the most critical “12” Covers variations in: Temp, V, N xistor, P xistor Automated element simulation done across these models Full-chip timing is done using 2 of these corners (hi V, lo V)

Extensive use of dynamic logic Precharge in phase 1, evaluate in phase 2 Registers, adders, comparators, arrays, etc. Customs, stacks (& APRs)

Two stack-element libraries With different bit pitches

Element libraries has several versions of same function Usually, at least “Fast/big/hot” & “slow/small/cool” Example: C5J has 2 different “vanilla” 32-bit adders

Fast (dynamic): 180 ps 37.9 highSlow (static): 250 ps 16.9 high

Note: 25 total adders in library, instantiated 65 total times

Random Circuit Topics (cont) Several families of registers available

Differ in function, speed, size & performance Std cell, datapath & custom versions Each comes in many drive strengths (sizes) Many have built-in functions

muxes, and/or logic, xors, compares, etc.These provide speed/size/power improvements vs. separate elements

Examples using C5J stack elements

k-reg 10k-reg 10

k-reg+dynamiccmp-eq

60

staticcmp-eq 20

82 ps(data-to-out)

5.0

90 32 17-----139 ps

88 ps

9.5

4.6

3.8

1b

26b 26b

26b26b

1b

inv 54x-reg 103.8

32 ps

1.4

normal reg fast reg

C5J Security Component Sizes (mm2)

0.0800.080 0.0800.091

0.014 0.014

0.034

0.0690.046

0.021

Total = 0.529 mm2 + 0.014 for 2 RNG’s (elsewhere) = 0.54 (a few cents, but for this chip it’s really free)

227

Sample scale

C5J Security Component Sizes

Note: We had so much spare room on die that we didn’t spend any effort making this smaller. We estimate at least 30% smaller if we tried hard!

0.0800.080 mm2 0.0800.091

0.014 0.014

0.034

0.0690.046

227

(If we had only known about all this space when we started…)

S-box ROM(2 x 256 x 8 bit) x 4 bytes 200 ps access (dynamic)

Row-shift muxes(wires to other 32b stacks not visible)

Column multiply (& key xor)made out of 2-,3-,4-,5-,6-, 7- & 8-input xors

Startup, CBC, etc. muxes & registers ---register----------------------------------

---register----------------------------------

Startup, CBC, etc. muxes & registers

---register----------------------------------

(extra stuff at bottom for key generation)

random stuff centaur technology inc. g glenn henry quick background our security functions centaur...

Documents

hardware sha

intel amd processors

fastest hdw

intel x86but

lower performancethis

design target market

processor designsthey

txwe design x86 processorshave