wei dong, peng li, xiaoji ye department of ece, texas a&m university

WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines

Wei Dong, Peng Li, Xiaoji Ye

Department of ECE, Texas A&M University{weidong, pli, yexiaoji} @ neo.tamu.edu

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines2

Multi-Core Implications Multi-core shift is changing the landscape of computing New challenges & opportunities for EDA

– Free ride of single-threaded EDA applications on Moore’s Law is coming to an end

Question: How to fully exploit increasingly parallel hardware and achieve good runtime scaling??

Courtesy Intel Courtesy AMD Courtesy IBM

DAC 2008


Why Parallel Transient Simulation? SPICE-like transient simulation is key to wide ranges of ICs

– Memories, custom digital, analog/RF/mixed-signal

Long simulation time presents significant bottleneck in design – CPU time > days, weeks (e.g. transistor-level PLL simulation)

– Can lead to insufficient verification, non-optimal design, chip failure

Natural target for parallelization!

DAC 2008


Prior Work Fine-grained parallelization

–Parallel matrix solves, devicemodel evaluations

–The efficiency of parallel matrix solvers deteriorates quickly

Parallel waveform relaxation[White et al ’87,Reichelt et al ICCAD’03]

–Limited convergence property

Domain decomposition[Wever et al, HICSS’96]

–Can create dense problems

–Applicability highly application dependent

1 2 3 4 5 6 7 80.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Cores/Threads

No

rma

lize

d R

un

time

SPD MatrixUnsymmetric Matrix

Performance of a public parallel matrix solver on a 8-processor server

DAC 2008


Our Strategies Exploit coarse-grained & application-level parallelisms

– Lessons learned before [T. Mattson, Intel]

– >100 parallel languages/environments developed in the 90’s !

– Only a few with significant domain knowledge made successful

– Develop simulation algorithms parallelizable by construction

Goals/Benefits– Reduce parallel overhead via applying domain knowledge

– Create rich parallelisms for multi-/many-core platforms (pairing with fine grained methods)

– Ease in parallel programming, debug and code reuse

– Do not jeopardize accuracy & convergence

DAC 2008


Proposed Approach Time-domain MNA formulation

How to parallelize along the time axis?

Data dependency

: vector of unknowns

: static nonlinearities

: dynamic nonlinearities

Nonlinear DAEs

: inputs

t1 t2 t3 t4 t5 t1 t2t3 t4 t5

One-step integration two-step integration

0)())(())(( tutxqdt

dtxf

)(tx))(( txf

))(( txq)(tu

DAC 2008


Waveform Pipelining (WavePipe)

…Backward Pipelining …Forward Pipelining

Multi-Step Num. Integration Predictive Computing

TCurrent/base

Position

Granularity of WaveformPipelining

Schedule

T1 T2 T3 T4 …Solve

Fine GrainedParallel AssistsParallel Matrix Solve/Device Evaluation

Multi-/Many-Core Machine

DAC 2008


Outline

Motivation

Overview

Parallel backward pipelining

Parallel forward pipelining

Experimental results

Summary

DAC 2008


Parallel Backward Pipelining Move backwards in time Create additional independent computing tasks along T axis

Why useful?– Employ under variable-stepsize multi-step numerical integration

– Contribute to a larger future time step

……Backward Pipelining Forward Pipelining


TCurrentPosition

DAC 2008


Variable-Stepsize Multi-Step Gear’s Method Gear’s integration formula

Two-step Gear’s method [Shichman, Trans. Circuit Theory, 1970]

p

kknknn xxx

11101

p

ixk ,0: order of numerical integration

: circuit response at time point i

: coefficients

11

21

1

21

11

111 )2()2(

)(

2

)(

nnnn

nn

nnn

nnn

nn

nnnn x

hhh

hx

hhh

hhx

hh

hhhx

0 1 2

nnn tth 11 1 nnn tth

DAC 2008


Local Truncation Error (LTE) Numerical integration error incurred “locally” at each point

– All the previous solutions are assumed to be accurate

LTEs in Gear’s methods

31

21

21)3(

1

21

21

1 )2(

)()(

)2(6

)(DD

hh

hhhx

hh

hhh

nn

nnn

nn

nnnn

4323121

23

22

21)4(

323121

23

22

21

1 )()(

)(24DD

TTTTTT

TTTx

TTTTTT

TTTn

Two-step

Three-step

ktk

k

DDkdt

xd !

k

iin

nknkk

h

tDDtDDDD

11

111 )()(

DAC 2008


LTE based Time Step Control (Gear2) Control the time step to meet an LTE tolerance

LTE’s dependency on hn & hn+1

Key observation

– Smaller hn greater hn+1:

if DD3 nonincreasing

– Exploit for parallel computing

bound 2 2

1 33

( 1),

(2 1)bound

n nn

k kh k h

k h DD

23 1 1 1 1 1

21 1

2 20

2

n n n n n n n n

n n n

DD h h h h h h h h

h h h

23 1 1 1

2

1

30

2n n n n n

n n n

DD h h h h h

h h h

T

?hn+1hn

DAC 2008


Parallel Backward Pipelining Serial Gear2

Double-threaded Gear2

Balance between efficiency and robustness:

Extensible to multi-step methods (e.g. Gear3)

Initial conditions @ t1 & t2

Tr1: t3 (h3 h2)Tr2: back to t3’

Tr1: t4 (h4 h3’) Tr2: back to t4’

time

t1

t2

t3t3’

h2

h3

h4

h3’

t4t4’

h4’

Thread 1 Thread 2

10,' ii hh

t1

t2

t3h2

h3

h4

t4

DAC 2008


Parallel Forward Pipelining Move forwards in time Exploit predictive computing along the forward T direction

Question– How to resolve data dependency & ensure accuracy

……Backward Pipelining Forward Pipelining


TCurrentPosition

DAC 2008


Parallel Forward Pipelining Ex: double threaded

Init. condition @ t1 & t2

Time point t3 (h3 h2)

FE estimate sol@t3


Solve sol@t3 & sol@t4


FE estimate sol@t5


Solve sol@t5 & sol@t6

time

t1

t2

t3t4h2

h3

h5

h4

t5

t6

h6

Thread 1 Thread 2

DAC 2008


Complications Time steps for forward points may not be estimated accurately

– Data dependency on initial conditions

– Apply a damping factor (β<1.0) for time step estimation

– Revoke forward results in thread scheduling cycle (covered later)

Forward points based on inaccurate initial conditions– Addressed by inter-thread communication

– Tradeoffs provided by fine/coarse grained communications

…Forward Pipelining

T

Base Position

h=?

…Forward Pipelining

T

Base Position

Accuracy?

DAC 2008


Coarse Grained Inter-thread Communication

FE Estimation

Newton Loop

One or more iter.

Convergence

Time point 2Thread 2

time

FE Estimation

Newton Loop

One or more iter.

Convergence

Time point 1Thread 1time

…

FE Estimation

Newton Loop

One or more iter.

Convergence


…

Iterate on the converged initial condition

DAC 2008


Fine Grained Inter-thread Communication

time

Communicate at the granularity of NR iterations Beneficial to large circuits

FE Estimation

NR Iteration 1

Convergence




time

NR Iteration 2

NR Iteration 3

FE Estimation

NR Iteration 1

Convergence

NR Iteration 2

NR Iteration 3

FE Estimation

NR Iteration 1

Convergence

NR Iteration 2

NR Iteration 3

…

DAC 2008


Multi-threaded WavePipe Combine backward with forward waveform pipelining

Ex: 4T (1-backward-2-forward) WavePipe

T1T2

T3T4

Initial Solutions

… …Backward

Forward2nd Forward

Base Gear2 point

One Thread Scheduling Cycle

FE Newton

FE Newton

FE Newton

FE Newton

Time step

Time step

Time step

Time step

T2: backward

T1: standard

T3: forward

T4: 2nd forward

DAC 2008


Thread Scheduling The work done over an overestimated step is discarded

Without Step Size Overestimation

Cycle Starts

Cycle Starts

Cycle Completes

Initial Conditions

…

Cycle Completes

…Time

BackwardForward2nd Forward

Standard 4-Thread WavePipe(1-backward-2-forward scheme)

With Step Size Overestimation

Cycle Starts

Cycle Starts

Partially Completes

Cycle Completes

……Time

Initial Conditions

DAC 2008


Experimental Setup A 8-processor Linux server with four dual-core processors WavePipe implemented in C/C++ using pThreads (Gear2) Compare with

– Reference serial SPICE-like (Gear2) transient simulation– Low level parallel matrix solve (SuperLU) and device evaluation

Test circuits

Index Circuit Size Time Points Serial Run Time (s)

1 VCO 20 86,023 37.59

2 Power Amplifier 8 113,972 30.12

3 DB mixer 27 134,612 48.11

4 Ring Oscillator 61 110,037 206.37

5 Frequency Divier 17 44,795 18.49

6 Digital Adder 112 2,558 8.93

7 RLC mesh 1 13,097 664 2,704.08

8 RLC mesh 2 27,670 143 2,659.35

DAC 2008


Experimental Results – Accuracy & Profiling 3-T (1 backward + 1 forward) WavePipe vs. serial (DB mixer)

Real-time threading profiling (mesh ckt)

DAC 2008


Experimental Results – 2T Speedups 2T 1-backward & 2T 1-forward

Circuit2T 1-backward 2T 1-forward

T(s) Speedup T(s) Speedup

VCO 27.3 1.38 23.1 1.63

Power Amplifier 21.7 1.39 18.1 1.66

DB mixer 36.9 1.30 30.8 1.56

Ring Oscillator 149.3 1.38 121.9 1.69

Frequency Divier 15.3 1.21 12.6 1.47

Digital Adder 7.3 1.22 6.0 1.49

RLC mesh 1 2245.1 1.20 1814.6 1.49

RLC mesh 2 2159.3 1.23 1742.2 1.53

1.29X 1.57X

DAC 2008


Experimental Results – 3T Speedups 3T 1-backward-1-forward & 3T 2-forward

Circuit3T 1-back-1-forward 3T 2-forward


VCO 20.3 1.85 19.6 1.92


DB mixer 27.6 1.74 26.3 1.83

Ring Oscillator 112.4 1.84 107.2 1.93


Digital Adder 5.4 1.65 5.1 1.75

RLC mesh 1 1679.6 1.61 1559.0 1.73

RLC mesh 2 1589.3 1.67 1487.4 1.79

1.73X 1.83X

DAC 2008


Experimental Results – 4T Speedups 4T 1-backward-2-forward & 4T 3-forward

Circuit4T 1-back-2-forward 4T 3-forward


VCO 16.8 2.24 16.1 2.33


DB mixer 22.7 2.12 21.6 2.23

Ring Oscillator 94.7 2.18 91.0 2.27


Digital Adder 4.5 2.03 4.2 2.13

RLC mesh 1 1390.2 1.95 1324.6 2.04

RLC mesh 2 1330.8 2.00 1265.4 2.10

2.09X 2.19X

DAC 2008


Experimental Results – Runtime Scaling 2-4 threads

DAC 2008


Experimental Results Low-level scheme

– Parallel matrix solve & device model evaluation Proposed scheme

– 1-4 threads: WavePipe– 8 threads: 3-forward WavePipe + parallel matrix sol. & model eval.

DAC 2008


Summary Multi-core challenges & opportunities for EDA

Application-level coarse-grained parallelism for transient simulation

– Parallelize at a granularity of single time-point circuit solution

– Inherent low inter-core communication overhead

– Maintain accuracy & convergence

– Ease in implementation and code reuse

Rich sets of parallelisms for multi-core or many-core systems– New parallel opportunities orthogonal to fine-grained schemes

– Pair with parallel matrix solve, device evaluation and low-level parallel programming assists

DAC 2008


Thanks

wei dong, peng li, xiaoji ye department of ece, texas a&m university

Documents