wei dong, peng li, xiaoji ye department of ece, texas a&m university

29
WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University {weidong, pli, yexiaoji} @ neo.tamu.edu

Upload: abe

Post on 17-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines. Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University {weidong, pli, yexiaoji} @ neo.tamu.edu. Courtesy Intel. Courtesy AMD. Courtesy IBM. Multi-Core Implications. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines

Wei Dong, Peng Li, Xiaoji Ye

Department of ECE, Texas A&M University{weidong, pli, yexiaoji} @ neo.tamu.edu

Page 2: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines2

Multi-Core Implications Multi-core shift is changing the landscape of computing New challenges & opportunities for EDA

– Free ride of single-threaded EDA applications on Moore’s Law is coming to an end

Question: How to fully exploit increasingly parallel hardware and achieve good runtime scaling??

Courtesy Intel Courtesy AMD Courtesy IBM

Page 3: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines3

Why Parallel Transient Simulation? SPICE-like transient simulation is key to wide ranges of ICs

– Memories, custom digital, analog/RF/mixed-signal

Long simulation time presents significant bottleneck in design – CPU time > days, weeks (e.g. transistor-level PLL simulation)

– Can lead to insufficient verification, non-optimal design, chip failure

Natural target for parallelization!

Page 4: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines4

Prior Work Fine-grained parallelization

–Parallel matrix solves, devicemodel evaluations

–The efficiency of parallel matrix solvers deteriorates quickly

Parallel waveform relaxation[White et al ’87,Reichelt et al ICCAD’03]

–Limited convergence property

Domain decomposition[Wever et al, HICSS’96]

–Can create dense problems

–Applicability highly application dependent

1 2 3 4 5 6 7 80.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of Cores/Threads

No

rma

lize

d R

un

time

SPD MatrixUnsymmetric Matrix

Performance of a public parallel matrix solver on a 8-processor server

Page 5: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines5

Our Strategies Exploit coarse-grained & application-level parallelisms

– Lessons learned before [T. Mattson, Intel]

– >100 parallel languages/environments developed in the 90’s !

– Only a few with significant domain knowledge made successful

– Develop simulation algorithms parallelizable by construction

Goals/Benefits– Reduce parallel overhead via applying domain knowledge

– Create rich parallelisms for multi-/many-core platforms (pairing with fine grained methods)

– Ease in parallel programming, debug and code reuse

– Do not jeopardize accuracy & convergence

Page 6: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines6

Proposed Approach Time-domain MNA formulation

How to parallelize along the time axis?

Data dependency

: vector of unknowns

: static nonlinearities

: dynamic nonlinearities

Nonlinear DAEs

: inputs

t1 t2 t3 t4 t5 t1 t2t3 t4 t5

One-step integration two-step integration

0)())(())(( tutxqdt

dtxf

)(tx))(( txf

))(( txq)(tu

Page 7: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines7

Waveform Pipelining (WavePipe)

…Backward Pipelining …Forward Pipelining

Multi-Step Num. Integration Predictive Computing

TCurrent/base

Position

Granularity of WaveformPipelining

Schedule

T1 T2 T3 T4 …Solve

Fine GrainedParallel AssistsParallel Matrix Solve/Device Evaluation

Multi-/Many-Core Machine

Page 8: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines8

Outline

Motivation

Overview

Parallel backward pipelining

Parallel forward pipelining

Experimental results

Summary

Page 9: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines9

Parallel Backward Pipelining Move backwards in time Create additional independent computing tasks along T axis

Why useful?– Employ under variable-stepsize multi-step numerical integration

– Contribute to a larger future time step

……Backward Pipelining Forward Pipelining

Multi-Step Num. Integration Predictive Computing

TCurrentPosition

Page 10: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines10

Variable-Stepsize Multi-Step Gear’s Method Gear’s integration formula

Two-step Gear’s method [Shichman, Trans. Circuit Theory, 1970]

p

kknknn xxx

11101

p

ixk ,0: order of numerical integration

: circuit response at time point i

: coefficients

11

21

1

21

11

111 )2()2(

)(

2

)(

nnnn

nn

nnn

nnn

nn

nnnn x

hhh

hx

hhh

hhx

hh

hhhx

0 1 2

nnn tth 11 1 nnn tth

Page 11: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines11

Local Truncation Error (LTE) Numerical integration error incurred “locally” at each point

– All the previous solutions are assumed to be accurate

LTEs in Gear’s methods

31

21

21)3(

1

21

21

1 )2(

)()(

)2(6

)(DD

hh

hhhx

hh

hhh

nn

nnn

nn

nnnn

4323121

23

22

21)4(

323121

23

22

21

1 )()(

)(24DD

TTTTTT

TTTx

TTTTTT

TTTn

Two-step

Three-step

ktk

k

DDkdt

xd !

k

iin

nknkk

h

tDDtDDDD

11

111 )()(

Page 12: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines12

LTE based Time Step Control (Gear2) Control the time step to meet an LTE tolerance

LTE’s dependency on hn & hn+1

Key observation

– Smaller hn greater hn+1:

if DD3 nonincreasing

– Exploit for parallel computing

bound 2 2

1 33

( 1),

(2 1)bound

n nn

k kh k h

k h DD

23 1 1 1 1 1

21 1

2 20

2

n n n n n n n n

n n n

DD h h h h h h h h

h h h

23 1 1 1

2

1

30

2n n n n n

n n n

DD h h h h h

h h h

T

?hn+1hn

Page 13: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines13

Parallel Backward Pipelining Serial Gear2

Double-threaded Gear2

Balance between efficiency and robustness:

Extensible to multi-step methods (e.g. Gear3)

Initial conditions @ t1 & t2

Tr1: t3 (h3 h2)Tr2: back to t3’

Tr1: t4 (h4 h3’) Tr2: back to t4’

time

t1

t2

t3t3’

h2

h3

h4

h3’

t4t4’

h4’

Thread 1 Thread 2

10,' ii hh

t1

t2

t3h2

h3

h4

t4

Page 14: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines14

Parallel Forward Pipelining Move forwards in time Exploit predictive computing along the forward T direction

Question– How to resolve data dependency & ensure accuracy

……Backward Pipelining Forward Pipelining

Multi-Step Num. Integration Predictive Computing

TCurrentPosition

Page 15: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines15

Parallel Forward Pipelining Ex: double threaded

Init. condition @ t1 & t2

Time point t3 (h3 h2)

FE estimate sol@t3

Time point t4 (h4 h3)

Solve sol@t3 & sol@t4

Time point t5 (h5 h4)

FE estimate sol@t5

Time point t6 (h6 h5)

Solve sol@t5 & sol@t6

time

t1

t2

t3t4h2

h3

h5

h4

t5

t6

h6

Thread 1 Thread 2

Page 16: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines16

Complications Time steps for forward points may not be estimated accurately

– Data dependency on initial conditions

– Apply a damping factor (β<1.0) for time step estimation

– Revoke forward results in thread scheduling cycle (covered later)

Forward points based on inaccurate initial conditions– Addressed by inter-thread communication

– Tradeoffs provided by fine/coarse grained communications

…Forward Pipelining

T

Base Position

h=?

…Forward Pipelining

T

Base Position

Accuracy?

Page 17: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines17

Coarse Grained Inter-thread Communication

FE Estimation

Newton Loop

One or more iter.

Convergence

Time point 2Thread 2

time

FE Estimation

Newton Loop

One or more iter.

Convergence

Time point 1Thread 1time

FE Estimation

Newton Loop

One or more iter.

Convergence

Time point 3Thread 3

Iterate on the converged initial condition

Page 18: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines18

Fine Grained Inter-thread Communication

time

Communicate at the granularity of NR iterations Beneficial to large circuits

FE Estimation

NR Iteration 1

Convergence

Time point 1Thread 1

Time point 2Thread 2

Time point 3Thread 3

time

NR Iteration 2

NR Iteration 3

FE Estimation

NR Iteration 1

Convergence

NR Iteration 2

NR Iteration 3

FE Estimation

NR Iteration 1

Convergence

NR Iteration 2

NR Iteration 3

Page 19: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines19

Multi-threaded WavePipe Combine backward with forward waveform pipelining

Ex: 4T (1-backward-2-forward) WavePipe

T1T2

T3T4

Initial Solutions

… …Backward

Forward2nd Forward

Base Gear2 point

One Thread Scheduling Cycle

FE Newton

FE Newton

FE Newton

FE Newton

Time step

Time step

Time step

Time step

T2: backward

T1: standard

T3: forward

T4: 2nd forward

Page 20: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines20

Thread Scheduling The work done over an overestimated step is discarded

Without Step Size Overestimation

Cycle Starts

Cycle Starts

Cycle Completes

Initial Conditions

Cycle Completes

…Time

BackwardForward2nd Forward

Standard 4-Thread WavePipe(1-backward-2-forward scheme)

With Step Size Overestimation

Cycle Starts

Cycle Starts

Partially Completes

Cycle Completes

……Time

Initial Conditions

Page 21: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines21

Experimental Setup A 8-processor Linux server with four dual-core processors WavePipe implemented in C/C++ using pThreads (Gear2) Compare with

– Reference serial SPICE-like (Gear2) transient simulation– Low level parallel matrix solve (SuperLU) and device evaluation

Test circuits

Index Circuit Size Time Points Serial Run Time (s)

1 VCO 20 86,023 37.59

2 Power Amplifier 8 113,972 30.12

3 DB mixer 27 134,612 48.11

4 Ring Oscillator 61 110,037 206.37

5 Frequency Divier 17 44,795 18.49

6 Digital Adder 112 2,558 8.93

7 RLC mesh 1 13,097 664 2,704.08

8 RLC mesh 2 27,670 143 2,659.35

Page 22: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines22

Experimental Results – Accuracy & Profiling 3-T (1 backward + 1 forward) WavePipe vs. serial (DB mixer)

Real-time threading profiling (mesh ckt)

Page 23: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines23

Experimental Results – 2T Speedups 2T 1-backward & 2T 1-forward

Circuit2T 1-backward 2T 1-forward

T(s) Speedup T(s) Speedup

VCO 27.3 1.38 23.1 1.63

Power Amplifier 21.7 1.39 18.1 1.66

DB mixer 36.9 1.30 30.8 1.56

Ring Oscillator 149.3 1.38 121.9 1.69

Frequency Divier 15.3 1.21 12.6 1.47

Digital Adder 7.3 1.22 6.0 1.49

RLC mesh 1 2245.1 1.20 1814.6 1.49

RLC mesh 2 2159.3 1.23 1742.2 1.53

1.29X 1.57X

Page 24: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines24

Experimental Results – 3T Speedups 3T 1-backward-1-forward & 3T 2-forward

Circuit3T 1-back-1-forward 3T 2-forward

T(s) Speedup T(s) Speedup

VCO 20.3 1.85 19.6 1.92

Power Amplifier 16.2 1.86 15.4 1.96

DB mixer 27.6 1.74 26.3 1.83

Ring Oscillator 112.4 1.84 107.2 1.93

Frequency Divier 11.2 1.65 10.7 1.73

Digital Adder 5.4 1.65 5.1 1.75

RLC mesh 1 1679.6 1.61 1559.0 1.73

RLC mesh 2 1589.3 1.67 1487.4 1.79

1.73X 1.83X

Page 25: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines25

Experimental Results – 4T Speedups 4T 1-backward-2-forward & 4T 3-forward

Circuit4T 1-back-2-forward 4T 3-forward

T(s) Speedup T(s) Speedup

VCO 16.8 2.24 16.1 2.33

Power Amplifier 13.8 2.18 13.2 2.28

DB mixer 22.7 2.12 21.6 2.23

Ring Oscillator 94.7 2.18 91.0 2.27

Frequency Divier 9.2 2.01 8.7 2.16

Digital Adder 4.5 2.03 4.2 2.13

RLC mesh 1 1390.2 1.95 1324.6 2.04

RLC mesh 2 1330.8 2.00 1265.4 2.10

2.09X 2.19X

Page 26: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines26

Experimental Results – Runtime Scaling 2-4 threads

Page 27: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines27

Experimental Results Low-level scheme

– Parallel matrix solve & device model evaluation Proposed scheme

– 1-4 threads: WavePipe– 8 threads: 3-forward WavePipe + parallel matrix sol. & model eval.

Page 28: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines28

Summary Multi-core challenges & opportunities for EDA

Application-level coarse-grained parallelism for transient simulation

– Parallelize at a granularity of single time-point circuit solution

– Inherent low inter-core communication overhead

– Maintain accuracy & convergence

– Ease in implementation and code reuse

Rich sets of parallelisms for multi-core or many-core systems– New parallel opportunities orthogonal to fine-grained schemes

– Pair with parallel matrix solve, device evaluation and low-level parallel programming assists

Page 29: Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University

DAC 2008

WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines29

Thanks