chapter 1 introduction - virginia commonwealth universityrhklenke/docs/dissertation.pdf · 2.1...

1

Chapter 1

Introduction

The advent of synthesis systems for Very Large Scale Integrated Circuits (VLSI)

and automated design environments for Application Specific Integrated Circuits (ASIC)

have allowed digital systems designers to place large numbers of gates on a single IC in

record time. Generation of test patterns for these circuits to insure that they are fault free

however, still consumes considerable time. Currently, up to one third of the design time for

ASICs is spent generating tests [1].

Many algorithms have been developed to automate the test generation process

[2],[3],[4], but the test generation problem has been shown to be NP complete [5]. This

thesis deals with the application of parallel processing techniques to Automatic Test Pattern

Generation (ATPG) to address this problem.

1.1 Motivation

There are two basic approaches to solve the Automatic Test Pattern Generation

(ATPG) problem; algorithmic test pattern generation, and statistical or pseudorandom test

pattern generation. In the algorithmic approach, a test is generated for each fault in the

circuit using a specific ATPG algorithm. Most of these algorithms can be proven to be

complete. That is, they are guaranteed to find a test for a fault if a test does exist. However,

this process may involve a search of the entire solution space which is computationally

expensive.

Statistical or pseudorandom test pattern generation on the other hand, selects test

patterns at random, or using some heuristic, and determines the faults that are detected by

these patterns using fault simulation. Test patterns are selected and added to the test set if

they detect any previously undetected faults. This process continues until some required

fault coverage or computation time limit is reached. This method finds tests for the easy-

to-detect faults very quickly, but becomes less and less efficient as the easy-to-detect faults

2

are removed from the fault list and only the hard-to-detect faults are left. In many cases, the

required fault coverage can not be achieved without excessive computation times.

An efficient combined method for solving the ATPG problem uses statistical

methods to find tests for the easy-to-detect faults on the fault list and switches to an

algorithmic method to find tests for the hard-to-detect faults which remain. Using either this

method or the purely algorithmic method, a significant portion of the computation time will

be spent generating tests for the hard-to-detect faults algorithmically. Therefore, finding a

method to speed up this process should reduce the overall computation time considerably.

Much research has been done in increasing the efficiency of algorithms for ATPG

through heuristics [6],[7],[8]. However, the overall gains that can be achieved through these

improvements are limited and will not be adequate for future needs. This statement can be

justified by two facts. First, no system currently presented in the literature has been proven

on circuits that contain combinational logic blocks larger than 3 or 4 thousand gates.

Second, most sequential ATPG techniques are based on combinational ATPG algorithms

[9],[10]. These systems require that multiple passes be made through the ATPG process in

order to generate a test for a single fault. Therefore, any excessive runtimes will be

multiplied by this process and achieving fast combinational ATPG becomes even more

important.

An alternate approach to heuristics in reducing computation times is to use parallel

processing techniques. Parallel processing machines are becoming available for general use

and are being used to solve other problems in Computer Aided Design [11]. Most of these

readily available parallel processors are distributed memory machines due to cost and

scalability factors. Operating systems are also being developed that allow simple networks

of workstations to be used as distributed memory parallel computing environments [12].

Previous efforts to parallelize the ATPG problem can be placed in one of five

categories; fault partitioning, heuristic parallelization, search space partitioning,

algorithmic partitioning, and topological partitioning [13],[14]. These techniques, which

will be more fully detailed in Chapter 2, usually require each processing node in a

3

distributed memory system to contain the entire circuit description. However, the

increasing size of VLSI circuits has caused the amount of memory required to process these

circuits to grow rapidly. Topological partitioning techniques can be used to distribute the

circuit database across several processors thereby increasing the size of the largest circuit

that can be processed on a given distributed memory configuration.

For example, results from the EST system [15] have shown that the memory

requirements for processing the ISCAS ‘85 [16] benchmark circuit C7552 can be over 9

MBytes. This circuit contains only 3512 gates which is relatively small compared to state

of the art VLSI devices. At this rate, a circuit of only 10,000 gates could take as much as

25 MBytes of memory to process. Typical commercially available distributed memory

multicomputers must be able to take advantage of their entire memory space across all

nodes to process circuits as large as, or larger than this. Thus, topological partitioning of

the database across several processors will be required to perform ATPG on these larger

circuits.

Previous research in topological partitioning for ATPG [17],[18] has focused on the

D-algorithm [2]. The initial effort contained in [17] was directed toward a shared memory

parallel processor, hence, the parallelism exploited was fairly fine grained. The results of

the effort to port this system to a distributed memory multicomputer were mixed [18].

Some speedup was obtained, but the large number of messages required even for simple

circuits significantly limited the speedup. The parallelism exploited in these two systems

was limited to a parallelized implication procedure and fault partitioning which was used

to keep idle processors busy on other faults. The fact that the D-algorithm, which has been

shown to be very inefficient for some classes of circuits, was used in these systems also

increased the overall runtimes and limited the speedup possible. One of the most promising

results presented in [18], however, was that topological partitioning resulted in significant

reductions in the memory required in each processing node. For these reasons, research into

topological partitioning with a more efficient ATPG algorithm such as PODEM [3] was

undertaken.

4

1.2 Goals

This research focused on a system that is based on topological partitioning of the

circuit-under-test across several processing nodes. The goals of this research included

expanding on previous work [17],[18], and extending it to a more efficient base ATPG

algorithm. Analytical models of the topologically partitioned ATPG process were

developed to help predict the performance that could be expected. These models, once

validated through experimentation, were then used to predict the performance of the ATPG

system. The model was then used to determine the communications latency required on a

multicomputer to efficiently utilize this technique to achieve speedups. Another goal of this

research was to develop parallelizations of the base ATPG algorithm to increase speedup.

Investigations of how these parallelization methods could be used in conjunction with other

parallelization methods such as fault or search space partitioning were also undertaken.

Finally, this research outlined the additional work that will be required to make topological

partitioning a valid addition to ATPG systems for large scale designs.

1.3 Organization

This dissertation is divided into 7 Chapters including this introduction. Chapter 2

contains background material which includes a brief review of serial and parallel ATPG

algorithms. A discussion of the ES-KIT distributed memory multicomputer ES-TGS

parallel ATPG system used in this research is also included in Chapter 2. Chapter 3

describes the implementation details and results of the serial Topological Partitioning Test

Generation System (TOP-TGS) developed for this research. Chapter 4 details the analytical

model of the serial ATPG process and topologically partitioned ATPG. Predicted results

are also developed in this chapter and compared to the actual results presented in Chapter

3. Chapter 5 details the algorithmic parallelizations developed for the TOP-TGS system

and presents their results. Chapter 6 presents the results of using multiple parallelizations

in the TOP-TGS system. Finally, Chapter 7 presents conclusions and future work. The

future work section includes a discussion of how many of the heuristics presented in the

literature could be implemented in a topologically partitioned ATPG system.

5

Chapter 2

Background

This chapter presents the background material for the thesis. A brief presentation of

serial ATPG algorithms is included to familiarize the reader with the ATPG problem. Next

a discussion of the techniques available to parallelize ATPG is presented. Finally, the

distributed memory multicomputer, ES-KIT, and the parallel test generation system, ES-

TGS, that this work was based upon is presented.

2.1 Serial ATPG

Most parallel ATPG algorithms, including the ones to be presented here, are based

upon widely known serial ATPG algorithms. For a detailed discussion of ATPG

algorithms, the interested reader is referred to [19] and [20]. For this research, we will

consider only algorithms designed to generate tests for single stuck-at faults. These are

physical faults that cause a node in the circuit to behave as if it were stuck at a logic 0 or a

logic 1 level. The single stuck-at fault model is a simplification of the types of faults found

in real circuits, but empirical evidence shows that for most common implementation

technologies, it provides very high coverage of physical faults [21].

Automatic Test Pattern Generation can be thought of as the process of searching

through the entire space of possible input patterns for a circuit in an attempt to find one

which causes the output to differ depending on whether or not a circuit contains a specific

fault. The size of the search space is 2n wheren is the number of inputs to the circuit.

Because the search space is so large, many techniques have been developed to guide the

search process. Most of the search techniques in popular use today fall into the class of

algorithms called path runners. Path runners attempt to detect a fault by sensitizing it, and

then sensitizing a path between the faulty node and a primary output. Sensitizing a fault

consists of setting the value on the faulty node opposite the stuck-at vault, i.e. setting a logic

‘1’ on a node being tested for a stuck-at ‘0’ fault. Sensitizing a path consists of setting logic

6

values along a path in the circuit from the faulty node to the primary outputs such that a

change of the logic value on the node is observable at the primary output. For example, if

an AND gate is in the sensitized path, setting all of its inputs not in the path to a logic ‘1’

will result in its output value following the value of the inputs in the path. In order for an

algorithm to be complete, it must search all paths and combinations of paths in the circuit

from the faulty node to the primary outputs.

The major difference between the path sensitization algorithms presented in

[2],[3],[4] is the method and order by which the actual logic values are assigned to nodes

in the circuit. The D-algorithm [2], attempts to set logic values on nodes in the circuit by

assigning values to nodes which precede them in the circuit topology. The PODEM [3]

algorithm attempts to assign values to nodes in the circuit by assigning values to the

circuit’s primary inputs only. Because typical circuits have fewer inputs than internal

nodes, frequently by orders of magnitude, the search space enumerated by PODEM is much

smaller than the D-algorithm. Because this reduction of the search space makes PODEM

much more efficient than the D-algorithm, PODEM is the basis for most follow-on

algorithms that have been developed for ATPG [5],[6],[7],[8],[15]. For this reason,

PODEM was also selected as the base algorithm for this work whereas the D-algorithm

served as the basis for previous work in topological partitioning [17],[18]. A brief example

of the PODEM algorithm will now be presented to familiarize the reader with this

technique.

Figure 2.1 contains a diagram of the circuit that will be used for the purposes of this

discussion. The PODEM algorithm consists of 3 major processes; selection of the next

objective, backtracing that objective to an unassigned primary input to determine the value

that should be assigned to it, and assigning that value to the input and implying all node

values in the circuit affected by that assignment. This latter simulation-like process is called

forward implication.

For example in Figure 2.1, consider the fault of line J stuck-at a logical ‘1’. The first

objective selected might be to sensitize the fault by setting node J to ‘0’. Since both of the

7

inputs to the OR gate which drives J need to be set to ‘0’ in order to set J to a ‘0’, setting

one of them to this value becomes the next objective.The next step would be to backtrace

this objective to a primary input. Figure 2.2 illustrates this process. Assume for the sake of

discussion that line G is chosen to be set first. Typically, testability measures such as

controllability and observability measures are used to assist in making these types of

decisions. In order to set line G to a ‘0’, either of the inputs to the AND gate that drives it

must be set to ‘0’. Input A is selected and it along with its value are pushed on a stack that

is used to hold the input search space.

The next step is to actually assign a value of ‘0’ to input A and simulate the affect

of this assignment on the rest of the circuit. This processes is called forward implication.

Assigning a ‘0’ to A will cause the AND gate to drive node G to a ‘0’. No other nodes in

the circuit will be affected by this assignment.

Since the objective of setting node J to a ‘0’ has not been satisfied, it will be

backtraced again. This backtrace will determine that node I must be set to ‘0’ and this must

be accomplished by setting input E to ‘0’ and node H to ‘0’. Node H may be set to ‘0’ by

setting input C to ‘0’. This assignment is pushed on the stack and implication is performed.

Next, input E is pushed on the stack with its value and forward implication is done.This

implication will result in node J taking on the required ‘0’ value. This ‘0’ value represents

the value node J would have in the fault-free circuit, but node J will assume a value of ‘1’

s-a-1

A

B

C

D

E

F

G

HI

J

K

L

Figure 2.1 Circuit under test.

8

in the presence of a stuck-at ‘1’ fault. This set of values is represented using the “D”

notation of [2]. A node with a value of ‘D’ represents a ‘1’ in the good circuit and a ‘0’ in

the faulty circuit. A node with a value of‘D’ represents a value of ‘0’ in the good circuit

and a ‘1’ in the faulty circuit. Since node J is ‘0’ in the good circuit and ‘1’ in the faulty, its

value will be represented by a ‘D’. Figure 2.3 shows the state of the circuit and the input

stack after the assignments A=’0’, C=‘0‘, and E=’0’.

The final step in generating a test for node J stuck-at ‘1’ is to make the value on node

s-a-1

A

B

C

D

E

F

G

HI

J

K

L

first objective:set line J to 0

backtrace that objectiveto line G set to 0

backtrace that objectiveto line A set to 0

Figure 2.2 Backtracing objective J = 0.

s-a-0

A=0B

C=0D

E=0

F

G=0

H=0 I=0

J=D

K

L

A

C

E

0

0

0

Figure 2.3 Circuit state and input stack.

9

J visible on the primary output. This process is called propagation and it involves

sensitizing a path from the node to the output. This path is sensitized by setting all inputs

to gates on the path to their non-controlling values. The non-controlling values for an AND

or NAND gate is ‘1’ and for an OR or NOR gate is ‘0’. XOR and XNOR gates do not have

a controlling value, so either a ‘0’ or ‘1’ may work. All gates which have a ‘D’ or ‘D’ on

one of their inputs and an unknown, ‘X’, on their outputs are on potential paths. These gates

constitute what is known as the D-frontier. In the example circuit, the gate OR gate that

drive node L is on the D-frontier at this point. Node K must be set to the non-controlling

‘0’ value so that the value on the output L follows the value of node J. This task may be

accomplished by setting input F to a ‘0’. The final circuit state and input stack are illustrated

in Figure 2.4.

A test vector is generated by simply removing all input assignments from of the

stack. All inputs not in the stack are assigned don’t-cares in the test vector. In the example

circuit, the vector “0X0X00” would be the test vector for fault J stuck-at ‘1’.

If, during the assignments of input values, an assignment is made that makes a test

no longer possible, the last input assignment is popped off of the stack and the alternate

value is tried. This process is called backtracking and it continues until a circuit state where

s-a-0

A=0B

C=0D

E=0

F=0

G=0

H=0 I=0

J=D

K=0

L=D

A

C

E

0

0

0


F

0

Test

10

a test is again possible is reached or the stack becomes empty. Situations that may cause a

test to be impossible include setting the faulty node to the same value as the stuck-at fault,

and the disappearance of the D-frontier.

For example, consider the circuit of Figure 2.1 with a stuck-at ‘0’ fault on node K.

The first objective would be to set node K to a ‘1’. The first backtrace could lead to the

assignment of a logical ‘0’ on input F. Then in order to set node K to a ‘1’, node I must be

assigned a ‘1’. This task may be accomplished by assigning E=‘1’. However, when node I

is set to ‘1’, node J becomes ‘1’ and node L is forced to a ‘1’ regardless of the value on node

K. This assignment makes a test impossible with this input vector because the fault cannot

be observed at the primary output. Figure 2.5 illustrates this situation.

Backtracking will occur at this point and the alternate assignment of node E=’0’

will be tried. This assignment will also cause a test to be impossible because node K will

then be set to ‘0’, the same value as the stuck-at fault. Since assignments of both logic

values to node E did not result in a test, backtracking to the previous assignment must

occur. Thus, node F is now assigned a value of ‘1’. A test can now be found by assigning

a ‘0’ value to nodes E, C, and A as shown in Figure 2.6.

Backtracking results in an ordered search of the solution space and results in

implicit pruning of the search tree when inconsistent states are encountered such as

s-a-0

AB

CD

E=1

F=0

G

H I=1

J=1

K=D

L=1

F

E

0

1


Test not possible

11

assigning a value of ‘0’ to input F in the previous example. By checking the two alternate

assignments of input E and determining that they are both inconsistent, the entire portion

of the search tree below F=‘0’ may be pruned.

2.2 Parallel ATPG

This section provides a brief discussion of the methods that have been used to

parallelize the ATPG process. For a more detailed presentation of parallel ATPG

techniques, the reader is referred to [14]. These techniques can be divided up into 5

categories [13],[14]:

1) Fault Partitioning2) Heuristic Parallelization3) Search Space Partitioning4) Functional (Algorithmic) Partitioning5) Topological Partitioning

The simplest way to parallelize the ATPG problem is divide up the fault list among

multiple processors. Each processor then generates tests for each fault on its portion of the

fault list until all faults have been detected. This scheme results in each processor having a

completely separate task in that it performs the entire test generation procedure on its own.

This method of parallelization has been termed fault partitioning. If the fault list is divided

up carefully, each processor will have roughly the same amount of work to do and they will

all finish in about the same time. In practice, optimal partitioning of the fault list is not easy

s-a-0

A=0B

C=0D

E=0

F=1

G=0

H=0 I=0

J=0

K=D

L=D

F

E

0

1


0

1

E

C

A

0

0

0

Test

Test not possible

12

to doa priori so the scheduling can be done dynamically with each processor requesting a

new fault from a master scheduler whenever it is idle. Dynamic scheduling requires

increased communications overhead due to the requests from idle processors for new faults

to process. The fault partitioning method is very suitable for coarse grained parallel systems

because synchronization is only necessary when a new fault is needed from the remaining

fault list.

The biggest disadvantage of fault partitioning is that the setup time will be large.

The entire ATPG program and circuit database must be loaded into each processor’s

memory across the message fabric. If the total amount of work that can be divided up

among the processors is large (i.e. the fault list is long) then the percentage of time spent

on setup can be kept small and this scheme has promise. If the circuit has a small number

of faults, or fault classes, then the speedup will be limited as the above analysis suggests.

In any case, this method does not scale well because of the large setup time. Also,

performance of this method is poor if there are only a few hard-to-detect faults which

account for most of the processing time. Because processors cannot cooperate in generating

a test for the same fault, one or two processors could take hours to generate a test for these

hard-to-detect faults while the others stand idle. Many typical circuits have only a few hard-

to-detect faults and fall into this category. Results for systems which use this technique

show that linear speedup is possible only for a small number of processors, usually less than

ten [22],[23]. Clearly, this method of parallelization is less than optimum although it has

the benefit of being the simplest to implement.

Because ATPG is an NP complete problem [5], heuristics are used to guide the

search process. Research has indicated that many heuristics will produce a test for a given

fault within some computation time limit when other heuristics have failed to do so [24].

These complementary heuristics can be used in a multiprocessor system to aid in the ATPG

process. There are two basic strategies to heuristic parallelization; a variation of the fault

partitioning scheme discussed above, and concurrent parallel heuristics [25].

In the variation of the fault partitioning method, called uniform partitioning, the

13

fault list is divided up among the processors and each generates tests for the faults on its

own portion of the list. In generating the tests, however, multiple heuristics are used in

sequential order to attempt to generate a test. If a heuristic fails to generate a test within a

time limit, that heuristic is discontinued and the next one in the list is begun. This scheme

has the same advantages and disadvantages as the fault partitioning scheme discussed

above. However, it will be slightly better in some cases because the multiple heuristics will

shorten the test generation time for hard-to-detect faults.

In the concurrent parallel heuristic method, the system is required to have(m x n)

processors wheren is the number of different heuristics available. Ifm is equal to one, each

processor computes a test for the same fault using one of then heuristics. Whenever a

processor succeeds in generating a test for the fault, it sends a “stopwork” message to the

other processors in the cluster and they stop processing that fault. A new fault is selected

from the fault list and the process begins again. Ifm is greater than one, the processors are

clustered into groups of n and each cluster works on a separate fault. In this case, the system

is actually using a combination of the fault partitioning and heuristic parallelization

schemes. The concurrent parallel heuristic method has the potential to achieve greater

speedups than the uniform partitioning method due to possible anomalies in the ordering of

the heuristics for different faults.

The main disadvantage of the heuristic techniques discussed previously is that the

processors that are working on the same fault with a different heuristic are not guaranteed

to be searching disjoint portions of the search space. That is, all of the heuristics may lead

the ATPG program down the same path towards a non-solution.

A better way to parallelize work on a single fault is to divide up the search space

into disjoint pieces and evaluate them concurrently. This approach is a parallel

implementation of the branch and bound method which involves concurrent evaluation of

subproblems [26],[27]. This technique is called OR parallelism and its application to ATPG

is presented in detail in [28],[29]. Search space partitioning involves dividing up the search

space such that subproblems skipped by one processor are evaluated by another. The search

14

spaces for the processors are therefore disjoint and are spread across the solution space as

far as possible to maximize the area of the current search. This organization increases the

chances of finding a valid solution quickly.

The process of dividing up a search tree is illustrated in Figure 2.7. The search space

belonging to processor X is divided up into 2 parts for processors X and Y. Notice that the

processors are in fact always working on different problems (i.e. disjoint search spaces) and

that the place where each processor will backtrack to is different. If processor X finds a

conflict, it will backtrack and try an alternate value for input A. Processor Y will backtrack

and try an alternate value for input C in case of a conflict. This approach keeps the current

search space as large as possible which tends to make the search more efficient.

A major problem with search space partitioning is that it also requires a long setup

time. Each processor must have the entire circuit database and ATPG program loaded into

it. On the other hand, processors are dedicated to only one task which does not change and

the tasks are completely independent. This fact makes the overhead due to communications

Figure 2.7 Division of search tree.

A

B

E

0

0

0C

0D

Inconsistent

Inconsistent

0

1

1

1

1

1

A

B

E

0

0

0C

0D

Inconsistent

Inconsistent

0

1

1

1

A

B

E

0

0

0C

0D

Inconsistent

Inconsistent

1

1

1

1

Processor X Processor X Processor Y

15

very low and results in greater efficiency. Search space division is therefore most

appropriate for circuits that contain a small number of hard-to-detect faults which take up

a great deal of computation time. It is also ideally suited to message passing systems

because of its coarse grained parallelism.

There is another technique that can be used to allow more than one processor to

work simultaneously on finding a test for a single fault. This technique is called functional

partitioning. Functional partitioning refers to the process of dividing up an algorithm into

independent subtasks. These independent subtasks can then be executed on separate

processors in parallel. This method of parallelization is also known as algorithmic or AND

parallelism.

Most serial ATPG algorithms are difficult to parallelize functionally. The few

subtasks that can be identified, such as fault sensitization and path sensitization are not

independent. That is, action taken to perform one of these processes may change the circuit

state such that it has a side effect or causes an inconsistency in another process. Justification

of two goals cannot, in general, be done simultaneously. One way to allow parallelism in

justification is to perform justification for goals in different faults simultaneously. This

parallelism is an adaptation of the fault partitioning scheme already discussed.

In all of the parallel algorithms discussed thus far, each processor has to have access

to the entire circuit database. This requirement may be a problem for large circuits because

each node many not have enough memory to hold the entire circuit database. Also, loading

the database into memory in a message passing system takes time. Topological partitioning

of the circuit into separate partitions and instantiating each on a different processor would

help alleviate this problem.

Researchers have been investigating topological partitioning for parallel logic

simulation for some time. Although logic simulation is a different problem, it has some

similarities to algorithmic ATPG. A discussion of circuit partitioning for parallel logic

simulation is included in [30]. The objective of the partitioning scheme is to reduce the

communications necessary between partitions as much as possible while maximizing the

16

amount of work that can be done concurrently within the partitions. This paper analyzes 6

different partitioning schemes; random partitioning, natural partitioning, partitioning by

gate level, partitioning by element strings, and partitioning by fanin and fanout cones. Fanin

cones are an attempt to place all gates connected to a single primary input (even through

other gates) in the same group. Fanout cones are constructed the same way using primary

outputs.

The results presented in [30] indicate that for simulation, random partitioning

scores the best in maximizing concurrency, but worst in interprocessor communications.

This condition would make random partitioning a bad choice for most systems. Partitioning

by fanin and fanout cones offers the best trade off between concurrency and interprocessor

communications with fanout cones being slightly better. This result is most likely due to

the fact that fanout cones closely fit the flow of activity in the circuit during logic

simulation. An analysis of circuit partitioning techniques for ATPG is the focus of section

3.3 of this work.

Another issue in circuit partitioning for ATPG is the number of gates in each

partition; the so-called block size. As the number of gates assigned to a block decreases, the

amount of work that can be done between communications steps becomes smaller. Hence,

the parallelism becomes more fine grained. The minimum block size will also affect how

the problem scales with increasing numbers of processors. As more processors are added

to the system, the block size will get smaller and efficiency will decrease.

An investigation of the amount of parallelism theoretically available in

topologically partitioned parallel ATPG was undertaken in [31]. This work attempted to

find an upper bound on the amount of parallelism present in conflict-free test generation.

Two phases of the test generation process using the PODEM algorithm, backtracing and

forward implication, were parallelized. Each gate was assumed to be placed on its own

individual processor. The objective then was to measure the maximum number of

operations that could be performed in parallel by individual gate processors as possible.

The methods that were used to accomplish this objective are best illustrated using an

17

example.

Consider the example circuit of Figure 2.1 again. In setting the objective of J=‘0’,

there are several paths that must be backtraced such as the J->G->A path and the J->I->

H->C path. If these paths could be backtraced at the same time, then parallelism would be

present. If each backtrace operation is assumed to take place in the same amount of time,

then the backtraces will propagate through the circuit on a level by level basis. At each gate,

backtraces are generated on each input as required. Using this method, conflicts may arise

at points of reconvergence of fanout. This point is where the authors use the conflict-free

assumption. The correct values to be placed on reconvergence points are precomputed off-

line and the conflict is avoided. Figure 2.8 illustrates the process of parallel backtracing for

the objective J=‘0’.

Notice that in this case, the maximum number of backtrace operations that occur

during the same period of time, or “time-step” is 2. This measure would be the maximum

amount of parallelism available in this step of the ATPG process. Also note that the

objective values required on the individual lines are not assigned to them during the

s-a-1

A=0

B

C=0

D

E=0

F

G=0

H=0 I=0

J=0

K

L

Figure 2.8 Multiple parallel backtrace.

Time-step: 1

Time-step: 1

Time-step: 2

Time-step: 2Time-step: 3

Time-step: 3

objective values(not yet implied)

backtrace operations

18

backtrace procedure. These values must be set through forward implication as is done in

the serial PODEM case. The difference is that in this parallel implementation, the

implication procedure is parallelized. Thus in the case shown in Figure 2.8, the values

A=‘0’,C=‘0’, and E=‘0’ would be implied at the same time. Implications would then be

performed back through the circuit in parallel in a manner similar to backtraces. Analysis

of Figure 2.8 shows that the maximum parallelism present during parallel implication of

the above input assignments would be 2 as well.

The authors of [31] use this technique to analyze the maximum and average amount

of parallelism present in the ISCAS ‘85 [16] benchmark circuits. The analysis was done on

a conventional workstation using a simulation based technique. The average amount of

parallelism they found was less than they expected. For forward implication, most circuits

had an average parallelism of 4 to 7, although some circuits had values higher than this. For

backtracing, most circuits had average parallelism values of 1.5 to 3.5.

This method of analysis of the parallelism present in topological partitioned ATPG

has several drawbacks which do not allow a valid conclusion to be drawn concerning the

performance of topological partitioning. First, the assumption of one gate per processor is

unrealistic. Second, as shown in this thesis, there are other methods of parallelism available

for use with topological partitioning. Finally, the work in [31] completely ignores the

practical aspects, such as synchronization protocols and communications latency, of

implementing this type of system on an actual multicomputer. The authors do acknowledge

that this technique may have benefit when used with other parallelization methods and that

it has the important characteristic of allowing larger circuits to be processed on a given

distributed memory multicomputer.

2.3 Hardware Considerations

This research utilized a distributed memory MIMD machine known as the ES-KIT

88K. It is assumed that the reader is familiar with the typical characteristics, such as

message passing and connectivity, of parallel machines of this type. Only a brief discussion

of the affect of the characteristics of the machine and the programming of the application

19

will be undertaken in this section. This section will be followed by a discussion of the actual

hardware used in this research. Finally, the software system that formed the basis for this

work will be presented.

Distributed memory machines have local memory for each processor but no

globally accessible memory. Processors must send messages across some interconnection

medium, also called a message fabric, to share data. It may take hundreds or even thousands

of instructions to package a message for transmission so communications costs are much

higher than for shared memory. Also, message transfer time depends on the “distance”

between communicating processors. Distance between processors is a measure of the

length of the communications channel and the number of other processors which must pass

the message along for it to be transferred. There are a number of interconnection strategies

used on message passing systems [32]. Each one involves a trade-off of distance between

processors and the number of connections per processor.

Because communication time is distance dependent, data location in message

passing systems is as critical as, if not more so, that in shared memory systems.

Determination of which processors perform certain tasks is much more important in

distributed memory systems than in shared memory systems. Processes that must

communicate frequently must be instantiated on processors that are “close” to each other.

Therefore, algorithms must be designed for the specific communications topology of the

target machine. Algorithms designed for one machine may not perform satisfactorily on

another [33]. Programs on a message passing system will in general, use built-in systems

calls to send and receive messages. Data must be explicitly moved from one processor to

another using the send and receive mechanism. Synchronization between processors must

also take place using messages and is therefore more time consuming then in shared

memory systems. For this reason, algorithms for message passing machines must use more

coarse grained parallelism. Coarse grained parallelism implies that many instructions must

be processed between synchronization events. Setup time is much longer on message

passing systems because all of the program code and data, such as the circuit topology

information, must be loaded across the message fabric. New processes are harder to spawn

20

for this same reason. Therefore, setting up one processor as a master is more difficult.

Proper load balancing among the processors is also harder to achieve. In general,

algorithms for message passing systems are more difficult to design well, but the programs

are themselves easier to implement and debug because data consistency is more easily

maintained [33].

2.3.1 Experimental Systems Kit

The parallel processing machine available for use in this research is the

Experimental Systems-KIT (ES-KIT) 88K processor developed by the Microelectronics

and Computer Technology Corporation (MCC). The ES-KIT was developed by MCC

under a Defense Advanced Research Projects Agency (DARPA) grant to facilitate

experimentation into new parallel computer architectures and application specific

computing nodes. The ES-KIT system includes the 88K processor, described below, and

the ESP runtime system, described in the next section. The description of the ES-KIT

system is brief and limited to the characteristics which influence applications design. A

more detailed description of the ES-KIT system can be found in [34]. Further, the ESP

system and ES-KIT applications are implemented in the C++ language [35]. It is assumed

that the reader is knowledgeable in this language.

2.3.1.1 ES-KIT Hardware

The 88K processor is a distributed memory parallel architecture based on a 16 node,

4X4 2 dimensional mesh. A Sun 3/140 running the BSD 4.1 operating system acts as a host

for the 88K hardware.The Sun communicates with the 88K processor through a VME bus

interface board. The message fabric of the 88K processor is based on Simult System's

Asynchronous Message Routing Device (AMRD). Each node in the mesh has its own

AMRD. The use of the AMRDs prevents having to use store and forward message passing.

The use of AMRDs means that the processor on each node is not involved in passing

messages that are not addressed directly to it. The message fabric is in general capable of

passing messages at a rate of 20MB per second, but the software overhead of packing and

21

unpacking messages at either end prevents this rate from being achieved.

Each node is a general purpose computing system based on Motorola's 88000

processor family. The nodes consist of four boards which communicate with each other

across an internal bus based on the 88K standard. The four boards consist of the Message

Interface Board, the Processor Board, the Memory Module, and the Bus Terminator

Module. The boards are connected through a unique set of stacking connectors which allow

the nodes to be build on top of each other. A typical installation consists of two nodes

stacked on top of each other. The node stacks are arranged on top of Mother Board modules

that provide power and ground connections and contain the AMRDs. A 16 node

configuration consists of four Mother Boards mounted in a plane, each one containing two

stacks with two nodes per stack.

The processor board consists of one 20 MHz Motorola 88100 RISC

microprocessor, and two 88200 cache modules. The two 88200's provide separate paths for

instructions and data. The Memory Module provides 8MB of dynamic RAM. It is possible

to have up to four Memory Modules per node for a total of 32MB of memory, but the

standard configuration is only 8MB. The Message Interface Module consists of one 88100

processor, 128KB of data memory, 128K of instruction memory, and interface logic to

provide a path from memory to the node's AMRD. The 88100 processor on the MIM takes

care of all processing necessary to package a message and send it out through the AMRD.

The MIM processor off-loads a significant amount of message processing from the 88100

in the Processor Module. Finally, the Bus Terminator Module provides the electrical

termination for the high speed lines in the 88K bus, general purpose services such as the

system clock, and a UART interface to the outside world for debugging and repair of the

node hardware.

2.3.1.2 ESP Runtime System

The ESP (Extensible Software System) run-time environment is as important and

complex as the 88K processor. The environment is written in C++ and is intended to

maximize flexibility in the types of configurations that can be used in a parallel processing

22

system. The environment consists of four major components, the ISSD, the mail daemon,

the shadow process, and the actual ESP kernels.

The Inter Service Support Daemon (ISSD) is the heart of the system in that it is the

first process invoked by the user and it constructs the rest of the run-time environment. The

ISSD is the major interface between the ESP environment and the outside world. The ISSD

controls the starting and terminating of applications programs, and all communication with

peripheral devices including screen and disk IO. The ISSD begins by reading the

configuration file to determine what ESP components are to be invoked. The configuration

file is created by the user and contains instructions as to how many of the various

components are to be constructed, where they are to run, and how they are connected. The

minimum configuration file must contain the invocation instructions for a single ISSD, one

mail daemon, and one or more ESP kernels. The ISSD also invokes the public service

objects (PSO's) such as the application manager and the kernel librarian which are

necessary to run any application. The ISSD always runs on the Sun host machine, but the

PSO's can run on any of the 88K nodes.

The mail daemon is responsible for routing messages between individual or groups

of ESP kernels. Each mail daemon is connected to all of the kernels in its group, every other

mail daemon in the configuration, and the ISSD. This connectivity allows messages to be

passed from kernel to kernel with the minimum handling possible. The configuration must

contain a minimum of one mail daemon for each type of ESP kernel in the configuration.

The shadow process runs on the host Sun where the ISSD is located. The shadow

process is responsible for reading the application source code files and managing the

terminal IO for the application. The shadow process is the next ESP component invoked by

the user after the ISSD.

Finally, the ESP kernel is the work horse of the ESP system in that it actually runs

the application. The kernel performs memory management, message packing and

unpacking, and task switching for the applications. The kernel runs on top of a rudimentary

OS in the 88K processor. The message passing portion of the kernel utilizes an MCC

23

developed protocol which utilizes the 88100 processor in the MIM on the 88K processor.

2.3.1.3 Object Oriented Programming in ESP

The ESP system uses the C++ object oriented paradigm as its abstraction for

parallel processing. Applications written for the 88K processor to run in ESP must be

programmed in C++. C++ incorporates the ideas of objects, data encapsulation, and

inheritance. C++ objects or classes are instantiated on different nodes and communicate

with each other through method invocation and return values. Each object has its own local

data contained within the node and that data can only be manipulated by method calls.

There is no global or 'public' data allowed in ESP. There were five major changes made to

the C++ language to implement the distributed processing environment of ESP. These

changes consisted of overloading the pointer-to-member function ‘->()’, redefining the

return values available for methods, overloading the 'new' function, eliminating the 'main'

routine, and incorporating the concept of futures.

All objects that are to have methods that are available for remote invocation must

be derived from the object remote_base. This object was developed by MCC and includes

several features necessary to implement remote method invocation, the first of which is a

handle. A handle is a pointer to an instance of an object and contains all of the information

needed to address an object in a distributed system. This information is contained in four

parts, a node number where the object actually resides, a class number for the object, an

application number and the actual instance number of the object. Handles can be passed

between objects or the address information can be passed and a new handle constructed to

point to that object.

The second feature included in remote_base is the overloaded pointer-to-member

function. In regular C++, the method invocation:

is implemented as a subroutine call. In ESP C++, the object may reside on a remote

object_instance->method(arg1,arg2,...)

24

node. Therefore, the method call must be invoked through message passing. Overloading

of the ‘->()’ function for remote methods handles this process. Methods for an object

derived from remote_base are defined to be remote by declaring them in the public section

of the object specification. When a method on a remote object is invoked, the kernel reads

the argument list to determine its length. It then copies the argument list into the message

buffer with the length of the argument list in bytes appended to it. Finally, the kernel uses

the handle of the object to instruct the MIM processor where to send the message. When

the receiving object receives the message, it invokes the proper method with the argument

list. If a value is to be returned by the method, one of the return macros defined in the ESP

programming environment must be used. The return macros instruct the kernel that the

return is to a remote object and that it must be packaged as a message. Macros are available

for returning most of the common data types such as integers, doubles, characters, and

strings. There is also a pointer return macro, but its functionality is different than in regular

C++. This macro is necessary because pointers on remote nodes are meaningless in the ESP

environment. If a pointer return is specified, the kernel packages the entire object to which

the pointer refers and sends it back to the node that invoked the method. The kernel on the

invoking node then copies the returned object into its memory space and returns a pointer

to this copy to the invoking object. In this way, any structure or object the size of which can

be determined at compile time can be returned from a remote method invocation.

In C++, the new operator is used to allocate memory space for instances of objects.

In the ESP environment, the new operator is overloaded to allow arguments to be passed to

it. These arguments specify which node an object is to be instantiated upon. The syntax for

a call to the overloaded new function is:

The node, relationship pair is used to specify the location of the object. For example

if the variable homenode is specified to be (1,1), the call new{homenode,SAMEAS} will

create the object on node 1,1. Options for the relationship variable include; SAMEAS,

DIFFERENT, NEAR, FAR, and NEXT. The next relationship does not need a node

object_pointer object_type*( ) new node,relationship{ } object_type();=

25

specifier and it allows the kernel to select the node for the object using its own criteria. At

this point, the criteria used is the amount of memory left on each node. The object is created

on the node with the most free memory. Other algorithms that take into account load

balancing and communications costs are under development by MCC. Until they are

available, the user must be careful to take these factors into consideration and specify where

each large object is to be created in order to optimize the application.

In ESP C++, there is no 'main' routine. When the shadow program loads the first

object in the application, its constructor is invoked after it is loaded. This constructor must

do the work necessary to start the application. This may be as simple as calling another

routine within the same object to take over control, or as complex as creating all other

objects and directly performing the necessary algorithm. The former approach is

recommended as it is more 'correct' and it allows the kernel to complete construction of the

initial object and alter the stack size for that object if necessary.

When a method on a remote object is invoked, the processing takes place on the

remote node. The invoking method is then free to perform some other calculation if it does

not need the result of the remote method. If the invoking method does need the result of the

remote method, it must block until that result is returned. Controlling when the object

blocks for the return result is done by using futures. Futures were introduced as a part of

Multilisp, [36] and allow lazy evaluation of return values. Note that the only parallel

processing that occurs is between the time that the remote method is invoked and the future

is evaluated. This fact demonstrates the value of the future abstraction for methods that

return values. Of course, remote methods that do not return values always run in parallel

with the invoking method.

One notable characteristic of objects in ESP is that to insure “correctness” only one

method in a specific object may be invoked at a time. This includes methods that are

blocked waiting for a future or return value. Thus if two objects each invoke a method in

each other and wait for return value, deadlock is possible. This process is illustrated in

Figure 2.9.

26

In order to avoid this deadlock situation, two objects that wish to pass values to each

other in an asynchronous fashion must use another method which involves more overhead.

This process, which is illustrated in Figure 2.10, involves sending a request to the other

object for the value and then having the other object sent the value back using a second

method invocation. This method insures deadlock free operation, but carries a significant

performance penalty as shown in the following section.

2.3.1.4 System Performance Characterization

This section presents a general overview of the measured performance

characteristics of the 88K processor. This data is necessary in order to analyze the

performance of the applications used in this research and explain some of the

implementation decisions that were made for the applications. A more detailed description

of the experiments used to gather this data and their results can be found in Appendix A of

this thesis.

The software overhead for message passing in the ES-KIT environment is

significant and reduces the maximum communications rate available on the 88K processor

to less than 1 MByte per second. Experiments have shown that additional message traffic

on the connections between nodes can reduce this rate by an order of magnitude. Other

Figure 2.9 Deadlock in synchronous communication among ESP objects.

Object 1

objest1:get_result2(){

result1=object2->return_result2();

}

// some code

// some more code

objest1:return_result1(){

INT_RETURN(result1);}

// some code

Object 2


result2=object1->return_result1();

}

// some code

// some more code

objest2:return_result2(){

INT_RETURN(result2);}

// some code

27

experiments have been carried out to determine the rate of information transfer that could

be realized while reading or writing to the host file system. A maximum transfer rate of

approximately 50 Kbytes per second was achieved. Further, a maximum block size of only

32 Kbytes can be read at one time from a file. These statistics severely limit the file I/O

performance of the ES-KIT 88K processor and dictate some major compromises in

application design.

The most significant performance measurement for determining the scalability of

topological partitioned ATPG on a given machine is communications latency.

Measurements have indicated that using the communications method shown in Figure 2.9,

approximately 3200 “communications loops” are possible per second. This translates to a

one-way communications latency of 156µs. However, if the deadlock free method of

Figure 2.10 is used, only an average of 2360 “communications loops” are possible per

second. This rate is equivalent to a one-way communications latency of 212µs. Because

this communications scheme must be used to insure deadlock free operation, this increase

in communications latency has an impact on application performance.

Figure 2.10 Deadlock free asynchronous communication among ESP objects.

Object 1 Object 2objest1:get_result2(){

object2->send_result2();}

// some code

// some more code

objest2:send_result2(){ object1->store_result2(result2);}

objest1:store_result2(int temp){

}

result1=temp;


object1->send_result1();}

// some code

// some more code

objest2:store_result1(int temp){

}

result2=temp;

objest1:send_result1(){ object2->store_result1(result1);}

28

2.4 Software Considerations

This section provides a brief description of the parallel test generation system that

was used as the basis for this work. A detailed discussion of the implementation and

performance is contained in Appendix B of this thesis.

2.4.1 The Test Generation System (TGS)

The test generation system used in this work is derived from the Test Generation

System developed at the University of Southern California. This system includes an

implementation of the PODEM [3] test generation algorithm and a critical path tracing fault

simulator [37]. The system was originally programmed in the PASCAL language. It was

converted to the C language by researchers at Mississippi State University and is currently

distributed in the Lager digital design toolset.

The first required modification to the TGS system was to modify the PODEM test

generator to handle exclusive-or (XOR) gates. The TGS implementation of PODEM was a

direct translation of the flow charts included in [3]. However, even though this paper

presents PODEM as an improved ATPG algorithm for circuits containing XOR gates, the

flow charts do not detail how XOR gates are to be handled during backtracing and objective

selection. Simple heuristics were devised and added to the TGS PODEM ATPG system to

handle XOR gates. After these modifications were implemented and tested, the modified

PODEM algorithm was returned to MSU for inclusion in future Lager toolset releases.

The next modification made to the TGS system was to convert it from the C

language to C++. This process involved encapsulating the various functions of the TGS

system into C++ objects. The PODEM ATPG algorithm was incorporated into a

test_generator object. The critical path tracing fault simulation algorithm was incorporated

into a fault_simulator object. The remainder of the “control” type functionality was

incorporated into a master object. A reader and a writer object were added to the system to

read in the circuit description and write out the test vector file respectively. These objects

were added to increase performance in light of the 88K processor’s IO performance as

29

discussed in the previous section.

2.4.2 The ES-KIT Test Generation System (ES-TGS)

Once the TGS system was converted to C++, it was ported to the ESP environment.

This port involved changing all communications between objects into ESP compatible

remote method invocations and the non-trivial task of debugging the system and adjusting

it to ESP’s many idiosyncrasies. Once this task was completed, the system was parallelized

using fault partitioning. Results of the parallelization effort including speedups achieved

for various version on the ISCAS ‘85 circuits can be found in Appendix B. The PODEM

portion of the ESP compatible system was used as the starting point for the Topological

Partitioning Test Generation System detailed in the next chapter.

30

Chapter 3

Topologically Partitioned Test Pattern Generation

System

This chapter presents the Topologically Partitioned Test Pattern Generation System

(TOP-TGS) developed for the 88K processor. The algorithms used to generate the

topological partitions are presented first. This section is followed by a discussion of the

implementation of the TOP-TGS system on the 88K processor. Finally the results section

presents a comparison of the performance of the TOP-TGS system on circuits partitioned

using the various partitioning algorithms described herein.

3.1 Circuit Partitioning

This section describes the complete circuit partitioning system. This system

consists of various algorithms for generating “blocks” which are then combined into

partitions.

The generation of partitions begins with a circuit netlist which is placed in the

proper “PODEM” format. The PODEM format is one of the intermediate formats used in

the TGS system. It consists of a simple ASCII file with one line per gate where each line

contains the gate name, gate type, the number of inputs and outputs to the gate, and a list

of the line numbers of the gates that are inputs to that gate. In addition, the first line in the

file contains the total number of gates in the circuit followed by the number of primary

inputs and primary outputs. In this format, a primary input is simply represented as a gate

of type ‘inpt’ with 0 inputs and a primary output is any gate with 0 outputs.

The next step in the partition generation process consists of performing testability

analysis on the circuit. Testability analysis is a process which assigns controllability and

observability measures to each node in the circuit. These values are used to aid the ATPG

process. For this system, the SCOAP [38] testability analysis algorithm was chosen.

31

Comparison of the performance of SCOAP versus other testability measures performed

early in the research indicated that it outperformed the other measures for a majority of the

ISCAS ‘85 [16] circuits.

Once the testability measures are computed, the circuit netlist is ready for the

partitioning program. As stated previously, this program first divides the circuit up into

blocks and then combines these blocks into partitions. Finally the circuit database is written

back out to a file in the binary format used by the TOP-TGS system with the testability and

partition information included.

3.1.1 Partitioning System Goals

The two requirements for an effective topological partitioning scheme for ATPG

are, reduction of the amount of memory used on each node, and limiting communications

overhead between the processors. Typically there is a large interaction between these two

factors, and reducing one can cause an increase in the other.

Reducing the memory requirements on each node is more complex than simply

dividing the gates up evenly among the available processors. In most current ATPG

algorithms there are occasions where information about the characteristics or state of gates

in other partitions is required. Duplicating this information in both partitions is sometimes

more beneficial than sending messages between partitions each time it is required. For

example, in PODEM, when backtracing an objective value to the boundary of a partition,

it is necessary to know the controllabilities and output states of the gates driving the current

objective gate. This information is used to select the next objective node and value. If this

information is not duplicated in both partitions, several messages will have to be passed to

gather this information before a decision can be made. This process is illustrated in Figure

3.1.

Similarly, the process of selecting a D-frontier gate in PODEM involves knowledge

of the observability and state of gates on the output of the gates in the D frontier. This

process is shown in Figure 3.2.

32

Gates at the interface between partitions can be duplicated in both partitions to

eliminate some of the messages necessary for backtracing and propagation. This

duplication increases the memory required on each partition, but the data structures for the

objective -> 1

Partition #3

Partition #1

state = X

Partition #2

partition boundries

state = Xccy0 = 32

ccy0 = 45

state = X

messages required to requestand receive state and 0 controllability(ccy0) information

messages required to requestand receive state and 0 controllability(ccy0) information

message required to pass newobjective of 0 to this partition

1

2

1

Figure 3.1 Backtracing across partition boundaries.

Partition #2Partition #1

state = D

Partition #3

partition boundries

coy = 23

state = X

messages required to requestand receive state and observability(coy) information

messages required to requestand receive state and observability(coy) information

message required to pass Dfrontier to this partition

1

2

state = X

state = 1

state = X

state = X

state = 1

coy = 35

coy = 30

1

Figure 3.2 Propagating D frontier across partition boundaries.

33

“extra” gates can be much smaller because only the state and the controllability and

observability information is needed. The controllability and observability information in

both partitions is initialized when the circuit database is loaded. One additional message for

each overlapped gate is required during the database initialization process. The state

information is updated in both partitions automatically during forward implication without

need for any additional messages.

Although duplicating the gate structures on both sides of a partition boundary will

reduce the communications overhead for crossing the boundary, the most effective way to

reduce message traffic is to reduce the number of times a partition boundary must be

crossed during ATPG. This reduction is accomplished by minimizing the number of ‘high

traffic’ connections between partitions. A high traffic connection is one which is traversed

a large number of times during ATPG. An example of a high traffic connection might be a

portion of a reconvergent fanout loop such as illustrated in Figure 3.3. In the PODEM

algorithm, the circuit database is traversed from inputs towards outputs and vice versa.

Therefore, the optimum partitions are most likely ones that cut the circuit in a longitudinal,

Partition #1 Partition #2

G

Moving gate G frompartition 2 to partition 1will probably reducemessage traffic duringATPG

Figure 3.3 Placement of gates in partitions.

34

or input to output direction rather than ones that cut the circuit transversely across circuit

levels.

3.1.2 Block Generation

Block generation is the process of dividing up the circuit into subpartitions or

“blocks” of related gates. Dividing up the circuit into blocks is done using one of four

algorithms, fanin cones, fanout cones, input paths, or output paths. The number of blocks

is determined by the circuit topology and the type of partitioning selected. For fanin cones

and input paths, the number of blocks is equal to the number of primary inputs in the circuit.

For fanout cones or output paths, the number of subpartitions is equal to the number of

primary outputs.

The algorithms used for fanin and fanout cone partitioning are presented in [30] and

will not be detailed here. The main characteristic of these algorithms to note is that the

affinity of a specific gate to a cone is increased as the subpartition, or block, increases in

size. This fact usually causes one block to grow in size rapidly and encompass all the gates

surrounding it. Once this block has reached its maximum size, the remaining gates must be

placed in other blocks. Thus one block will be localized to one area of the circuit, but the

remaining blocks may be scattered all over the remaining circuit area.

Partitioning by input paths is accomplished by determining the total number of

paths from each primary input to each gate. Primary inputs are then assigned one to each

block. Each gate is placed in the same block as the primary input to which it has the most

paths. If a block reaches its maximum value, no other gates may be assigned to it. The

maximum size of a block is set to be equal to the total number of gates in the circuit divided

by the total number of full partitions desired.

Partitioning by output paths is accomplished similarly using the primary outputs of

the circuit as starting points. Notice that in this algorithm, the affinity of a gate to a specific

block is determined before the actual block assignment starts and does not change. This fact

results in much more balanced blocks.

35

3.1.3 Partition generation

After block generation is completed, the next step in the partitioning process is to

combine the blocks into full partitions. This step is done using either a greedy algorithm or

a simulated annealing approach [39]. The cost vector includes a factor for the size of the

partitions, and the interconnections between partitions. The size factor is simply the sum of

the absolute value of the difference between the number of gates in each partition and the

balanced partition size. The interconnection factor is the total number of interconnections

between each pair of partitions. The size and interconnection factor are multiplied by

weight factors and then summed to determine the total cost. The weight factors can be set

by the user. For most circuits and most partitioning methods, the cost vector is simple

enough that the greedy algorithm generates a minimal solution. For others, the simulated

annealing results in a better solution. Obviously this result depends on the shape of the

solution space and simulated annealing would probably perform better if the cost vector is

made even more complex by the addition of other factors.

In addition to fanin-out cones and input-output paths, the partitioning system can

generate two other partition types that do not require generation and combining of blocks.

These methods were implemented simply for comparison with the other methods and

include random partitioning and partitioning by gate levels. Random partitioning consists

of placing gates in partitions according to the output of a random number generator. The

uniform distribution of the random numbers insures that the number of gates in each

partition is close to being equal.

Partitioning the circuit by gate levels involves placing gates into partitions

depending on their level, or distance from a primary input. The size of each partition is

limited to the balanced partition size. This limit is strictly imposed and may result in

splitting a level. That is, all of the gates in a specific circuit level may not be placed in the

same partition.

36

3.2 TOP-TGS System Implementation

This section describes the scalar version of the topologically partitioned PODEM

ATPG system. This system was implemented to measure and compare the memory

utilization and message overhead of the APTG process utilizing the various partitioning

algorithms. The next section describes the architecture of the system in terms of the C++

objects which comprise it, and the following section describes the operation of the system

during the various ATPG phases.

3.2.1 TOP-TGS System Architecture

The overall architecture of this system is shown in Figure 3.4. This system consists

of 4 objects, a master object, a test generator object, and reader and writer objects. The

reader and writer objects perform the I/O functions for the TOP-TGS system. The reader

object reads in the circuit database and distributes the partitions to the test_generator

objects. The writer object is responsible for storing the test vectors resulting from ATPG

and keeping track of the fault coverage. The vectors are stored in the writer object because

the file I/O performance of the 88K processors is too slow to allow the vectors to be written

to a file as they are generated.

The master object controls the operation of the test generator objects. This control

includes selection of the next fault for ATPG and initiation of the various ATPG phases

Figure 3.4 Topological PODEM ATPG system architecture.

Reader

WriterMaster

TestGenerator

TestGenerator

TestGenerator

TestGenerator

37

such as objective selection and implication. The test generator objects perform all of the

operations necessary to carry out these phases of the PODEM algorithm. There is no fault

simulation in this system and ATPG is performed on every fault in the circuit. This method

was used only for experimentation into the efficiency of the ATPG algorithm. A real

system which uses this algorithm would include fault simulation similar to the one found

in the ES-TGS system (see Appendix B).

3.2.2 TOP-TGS System Operation

The master object begins the ATPG process by setting up the fault list and selecting

the first fault for ATPG. It then sends this fault to all of the test_generator objects so that

they may determine which one among them has the faulty gate in their database. This task

is performed by calling thefault_transform(FAULTREC target_fault) method in each

test_generator with the specific target fault. Once it has been determined which

test_generator has the target fault, the master starts the test generation loop. This loop

consists of identifying the next objective, backtracing the objective to a primary input,

implying a value on that primary input, and determining the status of the objective. If the

objective has been satisfied, then the next objective is selected and the loop continues. If

the objective has not been satisfied, then it is backtraced to another primary input and a

value is implied there. If the objective value has been incorrectly set, then backtracking

must occur. As stated previously, the master object only controls the performance of these

tasks while the test generators actually perform them. The process by which each phase is

actually carried out will be discussed in the following sections.

Because some of the communications from the test_generator back to the master

object must be synchronous, (section 2.3.1.3), all communications from the master to the

test_generator must be asynchronous as shown in Figure 2.10. As discussed before, this

requirement insures that the system will be deadlock free.

3.2.2.1 Setting Initial Objective

The initial objective in the test generation process is to sensitize the fault. This task

38

is performed by setting the value on the faulty line opposite the stuck-at value. Once the

test generator which has the faulty gate in its database has been identified, the master

invokes thesetnextobjective() method in that test generator. This routine checks to see if

the fault has been sensitized. If it has not, then setting it opposite the stuck-at value becomes

the next objective. This objective is then sent back to the master so that backtracing may

begin.

If the fault is located on the output of a gate, and it has been set opposite the stuck-

at value, then it has been completely sensitized. During implication, a ‘D’ or ‘D’ will be

placed on it as appropriate and the next objective will be calculated by propagating the D

frontier.

However, if the fault is located on a gate input, then sensitization is not complete

until the effect of that fault has been propagated to the gate output. Sensitization is

accomplished by setting the gate’s other inputs to the non-controlling value. The non-

controlling value for an AND/NAND gate is a ‘1’. The non-controlling value for an OR/

NOR gate is a ‘0’. An XOR gate has no controlling value, so the other input may be set to

any value. Once the non-controlling values have been set on the fault free inputs, the gate

output will take on the proper ‘D’ or ‘D’ value during implication. Propagation of the D

frontier can then be started.

3.2.2.2 Backtracing

Once the next objective, initial or subsequent, has been received by the master,

backtracing through the circuit to a primary input must be done. A detailed description of

the backtracing process can be found in [3].

Backtracing involves propagation of an objective value back through the circuit

topology until a primary input is reached. For example, suppose the current objective is to

set the output of an AND gate to ‘1’. Because an objective is only chosen for circuit nodes

with unknown, ‘X’, values, the output of the AND gate must be an ‘X’. Thus, all inputs to

this AND gate must be currently set at ‘1’ or ‘X’. An input to the AND gate which currently

39

has an ‘X’ value is selected as the next objective node with an objective value of ‘1’.

Likewise, for an objective of a ‘0’ on the output of an OR gate, one of the ‘X’ inputs would

be selected as the next objective node with a ‘0’ objective value.

When there is more than one input to a specific gate that has an ‘X’ value that can

be selected as the next objective, a simple heuristic is used to make that selection. If the

objective value on the gate’s output can be satisfied by setting one input to the controlling

value, the input easiest to control is selected. In this case, the easiest to control means the

input with the lowest controllability measure as calculated by SCOAP [38]. If the objective

value on the gate output can only be satisfied by setting all inputs to the non-controlling

value, the hardest to control input with an ‘X’ value is selected first. These rules are valid

in the AND/NAND and OR/NOR case. For the XOR gate, the work in [3] does not detail

how backtracing is performed. For XORs, the author implemented a heuristic that is based

upon the types of gates driving the XOR’s inputs. The details of this heuristic are not

important, except to note that it performs correctly and efficiently although it has not been

shown to be optimal in all cases.

Backtracing is started by the master which issues a call to thebacktrace(int

gate_number,int objective_value) method in the test generator which has the next

objective. The objective is backtraced through that test generator’s circuit database using

the above semantics until a primary input is reached or until a partition boundary is

encountered. A message is sent to the test generator with the gate on the other side of the

boundary to continue the backtrace procedure. This message is simply a call to the

backtrace() method in that test generator with the proper arguments. This process continues

until a primary input is reached. The number of the primary input and the value to be

assigned to it are then sent to the master for implication.

3.2.2.3 Implication

The implication phase and the procedure that the master uses to determine when

implication has completed are the most complicated parts of the scalar TOP-TGS system.

Implication is also the only portion of the test generation process that is implicitly

40

parallelized. Implication is a simulation-like process which is begun when a test generator

sends a primary input number and value back from a backtrace procedure. The master sends

that input number and value to the test generator that has that input number in its circuit

database through theimply(gate_number,input_value) method invocation. The test

generator uses this value to imply, or simulate, the values on the outputs of all gates in its

database driven by that gate number. The values on the outputs of the gates driven by any

gate whose output has changed in the previous iteration are then implied. This local

implication loop is continued until no other gate output in the local circuit database has

changed. The test generator then looks through the database for any gate who’s output has

changed and that drives a gate in another test generator’s database. This information is sent

to that test generator through an invocation of itsimply() method. This cycle of local

implies followed by transmission of values to other test generators is continued until the

implication phase is completed thus making the global circuit state consistent. This process

is inherently parallel in that a single test generator can send outimply() calls to more than

one other test generator at a time.

The difficulty in handling implication in this fashion is determining when

implication is completed. If one could watch the activity on the message fabric and activity

on the nodes, then completion could be detected when no outstandingimply() messages

exists and no node is running theimply() method. This type of monitoring is not possible,

so the master uses counting semaphores to determine when each test generator has satisfied

all of its outstandingimply() calls and is thus idle. For example, if test_generator #1 sends

two imply messages to test_generator #2, it will tell the master to increment the counting

semaphore for test_generator #2 twice. Each time test_generator #2 completes theimply()

procedure, it will tell the master to decrement its counting semaphore. Therefore, after the

two calls to imply() in test_generator #2, its counting semaphore will be back to zero

(assuming it started there), indicating that it is now idle. When all counting semaphores for

the test generators are zero, the master knows that the implication process is finished.

One important synchronization issue in this process is that the master must

increment a test_generator object’s counting semaphore before that test_generator

41

completes theimply() method call and issues a call to decrement the semaphore. If these

steps do not occur in the proper order, the master may think that the global implication is

completed while that test_generator is still working on it. In order to avoid this problem,

the incrementing and decrementing of the counting semaphore is done using the

synchronous communications protocol of Figure 2.9. In the above example, test_generator

#1 will send a message to the master to increment the counting semaphore for

test_generator #2 and wait for a return value from the master to signify that this task has

been accomplished. Only then will it send a message to test_generator #2 to actually

perform the implication. This protocol insures correct operation, but unfortunately, it adds

a significant amount of overhead to the implication process. Further research may be

necessary to find a more efficient solution to the problem of determining when global

implication is complete.

When implication is completed, the master gets the status of the objective from the

appropriate test generator. If the objective has been met, the master selects a new objective

through a call to thesetnextobjective() procedure in the appropriate test generator. If the

objective has not yet been met, the master calls thebacktrace() procedure in the test

generator that has the current objective. This backtrace will lead to another primary input

and value which will then be implied. If the objective value has been set to opposite what

is required, then backtracking must occur.

3.2.2.4 Setting D-Frontier

After the faulty line has been sensitized, a path must be created from the faulty node

to a primary output so that the value on the node may be observed. This path will be formed

by propagating the D frontier from the faulty node to a primary output. The D frontier is

the set of gates with a ‘D’ or ‘D’ on the inputs and an X or unknown on the outputs. The

objective will be to set the other inputs to a gate on the D frontier to the non-controlling

value so that the ‘D’ is propagated from the inputs to the outputs. This process is called

propagation and is continued until a ‘D’ has reached a primary output, at which time a test

has been generated. The gate on the D frontier with the best (lowest) SCOAP observability

42

is selected as the next objective. Since this gate may be in any of the test generator’s

databases, all test_generators must participate in the decision as to which gate will be

selected for propagation. If the test generator that has the target fault in its database

determines that the sensitization process has been completed, it calls the

setDfrontier(gate,observability)method in itself with null arguments. ThesetDfrontier()

method calculates the gate on the D frontier with the best observability in its own database.

This gate becomes the new objective. This test_generator then calls thesetDfrontier()

method in the test generator with the next highest address using the new objective gate and

observability as arguments. The second test_generator then finds the gate with the best

observability on its own D frontier. If this gate has a better observability than the gate that

was sent to it, this gate becomes the new objective. The second test_generator then calls the

setDfrontier() method in the test generator with the next highest address using the new

objective. The test generator at the top of the address list will call the test generator in the

bottom of the list thus forming a loop of method invocations. If a test generator in the list

receives asetDfrontier() call with its own D frontier gate number, than it knows that a loop

has been completed and it has the next objective. This objective is then sent back to the

master so that backtracing may begin. If any test generator has a ‘D’ on a primary output,

it knows that a test has been generated and this fact is sent to the master. If no test generator

has a gate on the D frontier, a test is not possible with this input combination and that fact

is sent back to the master. This result would cause backtracking to occur.

Another method that could be used to calculate the D frontier would be for the

master to broadcast a message to every test_generator to report the gate number and

observability of its D frontier objective. The master could then select the new objective

directly. Although the “broadcast” method was not tested, the “circular loop” method

described above has two apparent advantages which lead to its selection.

First, the loop method results in less message traffic on average. The broadcast

method always results in2N messages, whereN is the number of partitions of the circuit

and thus test_generator objects. The loop method results inN + 2messages in the best case

where the first test_generator in the loop has the objective, and2N + 1 messages in the

43

worst case where the last test_generator in the loop has the objective.

Second, the loop method off-loads most of the work in this step from the master to

the test_generator objects. This fact may become important later in the parallel TOP-TGS

systems as the master object becomes the computational bottleneck.

3.2.2.5 Backtracking

As discussed previously, backtracking is the process of moving back up the search

tree when the current input assignment can not generate a test. There are two conditions

which, when present, indicate that a test is not possible under the current circuit state. The

first condition occurs when the faulty node has been set to the same value as the stuck-at

value. In this case, the fault has not been sensitized because the node value will be the same

in both the faulty and fault-free circuits. The second condition occurs when the D frontier

disappears. In this case, there is no ‘D’ value on a primary output and there is no gate in the

circuit with a ‘D’ or ‘D’ value on one of its inputs and an ‘X’ on its output. This fact

indicates that there is no possible path that can be sensitized to allow the value on the faulty

node to be observed.

The actual process of backtracking is handled by the master object. Backtracking is

started by theset_objective_status(int status) andset_Dfrontier_status(int status) methods

in the master. Theset_objective_status() routine is called by the test_generator with the

faulty node in its database after each implication phase until the fault is sensitized. If the

faulty node, has been set to the incorrect value,set_objective_status() is called again with

theERROR_OBJWRONGSET flag as the argument. This call signals the master to begin

the backtrack process.

Theset_Dfrontier_status() routine is called by the test_generator which has the next

objective in the D frontier. The test_generator with the next objective is determined using

the loop method as described in section 3.2.2.4. If none of the test_generators has a gate on

the D frontier, the last test_generator in the loop will call theset_Dfroniter_status() method

with the ERROR_NOGBLDFRONTIER flag. This call will also signal the master to

44

begin backtracking.

Regardless of how the backtracking process is begun, it proceeds in the same

manner. It begins by popping the last input assignment off of the stack and checking to see

if the alternate value has been tried. This check is performed by determining if a flag in the

stack structure has been set. If the alternate value has been tried, an unknown ‘X’ is

assigned to that input and implication is performed. This process restores the circuit state

to the point it was before that input assignment was made and removes the bottom node

from the search tree.

The process of popping input assignments off of the stack and removing them from

the search tree is continued until one is found in which the alternate value has not been tried.

When this situation occurs, the value of the assignment is inverted, its flag is set indicating

that the alternate value has been tried, and it is pushed back on the stack. The new value is

then placed on the input and implication is performed. When the implication has been

completed, the master invokes thesetnextobjective() method in the appropriate

test_generator. The result of this call will be either a return call toset_objective_status() or

set_Dfroniter_status() with the next objective. If either of these methods is called with the

error flags, backtracking will continue. Otherwise, the test generation process continues as

before backtracking began.

If at any time during backtracking the stack becomes empty, then the fault has been

proven redundant and work on it is stopped. The master also keeps track of the number of

backtracks that have been performed and stops work after a fixed number. The master then

marks the current fault as hard-to-detect. This process is normally done in many ATPG

systems to keep runtimes reasonable in circuits with a large number of hard-to-detect or

redundant faults.

3.3 Scalar TOP-TGS System Results

This section presents results of the TOP-TGS system on circuits that have been

partitioned using the various methods. Several of the ISCAS ‘85 benchmark circuits were

45

used to test this system. Table 3.1 shows the characteristics of the circuits for which data is

presented. These characteristics include the number of primary inputs and outputs, the total

number of gates, the total number of faults, and the number of levels in the circuit.

For this experiment, all of the faults in the circuits were targeted which means that

ATPG was performed on each fault in the circuit. Fault collapsing to reduce the size of the

fault list was not done and fault simulation was not used to remove detected faults from the

fault list.

3.3.1 Comparison of Partitioning Methods

The results for performing ATPG on circuits that were topologically partitioned

using the methods presented in the previous section are shown in Table 3.2. The results are

presented for 2, 4 and 8 partitions. In all cases, each partition resided on a separate

processor. The first column contains the average number of messages passed between

partitions in order to perform ATPG on each fault. The second column contains the average

time taken to perform ATPG on each fault. The third column is the percentage of memory

necessary to hold the circuit database on the individual processors compared to the

uniprocessor implementation. It is calculated using the following formula:

(3.1)

Where Muni is the amount of memory required for the circuit database in the

uniprocessor case, andM1... MN is the amount of memory required for the circuit database

on each of theN processors. Using the maximum of the memory used in each individual

Table 3.1 Characteristics of Benchmark Circuits

CircuitNo. of

PIsNo. ofPOs

No. ofGates

No. ofLevels

No. ofFaults

C432 36 7 160 18 864

C499 41 31 202 12 998

C880 60 26 383 18 1760

%Memorymax M1 M2 ... MN, , ,( )

Muni

100×=

46

TotalMessages

PercentMemory

MemoryEfficiency

TotalMessages

PercentMemory

MemoryEfficiency

TotalMessages

PercentMemory

MemoryEfficiency

2 Processors

Random

Fanin

Fanout

Cones

InputPaths

OutputPaths

GateLevel

Cones

Partition

Type

4 Processors 8 Processors [1]

1442 1176 95.4 52.8

Circuit C432

3330 1424 77.0 35.7 4608 1468 51.0 28.9

755 483 91.3 61.3 1267 674 69.4 50.6 1802 791 48.0 42.7

806 586 95.4 61.4 1580 818 67.3 43.4 2336 979 49.3 36.4

839 571 73.0 69.8 1356 625 54.1 56.6 2124 778 45.9 48.9

809 527 92.9 71.8 1158 652 69.9 51.7 1617 844 62.8 41.5

828 595 82.7 63.2 1423 784 62.2 43.4 2097 967 42.3 32.7

Time (ms)Total

Time (ms)Total

Time (ms)Total

TotalMessages

PercentMemory

MemoryEfficiency

TotalMessages

PercentMemory

MemoryEfficiency

TotalMessages

PercentMemory

MemoryEfficiency

2 Processors

Random

Fanin

Fanout

Cones

InputPaths

OutputPaths

GateLevel

Cones

Partition

Type

4 Processors 8 Processors

1172 962 93.8 53.8

Circuit C499

2170 1045 72.4 37.3 3250 1144 48.6 30.0

778 475 70.8 70.6 1028 522 51.0 51.8 1368 637 33.9 43.8

831 528 85.6 64.5 1224 720 55.6 45.8 1692 801 47.3 41.4

845 602 81.5 61.4 1070 632 70.4 51.8 1534 714 54.7 44.2

759 478 70.0 72.3 1159 545 49.0 55.6 1725 638 37.0 47.8

841 664 66.7 60.3 1217 691 74.9 46.9 1907 948 57.2 33.0

Time (ms)Total

Time (ms)Total

Time (ms)Total

TotalMessages

PercentMemory

MemoryEfficiency

TotalMessages

PercentMemory

MemoryEfficiency

TotalMessages

PercentMemory

MemoryEfficiency

2 Processors

Random

Fanin

Fanout

Cones

InputPaths

OutputPaths

GateLevel

Cones

Partition

Type

4 Processors 8 Processors

162 176 92.6 54.1

Circuit C880

278 182 73.4 38.9 391 189 48.8 29.3

116 84 73.0 77.7 148 97 59.4 62.3 207 118 39.1 55.2

106 79 99.5 87.9 121 85 98.2 88.0 151 100 95.7 86.5

122 94 68.0 74.0 157 98 54.2 63.1 213 119 40.9 51.8

118 84 61.9 81.9 160 89 41.3 66.4 210 103 27.3 59.4

123 101 76.6 74.2 169 114 55.8 52.4 245 145 44.5 40.5

Time (ms)Total

Time (ms)Total

Time (ms)Total

******

******

******

******

******

****** ****** ******

******

******

******

******

******

******

******

******

****** ******

****** ******

****** ******

****** ******

****** ******

******

****** ****** ******

******

****** ****** ******

******

******

[1] Note: only 7 processors for Fanout Cones and Output Paths for circuit C432

Table 3.2 TOP-TGS Results on Benchmark Circuits

47

processor gives an indication of the memory required by the worst case. The fourth column

is the total memory efficiency of the multiprocessor implementation. Memory efficiency is

the ratio of total memory used in the multiprocessor case to the amount of memory used in

the uniprocessor case. It is calculated as follows:

(3.2)

In each column, the entry with the best value is highlighted for each circuit. For

example, in the two processor case of circuit C432, fanin cones had the least number of total

messages per fault and the lowest run time per fault. However, input paths has the lowest

percent of memory and output paths has the highest memory efficiency. The overall goals

of topological partitioning in this system are to reduce the amount of memory required on

each node while maintaining the overhead caused by message passing at a minimum. The

message passing overhead may be measured indirectly by the increase in total runtime per

fault. Therefore, the percent of memory used and the total runtime are the most important

entries for selecting the optimum partitioning method. For circuits C499 and C880, the

output paths method of partitioning has the best overall performance. It typically results in

the lowest percent memory and runtime or is very close to the lowest. It also has high

overall memory efficiency. For circuit C880, fanout cones would seem to be the optimum

because it has the best runtimes for all three processor configurations, but its percent

memory is always over 95 percent. Output paths has the lowest percent of memory for all

processor configurations and yet has runtimes close to that of fanout cones.

For circuit C432, output paths performs adequately in terms of runtime, but it does

not result in a significant reduction in percent memory required. This result is caused by the

fact that C432 only has 7 primary outputs whereas the other circuits have at least 26. In fact,

since C432 only has 7 outputs, the maximum number of partitions and hence processors, is

7 and the results presented in the 8 processor column are really for 7 processors. Even in

the 2 and 4 processor case, there are only 7 subpartitions that can be combined into full

MemoryEfficiencyM

uni

Mi

i 1=

N

∑100×=

48

partitions and method does not seem to lead to optimum results. For circuits such as C432

which have significantly fewer primary outputs than inputs, partitioning by input paths is

recommended.

One fact to notice in the results is that although message passing has the largest

impact on runtime, the partitioning method that results in the least messages per fault does

not always have the lowest runtime. This fact indicates that message passing is not the only

impact on runtime for the multiprocessor case verses the uniprocessor case. The

implication procedure is the only process that is significantly different between the two

implementations. In the multiprocessor case, all of the gates in each processor’s memory

must be processed during each implication iteration. Therefore, poor memory efficiency

will adversely affect the time taken by implication. This fact can be seen in the results for

input paths and output paths for 8 processors and circuit C499. Input paths generates 1534

messages per fault with a runtime of 714 ms while output paths generates 1725 messages

with a runtime of 638 ms. However, partitioning by output paths has a higher memory

efficiency of 47.8% compared to 44.2%. This reduction of duplicated gates allows

partitioning by output paths to produce a lower runtime. Another factor that can effect

runtimes is the fact that in the multiprocessor case, separate processors can perform

implications in parallel. This result assumes that the circuit topology and the way it is

partitioned causes this parallelism to take place. This effect is difficult to measure however,

and appears to be a minor contributor to runtime.

3.3.2 Conclusions

Several obvious conclusions can be made from these results. Random partitioning

performs the worst. It resulted in the longest runtimes and the largest percentage of memory

even though the actual number of gates in each partition was fairly well balanced. This

result is due to the fact that random partitioning results in the most duplication of gates

between partitions as indicated by its poor memory efficiency. Partitioning by gate level

performed better than random in all categories, but usually worse than any of the other

methods in terms of runtime and total messages. This result is as expected because

49

partitioning by gate level divides the circuit transversely while the algorithm moves

through the circuit longitudinally (from inputs to outputs). Thus partitioning by gate level

should result in more messages and therefore longer runtimes that the other methods.

Partitioning by fanin and fanout cones often resulted in the best runtime results, but

the percent memory used was often over 90% of the uniprocessor case. This result is caused

by the fact that fanin and fanout cone partitioning can result in one partition growing much

faster than the others leading to unbalanced partition sizes.

The two new methods developed for this research, partitioning by output paths and

input paths resulted in the highest reduction in percent memory required in most cases

while generating runtimes close to the best case. For most circuits, partitioning by output

paths is recommended. However, for circuits with many more primary inputs that outputs,

partitioning by input paths is recommended. In the subsequent experiments undertaken for

this thesis, the partitioning method that performed the best for each circuit was used unless

specifically noted. In the case of circuits C499 and C880, the optimum method was

partitioning by output paths, and for circuit C432, partitioning by input paths.

The next chapter presents an analysis of the topologically partitioned ATPG process

in order to estimate its performance on parallel machines with lesser communications

latency then the ES-KIT 88K.

50

Chapter 4

Analysis of Topologically Partitioned ATPG

This chapter presents an analysis of the complexity of the topologically partitioned

PODEM ATPG process in terms of a binary search. This analysis involves an identification

of the operations required to construct the search tree and a prediction of the number of each

type of operation required to generate a test for a specific fault. The computational

complexity and communications requirements for each type of operation in a topologically

partitioned ATPG system will be discussed. Predictions of the effect of increasing

communications cost will be made and compared with experimental data. Finally, the

results will be analyzed and used to predict the performance of the ATPG system on a

multiprocessor with lower communications costs.

4.1 The ATPG Search Process

This section presents a discussion of the PODEM algorithm as a binary search.

Recall from the description of PODEM in Chapter 2 that PODEM represents the search

space as a binary tree. Nodes in the tree, other than leaf nodes, represent primary inputs,

and edges in the tree represent logic value assignments to these nodes. Leaf nodes in the

tree represent either successful generation of a test vector, or determination that a test is not

possible with the given input assignments. The latter case is sometimes called an

inconsistency. Figure 4.1 shows an example circuit and a search tree for a specific fault. In

this case, the search tree is for a stuck at ‘0’ fault on the input of gate 3. Notice that a wrong

decision somewhere in the test generation process has resulted in an inconsistent

assignment of A=’0’, B=’1’, C=’0’, and A=’0’, B=’1’, C=’1’ being tried. At this point,

nodes were removed from the tree and more were added in another portion of the search

space.

Adding a node to the search tree requires the selection of a primary input for

assignment, and the actual assignment of the value to the input followed by implication of

51

the new circuit state This process requires three separate operations.

First, a new intermediate goal required for a successful test must be selected. This

goal may be the sensitizing of the fault by setting the value of the faulty node opposite the

faulty value, or propagating the effect of the fault by setting the inputs of a gate along the

sensitized path to the non-controlling value. This process is called selecting the next

objective, or simply an objective operation.

Second, an input assignment necessary to set the objective value on the objective

node must be selected. This process is accomplished by a backtrace operation. The

backtrace operation consists of continuously moving the objective from a single gate’s

output to one of its inputs until a primary input is encountered.

Finally, the actual assignment to the primary input must be made and the new circuit

state computed. The forward implication operation accomplishes this process. Thus, adding

a node to the tree requires an objective, backtrace, and implication operation in that order.

On the other hand, removal of a node from the tree, called a backtrack, requires

either an objective operation followed by an implication, or only an implication. If the node

to be removed is a leaf node, an objective operation is required to determine that a test is

no longer possible. This step is followed by implication of the alternate value for the last

s-a-0A

B

C

A

B

C

0

1

0

Figure 4.1 Circuit under test and search tree.

1

2

3

4

5

6 0

C01

T

52

input assignment in the tree. If the node removed is not a leaf node, then the only operation

required is the implication of the alternate value for the input.

As an example, consider the trees shown in Figure 4.2. Tree A is a representation

of test generation which did not require any backtracking. Notice that the test involves

assignment of two input values and that generation of the test required 2 backtraces, 2

implys, and 3 objective operations. Generation of a test without backtracks represents the

lower bound on the complexity of the ATPG process and the number of individual

operations required. In general, ATPG without backtracking requiresn backtraces,n

implys, andn+1 objectives, wheren is the number of inputs specified in the test vector.

Tree B is a representation of attempted test generation on a redundant fault. In this

case, two input assignments must be made each time before a test is found to not be

possible. The tree contains the maximum number of internal nodes and leaf nodes for a tree

with two inputs specified. In this case, 3 backtraces, 7 implys and 7 objective operations

are required to construct the tree. In addition, 4 backtracks are required. This shape of tree

represents the worst case in terms of number of operations required. The next section

presents the calculation of the number of backtracks and operations necessary to construct

a tree of this type in the general case ofn inputs specified.

A

B

0

0

T

Figure 4.2 State space search trees and their operational requirements.

Tree A

obi

obi

o

o - objectivei - implyb - backtracebt - backtrack

A0

B

1

0 1

Tree B

B

0 1

obi

obi obi

i

bt i bt ibt i

bt

o ooo

53

4.2 Operations Requirements for Balanced Redundant Trees

Figure 4.3 shows a worst case search tree for a redundant fault with 3 inputs

specified. The tree has been annotated to show the operations required for each node.

Notice that the tree has2n-1 internal nodes,2n leaf nodes, and a total ofn+1 levels. The

levels are numberedl=0, for the root level, throughl=n, for the leaf node level. Each level

has2l nodes. The total number of operations can be calculated by determining the number

of operations required in each level and summing across all levels.

For example, consider the number of backtracks required. Each level froml=1 to n

requires2l-1 backtracks. Thus the total number of backtracks,α is:

(4.1)

Similarly, the number of implications can be calculated by noting that each of the

levels froml=1 to l=n requires3(2l-1)-1 implications. Then the number of implications,β,

is given by:

Figure 4.3 State space search tree for worst case redundant fault.

A

0

C

1

0 1

C

0 1

obi

obi obi

i

bt i bt ibt i

bt

o ooo

C

0 1obi

bt ibt i

o o

C

0 1obi

bt ibt i

o o

A A

obi obi

bt bt

i

ii

level: 0

level: 1

level: 2

level: 3

o - objectivei - implyb - backtracebt - backtrack

bt110 0

α 2l 1−l 1=

n

∑ 2 2n 1−( ) n−= =

54

(4.2)

The number of backtraces,χ, required for levelsl=1 to l=n is 2l-1 per level, or:

(4.3)

Finally, the number of objectives required per levell=1 to l=n is also2l-1. The leaf

node level also requires an additional2n objectives for each node to determine if a test is

possible at this point. Then the number of objective operations,δ, is given by:

(4.4)

Table 4.1 contains a summary of the number of operations required forn=1,2,3, and

4 as well as the general case. The next section explains how these formulas can be used to

estimate the number of operations required by a non uniform search tree. Such a tree is

typically constructed when searching for a hard-to-detect fault which results in a test being

found within a specific backtrack limit. From this point forward, a balanced search tree

such as discussed in this section forn inputs specified will be denoted by the symbol:

.

β 3 2l 1−( ) 1−( )l 1=

n

∑ 3 2n 1−( ) n−= =

χ 2l 1−( )l 1=

n

∑ 2n 1−= =

δ 2l 1−( )l 1=

n

∑ 2n+ 2n 1+ 1−= =

n

Table 4.1 Operational Requirements vs. Number of Inputs Specified

# inputs α β χ δ

2(2n-1)-n 3(2n-1)-n 2n-1 2n+1-1n

1 1 2 1 3

2 4 7 3 7

3 11 18 7 15

4 26 41 15 31

55

4.3 Operations Requirements for Arbitrary Unbalanced Trees

For a circuit withn inputs, the entire search space for a specific fault is bounded by

. However, test generation typically requires much fewer operations then required to

search . This reduction is due to the fact that a test for a specific fault usually requires

only i<n inputs to be specified. Also, a backtrack limit is frequently set in ATPG such that

if a test for a fault is not found within the specified number of backtracks, the fault is

marked as undetected and dropped from the fault list. This backtrack limit prevents the test

generation process from taking exponential time in the worst case. A method has been

developed to estimate the operations required to generate an arbitrarily shaped unbalanced

search tree in whichm backtracks are required andi inputs are specified.

Consider the case of a fault for which test generation requires 11 backtracks and a

total of 5 inputs specified. The backtrack number of 11 is sufficient to generate a tree of

, and this would leave an extra 2 inputs to be specified. Thus, a search tree such as

shown in Figure 4.4A may result for this fault. In this case, the operational requirements

may be calculated by adding the requirements of with those required to generate the

two extra nodes. The general case of a fault which requiresm=2(2n-1)-n backtracks and a

total of i inputs specified is shown in Figure 4.4B.

Most typical faults do not generate trees of this shape, but experimental data has

shown that using this type of tree as an estimator yields good results. For a fault with an

arbitrary number of backtracks,m, equation (4.1) must be solved forn. This yields:

(4.5)

This equation is of course transcendental and must be solved numerically form.

Once this is done, the operations requirement can be found by calculating the requirements

for using (4.2), (4.3), and (4.4) and adding those to the requirements to generate the

extra(i-n) nodes.

Table 4.2 shows the comparison for the actual vs. estimated operations required by

n

n

3

3

nn2

( )2

log− m2

1−( )2

log=

n

56

a number of faults in the C432, C499, and C3540 benchmark circuits. The data shows that

the above estimation method can predict the number of operations required to find a test for

a single fault within 7.0%.

4.4 Communications Requirements for TopologicallyPartitioned ATPG

The next step in the process of analyzing the performance of topologically

partitioned ATPG is to determine the communications cost. This calculation can be

performed by combining the estimated number of operations required for ATPG with the

estimated communications costs of each operation type. The equations presented in the

previous section can be used to estimate the total number of operations required for ATPG

on an entire circuit. However, this technique requires an estimate of the average number of

inputs specified in the test vector set,ie, and an average number of backtracks required to

generate the test vector set,me. Equation (4.5) can then be used to calculate thene from

which βe, χe, δe can be determined using (4.2), (4.3), and (4.4).

Given these numbers, all that remains is to determine the average communications

required for an implication operation, backtrace operation, and objective operation. Let

Figure 4.4A State space search tree.

A

B

0

0

3

obi

obi

Figure 4.4B General case state space search tree.

a1

a2

0

0

n

obi

obi

ai-n

0obi

57

icomm, bcomm, andocomm denote these quantities. The total number of messages required

for ATPG can be calculated by:

(4.6)

wheref is the number of faults in the circuit under consideration.

Of course, the difficulty of this method is estimating the values ofie andme and the

Fault ## inputs

β χ δ

261

Table 4.2 Actual vs. Estimated Operational Requirements

specifiedα actual

β χ δ

260

estimated

C432 - 15 12 247 380 134 379 132

270 269C432 - 27 21 247 389 143 388 141

115 115C432 - 145 11 103 165 62 163 60

119 118C432 - 149 14 104 169 65 166 63

43 43C499 - 280 34 8 45 37 44 36

45 45C432 - 171 11 33 60 27 59 26

49 49C432 - 175 14 34 64 30 63 29

21 21C432 - 197 11 9 24 15 23 14

36 36C499 - 282 34 1 35 34 35 34

289 288C499 - 480 40 247 410 163 407 160

286 285C499 - 793 37 247 407 160 404 157

284 283C499 - 871 35 247 403 157 402 155

287 277C499 - 873 29 247 398 152 396 149

197 197C3540-2000 14 182 286 104 284 102

244 245C3540-2002 11 232 358 126 357 124

221 222C3540-2004 11 207 324 115 321 114

263 262C3540-2016 14 247 382 136 381 134

107 107C3540-2017 12 94 152 58 150 56

39 39C3540-2047 19 19 47 28 46 27

Μcomm

f icomm

βe bcomm

χe

ocomm

δe

+ +( )×=

58

communications requirements,icomm, bcomm, andocomm. The values ofie and me are

dependent upon the circuit under consideration and the ATPG algorithm used. The values

of icomm, bcomm, and ocomm are dependent not only on those factor, but also on the

algorithm used to partition the circuit and the number of partitions the circuit was divided

into.

One way to estimate these parameters is to use the method presented in [41] where

the values derived from partial ATPG are used to estimate the requirements for complete

ATPG. That is, the topologically partitioned ATPG system is used to generate tests for a

small number of faults and the values ofie, me, andicomm, bcomm, andocomm are measured.

These numbers are then used to estimate the communications costs of complete test

generation for all faults using (4.6).

The graphs in Figure 4.5, Figure 4.6, and Figure 4.7 show the results of using this

method to estimate the total number of messages required for topologically partitioned

ATPG on the C432, C499, and C880 circuits. In each case, the circuits are partitioned onto

two processors. Estimates were made of the total communications requirements in terms of

total number of messages after ATPG was performed on 20%, 40%, 60%, 80%, and 100%

of the faults in the circuit. The actual measured number of messages required is also shown

on the graphs.

Notice that for circuit C880, the estimated number of messages required is within

3% of the actual number required even after only 20% of the faults were processed. This

result is caused by the fact that the values ofie, me, andicomm, bcomm, andocomm are very

constant and close to the final values throughout the test generation process.

For the C499 and C432 circuits, however, the initial estimates for the total number

of messages required varied up to 42% from the actual number after the first 20% of the

faults were processed. In these circuits, the values of icomm, bcomm, andocomm were fairly

constant and accurate throughout the test generation process, but the values forie andme

varied widely from the final values. This variation inie andme caused the calculated values

of βe, χe, andδe to be inaccurate initially.

59

In the C432 circuit, the first portion of the fault list contains a much higher

percentage of hard-to-detect faults than does the entire list. This fact causes the initial

operations estimates, and hence the communications requirements estimates, to be higher

than actual. In C499, however, the opposite is true. The initial portion of the fault list

contains a lower percentage of hard-to-detect faults than the entire list and thus the

communications requirements estimates were initially lower than actual. In both cases

though, the estimated number of messages required became closer to actual as the ATPG

process continued. Both estimates were within 7% by the time all faults had been

processed.

4.5 Estimating the Effects of Changing CommunicationsLatency

Once the number of messages required for topologically partitioned ATPG has been

determined, the total increase in test generation time due to communications costs can be

Figure 4.5 Estimated number of messages required for C432.

0.0 20.0 40.0 60.0 80.0 100.0

Percentage of Faults Processed

500000.0

600000.0

700000.0

800000.0

Tot

al N

umbe

r of

Mes

sage

s R

equi

red

C432

actual number of messages

predicted number of messages

60

determined. Because this version of the TOP-TGS system is essentially scalar, the time

necessary to send the required messages can be added directly to the computation time. The

computation time itself is difficult to calculate, but it can be determined directly by

measuring the time required for test generation by the serial (non distributed) version of the

TGS system running on the same platform. The total test generation time for the distributed

system,τdist, can be calculated by:

(4.7)

where τserial is the non distributed TGS execution time,τcomm is the total message

communications time, andτov is the additional execution time required by the

computational overhead of the distributed ATPG algorithm. A more detailed analysis of the

computations which compriseτov will be presented in the next section. The

communications time,τcomm, for a given parallel processor can be calculated as:


0.0 20.0 40.0 60.0 80.0 100.0


330000.0

380000.0

430000.0

480000.0

530000.0

Tot

al N

umbe

r of

Mes

sage

s R

equi

red

C499



τdist

τserial

τcomm

τov

+ +=

61

(4.8)

where Μcomm is the total number of messages required from (4.6), and∆comm is the

communications latency for a given machine. For the ES-KIT 88K processor,∆comm ≅

0.212ms as stated in Chapter 2.

The overhead time,τov, is impossible to calculateapriori because it depends on the

fault under consideration, the circuit topology, and the partitioning method used. However,

because it is possible to calculateτcomm and measureτdist and τserial, τov can be

determined using (4.7). When (4.8) is substituted into (4.7) and the topologically

partitioned ATPG runtimes,τdist, can be calculated for various communications latencies,

∆comm.

In order to verify that this method of determining the change inτdist with ∆comm,

an experiment to actually measure the effect was needed. Because it is not possible to


0.0 20.0 40.0 60.0 80.0 100.0


170000.0

172000.0

174000.0

176000.0

178000.0

180000.0

Tot

al N

umbe

r of

Mes

sage

s R

equi

red

C880



τcomm

Μcomm

∆comm

=

62

change the actual communications latency of the ES-KIT 88K directly, some means of

simulating the effect of changing∆comm was necessary. Again, because this version of the

TOP-TGS system is scalar, the communications latency between processors simply

represents the time from the point where useful computation halts on one processor to the

point where it begins on another. Therefore, it is possible to effectively increase this time

by performing useless computations on the receiving processor after the message is

received for some time∆spin. If ∆spin=∆comm, then the communications latency is

effectively doubled. This process is illustrated in Figure 4.8.

Figure 4.8 Simulating increased communications latency.

// computational code

remote_object->send_message();

recieve_message;

// more computational code

∆comm

// computational code

remote_object->send_message();

recieve_message;

// more computational code

∆comm

for(i=0;i<SPIN_NO;i++) {

//do a floating point multiply }

∆spin

∆eff∆eff = ∆comm +∆spin =2∆comm

63

Notice that this method is only valid for a scalar algorithm because in a parallel

algorithm, the useless computations could delay useful computations on another portion of

the algorithm. These computations should be performed in parallel with the

communications latency in a true parallel system with the longer latency.

The useless computation placed in the TOP-TGS system to increase

communications latency consisted of a loop of floating point multiply operations. These

floating point multiplies involved operands stored in memory. Experimentation determined

that each one of these type of multiplies takes 1.80009µs to complete. Thus, it required a

loop of 117 of these operations to make ∆spin=∆comm. One iteration of the 117 multiply

operations would double the communications latency, or scaling factor, two iterations

would triple it, and so on.

These spin loops were added to the TOP-TGS system at the beginning of each

remote method. These methods are the ones that must be invoked when sending messages

between processors. The TOP-TGS system was further modified to perform ATPG on a

single fault multiple times for timing purposes. This modification was made so that

runtimes could be calculated and measured for single faults and compared for greater

accuracy. The faults tested were chosen to represent a mix of easy and hard faults that were

distributed throughout the circuit topology.

The graphs in Figure 4.9, Figure 4.10, Figure 4.11, and Figure 4.12, show a

comparison between the predicted and measured test generation time for representative

faults in circuits 74181, C432, C499, and C880. In all cases, the predicted and measured

times increased very linearly with communications scaling factor. In some cases, the slopes

of the predicted and measured increases differed slightly, but this fact is probably due to

inaccuracies in measuring the actual communications latency as discussed in Chapter 2.

The results verify that (4.7) is in fact an excellent predictor of the scaling ofτdist with

changes in∆comm.

64

4.6 Distributed ATPG Overhead

If the TOP-TGS system were run on an ideal multiprocessor system where there

was instantaneous communications between processors, i.e.∆comm = 0, then the runtime

would be:

(4.9)

Thusτov is a significant factor in the distributed runtime in that it represents excess

computation time that must be overcome by any parallelism added to the algorithm. This

result is true even if it is run on an ideal multiprocessor with∆comm = 0.

There are three ways in whichτov may be determined. The first is to measure the

serial execution time, the distributed execution time, and the total number of messages

required for ATPG.τov can then be calculated using (4.7) and (4.8). The second method is

to use the linear relationship of the actual runtimes vs. communications scaling factor to

Figure 4.9 Test generation times vs. communications scaling factor, 74181.

1.0 2.0 3.0 4.0Communications Scaling Factor

20.0

40.0

60.0

80.0

100.0

120.0R

untim

es (

ms)

74181

fault #1 measured fault #1predicted fault #181measured fault #181predicted


100.0

150.0

200.0

250.0

300.0

350.0

400.0

Run

times

(m

s)

74181

fault #379 measured

fault #379 predicted

τdist

τserial

τov

+=

65

find the intercept point of the runtime curve when the scaling factor is zero. Using the data

and graphs presented in the previous section results in values ofτov that agree within 5%

or less.

The third method of calculatingτov is to determine the types of computations which

Figure 4.10 Test generation times vs. communications scaling factor, C432.


0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

160.0

180.0

200.0R

untim

es (

ms)

C432

fault #1 measuredfault #1 predicted

fault #501 measured fault #501 predicted


1000.0

1500.0

2000.0

2500.0

Run

times

(m

s)

C432

fault #145 measured fault #145 predicted fault #731 measured fault #731 predicted


2000.0

2500.0

3000.0

3500.0

4000.0

4500.0

5000.0

Run

times

(m

s)

C432

fault #619 measured


66

constitute this overhead and the number of times they are executed in the distributed vs. the

serial algorithm. Thenτov can be estimated using the execution rate of the host processor.

This method is by far the most difficult, but it is very useful in that it helps identify areas

of inefficiency in the distributed algorithm that may be improved.

There are two areas of inefficiency that have been identified thus far in the TOP-

TGS system. They are, increased numbers of gate evaluations during the calculation of the

D frontier, and increased numbers of gate evaluations during the forward implication

process.

As discussed in section 3.2.2.4, the calculation of the D frontier is a global process

that involves all of the processors. In the current implementation, the calculation is done

sequentially with the processors communicating in a ring. Each processor calculates its D

frontier value and compares it to the value sent to it by the previous processor. If its value

is better than the one sent to it, the processor forwards its value to the next processor in the

ring. If its D frontier is not better, the processor forwards the value sent to it to the next


1.0 2.0 3.0 4.0

Communications Scaling Factor

2000.0

3000.0

4000.0

5000.0

6000.0

Run

times

(m

s)

C499


1.0 2.0 3.0 4.0

Communications Scaling Factor

0.0

100.0

200.0

300.0

400.0

500.0R

untim

es (

ms)

C499



67

processor. The first processor in the ring to receive its own value has the D frontier. Using

this system, the worst case scenario would be for the last processor in the ring to have the

D frontier. Therefore, each processor would have to calculate the D frontier twice. This

would involve evaluating every gate in the circuit twice for each D frontier (or objective)



125.0

175.0

225.0

275.0

Run

times

(m

s)

C880



60.0

80.0

100.0

120.0

140.0R

untim

es (

ms)

C880



10.0

20.0

30.0

40.0

50.0

Run

times

(m

s)

C880


68

operation where as the serial system evaluates each gate only once. Therefore, the overhead

from this inefficiency,τov1, would be:

(4.10)

whereNg is the number of gates in the circuit,δ is the total number of objective operations

from (4.4), I o is the number of instructions required for each gate evaluation in the D

frontier, andR88 is the single instruction execution time of the 88100 processor.

Examination of the symbol table for the test generator object reveals that the code

for thesetDfrontier() method begins at address 0x12DC and ends at 0x1620. This address

range constitutes 836 bytes or approximately 209 32 bit instructions for the 88100. The gate

evaluation portion of this routine probably accounts for about 20% of this code or 42

instructions. The manufacturer of the 88100 claims an average execution rate of 1.36 clock

cycles per integer instruction or 14.7 million instructions per second for an 88100 with a 20

MHz clock. This figure corresponds to anR88 of 68.0ns.

The increase in gate evaluations during forward implication in the distributed

ATPG algorithm vs. the serial ATPG algorithm is due to the fact that the two algorithms

process the gates in different order. The serial ATPG algorithm processes the gates in level

order from inputs to outputs whereas the distributed algorithm processes gates as their

inputs become active. The levelized processing of the serial algorithm guarantees that all

inputs to a gate have reached their current state values before the gate is evaluated. This

insures that each gate is evaluated at most once in every forward implication operation.

This levelized processing is not possible in the distributed system and this results in excess

gate evaluation steps when changing inputs to gates do not arrive at the same time.

As an example, consider the AND gate of Figure 4.13. In the serial forward

implication process, the inputs to the AND gate are both set to ‘1’ before the gate is

evaluated and its output is set to ‘1’. However, in the distributed case, if the imply messages

#1 and #2, arrive at different times, two evaluations are necessary. The first evaluation with

the inputs at ‘1’, ‘X’, results in an output of ‘X’ and the second with the inputs at ‘1’,’1’

τov1

NgδIoR88≅

69

results in the output becoming ‘1’.

The number of gate evaluations performed in the serial ATPG system,ges, and the

distributed system,ged, can be measured. The difference then,∆ge = ged - ges can be

determined.

The overhead due to excess gate evaluations during forward implication,τov2, can

be calculated thus:

(4.11)

whereI i is the number of instruction required for each gate evaluation, andγ is the average

number of inputs for each gate. It is necessary to include theγ factor because the

implication procedure loops through the instructions once for each gate input. For most of

the circuits under consideration,γ=2.

The code for theimply() procedure begins at address 0x199C and ends at 0x230C.

This address range constitutes 2416 bytes or approximately 604 instructions. The gate

evaluation portion of this routine accounts for about 90% of the code or 543 instructions.

Table 4.3 shows the calculated values ofτov1, τov2, andτovcalc= τov1 + τov2, for the

Figure 4.13 Serial vs. distributed forward implication.

level:n level:n+11

1

1

1

1

partition 1

partition 2

partition 3

1

implication message #1

implication message #2

Serial ATPG algorithm Distributed ATPG algorithm

τov2

∆ge

γI iR88≅

70

faults analyzed on the previous section. The calculated overhead,τovcalc, is the compared to

the measured value,τovmeas, by a factorf= τovmeas/τovcalc. The factorf varies from 3.88 to

1.33. While this difference may seem to be large, the calculated and measured values of the

overhead time agree fairly closely given the gross approximation involved in calculating

the overhead time.

The variance in the ratios suggests that there may be additional excess calculations

being performed in the distributed ATPG system that were not taken into account by (4.10)

and (4.11). However, the variance may simply be due to the variance inR88 caused by such

Fault #

74181 - 1

74181 - 181

74181 - 379

C432 - 1

C432 - 731

C432 - 145

C432 - 501

C432 - 619

C499 - 1

C499 - 480

C499 - 621

C499 - 931

C499 - 961

C880 - 41

C880 - 801

C880 - 1201

C880 - 1521

C880 - 1741

Table 4.3 Calculated vs. Measured Overhead Times

∆geγ τov1 τov2 τovcalc τovmeasf

6 1.40 66 4.87 6.27 12.62 2.07

8 1.87 153 11.30 13.71 19.20 1.46

25 5.85 682 50.36 56.20 74.71 1.33

11 6.62 19 1.40 8.02 31.16 3.88

115 69.30 3066 226.0 295.3 653.4 2.21

9 5.40 28 2.06 7.48 24.75 3.70

260 156.7 7156 528.0 684.7 1368.9 1.99

180 108.5 1761 130.0 238.5 654.9 2.91

23 18.58 32 2.30 20.88 64.12 3.07

289 233.5 12110 894.0 1127.0 1848.1 1.65

42 33.94 182 13.44 47.38 88.73 1.87

275 222.3 3188 235.4 457.7 1230.2 2.68

14 11.31 42 3.10 14.41 25.55 1.77

16 21.40 60 4.43 25.83 75.36 2.92

4 5.35 2 0.15 5.49 11.18 2.04

17 22.77 55 4.06 26.83 73.78 2.75

9 12.05 63 4.65 16.70 36.99 2.21

3 4.01 6 0.43 4.45 9.67 2.71

71

factors as cache miss rates.

4.7 Effects of Increasing the Number of Partitions

The results presented in the previous sections were for circuits divided into 2

partitions. This section presents results for circuits divided into 4 partitions.

The distributed ATPG algorithm is consistent in the sense that it always follows the

same path through the search space independent of the partitioning of the circuit. Therefore,

the number of operations required to generate the search tree will remain the same.

However, the number of messages required to perform each operation should increase. This

increase in messages will increase the communications cost,τcomm. Figure 4.14 shows the

graphs of the estimated number of messages for the C432, C499, and C880 circuits when

divided into 4 partitions.

Notice that the curves have the same shape as those for 2 partitions presented in

section 4.4. The estimated curve for C432 initially falls above the actual curve because of

the high percentage of hard-to-detect faults in the beginning of the fault list. The estimated

curve for C499 is initially below the actual curve because of the high percentage of hard-

to-detect faults in the end of the fault list. The estimated curve for C880 falls within a few

percent of the actual values throughout the ATPG process. One additional fact to notice is

the increase in the total number of messages required for ATPG with 4 partitions vs. 2

partitions. In the case of C432, the 4 partition case requires almost 67% more messages than

the 2 partition case. The C499 and C880 circuits require 52% and 42% more total messages

respectively. These increases in the total number of messages required increases the

communications cost,τcomm, and thus the distributed ATPG runtime,τdist, directly.

The graphs in Figure 4.15 show the actual and predicted ATPG times scale with

increasing communications latency,∆comm, for selected faults. The graphs also show that

the ATPG times themselves have increased over the 2 partition case because of the

increased communications costs. The overhead time,τov, however has not increased

significantly and can still be estimated using (4.10) and (4.11) as shown in Table 4.4.

72

Figure 4.14 Total message requirements for 4 processors.

20.0 40.0 60.0 80.0 100.0Percentage of Faults Processed

800000.0

1000000.0

1200000.0

1400000.0

1600000.0

Tot

al N

umbe

r of

Mes

sage

s R

equi

red

C432


450000.0

550000.0

650000.0

750000.0

Tot

al N

umbe

r of

Mes

sage

s R

equi

red

C499


240000.0

245000.0

250000.0

Tot

al N

umbe

r of

Mes

sage

s R

equi

red

C880

predicted

actual predicted

actual

predicted

actual

73


0.0

2000.0

4000.0

6000.0

8000.0

Run

times

(m

s)

C499


fault #480 measuredfault #480 predictedfault #931 measuredfault #931 predicted


0.0

2000.0

4000.0

6000.0

8000.0

Run

times

(m

s)C432





fault #41 measured


Figure 4.15 Test generation time vs. communications scalingfactor for 4 partitions.


0.0

100.0

200.0

300.0

400.0

Run

times

(m

s)

C880

74

Fault #

C432 - 145

C432 - 501

C432 - 619

C499 - 1

C880 - 1741

C499 - 480

C499 - 931

C880 - 41

∆geγ τov1 τov2 τovcalc τovmeasf

115 69.30 4243 313.3 382.6 669.0 1.75

9 5.42 30 2.21 7.63 27.65 3.62

260 156.7 10249 756.8 913.5 1324.3 1.45

23 18.50 58 4.28 22.78 79.61 3.49

289 233.6 14012 1034.0 1267.6 1483.0 1.17

275 222.3 7206 532.1 754.3 1248.0 1.65

16 21.40 54 3.98 25.38 69.88 2.75

3 4.01 6 0.443 4.45 9.52 2.14

Table 4.4 Calculated vs. Measured Overhead Times

75

Chapter 5

Algorithmic Enhancements

This chapter details the algorithmic enhancements made to the serial TOP_TGS

system in order to speedup the ATPG process for a single fault across topological

partitions. The enhancements fall into two general categories. The first category is

efficiency enhancements which attempt to increase the number of computations that are

performed between communications steps. These enhancements increase the grain size of

the algorithm which can increase performance in a system with large communications

latencies such as the ES-KIT 88K. These types of enhancements are not parallelizations of

the base algorithm, but they can lessen the impact of communications overhead incurred by

topological partitioning and thus increase the benefit of the parallelizations actually added

to the system.

The second category of enhancements are those which attempt to increase the

amount of work that is done simultaneously on separate processors. These enhancements

are parallelizations of the basic algorithm which can achieve speedups over the

uniprocessor system.

The first two of the following sections discuss packetizing implication invocations

and optimistic backtracing which fall into the category of efficiency enhancements. The

next two sections detail the multiple backtrace and optimistic propagation procedures

which are parallelization enhancements. The last section presents the results of the various

enhancements on the benchmark circuits.

5.1 Packetized Implication

During the implication procedure within one partition, the outputs of gates at the

partition boundary can take on new values. These new values must be transmitted to the

test_generator objects which have the gates on the other side of the boundary so that the

global implication process can be completed. The original TOP_TGS system was

76

implemented such that this process was performed one gate at a time as the new values on

the boundary gates were generated. This process meant that when implication changed a

value on a boundary gate, it would stop, send a message to the master to increment the

receiving test_generator’s counting semaphore, wait for a reply, and then send the single

new value to the receiving test_generator via a call to itsimply() method. It can easily be

seen that this process is time consuming and efficiency could be enhanced if these values

were be grouped into packets and sent together for implication. This enhancement was the

first one made to the TOP_TGS system.

The majority of the modifications for this enhancement were made to theimply()

method itself. The method was modified to take a list, or packet, of gate numbers and values

for implication as an argument. An overall loop was added to imply all of the values in the

packet across the partition in serial order. Because only combinational logic is handled by

the TOP_TGS system, the final circuit state after all of the implications is independent of

the order in which the individual implications were done.

The process by which the implication calls for the changing boundary gate outputs

were sent out required the most modification. In order to packetize these calls, no outgoing

imply() invocations are issued until the local implication of the entire input packet is

complete. This modification necessitated a method of determining which boundary gates

had undergone a change on their outputs since the start of implication within the partition.

This determination was made by adding a previous state variable to each gate structure and

saving the current state of the output to it at the start of local implication. The state of the

output after local implication can then be compared to the previous state to determine if a

state change necessitating animply() call has occurred. The addition of this previous state

variable to the gate structure did add a small amount of memory overhead (6%) to the

partition database.

Once the local implication of the input packet is performed, theimply() calls

generated by the gates on the partition boundary are packetized according to their

destination. Because the ESP environment requires all arrays that are sent as arguments to

77

methods to be of fixed size, several packets may have to be sent to a single test_generator.

The counting semaphore for each test_generator is thus incremented by the master as

determined by the number of packets to be sent to that test_generator. The packets are then

sent to theimply() methods in the appropriate test_generator objects one at a time.

The major disadvantage of this method of packetizing is that the parallelization of

the implication procedure is reduced. This reduction is caused by the fact that theimply()

calls are issued after local implication is completed, not in parallel with the actual

implication itself.

There is an obvious trade-off in determining the size of the implication packets. Too

large a packet size results in wasted time as large messages are sent between test_generators

with little information stored in them. Too small a packet size results in multiple messages

being sent between test_generators where one would have sufficed if it was larger. The

optimum packet size was determined by experimentation to be equal to the average number

of values passed between test generators during implication. The optimum packet size is

somewhat dependent on the size of the circuit, the number of partitions, and the partitioning

method used. For the circuits under consideration here, a packet size of 10 performed

adequately and resulted in an average of a 5% decrease in ATPG runtimes.

5.2 Optimistic Backtracing/Objective Satisfaction

The second algorithmic enhancement which falls into the category of efficiency

enhancements is the optimistic backtracing procedure. This procedure attempts to increase

the amount of work done when a partition receives a backtrace message. Recall from the

discussions in previous sections that a backtrace operation is used to determine what input

assignments need to be made to satisfy a specific objective. This objective would consist

of specifying a certain value to be present on a circuit node. Also recall that backtracing is

accomplished by systematically moving that objective back through the circuit topology to

the primary inputs. If, during this process, the objective crosses a partition boundary, a

message must be sent from one test_generator object to another to continue the backtrace

operation. Optimistic backtracing can increase the amount of work done when this type of

78

message is received.

Suppose a test_generator receives a backtrace message for a specific gate on its

boundary and successfully backtraces it to a primary input in its own circuit partition. The

next step in the ATPG process would be to perform a global implication of that input

assignment and check the status of the objective. In most cases, the objective will not yet

be satisfied and in fact, backtracing may proceed along the same path and cross the partition

boundary at the same gate. This situation will occur if the gate output at the partition

boundary is still at an ‘X’ value.

The optimistic backtracing process uses the fact that the output of the gate at the

partition boundary where the backtrace was originally received still has an ‘X’ value as a

signal. This signal tells the process that another backtrace from the partition boundary can

be done in an attempt to find another local input assignment.

The local backtracing and implication continues until either the objective at the

partition boundary is satisfied, or the backtrace operation encounters another partition

boundary. When either of these occurs, the input assignments made locally are sent to the

master object to be pushed on the global stack, and global implication is performed to make

the circuit state consistent.

As an example, consider the circuit of Figure 5.1. For a stuck-at ‘1’ fault on the

input toPnot driven byC1, the first objective is to drive that node to a ‘0’. This objective

will sensitize the fault. A backtrace invocation will be sent from test_generator #3 to

test_generator #2 to backtrace the objective of the output node ofC1 to a ‘0’.

Test_generator #2 will backtrace this objective toward inputS2 until the boundary with

test_generator #1 is encountered. This operation will result in a backtrace invocation being

sent to test_generator #1 to setS2 to a ‘1’. This input assignment will be made and global

implication will be performed. After implication, the output ofC1 will still be an ‘X’. The

next backtrace issued to test_generator #2 for the objective of settingC1 to ‘0’ will result

in the optimistic backtrace procedure being used.

79

With inputS2 set at ‘1’, the backtrace ofC1, ‘0’ will lead to the assignment of input

A3 to a ‘1’. Because this is a local input to test_generator #2, a local implication will be

done to see if theC1, ‘0’ objective has been satisfied. Since the output ofC1 will still be at

‘X’, another local backtrace is done which will lead to the assignment of a ‘0’ to inputB3.

Local implication of this input assignment will lead to a ‘1’ value on the output ofB2 and

thus a ‘0’ on the output ofC1. The original objective has now been satisfied and the

optimistic backtrace procedure will make the global state consistent. This process involves

sending the input assignmentsB3, ‘0’ andA3, ‘1’ to the master object to be pushed on the

global stack, and packetizing and sending outimply() calls for any changes on the outputs

of boundary gates. When implication is finished, the test generation process will continue

by selecting a new global objective which in this case will be to set any unknown inputs to

Pnot to a ‘1’ to propagate the fault.

The modifications necessary to incorporate this functionality into the test_generator

object were fairly extensive. First, the backtrace and imply procedures had to be configured

s-a-1

PnotC1

B1

B2A1

S2

S3

B3

A3

Partition 1

Partition 2

Partition 3

0

11

1

1

0

1

Figure 5.1 Optimistic backtracking/objective satisfaction in the 74181 circuit.

backtrace operation

input assignmentSTEP #1

STEP #3

STEP #2

80

so that they could be invoked locally without generating message traffic. This operation

was accomplished by changing the public methodsimply() and backtrace() to private

methods local_imply() and local_backtrace(). Next, a publicimply() method was added

which allowed the global implication to be performed correctly. Thisimply() method is, in

fact, a public stub routine which callslocal_imply() with the same arguments.

The public backtrace() method actually carries out the optimistic backtrace

procedure. It consists of a loop of calls tolocal_backtrace() and local_imply() which

continues until either a partition boundary is encountered or the objective is satisfied. A

pseudocode representation of this method is shown in Figure 5.1.

Thesend_inputs() routine sends the local input assignments to the master object to

be pushed on the global stack. Thesend_imply() routine starts the global implication

process. This task is performed exactly as in the previous system. Theimply() calls are

packetized, the counting semaphores for the test_generators are incremented by the master,

and theimply() calls are issued to the respective test_generators.

This enhancement typically resulted in a 10% enhancement in runtime on the

benchmark circuits. However, this result is greatly dependent on circuit topology and the

Figure 5.2 Pseudocode representation of optimistic backtrace procedure.

backtrace(objective_gate. objective_value){

do { local_objective_gate = objective_gate; // set local variables local_objective_value = objective_value; local_backtrace(); // backtrace to local input or partition boundary local_push(input_assignment); // save input assignment local_imply(input_assignment); // do local implication of input assignment } while(partition_boundary_not_found && local_objective_gate.output != ‘X’);

send_inputs(); // send local input assignments to master send_imply(); // send out values on boundary gates for implication

}

81

target fault under consideration. A more detailed explanation of the results is contained in

section 5.5.

5.3 Multiple Backtrace

The multiple backtrace procedure is an algorithmic enhancement that falls into the

category of parallelizations. The goal of the multiple backtrace procedure is to issue as

many backtrace operations as possible so that multiple test_generator objects will be

processing them in parallel.

The concept of multiple backtrace was first presented in the FAN algorithm [4].

However, FAN is a serial algorithm and the goal of using multiple backtrace in it was to

reduce the number of conflicts when backtracing through reconvergent fanout paths. This

reduction was accomplished by backtracing all paths through reconvergent fanout

simultaneously and then counting the number of times each logic value, either ‘0’, or ‘1’,

was backtraced through a fanout stem. The objective value on that fanout stem would then

be assigned the logic value with the largest count.

The use of multiple backtrace for the parallelization of topologically partitioned

ATPG was presented in [31]. However, the scheme presented in [31] is unrealistic in its

assumption of only one gate per processor. This assumption was made to measure the upper

limit on the number of simultaneous backtrace operations which could be expected on the

benchmark circuits. Further, this work does not address the application of multiple

backtracing to a realistic topological partitioning scheme which requires many gates to be

placed in a single partition as was done for the TOP-TGS system. It also does not address

the practical implementation issues that must be addressed in implementing multiple

backtrace on a real multiprocessor system.

As an example of the multiple backtrace procedure, consider the section of the

74181 circuit shown in Figure 5.3. In generating a test for a stuck-at ‘1’ fault on the output

of gatePnot, the first objective is to sensitize the fault by placing a ‘0’ on that node. This

objective requires that all of the inputs toPnot be set to a ‘1’ value. Because all of the inputs

82

to Pnot are driven by gates in other partitions, backtrace operations can be sent to them

simultaneously to be backtraced in parallel.

The actual implementation of the multiple backtrace process in the TOP_TGS

system differs slightly from the description above. In the example of Figure 5.3, the

backtrace of the objective,Pnot, ‘0’, would first lead to a backtrace call to test_generator

#2 with the objective,C1, ‘1’. This value would then be logically implied by test_generator

#3 followed by another local backtrace ofPnot, ‘0’. This backtrace would then lead to the

objective,C3, ‘1’ being sent to test_generator #1. The value ‘1’ would be locally implied

on the output ofC3 by test_generator #3 followed by another local backtrace operation.

This process would continue until the original objective of Pnot, ‘0’ had been satisfied by

the local implications in test_generator #3. Each backtrace call sent to the other

test_generators during this process would be satisfied by an optimistic backtrace procedure

within that test_generator. Once all of the backtrace calls had been satisfied, all input

assignments would be pushed onto the global stack and implication would be performed to

s-a-1

Pnot

C3

Partition 3

Figure 5.3 Multiple backtrace in the 74181 circuit.

C5

C1

C7

Partition 2

Partition 1

Partition 4

1

1backtrace operation

1

1

1

83

make the global circuit state consistent. The next global objective would be selected and

the test generation process would continue.

This method of performing the multiple backtrace process was chosen for its ease

of implementation and the fact that it helps keep the originating test_generator object busy

as well as the test_generators receiving the backtrace calls.

Like optimistic backtracing, the multiple backtrace process is performed by the

backtrace() method. In this case, the method contains a loop which continues to be

executed until the original objective which was sent to it has been satisfied. The loop begins

by callinglocal_backtrace() to backtrace the objective until either a local primary input or

a partition boundary is encountered. If a local input is found, the objective value is assigned

to it and a local implication is performed. If a partition boundary is encountered, a backtrace

call is issued to the appropriate test_generator and a local implication of the value on the

boundary gate’s output is done. Once the initial objective has been met, the local input

assignments made during the process are sent to the master to be pushed on the global stack.

Imply() calls are then issued for the boundary gate outputs that have changed after the

proper counting semaphores have been incremented. A pseudo code representation of the

multiple backtrace routine is shown in Figure 5.4.

Because all backtrace operations now use this multiple backtrace process, some

test_generators may be issuingimply() calls while others are still in the backtrace phase.

This fact could result in the master thinking that all test_generators are idle when actually

some test_generators are still backtracing. This result would cause the master object to

incorrectly begin the next phase of the ATPG process before the present phase is

completed. In order to avoid this problem, the system was modified so that the counting

semaphore for a specific test_generator is incremented whenever it is sent abacktrace() or

imply() call and decremented whenever one of these operations completes. Therefore, the

master will not determine that all test_generators are idle until all outstandingbacktrace()

andimply() calls have been processed.

The results of using multiple backtracing on the benchmark circuits were mixed.

84

Overall, this enhancement increased processor utilization and thus decreased runtimes in

most cases. However, in some cases there are situations where the multiple backtrace

procedure was less efficient. This situation most often occurs when one local input

assignment would satisfy the global objective itself or result in a circuit state where a test

was not possible. The fact that this situation has occurred may not be known until the entire

circuit state is made consistent throughimply() calls. Much of the work done in these

situations could be unnecessary or even incorrect. Reconvergent fanout or excessive

numbers of XOR gates in the circuit topology are the most likely causes of these situations

and this occurrence can seriously impact the performance increases realized by this

algorithmic enhancement. Section 5.5 presents a more detailed discussion of the results of

the multiple backtrace process.

// the counting semaphore for this test_generator was incremented by the master// before this call was received. this marks this test_generator as being busy.

backtrace(objective_gate. objective_value){

do { // main loop local_objective_gate = objective_gate; // set local variables local_objective_value = objective_value; local_backtrace(); // backtrace to local input or partition boundary. // this sets the current_objective. if(current_objective == primary_input) { local_push(input_assignment); // save input assignment local_imply(input_assignment); // do local implication of input assignment } else { local_imply(boundary_gate.output); // local imply objective value on boundary gate } } while(local_objective_gate.output != ‘X’); // loop until objective is satisfied

send_inputs(); // send local input assignments to master send_imply(); // send out values on boundary gates for implication mark_idle(this_test_generator); // decrement this test generator’s semaphore}

Figure 5.4 Pseudo code representation of the multiple backtrace procedure.

85

5.4 Optimistic Propagation

One of the major limits to the available parallelism in the multiple backtrace

procedure is the fact that once the global objective has been satisfied in the local partition,

the entire system must work together to make the global state consistent and compute a new

global objective. The global implication and objective selection process tends to be more

serial in nature than the backtracing process and, therefore, the more often it is performed,

the lower the speedup will be. The optimistic propagation procedure is an attempt to

decrease the number of times the global implication/objective selection process is

performed.

Optimistic propagation is the process of speculating on what the next global

objective will be and preceding with work on satisfying that objective. This procedure is

based on the theory of Speculative Computation [40] that states that if a processor which is

idle is used to perform work that may or may not be necessary, then only will speedup be

achieved if the work turns out to be useful. This result, of course, assumes that the

speculative work does not take place ahead of required work that other processors are

waiting for.

As an example, consider the stuck-at ‘0’ fault on the output of gateE3 as shown in

Figure 5.5. The first global objective would be to sensitize the fault by setting the output of

E3 to a ‘1’. This objective would require a ‘1’ on all inputs to E3. The first backtrace

operation would be local to test_generator #2 and would lead to the local input assignments

necessary to set the output of gate C1 to ‘1’. The next backtrace would lead to a backtrace

message being sent to test_generator #1 with the objective, C3, ‘1’. After local implication

of this boundary gate value, the next backtrace would send the objective, C4, ‘1’ to

test_generator #1. Local implication of this boundary gate output in test_generator #2

would lead to the satisfaction of the global objective as the output of E3 would become ‘1’.

Without the addition of the optimistic backtrace procedure, the system would normally stop

and make the global state consistent at this point and then select a new global objective.

However, the optimistic propagation procedure would speculate on what the next objective

86

would be for propagation and begin processing it.

In the example case, the next objective would be selected in the local partition of

test generator #2 to propagate the ‘D’ value on the output of E3 through gate F1. This

objective would require all inputs of F1 not on the propagation path to be set to the non-

controlling ‘0’ value. Local backtraces leading the outputs of gates E1 and E2 would be

performed followed by a local backtrace to set the output of gate E4 to a ‘0’ also. These

local backtraces could lead to backtrace calls to other partitions. This process of selecting

the next local objective for propagation would continue until the D frontier has reached a

partition boundary. At this point, the system will stop and make the global state consistent,

and select a new global objective.

s-a-0C1

C3

Partition 2

1

Figure 5.5 Optimistic propagation the 74181 circuit.

1backtrace operation

E1

E3

E2

F1

E4

C4

Partition 1

1

1local backtrace

1D

X

X

X

0

0

0

0optimistic backtrace

87

There are several situations which could make the work done in propagating the

local D frontier to the partition boundary unnecessary and, hence, lead to it being

speculative. First, the input assignments set by other partitions in satisfying their backtrace

calls could actually set up the conditions necessary to propagate the D frontier in the local

partition. Second, these input assignments could result in propagation of a D frontier in

another partition. This remote D frontier may then be at a more observable point, or even

at a primary output, and this fact will not be known until a new global objective is selected.

Finally, the remote input assignments could result in a circuit state where a test is no longer

possible. Because of the possibility of these situations occurring, there is a trade-off in how

much work can be done between performances of global implication/objective selections.

If too many operations are done between global consistency checks, there is a larger chance

that some work may be unnecessary if incorrect. Too few operations between global

consistencies tends to limit the amount of parallelism that can be extracted.

As in the previous algorithmic enhancements, most of the modifications necessary

to implement the optimistic propagation process were made to thebacktrace() routine.The

name of this routine was changed tosatisfy_obj() and its major function became the

satisfaction of all local objectives necessary to propagate the D frontier to a partition

boundary.

At the beginning of the ATPG process,satisfy_obj() is called in the test_generator

with the target fault in its database. This invocation ofsatisfy_obj() will call the

local_setnextobj() routine to setup the next local objective.Local_setnextobj() is simply a

local version of thesetnextobj() method and in this case it will return the next objective that

will sensitize the fault.Satisfy_obj() will then perform all of thelocal_backtrace() and

local_imply() calls necessary to satisfy this objective. Once the current objective has been

satisfied, thelocal_setnextobj() will be invoked to select the next objective in the local

partition.Local_setnextobj() will continue to return local objectives until the fault has been

sensitized. Once the fault has been sensitized,local_setnextobj() will call

local_setDfrontier() to return the next objective.Local_setDfrontier() will return the next

objective necessary to propagate the D frontier through the next gate.

88

Thesatisfy_obj() method will continue looping in this manner until the D frontier

has been propagated to the partition boundary. At this point, all of the local input

assignments made are sent to the master to be pushed onto the global stack and the global

state is made consistent by issuing the necessaryimply() calls. A pseudo code

representation of thesatisfy_obj() routine os shown in Figure 5.6.

As with the multiple backtrace procedure, results of the optimistic propagation

procedure are greatly dependent on the topology of the circuit and the target fault under

consideration. In some situations, all of the speculative work is later required and is

therefore useful. Runtimes in these cases are much lower using the optimistic propagation

procedure. In other situations, the speculative work is incorrect and runtimes are increased.

This incorrect work frequently occurs when processing a hard fault that usually results in a

// The counting semaphore for this test_generator was incremented by the master// before this call was received. This marks this test_generator as being busy. It is not// necessary to send this routine a starting objective because it uses the global objective// as a starting point and then computes its own local objectives.

satsify_obj(){

do { // main loop local_setnextobj() // set the next local objective local_backtrace(); // backtrace to local input or partition boundary. // this sets the current_objective. if(current_objective == primary_input) { local_push(input_assignment); // save input assignment local_imply(input_assignment); // do local implication of input assignment } else { local_imply(boundary_gate.output); // local imply objective value on boundary gate } } while(objective_gate_is_not_boundary); // loop until D frontier is at partition boundary

send_inputs(); // send local input assignments to master send_imply(); // send out values on boundary gates for implication set_idle_status(this_test_generator); // decrement this test generator’s semaphore}

Figure 5.6 Pseudo code representation of the optimistic propagation procedure.

89

large number of backtracks. In this situation, frequent global consistency checks are

beneficial to determine if a wrong decision which should result in a backtrack has been

made. Performing speculative work between global consistencies here simply results in

useless computations which increase the time between backtracks and hence increase the

ATPG time. It is also possible for the optimistic propagation process to increase the number

of backtracks necessary to perform ATPG on any given fault. This situation occurs when

the optimistic propagation routine continues to work on propagating an unprofitable D

frontier when a global objective selection procedure would have selected another D frontier

gate for propagation. This increase in the number of backtracks is especially detrimental

because the processing of backtracks requires an extra implication operation which has

very limited parallelism. The next section presents a detailed discussion of the results of

this algorithm enhancement.

5.5 Analysis of Results

This section presents the results of the various algorithmic enhancements to the

TOP_TGS system. The results of adding each enhancement to the system will be presented

using the same example faults used in the previous chapter. Then the results of doing ATPG

on all of the faults in the circuit with each enhancement will be presented.

5.5.1 Optimistic Backtracing Results

Table 5.1 compares the results of the TOP_TGS system with the optimistic

backtracing enhancement added, dubbed version 1, with the unenhanced distributed

TOP_TGS system. Included in the comparison are the runtimes for each fault,τ, the total

number of messages required for ATPG,Μcomm, the total number of backtracks,α, and the

total number of inputs specified in the test vector,ie. The results show that although a

decrease in runtime of up to 20% was achieved on some faults, others faults experienced

up to a 38% increase in runtime.

Because the optimistic backtrace procedure is not a parallelization enhancement,

the increase or decrease in runtimes is caused directly by an increase or decrease in either

90

communications costs or operational complexity. The communications cost for ATPG is

measured directly byΜcomm. The operational complexity can be determined byα andie as

shown in the previous chapter.

For example, consider fault #C432-501. Both versions of the TOP_TGS system

required no backtracks for ATPG on this fault and specified 8 inputs in the test vector.

Therefore, the operational complexity in both cases is equivalent. However, the number of

messages required actually increased from 106 to 111 thereby increasing the runtime from

Table 5.1 Comparison of Results for Version 1 System

Fault No.Distributed System Version 1 System

τdist α ie Μcomm τver1 α ie Μcomm

74181-1 31.0 0 7 61 28.6 0 5 59

74181-181 44.97 0 7 94 34.8 0 7 69

74181-379 146.97 10 14 257 201.73 28 14 365

C432-1 70.73 0 10 126 65.95 0 10 121

C432-145 1149.0 103 11 1627 1198.9 109 13 1704

C432-501 56.57 0 8 106 59.78 0 8 111

C432-619 2336.9 247 11 3010 2368.8 247 12 3216

C432-731 1281.98 157 22 1915 1438.45 166 22 2303

C499-1 147.12 0 22 284 58.56 0 22 86

C499-480 3065.1 247 40 3908 3069.78 247 38 4016

C499-621 208.74 0 41 369 157.12 0 41 262

C499-931 2230.89 247 26 2184 2277.9 247 26 2896

C499-961 64.37 0 13 118 220.66 16 32 292

C880-41 143.49 0 15 188 99.30 0 15 121

C880-801 24.73 0 3 33 15.69 0 3 17

C880-1201 145.86 0 16 190 143.72 0 16 201

C880-1521 73.48 0 8 89 59.50 0 8 69

C880-1741 19.58 0 2 23 16.83 0 2 16

91

56.57ms to 59.78ms. Likewise, fault #C880-1741 has the same operational complexity in

both cases, but a decrease in the number of messages from 23 to 16 reduced the runtime

from 19.58ms to 16.83ms. In the case of faults such as C499-961, however, the dominant

factor was the increase in operational complexity which also increased the number of

messages required. For this fault,α increased from 0 to 16 backtracks, andie increased

from 13 to 32 inputs specified. These increases led to an increase in the runtime from 64.36

ms to 220.66ms.

This change in operational complexity is caused by the fact that the optimistic

backtrace process can change which primary inputs are specified in the test vector The

change in primary inputs specified can lead to the enhanced ATPG algorithm searching

different portions of the search space than the serial algorithm. The altered search can turn

out to be more or less efficient. The data in Table 5.1 tends to indicate that more faults take

a longer time for ATPG than those that take a shorter time. If this were the case, then ATPG

for all faults would take longer for the version 1 system than the unenhanced distributed

system. The data presented in section 5.5.4, will show that this is not the case however.

5.5.2 Multiple Backtracing Results

Table 5.2 contains a comparison of the results of the version 2 TOP_TGS system

and the unenhanced distributed system. The version 2 system contains both the multiple

backtrace parallelization and the optimistic backtracing enhancement.

Because the multiple backtrace enhancement is in fact a method of parallelization,

it is expected that some faults will experience a decrease in runtime without a decrease in

the total amount of work being done. This decrease would be due to the effect of

parallelism.

Examination of the results for fault #C499-961 shows that the runtime decreased

from 64.37ms to 48.44ms. This decrease was realized in spite of the fact that the number

of inputs specified increased from 13 to 34 indicating that the operational complexity of the

ATPG process increased by a factor of over 2.6. The number of gate evaluations for this

92

fault also increased from 134 to 520 indicating that the speedup also occurred in spite of an

increase in overhead processing.

The results for fault # C432-1 also indicate a speedup from 70.73ms to 46.39ms.

The operational complexity for this fault increased from 10 to 14 inputs specified while the

overhead decreased slightly with the number of gate evaluations decreasing from 197 to

170.




74181-1 31.0 0 7 61 26.06 0 5 56

74181-181 44.97 0 7 94 27.07 0 7 63

74181-379 146.97 10 14 257 114.87 12 14 231

C432-1 70.73 0 10 126 46.39 0 14 125

C432-145 1149.0 103 11 1627 2034.7 247 14 3045

C432-501 56.57 0 8 106 33.04 0 8 67

C432-619 2336.9 247 11 3010 47.27 0 12 109

C432-731 1281.98 157 22 1915 446.61 62 22 806

C499-1 147.12 0 22 284 39.56 0 22 84

C499-480 3065.1 247 40 3908 2401.2 247 37 3376

C499-621 208.74 0 41 369 223.28 32 41 405

C499-931 2230.89 247 26 2184 79.81 1 40 188

C499-961 64.37 0 13 118 48.44 0 34 133

C880-41 143.49 0 15 188 94.05 0 15 122

C880-801 24.73 0 3 33 16.21 0 3 20

C880-1201 145.86 0 16 190 76.40 0 16 134

C880-1521 73.48 0 8 89 42.12 0 8 52

C880-1741 19.58 0 2 23 18.00 0 2 16

93

As was the case for the version 1 system, some faults experienced an increase in

runtimes. These increases were due mainly to an increase in operational complexity. Again

the increase in operational complexity was caused by the algorithmic changes leading the

ATPG process into a nonproductive area of the search space.

The results for fault # C432-145 show that the number of backtracks increased from

103 to 247 as the fault became hard-to-detect. This increase inα lead to an increase in

runtime from 1149.0ms to 2034.75ms.

5.5.3 Optimistic Propagation Results

Finally, Table 5.3 contains a comparison of the results of the version 3 TOP_TGS

system and the unenhanced distributed version. The version 3 system contains all of the

enhancements of the previous versions as well as the optimistic propagation enhancement.

The results are similar to those for the version 2 system. Both fault #C432-1 and

fault #C499-961 show the apparent effect of speedup due to parallelism. Several faults also

had increases in runtime due to increases in operational complexity.

One interesting result to note is the fact that all of the example faults for circuit

C432 were processed with zero backtracks by the version 3 system while the unenhanced

version required a total of 507 backtracks for these faults. While this may seem to be a

promising result, there may in fact be other faults on the fault list which experience an

increase in the number of required backtracks. An experiment where ATPG is performed

on the entire fault list is required to determine if the total number of backtracks did indeed

decrease. The next section presents the results of such an experiment.

5.5.4 Results of ATPG on the Entire Fault List

Figure 5.7 contains the graphs of the runtimes for the benchmark circuits on the

various TGS systems. These runtimes are for complete test generation where every fault is

targeted. Version ‘s’ is the serial, uniprocessor, PODEM ATPG system. Version 0 is the

serial distributed TOP_TGS system. Version 1 through 3 are the enhanced TOP_TGS

94

systems discussed in the previous sections.

The graphs show a large increase in test generation time from the serial version to

the distributed version. This increase in runtimes is due to the communications time and

overhead processing incurred by the distributed system. These were the factors analyzed in

Chapter 4. Note that the serial system requires no messages to perform ATPG. Graphs of

the number of messages required for ATPG are shown in Figure 5.8.




74181-1 31.0 0 7 61 56.13 0 13 103

74181-181 44.97 0 7 94 23.19 0 7 51

74181-379 146.97 10 14 257 78.64 12 10 169

C432-1 70.73 0 10 126 37.32 0 14 106

C432-145 1149.0 103 11 1627 77.28 0 14 154

C432-501 56.57 0 8 106 28.67 0 8 54

C432-619 2336.9 247 11 3010 54.28 0 16 131

C432-731 1281.98 157 22 1915 107.03 0 22 197

C499-1 147.12 0 22 284 80.04 0 41 206

C499-480 3065.1 247 40 3908 2444.7 247 38 3530

C499-621 208.74 0 41 369 2130.5 247 35 2359

C499-931 2230.89 247 26 2184 79.59 1 40 188

C499-961 64.37 0 13 118 48.36 0 34 134

C880-41 143.49 0 15 188 38.04 0 13 42

C880-801 24.73 0 3 33 15.05 0 3 21

C880-1201 145.86 0 16 190 55.55 0 15 94

C880-1521 73.48 0 8 89 28.96 0 9 42

C880-1741 19.58 0 2 23 16.43 0 2 17

95

All of the circuits show a steady improvement in the runtimes for the enhanced

TOP_TGS systems with the exception of C499. The runtimes for this circuit decreased for

versions 1 and 2 over the unenhanced system, but the runtime for version 3 increased

substantially. Examination of the number of messages required for C499 in Figure 5.8

reveals thatΜcomm also increases greatly for version 3. This increase inΜcomm and thus

the total runtime is in fact caused by the increase in aborted faults for the version 3 system

over the other versions. Figure 5.9 contains a graph of the number of aborted faults for

circuits C432 and C499. Notice that the number of aborted faults for the version 3 system

Figure 5.7 Complete ATPG runtimes.

S 0.0 1.0 2.0 3.0TOP_TGS Version

5.0

15.0

25.0

35.0

Run

times

(se

c)

74181


0.0

100.0

200.0

300.0

400.0

500.0

600.0

Run

times

(se

c)

C432


0.0

100.0

200.0

300.0

400.0

Run

times

(se

c)

C499


50.0

100.0

150.0

200.0

Run

times

(se

c)

C880

96

on C499 increases almost 50% over the unenhanced version. This increase in aborted faults

increases the total number of backtracks,α, because every aborted fault requires 247

backtracks to process. The increase inα, in turn, increases the operational complexity of

the ATPG process thereby increasingΜcomm.

The graphs in Figure 5.7 indicate that the best results were obtained for the C880

Figure 5.8 Complete ATPG message requirements.


0.0

10000.0

20000.0

30000.0

Tot

al N

umbe

r of

Mes

sage

s

74181


0.0

200000.0

400000.0

600000.0

Tot

al N

umbe

r of

Mes

sage

s

C432


0.0

200000.0

400000.0

600000.0

Tot

al N

umbe

r of

Mes

sage

s

C499


0.0

50000.0

100000.0

150000.0

200000.0T

otal

Num

ber

of M

essa

ges

C880

97

circuit. The overall runtime for this circuit was actually less for the version 3 system then

the serial uniprocessor system. The improvement in runtime was realized in spite of the fact

that the version 3 system required 53,662 total messages to perform ATPG. The version 3

system also experienced an increase in operational complexity of almost 100% as the total

number of inputs specified increased from 8391 to 15723 and an increase in overhead as

the total number of gate evaluations increased from 303,573 to 324,754. These facts are a

strong indicator that a significant amount of parallelization is present in the version 3

system for C880 and decreasing the amount of excess overhead processing and

communications latency should yield useful speedups on this circuit.

The results for the 74181 circuit, while not as promising, indicate that speedup

could be realized for this circuit if overhead processing and communications latency were

reduced. However, the results for the C432 and C499 circuits were not as promising. Any

decrease in runtimes due to parallelization were smaller and were reduced or eliminated in

Figure 5.9 Total aborted faults.


0.0

20.0

40.0

60.0

80.0

100.0

120.0

140.0

Num

ber

of A

bort

ed F

aults

C432 & C499

C432 aborted faults

C499 aborted faults

98

some cases by increases in overhead processing or operational complexity.

For C432, the operational complexity did decrease asα decreased from 33978 to

19488 backtracks while the number of inputs specified increased slightly from 8391 to

9829. The overhead increased as the number of gate evaluations increased from 2,225,243

to 2,337,844, but this is an equivalent percentage to that experienced by C880. The fact that

the runtime for the version 3 system was still almost 4 times the serial system indicates that

not as much parallelism is present in this circuit for the version 3 system as in C880.

For circuit C499, the operational complexity increased a great deal asα increased

from 13,850 to 28,023 backtracks and the number of inputs specified increased from 32,066

to 32,692. The overhead processing also increased as the number of gate evaluations

increased from 1,411,712 to 3,142,372. The direct result of these increases was the fact that

the version 3 runtime was over 4 times the serial runtime.

It is doubtful if any increase in communications latency or overhead processing will

decrease the version 3 runtimes for C432 and C499 enough to result in speedups over the

serial system. What is needed is a method to extract more parallelism from these circuits,

perhaps by using a more intelligent partitioning algorithm.

5.5.5 Results for Larger Numbers of Processors

This section presents a brief review of the results of increasing the number of

processors used in the ATPG process with the version 3 system. As in the previous section,

these results were obtained by performing test generation on all faults in the fault list.

Table 5.4 lists the pertinent statistics for the benchmark circuits for the version 3

system with 2 and 4 partitions (processors). Two circuits, C499 and C880 experienced an

apparent increase in parallelism from the 2 partition case to the 4 partition case. For C499,

the runtime decreased from 325.8sec to 268.9sec. in spite of an increase in operational

complexity indicated by an increase inα from 19,513 to 21,990 backtracks andie from

32,692 to 34,719 inputs specified. Communications time also increased asΜcomm

increased from 403,093 to 725,274. Overhead decreased slightly as the number of gate

99

evaluations fell from 3,142,372 to 2,842,683, but this is not enough to account for the entire

decrease in runtime.

The runtime for circuit C880 increased slightly from 56.2 sec to 60.4sec, but this

is much less than would be expected from the increases in operational complexity,

communications time, and overhead processing.

Circuit C432 experienced an increase in runtime from 298.5sec to 408.2sec. This

was caused mainly by the large increase in operational complexity asα increased from

26,898 to 33,079 andie decreased slightly. The increase inα was caused by an increase of

27 aborted faults which also caused an increase inΜcomm from 393,381 to 1,011,463 and

an increase in the number of gate evaluations from 2,337,844 to 3,358,566.

In all cases, the changes in the number of aborted faults and the number of

backtracks required was caused by the fact that changes in the way the gates are partitioned

and even in the order that messages are received by a partition can cause changes in the area

of the solution space in which the ATPG algorithm searches for a test. This effect was

discussed in section 5.5.1. The fact that the order that messages are received can affect the

input ordering makes the version 3 TOP_TGS system almost non-deterministic in that the

results are slightly different for each run as well as for different partition types and numbers

of partitions.

The next chapter details the effort to increase the amount of parallelism used in the

TOP_TGS system by adding additional methods of parallelism.

Table 5.4 Comparison of Results for 2 and 4 Partitions

CircuitNo.

2 processors 4 processors

τdist α ie Μcomm ge τver2 α ie Μcomm ge

74181 10.85 56 3583 21037 144673 15.89 742 3057 47671 89855

C432 298.5 26898 9829 393381 2337844 408.2 33079 9294 1011463 3358566

C499 325.9 19513 32692 403093 3142372 268.9 21990 34719 725274 2842683

C880 56.21 83 16385 53662 324754 60.40 200 16784 120206 345092

100

Chapter 6

Fault Partitioning

This chapter details the results of adding fault partitioning as an additional

parallelism to the TOP_TGS system. In spite of the enhancements and parallelizations

discussed in the previous chapter, there are frequently times in which all of the work in

ATPG is being done by one processor alone. In fact, optimization of the circuit partitioning

algorithm to reduce communications requirements will force this result directly. If, during

these processor idle times, other useful work was available for the processor, speedup

would increase. The amount of this speedup is determined by how much idle time is in fact

available and how well it is used. Performing ATPG on several faults for which test

generation is required, simultaneously, would reduce this idle time. The use of fault

partitioning to accomplish this goal was first proposed in [17]. The next section details the

changes to the TOP_TGS system necessary to implement the fault partitioning scheme.

6.1 Fault Partitioning Implementation

The majority of the changes necessary to implement fault partitioning in the

TOP_TGS system were made in the master and test_generator objects. In both of these

objects, the necessary data structures were changed from single values to arrays with one

element per fault. Examples include the global primary input stack and backtrack limit in

the master, and the individual gate output values and global objective in the test_generator.

Each of the routines which performs a specific test generation task was modified to include

a fault number as an argument. This fault number is used as an index into the arrays

discussed above as the task is performed. There were a number of methods in both the

master and test_generator objects that had to be modified in this way.

There were two other modifications required in the TOP_TGS system to implement

fault partitioning. The first modification concerned the initialization process. Initialization

is required before the ATPG process begins and after test generation is completed on each

101

fault. Initialization consists of setting all gate outputs back to an ‘X’ value, setting the

global objective to NULL, and emptying the global input stack. In the original TOP_TGS

system, initialization was handled by simply invoking the initialize() method in each

test_generator and waiting for it to return. This simple method in fact serialized the

initialization process, but the impact on performance was negligible. In the new system

with fault partitioning, this method is unacceptable because it would prevent the master

object from processing the faults during this time and would in fact lead to deadlock.

Deadlock is possible because both the master and test_generator would have routines that

use the synchronous communications protocol discussed in section 2.3.1.3.

In order to correct this problem, the initialization procedure was changed to an

asynchronous process. The master issuesinitialize(int fault_no) calls to all of the

test_generators. Each test_generator initializes the variables in the array for that fault

number and then signals the master when it is done. This signaling is done through a new

method in the master calledmark_done_init(int test_gen_no, int fault_no). Once all of the

test_generators have signaled that they have completed initialization for that fault number,

the ATPG process is started.

The other modification necessary for fault partitioning was the changes to the

master necessary to perform selection of the next fault. Ideally, the next fault selected

should be one which will provide work for the processor with the most idle time. This could

be done by selecting a fault which is located in the database of that processor. However, in

order to simplify implementation, the current method used is to simply select the next fault

on the list when a processor becomes idle. It is not felt that this compromise has a

significant impact in performance, but it might decrease efficiency somewhat.

6.2 Results

The graphs in Figure 6.1, Figure 6.2, Figure 6.3, and Figure 6.4, present the

runtimes for the bench mark circuits versus the number of faults that are processed in

parallel. Data is presented for both the 2 partition case and the 4 partition case. Note that in

all cases, the runtimes decrease continuously as the number of parallel faults increases.

102

However, the slopes tend to level out at about 4 parallel faults. This reduction in speedup

is due to the fact that eventually, the majority of the idle time in the processors is being used

effectively.

In the case of circuit 74181, the runtime for 2 partitions actually increases as the

parallel faults increased from 4 to 6. This increase is due to the fact that in this case, the

total number of backtracks,α, required for ATPG increased. Recall that in the version 3

system, the order in which inputs are placed in the global stack can change, which can cause

the system to search nonproductive areas in the solution space. The addition of more

parallel faults can change the timing and thus the order of the inputs. This change can

reduce the efficiency of the search. This reduction in efficiency accounts for the increase in

α and thus the increase in runtime.

The change in input ordering can also change the number of aborted faults

encountered during test generation. In the 4 partition case for C432, this effect accounts for

Figure 6.1 Runtimes vs. number of parallel faults for circuit 74181.

0.0 2.0 4.0 6.0 8.0Number of Parallel Faults

7.0

9.0

11.0

13.0

15.0

Run

times

(se

c)

74181

2 partitions

4 partitions

103

the majority of the dramatic decrease in runtime from 408.2sec to 295,1sec as the number

of parallel faults increased from 1 to 2. The addition of 1 parallel fault decreased the number

of aborted faults from 121 to 95 which in turn decreased the runtime. Further decreases in

runtime as the number of parallel faults increased beyond 2 were due to decreased

processor idle time as the number of aborted faults stabilized at around 95.

For circuits C880 and C499, all of the reduction in runtime was due to the use of

processor idle time as the number of aborted faults andα were fairly constant. Overall, the

addition of fault partitioning decreased the runtimes from 26% to 43%. The reductions were

larger in the 4 partition case which is logical because 4 processors would have more

processor idle time.

More research will be required to determine if a better fault selection scheme will

have a significant impact on runtime, but the results presented in this section suggest that

adding additional parallelisms to the system is an effective method of reducing runtimes.

Figure 6.2 Runtimes vs. number of parallel faults for circuit C432.

0.0 2.0 4.0 6.0 8.0

Number of Parallel Faults

200.0

300.0

400.0

Run

times

(se

c)

C432

2 partitions 4 partitions

104

For the problem of ATPG on a single hard-to-detect fault, the addition of search space

parallelism [28] should be as effective.


0.0 2.0 4.0 6.0 8.0


150.0

200.0

250.0

300.0

Run

times

(se

c)

C499

2 partitions

4 partitions

105


0.0 2.0 4.0 6.0 8.0


35.0

40.0

45.0

50.0

55.0

60.0R

untim

es (

sec)

C880

2 partitions

4 partitions

106

Chapter 7

Conclusions and Future Work

The goal of this research was to develop a topological partitioning ATPG system

based upon the PODEM algorithm and analyze its results. The analysis was intended to aid

in determining the effect on the performance of the ATPG system of decreasing

communications latencies. The following sections present conclusions based on the results

and recommendations for future work.

7.1 Conclusions

The major task necessary for this research was the development of the topologically

partitioned ATPG system based upon the PODEM algorithm. This task required the porting

of the original TGS system to the ESP environment, characterization of the performance of

the ES-KIT 88k processor (see Appendix A) and development of a fault partitioning

parallel version of the TGS system (see Appendix B). Development of the TOP_TGS

system from the ES-TGS system required the development of over 5,000 lines of new code.

All of these systems were developed in a parallel processing environment which was still

in its early state of development. The environment was therefore imperfect in its

implementation and it did not include any debugging support.

Development of a topologically partitioned ATPG system also included

development of algorithms to partition the circuit topologically across processing nodes.

Two new partitioning algorithms were developed and proven to be more efficient than

those previously developed. Criteria used for algorithm selection included memory

efficiency and message passing overhead.

Three algorithmic enhancements were developed to increase the efficiency of the

ATPG process and introduce parallelism. The optimistic backtrace procedure increased the

amount of work that was done between backtrace messages. This increase in work per

message decreased the number of messages required for test generation and thereby

107

decreased the ATPG runtimes. The multiple backtrace procedure introduced parallelism

into the ATPG process by allowing multiple backtrace operations to be processed

simultaneously. Finally, the optimistic propagation procedure introduced more parallelism

by allowing local implications and objective selections to be processed in parallel with

backtrace operations. Implementation of these enhancements to the TOP_TGS system

involved generating over 7500 lines of new code.

The results of the addition of these parallelizations were mixed. Some circuits

experienced a dramatic decrease in runtimes such that the topologically partitioned ATPG

system was faster than the original serial version. In these cases, decreasing the

communications latency by running the system on a higher performance machine, and

decreasing the amount of overhead processing in the distributed algorithm should yield

useful speedups on 2 to 4 processors.

Other circuits exhibited an increase in computational complexity caused by an

increase in aborted faults which reduced or negated the benefits of the parallelizations. The

increase in backtracks was caused by fact that the enhancements caused the ATPG

algorithm to search unproductive portions of the solution space. In all cases, faults with any

backtracks did not achieve as much improvement as those without backtracks. This result

was caused by the serial nature of the backtracking process. It is believed that decreasing

the likelihood of backtracking by improving the partitioning algorithm and the ATPG

search heuristics will have a large beneficial impact on the runtimes for these circuits.

Included in this work was an analysis of the ATPG process as a binary search. The

result of this analysis was a unique model of the search process which can be used to predict

the number of different types of operations that must be performed to generate a test.

Experiments were performed that verified the accuracy of this model and proved that it can

be used to predict the complexity of ATPG on the entire fault list given the results for a few

faults. This model was then used to predict the change in performance of the topological

partitioning ATPG system given a change in communications latency.

The model was also used to determine the increase in processing time of the

108

distributed ATPG system that was due to increased overhead. This analysis showed that the

overhead was a significant contributor to the increased runtimes of the distributed system

versus the serial system and it will have to be reduced in order for the speedups to become

useful.

The final contribution of this research was to develop a system which could be used

as the basis of future research and to identify in detail the avenues of investigation that must

be undertaken to continue development. The next section details this future work.

7.2 Future Work

The initial work towards improving the results of the topologically partitioned

ATPG system should be directed towards the partitioning algorithm. Improving the

partitioning algorithm could have a great impact on the number of messages required for

ATPG as well as the operational complexity of test generation. Keeping all nodes in a

reconvergent fanout loop within one partition should reduce the number of bad

assumptions made in the optimistic propagation procedure and reduce the amount of

speculative work that is done incorrectly. It may also decrease the number of backtracks

required for test generation.

One method which may improve the partitioning algorithm by keeping

reconvergent fanout loops in the same partition is to use the supergate partitioning method

developed in [17]. Another technique which can be used to improve the partitioning

algorithm is to use a simulated ATPG approach in calculating the cost vectors used in the

optimization portion of the algorithm. This approach is based on the data that suggests that

measuring the number of messages required to do ATPG on a small number of faults will

yield a good approximation of the number of messages required for full ATPG. Simulated

ATPG involves simulating a small number of ATPG operations such as implication and

backtracing and determining the number of messages required to perform them. These

numbers can then be used to estimate the number of messages required for full ATPG

which is used as a measure of the quality of the current partition.

109

The second area which deserves some work in the topologically partitioned ATPG

system is the overhead processing. The overhead involved in the calculation of the D

frontier could possibly be reduced by using a polling method of communication instead of

the ring method used now. In the polling method, the master would simply broadcast a

message to all of the test_generators requesting the current D frontier value. The

test_generators would calculate their D frontiers and send them back to the master. The

master would then make the decision as to which processor had the D frontier. This method

would reduce the extra calculating of the D frontier necessary in the ring method and the

calculations of the individual frontiers would be done in parallel. It would however,

increase the processing load on the master, possibly causing it to become a bottleneck.

Reduction of the overhead involved in the implication procedure is a more difficult

task. There is no apparent way to process the gates in a levelized order in the distributed

system without incurring a large increase in communications overhead. It may be possible

to make the implication procedure more intelligent by determining controlling and non-

controlling values and using them to determine when gate evaluation is really necessary.

However, this approach would have to be added to the serial system for comparison, and

would increase its speed as well. The benefits to the distributed system may be higher than

for the serial system because of the large number of gate evaluations involved.

There are many improved ATPG algorithms that have been developed subsequent

to PODEM, but as discussed in Chapter 2, most of them are PODEM based and can be

added to the TOP_TGS system. For the sake of comparison, the improvements would also

have to be added to the serial system. However, it is hoped that the improvement to the

distributed system, in decreasing bad decisions and backtracks, would be greater.

The algorithmic improvements generally fall into three categories, those that can be

added to the system at no cost, those that will certainly incur additional cost, and those that

may or may not incur costs depending on the implementation. Here the reference to cost

most frequently means extra communications overhead in the distributed system. Table 7.1

summarizes some of these algorithmic improvements and the categories into which they

110

fall.

The use of headlines and basis nodes as pseudo primary inputs during ATPG as in

FAN and TOPS could be added to the TOP_TGS system without incurring any additional

message overhead. The same could be said for the use of search strategy switching. The

addition of these techniques would then most certainly increase the speed of the distributed

system.

The X-path check and E-frontier techniques would definitely add additional

message requirements to the TOP_TGS system. Their addition could result in speedup or

slowdown depending on how much improvement they made versus the increase in message

overhead.

The variable cost improvements such as redundancy analysis and improved

implication may result in increased overhead if they are allowed to generate additional

message traffic. However, the improvements in these categories could be implemented to

work only within the partition boundaries and not generate additional messages. This

restriction would decrease the usefulness of the improvements, but it may provide a better

trade-off in utility versus message overhead.

More research is necessary into the addition of other methods of parallelization to

the TOP_TGS system. Although the results of adding fault partitioning to the system were

not as good as hoped, the addition of search space partitioning may prove to be more

Table 7.1 Algorithmic Improvements

No Cost Additional Cost Variable Cost

headlines {FAN}, [4]X-path

check {PODEM}, [3]

redundancy analysis {EST}, [15]

basis nodes{TOPS}, [6]

improved implication{FAN, Socrates}, [4], [7]

unique sensitization{FAN, Socrates}, [4], [7]

search strategyswitching {S3}, [41] E-frontier {EST}, [15]

multiple backtrace {FAN}, [4]

dominators {TOPS}, [6]

111

beneficial. It would certainly help to speedup processing of hard-to-detect faults which now

seem to be the biggest problem for the TOP_TGS system. It may also be more profitable to

try fault partitioning on the unenhanced version of the TOP_TGS system. This version has

few inherent parallelisms and the increased processor idle time may be used more

effectively by fault partitioning. Fault partitioning may also be more effective in the

enhanced TOP_TGS system if the partitioning algorithm is improved.

Finally, investigations into hardware acceleration of the topologically partitioned

ATPG process should be undertaken. A parallel reduction network (PRN)[42] could be

used to accelerate processing of certain critical synchronization points. The PRN could be

used to calculate the D frontier and distribute the results back to the processors. The PRN

could also be used in determining the idle status of the test_generator objects. Both of these

tasks are among the most message intense of the ATPG process and handling them with a

PRN could greatly increase the speed of test generation.

112

References

[1] Levitt, Marc, E.,ASIC Testing Upgraded, IEEE Spectrum, pp. 26-29, May,1992.

[2] Roth, J. Paul,Diagnosis of Automata Failures: A Calculus and a Method, IBMJournal of Research and Development, Vol. 10, pp. 278-291, July, 1966.

[3] Goel, Prabhakar,An Implicit Enumeration Algorithm to Generate Tests forCombinational Logic Circuits, IEEE Transactions on Computers, Vol. C-30, No 3., pp.215-222, March 1981.

[4] Fujiwara, Hideo, Takeshi Shimono,On the Acceleration of Test GenerationAlgorithms, IEEE Transaction on Computers, Vol. C-32, No 12, pp. 1137-1144,December 1983.

[5] Fujiwara, Hideo, S. Toida,The Complexity of Fault Detection: An Approach toDesign for Testability, Proceedings of the 12th International Symposium on FaultTolerant Computing, pp. 101-108, June 1982.

[6] Kirkland, Tom, M. Ray Mercer,A Topological Search Algorithm for ATPG,Proceedings of the 24th ACM/IEEE Design Automation Conference, pp. 502-508,1987.

[7] Schulz, Michael H., Erwin Trischler, Thomas M. Sarfert,SOCRATES: A HighlyEfficient Automatic Test Pattern Generation System, IEEE International TestConference, pp. 1016-1026, September, 1987.

[8] Waicukauski, John A., Paul A. Shupe, David J. Giramma, Arshad Matin,ATPGfor Ultra-Large Structured Designs, Proceedings of the 1990 IEEE International TestConference, pp. 44-51, September, 1990.

[9] Mallela, S., S. Wu,A Sequential Circuit Test Generation System, ProceedingsInternational Test Conference, pp. 57-61, November 1985.

[10] Ma, H. K. T., S. Devadas, A. R. Newton,Test Generation for SequentialCircuits, IEEE Transactions on Computer-Aided Design, Vol. 7, No. 10, pp. 1081-1093,October 1988.

[11] Tham, Kit Yoke,Parallel Processing of CAD Applications, IEEE Design andTest of Computers, pp. 13-17, October 1987.

[12] Grimshaw A.,An Introduction to Parallel Object-Oriented Programming withMENTAT,Computer Science Report No. TR-91-07, University of Virginia.

[13] Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for TestGeneration, IEEE Transactions on Computer-Aided Design, Vol. 9, No. 3, pp. 313-322,March, 1990.

[14] Klenke, Robert H., Ronald D. Williams, James H. Aylor, Parallel ProcessingTechniques for Automatic Test Pattern Generation, IEEE Computer, pp. 71-84, January,

113

1992.

[15] Giraldi, John, Michael L. Bushnell,EST: The New Frontier in Automatic TestPattern Generation, Proceedings of the 27th ACM/IEEE Design AutomationConference, pp. 667-672, June 1990.

[16] Brglez, Franc, Hideo Fujiwara,Neural Netlist of Ten CombinationalBenchmark Circuits and a Target Translator in FORTRAN, IEEE InternationalSymposium on Circuits and Systems, Special Session on ATPG, June 1985.

[17] Jokl, James A,Topologically-Partitioned Automatic Test Pattern Generation,Ph.D Dissertation, University of Virginia, September 1991.

[18] Bell, Robert H. Jr., Robert H. Klenke, James H. Aylor, Ronald D. Williams,Results of a Parallel Automatic Test Pattern Generation System for a Distributed-MemoryMulticomputer, Proc. of the Fifth Annual IEEE ASIC Conference, September, 1992,(scheduled to appear).

[19] Kirkland, Tom, M. Ray Mercer,Algorithms for Automatic Test PatternGeneration, IEEE Design and Test of Computers, pp. 43-55, June, 1988.

[20] Fujiwara, Hideo,Logic Testing and Design for Testability, The MIT Press,Cambridge, Massachusetts, 1986.

[21] Hayes, John P.,Fault Modeling, IEEE Design and Test, pp. 37-44, April 1985.

[22] Chandra, Susheel J., Janak H. Patel, Test Generation in a Parallel ProcessingEnvironment, IEEE International Conference on Computer Design, pp. 11-14, 1988.

[23] Patil, Srinivas, Prith Banerjee,Fault Partitioning Issues in an IntegratedParallel Test Generation System, IEEE International Test Conference, pp. 718-726,September 1989.

[24] Chandra, Susheel J., Janak H. Patel,Experimental Evaluation of TestabilityMeasure for Test Generation, IEEE Transactions on Computer-Aided Design, Vol. 8,No. 1, pp. 93-97, January 1989.

[25] Chandra, Susheel J., Janak H. Patel, Test Generation in a Parallel ProcessingEnvironment, IEEE International Conference on Computer Design, pp. 11-14, 1988.

[26] Wah, Benjamin W., Guo-jie Li, Chee Fen Yu,Multiprocessing ofCombinatorial Search Problems, IEEE Computer, pp. 93-108, June 1985.

[27] Rao, V. Nageshwara, Vipin Kumar,Parallel Depth First Search. Part IImplementation, International Journal of Parallel Processing, Vol. 16 No. 6, pp. 479-499, June, 1987.

[28] Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for TestGeneration, Proceedings of the 26th ACM/IEEE Design Automation Conference,pp.339-344 June, 1989.

[29] Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for Test

114

Generation, IEEE Transactions on Computer-Aided Design, Vol. 9, No. 3, pp.313-322March, 1990.

[30] Smith, Steven P., Bill Underwood, M. Ray Mercer,An Analysis of SeveralApproaches to Circuit Partitioning for Parallel Logic Simulation, IEEE InternationalConference on Computer Design, pp. 664-667, 1987.

[31] Bollinger, S. Wayne, Scott F. Midkiff, An Investigation of Circuit Partitioningfor Parallel Test Generation, IEEE VLSI Test Symposium, April, 1992.

[32] Feng, Tse-Yun,A Survey of Interconnection Networks, IEEE Computer, pp.12-27, December 1981.

[33] Karp, Alan H.,Programming for Parallelism, IEEE Computer, pp. 43-57,May 1987.

[34] Smith, K. Stuart, Robert J. Smith, II, Guy S. Caldwell, Claudia Porter, WilliamJ. Leddy, Arjun Khanna, Arunodaya Chatterjee, Ying T. Hung, Douglas W. Hahn, WayneP. Allen,Experimental Systems Project at MCC, March 1989, MCC Technical ReportNumber: ACA-ESP-089-89, March 2, 1989.

[35] Stroustroup, Bjarne,The C++ Programming Language, Addison-WesleyPublishing Company, Reading Massachusetts, 1987.

[36] Halstead, Robert H., Jr., Guo-jie Li, Chee Fen Yu,Parallel SymbolicComputing, IEEE Computer, pp. 35-43, August 1986.

[37] Abramovici, M., P. R. Menon, D. T. Miller,Critical Path Tracing-AnAlternative to Fault Simulation, Proceedings of the 20th ACM/IEEE DesignAutomation Conference, pp. 214-220, June 1983.

[38] Goldstein, Lawrence H.,Controllability/Observability Analysis of DigitalCircuits, IEEE Transactions on Circuits and Systems, Vol. CAS-26, No. 9, pp. 685-693,September 1979.

[39] Kirkpatrik, S., C.D. Gelatt, Jr., M. P. Vecchi,Optimization by SimulatedAnnealing, Science, Vol. 220, No. 4598, May 13, 1983, pp. 671-680.

[40] Burton, F. Warren, Speculative Computation, Parallelism, and FunctionalProgramming, IEEE Transaction on Computers, Vol. C-34, No.9, pp. 1190-1193,December 1985.

[41] Min, Hyoung Bok, A,Strategies for High Performance Test Generation, Ph.DDissertation, University of Texas at Austin, December 1990.

[42] Reynolds, Paul F.,An efficient Framework for Parallel Simulations,International Journal of Computer Simulations (to appear).

115

Bibliography

Parallel Processing

Burton, F. Warren, Speculative Computation, Parallelism, and Functional Programming,IEEE Transaction on Computers, Vol. C-34, No.9, pp. 1190-1193, December1985.

Finkel, Raphael, Udi Manber, DIB-A Distributed Implementation of Backtracking, ACMTransactions on Programming Languages and Systems, Vol. 9, No.2, pp. 235-256, April 1987.

Fox, Geoffrey C., Mark A. Johnson, Gregory A. Lyzenga, Steve W. Otto, John K. Salmon,David W. Walker,Solving Problems on Concurrent Processors, Volume I, GeneralTechniques and Regular Problems, Prentice Hall, Englewood Cliffs, New Jersey,1988.

Feng, Tse-Yun,A Survey of Interconnection Networks, IEEE Computer, pp. 12-27,December 1981.

Gajski, Daniel D., Jih-Kwon Peir,Essential Issues in Multiprocesor Systems, IEEEComputer, pp. 9-27, June 1985.

Halstead, Robert H., Jr., Guo-jie Li, Chee Fen Yu,Parallel Symbolic Computing, IEEEComputer, pp. 35-43, August 1986.

Haynes, Lenard S., Richard L. Lau, Daniel P. Sieworek, David W. Mizell,A Survey ofHighly Parallel Computing, IEEE Computer, pp. 9-24, January 1982.

Hirose, Fumiyasu, Mitsuo Ishii, Junichi Niitsuma, Tatsuya Shihdo, , Nobuaki Kawato,Hiroshi Hamamura, Keiichiro Uchida, Hiroshi Yamada.Simulation Processor SP,Proceedings of the 1988 IEEE International Conference on Computer-AidedDesign, pp. 484-486, November 1987.

Karp, Alan H.,Programming for Parallelism, IEEE Computer, pp. 43-57, May 1987.

Kruatrachue, Boontee, Ted Lewis,Grain Size Determination for Parallel Processing,IEEE Computer, pp. 23-32, January 1988.

Rao, V. Nageshwara, Vipin Kumar,Parallel Depth First Search. Part I Implementation,International Journal of Parallel Processing, Vol. 16 No. 6, pp. 479-499, June,1987.

Soule, Larry, Tom Blank,Parallel Logic Simulation on General Purpose Machines,Proceedings of the 25th ACM/IEEE Design Automation Conference, pp 166-171, June 1988.

Tham, Kit Yoke,Parallel Processing of CAD Applications, IEEE Design and Test ofComputers, pp. 13-17, October 1987.

116

Wah, Benjamin W., Guo-jie Li, Chee Fen Yu,Multiprocessing of Combinatorial SearchProblems, IEEE Computer, pp. 93-108, June 1985.

Automatic Test Pattern Generation (General)

Abramovici, M., P. R. Menon, D. T. Miller,Critical Path Tracing-An Alternative to FaultSimulation, Proceedings of the 20th ACM/IEEE Design AutomationConference, pp. 214-220, June 1983.

Abramovici, M., J. J. Kulikowski, P. R. Menon, D. T. Miller,SMART and FAST: TestGeneration for VLSI Scan-Design Circuits, IEEE Design and Test of Computers,pp. 43-54, August 1986.

Akers, Sheldon B., Balakrishnan Krishnamurthy,Test Counting: A Tool for VLSI Testing,IEEE Design and Test of Computers, pp. 58-77, October, 1989.

Benmehrez, C., J. F. McDonald,The Subscripted D-Algorithm - ATPG with MultipleIndependent Control Paths, Proceedings of the 1983 IEEE ATPG Workshop, pp.71-80, 1983.

Brglez, Franc, Hideo Fujiwara,Neural Netlist of Ten Combinational Benchmark Circuitsand a Target Translator, IEEE International Symposium on Circuits andSystems, Special Session on ATPG, June 1985.

Brglez, Franc, Philip Pwonal, Robert Hum, Accelerated ATPG and Fault Grading viaTestability Analysis, Proceedings of the International Symposium on Circuitsand Systems, pp. 695-698, 1985.

Brglez, Franc, Philip Pwonal, Robert Hum,Application of Testability Analysis: FromATPG to Critical Path Tracing, IEEE International Test Conference, pp. 705-712, September 1984.

Chandra, Susheel J., Janak H. Patel,Experimental Evaluation of Testability Measure forTest Generation, IEEE Transactions on Computer-Aided Design, Vol. 8, No. 1,pp. 93-97, January 1989.

Chiorboli, G., A. Ferrari,Influence of Guiding Testability Measure on Number ofBacktracks of ATPG Program, IEE Proceedings, Vol. 136, No. 4, pp. 316-320,July 1989.

Cheng, Kwang-Ting, Vishwani D. Agrawal,A Simulation-Based Directed-Search Methodfor Test Generation, IEEE International Conference on Computer Design, pp.48-51, 1987.

Eichelberger, E. B., T. W. Williams,A Logic Design Structure for LSI Testability,Proceedings of the 14th ACM/IEEE Design Automation Conference, pp. 462-468, June 1977.

Fujiwara, Hideo,Logic Testing and Design for Testability, The MIT Press, Cambridge,Massachusetts, 1986.

117

Fujiwara, Hideo, S. Toida,The Complexity of Fault Detection: An Approach to Design forTestability, Proceedings of the 12th International Symposium on FaultTolerant Computing, pp. 101-108, June 1982.

Fujiwara, Hideo, Takeshi Shimono,On the Acceleration of Test Generation Algorithms,IEEE Transaction on Computers, Vol. C-32, No 12, pp. 1137-1144, December1983.

Gaede, Rhonda Kay, M. Ray Mercer, Bill Underwood,Calculation of Greatest LowerBound Obtainable by the Cutting Algorithm, IEEE International TestConference, pp. 498-505, September, 1986.

Gaede, Rhonda Kay, M. Ray Mercer, Kenneth M. Butler, Don E. Ross,CATAPULT:Concurrent Automatic Testing Allowing Parallelization and Using LimitedTopology, Proceedings of the 25th ACM/IEEE Design Automation Conference,pp. 597-600, 1988.

Giraldi, John, Michael L. Bushnell,EST: The New Frontier in Automatic Test PatternGeneration, Proceedings of the 27th ACM/IEEE Design AutomationConference, pp. 667-672, June 1990.

Goel, Prabhakar,An Implicit Enumeration Algorithm to Generate Tests for CombinationalLogic Circuits, IEEE Transactions on Computers, Vol. C-30, No 3., pp. 215-222,March 1981.

Goel, Prabhakar,Test Generation Costs Analysis and Projections, Proceedings of the 17thACM/IEEE Design Automation Conference, pp. 77-84, June 1980.

Hayes, John P.,Fault Modeling, IEEE Design and Test, pp. 37-44, April 1985.

Ivanov, Andre, Vinod K. Agrawal,Testability Measures - What do they do for ATPG,IEEE International Test Conference, pp. 129-138, September 1986.

Kirkland, Tom, M. Ray Mercer,Algorithms for Automatic Test Pattern Generation, IEEEDesign and Test of Computers, pp. 43-55, June, 1988.

Kirkland, Tom, M. Ray Mercer,A Topological Search Algorithm for ATPG, Proceedingsof the 24th ACM/IEEE Design Automation Conference, pp. 502-508, 1987.

Levitt, Marc, E.,ASIC Testing Upgraded, IEEE Spectrum, pp. 26-29, May, 1992.

Lioy, A., Adaptive Backtrace and Dynamic Partitioning Enhance ATPG, IEEEInternational Conference on Computer-Aided Design, pp. 62-65, 1988.

Ladjadj, M., J. F. McDonald,Benchmark Runs of the Subscripted D-Algorithm withObservation Path Mergers on the Brglez-Fujiwara Circuits, Proceedings of the24th ACM/IEEE Design Automation Conference, pp. 509-515, 1987.

Min, Hyoung B., William A. Rogers,Search Strategy Switching: An Alternative toIncreased Backtracking, IEEE International Test Conference, pp. 803-811,September 1989.

118

Min, Hyoung B., William A. Rogers, Search Strategy Switching: A Cost Model and anAnalysis of Backtracking, Journal of Electronic Testing: Theory andApplications, Vol. 1, pp. 125-137, January 1990.

Patel, Sanjay, Janak H. Patel,Effectiveness of Heuristic Measures for Automatic TestPattern Generation, Proceedings of the 23rd ACM/IEEE Design AutomationConference, pp. 547-552, June 1986.

Roth, J. Paul,Diagnosis of Automata Failures: A Calculus and a Method, IBM Journal ofResearch and Development, Vol. 10, pp. 278-291, July, 1966.

Schulz, Michael H., Erwin Trischler, Thomas M. Sarfert,SOCRATES: A Highly EfficientAutomatic Test Pattern Generation System, IEEE International Test Conference,pp. 1016-1026, September, 1987.

Schulz, Michael H., Elisabeth Auth,Improved Deterministic Test Pattern Generation withApplications to Redundancy Identification, IEEE Transactions on Computer-Aided Design, Vol. 8, No. 7, pp. 811-816, July 1989.

Schulz, Michael H., Erwin Trischler, Thomas M. Sarfert,SOCRATES: A Highly EfficientAutomatic Test Pattern Generation System, IEEE Transactions on Computer-Aided Design, Vol. 7, No. 1, pp. 126-137, January 1988.

Waicukauski, John A., Paul A. Shupe, David J. Giramma, Arshad Matin,ATPG for Ultra-Large Structured Designs, Proceedings of the 1990 IEEE International TestConference, pp. 44-51, September, 1990.

Automatic Test Pattern Generation (Parallel)

Bollinger, S. Wayne, Scott F. Midkiff, An Investigation of Circuit Partitioning for ParallelTest Generation, IEEE VLSI Test Symposium, April, 1992.

Chakaradhar, Srimat T., Michael L. Bushnell, Vishwani D. Agrawal,Toward MassivelyParallel Automatic Test Generation, IEEE Transaction on Computer-AidedDesign, Vol. 9, No. 9, pp. 981-994, September 1990.

Chandra, Susheel J., Janak H. Patel, Test Generation in a Parallel ProcessingEnvironment, IEEE International Conference on Computer Design, pp. 11-14,1988.

Fujiwara, Hideo, Tomoo Inoue,Optimal Granularity of Test Generation in a DistributedSystem, IEEE Transaction on Computer-Aided Design, Vol. 9, No. 8, pp. 885-892, August 1990.

Hirose, Fumiyasu, Koichiro Takayama, Nobuaki Kawato,A Method to Generate Tests forCombinational Logic Circuits Using an Ultrahigh-Speed Logic Simulator, IEEEInternational Test Conference, pp. 102-107, September 1988.

Huisman, Leendert M., Indira Nair, Fault Simulation of Logic Designs on ParallelProcessors with Distributed Memory, IEEE International Test Conference, pp.690-697, September 1990.

119

Kramer, Glenn A.,Employing Massive Parallelism in Digital ATPG Algorithms, IEEEInternational Test Conference, pp. 108-114, 1983.

Klenke, Robert H., Ronald D. Williams, James H. Aylor,A Survey of Techniques forParallel Automatic Test Pattern Generation, IEEE Computer ,pp. 71-84, January,1992.

Levendel, Y. H., P. R. Menon, S. H. Patel,Parallel Fault Simulation Using DistributedProcessing, The Bell System Technical Journal, Vol. 62, No. 10, pp. 3107-3129,December, 1983.

Ma, Hi-Keung Tony, Srinivas Devadas, Alberto Sangiovanni-Vincentelli,LogicVerification Algorithms and their Parallel Implementation, Proceedings 24thACM/IEEE Design Automation Conference, pp. 283-290, 1987.

Markas, Tassos, Mark Royals, Nick Kanopoulos, On Distributed Fault Simulation, IEEEComputer, pp. 40-52, January 1990.

Motohara, Akira, Kenji Nishimura, Hideo Fujiwara, Isao Shirakawa,A Parallel Scheme forTest-Pattern Generation, IEEE International Conference on Computer AidedDesign, pp. 156-159, 1986.

Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for TestGeneration, Proceedings of the 26th ACM/IEEE Design AutomationConference, pp.339-344 June, 1989.

Patil, Srinivas, Prith Banerjee,Fault Partitioning Issues in an Integrated Parallel TestGeneration System, IEEE International Test Conference, pp. 718-726,September 1989.

Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for TestGeneration, IEEE Transactions on Computer-Aided Design, Vol. 9, No. 3,pp.313-322 March, 1990.

Patil, Srinivas, Prith Banerjee, Janak H. Patel,Parallel Test Generation for SequentialCircuits on General-Purpose Multiprocessors, Proceedings of the 28th ACM/IEEE Design Automation Conference, pp.155-159 June, 1991.

Smith, Steven P., Bill Underwood, M. Ray Mercer,An Analysis of Several Approaches toCircuit Partitioning for Parallel Logic Simulation, IEEE InternationalConference on Computer Design, pp. 664-667, 1987.

C++ and Object Oriented Program Design

Booch, Grady,Object Oriented Design, with Applications, The Benjamin/CummingsPublishing Company, Inc., Redwood City, CA, 1991.

Stroustroup, Bjarne,The C++ Programming Language, Addison-Wesley PublishingCompany, Reading Massachusetts, 1987.

120

Experimental Systems Kit

Chatterjee, Arunodaya, Futures: A mechanism for Concurrency Among Objects,MCCTechnical Report Number: ACT-ESP-225-89, June 9, 1989

Chatterjee, Arunodaya, Arjun Khanna, Ying Hung, Russel McLaren, SudhaNarayanaswamy, Nandini Ajmani,ES-KIT: A Distributed Object-OrientedOperating System, MCC Technical Report Number: ACT-ESP-374-89, October5, 1989.

Smith, K. Stuart, Robert J. Smith, II, Guy S. Caldwell, Claudia Porter, William J. Leddy,Arjun Khanna, Arunodaya Chatterjee, Ying T. Hung, Douglas W. Hahn, Wayne P.Allen, Experimental Systems Project at MCC, March 1989, MCC TechnicalReport Number: ACA-ESP-089-89, March 2, 1989.

Surma, Ed, Extensible Software Platform Software Development EnvironmentProgrammer’s Guide, MCC Technical Report Number: ACA-ESP-223-88,Revised August 1, 1990.

General

Kirkpatrik, S., C.D. Gelatt, Jr., M. P. Vecchi,Optimization by Simulated Annealing,Science, Vol. 220, No. 4598, May 13, 1983, pp. 671-680.

121

Appendix: A

The ES-KIT Parallel Processing System*

The following appendix briefly describes the Experimental Systems Kit (ES-KIT)

parallel processor and the object-oriented programming environment that runs on it. The

ES-KIT is a distributed memory parallel processing environment developed to facilitate

experimentation into new parallel computer architectures and application specific

computing nodes. The system includes the ES-KIT 88K parallel processor and the

Extensible Software Platform (ESP) software system. A more detailed description of the

ES-KIT system can be found in [34].

A.1 ES-KIT Hardware

The ES-KIT 88K parallel processor is a 16 node 2D mesh architecture. Each node

in the mesh consists of a Processor Module, a Memory Module, a Message Interface

Module (MIM) and a Bus Terminator Module. The Processor Module contains the node’s

central processor which consists of a Motorola 88100 RISC microprocessor and two 88200

MMU’s. The Memory Module houses the node’s main memory space which consists of

8Mb of RAM in the standard configuration (a maximum of 32Mb is possible). The

Message Interface Module contains another 88100 processor which performs all

computation necessary to interface between the Processor Module and the mesh

communications subsystem which is based on an Asynchronous Message Routing Device

(AMRD). The AMRD allows bidirectional communication on the mesh with all routing

functions being handled independently of the node’s processors. The Bus Terminator

Module includes the system clock and several UARTs used for debugging as well as

termination for the 88K high speed bus over which the boards that make up a node

communicate. A Sun 3/140 running the UNIX BSD 4.1 operating system acts as a host for

the 88K parallel processor.

122

A.2 ES-KIT Software

The Extensible Software Platform is written in C++ and provides complete support

for object-oriented parallel programming. The environment consists of four major

components, the Interservice Support Daemon (ISSD), the mail daemon, the shadow

process, and the ESP kernels. ESP runs on the ES-KIT 88K processor or a network of Sun3

workstations.

The ISSD is the heart of the total run-time system. It is the major interface between

the ES-KIT environment and the outside world. The ISSD controls the starting and the

termination of applications and all communication with peripheral devices including

terminal and disk I/O. It allows a variety of node configurations that include a

heterogeneous combination of 88K nodes and Sun 3 systems.

The ESP kernel is the portion of the system that actually runs the applications. It

performs memory management, message packing and unpacking, and task switching for

the objects. Additional functionality is provided by several Public Service Objects (PSOs)

which run as user processes on top of the kernel. The kernel runs as a standard UNIX

process on the SUNs and on top of a rudimentary OS on the ES-KIT 88K processor.

The mail daemon is responsible for routing messages between individual or groups

of ESP kernels. Each daemon is connected to all of the kernels in its group, every other

mail daemon in the configuration and the ISSD. This connectivity allows messages to be

passed from kernel to kernel with minimum handling. The shadow process runs on the Sun

host and is responsible for reading application source/object code and managing terminal

I/O.

The ESP supports the C++ programming language. Some language modifications

are made to account for the distributed nature of the hardware. The software system allows

objects to be located on arbitrary nodes and supports the concept of remote method

invocation. Remote method invocation is similar to a remote procedure call. Objects

which will be placed on remote nodes must be derived from a C++ class called

123

“remote_base”. Each object derived from remote_base can be referenced by its ‘handle’,

or address, for the purpose of method invocation. Message passing is used to implement

remote method invocation and return values.

The concept of ‘futures’ allows the caller of a remote method to remain unblocked

while the method is running on a different node. If the caller needs the result of the method

at a later time, it can declare the return value as a future which allows it to carry out some

computation while the result is being generated. Futures are implemented in ESP as a C++

class.

A.3 System Characterization

Experiments were conducted to characterize and study the performance of the ES-

Kit system (both hardware and software). The main purpose of the characterization

experiment was to establish a reference that would guide a programmer in the design and

coding of algorithms that make efficient use of the hardware and software resources

provided by the machine.

The performance metrics studied included,

• Message Speed Linearity• Μessage Speed under load.• Object-Oriented Paradigm overhead.• File I/O.

The following sections describe each of the above experiments in detail.

A.3.1 Message Speed Linearity

The purpose of this experiment was to study the linearity of the message fabric in

terms of the time taken to transport messages of increasing size. A sender/receiver pair of

nodes was set up with the sender sending messages (a byte stream) to the receiver upon

request. The exchange was timed by the receiver node and included the time for the

placement of the request which remained constant throughout the experiment. The time

to setup the message was not included. In order to smooth out variations in the

124

measurements, each experiment was run several times and an average value was taken. The

results are shown in Figure A.1.

The graph shows a fairly linear response by the message fabric. The spikes are due

to the fragmentation of the message memory and the consequent need for coalescing. The

fragmentation occurs because the experiments were run in a loop and not as new processes.

The graph also shows that the average message rate is approximately 1Mbyte/sec.

A.3.2 Message Speed under Load

In the previous experiment, the sender-receiver pair operated on a traffic-free mesh.

In this experiment, however, multiple sender-receiver pairs were created to produce traffic

on the mesh. The traffic caused contention for the communication links as the sender and

receiver of a pair were placed on separate nodes. The sender-receiver pairs exchanged

10,000 byte messages. The time measurement was similar to the previous experiment.

Table A.1 presents the results of the experiment in which 8 sender-receiver pairs

were set up as shown in Figure A.2. The solid arrows in the figure represent the actual

hardware connections between the nodes while the dotted arrows represent the logical

connections established as a part of this experiment. For example, node 0 sends to node 15,

Figure A.1 Message speed linearity.

125

node 1 to node 14 and so on. Table A.2 presents the results of a similar experiment with

16 sender-receiver pairs. Again, the sender and receiver of a pair lie on distinct nodes. As

expected, both tables show a reduction in the average byte transfer rate. One point to note

is that the transfer rates are fairly uniform across all sender-receiver pairs.

A.3.3 Object-Oriented Paradigm Overhead

The ES-Kit supports an object-oriented paradigm that allows an object to invoke

methods of class objects located on remote nodes. This feature is allowed only if the user-

Sender/Recvr Mean (15 runs)Rate (KB/sec)

7 - 0 0.0275 363.66 - 1 0.0281 355.8

0.02820.0282

5 - 24 - 3

354.6

354.6

Table A.1 8 Sender - Receiver pairs

Physical ConnectionLogical Connection

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Figure A.2 Traffic pattern on mesh.

126

defined classes are derived from the remote_base class. The aim of this experiment was to

measure the overhead of invoking methods of classes derived from remote_base by

comparing the time for method invocation against the time for an ordinary procedure call.

The experiment involved making 100 procedure calls and 100 method invocations in

separate loops and measuring the times for each.

The system will allow approximately 2.5 million procedure calls/sec as would be

expected from a 15 MIPS processor, but it allows only approximately 2500 method

invocations/sec. This result suggests that the ES-Kit supports coarse-grained parallelism

with large objects. Object methods on the ES-KIT should be designed to encompass large

sections of the total functionality of the object because of the large overhead inherent in

method invocation.

A.3.4 File I/O

Nodes on the ES-Kit access files on the host via the ISSD interface. This

connection can become a potential bottleneck for applications involving extensive file I/O

from the nodes. The performance of the file system in reading from and writing to ISSD

files on the host are presented in A and Figure A.3B. The experiments were run for

increasing byte sizes with multiple runs per size for smoothing variations in the results. The

results indicate that file I/O on the ES-KIT takes place at an average rate of approximately

50 KBytes/sec. This data suggests that file I/O on the ES-KIT will be time consuming, and

applications should be designed to minimize the necessity of file operations.

Sender/Recvr Mean (15 runs)Rate (KB/sec)

15 - 0 0.0371 269.514 - 1 0.0387 270.5

0.03630.0357

13 - 212 - 3

275.4

280.3

Table A.2 16 Sender - Receiver pairs

127

* (this appendix was coauthored by Sanjay Srinivasan)

Figure A.3A File read times. Figure A.3B File write times.

128

Appendix: B

ES-TGS Parallel Fault Partitioning System

This section describes the ES-KIT TGS system called ES-TGS. ES-TGS is derived

from the TGS system developed by researchers at the University of Southern California and

Mississippi State University. The TGS system uses the PODEM algorithm for test

generation and the Critical Path Tracing algorithm [37] for fault simulation. The parallel

ES-TGS system is based on the fault partitioning method of parallelization.

B.1 Scalar ES-TGS System

The first step in adapting the TGS system for use in the ESP environment was to

convert the system to the C++ language and encapsulate the functionality into objects. The

architecture of the resulting system is shown in Figure B.1. The system is composed of 5

objects: the test generator object, the fault simulator object, the master object, and the

reader and writer objects.

The master object controls the entire application. It begins by instantiating and

initializing the other objects. It then builds the master fault list from the circuit description.

Once this process is completed, the master enters a loop in which it selects a candidate fault

for test generation from the undetected faults on the fault list and sends the fault to the test

Reader

WriterMaster

FaultSimulator

TestGenerator

Figure B.1 Scalar ES-TGS system architecture.

129

generator for ATPG. The resulting test vector is sent to the fault simulator and that

determines the list of detected faults to be marked off of the master fault list. The master

then selects another candidate fault to repeat the process. This loop continues until the

desired fault coverage of 100% is reached, or all faults have been either marked as detected

or at least ATPG has been attempted.

The test generator object performs the PODEM test generation algorithm. It

receives a fault from the master object, generates a test for the fault if possible, and then

returns the resulting test back to the master. The test generator uses the SCOAP [38]

testability measure to guide the ATPG process. A backtrack limit is included for redundant

or hard-to-detect faults.

The fault simulator object uses the critical path tracing algorithm. It receives the

test vector produced by the test generator from the master object. It performs fault

simulation on the vector and returns the list of detected faults to the master.

The reader and writer objects were added to the TGS system to deal with the limited

I/O performance of the ES-KIT. The reader object was added to read in the circuit netlist

database from the host file system into memory and distribute it to all of the objects that

need it. This operation eliminates the performance penalty of multiple reads from the host

disk. The reader object is no longer needed after all of the other objects have the netlist and,

at that time, it is deleted. The writer object stores the successful test vectors from the master

object and writes them out to disk when the ATPG process is complete. This organization

eliminates the performance penalty of having one disk access in each of the test generation/

fault simulation loops performed by the master object.

The scalar version of the ES-TGS system has been run on the ISCAS ‘85 [16]

benchmark circuits on the Sun3 based ESP environment. The test vectors generated were

verified to be the same as the original C version. The application has also been run on the

ES-KIT 88K processor. However, the larger ISCAS circuits required too much memory to

be run on a single 88K node. Table B.1 summarizes the runtimes for the original TGS

system running on a Sun3 and Sun4 processor and for the ES-TGS system running in the

130

Sun3 and ES-KIT 88K processor. Note that the runtimes include only the time taken for the

ATPG portion of the application and do not include file I/O or fault collapsing times.

The data shows that there appears to be little or no performance penalty for running

in the ESP environment. This conclusion can be drawn from a comparison of the runtimes

of the original C version and the ESP C++ version running on the same Sun3 platform.

However, there is little communications between the objects in the scalar ES-TGS system

and the communications overhead of the ESP environment has not yet become a factor. The

increase in performance between the ESP Sun3 version and the ES-KIT 88K version is due

to the increased performance of the 88K CPU. The same is true for the runtimes of the Sun3

versus Sun4 C versions. The runtimes for circuit C17 in the ESP environment, both Sun3

and 88K, are dominated by setup times which includes the time to read the circuit database

from the host file system.

B.2 Parallel ES-TGS Version 1

The first step in parallelization of the ES-TGS system was simply to instantiate

multiple test generator and fault simulator objects. In the simplest version, there is only one

object on each processing node, either a test generator or a fault simulator. Half of the

available processing nodes are used to perform test generation and half are used to perform

ISCAS ‘85Circuit

Sun3 C ES-KIT C++Sun3 VersionVersion 88K Version

Sun4 CVersion

ES-KIT C++

C17 1.14 19.4 3.19 0.15

C432 276.8 213.2 60.1 33.6

C499 220.3 44.5 27.3

C880 327.3 304.4 108.8 39.7

C1355 958.9 656.5 171.6 120.8

C1908 2428.9 1662.3 278.3 338.8

C3540 13285.0 11971.0 2142.3 2911.6

167.8

Table B.1 Scalar ES-TGS ATPG times (sec)

131

fault simulation. The master object performs essentially the same function as in the scalar

system with a few minor differences. When a test generator becomes (or starts out) idle, the

master selects an untested and undetected fault from the fault list. This fault is marked as

tested and transmitted to the idle test generator. When a successful test vector is generated,

it is returned to the master where it is stored in a queue along with the fault under test. If

there is an idle fault simulator, the master sends that test vector to the idle fault simulator

for fault simulation. The master then selects another fault and restarts the test generator.

When a fault simulator returns its list of detected faults, the master object marks those faults

as detected. If the vector detects some new faults, it is sent to the writer object for storage.

The master then looks for a test vector in the queue to send to the fault simulator. If it finds

one, the vector is transmitted and the fault simulator is restarted, if not, the fault simulator

becomes idle until the next vector is generated by a test generator. Notice that in this

system, the master object must manage all of the intermediate steps in the process including

those necessary for object initialization. The throughput required of the master object keeps

its processing node very busy. The writer object is also busy storing vectors. For this

reason, it was found that the writer and master object could not reside on the same node. If

they did, one of the object’s incoming message queue would grow too large. This

requirement that the writer object and the master object reside on individual nodes

effectively eliminates two nodes of any configuration from the list of available processing

nodes. Thus, our system of 16 nodes could have at most 14 test generator and/or fault

simulator objects in this version of ES-TGS. The overall architecture of a system consisting

of 7 nodes is shown in Figure B.2. In this system, the reader and writer objects and the

master object would reside on individual nodes and the test generator and fault simulator

objects would reside on the remaining processing nodes.

One decision which needs to be made in this system is how many of the available

processing nodes should be test generators and how many fault simulators? The results

obtained with the static scheduling version of the ES-TGS Version 1 system indicated that

an answer was impossible to determinea priori and depended a great deal on how many

hard-to-detect faults that were present in a circuit. Some circuits have a preponderance of

132

hard-to-detect faults and require the majority of time to be spent in test generation and

others require the majority of time to be spent in fault simulation. However, most circuits

lie somewhere in between and require the majority of time for fault simulation in the

beginning of the ATPG process and the majority of time to be spent in test generation in

the latter part of ATPG. In order to address this problem, the Version 1 system was

modified to instantiate one fault simulator and one test generator object on each of the

available processing nodes in the system. Since there is no multitasking in the ES-KIT

operating system, a processing node can only be performing one function at a time. A

dynamic scheduling system using the test vector queue in the master object is used to

determine how many processing nodes will perform each type of task. If the test vector

queue in the master becomes empty, an idle fault simulator node (if there are any) is

selected to become a test generator. If the test vector queue is full, an idle test generator is

selected to become a fault simulator. This system results in higher overall utilization of the

processing nodes, but the task switching of the processing nodes adds more computation

load to the already over worked master object. Therefore, the overall runtimes were only

slightly smaller. Table B.2 summarizes the results of the ES-TGS Version 1 system without

Reader

Writer

Master

FaultSimulator

TestGenerator

FaultSimulator

FaultSimulator

TestGenerator

Figure B.2 Parallel ES-TGS system architecture (Version 1).

133

task switching on several of the ISCAS ‘85 benchmark circuits. Note that the number of

processing nodes is the number of nodes which are used for test generation or fault

simulation. In each case, there are two more nodes in the configuration, one for the reader

and writer objects, and one for the master object.

These results show that the Version 1 system attains reasonable speedup for only 2

to 4 processing nodes


The main limit to speedup in the Version 1 system was the fact that the master

object was overloaded. In order to address this fact, a “mini-master” object was created to

off-load some of the intermediate processing requirements. One mini-master object is

created on each processing node along with one test generator/fault simulator pair. The

mini-master handles all intermediate processing between the test generator/fault simulator

objects and the master object including the initialization steps. Figure B.3 shows the

architecture of the resulting Version 2 system for a configuration of 5 nodes. In this

system, the reader and writer objects again reside on node 1,1 and the master object resides

on node 1,2. The remaining processing nodes contain one test generator object, one fault

ISCAS ‘85

Circuit

C432

C499

C880

C1355

C1908

ATPG Times (Sec)

No. of Processing Nodes

2 4 8 10

30.99

28.18

46.47

119.74

237.14

18.98 16.72 17.85

19.43 18.99 18.58

41.43 44.95 46.54

71.19 64.65 61.39

138.22 114.65 110.56

C3540 1550.38 791.70 451.56 406.931.86 2.70 4.74 5.26

1.17 2.01 2.43 2.52

1.43 2.41 2.65 2.85

1.34 2.63 2.42 2.34

1.61 2.34 2.39 2.44

1.93 3.16 3.59 3.36

Speedup

Table B.2 ES-TGS Version 1 ATPG times

134

simulator object and one mini-master object.

During the ATPG phase, the mini-master receives a fault from the master. It gives

this fault to the test generator to perform ATPG. When the test generator is successful, it

returns a valid test vector to the mini-master. The mini-master then sends this vector to the

fault simulator. When the fault simulator completes, it sends the resulting list of detected

faults directly to the master. This step is necessary because the master object still maintains

the global fault list. The direct transmission from the fault simulator to the master

eliminates the redundant messages which would be required if the list was sent to the mini-

master first and then on to the master. If the master determines that the vector is good (i.e.

it detects some as yet undetected faults), it signals the mini-master of this and the mini-

master transmits the vector directly to the writer object for storage.

Table B.3 shows the results for the Version 2 system on the ISCAS ‘85 benchmark

circuits. Notice that the runtimes for larger numbers of processors scale much better than

the Version 1 system. In some cases, the speedup is greater than the number of processing

nodes. This is possible because the master processor node is not included in the processing

node count.

Reader

Writer

Master

FaultSimulator

TestGenerator

MiniMaster

TestGenerator

MiniMaster

TestGenerator

MiniMaster

FaultSimulator

FaultSimulator

Figure B.3 Parallel ES-TGS system architecture (Version2).

135

Figure B.4 shows a comparison of the speedup for the Version 2 system versus the

Version 1 system on the C3540 ISCAS ‘85 benchmark circuit. This benchmark is a medium

sized circuit with 1669 gates and 3428 single stuck-at faults. The data shows that the

Version 2 system achieves reasonable speedup on this circuit for up to 14 processors.

Speedups for the other ISCAS ‘85 benchmark circuits are not quite as good, but in the

general case, the Version 2 system performs much better than the Version 1 system and

achieves good speedup for 4 to 8 processors.


The final optimization that was made to the ES-TGS system was to eliminate the

need for the master object during the ATPG phase. This change was accomplished by

dividing up the fault list among the mini-master objects. The mini-masters manage their

own portion of the fault list completely independently. The mini-master selects an untested

fault from its own fault list and sends it to the test generator. The resultant vector is sent to

the fault simulator. The list of detected faults is then marked off of the mini-master’s

portion of the fault list. The portion of the list of detected faults which is not contained on

the mini-master’s own fault list is not transmitted to the other mini-masters. This approach

ISCAS ‘85

Circuit

C432

C499

C880

C1355

C1908


2 4 8 12

32.98

20.76

22.36

145.24

262.32

12.57 6.94 6.11

7.72 4.50 4.06

9.88 9.92 10.37

55.01 23.39

112.06 70.60 59.21

C3540 2038.04 704.72 306.70 206.40

14

5.77

4.07

10.70

24.03

56.37

181.20

28.04

1.03 2.98 6.85 10.2

1.13 2.61 4.23 5.05

1.16 3.07 6.03 7.23

1.25 2.83 2.82 2.70

1.58 4.24 7.28 8.07

1.01 2.59 4.70 5.33

11.6

5.30

6.96

2.61

8.05

5.65

ATPG Times (Sec) Speedup


136

will result in some redundant work as the other mini-masters generate tests for these already

detected faults. The resulting set of test vectors is also larger. This redundant work is the

major limit to speedup in this Version 3 system. This system is modeled after the static

scheduling system without communication presented in [23]. The runtimes in this system

are also increased somewhat by the fact that some mini-master objects finish their fault list

ahead of the others. However, this inefficiency seems to be a minor one compared to the

one resulting from the generation of the redundant vectors. Table B.4 contains the results

for the Version 3 system on the ISCAS ‘85 benchmark circuits. Note that in this system

there is one more node in the configuration available for placing a mini-master and test

generator/fault simulator group. This configuration is possible because the master object is

used only for initialization and can be placed on the same node as the reader and writer

objects.

Figure B.5 is a comparison of the speedup obtained for the Version 2 system versus

the Version 3 system for circuit C3540. The results for this circuit indicate that the Version

3 system does not perform as well as the Version 2 system. This result is, in general, the

Figure B.4 Version 1 vs. Version 2 speedup for circuit C3540.

137

case for all of the ISCAS ‘85 benchmark circuits. However, in the circuits which have very

few hard-to-detect faults, such as C880, where the Version 2 system performs poorly, the

Version 3 system performance is much closer.

As stated previously, the performance of the Version 3 system is limited by the

extra work done in generating the redundant vectors. Figure B.6 is a comparison of the

number of test vectors generated by both systems for circuit C3540. Notice that for 14

processors, the Version 3 system generates over twice as many test vectors as the Version

2 system.

B.5 Conclusions

Our initial task was to develop a parallel ATPG system for a given distributed

environment. Our methodology began with the characterization of the environment to

determine the optimum grain size the environment supports. We then attempted to

implement the simplest form of parallelism, namely fault partitioning, to gather data on the

system using a real application. A successful public domain scalar ATPG system was used

as a starting point to reduce coding time and ensure algorithm correctness. This ATPG

system was then ported to the target system and run in a scalar mode to determine operating

ISCAS ‘85

Circuit

C432

C499

C880

C1355

C1908


2 4 8 12

22.90

15.38

27.07

115.82

193.66

12.84 11.17

11.21 9.86 7.98

19.98 13.44 10.30

83.71 43.34

160.94 130.36 108.30

C3540 1298.73 813.79 471.50 399.97

14

9.71

7.49

9.47

38.74

85.44

317.85

52.25

18.02

1.60 2.55 4.4 5.19

1.32 1.60 1.97 2.37

1.23 1.69 2.72 3.27

1.20 1.63 2.43 3.17

1.27 1.74 1.98 2.45

1.28 1.63 2.29 2.64

6.53

3.01

3.66

3.45

2.61

3.03

ATPG Times (Sec) Speedup


138

system overhead, which turned out to be insignificant (compared to Unix). Once this was

completed, parallelization of the ATPG process was undertaken.

The Version 1 system, although simple to implement, is limited by the amount of

processing necessary by the master object to manage all of the ATPG process. Simple

analysis of the Version 1 system using Amdahl’s law:

Figure B.5 Version 2 vs Version 3 speedup for circuit C3540.

Figure B.6 Version 2 vs. Version 3 testset size for circuit C3540.

139

whereSN = speedup,f = fraction of code that is sequential, andN = number of

processing nodes, indicates that approximately 10% to 33% of the code is sequential. This

result corresponds to a maximum speedup of only 2.5 to 4.0. This analysis assumes that

the serial portion of the code is the only source of inefficiency and ignores the small amount

of extra work performed in this system generating redundant vectors. The sequential

portion of the code is the management of the ATPG process and the maintaining of the fault

list by the master object. The different figures for the percentage of sequential execution

and hence speedup are caused by the difference in the difficulty of test generation between

the various circuits. A circuit which has many hard-to-detect faults will have significant

processing to be done by the test generator objects in proportion to the work done by the

master in maintaining the fault list. In this case the speedup will not be as limited as in the

case where the faults are all easy and the processing done by the master dominates.

The Version 2 system attempted to reduce the portion of code that is serial by off

loading the management of the intermediate steps of the ATPG process to a mini-master

object on each node. The removal of a relatively small portion of the serial code from the

master yielded a significant increase in the speedup achieved in the best case, 11.6. This

result corresponds to 1.6% of the code being executed in serial. The worst case for the

Version 2 system is little better than for the Version 1 system, 2.61, which corresponds to

33.5% serial code. In both cases, this result was for circuit C880 which has no hard-to-

detect faults and hence the master object is always busy managing the fault list.

Analysis of the Version 3 system can be done by assuming that the only inefficiency

is due to the extra work done in generating redundant vectors. Since the mini-masters each

have their own portion of the fault list which they process independently, there is no serial

code in the ATPG process. The time taken to generate the tests for a circuit by this system

givenN processing nodes is given by:

SN f f 1−( ) N×+=

TN

T1 Ccalc×N

=

140

whereTN is the execution time forN processor,T1 is the execution time for the serial

case, andCcalc is a factor which takes into account the inefficiency introduced by the

redundant work. Speedup is then given by:

andCcalc can be calculated by:

SinceCcalc is the measure of redundant work, it should be related to the number of

processors,N, and also should be able to be calculated from the number of test vectors

generated by the parallel ATPG system. If we define a new factor,Cmeas, which is the ratio

of the number of test vectors generated by the parallel system over the number of vectors

generated by the serial system, we have:

where VN is the number of test vectors generated. Table B.5 is a tabulation ofCcalc

and Cmeas for increasing numbers of processing nodes for several of the ISCAS ‘85

benchmark circuits. The data shows a strong correspondence betweenCcalc andCmeas. Any

differences can be attributed to the additional inefficiency in the Version 3 system due to

imperfect load balancing which this analysis ignored. In fact, numbers forCcalc calculated

using the average per processor ATPG time versus the maximum ATPG execution time (as

was done for Table B.5) bear a much stronger correlation toCmeas.

The Version 3 system achieved significantly lower speedups than the Version 2

system for most cases. The exceptions were the circuits with few hard-to-detect faults such

as C880 on which the Version 2 system performed most poorly. The large amount of

redundant work done by the Version 3 system is the main limit to speedup. This redundant

SN

T1

TN

T1

T1 Ccalc×N

Ccalc

N= = =

CcalcNSN

=

Cmeas

VN

V1Ccalc≈=

141

work can be reduced or even eliminated by having each processor communicate its list of

detected faults to all of the other processors after each time fault simulation is done.

However, this situation produces significant communication and has not proven to increase

speedup [8]. Dynamic load balancing can be used to address the load balancing problem,

but in general, the Version 3 type system will not scale to large numbers of processors.

ISCAS ‘85Circuit

C432

C499

C880

C1355

C1908


2 4 8 12

C3540

14

Ccalc Cmeas

1.61 1.47

Ccalc Cmeas Ccalc Cmeas Ccalc Cmeas Ccalc Cmeas

3.432.45 2.31 3.49 3.02 4.54 4.62 3.57

1.57 1.53 4.002.30 2.43 4.04 3.46 4.90 5.36 4.12

1.66 1.38 3.022.45 1.94 3.29 2.61 3.78 4.05 3.21

1.62 1.48 2.802.36 1.95 2.94 2.45 3.66 3.82 3.02

151 1.34 3.352.50 2.01 4.06 2.81 5.06 4.65 3.50

1.25 1.31 2.221.56 1.62 1.81 1.98 2.31 2.14 2.36

Table B.5 Comparison ofCcalc vs.Cmeas

chapter 1 introduction - virginia commonwealth universityrhklenke/docs/dissertation.pdf · 2.1...

Documents