chapter 1 introduction - virginia commonwealth universityrhklenke/docs/dissertation.pdf · 2.1...
TRANSCRIPT
1
Chapter 1
Introduction
The advent of synthesis systems for Very Large Scale Integrated Circuits (VLSI)
and automated design environments for Application Specific Integrated Circuits (ASIC)
have allowed digital systems designers to place large numbers of gates on a single IC in
record time. Generation of test patterns for these circuits to insure that they are fault free
however, still consumes considerable time. Currently, up to one third of the design time for
ASICs is spent generating tests [1].
Many algorithms have been developed to automate the test generation process
[2],[3],[4], but the test generation problem has been shown to be NP complete [5]. This
thesis deals with the application of parallel processing techniques to Automatic Test Pattern
Generation (ATPG) to address this problem.
1.1 Motivation
There are two basic approaches to solve the Automatic Test Pattern Generation
(ATPG) problem; algorithmic test pattern generation, and statistical or pseudorandom test
pattern generation. In the algorithmic approach, a test is generated for each fault in the
circuit using a specific ATPG algorithm. Most of these algorithms can be proven to be
complete. That is, they are guaranteed to find a test for a fault if a test does exist. However,
this process may involve a search of the entire solution space which is computationally
expensive.
Statistical or pseudorandom test pattern generation on the other hand, selects test
patterns at random, or using some heuristic, and determines the faults that are detected by
these patterns using fault simulation. Test patterns are selected and added to the test set if
they detect any previously undetected faults. This process continues until some required
fault coverage or computation time limit is reached. This method finds tests for the easy-
to-detect faults very quickly, but becomes less and less efficient as the easy-to-detect faults
2
are removed from the fault list and only the hard-to-detect faults are left. In many cases, the
required fault coverage can not be achieved without excessive computation times.
An efficient combined method for solving the ATPG problem uses statistical
methods to find tests for the easy-to-detect faults on the fault list and switches to an
algorithmic method to find tests for the hard-to-detect faults which remain. Using either this
method or the purely algorithmic method, a significant portion of the computation time will
be spent generating tests for the hard-to-detect faults algorithmically. Therefore, finding a
method to speed up this process should reduce the overall computation time considerably.
Much research has been done in increasing the efficiency of algorithms for ATPG
through heuristics [6],[7],[8]. However, the overall gains that can be achieved through these
improvements are limited and will not be adequate for future needs. This statement can be
justified by two facts. First, no system currently presented in the literature has been proven
on circuits that contain combinational logic blocks larger than 3 or 4 thousand gates.
Second, most sequential ATPG techniques are based on combinational ATPG algorithms
[9],[10]. These systems require that multiple passes be made through the ATPG process in
order to generate a test for a single fault. Therefore, any excessive runtimes will be
multiplied by this process and achieving fast combinational ATPG becomes even more
important.
An alternate approach to heuristics in reducing computation times is to use parallel
processing techniques. Parallel processing machines are becoming available for general use
and are being used to solve other problems in Computer Aided Design [11]. Most of these
readily available parallel processors are distributed memory machines due to cost and
scalability factors. Operating systems are also being developed that allow simple networks
of workstations to be used as distributed memory parallel computing environments [12].
Previous efforts to parallelize the ATPG problem can be placed in one of five
categories; fault partitioning, heuristic parallelization, search space partitioning,
algorithmic partitioning, and topological partitioning [13],[14]. These techniques, which
will be more fully detailed in Chapter 2, usually require each processing node in a
3
distributed memory system to contain the entire circuit description. However, the
increasing size of VLSI circuits has caused the amount of memory required to process these
circuits to grow rapidly. Topological partitioning techniques can be used to distribute the
circuit database across several processors thereby increasing the size of the largest circuit
that can be processed on a given distributed memory configuration.
For example, results from the EST system [15] have shown that the memory
requirements for processing the ISCAS ‘85 [16] benchmark circuit C7552 can be over 9
MBytes. This circuit contains only 3512 gates which is relatively small compared to state
of the art VLSI devices. At this rate, a circuit of only 10,000 gates could take as much as
25 MBytes of memory to process. Typical commercially available distributed memory
multicomputers must be able to take advantage of their entire memory space across all
nodes to process circuits as large as, or larger than this. Thus, topological partitioning of
the database across several processors will be required to perform ATPG on these larger
circuits.
Previous research in topological partitioning for ATPG [17],[18] has focused on the
D-algorithm [2]. The initial effort contained in [17] was directed toward a shared memory
parallel processor, hence, the parallelism exploited was fairly fine grained. The results of
the effort to port this system to a distributed memory multicomputer were mixed [18].
Some speedup was obtained, but the large number of messages required even for simple
circuits significantly limited the speedup. The parallelism exploited in these two systems
was limited to a parallelized implication procedure and fault partitioning which was used
to keep idle processors busy on other faults. The fact that the D-algorithm, which has been
shown to be very inefficient for some classes of circuits, was used in these systems also
increased the overall runtimes and limited the speedup possible. One of the most promising
results presented in [18], however, was that topological partitioning resulted in significant
reductions in the memory required in each processing node. For these reasons, research into
topological partitioning with a more efficient ATPG algorithm such as PODEM [3] was
undertaken.
4
1.2 Goals
This research focused on a system that is based on topological partitioning of the
circuit-under-test across several processing nodes. The goals of this research included
expanding on previous work [17],[18], and extending it to a more efficient base ATPG
algorithm. Analytical models of the topologically partitioned ATPG process were
developed to help predict the performance that could be expected. These models, once
validated through experimentation, were then used to predict the performance of the ATPG
system. The model was then used to determine the communications latency required on a
multicomputer to efficiently utilize this technique to achieve speedups. Another goal of this
research was to develop parallelizations of the base ATPG algorithm to increase speedup.
Investigations of how these parallelization methods could be used in conjunction with other
parallelization methods such as fault or search space partitioning were also undertaken.
Finally, this research outlined the additional work that will be required to make topological
partitioning a valid addition to ATPG systems for large scale designs.
1.3 Organization
This dissertation is divided into 7 Chapters including this introduction. Chapter 2
contains background material which includes a brief review of serial and parallel ATPG
algorithms. A discussion of the ES-KIT distributed memory multicomputer ES-TGS
parallel ATPG system used in this research is also included in Chapter 2. Chapter 3
describes the implementation details and results of the serial Topological Partitioning Test
Generation System (TOP-TGS) developed for this research. Chapter 4 details the analytical
model of the serial ATPG process and topologically partitioned ATPG. Predicted results
are also developed in this chapter and compared to the actual results presented in Chapter
3. Chapter 5 details the algorithmic parallelizations developed for the TOP-TGS system
and presents their results. Chapter 6 presents the results of using multiple parallelizations
in the TOP-TGS system. Finally, Chapter 7 presents conclusions and future work. The
future work section includes a discussion of how many of the heuristics presented in the
literature could be implemented in a topologically partitioned ATPG system.
5
Chapter 2
Background
This chapter presents the background material for the thesis. A brief presentation of
serial ATPG algorithms is included to familiarize the reader with the ATPG problem. Next
a discussion of the techniques available to parallelize ATPG is presented. Finally, the
distributed memory multicomputer, ES-KIT, and the parallel test generation system, ES-
TGS, that this work was based upon is presented.
2.1 Serial ATPG
Most parallel ATPG algorithms, including the ones to be presented here, are based
upon widely known serial ATPG algorithms. For a detailed discussion of ATPG
algorithms, the interested reader is referred to [19] and [20]. For this research, we will
consider only algorithms designed to generate tests for single stuck-at faults. These are
physical faults that cause a node in the circuit to behave as if it were stuck at a logic 0 or a
logic 1 level. The single stuck-at fault model is a simplification of the types of faults found
in real circuits, but empirical evidence shows that for most common implementation
technologies, it provides very high coverage of physical faults [21].
Automatic Test Pattern Generation can be thought of as the process of searching
through the entire space of possible input patterns for a circuit in an attempt to find one
which causes the output to differ depending on whether or not a circuit contains a specific
fault. The size of the search space is 2n wheren is the number of inputs to the circuit.
Because the search space is so large, many techniques have been developed to guide the
search process. Most of the search techniques in popular use today fall into the class of
algorithms called path runners. Path runners attempt to detect a fault by sensitizing it, and
then sensitizing a path between the faulty node and a primary output. Sensitizing a fault
consists of setting the value on the faulty node opposite the stuck-at vault, i.e. setting a logic
‘1’ on a node being tested for a stuck-at ‘0’ fault. Sensitizing a path consists of setting logic
6
values along a path in the circuit from the faulty node to the primary outputs such that a
change of the logic value on the node is observable at the primary output. For example, if
an AND gate is in the sensitized path, setting all of its inputs not in the path to a logic ‘1’
will result in its output value following the value of the inputs in the path. In order for an
algorithm to be complete, it must search all paths and combinations of paths in the circuit
from the faulty node to the primary outputs.
The major difference between the path sensitization algorithms presented in
[2],[3],[4] is the method and order by which the actual logic values are assigned to nodes
in the circuit. The D-algorithm [2], attempts to set logic values on nodes in the circuit by
assigning values to nodes which precede them in the circuit topology. The PODEM [3]
algorithm attempts to assign values to nodes in the circuit by assigning values to the
circuit’s primary inputs only. Because typical circuits have fewer inputs than internal
nodes, frequently by orders of magnitude, the search space enumerated by PODEM is much
smaller than the D-algorithm. Because this reduction of the search space makes PODEM
much more efficient than the D-algorithm, PODEM is the basis for most follow-on
algorithms that have been developed for ATPG [5],[6],[7],[8],[15]. For this reason,
PODEM was also selected as the base algorithm for this work whereas the D-algorithm
served as the basis for previous work in topological partitioning [17],[18]. A brief example
of the PODEM algorithm will now be presented to familiarize the reader with this
technique.
Figure 2.1 contains a diagram of the circuit that will be used for the purposes of this
discussion. The PODEM algorithm consists of 3 major processes; selection of the next
objective, backtracing that objective to an unassigned primary input to determine the value
that should be assigned to it, and assigning that value to the input and implying all node
values in the circuit affected by that assignment. This latter simulation-like process is called
forward implication.
For example in Figure 2.1, consider the fault of line J stuck-at a logical ‘1’. The first
objective selected might be to sensitize the fault by setting node J to ‘0’. Since both of the
7
inputs to the OR gate which drives J need to be set to ‘0’ in order to set J to a ‘0’, setting
one of them to this value becomes the next objective.The next step would be to backtrace
this objective to a primary input. Figure 2.2 illustrates this process. Assume for the sake of
discussion that line G is chosen to be set first. Typically, testability measures such as
controllability and observability measures are used to assist in making these types of
decisions. In order to set line G to a ‘0’, either of the inputs to the AND gate that drives it
must be set to ‘0’. Input A is selected and it along with its value are pushed on a stack that
is used to hold the input search space.
The next step is to actually assign a value of ‘0’ to input A and simulate the affect
of this assignment on the rest of the circuit. This processes is called forward implication.
Assigning a ‘0’ to A will cause the AND gate to drive node G to a ‘0’. No other nodes in
the circuit will be affected by this assignment.
Since the objective of setting node J to a ‘0’ has not been satisfied, it will be
backtraced again. This backtrace will determine that node I must be set to ‘0’ and this must
be accomplished by setting input E to ‘0’ and node H to ‘0’. Node H may be set to ‘0’ by
setting input C to ‘0’. This assignment is pushed on the stack and implication is performed.
Next, input E is pushed on the stack with its value and forward implication is done.This
implication will result in node J taking on the required ‘0’ value. This ‘0’ value represents
the value node J would have in the fault-free circuit, but node J will assume a value of ‘1’
s-a-1
A
B
C
D
E
F
G
HI
J
K
L
Figure 2.1 Circuit under test.
8
in the presence of a stuck-at ‘1’ fault. This set of values is represented using the “D”
notation of [2]. A node with a value of ‘D’ represents a ‘1’ in the good circuit and a ‘0’ in
the faulty circuit. A node with a value of‘D’ represents a value of ‘0’ in the good circuit
and a ‘1’ in the faulty circuit. Since node J is ‘0’ in the good circuit and ‘1’ in the faulty, its
value will be represented by a ‘D’. Figure 2.3 shows the state of the circuit and the input
stack after the assignments A=’0’, C=‘0‘, and E=’0’.
The final step in generating a test for node J stuck-at ‘1’ is to make the value on node
s-a-1
A
B
C
D
E
F
G
HI
J
K
L
first objective:set line J to 0
backtrace that objectiveto line G set to 0
backtrace that objectiveto line A set to 0
Figure 2.2 Backtracing objective J = 0.
s-a-0
A=0B
C=0D
E=0
F
G=0
H=0 I=0
J=D
K
L
A
C
E
0
0
0
Figure 2.3 Circuit state and input stack.
9
J visible on the primary output. This process is called propagation and it involves
sensitizing a path from the node to the output. This path is sensitized by setting all inputs
to gates on the path to their non-controlling values. The non-controlling values for an AND
or NAND gate is ‘1’ and for an OR or NOR gate is ‘0’. XOR and XNOR gates do not have
a controlling value, so either a ‘0’ or ‘1’ may work. All gates which have a ‘D’ or ‘D’ on
one of their inputs and an unknown, ‘X’, on their outputs are on potential paths. These gates
constitute what is known as the D-frontier. In the example circuit, the gate OR gate that
drive node L is on the D-frontier at this point. Node K must be set to the non-controlling
‘0’ value so that the value on the output L follows the value of node J. This task may be
accomplished by setting input F to a ‘0’. The final circuit state and input stack are illustrated
in Figure 2.4.
A test vector is generated by simply removing all input assignments from of the
stack. All inputs not in the stack are assigned don’t-cares in the test vector. In the example
circuit, the vector “0X0X00” would be the test vector for fault J stuck-at ‘1’.
If, during the assignments of input values, an assignment is made that makes a test
no longer possible, the last input assignment is popped off of the stack and the alternate
value is tried. This process is called backtracking and it continues until a circuit state where
s-a-0
A=0B
C=0D
E=0
F=0
G=0
H=0 I=0
J=D
K=0
L=D
A
C
E
0
0
0
Figure 2.4 Circuit state and input stack.
F
0
Test
10
a test is again possible is reached or the stack becomes empty. Situations that may cause a
test to be impossible include setting the faulty node to the same value as the stuck-at fault,
and the disappearance of the D-frontier.
For example, consider the circuit of Figure 2.1 with a stuck-at ‘0’ fault on node K.
The first objective would be to set node K to a ‘1’. The first backtrace could lead to the
assignment of a logical ‘0’ on input F. Then in order to set node K to a ‘1’, node I must be
assigned a ‘1’. This task may be accomplished by assigning E=‘1’. However, when node I
is set to ‘1’, node J becomes ‘1’ and node L is forced to a ‘1’ regardless of the value on node
K. This assignment makes a test impossible with this input vector because the fault cannot
be observed at the primary output. Figure 2.5 illustrates this situation.
Backtracking will occur at this point and the alternate assignment of node E=’0’
will be tried. This assignment will also cause a test to be impossible because node K will
then be set to ‘0’, the same value as the stuck-at fault. Since assignments of both logic
values to node E did not result in a test, backtracking to the previous assignment must
occur. Thus, node F is now assigned a value of ‘1’. A test can now be found by assigning
a ‘0’ value to nodes E, C, and A as shown in Figure 2.6.
Backtracking results in an ordered search of the solution space and results in
implicit pruning of the search tree when inconsistent states are encountered such as
s-a-0
AB
CD
E=1
F=0
G
H I=1
J=1
K=D
L=1
F
E
0
1
Figure 2.5 Circuit state and input stack.
Test not possible
11
assigning a value of ‘0’ to input F in the previous example. By checking the two alternate
assignments of input E and determining that they are both inconsistent, the entire portion
of the search tree below F=‘0’ may be pruned.
2.2 Parallel ATPG
This section provides a brief discussion of the methods that have been used to
parallelize the ATPG process. For a more detailed presentation of parallel ATPG
techniques, the reader is referred to [14]. These techniques can be divided up into 5
categories [13],[14]:
1) Fault Partitioning2) Heuristic Parallelization3) Search Space Partitioning4) Functional (Algorithmic) Partitioning5) Topological Partitioning
The simplest way to parallelize the ATPG problem is divide up the fault list among
multiple processors. Each processor then generates tests for each fault on its portion of the
fault list until all faults have been detected. This scheme results in each processor having a
completely separate task in that it performs the entire test generation procedure on its own.
This method of parallelization has been termed fault partitioning. If the fault list is divided
up carefully, each processor will have roughly the same amount of work to do and they will
all finish in about the same time. In practice, optimal partitioning of the fault list is not easy
s-a-0
A=0B
C=0D
E=0
F=1
G=0
H=0 I=0
J=0
K=D
L=D
F
E
0
1
Figure 2.6 Circuit state and input stack.
0
1
E
C
A
0
0
0
Test
Test not possible
12
to doa priori so the scheduling can be done dynamically with each processor requesting a
new fault from a master scheduler whenever it is idle. Dynamic scheduling requires
increased communications overhead due to the requests from idle processors for new faults
to process. The fault partitioning method is very suitable for coarse grained parallel systems
because synchronization is only necessary when a new fault is needed from the remaining
fault list.
The biggest disadvantage of fault partitioning is that the setup time will be large.
The entire ATPG program and circuit database must be loaded into each processor’s
memory across the message fabric. If the total amount of work that can be divided up
among the processors is large (i.e. the fault list is long) then the percentage of time spent
on setup can be kept small and this scheme has promise. If the circuit has a small number
of faults, or fault classes, then the speedup will be limited as the above analysis suggests.
In any case, this method does not scale well because of the large setup time. Also,
performance of this method is poor if there are only a few hard-to-detect faults which
account for most of the processing time. Because processors cannot cooperate in generating
a test for the same fault, one or two processors could take hours to generate a test for these
hard-to-detect faults while the others stand idle. Many typical circuits have only a few hard-
to-detect faults and fall into this category. Results for systems which use this technique
show that linear speedup is possible only for a small number of processors, usually less than
ten [22],[23]. Clearly, this method of parallelization is less than optimum although it has
the benefit of being the simplest to implement.
Because ATPG is an NP complete problem [5], heuristics are used to guide the
search process. Research has indicated that many heuristics will produce a test for a given
fault within some computation time limit when other heuristics have failed to do so [24].
These complementary heuristics can be used in a multiprocessor system to aid in the ATPG
process. There are two basic strategies to heuristic parallelization; a variation of the fault
partitioning scheme discussed above, and concurrent parallel heuristics [25].
In the variation of the fault partitioning method, called uniform partitioning, the
13
fault list is divided up among the processors and each generates tests for the faults on its
own portion of the list. In generating the tests, however, multiple heuristics are used in
sequential order to attempt to generate a test. If a heuristic fails to generate a test within a
time limit, that heuristic is discontinued and the next one in the list is begun. This scheme
has the same advantages and disadvantages as the fault partitioning scheme discussed
above. However, it will be slightly better in some cases because the multiple heuristics will
shorten the test generation time for hard-to-detect faults.
In the concurrent parallel heuristic method, the system is required to have(m x n)
processors wheren is the number of different heuristics available. Ifm is equal to one, each
processor computes a test for the same fault using one of then heuristics. Whenever a
processor succeeds in generating a test for the fault, it sends a “stopwork” message to the
other processors in the cluster and they stop processing that fault. A new fault is selected
from the fault list and the process begins again. Ifm is greater than one, the processors are
clustered into groups of n and each cluster works on a separate fault. In this case, the system
is actually using a combination of the fault partitioning and heuristic parallelization
schemes. The concurrent parallel heuristic method has the potential to achieve greater
speedups than the uniform partitioning method due to possible anomalies in the ordering of
the heuristics for different faults.
The main disadvantage of the heuristic techniques discussed previously is that the
processors that are working on the same fault with a different heuristic are not guaranteed
to be searching disjoint portions of the search space. That is, all of the heuristics may lead
the ATPG program down the same path towards a non-solution.
A better way to parallelize work on a single fault is to divide up the search space
into disjoint pieces and evaluate them concurrently. This approach is a parallel
implementation of the branch and bound method which involves concurrent evaluation of
subproblems [26],[27]. This technique is called OR parallelism and its application to ATPG
is presented in detail in [28],[29]. Search space partitioning involves dividing up the search
space such that subproblems skipped by one processor are evaluated by another. The search
14
spaces for the processors are therefore disjoint and are spread across the solution space as
far as possible to maximize the area of the current search. This organization increases the
chances of finding a valid solution quickly.
The process of dividing up a search tree is illustrated in Figure 2.7. The search space
belonging to processor X is divided up into 2 parts for processors X and Y. Notice that the
processors are in fact always working on different problems (i.e. disjoint search spaces) and
that the place where each processor will backtrack to is different. If processor X finds a
conflict, it will backtrack and try an alternate value for input A. Processor Y will backtrack
and try an alternate value for input C in case of a conflict. This approach keeps the current
search space as large as possible which tends to make the search more efficient.
A major problem with search space partitioning is that it also requires a long setup
time. Each processor must have the entire circuit database and ATPG program loaded into
it. On the other hand, processors are dedicated to only one task which does not change and
the tasks are completely independent. This fact makes the overhead due to communications
Figure 2.7 Division of search tree.
A
B
E
0
0
0C
0D
Inconsistent
Inconsistent
0
1
1
1
1
1
A
B
E
0
0
0C
0D
Inconsistent
Inconsistent
0
1
1
1
A
B
E
0
0
0C
0D
Inconsistent
Inconsistent
1
1
1
1
Processor X Processor X Processor Y
15
very low and results in greater efficiency. Search space division is therefore most
appropriate for circuits that contain a small number of hard-to-detect faults which take up
a great deal of computation time. It is also ideally suited to message passing systems
because of its coarse grained parallelism.
There is another technique that can be used to allow more than one processor to
work simultaneously on finding a test for a single fault. This technique is called functional
partitioning. Functional partitioning refers to the process of dividing up an algorithm into
independent subtasks. These independent subtasks can then be executed on separate
processors in parallel. This method of parallelization is also known as algorithmic or AND
parallelism.
Most serial ATPG algorithms are difficult to parallelize functionally. The few
subtasks that can be identified, such as fault sensitization and path sensitization are not
independent. That is, action taken to perform one of these processes may change the circuit
state such that it has a side effect or causes an inconsistency in another process. Justification
of two goals cannot, in general, be done simultaneously. One way to allow parallelism in
justification is to perform justification for goals in different faults simultaneously. This
parallelism is an adaptation of the fault partitioning scheme already discussed.
In all of the parallel algorithms discussed thus far, each processor has to have access
to the entire circuit database. This requirement may be a problem for large circuits because
each node many not have enough memory to hold the entire circuit database. Also, loading
the database into memory in a message passing system takes time. Topological partitioning
of the circuit into separate partitions and instantiating each on a different processor would
help alleviate this problem.
Researchers have been investigating topological partitioning for parallel logic
simulation for some time. Although logic simulation is a different problem, it has some
similarities to algorithmic ATPG. A discussion of circuit partitioning for parallel logic
simulation is included in [30]. The objective of the partitioning scheme is to reduce the
communications necessary between partitions as much as possible while maximizing the
16
amount of work that can be done concurrently within the partitions. This paper analyzes 6
different partitioning schemes; random partitioning, natural partitioning, partitioning by
gate level, partitioning by element strings, and partitioning by fanin and fanout cones. Fanin
cones are an attempt to place all gates connected to a single primary input (even through
other gates) in the same group. Fanout cones are constructed the same way using primary
outputs.
The results presented in [30] indicate that for simulation, random partitioning
scores the best in maximizing concurrency, but worst in interprocessor communications.
This condition would make random partitioning a bad choice for most systems. Partitioning
by fanin and fanout cones offers the best trade off between concurrency and interprocessor
communications with fanout cones being slightly better. This result is most likely due to
the fact that fanout cones closely fit the flow of activity in the circuit during logic
simulation. An analysis of circuit partitioning techniques for ATPG is the focus of section
3.3 of this work.
Another issue in circuit partitioning for ATPG is the number of gates in each
partition; the so-called block size. As the number of gates assigned to a block decreases, the
amount of work that can be done between communications steps becomes smaller. Hence,
the parallelism becomes more fine grained. The minimum block size will also affect how
the problem scales with increasing numbers of processors. As more processors are added
to the system, the block size will get smaller and efficiency will decrease.
An investigation of the amount of parallelism theoretically available in
topologically partitioned parallel ATPG was undertaken in [31]. This work attempted to
find an upper bound on the amount of parallelism present in conflict-free test generation.
Two phases of the test generation process using the PODEM algorithm, backtracing and
forward implication, were parallelized. Each gate was assumed to be placed on its own
individual processor. The objective then was to measure the maximum number of
operations that could be performed in parallel by individual gate processors as possible.
The methods that were used to accomplish this objective are best illustrated using an
17
example.
Consider the example circuit of Figure 2.1 again. In setting the objective of J=‘0’,
there are several paths that must be backtraced such as the J->G->A path and the J->I->
H->C path. If these paths could be backtraced at the same time, then parallelism would be
present. If each backtrace operation is assumed to take place in the same amount of time,
then the backtraces will propagate through the circuit on a level by level basis. At each gate,
backtraces are generated on each input as required. Using this method, conflicts may arise
at points of reconvergence of fanout. This point is where the authors use the conflict-free
assumption. The correct values to be placed on reconvergence points are precomputed off-
line and the conflict is avoided. Figure 2.8 illustrates the process of parallel backtracing for
the objective J=‘0’.
Notice that in this case, the maximum number of backtrace operations that occur
during the same period of time, or “time-step” is 2. This measure would be the maximum
amount of parallelism available in this step of the ATPG process. Also note that the
objective values required on the individual lines are not assigned to them during the
s-a-1
A=0
B
C=0
D
E=0
F
G=0
H=0 I=0
J=0
K
L
Figure 2.8 Multiple parallel backtrace.
Time-step: 1
Time-step: 1
Time-step: 2
Time-step: 2Time-step: 3
Time-step: 3
objective values(not yet implied)
backtrace operations
18
backtrace procedure. These values must be set through forward implication as is done in
the serial PODEM case. The difference is that in this parallel implementation, the
implication procedure is parallelized. Thus in the case shown in Figure 2.8, the values
A=‘0’,C=‘0’, and E=‘0’ would be implied at the same time. Implications would then be
performed back through the circuit in parallel in a manner similar to backtraces. Analysis
of Figure 2.8 shows that the maximum parallelism present during parallel implication of
the above input assignments would be 2 as well.
The authors of [31] use this technique to analyze the maximum and average amount
of parallelism present in the ISCAS ‘85 [16] benchmark circuits. The analysis was done on
a conventional workstation using a simulation based technique. The average amount of
parallelism they found was less than they expected. For forward implication, most circuits
had an average parallelism of 4 to 7, although some circuits had values higher than this. For
backtracing, most circuits had average parallelism values of 1.5 to 3.5.
This method of analysis of the parallelism present in topological partitioned ATPG
has several drawbacks which do not allow a valid conclusion to be drawn concerning the
performance of topological partitioning. First, the assumption of one gate per processor is
unrealistic. Second, as shown in this thesis, there are other methods of parallelism available
for use with topological partitioning. Finally, the work in [31] completely ignores the
practical aspects, such as synchronization protocols and communications latency, of
implementing this type of system on an actual multicomputer. The authors do acknowledge
that this technique may have benefit when used with other parallelization methods and that
it has the important characteristic of allowing larger circuits to be processed on a given
distributed memory multicomputer.
2.3 Hardware Considerations
This research utilized a distributed memory MIMD machine known as the ES-KIT
88K. It is assumed that the reader is familiar with the typical characteristics, such as
message passing and connectivity, of parallel machines of this type. Only a brief discussion
of the affect of the characteristics of the machine and the programming of the application
19
will be undertaken in this section. This section will be followed by a discussion of the actual
hardware used in this research. Finally, the software system that formed the basis for this
work will be presented.
Distributed memory machines have local memory for each processor but no
globally accessible memory. Processors must send messages across some interconnection
medium, also called a message fabric, to share data. It may take hundreds or even thousands
of instructions to package a message for transmission so communications costs are much
higher than for shared memory. Also, message transfer time depends on the “distance”
between communicating processors. Distance between processors is a measure of the
length of the communications channel and the number of other processors which must pass
the message along for it to be transferred. There are a number of interconnection strategies
used on message passing systems [32]. Each one involves a trade-off of distance between
processors and the number of connections per processor.
Because communication time is distance dependent, data location in message
passing systems is as critical as, if not more so, that in shared memory systems.
Determination of which processors perform certain tasks is much more important in
distributed memory systems than in shared memory systems. Processes that must
communicate frequently must be instantiated on processors that are “close” to each other.
Therefore, algorithms must be designed for the specific communications topology of the
target machine. Algorithms designed for one machine may not perform satisfactorily on
another [33]. Programs on a message passing system will in general, use built-in systems
calls to send and receive messages. Data must be explicitly moved from one processor to
another using the send and receive mechanism. Synchronization between processors must
also take place using messages and is therefore more time consuming then in shared
memory systems. For this reason, algorithms for message passing machines must use more
coarse grained parallelism. Coarse grained parallelism implies that many instructions must
be processed between synchronization events. Setup time is much longer on message
passing systems because all of the program code and data, such as the circuit topology
information, must be loaded across the message fabric. New processes are harder to spawn
20
for this same reason. Therefore, setting up one processor as a master is more difficult.
Proper load balancing among the processors is also harder to achieve. In general,
algorithms for message passing systems are more difficult to design well, but the programs
are themselves easier to implement and debug because data consistency is more easily
maintained [33].
2.3.1 Experimental Systems Kit
The parallel processing machine available for use in this research is the
Experimental Systems-KIT (ES-KIT) 88K processor developed by the Microelectronics
and Computer Technology Corporation (MCC). The ES-KIT was developed by MCC
under a Defense Advanced Research Projects Agency (DARPA) grant to facilitate
experimentation into new parallel computer architectures and application specific
computing nodes. The ES-KIT system includes the 88K processor, described below, and
the ESP runtime system, described in the next section. The description of the ES-KIT
system is brief and limited to the characteristics which influence applications design. A
more detailed description of the ES-KIT system can be found in [34]. Further, the ESP
system and ES-KIT applications are implemented in the C++ language [35]. It is assumed
that the reader is knowledgeable in this language.
2.3.1.1 ES-KIT Hardware
The 88K processor is a distributed memory parallel architecture based on a 16 node,
4X4 2 dimensional mesh. A Sun 3/140 running the BSD 4.1 operating system acts as a host
for the 88K hardware.The Sun communicates with the 88K processor through a VME bus
interface board. The message fabric of the 88K processor is based on Simult System's
Asynchronous Message Routing Device (AMRD). Each node in the mesh has its own
AMRD. The use of the AMRDs prevents having to use store and forward message passing.
The use of AMRDs means that the processor on each node is not involved in passing
messages that are not addressed directly to it. The message fabric is in general capable of
passing messages at a rate of 20MB per second, but the software overhead of packing and
21
unpacking messages at either end prevents this rate from being achieved.
Each node is a general purpose computing system based on Motorola's 88000
processor family. The nodes consist of four boards which communicate with each other
across an internal bus based on the 88K standard. The four boards consist of the Message
Interface Board, the Processor Board, the Memory Module, and the Bus Terminator
Module. The boards are connected through a unique set of stacking connectors which allow
the nodes to be build on top of each other. A typical installation consists of two nodes
stacked on top of each other. The node stacks are arranged on top of Mother Board modules
that provide power and ground connections and contain the AMRDs. A 16 node
configuration consists of four Mother Boards mounted in a plane, each one containing two
stacks with two nodes per stack.
The processor board consists of one 20 MHz Motorola 88100 RISC
microprocessor, and two 88200 cache modules. The two 88200's provide separate paths for
instructions and data. The Memory Module provides 8MB of dynamic RAM. It is possible
to have up to four Memory Modules per node for a total of 32MB of memory, but the
standard configuration is only 8MB. The Message Interface Module consists of one 88100
processor, 128KB of data memory, 128K of instruction memory, and interface logic to
provide a path from memory to the node's AMRD. The 88100 processor on the MIM takes
care of all processing necessary to package a message and send it out through the AMRD.
The MIM processor off-loads a significant amount of message processing from the 88100
in the Processor Module. Finally, the Bus Terminator Module provides the electrical
termination for the high speed lines in the 88K bus, general purpose services such as the
system clock, and a UART interface to the outside world for debugging and repair of the
node hardware.
2.3.1.2 ESP Runtime System
The ESP (Extensible Software System) run-time environment is as important and
complex as the 88K processor. The environment is written in C++ and is intended to
maximize flexibility in the types of configurations that can be used in a parallel processing
22
system. The environment consists of four major components, the ISSD, the mail daemon,
the shadow process, and the actual ESP kernels.
The Inter Service Support Daemon (ISSD) is the heart of the system in that it is the
first process invoked by the user and it constructs the rest of the run-time environment. The
ISSD is the major interface between the ESP environment and the outside world. The ISSD
controls the starting and terminating of applications programs, and all communication with
peripheral devices including screen and disk IO. The ISSD begins by reading the
configuration file to determine what ESP components are to be invoked. The configuration
file is created by the user and contains instructions as to how many of the various
components are to be constructed, where they are to run, and how they are connected. The
minimum configuration file must contain the invocation instructions for a single ISSD, one
mail daemon, and one or more ESP kernels. The ISSD also invokes the public service
objects (PSO's) such as the application manager and the kernel librarian which are
necessary to run any application. The ISSD always runs on the Sun host machine, but the
PSO's can run on any of the 88K nodes.
The mail daemon is responsible for routing messages between individual or groups
of ESP kernels. Each mail daemon is connected to all of the kernels in its group, every other
mail daemon in the configuration, and the ISSD. This connectivity allows messages to be
passed from kernel to kernel with the minimum handling possible. The configuration must
contain a minimum of one mail daemon for each type of ESP kernel in the configuration.
The shadow process runs on the host Sun where the ISSD is located. The shadow
process is responsible for reading the application source code files and managing the
terminal IO for the application. The shadow process is the next ESP component invoked by
the user after the ISSD.
Finally, the ESP kernel is the work horse of the ESP system in that it actually runs
the application. The kernel performs memory management, message packing and
unpacking, and task switching for the applications. The kernel runs on top of a rudimentary
OS in the 88K processor. The message passing portion of the kernel utilizes an MCC
23
developed protocol which utilizes the 88100 processor in the MIM on the 88K processor.
2.3.1.3 Object Oriented Programming in ESP
The ESP system uses the C++ object oriented paradigm as its abstraction for
parallel processing. Applications written for the 88K processor to run in ESP must be
programmed in C++. C++ incorporates the ideas of objects, data encapsulation, and
inheritance. C++ objects or classes are instantiated on different nodes and communicate
with each other through method invocation and return values. Each object has its own local
data contained within the node and that data can only be manipulated by method calls.
There is no global or 'public' data allowed in ESP. There were five major changes made to
the C++ language to implement the distributed processing environment of ESP. These
changes consisted of overloading the pointer-to-member function ‘->()’, redefining the
return values available for methods, overloading the 'new' function, eliminating the 'main'
routine, and incorporating the concept of futures.
All objects that are to have methods that are available for remote invocation must
be derived from the object remote_base. This object was developed by MCC and includes
several features necessary to implement remote method invocation, the first of which is a
handle. A handle is a pointer to an instance of an object and contains all of the information
needed to address an object in a distributed system. This information is contained in four
parts, a node number where the object actually resides, a class number for the object, an
application number and the actual instance number of the object. Handles can be passed
between objects or the address information can be passed and a new handle constructed to
point to that object.
The second feature included in remote_base is the overloaded pointer-to-member
function. In regular C++, the method invocation:
is implemented as a subroutine call. In ESP C++, the object may reside on a remote
object_instance->method(arg1,arg2,...)
24
node. Therefore, the method call must be invoked through message passing. Overloading
of the ‘->()’ function for remote methods handles this process. Methods for an object
derived from remote_base are defined to be remote by declaring them in the public section
of the object specification. When a method on a remote object is invoked, the kernel reads
the argument list to determine its length. It then copies the argument list into the message
buffer with the length of the argument list in bytes appended to it. Finally, the kernel uses
the handle of the object to instruct the MIM processor where to send the message. When
the receiving object receives the message, it invokes the proper method with the argument
list. If a value is to be returned by the method, one of the return macros defined in the ESP
programming environment must be used. The return macros instruct the kernel that the
return is to a remote object and that it must be packaged as a message. Macros are available
for returning most of the common data types such as integers, doubles, characters, and
strings. There is also a pointer return macro, but its functionality is different than in regular
C++. This macro is necessary because pointers on remote nodes are meaningless in the ESP
environment. If a pointer return is specified, the kernel packages the entire object to which
the pointer refers and sends it back to the node that invoked the method. The kernel on the
invoking node then copies the returned object into its memory space and returns a pointer
to this copy to the invoking object. In this way, any structure or object the size of which can
be determined at compile time can be returned from a remote method invocation.
In C++, the new operator is used to allocate memory space for instances of objects.
In the ESP environment, the new operator is overloaded to allow arguments to be passed to
it. These arguments specify which node an object is to be instantiated upon. The syntax for
a call to the overloaded new function is:
The node, relationship pair is used to specify the location of the object. For example
if the variable homenode is specified to be (1,1), the call new{homenode,SAMEAS} will
create the object on node 1,1. Options for the relationship variable include; SAMEAS,
DIFFERENT, NEAR, FAR, and NEXT. The next relationship does not need a node
object_pointer object_type*( ) new node,relationship{ } object_type();=
25
specifier and it allows the kernel to select the node for the object using its own criteria. At
this point, the criteria used is the amount of memory left on each node. The object is created
on the node with the most free memory. Other algorithms that take into account load
balancing and communications costs are under development by MCC. Until they are
available, the user must be careful to take these factors into consideration and specify where
each large object is to be created in order to optimize the application.
In ESP C++, there is no 'main' routine. When the shadow program loads the first
object in the application, its constructor is invoked after it is loaded. This constructor must
do the work necessary to start the application. This may be as simple as calling another
routine within the same object to take over control, or as complex as creating all other
objects and directly performing the necessary algorithm. The former approach is
recommended as it is more 'correct' and it allows the kernel to complete construction of the
initial object and alter the stack size for that object if necessary.
When a method on a remote object is invoked, the processing takes place on the
remote node. The invoking method is then free to perform some other calculation if it does
not need the result of the remote method. If the invoking method does need the result of the
remote method, it must block until that result is returned. Controlling when the object
blocks for the return result is done by using futures. Futures were introduced as a part of
Multilisp, [36] and allow lazy evaluation of return values. Note that the only parallel
processing that occurs is between the time that the remote method is invoked and the future
is evaluated. This fact demonstrates the value of the future abstraction for methods that
return values. Of course, remote methods that do not return values always run in parallel
with the invoking method.
One notable characteristic of objects in ESP is that to insure “correctness” only one
method in a specific object may be invoked at a time. This includes methods that are
blocked waiting for a future or return value. Thus if two objects each invoke a method in
each other and wait for return value, deadlock is possible. This process is illustrated in
Figure 2.9.
26
In order to avoid this deadlock situation, two objects that wish to pass values to each
other in an asynchronous fashion must use another method which involves more overhead.
This process, which is illustrated in Figure 2.10, involves sending a request to the other
object for the value and then having the other object sent the value back using a second
method invocation. This method insures deadlock free operation, but carries a significant
performance penalty as shown in the following section.
2.3.1.4 System Performance Characterization
This section presents a general overview of the measured performance
characteristics of the 88K processor. This data is necessary in order to analyze the
performance of the applications used in this research and explain some of the
implementation decisions that were made for the applications. A more detailed description
of the experiments used to gather this data and their results can be found in Appendix A of
this thesis.
The software overhead for message passing in the ES-KIT environment is
significant and reduces the maximum communications rate available on the 88K processor
to less than 1 MByte per second. Experiments have shown that additional message traffic
on the connections between nodes can reduce this rate by an order of magnitude. Other
Figure 2.9 Deadlock in synchronous communication among ESP objects.
Object 1
objest1:get_result2(){
result1=object2->return_result2();
}
// some code
// some more code
objest1:return_result1(){
INT_RETURN(result1);}
// some code
Object 2
objest2:get_result1(){
result2=object1->return_result1();
}
// some code
// some more code
objest2:return_result2(){
INT_RETURN(result2);}
// some code
27
experiments have been carried out to determine the rate of information transfer that could
be realized while reading or writing to the host file system. A maximum transfer rate of
approximately 50 Kbytes per second was achieved. Further, a maximum block size of only
32 Kbytes can be read at one time from a file. These statistics severely limit the file I/O
performance of the ES-KIT 88K processor and dictate some major compromises in
application design.
The most significant performance measurement for determining the scalability of
topological partitioned ATPG on a given machine is communications latency.
Measurements have indicated that using the communications method shown in Figure 2.9,
approximately 3200 “communications loops” are possible per second. This translates to a
one-way communications latency of 156µs. However, if the deadlock free method of
Figure 2.10 is used, only an average of 2360 “communications loops” are possible per
second. This rate is equivalent to a one-way communications latency of 212µs. Because
this communications scheme must be used to insure deadlock free operation, this increase
in communications latency has an impact on application performance.
Figure 2.10 Deadlock free asynchronous communication among ESP objects.
Object 1 Object 2objest1:get_result2(){
object2->send_result2();}
// some code
// some more code
objest2:send_result2(){ object1->store_result2(result2);}
objest1:store_result2(int temp){
}
result1=temp;
objest2:get_result1(){
object1->send_result1();}
// some code
// some more code
objest2:store_result1(int temp){
}
result2=temp;
objest1:send_result1(){ object2->store_result1(result1);}
28
2.4 Software Considerations
This section provides a brief description of the parallel test generation system that
was used as the basis for this work. A detailed discussion of the implementation and
performance is contained in Appendix B of this thesis.
2.4.1 The Test Generation System (TGS)
The test generation system used in this work is derived from the Test Generation
System developed at the University of Southern California. This system includes an
implementation of the PODEM [3] test generation algorithm and a critical path tracing fault
simulator [37]. The system was originally programmed in the PASCAL language. It was
converted to the C language by researchers at Mississippi State University and is currently
distributed in the Lager digital design toolset.
The first required modification to the TGS system was to modify the PODEM test
generator to handle exclusive-or (XOR) gates. The TGS implementation of PODEM was a
direct translation of the flow charts included in [3]. However, even though this paper
presents PODEM as an improved ATPG algorithm for circuits containing XOR gates, the
flow charts do not detail how XOR gates are to be handled during backtracing and objective
selection. Simple heuristics were devised and added to the TGS PODEM ATPG system to
handle XOR gates. After these modifications were implemented and tested, the modified
PODEM algorithm was returned to MSU for inclusion in future Lager toolset releases.
The next modification made to the TGS system was to convert it from the C
language to C++. This process involved encapsulating the various functions of the TGS
system into C++ objects. The PODEM ATPG algorithm was incorporated into a
test_generator object. The critical path tracing fault simulation algorithm was incorporated
into a fault_simulator object. The remainder of the “control” type functionality was
incorporated into a master object. A reader and a writer object were added to the system to
read in the circuit description and write out the test vector file respectively. These objects
were added to increase performance in light of the 88K processor’s IO performance as
29
discussed in the previous section.
2.4.2 The ES-KIT Test Generation System (ES-TGS)
Once the TGS system was converted to C++, it was ported to the ESP environment.
This port involved changing all communications between objects into ESP compatible
remote method invocations and the non-trivial task of debugging the system and adjusting
it to ESP’s many idiosyncrasies. Once this task was completed, the system was parallelized
using fault partitioning. Results of the parallelization effort including speedups achieved
for various version on the ISCAS ‘85 circuits can be found in Appendix B. The PODEM
portion of the ESP compatible system was used as the starting point for the Topological
Partitioning Test Generation System detailed in the next chapter.
30
Chapter 3
Topologically Partitioned Test Pattern Generation
System
This chapter presents the Topologically Partitioned Test Pattern Generation System
(TOP-TGS) developed for the 88K processor. The algorithms used to generate the
topological partitions are presented first. This section is followed by a discussion of the
implementation of the TOP-TGS system on the 88K processor. Finally the results section
presents a comparison of the performance of the TOP-TGS system on circuits partitioned
using the various partitioning algorithms described herein.
3.1 Circuit Partitioning
This section describes the complete circuit partitioning system. This system
consists of various algorithms for generating “blocks” which are then combined into
partitions.
The generation of partitions begins with a circuit netlist which is placed in the
proper “PODEM” format. The PODEM format is one of the intermediate formats used in
the TGS system. It consists of a simple ASCII file with one line per gate where each line
contains the gate name, gate type, the number of inputs and outputs to the gate, and a list
of the line numbers of the gates that are inputs to that gate. In addition, the first line in the
file contains the total number of gates in the circuit followed by the number of primary
inputs and primary outputs. In this format, a primary input is simply represented as a gate
of type ‘inpt’ with 0 inputs and a primary output is any gate with 0 outputs.
The next step in the partition generation process consists of performing testability
analysis on the circuit. Testability analysis is a process which assigns controllability and
observability measures to each node in the circuit. These values are used to aid the ATPG
process. For this system, the SCOAP [38] testability analysis algorithm was chosen.
31
Comparison of the performance of SCOAP versus other testability measures performed
early in the research indicated that it outperformed the other measures for a majority of the
ISCAS ‘85 [16] circuits.
Once the testability measures are computed, the circuit netlist is ready for the
partitioning program. As stated previously, this program first divides the circuit up into
blocks and then combines these blocks into partitions. Finally the circuit database is written
back out to a file in the binary format used by the TOP-TGS system with the testability and
partition information included.
3.1.1 Partitioning System Goals
The two requirements for an effective topological partitioning scheme for ATPG
are, reduction of the amount of memory used on each node, and limiting communications
overhead between the processors. Typically there is a large interaction between these two
factors, and reducing one can cause an increase in the other.
Reducing the memory requirements on each node is more complex than simply
dividing the gates up evenly among the available processors. In most current ATPG
algorithms there are occasions where information about the characteristics or state of gates
in other partitions is required. Duplicating this information in both partitions is sometimes
more beneficial than sending messages between partitions each time it is required. For
example, in PODEM, when backtracing an objective value to the boundary of a partition,
it is necessary to know the controllabilities and output states of the gates driving the current
objective gate. This information is used to select the next objective node and value. If this
information is not duplicated in both partitions, several messages will have to be passed to
gather this information before a decision can be made. This process is illustrated in Figure
3.1.
Similarly, the process of selecting a D-frontier gate in PODEM involves knowledge
of the observability and state of gates on the output of the gates in the D frontier. This
process is shown in Figure 3.2.
32
Gates at the interface between partitions can be duplicated in both partitions to
eliminate some of the messages necessary for backtracing and propagation. This
duplication increases the memory required on each partition, but the data structures for the
objective -> 1
Partition #3
Partition #1
state = X
Partition #2
partition boundries
state = Xccy0 = 32
ccy0 = 45
state = X
messages required to requestand receive state and 0 controllability(ccy0) information
messages required to requestand receive state and 0 controllability(ccy0) information
message required to pass newobjective of 0 to this partition
1
2
1
Figure 3.1 Backtracing across partition boundaries.
Partition #2Partition #1
state = D
Partition #3
partition boundries
coy = 23
state = X
messages required to requestand receive state and observability(coy) information
messages required to requestand receive state and observability(coy) information
message required to pass Dfrontier to this partition
1
2
state = X
state = 1
state = X
state = X
state = 1
coy = 35
coy = 30
1
Figure 3.2 Propagating D frontier across partition boundaries.
33
“extra” gates can be much smaller because only the state and the controllability and
observability information is needed. The controllability and observability information in
both partitions is initialized when the circuit database is loaded. One additional message for
each overlapped gate is required during the database initialization process. The state
information is updated in both partitions automatically during forward implication without
need for any additional messages.
Although duplicating the gate structures on both sides of a partition boundary will
reduce the communications overhead for crossing the boundary, the most effective way to
reduce message traffic is to reduce the number of times a partition boundary must be
crossed during ATPG. This reduction is accomplished by minimizing the number of ‘high
traffic’ connections between partitions. A high traffic connection is one which is traversed
a large number of times during ATPG. An example of a high traffic connection might be a
portion of a reconvergent fanout loop such as illustrated in Figure 3.3. In the PODEM
algorithm, the circuit database is traversed from inputs towards outputs and vice versa.
Therefore, the optimum partitions are most likely ones that cut the circuit in a longitudinal,
Partition #1 Partition #2
G
Moving gate G frompartition 2 to partition 1will probably reducemessage traffic duringATPG
Figure 3.3 Placement of gates in partitions.
34
or input to output direction rather than ones that cut the circuit transversely across circuit
levels.
3.1.2 Block Generation
Block generation is the process of dividing up the circuit into subpartitions or
“blocks” of related gates. Dividing up the circuit into blocks is done using one of four
algorithms, fanin cones, fanout cones, input paths, or output paths. The number of blocks
is determined by the circuit topology and the type of partitioning selected. For fanin cones
and input paths, the number of blocks is equal to the number of primary inputs in the circuit.
For fanout cones or output paths, the number of subpartitions is equal to the number of
primary outputs.
The algorithms used for fanin and fanout cone partitioning are presented in [30] and
will not be detailed here. The main characteristic of these algorithms to note is that the
affinity of a specific gate to a cone is increased as the subpartition, or block, increases in
size. This fact usually causes one block to grow in size rapidly and encompass all the gates
surrounding it. Once this block has reached its maximum size, the remaining gates must be
placed in other blocks. Thus one block will be localized to one area of the circuit, but the
remaining blocks may be scattered all over the remaining circuit area.
Partitioning by input paths is accomplished by determining the total number of
paths from each primary input to each gate. Primary inputs are then assigned one to each
block. Each gate is placed in the same block as the primary input to which it has the most
paths. If a block reaches its maximum value, no other gates may be assigned to it. The
maximum size of a block is set to be equal to the total number of gates in the circuit divided
by the total number of full partitions desired.
Partitioning by output paths is accomplished similarly using the primary outputs of
the circuit as starting points. Notice that in this algorithm, the affinity of a gate to a specific
block is determined before the actual block assignment starts and does not change. This fact
results in much more balanced blocks.
35
3.1.3 Partition generation
After block generation is completed, the next step in the partitioning process is to
combine the blocks into full partitions. This step is done using either a greedy algorithm or
a simulated annealing approach [39]. The cost vector includes a factor for the size of the
partitions, and the interconnections between partitions. The size factor is simply the sum of
the absolute value of the difference between the number of gates in each partition and the
balanced partition size. The interconnection factor is the total number of interconnections
between each pair of partitions. The size and interconnection factor are multiplied by
weight factors and then summed to determine the total cost. The weight factors can be set
by the user. For most circuits and most partitioning methods, the cost vector is simple
enough that the greedy algorithm generates a minimal solution. For others, the simulated
annealing results in a better solution. Obviously this result depends on the shape of the
solution space and simulated annealing would probably perform better if the cost vector is
made even more complex by the addition of other factors.
In addition to fanin-out cones and input-output paths, the partitioning system can
generate two other partition types that do not require generation and combining of blocks.
These methods were implemented simply for comparison with the other methods and
include random partitioning and partitioning by gate levels. Random partitioning consists
of placing gates in partitions according to the output of a random number generator. The
uniform distribution of the random numbers insures that the number of gates in each
partition is close to being equal.
Partitioning the circuit by gate levels involves placing gates into partitions
depending on their level, or distance from a primary input. The size of each partition is
limited to the balanced partition size. This limit is strictly imposed and may result in
splitting a level. That is, all of the gates in a specific circuit level may not be placed in the
same partition.
36
3.2 TOP-TGS System Implementation
This section describes the scalar version of the topologically partitioned PODEM
ATPG system. This system was implemented to measure and compare the memory
utilization and message overhead of the APTG process utilizing the various partitioning
algorithms. The next section describes the architecture of the system in terms of the C++
objects which comprise it, and the following section describes the operation of the system
during the various ATPG phases.
3.2.1 TOP-TGS System Architecture
The overall architecture of this system is shown in Figure 3.4. This system consists
of 4 objects, a master object, a test generator object, and reader and writer objects. The
reader and writer objects perform the I/O functions for the TOP-TGS system. The reader
object reads in the circuit database and distributes the partitions to the test_generator
objects. The writer object is responsible for storing the test vectors resulting from ATPG
and keeping track of the fault coverage. The vectors are stored in the writer object because
the file I/O performance of the 88K processors is too slow to allow the vectors to be written
to a file as they are generated.
The master object controls the operation of the test generator objects. This control
includes selection of the next fault for ATPG and initiation of the various ATPG phases
Figure 3.4 Topological PODEM ATPG system architecture.
Reader
WriterMaster
TestGenerator
TestGenerator
TestGenerator
TestGenerator
37
such as objective selection and implication. The test generator objects perform all of the
operations necessary to carry out these phases of the PODEM algorithm. There is no fault
simulation in this system and ATPG is performed on every fault in the circuit. This method
was used only for experimentation into the efficiency of the ATPG algorithm. A real
system which uses this algorithm would include fault simulation similar to the one found
in the ES-TGS system (see Appendix B).
3.2.2 TOP-TGS System Operation
The master object begins the ATPG process by setting up the fault list and selecting
the first fault for ATPG. It then sends this fault to all of the test_generator objects so that
they may determine which one among them has the faulty gate in their database. This task
is performed by calling thefault_transform(FAULTREC target_fault) method in each
test_generator with the specific target fault. Once it has been determined which
test_generator has the target fault, the master starts the test generation loop. This loop
consists of identifying the next objective, backtracing the objective to a primary input,
implying a value on that primary input, and determining the status of the objective. If the
objective has been satisfied, then the next objective is selected and the loop continues. If
the objective has not been satisfied, then it is backtraced to another primary input and a
value is implied there. If the objective value has been incorrectly set, then backtracking
must occur. As stated previously, the master object only controls the performance of these
tasks while the test generators actually perform them. The process by which each phase is
actually carried out will be discussed in the following sections.
Because some of the communications from the test_generator back to the master
object must be synchronous, (section 2.3.1.3), all communications from the master to the
test_generator must be asynchronous as shown in Figure 2.10. As discussed before, this
requirement insures that the system will be deadlock free.
3.2.2.1 Setting Initial Objective
The initial objective in the test generation process is to sensitize the fault. This task
38
is performed by setting the value on the faulty line opposite the stuck-at value. Once the
test generator which has the faulty gate in its database has been identified, the master
invokes thesetnextobjective() method in that test generator. This routine checks to see if
the fault has been sensitized. If it has not, then setting it opposite the stuck-at value becomes
the next objective. This objective is then sent back to the master so that backtracing may
begin.
If the fault is located on the output of a gate, and it has been set opposite the stuck-
at value, then it has been completely sensitized. During implication, a ‘D’ or ‘D’ will be
placed on it as appropriate and the next objective will be calculated by propagating the D
frontier.
However, if the fault is located on a gate input, then sensitization is not complete
until the effect of that fault has been propagated to the gate output. Sensitization is
accomplished by setting the gate’s other inputs to the non-controlling value. The non-
controlling value for an AND/NAND gate is a ‘1’. The non-controlling value for an OR/
NOR gate is a ‘0’. An XOR gate has no controlling value, so the other input may be set to
any value. Once the non-controlling values have been set on the fault free inputs, the gate
output will take on the proper ‘D’ or ‘D’ value during implication. Propagation of the D
frontier can then be started.
3.2.2.2 Backtracing
Once the next objective, initial or subsequent, has been received by the master,
backtracing through the circuit to a primary input must be done. A detailed description of
the backtracing process can be found in [3].
Backtracing involves propagation of an objective value back through the circuit
topology until a primary input is reached. For example, suppose the current objective is to
set the output of an AND gate to ‘1’. Because an objective is only chosen for circuit nodes
with unknown, ‘X’, values, the output of the AND gate must be an ‘X’. Thus, all inputs to
this AND gate must be currently set at ‘1’ or ‘X’. An input to the AND gate which currently
39
has an ‘X’ value is selected as the next objective node with an objective value of ‘1’.
Likewise, for an objective of a ‘0’ on the output of an OR gate, one of the ‘X’ inputs would
be selected as the next objective node with a ‘0’ objective value.
When there is more than one input to a specific gate that has an ‘X’ value that can
be selected as the next objective, a simple heuristic is used to make that selection. If the
objective value on the gate’s output can be satisfied by setting one input to the controlling
value, the input easiest to control is selected. In this case, the easiest to control means the
input with the lowest controllability measure as calculated by SCOAP [38]. If the objective
value on the gate output can only be satisfied by setting all inputs to the non-controlling
value, the hardest to control input with an ‘X’ value is selected first. These rules are valid
in the AND/NAND and OR/NOR case. For the XOR gate, the work in [3] does not detail
how backtracing is performed. For XORs, the author implemented a heuristic that is based
upon the types of gates driving the XOR’s inputs. The details of this heuristic are not
important, except to note that it performs correctly and efficiently although it has not been
shown to be optimal in all cases.
Backtracing is started by the master which issues a call to thebacktrace(int
gate_number,int objective_value) method in the test generator which has the next
objective. The objective is backtraced through that test generator’s circuit database using
the above semantics until a primary input is reached or until a partition boundary is
encountered. A message is sent to the test generator with the gate on the other side of the
boundary to continue the backtrace procedure. This message is simply a call to the
backtrace() method in that test generator with the proper arguments. This process continues
until a primary input is reached. The number of the primary input and the value to be
assigned to it are then sent to the master for implication.
3.2.2.3 Implication
The implication phase and the procedure that the master uses to determine when
implication has completed are the most complicated parts of the scalar TOP-TGS system.
Implication is also the only portion of the test generation process that is implicitly
40
parallelized. Implication is a simulation-like process which is begun when a test generator
sends a primary input number and value back from a backtrace procedure. The master sends
that input number and value to the test generator that has that input number in its circuit
database through theimply(gate_number,input_value) method invocation. The test
generator uses this value to imply, or simulate, the values on the outputs of all gates in its
database driven by that gate number. The values on the outputs of the gates driven by any
gate whose output has changed in the previous iteration are then implied. This local
implication loop is continued until no other gate output in the local circuit database has
changed. The test generator then looks through the database for any gate who’s output has
changed and that drives a gate in another test generator’s database. This information is sent
to that test generator through an invocation of itsimply() method. This cycle of local
implies followed by transmission of values to other test generators is continued until the
implication phase is completed thus making the global circuit state consistent. This process
is inherently parallel in that a single test generator can send outimply() calls to more than
one other test generator at a time.
The difficulty in handling implication in this fashion is determining when
implication is completed. If one could watch the activity on the message fabric and activity
on the nodes, then completion could be detected when no outstandingimply() messages
exists and no node is running theimply() method. This type of monitoring is not possible,
so the master uses counting semaphores to determine when each test generator has satisfied
all of its outstandingimply() calls and is thus idle. For example, if test_generator #1 sends
two imply messages to test_generator #2, it will tell the master to increment the counting
semaphore for test_generator #2 twice. Each time test_generator #2 completes theimply()
procedure, it will tell the master to decrement its counting semaphore. Therefore, after the
two calls to imply() in test_generator #2, its counting semaphore will be back to zero
(assuming it started there), indicating that it is now idle. When all counting semaphores for
the test generators are zero, the master knows that the implication process is finished.
One important synchronization issue in this process is that the master must
increment a test_generator object’s counting semaphore before that test_generator
41
completes theimply() method call and issues a call to decrement the semaphore. If these
steps do not occur in the proper order, the master may think that the global implication is
completed while that test_generator is still working on it. In order to avoid this problem,
the incrementing and decrementing of the counting semaphore is done using the
synchronous communications protocol of Figure 2.9. In the above example, test_generator
#1 will send a message to the master to increment the counting semaphore for
test_generator #2 and wait for a return value from the master to signify that this task has
been accomplished. Only then will it send a message to test_generator #2 to actually
perform the implication. This protocol insures correct operation, but unfortunately, it adds
a significant amount of overhead to the implication process. Further research may be
necessary to find a more efficient solution to the problem of determining when global
implication is complete.
When implication is completed, the master gets the status of the objective from the
appropriate test generator. If the objective has been met, the master selects a new objective
through a call to thesetnextobjective() procedure in the appropriate test generator. If the
objective has not yet been met, the master calls thebacktrace() procedure in the test
generator that has the current objective. This backtrace will lead to another primary input
and value which will then be implied. If the objective value has been set to opposite what
is required, then backtracking must occur.
3.2.2.4 Setting D-Frontier
After the faulty line has been sensitized, a path must be created from the faulty node
to a primary output so that the value on the node may be observed. This path will be formed
by propagating the D frontier from the faulty node to a primary output. The D frontier is
the set of gates with a ‘D’ or ‘D’ on the inputs and an X or unknown on the outputs. The
objective will be to set the other inputs to a gate on the D frontier to the non-controlling
value so that the ‘D’ is propagated from the inputs to the outputs. This process is called
propagation and is continued until a ‘D’ has reached a primary output, at which time a test
has been generated. The gate on the D frontier with the best (lowest) SCOAP observability
42
is selected as the next objective. Since this gate may be in any of the test generator’s
databases, all test_generators must participate in the decision as to which gate will be
selected for propagation. If the test generator that has the target fault in its database
determines that the sensitization process has been completed, it calls the
setDfrontier(gate,observability)method in itself with null arguments. ThesetDfrontier()
method calculates the gate on the D frontier with the best observability in its own database.
This gate becomes the new objective. This test_generator then calls thesetDfrontier()
method in the test generator with the next highest address using the new objective gate and
observability as arguments. The second test_generator then finds the gate with the best
observability on its own D frontier. If this gate has a better observability than the gate that
was sent to it, this gate becomes the new objective. The second test_generator then calls the
setDfrontier() method in the test generator with the next highest address using the new
objective. The test generator at the top of the address list will call the test generator in the
bottom of the list thus forming a loop of method invocations. If a test generator in the list
receives asetDfrontier() call with its own D frontier gate number, than it knows that a loop
has been completed and it has the next objective. This objective is then sent back to the
master so that backtracing may begin. If any test generator has a ‘D’ on a primary output,
it knows that a test has been generated and this fact is sent to the master. If no test generator
has a gate on the D frontier, a test is not possible with this input combination and that fact
is sent back to the master. This result would cause backtracking to occur.
Another method that could be used to calculate the D frontier would be for the
master to broadcast a message to every test_generator to report the gate number and
observability of its D frontier objective. The master could then select the new objective
directly. Although the “broadcast” method was not tested, the “circular loop” method
described above has two apparent advantages which lead to its selection.
First, the loop method results in less message traffic on average. The broadcast
method always results in2N messages, whereN is the number of partitions of the circuit
and thus test_generator objects. The loop method results inN + 2messages in the best case
where the first test_generator in the loop has the objective, and2N + 1 messages in the
43
worst case where the last test_generator in the loop has the objective.
Second, the loop method off-loads most of the work in this step from the master to
the test_generator objects. This fact may become important later in the parallel TOP-TGS
systems as the master object becomes the computational bottleneck.
3.2.2.5 Backtracking
As discussed previously, backtracking is the process of moving back up the search
tree when the current input assignment can not generate a test. There are two conditions
which, when present, indicate that a test is not possible under the current circuit state. The
first condition occurs when the faulty node has been set to the same value as the stuck-at
value. In this case, the fault has not been sensitized because the node value will be the same
in both the faulty and fault-free circuits. The second condition occurs when the D frontier
disappears. In this case, there is no ‘D’ value on a primary output and there is no gate in the
circuit with a ‘D’ or ‘D’ value on one of its inputs and an ‘X’ on its output. This fact
indicates that there is no possible path that can be sensitized to allow the value on the faulty
node to be observed.
The actual process of backtracking is handled by the master object. Backtracking is
started by theset_objective_status(int status) andset_Dfrontier_status(int status) methods
in the master. Theset_objective_status() routine is called by the test_generator with the
faulty node in its database after each implication phase until the fault is sensitized. If the
faulty node, has been set to the incorrect value,set_objective_status() is called again with
theERROR_OBJWRONGSET flag as the argument. This call signals the master to begin
the backtrack process.
Theset_Dfrontier_status() routine is called by the test_generator which has the next
objective in the D frontier. The test_generator with the next objective is determined using
the loop method as described in section 3.2.2.4. If none of the test_generators has a gate on
the D frontier, the last test_generator in the loop will call theset_Dfroniter_status() method
with the ERROR_NOGBLDFRONTIER flag. This call will also signal the master to
44
begin backtracking.
Regardless of how the backtracking process is begun, it proceeds in the same
manner. It begins by popping the last input assignment off of the stack and checking to see
if the alternate value has been tried. This check is performed by determining if a flag in the
stack structure has been set. If the alternate value has been tried, an unknown ‘X’ is
assigned to that input and implication is performed. This process restores the circuit state
to the point it was before that input assignment was made and removes the bottom node
from the search tree.
The process of popping input assignments off of the stack and removing them from
the search tree is continued until one is found in which the alternate value has not been tried.
When this situation occurs, the value of the assignment is inverted, its flag is set indicating
that the alternate value has been tried, and it is pushed back on the stack. The new value is
then placed on the input and implication is performed. When the implication has been
completed, the master invokes thesetnextobjective() method in the appropriate
test_generator. The result of this call will be either a return call toset_objective_status() or
set_Dfroniter_status() with the next objective. If either of these methods is called with the
error flags, backtracking will continue. Otherwise, the test generation process continues as
before backtracking began.
If at any time during backtracking the stack becomes empty, then the fault has been
proven redundant and work on it is stopped. The master also keeps track of the number of
backtracks that have been performed and stops work after a fixed number. The master then
marks the current fault as hard-to-detect. This process is normally done in many ATPG
systems to keep runtimes reasonable in circuits with a large number of hard-to-detect or
redundant faults.
3.3 Scalar TOP-TGS System Results
This section presents results of the TOP-TGS system on circuits that have been
partitioned using the various methods. Several of the ISCAS ‘85 benchmark circuits were
45
used to test this system. Table 3.1 shows the characteristics of the circuits for which data is
presented. These characteristics include the number of primary inputs and outputs, the total
number of gates, the total number of faults, and the number of levels in the circuit.
For this experiment, all of the faults in the circuits were targeted which means that
ATPG was performed on each fault in the circuit. Fault collapsing to reduce the size of the
fault list was not done and fault simulation was not used to remove detected faults from the
fault list.
3.3.1 Comparison of Partitioning Methods
The results for performing ATPG on circuits that were topologically partitioned
using the methods presented in the previous section are shown in Table 3.2. The results are
presented for 2, 4 and 8 partitions. In all cases, each partition resided on a separate
processor. The first column contains the average number of messages passed between
partitions in order to perform ATPG on each fault. The second column contains the average
time taken to perform ATPG on each fault. The third column is the percentage of memory
necessary to hold the circuit database on the individual processors compared to the
uniprocessor implementation. It is calculated using the following formula:
(3.1)
Where Muni is the amount of memory required for the circuit database in the
uniprocessor case, andM1... MN is the amount of memory required for the circuit database
on each of theN processors. Using the maximum of the memory used in each individual
Table 3.1 Characteristics of Benchmark Circuits
CircuitNo. of
PIsNo. ofPOs
No. ofGates
No. ofLevels
No. ofFaults
C432 36 7 160 18 864
C499 41 31 202 12 998
C880 60 26 383 18 1760
%Memorymax M1 M2 ... MN, , ,( )
Muni
100×=
46
TotalMessages
PercentMemory
MemoryEfficiency
TotalMessages
PercentMemory
MemoryEfficiency
TotalMessages
PercentMemory
MemoryEfficiency
2 Processors
Random
Fanin
Fanout
Cones
InputPaths
OutputPaths
GateLevel
Cones
Partition
Type
4 Processors 8 Processors [1]
1442 1176 95.4 52.8
Circuit C432
3330 1424 77.0 35.7 4608 1468 51.0 28.9
755 483 91.3 61.3 1267 674 69.4 50.6 1802 791 48.0 42.7
806 586 95.4 61.4 1580 818 67.3 43.4 2336 979 49.3 36.4
839 571 73.0 69.8 1356 625 54.1 56.6 2124 778 45.9 48.9
809 527 92.9 71.8 1158 652 69.9 51.7 1617 844 62.8 41.5
828 595 82.7 63.2 1423 784 62.2 43.4 2097 967 42.3 32.7
Time (ms)Total
Time (ms)Total
Time (ms)Total
TotalMessages
PercentMemory
MemoryEfficiency
TotalMessages
PercentMemory
MemoryEfficiency
TotalMessages
PercentMemory
MemoryEfficiency
2 Processors
Random
Fanin
Fanout
Cones
InputPaths
OutputPaths
GateLevel
Cones
Partition
Type
4 Processors 8 Processors
1172 962 93.8 53.8
Circuit C499
2170 1045 72.4 37.3 3250 1144 48.6 30.0
778 475 70.8 70.6 1028 522 51.0 51.8 1368 637 33.9 43.8
831 528 85.6 64.5 1224 720 55.6 45.8 1692 801 47.3 41.4
845 602 81.5 61.4 1070 632 70.4 51.8 1534 714 54.7 44.2
759 478 70.0 72.3 1159 545 49.0 55.6 1725 638 37.0 47.8
841 664 66.7 60.3 1217 691 74.9 46.9 1907 948 57.2 33.0
Time (ms)Total
Time (ms)Total
Time (ms)Total
TotalMessages
PercentMemory
MemoryEfficiency
TotalMessages
PercentMemory
MemoryEfficiency
TotalMessages
PercentMemory
MemoryEfficiency
2 Processors
Random
Fanin
Fanout
Cones
InputPaths
OutputPaths
GateLevel
Cones
Partition
Type
4 Processors 8 Processors
162 176 92.6 54.1
Circuit C880
278 182 73.4 38.9 391 189 48.8 29.3
116 84 73.0 77.7 148 97 59.4 62.3 207 118 39.1 55.2
106 79 99.5 87.9 121 85 98.2 88.0 151 100 95.7 86.5
122 94 68.0 74.0 157 98 54.2 63.1 213 119 40.9 51.8
118 84 61.9 81.9 160 89 41.3 66.4 210 103 27.3 59.4
123 101 76.6 74.2 169 114 55.8 52.4 245 145 44.5 40.5
Time (ms)Total
Time (ms)Total
Time (ms)Total
******
******
******
******
******
****** ****** ******
******
******
******
******
******
******
******
******
****** ******
****** ******
****** ******
****** ******
****** ******
******
****** ****** ******
******
****** ****** ******
******
******
[1] Note: only 7 processors for Fanout Cones and Output Paths for circuit C432
Table 3.2 TOP-TGS Results on Benchmark Circuits
47
processor gives an indication of the memory required by the worst case. The fourth column
is the total memory efficiency of the multiprocessor implementation. Memory efficiency is
the ratio of total memory used in the multiprocessor case to the amount of memory used in
the uniprocessor case. It is calculated as follows:
(3.2)
In each column, the entry with the best value is highlighted for each circuit. For
example, in the two processor case of circuit C432, fanin cones had the least number of total
messages per fault and the lowest run time per fault. However, input paths has the lowest
percent of memory and output paths has the highest memory efficiency. The overall goals
of topological partitioning in this system are to reduce the amount of memory required on
each node while maintaining the overhead caused by message passing at a minimum. The
message passing overhead may be measured indirectly by the increase in total runtime per
fault. Therefore, the percent of memory used and the total runtime are the most important
entries for selecting the optimum partitioning method. For circuits C499 and C880, the
output paths method of partitioning has the best overall performance. It typically results in
the lowest percent memory and runtime or is very close to the lowest. It also has high
overall memory efficiency. For circuit C880, fanout cones would seem to be the optimum
because it has the best runtimes for all three processor configurations, but its percent
memory is always over 95 percent. Output paths has the lowest percent of memory for all
processor configurations and yet has runtimes close to that of fanout cones.
For circuit C432, output paths performs adequately in terms of runtime, but it does
not result in a significant reduction in percent memory required. This result is caused by the
fact that C432 only has 7 primary outputs whereas the other circuits have at least 26. In fact,
since C432 only has 7 outputs, the maximum number of partitions and hence processors, is
7 and the results presented in the 8 processor column are really for 7 processors. Even in
the 2 and 4 processor case, there are only 7 subpartitions that can be combined into full
MemoryEfficiencyM
uni
Mi
i 1=
N
∑100×=
48
partitions and method does not seem to lead to optimum results. For circuits such as C432
which have significantly fewer primary outputs than inputs, partitioning by input paths is
recommended.
One fact to notice in the results is that although message passing has the largest
impact on runtime, the partitioning method that results in the least messages per fault does
not always have the lowest runtime. This fact indicates that message passing is not the only
impact on runtime for the multiprocessor case verses the uniprocessor case. The
implication procedure is the only process that is significantly different between the two
implementations. In the multiprocessor case, all of the gates in each processor’s memory
must be processed during each implication iteration. Therefore, poor memory efficiency
will adversely affect the time taken by implication. This fact can be seen in the results for
input paths and output paths for 8 processors and circuit C499. Input paths generates 1534
messages per fault with a runtime of 714 ms while output paths generates 1725 messages
with a runtime of 638 ms. However, partitioning by output paths has a higher memory
efficiency of 47.8% compared to 44.2%. This reduction of duplicated gates allows
partitioning by output paths to produce a lower runtime. Another factor that can effect
runtimes is the fact that in the multiprocessor case, separate processors can perform
implications in parallel. This result assumes that the circuit topology and the way it is
partitioned causes this parallelism to take place. This effect is difficult to measure however,
and appears to be a minor contributor to runtime.
3.3.2 Conclusions
Several obvious conclusions can be made from these results. Random partitioning
performs the worst. It resulted in the longest runtimes and the largest percentage of memory
even though the actual number of gates in each partition was fairly well balanced. This
result is due to the fact that random partitioning results in the most duplication of gates
between partitions as indicated by its poor memory efficiency. Partitioning by gate level
performed better than random in all categories, but usually worse than any of the other
methods in terms of runtime and total messages. This result is as expected because
49
partitioning by gate level divides the circuit transversely while the algorithm moves
through the circuit longitudinally (from inputs to outputs). Thus partitioning by gate level
should result in more messages and therefore longer runtimes that the other methods.
Partitioning by fanin and fanout cones often resulted in the best runtime results, but
the percent memory used was often over 90% of the uniprocessor case. This result is caused
by the fact that fanin and fanout cone partitioning can result in one partition growing much
faster than the others leading to unbalanced partition sizes.
The two new methods developed for this research, partitioning by output paths and
input paths resulted in the highest reduction in percent memory required in most cases
while generating runtimes close to the best case. For most circuits, partitioning by output
paths is recommended. However, for circuits with many more primary inputs that outputs,
partitioning by input paths is recommended. In the subsequent experiments undertaken for
this thesis, the partitioning method that performed the best for each circuit was used unless
specifically noted. In the case of circuits C499 and C880, the optimum method was
partitioning by output paths, and for circuit C432, partitioning by input paths.
The next chapter presents an analysis of the topologically partitioned ATPG process
in order to estimate its performance on parallel machines with lesser communications
latency then the ES-KIT 88K.
50
Chapter 4
Analysis of Topologically Partitioned ATPG
This chapter presents an analysis of the complexity of the topologically partitioned
PODEM ATPG process in terms of a binary search. This analysis involves an identification
of the operations required to construct the search tree and a prediction of the number of each
type of operation required to generate a test for a specific fault. The computational
complexity and communications requirements for each type of operation in a topologically
partitioned ATPG system will be discussed. Predictions of the effect of increasing
communications cost will be made and compared with experimental data. Finally, the
results will be analyzed and used to predict the performance of the ATPG system on a
multiprocessor with lower communications costs.
4.1 The ATPG Search Process
This section presents a discussion of the PODEM algorithm as a binary search.
Recall from the description of PODEM in Chapter 2 that PODEM represents the search
space as a binary tree. Nodes in the tree, other than leaf nodes, represent primary inputs,
and edges in the tree represent logic value assignments to these nodes. Leaf nodes in the
tree represent either successful generation of a test vector, or determination that a test is not
possible with the given input assignments. The latter case is sometimes called an
inconsistency. Figure 4.1 shows an example circuit and a search tree for a specific fault. In
this case, the search tree is for a stuck at ‘0’ fault on the input of gate 3. Notice that a wrong
decision somewhere in the test generation process has resulted in an inconsistent
assignment of A=’0’, B=’1’, C=’0’, and A=’0’, B=’1’, C=’1’ being tried. At this point,
nodes were removed from the tree and more were added in another portion of the search
space.
Adding a node to the search tree requires the selection of a primary input for
assignment, and the actual assignment of the value to the input followed by implication of
51
the new circuit state This process requires three separate operations.
First, a new intermediate goal required for a successful test must be selected. This
goal may be the sensitizing of the fault by setting the value of the faulty node opposite the
faulty value, or propagating the effect of the fault by setting the inputs of a gate along the
sensitized path to the non-controlling value. This process is called selecting the next
objective, or simply an objective operation.
Second, an input assignment necessary to set the objective value on the objective
node must be selected. This process is accomplished by a backtrace operation. The
backtrace operation consists of continuously moving the objective from a single gate’s
output to one of its inputs until a primary input is encountered.
Finally, the actual assignment to the primary input must be made and the new circuit
state computed. The forward implication operation accomplishes this process. Thus, adding
a node to the tree requires an objective, backtrace, and implication operation in that order.
On the other hand, removal of a node from the tree, called a backtrack, requires
either an objective operation followed by an implication, or only an implication. If the node
to be removed is a leaf node, an objective operation is required to determine that a test is
no longer possible. This step is followed by implication of the alternate value for the last
s-a-0A
B
C
A
B
C
0
1
0
Figure 4.1 Circuit under test and search tree.
1
2
3
4
5
6 0
C01
T
52
input assignment in the tree. If the node removed is not a leaf node, then the only operation
required is the implication of the alternate value for the input.
As an example, consider the trees shown in Figure 4.2. Tree A is a representation
of test generation which did not require any backtracking. Notice that the test involves
assignment of two input values and that generation of the test required 2 backtraces, 2
implys, and 3 objective operations. Generation of a test without backtracks represents the
lower bound on the complexity of the ATPG process and the number of individual
operations required. In general, ATPG without backtracking requiresn backtraces,n
implys, andn+1 objectives, wheren is the number of inputs specified in the test vector.
Tree B is a representation of attempted test generation on a redundant fault. In this
case, two input assignments must be made each time before a test is found to not be
possible. The tree contains the maximum number of internal nodes and leaf nodes for a tree
with two inputs specified. In this case, 3 backtraces, 7 implys and 7 objective operations
are required to construct the tree. In addition, 4 backtracks are required. This shape of tree
represents the worst case in terms of number of operations required. The next section
presents the calculation of the number of backtracks and operations necessary to construct
a tree of this type in the general case ofn inputs specified.
A
B
0
0
T
Figure 4.2 State space search trees and their operational requirements.
Tree A
obi
obi
o
o - objectivei - implyb - backtracebt - backtrack
A0
B
1
0 1
Tree B
B
0 1
obi
obi obi
i
bt i bt ibt i
bt
o ooo
53
4.2 Operations Requirements for Balanced Redundant Trees
Figure 4.3 shows a worst case search tree for a redundant fault with 3 inputs
specified. The tree has been annotated to show the operations required for each node.
Notice that the tree has2n-1 internal nodes,2n leaf nodes, and a total ofn+1 levels. The
levels are numberedl=0, for the root level, throughl=n, for the leaf node level. Each level
has2l nodes. The total number of operations can be calculated by determining the number
of operations required in each level and summing across all levels.
For example, consider the number of backtracks required. Each level froml=1 to n
requires2l-1 backtracks. Thus the total number of backtracks,α is:
(4.1)
Similarly, the number of implications can be calculated by noting that each of the
levels froml=1 to l=n requires3(2l-1)-1 implications. Then the number of implications,β,
is given by:
Figure 4.3 State space search tree for worst case redundant fault.
A
0
C
1
0 1
C
0 1
obi
obi obi
i
bt i bt ibt i
bt
o ooo
C
0 1obi
bt ibt i
o o
C
0 1obi
bt ibt i
o o
A A
obi obi
bt bt
i
ii
level: 0
level: 1
level: 2
level: 3
o - objectivei - implyb - backtracebt - backtrack
bt110 0
α 2l 1−l 1=
n
∑ 2 2n 1−( ) n−= =
54
(4.2)
The number of backtraces,χ, required for levelsl=1 to l=n is 2l-1 per level, or:
(4.3)
Finally, the number of objectives required per levell=1 to l=n is also2l-1. The leaf
node level also requires an additional2n objectives for each node to determine if a test is
possible at this point. Then the number of objective operations,δ, is given by:
(4.4)
Table 4.1 contains a summary of the number of operations required forn=1,2,3, and
4 as well as the general case. The next section explains how these formulas can be used to
estimate the number of operations required by a non uniform search tree. Such a tree is
typically constructed when searching for a hard-to-detect fault which results in a test being
found within a specific backtrack limit. From this point forward, a balanced search tree
such as discussed in this section forn inputs specified will be denoted by the symbol:
.
β 3 2l 1−( ) 1−( )l 1=
n
∑ 3 2n 1−( ) n−= =
χ 2l 1−( )l 1=
n
∑ 2n 1−= =
δ 2l 1−( )l 1=
n
∑ 2n+ 2n 1+ 1−= =
n
Table 4.1 Operational Requirements vs. Number of Inputs Specified
# inputs α β χ δ
2(2n-1)-n 3(2n-1)-n 2n-1 2n+1-1n
1 1 2 1 3
2 4 7 3 7
3 11 18 7 15
4 26 41 15 31
55
4.3 Operations Requirements for Arbitrary Unbalanced Trees
For a circuit withn inputs, the entire search space for a specific fault is bounded by
. However, test generation typically requires much fewer operations then required to
search . This reduction is due to the fact that a test for a specific fault usually requires
only i<n inputs to be specified. Also, a backtrack limit is frequently set in ATPG such that
if a test for a fault is not found within the specified number of backtracks, the fault is
marked as undetected and dropped from the fault list. This backtrack limit prevents the test
generation process from taking exponential time in the worst case. A method has been
developed to estimate the operations required to generate an arbitrarily shaped unbalanced
search tree in whichm backtracks are required andi inputs are specified.
Consider the case of a fault for which test generation requires 11 backtracks and a
total of 5 inputs specified. The backtrack number of 11 is sufficient to generate a tree of
, and this would leave an extra 2 inputs to be specified. Thus, a search tree such as
shown in Figure 4.4A may result for this fault. In this case, the operational requirements
may be calculated by adding the requirements of with those required to generate the
two extra nodes. The general case of a fault which requiresm=2(2n-1)-n backtracks and a
total of i inputs specified is shown in Figure 4.4B.
Most typical faults do not generate trees of this shape, but experimental data has
shown that using this type of tree as an estimator yields good results. For a fault with an
arbitrary number of backtracks,m, equation (4.1) must be solved forn. This yields:
(4.5)
This equation is of course transcendental and must be solved numerically form.
Once this is done, the operations requirement can be found by calculating the requirements
for using (4.2), (4.3), and (4.4) and adding those to the requirements to generate the
extra(i-n) nodes.
Table 4.2 shows the comparison for the actual vs. estimated operations required by
n
n
3
3
nn2
( )2
log− m2
1−( )2
log=
n
56
a number of faults in the C432, C499, and C3540 benchmark circuits. The data shows that
the above estimation method can predict the number of operations required to find a test for
a single fault within 7.0%.
4.4 Communications Requirements for TopologicallyPartitioned ATPG
The next step in the process of analyzing the performance of topologically
partitioned ATPG is to determine the communications cost. This calculation can be
performed by combining the estimated number of operations required for ATPG with the
estimated communications costs of each operation type. The equations presented in the
previous section can be used to estimate the total number of operations required for ATPG
on an entire circuit. However, this technique requires an estimate of the average number of
inputs specified in the test vector set,ie, and an average number of backtracks required to
generate the test vector set,me. Equation (4.5) can then be used to calculate thene from
which βe, χe, δe can be determined using (4.2), (4.3), and (4.4).
Given these numbers, all that remains is to determine the average communications
required for an implication operation, backtrace operation, and objective operation. Let
Figure 4.4A State space search tree.
A
B
0
0
3
obi
obi
Figure 4.4B General case state space search tree.
a1
a2
0
0
n
obi
obi
ai-n
0obi
57
icomm, bcomm, andocomm denote these quantities. The total number of messages required
for ATPG can be calculated by:
(4.6)
wheref is the number of faults in the circuit under consideration.
Of course, the difficulty of this method is estimating the values ofie andme and the
Fault ## inputs
β χ δ
261
Table 4.2 Actual vs. Estimated Operational Requirements
specifiedα actual
β χ δ
260
estimated
C432 - 15 12 247 380 134 379 132
270 269C432 - 27 21 247 389 143 388 141
115 115C432 - 145 11 103 165 62 163 60
119 118C432 - 149 14 104 169 65 166 63
43 43C499 - 280 34 8 45 37 44 36
45 45C432 - 171 11 33 60 27 59 26
49 49C432 - 175 14 34 64 30 63 29
21 21C432 - 197 11 9 24 15 23 14
36 36C499 - 282 34 1 35 34 35 34
289 288C499 - 480 40 247 410 163 407 160
286 285C499 - 793 37 247 407 160 404 157
284 283C499 - 871 35 247 403 157 402 155
287 277C499 - 873 29 247 398 152 396 149
197 197C3540-2000 14 182 286 104 284 102
244 245C3540-2002 11 232 358 126 357 124
221 222C3540-2004 11 207 324 115 321 114
263 262C3540-2016 14 247 382 136 381 134
107 107C3540-2017 12 94 152 58 150 56
39 39C3540-2047 19 19 47 28 46 27
Μcomm
f icomm
βe bcomm
χe
ocomm
δe
+ +( )×=
58
communications requirements,icomm, bcomm, andocomm. The values ofie and me are
dependent upon the circuit under consideration and the ATPG algorithm used. The values
of icomm, bcomm, and ocomm are dependent not only on those factor, but also on the
algorithm used to partition the circuit and the number of partitions the circuit was divided
into.
One way to estimate these parameters is to use the method presented in [41] where
the values derived from partial ATPG are used to estimate the requirements for complete
ATPG. That is, the topologically partitioned ATPG system is used to generate tests for a
small number of faults and the values ofie, me, andicomm, bcomm, andocomm are measured.
These numbers are then used to estimate the communications costs of complete test
generation for all faults using (4.6).
The graphs in Figure 4.5, Figure 4.6, and Figure 4.7 show the results of using this
method to estimate the total number of messages required for topologically partitioned
ATPG on the C432, C499, and C880 circuits. In each case, the circuits are partitioned onto
two processors. Estimates were made of the total communications requirements in terms of
total number of messages after ATPG was performed on 20%, 40%, 60%, 80%, and 100%
of the faults in the circuit. The actual measured number of messages required is also shown
on the graphs.
Notice that for circuit C880, the estimated number of messages required is within
3% of the actual number required even after only 20% of the faults were processed. This
result is caused by the fact that the values ofie, me, andicomm, bcomm, andocomm are very
constant and close to the final values throughout the test generation process.
For the C499 and C432 circuits, however, the initial estimates for the total number
of messages required varied up to 42% from the actual number after the first 20% of the
faults were processed. In these circuits, the values of icomm, bcomm, andocomm were fairly
constant and accurate throughout the test generation process, but the values forie andme
varied widely from the final values. This variation inie andme caused the calculated values
of βe, χe, andδe to be inaccurate initially.
59
In the C432 circuit, the first portion of the fault list contains a much higher
percentage of hard-to-detect faults than does the entire list. This fact causes the initial
operations estimates, and hence the communications requirements estimates, to be higher
than actual. In C499, however, the opposite is true. The initial portion of the fault list
contains a lower percentage of hard-to-detect faults than the entire list and thus the
communications requirements estimates were initially lower than actual. In both cases
though, the estimated number of messages required became closer to actual as the ATPG
process continued. Both estimates were within 7% by the time all faults had been
processed.
4.5 Estimating the Effects of Changing CommunicationsLatency
Once the number of messages required for topologically partitioned ATPG has been
determined, the total increase in test generation time due to communications costs can be
Figure 4.5 Estimated number of messages required for C432.
0.0 20.0 40.0 60.0 80.0 100.0
Percentage of Faults Processed
500000.0
600000.0
700000.0
800000.0
Tot
al N
umbe
r of
Mes
sage
s R
equi
red
C432
actual number of messages
predicted number of messages
60
determined. Because this version of the TOP-TGS system is essentially scalar, the time
necessary to send the required messages can be added directly to the computation time. The
computation time itself is difficult to calculate, but it can be determined directly by
measuring the time required for test generation by the serial (non distributed) version of the
TGS system running on the same platform. The total test generation time for the distributed
system,τdist, can be calculated by:
(4.7)
where τserial is the non distributed TGS execution time,τcomm is the total message
communications time, andτov is the additional execution time required by the
computational overhead of the distributed ATPG algorithm. A more detailed analysis of the
computations which compriseτov will be presented in the next section. The
communications time,τcomm, for a given parallel processor can be calculated as:
Figure 4.6 Estimated number of messages required for C499.
0.0 20.0 40.0 60.0 80.0 100.0
Percentage of Faults Processed
330000.0
380000.0
430000.0
480000.0
530000.0
Tot
al N
umbe
r of
Mes
sage
s R
equi
red
C499
actual number of messages
predicted number of messages
τdist
τserial
τcomm
τov
+ +=
61
(4.8)
where Μcomm is the total number of messages required from (4.6), and∆comm is the
communications latency for a given machine. For the ES-KIT 88K processor,∆comm ≅
0.212ms as stated in Chapter 2.
The overhead time,τov, is impossible to calculateapriori because it depends on the
fault under consideration, the circuit topology, and the partitioning method used. However,
because it is possible to calculateτcomm and measureτdist and τserial, τov can be
determined using (4.7). When (4.8) is substituted into (4.7) and the topologically
partitioned ATPG runtimes,τdist, can be calculated for various communications latencies,
∆comm.
In order to verify that this method of determining the change inτdist with ∆comm,
an experiment to actually measure the effect was needed. Because it is not possible to
Figure 4.7 Estimated number of messages required for C880.
0.0 20.0 40.0 60.0 80.0 100.0
Percentage of Faults Processed
170000.0
172000.0
174000.0
176000.0
178000.0
180000.0
Tot
al N
umbe
r of
Mes
sage
s R
equi
red
C880
actual number of messages
predicted number of messages
τcomm
Μcomm
∆comm
=
62
change the actual communications latency of the ES-KIT 88K directly, some means of
simulating the effect of changing∆comm was necessary. Again, because this version of the
TOP-TGS system is scalar, the communications latency between processors simply
represents the time from the point where useful computation halts on one processor to the
point where it begins on another. Therefore, it is possible to effectively increase this time
by performing useless computations on the receiving processor after the message is
received for some time∆spin. If ∆spin=∆comm, then the communications latency is
effectively doubled. This process is illustrated in Figure 4.8.
Figure 4.8 Simulating increased communications latency.
// computational code
remote_object->send_message();
recieve_message;
// more computational code
∆comm
// computational code
remote_object->send_message();
recieve_message;
// more computational code
∆comm
for(i=0;i<SPIN_NO;i++) {
//do a floating point multiply }
∆spin
∆eff∆eff = ∆comm +∆spin =2∆comm
63
Notice that this method is only valid for a scalar algorithm because in a parallel
algorithm, the useless computations could delay useful computations on another portion of
the algorithm. These computations should be performed in parallel with the
communications latency in a true parallel system with the longer latency.
The useless computation placed in the TOP-TGS system to increase
communications latency consisted of a loop of floating point multiply operations. These
floating point multiplies involved operands stored in memory. Experimentation determined
that each one of these type of multiplies takes 1.80009µs to complete. Thus, it required a
loop of 117 of these operations to make ∆spin=∆comm. One iteration of the 117 multiply
operations would double the communications latency, or scaling factor, two iterations
would triple it, and so on.
These spin loops were added to the TOP-TGS system at the beginning of each
remote method. These methods are the ones that must be invoked when sending messages
between processors. The TOP-TGS system was further modified to perform ATPG on a
single fault multiple times for timing purposes. This modification was made so that
runtimes could be calculated and measured for single faults and compared for greater
accuracy. The faults tested were chosen to represent a mix of easy and hard faults that were
distributed throughout the circuit topology.
The graphs in Figure 4.9, Figure 4.10, Figure 4.11, and Figure 4.12, show a
comparison between the predicted and measured test generation time for representative
faults in circuits 74181, C432, C499, and C880. In all cases, the predicted and measured
times increased very linearly with communications scaling factor. In some cases, the slopes
of the predicted and measured increases differed slightly, but this fact is probably due to
inaccuracies in measuring the actual communications latency as discussed in Chapter 2.
The results verify that (4.7) is in fact an excellent predictor of the scaling ofτdist with
changes in∆comm.
64
4.6 Distributed ATPG Overhead
If the TOP-TGS system were run on an ideal multiprocessor system where there
was instantaneous communications between processors, i.e.∆comm = 0, then the runtime
would be:
(4.9)
Thusτov is a significant factor in the distributed runtime in that it represents excess
computation time that must be overcome by any parallelism added to the algorithm. This
result is true even if it is run on an ideal multiprocessor with∆comm = 0.
There are three ways in whichτov may be determined. The first is to measure the
serial execution time, the distributed execution time, and the total number of messages
required for ATPG.τov can then be calculated using (4.7) and (4.8). The second method is
to use the linear relationship of the actual runtimes vs. communications scaling factor to
Figure 4.9 Test generation times vs. communications scaling factor, 74181.
1.0 2.0 3.0 4.0Communications Scaling Factor
20.0
40.0
60.0
80.0
100.0
120.0R
untim
es (
ms)
74181
fault #1 measured fault #1predicted fault #181measured fault #181predicted
1.0 2.0 3.0 4.0Communications Scaling Factor
100.0
150.0
200.0
250.0
300.0
350.0
400.0
Run
times
(m
s)
74181
fault #379 measured
fault #379 predicted
τdist
τserial
τov
+=
65
find the intercept point of the runtime curve when the scaling factor is zero. Using the data
and graphs presented in the previous section results in values ofτov that agree within 5%
or less.
The third method of calculatingτov is to determine the types of computations which
Figure 4.10 Test generation times vs. communications scaling factor, C432.
1.0 2.0 3.0 4.0Communications Scaling Factor
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
160.0
180.0
200.0R
untim
es (
ms)
C432
fault #1 measuredfault #1 predicted
fault #501 measured fault #501 predicted
1.0 2.0 3.0 4.0Communications Scaling Factor
1000.0
1500.0
2000.0
2500.0
Run
times
(m
s)
C432
fault #145 measured fault #145 predicted fault #731 measured fault #731 predicted
1.0 2.0 3.0 4.0Communications Scaling Factor
2000.0
2500.0
3000.0
3500.0
4000.0
4500.0
5000.0
Run
times
(m
s)
C432
fault #619 measured
fault #619 predicted
66
constitute this overhead and the number of times they are executed in the distributed vs. the
serial algorithm. Thenτov can be estimated using the execution rate of the host processor.
This method is by far the most difficult, but it is very useful in that it helps identify areas
of inefficiency in the distributed algorithm that may be improved.
There are two areas of inefficiency that have been identified thus far in the TOP-
TGS system. They are, increased numbers of gate evaluations during the calculation of the
D frontier, and increased numbers of gate evaluations during the forward implication
process.
As discussed in section 3.2.2.4, the calculation of the D frontier is a global process
that involves all of the processors. In the current implementation, the calculation is done
sequentially with the processors communicating in a ring. Each processor calculates its D
frontier value and compares it to the value sent to it by the previous processor. If its value
is better than the one sent to it, the processor forwards its value to the next processor in the
ring. If its D frontier is not better, the processor forwards the value sent to it to the next
Figure 4.11 Test generation times vs. communications scaling factor, C499.
1.0 2.0 3.0 4.0
Communications Scaling Factor
2000.0
3000.0
4000.0
5000.0
6000.0
Run
times
(m
s)
C499
fault #480 measured fault #480 predicted fault #931 measured fault #931 predicted
1.0 2.0 3.0 4.0
Communications Scaling Factor
0.0
100.0
200.0
300.0
400.0
500.0R
untim
es (
ms)
C499
fault #1 measured fault #1 predicted fault #621 measured fault #621 predicted
fault #961 measured fault #961 predicted
67
processor. The first processor in the ring to receive its own value has the D frontier. Using
this system, the worst case scenario would be for the last processor in the ring to have the
D frontier. Therefore, each processor would have to calculate the D frontier twice. This
would involve evaluating every gate in the circuit twice for each D frontier (or objective)
Figure 4.12 Test generation times vs. communications scaling factor, C880.
1.0 2.0 3.0 4.0Communications Scaling Factor
125.0
175.0
225.0
275.0
Run
times
(m
s)
C880
fault #41 measured fault #41 predicted fault #1201 measured fault #1201 predicted
1.0 2.0 3.0 4.0Communications Scaling Factor
60.0
80.0
100.0
120.0
140.0R
untim
es (
ms)
C880
fault #1521 measured fault #1521 predicted
1.0 2.0 3.0 4.0Communications Scaling Factor
10.0
20.0
30.0
40.0
50.0
Run
times
(m
s)
C880
fault #801 measured fault #801 predicted fault #1741 measured fault #1741 predicted
68
operation where as the serial system evaluates each gate only once. Therefore, the overhead
from this inefficiency,τov1, would be:
(4.10)
whereNg is the number of gates in the circuit,δ is the total number of objective operations
from (4.4), I o is the number of instructions required for each gate evaluation in the D
frontier, andR88 is the single instruction execution time of the 88100 processor.
Examination of the symbol table for the test generator object reveals that the code
for thesetDfrontier() method begins at address 0x12DC and ends at 0x1620. This address
range constitutes 836 bytes or approximately 209 32 bit instructions for the 88100. The gate
evaluation portion of this routine probably accounts for about 20% of this code or 42
instructions. The manufacturer of the 88100 claims an average execution rate of 1.36 clock
cycles per integer instruction or 14.7 million instructions per second for an 88100 with a 20
MHz clock. This figure corresponds to anR88 of 68.0ns.
The increase in gate evaluations during forward implication in the distributed
ATPG algorithm vs. the serial ATPG algorithm is due to the fact that the two algorithms
process the gates in different order. The serial ATPG algorithm processes the gates in level
order from inputs to outputs whereas the distributed algorithm processes gates as their
inputs become active. The levelized processing of the serial algorithm guarantees that all
inputs to a gate have reached their current state values before the gate is evaluated. This
insures that each gate is evaluated at most once in every forward implication operation.
This levelized processing is not possible in the distributed system and this results in excess
gate evaluation steps when changing inputs to gates do not arrive at the same time.
As an example, consider the AND gate of Figure 4.13. In the serial forward
implication process, the inputs to the AND gate are both set to ‘1’ before the gate is
evaluated and its output is set to ‘1’. However, in the distributed case, if the imply messages
#1 and #2, arrive at different times, two evaluations are necessary. The first evaluation with
the inputs at ‘1’, ‘X’, results in an output of ‘X’ and the second with the inputs at ‘1’,’1’
τov1
NgδIoR88≅
69
results in the output becoming ‘1’.
The number of gate evaluations performed in the serial ATPG system,ges, and the
distributed system,ged, can be measured. The difference then,∆ge = ged - ges can be
determined.
The overhead due to excess gate evaluations during forward implication,τov2, can
be calculated thus:
(4.11)
whereI i is the number of instruction required for each gate evaluation, andγ is the average
number of inputs for each gate. It is necessary to include theγ factor because the
implication procedure loops through the instructions once for each gate input. For most of
the circuits under consideration,γ=2.
The code for theimply() procedure begins at address 0x199C and ends at 0x230C.
This address range constitutes 2416 bytes or approximately 604 instructions. The gate
evaluation portion of this routine accounts for about 90% of the code or 543 instructions.
Table 4.3 shows the calculated values ofτov1, τov2, andτovcalc= τov1 + τov2, for the
Figure 4.13 Serial vs. distributed forward implication.
level:n level:n+11
1
1
1
1
partition 1
partition 2
partition 3
1
implication message #1
implication message #2
Serial ATPG algorithm Distributed ATPG algorithm
τov2
∆ge
γI iR88≅
70
faults analyzed on the previous section. The calculated overhead,τovcalc, is the compared to
the measured value,τovmeas, by a factorf= τovmeas/τovcalc. The factorf varies from 3.88 to
1.33. While this difference may seem to be large, the calculated and measured values of the
overhead time agree fairly closely given the gross approximation involved in calculating
the overhead time.
The variance in the ratios suggests that there may be additional excess calculations
being performed in the distributed ATPG system that were not taken into account by (4.10)
and (4.11). However, the variance may simply be due to the variance inR88 caused by such
Fault #
74181 - 1
74181 - 181
74181 - 379
C432 - 1
C432 - 731
C432 - 145
C432 - 501
C432 - 619
C499 - 1
C499 - 480
C499 - 621
C499 - 931
C499 - 961
C880 - 41
C880 - 801
C880 - 1201
C880 - 1521
C880 - 1741
Table 4.3 Calculated vs. Measured Overhead Times
∆geγ τov1 τov2 τovcalc τovmeasf
6 1.40 66 4.87 6.27 12.62 2.07
8 1.87 153 11.30 13.71 19.20 1.46
25 5.85 682 50.36 56.20 74.71 1.33
11 6.62 19 1.40 8.02 31.16 3.88
115 69.30 3066 226.0 295.3 653.4 2.21
9 5.40 28 2.06 7.48 24.75 3.70
260 156.7 7156 528.0 684.7 1368.9 1.99
180 108.5 1761 130.0 238.5 654.9 2.91
23 18.58 32 2.30 20.88 64.12 3.07
289 233.5 12110 894.0 1127.0 1848.1 1.65
42 33.94 182 13.44 47.38 88.73 1.87
275 222.3 3188 235.4 457.7 1230.2 2.68
14 11.31 42 3.10 14.41 25.55 1.77
16 21.40 60 4.43 25.83 75.36 2.92
4 5.35 2 0.15 5.49 11.18 2.04
17 22.77 55 4.06 26.83 73.78 2.75
9 12.05 63 4.65 16.70 36.99 2.21
3 4.01 6 0.43 4.45 9.67 2.71
71
factors as cache miss rates.
4.7 Effects of Increasing the Number of Partitions
The results presented in the previous sections were for circuits divided into 2
partitions. This section presents results for circuits divided into 4 partitions.
The distributed ATPG algorithm is consistent in the sense that it always follows the
same path through the search space independent of the partitioning of the circuit. Therefore,
the number of operations required to generate the search tree will remain the same.
However, the number of messages required to perform each operation should increase. This
increase in messages will increase the communications cost,τcomm. Figure 4.14 shows the
graphs of the estimated number of messages for the C432, C499, and C880 circuits when
divided into 4 partitions.
Notice that the curves have the same shape as those for 2 partitions presented in
section 4.4. The estimated curve for C432 initially falls above the actual curve because of
the high percentage of hard-to-detect faults in the beginning of the fault list. The estimated
curve for C499 is initially below the actual curve because of the high percentage of hard-
to-detect faults in the end of the fault list. The estimated curve for C880 falls within a few
percent of the actual values throughout the ATPG process. One additional fact to notice is
the increase in the total number of messages required for ATPG with 4 partitions vs. 2
partitions. In the case of C432, the 4 partition case requires almost 67% more messages than
the 2 partition case. The C499 and C880 circuits require 52% and 42% more total messages
respectively. These increases in the total number of messages required increases the
communications cost,τcomm, and thus the distributed ATPG runtime,τdist, directly.
The graphs in Figure 4.15 show the actual and predicted ATPG times scale with
increasing communications latency,∆comm, for selected faults. The graphs also show that
the ATPG times themselves have increased over the 2 partition case because of the
increased communications costs. The overhead time,τov, however has not increased
significantly and can still be estimated using (4.10) and (4.11) as shown in Table 4.4.
72
Figure 4.14 Total message requirements for 4 processors.
20.0 40.0 60.0 80.0 100.0Percentage of Faults Processed
800000.0
1000000.0
1200000.0
1400000.0
1600000.0
Tot
al N
umbe
r of
Mes
sage
s R
equi
red
C432
20.0 40.0 60.0 80.0 100.0Percentage of Faults Processed
450000.0
550000.0
650000.0
750000.0
Tot
al N
umbe
r of
Mes
sage
s R
equi
red
C499
20.0 40.0 60.0 80.0 100.0Percentage of Faults Processed
240000.0
245000.0
250000.0
Tot
al N
umbe
r of
Mes
sage
s R
equi
red
C880
predicted
actual predicted
actual
predicted
actual
73
1.0 2.0 3.0 4.0Communications Scaling Factor
0.0
2000.0
4000.0
6000.0
8000.0
Run
times
(m
s)
C499
fault #1 measuredfault #1 predicted
fault #480 measuredfault #480 predictedfault #931 measuredfault #931 predicted
1.0 2.0 3.0 4.0Communications Scaling Factor
0.0
2000.0
4000.0
6000.0
8000.0
Run
times
(m
s)C432
fault #145 measuredfault #145 predicted
fault #501 measuredfault #501 predicted
fault #619 measuredfault #619 predicted
fault #1741 predicted
fault #41 measured
fault #1741 measuredfault #41 predicted
Figure 4.15 Test generation time vs. communications scalingfactor for 4 partitions.
1.0 2.0 3.0 4.0Communications Scaling Factor
0.0
100.0
200.0
300.0
400.0
Run
times
(m
s)
C880
74
Fault #
C432 - 145
C432 - 501
C432 - 619
C499 - 1
C880 - 1741
C499 - 480
C499 - 931
C880 - 41
∆geγ τov1 τov2 τovcalc τovmeasf
115 69.30 4243 313.3 382.6 669.0 1.75
9 5.42 30 2.21 7.63 27.65 3.62
260 156.7 10249 756.8 913.5 1324.3 1.45
23 18.50 58 4.28 22.78 79.61 3.49
289 233.6 14012 1034.0 1267.6 1483.0 1.17
275 222.3 7206 532.1 754.3 1248.0 1.65
16 21.40 54 3.98 25.38 69.88 2.75
3 4.01 6 0.443 4.45 9.52 2.14
Table 4.4 Calculated vs. Measured Overhead Times
75
Chapter 5
Algorithmic Enhancements
This chapter details the algorithmic enhancements made to the serial TOP_TGS
system in order to speedup the ATPG process for a single fault across topological
partitions. The enhancements fall into two general categories. The first category is
efficiency enhancements which attempt to increase the number of computations that are
performed between communications steps. These enhancements increase the grain size of
the algorithm which can increase performance in a system with large communications
latencies such as the ES-KIT 88K. These types of enhancements are not parallelizations of
the base algorithm, but they can lessen the impact of communications overhead incurred by
topological partitioning and thus increase the benefit of the parallelizations actually added
to the system.
The second category of enhancements are those which attempt to increase the
amount of work that is done simultaneously on separate processors. These enhancements
are parallelizations of the basic algorithm which can achieve speedups over the
uniprocessor system.
The first two of the following sections discuss packetizing implication invocations
and optimistic backtracing which fall into the category of efficiency enhancements. The
next two sections detail the multiple backtrace and optimistic propagation procedures
which are parallelization enhancements. The last section presents the results of the various
enhancements on the benchmark circuits.
5.1 Packetized Implication
During the implication procedure within one partition, the outputs of gates at the
partition boundary can take on new values. These new values must be transmitted to the
test_generator objects which have the gates on the other side of the boundary so that the
global implication process can be completed. The original TOP_TGS system was
76
implemented such that this process was performed one gate at a time as the new values on
the boundary gates were generated. This process meant that when implication changed a
value on a boundary gate, it would stop, send a message to the master to increment the
receiving test_generator’s counting semaphore, wait for a reply, and then send the single
new value to the receiving test_generator via a call to itsimply() method. It can easily be
seen that this process is time consuming and efficiency could be enhanced if these values
were be grouped into packets and sent together for implication. This enhancement was the
first one made to the TOP_TGS system.
The majority of the modifications for this enhancement were made to theimply()
method itself. The method was modified to take a list, or packet, of gate numbers and values
for implication as an argument. An overall loop was added to imply all of the values in the
packet across the partition in serial order. Because only combinational logic is handled by
the TOP_TGS system, the final circuit state after all of the implications is independent of
the order in which the individual implications were done.
The process by which the implication calls for the changing boundary gate outputs
were sent out required the most modification. In order to packetize these calls, no outgoing
imply() invocations are issued until the local implication of the entire input packet is
complete. This modification necessitated a method of determining which boundary gates
had undergone a change on their outputs since the start of implication within the partition.
This determination was made by adding a previous state variable to each gate structure and
saving the current state of the output to it at the start of local implication. The state of the
output after local implication can then be compared to the previous state to determine if a
state change necessitating animply() call has occurred. The addition of this previous state
variable to the gate structure did add a small amount of memory overhead (6%) to the
partition database.
Once the local implication of the input packet is performed, theimply() calls
generated by the gates on the partition boundary are packetized according to their
destination. Because the ESP environment requires all arrays that are sent as arguments to
77
methods to be of fixed size, several packets may have to be sent to a single test_generator.
The counting semaphore for each test_generator is thus incremented by the master as
determined by the number of packets to be sent to that test_generator. The packets are then
sent to theimply() methods in the appropriate test_generator objects one at a time.
The major disadvantage of this method of packetizing is that the parallelization of
the implication procedure is reduced. This reduction is caused by the fact that theimply()
calls are issued after local implication is completed, not in parallel with the actual
implication itself.
There is an obvious trade-off in determining the size of the implication packets. Too
large a packet size results in wasted time as large messages are sent between test_generators
with little information stored in them. Too small a packet size results in multiple messages
being sent between test_generators where one would have sufficed if it was larger. The
optimum packet size was determined by experimentation to be equal to the average number
of values passed between test generators during implication. The optimum packet size is
somewhat dependent on the size of the circuit, the number of partitions, and the partitioning
method used. For the circuits under consideration here, a packet size of 10 performed
adequately and resulted in an average of a 5% decrease in ATPG runtimes.
5.2 Optimistic Backtracing/Objective Satisfaction
The second algorithmic enhancement which falls into the category of efficiency
enhancements is the optimistic backtracing procedure. This procedure attempts to increase
the amount of work done when a partition receives a backtrace message. Recall from the
discussions in previous sections that a backtrace operation is used to determine what input
assignments need to be made to satisfy a specific objective. This objective would consist
of specifying a certain value to be present on a circuit node. Also recall that backtracing is
accomplished by systematically moving that objective back through the circuit topology to
the primary inputs. If, during this process, the objective crosses a partition boundary, a
message must be sent from one test_generator object to another to continue the backtrace
operation. Optimistic backtracing can increase the amount of work done when this type of
78
message is received.
Suppose a test_generator receives a backtrace message for a specific gate on its
boundary and successfully backtraces it to a primary input in its own circuit partition. The
next step in the ATPG process would be to perform a global implication of that input
assignment and check the status of the objective. In most cases, the objective will not yet
be satisfied and in fact, backtracing may proceed along the same path and cross the partition
boundary at the same gate. This situation will occur if the gate output at the partition
boundary is still at an ‘X’ value.
The optimistic backtracing process uses the fact that the output of the gate at the
partition boundary where the backtrace was originally received still has an ‘X’ value as a
signal. This signal tells the process that another backtrace from the partition boundary can
be done in an attempt to find another local input assignment.
The local backtracing and implication continues until either the objective at the
partition boundary is satisfied, or the backtrace operation encounters another partition
boundary. When either of these occurs, the input assignments made locally are sent to the
master object to be pushed on the global stack, and global implication is performed to make
the circuit state consistent.
As an example, consider the circuit of Figure 5.1. For a stuck-at ‘1’ fault on the
input toPnot driven byC1, the first objective is to drive that node to a ‘0’. This objective
will sensitize the fault. A backtrace invocation will be sent from test_generator #3 to
test_generator #2 to backtrace the objective of the output node ofC1 to a ‘0’.
Test_generator #2 will backtrace this objective toward inputS2 until the boundary with
test_generator #1 is encountered. This operation will result in a backtrace invocation being
sent to test_generator #1 to setS2 to a ‘1’. This input assignment will be made and global
implication will be performed. After implication, the output ofC1 will still be an ‘X’. The
next backtrace issued to test_generator #2 for the objective of settingC1 to ‘0’ will result
in the optimistic backtrace procedure being used.
79
With inputS2 set at ‘1’, the backtrace ofC1, ‘0’ will lead to the assignment of input
A3 to a ‘1’. Because this is a local input to test_generator #2, a local implication will be
done to see if theC1, ‘0’ objective has been satisfied. Since the output ofC1 will still be at
‘X’, another local backtrace is done which will lead to the assignment of a ‘0’ to inputB3.
Local implication of this input assignment will lead to a ‘1’ value on the output ofB2 and
thus a ‘0’ on the output ofC1. The original objective has now been satisfied and the
optimistic backtrace procedure will make the global state consistent. This process involves
sending the input assignmentsB3, ‘0’ andA3, ‘1’ to the master object to be pushed on the
global stack, and packetizing and sending outimply() calls for any changes on the outputs
of boundary gates. When implication is finished, the test generation process will continue
by selecting a new global objective which in this case will be to set any unknown inputs to
Pnot to a ‘1’ to propagate the fault.
The modifications necessary to incorporate this functionality into the test_generator
object were fairly extensive. First, the backtrace and imply procedures had to be configured
s-a-1
PnotC1
B1
B2A1
S2
S3
B3
A3
Partition 1
Partition 2
Partition 3
0
11
1
1
0
1
Figure 5.1 Optimistic backtracking/objective satisfaction in the 74181 circuit.
backtrace operation
input assignmentSTEP #1
STEP #3
STEP #2
80
so that they could be invoked locally without generating message traffic. This operation
was accomplished by changing the public methodsimply() and backtrace() to private
methods local_imply() and local_backtrace(). Next, a publicimply() method was added
which allowed the global implication to be performed correctly. Thisimply() method is, in
fact, a public stub routine which callslocal_imply() with the same arguments.
The public backtrace() method actually carries out the optimistic backtrace
procedure. It consists of a loop of calls tolocal_backtrace() and local_imply() which
continues until either a partition boundary is encountered or the objective is satisfied. A
pseudocode representation of this method is shown in Figure 5.1.
Thesend_inputs() routine sends the local input assignments to the master object to
be pushed on the global stack. Thesend_imply() routine starts the global implication
process. This task is performed exactly as in the previous system. Theimply() calls are
packetized, the counting semaphores for the test_generators are incremented by the master,
and theimply() calls are issued to the respective test_generators.
This enhancement typically resulted in a 10% enhancement in runtime on the
benchmark circuits. However, this result is greatly dependent on circuit topology and the
Figure 5.2 Pseudocode representation of optimistic backtrace procedure.
backtrace(objective_gate. objective_value){
do { local_objective_gate = objective_gate; // set local variables local_objective_value = objective_value; local_backtrace(); // backtrace to local input or partition boundary local_push(input_assignment); // save input assignment local_imply(input_assignment); // do local implication of input assignment } while(partition_boundary_not_found && local_objective_gate.output != ‘X’);
send_inputs(); // send local input assignments to master send_imply(); // send out values on boundary gates for implication
}
81
target fault under consideration. A more detailed explanation of the results is contained in
section 5.5.
5.3 Multiple Backtrace
The multiple backtrace procedure is an algorithmic enhancement that falls into the
category of parallelizations. The goal of the multiple backtrace procedure is to issue as
many backtrace operations as possible so that multiple test_generator objects will be
processing them in parallel.
The concept of multiple backtrace was first presented in the FAN algorithm [4].
However, FAN is a serial algorithm and the goal of using multiple backtrace in it was to
reduce the number of conflicts when backtracing through reconvergent fanout paths. This
reduction was accomplished by backtracing all paths through reconvergent fanout
simultaneously and then counting the number of times each logic value, either ‘0’, or ‘1’,
was backtraced through a fanout stem. The objective value on that fanout stem would then
be assigned the logic value with the largest count.
The use of multiple backtrace for the parallelization of topologically partitioned
ATPG was presented in [31]. However, the scheme presented in [31] is unrealistic in its
assumption of only one gate per processor. This assumption was made to measure the upper
limit on the number of simultaneous backtrace operations which could be expected on the
benchmark circuits. Further, this work does not address the application of multiple
backtracing to a realistic topological partitioning scheme which requires many gates to be
placed in a single partition as was done for the TOP-TGS system. It also does not address
the practical implementation issues that must be addressed in implementing multiple
backtrace on a real multiprocessor system.
As an example of the multiple backtrace procedure, consider the section of the
74181 circuit shown in Figure 5.3. In generating a test for a stuck-at ‘1’ fault on the output
of gatePnot, the first objective is to sensitize the fault by placing a ‘0’ on that node. This
objective requires that all of the inputs toPnot be set to a ‘1’ value. Because all of the inputs
82
to Pnot are driven by gates in other partitions, backtrace operations can be sent to them
simultaneously to be backtraced in parallel.
The actual implementation of the multiple backtrace process in the TOP_TGS
system differs slightly from the description above. In the example of Figure 5.3, the
backtrace of the objective,Pnot, ‘0’, would first lead to a backtrace call to test_generator
#2 with the objective,C1, ‘1’. This value would then be logically implied by test_generator
#3 followed by another local backtrace ofPnot, ‘0’. This backtrace would then lead to the
objective,C3, ‘1’ being sent to test_generator #1. The value ‘1’ would be locally implied
on the output ofC3 by test_generator #3 followed by another local backtrace operation.
This process would continue until the original objective of Pnot, ‘0’ had been satisfied by
the local implications in test_generator #3. Each backtrace call sent to the other
test_generators during this process would be satisfied by an optimistic backtrace procedure
within that test_generator. Once all of the backtrace calls had been satisfied, all input
assignments would be pushed onto the global stack and implication would be performed to
s-a-1
Pnot
C3
Partition 3
Figure 5.3 Multiple backtrace in the 74181 circuit.
C5
C1
C7
Partition 2
Partition 1
Partition 4
1
1backtrace operation
1
1
1
83
make the global circuit state consistent. The next global objective would be selected and
the test generation process would continue.
This method of performing the multiple backtrace process was chosen for its ease
of implementation and the fact that it helps keep the originating test_generator object busy
as well as the test_generators receiving the backtrace calls.
Like optimistic backtracing, the multiple backtrace process is performed by the
backtrace() method. In this case, the method contains a loop which continues to be
executed until the original objective which was sent to it has been satisfied. The loop begins
by callinglocal_backtrace() to backtrace the objective until either a local primary input or
a partition boundary is encountered. If a local input is found, the objective value is assigned
to it and a local implication is performed. If a partition boundary is encountered, a backtrace
call is issued to the appropriate test_generator and a local implication of the value on the
boundary gate’s output is done. Once the initial objective has been met, the local input
assignments made during the process are sent to the master to be pushed on the global stack.
Imply() calls are then issued for the boundary gate outputs that have changed after the
proper counting semaphores have been incremented. A pseudo code representation of the
multiple backtrace routine is shown in Figure 5.4.
Because all backtrace operations now use this multiple backtrace process, some
test_generators may be issuingimply() calls while others are still in the backtrace phase.
This fact could result in the master thinking that all test_generators are idle when actually
some test_generators are still backtracing. This result would cause the master object to
incorrectly begin the next phase of the ATPG process before the present phase is
completed. In order to avoid this problem, the system was modified so that the counting
semaphore for a specific test_generator is incremented whenever it is sent abacktrace() or
imply() call and decremented whenever one of these operations completes. Therefore, the
master will not determine that all test_generators are idle until all outstandingbacktrace()
andimply() calls have been processed.
The results of using multiple backtracing on the benchmark circuits were mixed.
84
Overall, this enhancement increased processor utilization and thus decreased runtimes in
most cases. However, in some cases there are situations where the multiple backtrace
procedure was less efficient. This situation most often occurs when one local input
assignment would satisfy the global objective itself or result in a circuit state where a test
was not possible. The fact that this situation has occurred may not be known until the entire
circuit state is made consistent throughimply() calls. Much of the work done in these
situations could be unnecessary or even incorrect. Reconvergent fanout or excessive
numbers of XOR gates in the circuit topology are the most likely causes of these situations
and this occurrence can seriously impact the performance increases realized by this
algorithmic enhancement. Section 5.5 presents a more detailed discussion of the results of
the multiple backtrace process.
// the counting semaphore for this test_generator was incremented by the master// before this call was received. this marks this test_generator as being busy.
backtrace(objective_gate. objective_value){
do { // main loop local_objective_gate = objective_gate; // set local variables local_objective_value = objective_value; local_backtrace(); // backtrace to local input or partition boundary. // this sets the current_objective. if(current_objective == primary_input) { local_push(input_assignment); // save input assignment local_imply(input_assignment); // do local implication of input assignment } else { local_imply(boundary_gate.output); // local imply objective value on boundary gate } } while(local_objective_gate.output != ‘X’); // loop until objective is satisfied
send_inputs(); // send local input assignments to master send_imply(); // send out values on boundary gates for implication mark_idle(this_test_generator); // decrement this test generator’s semaphore}
Figure 5.4 Pseudo code representation of the multiple backtrace procedure.
85
5.4 Optimistic Propagation
One of the major limits to the available parallelism in the multiple backtrace
procedure is the fact that once the global objective has been satisfied in the local partition,
the entire system must work together to make the global state consistent and compute a new
global objective. The global implication and objective selection process tends to be more
serial in nature than the backtracing process and, therefore, the more often it is performed,
the lower the speedup will be. The optimistic propagation procedure is an attempt to
decrease the number of times the global implication/objective selection process is
performed.
Optimistic propagation is the process of speculating on what the next global
objective will be and preceding with work on satisfying that objective. This procedure is
based on the theory of Speculative Computation [40] that states that if a processor which is
idle is used to perform work that may or may not be necessary, then only will speedup be
achieved if the work turns out to be useful. This result, of course, assumes that the
speculative work does not take place ahead of required work that other processors are
waiting for.
As an example, consider the stuck-at ‘0’ fault on the output of gateE3 as shown in
Figure 5.5. The first global objective would be to sensitize the fault by setting the output of
E3 to a ‘1’. This objective would require a ‘1’ on all inputs to E3. The first backtrace
operation would be local to test_generator #2 and would lead to the local input assignments
necessary to set the output of gate C1 to ‘1’. The next backtrace would lead to a backtrace
message being sent to test_generator #1 with the objective, C3, ‘1’. After local implication
of this boundary gate value, the next backtrace would send the objective, C4, ‘1’ to
test_generator #1. Local implication of this boundary gate output in test_generator #2
would lead to the satisfaction of the global objective as the output of E3 would become ‘1’.
Without the addition of the optimistic backtrace procedure, the system would normally stop
and make the global state consistent at this point and then select a new global objective.
However, the optimistic propagation procedure would speculate on what the next objective
86
would be for propagation and begin processing it.
In the example case, the next objective would be selected in the local partition of
test generator #2 to propagate the ‘D’ value on the output of E3 through gate F1. This
objective would require all inputs of F1 not on the propagation path to be set to the non-
controlling ‘0’ value. Local backtraces leading the outputs of gates E1 and E2 would be
performed followed by a local backtrace to set the output of gate E4 to a ‘0’ also. These
local backtraces could lead to backtrace calls to other partitions. This process of selecting
the next local objective for propagation would continue until the D frontier has reached a
partition boundary. At this point, the system will stop and make the global state consistent,
and select a new global objective.
s-a-0C1
C3
Partition 2
1
Figure 5.5 Optimistic propagation the 74181 circuit.
1backtrace operation
E1
E3
E2
F1
E4
C4
Partition 1
1
1local backtrace
1D
X
X
X
0
0
0
0optimistic backtrace
87
There are several situations which could make the work done in propagating the
local D frontier to the partition boundary unnecessary and, hence, lead to it being
speculative. First, the input assignments set by other partitions in satisfying their backtrace
calls could actually set up the conditions necessary to propagate the D frontier in the local
partition. Second, these input assignments could result in propagation of a D frontier in
another partition. This remote D frontier may then be at a more observable point, or even
at a primary output, and this fact will not be known until a new global objective is selected.
Finally, the remote input assignments could result in a circuit state where a test is no longer
possible. Because of the possibility of these situations occurring, there is a trade-off in how
much work can be done between performances of global implication/objective selections.
If too many operations are done between global consistency checks, there is a larger chance
that some work may be unnecessary if incorrect. Too few operations between global
consistencies tends to limit the amount of parallelism that can be extracted.
As in the previous algorithmic enhancements, most of the modifications necessary
to implement the optimistic propagation process were made to thebacktrace() routine.The
name of this routine was changed tosatisfy_obj() and its major function became the
satisfaction of all local objectives necessary to propagate the D frontier to a partition
boundary.
At the beginning of the ATPG process,satisfy_obj() is called in the test_generator
with the target fault in its database. This invocation ofsatisfy_obj() will call the
local_setnextobj() routine to setup the next local objective.Local_setnextobj() is simply a
local version of thesetnextobj() method and in this case it will return the next objective that
will sensitize the fault.Satisfy_obj() will then perform all of thelocal_backtrace() and
local_imply() calls necessary to satisfy this objective. Once the current objective has been
satisfied, thelocal_setnextobj() will be invoked to select the next objective in the local
partition.Local_setnextobj() will continue to return local objectives until the fault has been
sensitized. Once the fault has been sensitized,local_setnextobj() will call
local_setDfrontier() to return the next objective.Local_setDfrontier() will return the next
objective necessary to propagate the D frontier through the next gate.
88
Thesatisfy_obj() method will continue looping in this manner until the D frontier
has been propagated to the partition boundary. At this point, all of the local input
assignments made are sent to the master to be pushed onto the global stack and the global
state is made consistent by issuing the necessaryimply() calls. A pseudo code
representation of thesatisfy_obj() routine os shown in Figure 5.6.
As with the multiple backtrace procedure, results of the optimistic propagation
procedure are greatly dependent on the topology of the circuit and the target fault under
consideration. In some situations, all of the speculative work is later required and is
therefore useful. Runtimes in these cases are much lower using the optimistic propagation
procedure. In other situations, the speculative work is incorrect and runtimes are increased.
This incorrect work frequently occurs when processing a hard fault that usually results in a
// The counting semaphore for this test_generator was incremented by the master// before this call was received. This marks this test_generator as being busy. It is not// necessary to send this routine a starting objective because it uses the global objective// as a starting point and then computes its own local objectives.
satsify_obj(){
do { // main loop local_setnextobj() // set the next local objective local_backtrace(); // backtrace to local input or partition boundary. // this sets the current_objective. if(current_objective == primary_input) { local_push(input_assignment); // save input assignment local_imply(input_assignment); // do local implication of input assignment } else { local_imply(boundary_gate.output); // local imply objective value on boundary gate } } while(objective_gate_is_not_boundary); // loop until D frontier is at partition boundary
send_inputs(); // send local input assignments to master send_imply(); // send out values on boundary gates for implication set_idle_status(this_test_generator); // decrement this test generator’s semaphore}
Figure 5.6 Pseudo code representation of the optimistic propagation procedure.
89
large number of backtracks. In this situation, frequent global consistency checks are
beneficial to determine if a wrong decision which should result in a backtrack has been
made. Performing speculative work between global consistencies here simply results in
useless computations which increase the time between backtracks and hence increase the
ATPG time. It is also possible for the optimistic propagation process to increase the number
of backtracks necessary to perform ATPG on any given fault. This situation occurs when
the optimistic propagation routine continues to work on propagating an unprofitable D
frontier when a global objective selection procedure would have selected another D frontier
gate for propagation. This increase in the number of backtracks is especially detrimental
because the processing of backtracks requires an extra implication operation which has
very limited parallelism. The next section presents a detailed discussion of the results of
this algorithm enhancement.
5.5 Analysis of Results
This section presents the results of the various algorithmic enhancements to the
TOP_TGS system. The results of adding each enhancement to the system will be presented
using the same example faults used in the previous chapter. Then the results of doing ATPG
on all of the faults in the circuit with each enhancement will be presented.
5.5.1 Optimistic Backtracing Results
Table 5.1 compares the results of the TOP_TGS system with the optimistic
backtracing enhancement added, dubbed version 1, with the unenhanced distributed
TOP_TGS system. Included in the comparison are the runtimes for each fault,τ, the total
number of messages required for ATPG,Μcomm, the total number of backtracks,α, and the
total number of inputs specified in the test vector,ie. The results show that although a
decrease in runtime of up to 20% was achieved on some faults, others faults experienced
up to a 38% increase in runtime.
Because the optimistic backtrace procedure is not a parallelization enhancement,
the increase or decrease in runtimes is caused directly by an increase or decrease in either
90
communications costs or operational complexity. The communications cost for ATPG is
measured directly byΜcomm. The operational complexity can be determined byα andie as
shown in the previous chapter.
For example, consider fault #C432-501. Both versions of the TOP_TGS system
required no backtracks for ATPG on this fault and specified 8 inputs in the test vector.
Therefore, the operational complexity in both cases is equivalent. However, the number of
messages required actually increased from 106 to 111 thereby increasing the runtime from
Table 5.1 Comparison of Results for Version 1 System
Fault No.Distributed System Version 1 System
τdist α ie Μcomm τver1 α ie Μcomm
74181-1 31.0 0 7 61 28.6 0 5 59
74181-181 44.97 0 7 94 34.8 0 7 69
74181-379 146.97 10 14 257 201.73 28 14 365
C432-1 70.73 0 10 126 65.95 0 10 121
C432-145 1149.0 103 11 1627 1198.9 109 13 1704
C432-501 56.57 0 8 106 59.78 0 8 111
C432-619 2336.9 247 11 3010 2368.8 247 12 3216
C432-731 1281.98 157 22 1915 1438.45 166 22 2303
C499-1 147.12 0 22 284 58.56 0 22 86
C499-480 3065.1 247 40 3908 3069.78 247 38 4016
C499-621 208.74 0 41 369 157.12 0 41 262
C499-931 2230.89 247 26 2184 2277.9 247 26 2896
C499-961 64.37 0 13 118 220.66 16 32 292
C880-41 143.49 0 15 188 99.30 0 15 121
C880-801 24.73 0 3 33 15.69 0 3 17
C880-1201 145.86 0 16 190 143.72 0 16 201
C880-1521 73.48 0 8 89 59.50 0 8 69
C880-1741 19.58 0 2 23 16.83 0 2 16
91
56.57ms to 59.78ms. Likewise, fault #C880-1741 has the same operational complexity in
both cases, but a decrease in the number of messages from 23 to 16 reduced the runtime
from 19.58ms to 16.83ms. In the case of faults such as C499-961, however, the dominant
factor was the increase in operational complexity which also increased the number of
messages required. For this fault,α increased from 0 to 16 backtracks, andie increased
from 13 to 32 inputs specified. These increases led to an increase in the runtime from 64.36
ms to 220.66ms.
This change in operational complexity is caused by the fact that the optimistic
backtrace process can change which primary inputs are specified in the test vector The
change in primary inputs specified can lead to the enhanced ATPG algorithm searching
different portions of the search space than the serial algorithm. The altered search can turn
out to be more or less efficient. The data in Table 5.1 tends to indicate that more faults take
a longer time for ATPG than those that take a shorter time. If this were the case, then ATPG
for all faults would take longer for the version 1 system than the unenhanced distributed
system. The data presented in section 5.5.4, will show that this is not the case however.
5.5.2 Multiple Backtracing Results
Table 5.2 contains a comparison of the results of the version 2 TOP_TGS system
and the unenhanced distributed system. The version 2 system contains both the multiple
backtrace parallelization and the optimistic backtracing enhancement.
Because the multiple backtrace enhancement is in fact a method of parallelization,
it is expected that some faults will experience a decrease in runtime without a decrease in
the total amount of work being done. This decrease would be due to the effect of
parallelism.
Examination of the results for fault #C499-961 shows that the runtime decreased
from 64.37ms to 48.44ms. This decrease was realized in spite of the fact that the number
of inputs specified increased from 13 to 34 indicating that the operational complexity of the
ATPG process increased by a factor of over 2.6. The number of gate evaluations for this
92
fault also increased from 134 to 520 indicating that the speedup also occurred in spite of an
increase in overhead processing.
The results for fault # C432-1 also indicate a speedup from 70.73ms to 46.39ms.
The operational complexity for this fault increased from 10 to 14 inputs specified while the
overhead decreased slightly with the number of gate evaluations decreasing from 197 to
170.
Table 5.2 Comparison of Results for Version 2 System
Fault No.Distributed System Version 2 System
τdist α ie Μcomm τver2 α ie Μcomm
74181-1 31.0 0 7 61 26.06 0 5 56
74181-181 44.97 0 7 94 27.07 0 7 63
74181-379 146.97 10 14 257 114.87 12 14 231
C432-1 70.73 0 10 126 46.39 0 14 125
C432-145 1149.0 103 11 1627 2034.7 247 14 3045
C432-501 56.57 0 8 106 33.04 0 8 67
C432-619 2336.9 247 11 3010 47.27 0 12 109
C432-731 1281.98 157 22 1915 446.61 62 22 806
C499-1 147.12 0 22 284 39.56 0 22 84
C499-480 3065.1 247 40 3908 2401.2 247 37 3376
C499-621 208.74 0 41 369 223.28 32 41 405
C499-931 2230.89 247 26 2184 79.81 1 40 188
C499-961 64.37 0 13 118 48.44 0 34 133
C880-41 143.49 0 15 188 94.05 0 15 122
C880-801 24.73 0 3 33 16.21 0 3 20
C880-1201 145.86 0 16 190 76.40 0 16 134
C880-1521 73.48 0 8 89 42.12 0 8 52
C880-1741 19.58 0 2 23 18.00 0 2 16
93
As was the case for the version 1 system, some faults experienced an increase in
runtimes. These increases were due mainly to an increase in operational complexity. Again
the increase in operational complexity was caused by the algorithmic changes leading the
ATPG process into a nonproductive area of the search space.
The results for fault # C432-145 show that the number of backtracks increased from
103 to 247 as the fault became hard-to-detect. This increase inα lead to an increase in
runtime from 1149.0ms to 2034.75ms.
5.5.3 Optimistic Propagation Results
Finally, Table 5.3 contains a comparison of the results of the version 3 TOP_TGS
system and the unenhanced distributed version. The version 3 system contains all of the
enhancements of the previous versions as well as the optimistic propagation enhancement.
The results are similar to those for the version 2 system. Both fault #C432-1 and
fault #C499-961 show the apparent effect of speedup due to parallelism. Several faults also
had increases in runtime due to increases in operational complexity.
One interesting result to note is the fact that all of the example faults for circuit
C432 were processed with zero backtracks by the version 3 system while the unenhanced
version required a total of 507 backtracks for these faults. While this may seem to be a
promising result, there may in fact be other faults on the fault list which experience an
increase in the number of required backtracks. An experiment where ATPG is performed
on the entire fault list is required to determine if the total number of backtracks did indeed
decrease. The next section presents the results of such an experiment.
5.5.4 Results of ATPG on the Entire Fault List
Figure 5.7 contains the graphs of the runtimes for the benchmark circuits on the
various TGS systems. These runtimes are for complete test generation where every fault is
targeted. Version ‘s’ is the serial, uniprocessor, PODEM ATPG system. Version 0 is the
serial distributed TOP_TGS system. Version 1 through 3 are the enhanced TOP_TGS
94
systems discussed in the previous sections.
The graphs show a large increase in test generation time from the serial version to
the distributed version. This increase in runtimes is due to the communications time and
overhead processing incurred by the distributed system. These were the factors analyzed in
Chapter 4. Note that the serial system requires no messages to perform ATPG. Graphs of
the number of messages required for ATPG are shown in Figure 5.8.
Table 5.3 Comparison of Results for Version 3 System
Fault No.Distributed System Version 2 System
τdist α ie Μcomm τver2 α ie Μcomm
74181-1 31.0 0 7 61 56.13 0 13 103
74181-181 44.97 0 7 94 23.19 0 7 51
74181-379 146.97 10 14 257 78.64 12 10 169
C432-1 70.73 0 10 126 37.32 0 14 106
C432-145 1149.0 103 11 1627 77.28 0 14 154
C432-501 56.57 0 8 106 28.67 0 8 54
C432-619 2336.9 247 11 3010 54.28 0 16 131
C432-731 1281.98 157 22 1915 107.03 0 22 197
C499-1 147.12 0 22 284 80.04 0 41 206
C499-480 3065.1 247 40 3908 2444.7 247 38 3530
C499-621 208.74 0 41 369 2130.5 247 35 2359
C499-931 2230.89 247 26 2184 79.59 1 40 188
C499-961 64.37 0 13 118 48.36 0 34 134
C880-41 143.49 0 15 188 38.04 0 13 42
C880-801 24.73 0 3 33 15.05 0 3 21
C880-1201 145.86 0 16 190 55.55 0 15 94
C880-1521 73.48 0 8 89 28.96 0 9 42
C880-1741 19.58 0 2 23 16.43 0 2 17
95
All of the circuits show a steady improvement in the runtimes for the enhanced
TOP_TGS systems with the exception of C499. The runtimes for this circuit decreased for
versions 1 and 2 over the unenhanced system, but the runtime for version 3 increased
substantially. Examination of the number of messages required for C499 in Figure 5.8
reveals thatΜcomm also increases greatly for version 3. This increase inΜcomm and thus
the total runtime is in fact caused by the increase in aborted faults for the version 3 system
over the other versions. Figure 5.9 contains a graph of the number of aborted faults for
circuits C432 and C499. Notice that the number of aborted faults for the version 3 system
Figure 5.7 Complete ATPG runtimes.
S 0.0 1.0 2.0 3.0TOP_TGS Version
5.0
15.0
25.0
35.0
Run
times
(se
c)
74181
S 0.0 1.0 2.0 3.0TOP_TGS Version
0.0
100.0
200.0
300.0
400.0
500.0
600.0
Run
times
(se
c)
C432
S 0.0 1.0 2.0 3.0TOP_TGS Version
0.0
100.0
200.0
300.0
400.0
Run
times
(se
c)
C499
S 0.0 1.0 2.0 3.0TOP_TGS Version
50.0
100.0
150.0
200.0
Run
times
(se
c)
C880
96
on C499 increases almost 50% over the unenhanced version. This increase in aborted faults
increases the total number of backtracks,α, because every aborted fault requires 247
backtracks to process. The increase inα, in turn, increases the operational complexity of
the ATPG process thereby increasingΜcomm.
The graphs in Figure 5.7 indicate that the best results were obtained for the C880
Figure 5.8 Complete ATPG message requirements.
S 0.0 1.0 2.0 3.0TOP_TGS Version
0.0
10000.0
20000.0
30000.0
Tot
al N
umbe
r of
Mes
sage
s
74181
S 0.0 1.0 2.0 3.0TOP_TGS Version
0.0
200000.0
400000.0
600000.0
Tot
al N
umbe
r of
Mes
sage
s
C432
S 0.0 1.0 2.0 3.0TOP_TGS Version
0.0
200000.0
400000.0
600000.0
Tot
al N
umbe
r of
Mes
sage
s
C499
S 0.0 1.0 2.0 3.0TOP_TGS Version
0.0
50000.0
100000.0
150000.0
200000.0T
otal
Num
ber
of M
essa
ges
C880
97
circuit. The overall runtime for this circuit was actually less for the version 3 system then
the serial uniprocessor system. The improvement in runtime was realized in spite of the fact
that the version 3 system required 53,662 total messages to perform ATPG. The version 3
system also experienced an increase in operational complexity of almost 100% as the total
number of inputs specified increased from 8391 to 15723 and an increase in overhead as
the total number of gate evaluations increased from 303,573 to 324,754. These facts are a
strong indicator that a significant amount of parallelization is present in the version 3
system for C880 and decreasing the amount of excess overhead processing and
communications latency should yield useful speedups on this circuit.
The results for the 74181 circuit, while not as promising, indicate that speedup
could be realized for this circuit if overhead processing and communications latency were
reduced. However, the results for the C432 and C499 circuits were not as promising. Any
decrease in runtimes due to parallelization were smaller and were reduced or eliminated in
Figure 5.9 Total aborted faults.
S 0.0 1.0 2.0 3.0TOP_TGS Version
0.0
20.0
40.0
60.0
80.0
100.0
120.0
140.0
Num
ber
of A
bort
ed F
aults
C432 & C499
C432 aborted faults
C499 aborted faults
98
some cases by increases in overhead processing or operational complexity.
For C432, the operational complexity did decrease asα decreased from 33978 to
19488 backtracks while the number of inputs specified increased slightly from 8391 to
9829. The overhead increased as the number of gate evaluations increased from 2,225,243
to 2,337,844, but this is an equivalent percentage to that experienced by C880. The fact that
the runtime for the version 3 system was still almost 4 times the serial system indicates that
not as much parallelism is present in this circuit for the version 3 system as in C880.
For circuit C499, the operational complexity increased a great deal asα increased
from 13,850 to 28,023 backtracks and the number of inputs specified increased from 32,066
to 32,692. The overhead processing also increased as the number of gate evaluations
increased from 1,411,712 to 3,142,372. The direct result of these increases was the fact that
the version 3 runtime was over 4 times the serial runtime.
It is doubtful if any increase in communications latency or overhead processing will
decrease the version 3 runtimes for C432 and C499 enough to result in speedups over the
serial system. What is needed is a method to extract more parallelism from these circuits,
perhaps by using a more intelligent partitioning algorithm.
5.5.5 Results for Larger Numbers of Processors
This section presents a brief review of the results of increasing the number of
processors used in the ATPG process with the version 3 system. As in the previous section,
these results were obtained by performing test generation on all faults in the fault list.
Table 5.4 lists the pertinent statistics for the benchmark circuits for the version 3
system with 2 and 4 partitions (processors). Two circuits, C499 and C880 experienced an
apparent increase in parallelism from the 2 partition case to the 4 partition case. For C499,
the runtime decreased from 325.8sec to 268.9sec. in spite of an increase in operational
complexity indicated by an increase inα from 19,513 to 21,990 backtracks andie from
32,692 to 34,719 inputs specified. Communications time also increased asΜcomm
increased from 403,093 to 725,274. Overhead decreased slightly as the number of gate
99
evaluations fell from 3,142,372 to 2,842,683, but this is not enough to account for the entire
decrease in runtime.
The runtime for circuit C880 increased slightly from 56.2 sec to 60.4sec, but this
is much less than would be expected from the increases in operational complexity,
communications time, and overhead processing.
Circuit C432 experienced an increase in runtime from 298.5sec to 408.2sec. This
was caused mainly by the large increase in operational complexity asα increased from
26,898 to 33,079 andie decreased slightly. The increase inα was caused by an increase of
27 aborted faults which also caused an increase inΜcomm from 393,381 to 1,011,463 and
an increase in the number of gate evaluations from 2,337,844 to 3,358,566.
In all cases, the changes in the number of aborted faults and the number of
backtracks required was caused by the fact that changes in the way the gates are partitioned
and even in the order that messages are received by a partition can cause changes in the area
of the solution space in which the ATPG algorithm searches for a test. This effect was
discussed in section 5.5.1. The fact that the order that messages are received can affect the
input ordering makes the version 3 TOP_TGS system almost non-deterministic in that the
results are slightly different for each run as well as for different partition types and numbers
of partitions.
The next chapter details the effort to increase the amount of parallelism used in the
TOP_TGS system by adding additional methods of parallelism.
Table 5.4 Comparison of Results for 2 and 4 Partitions
CircuitNo.
2 processors 4 processors
τdist α ie Μcomm ge τver2 α ie Μcomm ge
74181 10.85 56 3583 21037 144673 15.89 742 3057 47671 89855
C432 298.5 26898 9829 393381 2337844 408.2 33079 9294 1011463 3358566
C499 325.9 19513 32692 403093 3142372 268.9 21990 34719 725274 2842683
C880 56.21 83 16385 53662 324754 60.40 200 16784 120206 345092
100
Chapter 6
Fault Partitioning
This chapter details the results of adding fault partitioning as an additional
parallelism to the TOP_TGS system. In spite of the enhancements and parallelizations
discussed in the previous chapter, there are frequently times in which all of the work in
ATPG is being done by one processor alone. In fact, optimization of the circuit partitioning
algorithm to reduce communications requirements will force this result directly. If, during
these processor idle times, other useful work was available for the processor, speedup
would increase. The amount of this speedup is determined by how much idle time is in fact
available and how well it is used. Performing ATPG on several faults for which test
generation is required, simultaneously, would reduce this idle time. The use of fault
partitioning to accomplish this goal was first proposed in [17]. The next section details the
changes to the TOP_TGS system necessary to implement the fault partitioning scheme.
6.1 Fault Partitioning Implementation
The majority of the changes necessary to implement fault partitioning in the
TOP_TGS system were made in the master and test_generator objects. In both of these
objects, the necessary data structures were changed from single values to arrays with one
element per fault. Examples include the global primary input stack and backtrack limit in
the master, and the individual gate output values and global objective in the test_generator.
Each of the routines which performs a specific test generation task was modified to include
a fault number as an argument. This fault number is used as an index into the arrays
discussed above as the task is performed. There were a number of methods in both the
master and test_generator objects that had to be modified in this way.
There were two other modifications required in the TOP_TGS system to implement
fault partitioning. The first modification concerned the initialization process. Initialization
is required before the ATPG process begins and after test generation is completed on each
101
fault. Initialization consists of setting all gate outputs back to an ‘X’ value, setting the
global objective to NULL, and emptying the global input stack. In the original TOP_TGS
system, initialization was handled by simply invoking the initialize() method in each
test_generator and waiting for it to return. This simple method in fact serialized the
initialization process, but the impact on performance was negligible. In the new system
with fault partitioning, this method is unacceptable because it would prevent the master
object from processing the faults during this time and would in fact lead to deadlock.
Deadlock is possible because both the master and test_generator would have routines that
use the synchronous communications protocol discussed in section 2.3.1.3.
In order to correct this problem, the initialization procedure was changed to an
asynchronous process. The master issuesinitialize(int fault_no) calls to all of the
test_generators. Each test_generator initializes the variables in the array for that fault
number and then signals the master when it is done. This signaling is done through a new
method in the master calledmark_done_init(int test_gen_no, int fault_no). Once all of the
test_generators have signaled that they have completed initialization for that fault number,
the ATPG process is started.
The other modification necessary for fault partitioning was the changes to the
master necessary to perform selection of the next fault. Ideally, the next fault selected
should be one which will provide work for the processor with the most idle time. This could
be done by selecting a fault which is located in the database of that processor. However, in
order to simplify implementation, the current method used is to simply select the next fault
on the list when a processor becomes idle. It is not felt that this compromise has a
significant impact in performance, but it might decrease efficiency somewhat.
6.2 Results
The graphs in Figure 6.1, Figure 6.2, Figure 6.3, and Figure 6.4, present the
runtimes for the bench mark circuits versus the number of faults that are processed in
parallel. Data is presented for both the 2 partition case and the 4 partition case. Note that in
all cases, the runtimes decrease continuously as the number of parallel faults increases.
102
However, the slopes tend to level out at about 4 parallel faults. This reduction in speedup
is due to the fact that eventually, the majority of the idle time in the processors is being used
effectively.
In the case of circuit 74181, the runtime for 2 partitions actually increases as the
parallel faults increased from 4 to 6. This increase is due to the fact that in this case, the
total number of backtracks,α, required for ATPG increased. Recall that in the version 3
system, the order in which inputs are placed in the global stack can change, which can cause
the system to search nonproductive areas in the solution space. The addition of more
parallel faults can change the timing and thus the order of the inputs. This change can
reduce the efficiency of the search. This reduction in efficiency accounts for the increase in
α and thus the increase in runtime.
The change in input ordering can also change the number of aborted faults
encountered during test generation. In the 4 partition case for C432, this effect accounts for
Figure 6.1 Runtimes vs. number of parallel faults for circuit 74181.
0.0 2.0 4.0 6.0 8.0Number of Parallel Faults
7.0
9.0
11.0
13.0
15.0
Run
times
(se
c)
74181
2 partitions
4 partitions
103
the majority of the dramatic decrease in runtime from 408.2sec to 295,1sec as the number
of parallel faults increased from 1 to 2. The addition of 1 parallel fault decreased the number
of aborted faults from 121 to 95 which in turn decreased the runtime. Further decreases in
runtime as the number of parallel faults increased beyond 2 were due to decreased
processor idle time as the number of aborted faults stabilized at around 95.
For circuits C880 and C499, all of the reduction in runtime was due to the use of
processor idle time as the number of aborted faults andα were fairly constant. Overall, the
addition of fault partitioning decreased the runtimes from 26% to 43%. The reductions were
larger in the 4 partition case which is logical because 4 processors would have more
processor idle time.
More research will be required to determine if a better fault selection scheme will
have a significant impact on runtime, but the results presented in this section suggest that
adding additional parallelisms to the system is an effective method of reducing runtimes.
Figure 6.2 Runtimes vs. number of parallel faults for circuit C432.
0.0 2.0 4.0 6.0 8.0
Number of Parallel Faults
200.0
300.0
400.0
Run
times
(se
c)
C432
2 partitions 4 partitions
104
For the problem of ATPG on a single hard-to-detect fault, the addition of search space
parallelism [28] should be as effective.
Figure 6.3 Runtimes vs. number of parallel faults for circuit C499.
0.0 2.0 4.0 6.0 8.0
Number of Parallel Faults
150.0
200.0
250.0
300.0
Run
times
(se
c)
C499
2 partitions
4 partitions
105
Figure 6.4 Runtimes vs. number of parallel faults for circuit C880.
0.0 2.0 4.0 6.0 8.0
Number of Parallel Faults
35.0
40.0
45.0
50.0
55.0
60.0R
untim
es (
sec)
C880
2 partitions
4 partitions
106
Chapter 7
Conclusions and Future Work
The goal of this research was to develop a topological partitioning ATPG system
based upon the PODEM algorithm and analyze its results. The analysis was intended to aid
in determining the effect on the performance of the ATPG system of decreasing
communications latencies. The following sections present conclusions based on the results
and recommendations for future work.
7.1 Conclusions
The major task necessary for this research was the development of the topologically
partitioned ATPG system based upon the PODEM algorithm. This task required the porting
of the original TGS system to the ESP environment, characterization of the performance of
the ES-KIT 88k processor (see Appendix A) and development of a fault partitioning
parallel version of the TGS system (see Appendix B). Development of the TOP_TGS
system from the ES-TGS system required the development of over 5,000 lines of new code.
All of these systems were developed in a parallel processing environment which was still
in its early state of development. The environment was therefore imperfect in its
implementation and it did not include any debugging support.
Development of a topologically partitioned ATPG system also included
development of algorithms to partition the circuit topologically across processing nodes.
Two new partitioning algorithms were developed and proven to be more efficient than
those previously developed. Criteria used for algorithm selection included memory
efficiency and message passing overhead.
Three algorithmic enhancements were developed to increase the efficiency of the
ATPG process and introduce parallelism. The optimistic backtrace procedure increased the
amount of work that was done between backtrace messages. This increase in work per
message decreased the number of messages required for test generation and thereby
107
decreased the ATPG runtimes. The multiple backtrace procedure introduced parallelism
into the ATPG process by allowing multiple backtrace operations to be processed
simultaneously. Finally, the optimistic propagation procedure introduced more parallelism
by allowing local implications and objective selections to be processed in parallel with
backtrace operations. Implementation of these enhancements to the TOP_TGS system
involved generating over 7500 lines of new code.
The results of the addition of these parallelizations were mixed. Some circuits
experienced a dramatic decrease in runtimes such that the topologically partitioned ATPG
system was faster than the original serial version. In these cases, decreasing the
communications latency by running the system on a higher performance machine, and
decreasing the amount of overhead processing in the distributed algorithm should yield
useful speedups on 2 to 4 processors.
Other circuits exhibited an increase in computational complexity caused by an
increase in aborted faults which reduced or negated the benefits of the parallelizations. The
increase in backtracks was caused by fact that the enhancements caused the ATPG
algorithm to search unproductive portions of the solution space. In all cases, faults with any
backtracks did not achieve as much improvement as those without backtracks. This result
was caused by the serial nature of the backtracking process. It is believed that decreasing
the likelihood of backtracking by improving the partitioning algorithm and the ATPG
search heuristics will have a large beneficial impact on the runtimes for these circuits.
Included in this work was an analysis of the ATPG process as a binary search. The
result of this analysis was a unique model of the search process which can be used to predict
the number of different types of operations that must be performed to generate a test.
Experiments were performed that verified the accuracy of this model and proved that it can
be used to predict the complexity of ATPG on the entire fault list given the results for a few
faults. This model was then used to predict the change in performance of the topological
partitioning ATPG system given a change in communications latency.
The model was also used to determine the increase in processing time of the
108
distributed ATPG system that was due to increased overhead. This analysis showed that the
overhead was a significant contributor to the increased runtimes of the distributed system
versus the serial system and it will have to be reduced in order for the speedups to become
useful.
The final contribution of this research was to develop a system which could be used
as the basis of future research and to identify in detail the avenues of investigation that must
be undertaken to continue development. The next section details this future work.
7.2 Future Work
The initial work towards improving the results of the topologically partitioned
ATPG system should be directed towards the partitioning algorithm. Improving the
partitioning algorithm could have a great impact on the number of messages required for
ATPG as well as the operational complexity of test generation. Keeping all nodes in a
reconvergent fanout loop within one partition should reduce the number of bad
assumptions made in the optimistic propagation procedure and reduce the amount of
speculative work that is done incorrectly. It may also decrease the number of backtracks
required for test generation.
One method which may improve the partitioning algorithm by keeping
reconvergent fanout loops in the same partition is to use the supergate partitioning method
developed in [17]. Another technique which can be used to improve the partitioning
algorithm is to use a simulated ATPG approach in calculating the cost vectors used in the
optimization portion of the algorithm. This approach is based on the data that suggests that
measuring the number of messages required to do ATPG on a small number of faults will
yield a good approximation of the number of messages required for full ATPG. Simulated
ATPG involves simulating a small number of ATPG operations such as implication and
backtracing and determining the number of messages required to perform them. These
numbers can then be used to estimate the number of messages required for full ATPG
which is used as a measure of the quality of the current partition.
109
The second area which deserves some work in the topologically partitioned ATPG
system is the overhead processing. The overhead involved in the calculation of the D
frontier could possibly be reduced by using a polling method of communication instead of
the ring method used now. In the polling method, the master would simply broadcast a
message to all of the test_generators requesting the current D frontier value. The
test_generators would calculate their D frontiers and send them back to the master. The
master would then make the decision as to which processor had the D frontier. This method
would reduce the extra calculating of the D frontier necessary in the ring method and the
calculations of the individual frontiers would be done in parallel. It would however,
increase the processing load on the master, possibly causing it to become a bottleneck.
Reduction of the overhead involved in the implication procedure is a more difficult
task. There is no apparent way to process the gates in a levelized order in the distributed
system without incurring a large increase in communications overhead. It may be possible
to make the implication procedure more intelligent by determining controlling and non-
controlling values and using them to determine when gate evaluation is really necessary.
However, this approach would have to be added to the serial system for comparison, and
would increase its speed as well. The benefits to the distributed system may be higher than
for the serial system because of the large number of gate evaluations involved.
There are many improved ATPG algorithms that have been developed subsequent
to PODEM, but as discussed in Chapter 2, most of them are PODEM based and can be
added to the TOP_TGS system. For the sake of comparison, the improvements would also
have to be added to the serial system. However, it is hoped that the improvement to the
distributed system, in decreasing bad decisions and backtracks, would be greater.
The algorithmic improvements generally fall into three categories, those that can be
added to the system at no cost, those that will certainly incur additional cost, and those that
may or may not incur costs depending on the implementation. Here the reference to cost
most frequently means extra communications overhead in the distributed system. Table 7.1
summarizes some of these algorithmic improvements and the categories into which they
110
fall.
The use of headlines and basis nodes as pseudo primary inputs during ATPG as in
FAN and TOPS could be added to the TOP_TGS system without incurring any additional
message overhead. The same could be said for the use of search strategy switching. The
addition of these techniques would then most certainly increase the speed of the distributed
system.
The X-path check and E-frontier techniques would definitely add additional
message requirements to the TOP_TGS system. Their addition could result in speedup or
slowdown depending on how much improvement they made versus the increase in message
overhead.
The variable cost improvements such as redundancy analysis and improved
implication may result in increased overhead if they are allowed to generate additional
message traffic. However, the improvements in these categories could be implemented to
work only within the partition boundaries and not generate additional messages. This
restriction would decrease the usefulness of the improvements, but it may provide a better
trade-off in utility versus message overhead.
More research is necessary into the addition of other methods of parallelization to
the TOP_TGS system. Although the results of adding fault partitioning to the system were
not as good as hoped, the addition of search space partitioning may prove to be more
Table 7.1 Algorithmic Improvements
No Cost Additional Cost Variable Cost
headlines {FAN}, [4]X-path
check {PODEM}, [3]
redundancy analysis {EST}, [15]
basis nodes{TOPS}, [6]
improved implication{FAN, Socrates}, [4], [7]
unique sensitization{FAN, Socrates}, [4], [7]
search strategyswitching {S3}, [41] E-frontier {EST}, [15]
multiple backtrace {FAN}, [4]
dominators {TOPS}, [6]
111
beneficial. It would certainly help to speedup processing of hard-to-detect faults which now
seem to be the biggest problem for the TOP_TGS system. It may also be more profitable to
try fault partitioning on the unenhanced version of the TOP_TGS system. This version has
few inherent parallelisms and the increased processor idle time may be used more
effectively by fault partitioning. Fault partitioning may also be more effective in the
enhanced TOP_TGS system if the partitioning algorithm is improved.
Finally, investigations into hardware acceleration of the topologically partitioned
ATPG process should be undertaken. A parallel reduction network (PRN)[42] could be
used to accelerate processing of certain critical synchronization points. The PRN could be
used to calculate the D frontier and distribute the results back to the processors. The PRN
could also be used in determining the idle status of the test_generator objects. Both of these
tasks are among the most message intense of the ATPG process and handling them with a
PRN could greatly increase the speed of test generation.
112
References
[1] Levitt, Marc, E.,ASIC Testing Upgraded, IEEE Spectrum, pp. 26-29, May,1992.
[2] Roth, J. Paul,Diagnosis of Automata Failures: A Calculus and a Method, IBMJournal of Research and Development, Vol. 10, pp. 278-291, July, 1966.
[3] Goel, Prabhakar,An Implicit Enumeration Algorithm to Generate Tests forCombinational Logic Circuits, IEEE Transactions on Computers, Vol. C-30, No 3., pp.215-222, March 1981.
[4] Fujiwara, Hideo, Takeshi Shimono,On the Acceleration of Test GenerationAlgorithms, IEEE Transaction on Computers, Vol. C-32, No 12, pp. 1137-1144,December 1983.
[5] Fujiwara, Hideo, S. Toida,The Complexity of Fault Detection: An Approach toDesign for Testability, Proceedings of the 12th International Symposium on FaultTolerant Computing, pp. 101-108, June 1982.
[6] Kirkland, Tom, M. Ray Mercer,A Topological Search Algorithm for ATPG,Proceedings of the 24th ACM/IEEE Design Automation Conference, pp. 502-508,1987.
[7] Schulz, Michael H., Erwin Trischler, Thomas M. Sarfert,SOCRATES: A HighlyEfficient Automatic Test Pattern Generation System, IEEE International TestConference, pp. 1016-1026, September, 1987.
[8] Waicukauski, John A., Paul A. Shupe, David J. Giramma, Arshad Matin,ATPGfor Ultra-Large Structured Designs, Proceedings of the 1990 IEEE International TestConference, pp. 44-51, September, 1990.
[9] Mallela, S., S. Wu,A Sequential Circuit Test Generation System, ProceedingsInternational Test Conference, pp. 57-61, November 1985.
[10] Ma, H. K. T., S. Devadas, A. R. Newton,Test Generation for SequentialCircuits, IEEE Transactions on Computer-Aided Design, Vol. 7, No. 10, pp. 1081-1093,October 1988.
[11] Tham, Kit Yoke,Parallel Processing of CAD Applications, IEEE Design andTest of Computers, pp. 13-17, October 1987.
[12] Grimshaw A.,An Introduction to Parallel Object-Oriented Programming withMENTAT,Computer Science Report No. TR-91-07, University of Virginia.
[13] Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for TestGeneration, IEEE Transactions on Computer-Aided Design, Vol. 9, No. 3, pp. 313-322,March, 1990.
[14] Klenke, Robert H., Ronald D. Williams, James H. Aylor, Parallel ProcessingTechniques for Automatic Test Pattern Generation, IEEE Computer, pp. 71-84, January,
113
1992.
[15] Giraldi, John, Michael L. Bushnell,EST: The New Frontier in Automatic TestPattern Generation, Proceedings of the 27th ACM/IEEE Design AutomationConference, pp. 667-672, June 1990.
[16] Brglez, Franc, Hideo Fujiwara,Neural Netlist of Ten CombinationalBenchmark Circuits and a Target Translator in FORTRAN, IEEE InternationalSymposium on Circuits and Systems, Special Session on ATPG, June 1985.
[17] Jokl, James A,Topologically-Partitioned Automatic Test Pattern Generation,Ph.D Dissertation, University of Virginia, September 1991.
[18] Bell, Robert H. Jr., Robert H. Klenke, James H. Aylor, Ronald D. Williams,Results of a Parallel Automatic Test Pattern Generation System for a Distributed-MemoryMulticomputer, Proc. of the Fifth Annual IEEE ASIC Conference, September, 1992,(scheduled to appear).
[19] Kirkland, Tom, M. Ray Mercer,Algorithms for Automatic Test PatternGeneration, IEEE Design and Test of Computers, pp. 43-55, June, 1988.
[20] Fujiwara, Hideo,Logic Testing and Design for Testability, The MIT Press,Cambridge, Massachusetts, 1986.
[21] Hayes, John P.,Fault Modeling, IEEE Design and Test, pp. 37-44, April 1985.
[22] Chandra, Susheel J., Janak H. Patel, Test Generation in a Parallel ProcessingEnvironment, IEEE International Conference on Computer Design, pp. 11-14, 1988.
[23] Patil, Srinivas, Prith Banerjee,Fault Partitioning Issues in an IntegratedParallel Test Generation System, IEEE International Test Conference, pp. 718-726,September 1989.
[24] Chandra, Susheel J., Janak H. Patel,Experimental Evaluation of TestabilityMeasure for Test Generation, IEEE Transactions on Computer-Aided Design, Vol. 8,No. 1, pp. 93-97, January 1989.
[25] Chandra, Susheel J., Janak H. Patel, Test Generation in a Parallel ProcessingEnvironment, IEEE International Conference on Computer Design, pp. 11-14, 1988.
[26] Wah, Benjamin W., Guo-jie Li, Chee Fen Yu,Multiprocessing ofCombinatorial Search Problems, IEEE Computer, pp. 93-108, June 1985.
[27] Rao, V. Nageshwara, Vipin Kumar,Parallel Depth First Search. Part IImplementation, International Journal of Parallel Processing, Vol. 16 No. 6, pp. 479-499, June, 1987.
[28] Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for TestGeneration, Proceedings of the 26th ACM/IEEE Design Automation Conference,pp.339-344 June, 1989.
[29] Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for Test
114
Generation, IEEE Transactions on Computer-Aided Design, Vol. 9, No. 3, pp.313-322March, 1990.
[30] Smith, Steven P., Bill Underwood, M. Ray Mercer,An Analysis of SeveralApproaches to Circuit Partitioning for Parallel Logic Simulation, IEEE InternationalConference on Computer Design, pp. 664-667, 1987.
[31] Bollinger, S. Wayne, Scott F. Midkiff, An Investigation of Circuit Partitioningfor Parallel Test Generation, IEEE VLSI Test Symposium, April, 1992.
[32] Feng, Tse-Yun,A Survey of Interconnection Networks, IEEE Computer, pp.12-27, December 1981.
[33] Karp, Alan H.,Programming for Parallelism, IEEE Computer, pp. 43-57,May 1987.
[34] Smith, K. Stuart, Robert J. Smith, II, Guy S. Caldwell, Claudia Porter, WilliamJ. Leddy, Arjun Khanna, Arunodaya Chatterjee, Ying T. Hung, Douglas W. Hahn, WayneP. Allen,Experimental Systems Project at MCC, March 1989, MCC Technical ReportNumber: ACA-ESP-089-89, March 2, 1989.
[35] Stroustroup, Bjarne,The C++ Programming Language, Addison-WesleyPublishing Company, Reading Massachusetts, 1987.
[36] Halstead, Robert H., Jr., Guo-jie Li, Chee Fen Yu,Parallel SymbolicComputing, IEEE Computer, pp. 35-43, August 1986.
[37] Abramovici, M., P. R. Menon, D. T. Miller,Critical Path Tracing-AnAlternative to Fault Simulation, Proceedings of the 20th ACM/IEEE DesignAutomation Conference, pp. 214-220, June 1983.
[38] Goldstein, Lawrence H.,Controllability/Observability Analysis of DigitalCircuits, IEEE Transactions on Circuits and Systems, Vol. CAS-26, No. 9, pp. 685-693,September 1979.
[39] Kirkpatrik, S., C.D. Gelatt, Jr., M. P. Vecchi,Optimization by SimulatedAnnealing, Science, Vol. 220, No. 4598, May 13, 1983, pp. 671-680.
[40] Burton, F. Warren, Speculative Computation, Parallelism, and FunctionalProgramming, IEEE Transaction on Computers, Vol. C-34, No.9, pp. 1190-1193,December 1985.
[41] Min, Hyoung Bok, A,Strategies for High Performance Test Generation, Ph.DDissertation, University of Texas at Austin, December 1990.
[42] Reynolds, Paul F.,An efficient Framework for Parallel Simulations,International Journal of Computer Simulations (to appear).
115
Bibliography
Parallel Processing
Burton, F. Warren, Speculative Computation, Parallelism, and Functional Programming,IEEE Transaction on Computers, Vol. C-34, No.9, pp. 1190-1193, December1985.
Finkel, Raphael, Udi Manber, DIB-A Distributed Implementation of Backtracking, ACMTransactions on Programming Languages and Systems, Vol. 9, No.2, pp. 235-256, April 1987.
Fox, Geoffrey C., Mark A. Johnson, Gregory A. Lyzenga, Steve W. Otto, John K. Salmon,David W. Walker,Solving Problems on Concurrent Processors, Volume I, GeneralTechniques and Regular Problems, Prentice Hall, Englewood Cliffs, New Jersey,1988.
Feng, Tse-Yun,A Survey of Interconnection Networks, IEEE Computer, pp. 12-27,December 1981.
Gajski, Daniel D., Jih-Kwon Peir,Essential Issues in Multiprocesor Systems, IEEEComputer, pp. 9-27, June 1985.
Halstead, Robert H., Jr., Guo-jie Li, Chee Fen Yu,Parallel Symbolic Computing, IEEEComputer, pp. 35-43, August 1986.
Haynes, Lenard S., Richard L. Lau, Daniel P. Sieworek, David W. Mizell,A Survey ofHighly Parallel Computing, IEEE Computer, pp. 9-24, January 1982.
Hirose, Fumiyasu, Mitsuo Ishii, Junichi Niitsuma, Tatsuya Shihdo, , Nobuaki Kawato,Hiroshi Hamamura, Keiichiro Uchida, Hiroshi Yamada.Simulation Processor SP,Proceedings of the 1988 IEEE International Conference on Computer-AidedDesign, pp. 484-486, November 1987.
Karp, Alan H.,Programming for Parallelism, IEEE Computer, pp. 43-57, May 1987.
Kruatrachue, Boontee, Ted Lewis,Grain Size Determination for Parallel Processing,IEEE Computer, pp. 23-32, January 1988.
Rao, V. Nageshwara, Vipin Kumar,Parallel Depth First Search. Part I Implementation,International Journal of Parallel Processing, Vol. 16 No. 6, pp. 479-499, June,1987.
Soule, Larry, Tom Blank,Parallel Logic Simulation on General Purpose Machines,Proceedings of the 25th ACM/IEEE Design Automation Conference, pp 166-171, June 1988.
Tham, Kit Yoke,Parallel Processing of CAD Applications, IEEE Design and Test ofComputers, pp. 13-17, October 1987.
116
Wah, Benjamin W., Guo-jie Li, Chee Fen Yu,Multiprocessing of Combinatorial SearchProblems, IEEE Computer, pp. 93-108, June 1985.
Automatic Test Pattern Generation (General)
Abramovici, M., P. R. Menon, D. T. Miller,Critical Path Tracing-An Alternative to FaultSimulation, Proceedings of the 20th ACM/IEEE Design AutomationConference, pp. 214-220, June 1983.
Abramovici, M., J. J. Kulikowski, P. R. Menon, D. T. Miller,SMART and FAST: TestGeneration for VLSI Scan-Design Circuits, IEEE Design and Test of Computers,pp. 43-54, August 1986.
Akers, Sheldon B., Balakrishnan Krishnamurthy,Test Counting: A Tool for VLSI Testing,IEEE Design and Test of Computers, pp. 58-77, October, 1989.
Benmehrez, C., J. F. McDonald,The Subscripted D-Algorithm - ATPG with MultipleIndependent Control Paths, Proceedings of the 1983 IEEE ATPG Workshop, pp.71-80, 1983.
Brglez, Franc, Hideo Fujiwara,Neural Netlist of Ten Combinational Benchmark Circuitsand a Target Translator, IEEE International Symposium on Circuits andSystems, Special Session on ATPG, June 1985.
Brglez, Franc, Philip Pwonal, Robert Hum, Accelerated ATPG and Fault Grading viaTestability Analysis, Proceedings of the International Symposium on Circuitsand Systems, pp. 695-698, 1985.
Brglez, Franc, Philip Pwonal, Robert Hum,Application of Testability Analysis: FromATPG to Critical Path Tracing, IEEE International Test Conference, pp. 705-712, September 1984.
Chandra, Susheel J., Janak H. Patel,Experimental Evaluation of Testability Measure forTest Generation, IEEE Transactions on Computer-Aided Design, Vol. 8, No. 1,pp. 93-97, January 1989.
Chiorboli, G., A. Ferrari,Influence of Guiding Testability Measure on Number ofBacktracks of ATPG Program, IEE Proceedings, Vol. 136, No. 4, pp. 316-320,July 1989.
Cheng, Kwang-Ting, Vishwani D. Agrawal,A Simulation-Based Directed-Search Methodfor Test Generation, IEEE International Conference on Computer Design, pp.48-51, 1987.
Eichelberger, E. B., T. W. Williams,A Logic Design Structure for LSI Testability,Proceedings of the 14th ACM/IEEE Design Automation Conference, pp. 462-468, June 1977.
Fujiwara, Hideo,Logic Testing and Design for Testability, The MIT Press, Cambridge,Massachusetts, 1986.
117
Fujiwara, Hideo, S. Toida,The Complexity of Fault Detection: An Approach to Design forTestability, Proceedings of the 12th International Symposium on FaultTolerant Computing, pp. 101-108, June 1982.
Fujiwara, Hideo, Takeshi Shimono,On the Acceleration of Test Generation Algorithms,IEEE Transaction on Computers, Vol. C-32, No 12, pp. 1137-1144, December1983.
Gaede, Rhonda Kay, M. Ray Mercer, Bill Underwood,Calculation of Greatest LowerBound Obtainable by the Cutting Algorithm, IEEE International TestConference, pp. 498-505, September, 1986.
Gaede, Rhonda Kay, M. Ray Mercer, Kenneth M. Butler, Don E. Ross,CATAPULT:Concurrent Automatic Testing Allowing Parallelization and Using LimitedTopology, Proceedings of the 25th ACM/IEEE Design Automation Conference,pp. 597-600, 1988.
Giraldi, John, Michael L. Bushnell,EST: The New Frontier in Automatic Test PatternGeneration, Proceedings of the 27th ACM/IEEE Design AutomationConference, pp. 667-672, June 1990.
Goel, Prabhakar,An Implicit Enumeration Algorithm to Generate Tests for CombinationalLogic Circuits, IEEE Transactions on Computers, Vol. C-30, No 3., pp. 215-222,March 1981.
Goel, Prabhakar,Test Generation Costs Analysis and Projections, Proceedings of the 17thACM/IEEE Design Automation Conference, pp. 77-84, June 1980.
Hayes, John P.,Fault Modeling, IEEE Design and Test, pp. 37-44, April 1985.
Ivanov, Andre, Vinod K. Agrawal,Testability Measures - What do they do for ATPG,IEEE International Test Conference, pp. 129-138, September 1986.
Kirkland, Tom, M. Ray Mercer,Algorithms for Automatic Test Pattern Generation, IEEEDesign and Test of Computers, pp. 43-55, June, 1988.
Kirkland, Tom, M. Ray Mercer,A Topological Search Algorithm for ATPG, Proceedingsof the 24th ACM/IEEE Design Automation Conference, pp. 502-508, 1987.
Levitt, Marc, E.,ASIC Testing Upgraded, IEEE Spectrum, pp. 26-29, May, 1992.
Lioy, A., Adaptive Backtrace and Dynamic Partitioning Enhance ATPG, IEEEInternational Conference on Computer-Aided Design, pp. 62-65, 1988.
Ladjadj, M., J. F. McDonald,Benchmark Runs of the Subscripted D-Algorithm withObservation Path Mergers on the Brglez-Fujiwara Circuits, Proceedings of the24th ACM/IEEE Design Automation Conference, pp. 509-515, 1987.
Min, Hyoung B., William A. Rogers,Search Strategy Switching: An Alternative toIncreased Backtracking, IEEE International Test Conference, pp. 803-811,September 1989.
118
Min, Hyoung B., William A. Rogers, Search Strategy Switching: A Cost Model and anAnalysis of Backtracking, Journal of Electronic Testing: Theory andApplications, Vol. 1, pp. 125-137, January 1990.
Patel, Sanjay, Janak H. Patel,Effectiveness of Heuristic Measures for Automatic TestPattern Generation, Proceedings of the 23rd ACM/IEEE Design AutomationConference, pp. 547-552, June 1986.
Roth, J. Paul,Diagnosis of Automata Failures: A Calculus and a Method, IBM Journal ofResearch and Development, Vol. 10, pp. 278-291, July, 1966.
Schulz, Michael H., Erwin Trischler, Thomas M. Sarfert,SOCRATES: A Highly EfficientAutomatic Test Pattern Generation System, IEEE International Test Conference,pp. 1016-1026, September, 1987.
Schulz, Michael H., Elisabeth Auth,Improved Deterministic Test Pattern Generation withApplications to Redundancy Identification, IEEE Transactions on Computer-Aided Design, Vol. 8, No. 7, pp. 811-816, July 1989.
Schulz, Michael H., Erwin Trischler, Thomas M. Sarfert,SOCRATES: A Highly EfficientAutomatic Test Pattern Generation System, IEEE Transactions on Computer-Aided Design, Vol. 7, No. 1, pp. 126-137, January 1988.
Waicukauski, John A., Paul A. Shupe, David J. Giramma, Arshad Matin,ATPG for Ultra-Large Structured Designs, Proceedings of the 1990 IEEE International TestConference, pp. 44-51, September, 1990.
Automatic Test Pattern Generation (Parallel)
Bollinger, S. Wayne, Scott F. Midkiff, An Investigation of Circuit Partitioning for ParallelTest Generation, IEEE VLSI Test Symposium, April, 1992.
Chakaradhar, Srimat T., Michael L. Bushnell, Vishwani D. Agrawal,Toward MassivelyParallel Automatic Test Generation, IEEE Transaction on Computer-AidedDesign, Vol. 9, No. 9, pp. 981-994, September 1990.
Chandra, Susheel J., Janak H. Patel, Test Generation in a Parallel ProcessingEnvironment, IEEE International Conference on Computer Design, pp. 11-14,1988.
Fujiwara, Hideo, Tomoo Inoue,Optimal Granularity of Test Generation in a DistributedSystem, IEEE Transaction on Computer-Aided Design, Vol. 9, No. 8, pp. 885-892, August 1990.
Hirose, Fumiyasu, Koichiro Takayama, Nobuaki Kawato,A Method to Generate Tests forCombinational Logic Circuits Using an Ultrahigh-Speed Logic Simulator, IEEEInternational Test Conference, pp. 102-107, September 1988.
Huisman, Leendert M., Indira Nair, Fault Simulation of Logic Designs on ParallelProcessors with Distributed Memory, IEEE International Test Conference, pp.690-697, September 1990.
119
Kramer, Glenn A.,Employing Massive Parallelism in Digital ATPG Algorithms, IEEEInternational Test Conference, pp. 108-114, 1983.
Klenke, Robert H., Ronald D. Williams, James H. Aylor,A Survey of Techniques forParallel Automatic Test Pattern Generation, IEEE Computer ,pp. 71-84, January,1992.
Levendel, Y. H., P. R. Menon, S. H. Patel,Parallel Fault Simulation Using DistributedProcessing, The Bell System Technical Journal, Vol. 62, No. 10, pp. 3107-3129,December, 1983.
Ma, Hi-Keung Tony, Srinivas Devadas, Alberto Sangiovanni-Vincentelli,LogicVerification Algorithms and their Parallel Implementation, Proceedings 24thACM/IEEE Design Automation Conference, pp. 283-290, 1987.
Markas, Tassos, Mark Royals, Nick Kanopoulos, On Distributed Fault Simulation, IEEEComputer, pp. 40-52, January 1990.
Motohara, Akira, Kenji Nishimura, Hideo Fujiwara, Isao Shirakawa,A Parallel Scheme forTest-Pattern Generation, IEEE International Conference on Computer AidedDesign, pp. 156-159, 1986.
Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for TestGeneration, Proceedings of the 26th ACM/IEEE Design AutomationConference, pp.339-344 June, 1989.
Patil, Srinivas, Prith Banerjee,Fault Partitioning Issues in an Integrated Parallel TestGeneration System, IEEE International Test Conference, pp. 718-726,September 1989.
Patil, Srinivas, Prith Banerjee,A Parallel Branch and Bound Algorithm for TestGeneration, IEEE Transactions on Computer-Aided Design, Vol. 9, No. 3,pp.313-322 March, 1990.
Patil, Srinivas, Prith Banerjee, Janak H. Patel,Parallel Test Generation for SequentialCircuits on General-Purpose Multiprocessors, Proceedings of the 28th ACM/IEEE Design Automation Conference, pp.155-159 June, 1991.
Smith, Steven P., Bill Underwood, M. Ray Mercer,An Analysis of Several Approaches toCircuit Partitioning for Parallel Logic Simulation, IEEE InternationalConference on Computer Design, pp. 664-667, 1987.
C++ and Object Oriented Program Design
Booch, Grady,Object Oriented Design, with Applications, The Benjamin/CummingsPublishing Company, Inc., Redwood City, CA, 1991.
Stroustroup, Bjarne,The C++ Programming Language, Addison-Wesley PublishingCompany, Reading Massachusetts, 1987.
120
Experimental Systems Kit
Chatterjee, Arunodaya, Futures: A mechanism for Concurrency Among Objects,MCCTechnical Report Number: ACT-ESP-225-89, June 9, 1989
Chatterjee, Arunodaya, Arjun Khanna, Ying Hung, Russel McLaren, SudhaNarayanaswamy, Nandini Ajmani,ES-KIT: A Distributed Object-OrientedOperating System, MCC Technical Report Number: ACT-ESP-374-89, October5, 1989.
Smith, K. Stuart, Robert J. Smith, II, Guy S. Caldwell, Claudia Porter, William J. Leddy,Arjun Khanna, Arunodaya Chatterjee, Ying T. Hung, Douglas W. Hahn, Wayne P.Allen, Experimental Systems Project at MCC, March 1989, MCC TechnicalReport Number: ACA-ESP-089-89, March 2, 1989.
Surma, Ed, Extensible Software Platform Software Development EnvironmentProgrammer’s Guide, MCC Technical Report Number: ACA-ESP-223-88,Revised August 1, 1990.
General
Kirkpatrik, S., C.D. Gelatt, Jr., M. P. Vecchi,Optimization by Simulated Annealing,Science, Vol. 220, No. 4598, May 13, 1983, pp. 671-680.
121
Appendix: A
The ES-KIT Parallel Processing System*
The following appendix briefly describes the Experimental Systems Kit (ES-KIT)
parallel processor and the object-oriented programming environment that runs on it. The
ES-KIT is a distributed memory parallel processing environment developed to facilitate
experimentation into new parallel computer architectures and application specific
computing nodes. The system includes the ES-KIT 88K parallel processor and the
Extensible Software Platform (ESP) software system. A more detailed description of the
ES-KIT system can be found in [34].
A.1 ES-KIT Hardware
The ES-KIT 88K parallel processor is a 16 node 2D mesh architecture. Each node
in the mesh consists of a Processor Module, a Memory Module, a Message Interface
Module (MIM) and a Bus Terminator Module. The Processor Module contains the node’s
central processor which consists of a Motorola 88100 RISC microprocessor and two 88200
MMU’s. The Memory Module houses the node’s main memory space which consists of
8Mb of RAM in the standard configuration (a maximum of 32Mb is possible). The
Message Interface Module contains another 88100 processor which performs all
computation necessary to interface between the Processor Module and the mesh
communications subsystem which is based on an Asynchronous Message Routing Device
(AMRD). The AMRD allows bidirectional communication on the mesh with all routing
functions being handled independently of the node’s processors. The Bus Terminator
Module includes the system clock and several UARTs used for debugging as well as
termination for the 88K high speed bus over which the boards that make up a node
communicate. A Sun 3/140 running the UNIX BSD 4.1 operating system acts as a host for
the 88K parallel processor.
122
A.2 ES-KIT Software
The Extensible Software Platform is written in C++ and provides complete support
for object-oriented parallel programming. The environment consists of four major
components, the Interservice Support Daemon (ISSD), the mail daemon, the shadow
process, and the ESP kernels. ESP runs on the ES-KIT 88K processor or a network of Sun3
workstations.
The ISSD is the heart of the total run-time system. It is the major interface between
the ES-KIT environment and the outside world. The ISSD controls the starting and the
termination of applications and all communication with peripheral devices including
terminal and disk I/O. It allows a variety of node configurations that include a
heterogeneous combination of 88K nodes and Sun 3 systems.
The ESP kernel is the portion of the system that actually runs the applications. It
performs memory management, message packing and unpacking, and task switching for
the objects. Additional functionality is provided by several Public Service Objects (PSOs)
which run as user processes on top of the kernel. The kernel runs as a standard UNIX
process on the SUNs and on top of a rudimentary OS on the ES-KIT 88K processor.
The mail daemon is responsible for routing messages between individual or groups
of ESP kernels. Each daemon is connected to all of the kernels in its group, every other
mail daemon in the configuration and the ISSD. This connectivity allows messages to be
passed from kernel to kernel with minimum handling. The shadow process runs on the Sun
host and is responsible for reading application source/object code and managing terminal
I/O.
The ESP supports the C++ programming language. Some language modifications
are made to account for the distributed nature of the hardware. The software system allows
objects to be located on arbitrary nodes and supports the concept of remote method
invocation. Remote method invocation is similar to a remote procedure call. Objects
which will be placed on remote nodes must be derived from a C++ class called
123
“remote_base”. Each object derived from remote_base can be referenced by its ‘handle’,
or address, for the purpose of method invocation. Message passing is used to implement
remote method invocation and return values.
The concept of ‘futures’ allows the caller of a remote method to remain unblocked
while the method is running on a different node. If the caller needs the result of the method
at a later time, it can declare the return value as a future which allows it to carry out some
computation while the result is being generated. Futures are implemented in ESP as a C++
class.
A.3 System Characterization
Experiments were conducted to characterize and study the performance of the ES-
Kit system (both hardware and software). The main purpose of the characterization
experiment was to establish a reference that would guide a programmer in the design and
coding of algorithms that make efficient use of the hardware and software resources
provided by the machine.
The performance metrics studied included,
• Message Speed Linearity• Μessage Speed under load.• Object-Oriented Paradigm overhead.• File I/O.
The following sections describe each of the above experiments in detail.
A.3.1 Message Speed Linearity
The purpose of this experiment was to study the linearity of the message fabric in
terms of the time taken to transport messages of increasing size. A sender/receiver pair of
nodes was set up with the sender sending messages (a byte stream) to the receiver upon
request. The exchange was timed by the receiver node and included the time for the
placement of the request which remained constant throughout the experiment. The time
to setup the message was not included. In order to smooth out variations in the
124
measurements, each experiment was run several times and an average value was taken. The
results are shown in Figure A.1.
The graph shows a fairly linear response by the message fabric. The spikes are due
to the fragmentation of the message memory and the consequent need for coalescing. The
fragmentation occurs because the experiments were run in a loop and not as new processes.
The graph also shows that the average message rate is approximately 1Mbyte/sec.
A.3.2 Message Speed under Load
In the previous experiment, the sender-receiver pair operated on a traffic-free mesh.
In this experiment, however, multiple sender-receiver pairs were created to produce traffic
on the mesh. The traffic caused contention for the communication links as the sender and
receiver of a pair were placed on separate nodes. The sender-receiver pairs exchanged
10,000 byte messages. The time measurement was similar to the previous experiment.
Table A.1 presents the results of the experiment in which 8 sender-receiver pairs
were set up as shown in Figure A.2. The solid arrows in the figure represent the actual
hardware connections between the nodes while the dotted arrows represent the logical
connections established as a part of this experiment. For example, node 0 sends to node 15,
Figure A.1 Message speed linearity.
125
node 1 to node 14 and so on. Table A.2 presents the results of a similar experiment with
16 sender-receiver pairs. Again, the sender and receiver of a pair lie on distinct nodes. As
expected, both tables show a reduction in the average byte transfer rate. One point to note
is that the transfer rates are fairly uniform across all sender-receiver pairs.
A.3.3 Object-Oriented Paradigm Overhead
The ES-Kit supports an object-oriented paradigm that allows an object to invoke
methods of class objects located on remote nodes. This feature is allowed only if the user-
Sender/Recvr Mean (15 runs)Rate (KB/sec)
7 - 0 0.0275 363.66 - 1 0.0281 355.8
0.02820.0282
5 - 24 - 3
354.6
354.6
Table A.1 8 Sender - Receiver pairs
Physical ConnectionLogical Connection
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Figure A.2 Traffic pattern on mesh.
126
defined classes are derived from the remote_base class. The aim of this experiment was to
measure the overhead of invoking methods of classes derived from remote_base by
comparing the time for method invocation against the time for an ordinary procedure call.
The experiment involved making 100 procedure calls and 100 method invocations in
separate loops and measuring the times for each.
The system will allow approximately 2.5 million procedure calls/sec as would be
expected from a 15 MIPS processor, but it allows only approximately 2500 method
invocations/sec. This result suggests that the ES-Kit supports coarse-grained parallelism
with large objects. Object methods on the ES-KIT should be designed to encompass large
sections of the total functionality of the object because of the large overhead inherent in
method invocation.
A.3.4 File I/O
Nodes on the ES-Kit access files on the host via the ISSD interface. This
connection can become a potential bottleneck for applications involving extensive file I/O
from the nodes. The performance of the file system in reading from and writing to ISSD
files on the host are presented in A and Figure A.3B. The experiments were run for
increasing byte sizes with multiple runs per size for smoothing variations in the results. The
results indicate that file I/O on the ES-KIT takes place at an average rate of approximately
50 KBytes/sec. This data suggests that file I/O on the ES-KIT will be time consuming, and
applications should be designed to minimize the necessity of file operations.
Sender/Recvr Mean (15 runs)Rate (KB/sec)
15 - 0 0.0371 269.514 - 1 0.0387 270.5
0.03630.0357
13 - 212 - 3
275.4
280.3
Table A.2 16 Sender - Receiver pairs
127
* (this appendix was coauthored by Sanjay Srinivasan)
Figure A.3A File read times. Figure A.3B File write times.
128
Appendix: B
ES-TGS Parallel Fault Partitioning System
This section describes the ES-KIT TGS system called ES-TGS. ES-TGS is derived
from the TGS system developed by researchers at the University of Southern California and
Mississippi State University. The TGS system uses the PODEM algorithm for test
generation and the Critical Path Tracing algorithm [37] for fault simulation. The parallel
ES-TGS system is based on the fault partitioning method of parallelization.
B.1 Scalar ES-TGS System
The first step in adapting the TGS system for use in the ESP environment was to
convert the system to the C++ language and encapsulate the functionality into objects. The
architecture of the resulting system is shown in Figure B.1. The system is composed of 5
objects: the test generator object, the fault simulator object, the master object, and the
reader and writer objects.
The master object controls the entire application. It begins by instantiating and
initializing the other objects. It then builds the master fault list from the circuit description.
Once this process is completed, the master enters a loop in which it selects a candidate fault
for test generation from the undetected faults on the fault list and sends the fault to the test
Reader
WriterMaster
FaultSimulator
TestGenerator
Figure B.1 Scalar ES-TGS system architecture.
129
generator for ATPG. The resulting test vector is sent to the fault simulator and that
determines the list of detected faults to be marked off of the master fault list. The master
then selects another candidate fault to repeat the process. This loop continues until the
desired fault coverage of 100% is reached, or all faults have been either marked as detected
or at least ATPG has been attempted.
The test generator object performs the PODEM test generation algorithm. It
receives a fault from the master object, generates a test for the fault if possible, and then
returns the resulting test back to the master. The test generator uses the SCOAP [38]
testability measure to guide the ATPG process. A backtrack limit is included for redundant
or hard-to-detect faults.
The fault simulator object uses the critical path tracing algorithm. It receives the
test vector produced by the test generator from the master object. It performs fault
simulation on the vector and returns the list of detected faults to the master.
The reader and writer objects were added to the TGS system to deal with the limited
I/O performance of the ES-KIT. The reader object was added to read in the circuit netlist
database from the host file system into memory and distribute it to all of the objects that
need it. This operation eliminates the performance penalty of multiple reads from the host
disk. The reader object is no longer needed after all of the other objects have the netlist and,
at that time, it is deleted. The writer object stores the successful test vectors from the master
object and writes them out to disk when the ATPG process is complete. This organization
eliminates the performance penalty of having one disk access in each of the test generation/
fault simulation loops performed by the master object.
The scalar version of the ES-TGS system has been run on the ISCAS ‘85 [16]
benchmark circuits on the Sun3 based ESP environment. The test vectors generated were
verified to be the same as the original C version. The application has also been run on the
ES-KIT 88K processor. However, the larger ISCAS circuits required too much memory to
be run on a single 88K node. Table B.1 summarizes the runtimes for the original TGS
system running on a Sun3 and Sun4 processor and for the ES-TGS system running in the
130
Sun3 and ES-KIT 88K processor. Note that the runtimes include only the time taken for the
ATPG portion of the application and do not include file I/O or fault collapsing times.
The data shows that there appears to be little or no performance penalty for running
in the ESP environment. This conclusion can be drawn from a comparison of the runtimes
of the original C version and the ESP C++ version running on the same Sun3 platform.
However, there is little communications between the objects in the scalar ES-TGS system
and the communications overhead of the ESP environment has not yet become a factor. The
increase in performance between the ESP Sun3 version and the ES-KIT 88K version is due
to the increased performance of the 88K CPU. The same is true for the runtimes of the Sun3
versus Sun4 C versions. The runtimes for circuit C17 in the ESP environment, both Sun3
and 88K, are dominated by setup times which includes the time to read the circuit database
from the host file system.
B.2 Parallel ES-TGS Version 1
The first step in parallelization of the ES-TGS system was simply to instantiate
multiple test generator and fault simulator objects. In the simplest version, there is only one
object on each processing node, either a test generator or a fault simulator. Half of the
available processing nodes are used to perform test generation and half are used to perform
ISCAS ‘85Circuit
Sun3 C ES-KIT C++Sun3 VersionVersion 88K Version
Sun4 CVersion
ES-KIT C++
C17 1.14 19.4 3.19 0.15
C432 276.8 213.2 60.1 33.6
C499 220.3 44.5 27.3
C880 327.3 304.4 108.8 39.7
C1355 958.9 656.5 171.6 120.8
C1908 2428.9 1662.3 278.3 338.8
C3540 13285.0 11971.0 2142.3 2911.6
167.8
Table B.1 Scalar ES-TGS ATPG times (sec)
131
fault simulation. The master object performs essentially the same function as in the scalar
system with a few minor differences. When a test generator becomes (or starts out) idle, the
master selects an untested and undetected fault from the fault list. This fault is marked as
tested and transmitted to the idle test generator. When a successful test vector is generated,
it is returned to the master where it is stored in a queue along with the fault under test. If
there is an idle fault simulator, the master sends that test vector to the idle fault simulator
for fault simulation. The master then selects another fault and restarts the test generator.
When a fault simulator returns its list of detected faults, the master object marks those faults
as detected. If the vector detects some new faults, it is sent to the writer object for storage.
The master then looks for a test vector in the queue to send to the fault simulator. If it finds
one, the vector is transmitted and the fault simulator is restarted, if not, the fault simulator
becomes idle until the next vector is generated by a test generator. Notice that in this
system, the master object must manage all of the intermediate steps in the process including
those necessary for object initialization. The throughput required of the master object keeps
its processing node very busy. The writer object is also busy storing vectors. For this
reason, it was found that the writer and master object could not reside on the same node. If
they did, one of the object’s incoming message queue would grow too large. This
requirement that the writer object and the master object reside on individual nodes
effectively eliminates two nodes of any configuration from the list of available processing
nodes. Thus, our system of 16 nodes could have at most 14 test generator and/or fault
simulator objects in this version of ES-TGS. The overall architecture of a system consisting
of 7 nodes is shown in Figure B.2. In this system, the reader and writer objects and the
master object would reside on individual nodes and the test generator and fault simulator
objects would reside on the remaining processing nodes.
One decision which needs to be made in this system is how many of the available
processing nodes should be test generators and how many fault simulators? The results
obtained with the static scheduling version of the ES-TGS Version 1 system indicated that
an answer was impossible to determinea priori and depended a great deal on how many
hard-to-detect faults that were present in a circuit. Some circuits have a preponderance of
132
hard-to-detect faults and require the majority of time to be spent in test generation and
others require the majority of time to be spent in fault simulation. However, most circuits
lie somewhere in between and require the majority of time for fault simulation in the
beginning of the ATPG process and the majority of time to be spent in test generation in
the latter part of ATPG. In order to address this problem, the Version 1 system was
modified to instantiate one fault simulator and one test generator object on each of the
available processing nodes in the system. Since there is no multitasking in the ES-KIT
operating system, a processing node can only be performing one function at a time. A
dynamic scheduling system using the test vector queue in the master object is used to
determine how many processing nodes will perform each type of task. If the test vector
queue in the master becomes empty, an idle fault simulator node (if there are any) is
selected to become a test generator. If the test vector queue is full, an idle test generator is
selected to become a fault simulator. This system results in higher overall utilization of the
processing nodes, but the task switching of the processing nodes adds more computation
load to the already over worked master object. Therefore, the overall runtimes were only
slightly smaller. Table B.2 summarizes the results of the ES-TGS Version 1 system without
Reader
Writer
Master
FaultSimulator
TestGenerator
FaultSimulator
FaultSimulator
TestGenerator
Figure B.2 Parallel ES-TGS system architecture (Version 1).
133
task switching on several of the ISCAS ‘85 benchmark circuits. Note that the number of
processing nodes is the number of nodes which are used for test generation or fault
simulation. In each case, there are two more nodes in the configuration, one for the reader
and writer objects, and one for the master object.
These results show that the Version 1 system attains reasonable speedup for only 2
to 4 processing nodes
B.3 Parallel ES-TGS Version 2
The main limit to speedup in the Version 1 system was the fact that the master
object was overloaded. In order to address this fact, a “mini-master” object was created to
off-load some of the intermediate processing requirements. One mini-master object is
created on each processing node along with one test generator/fault simulator pair. The
mini-master handles all intermediate processing between the test generator/fault simulator
objects and the master object including the initialization steps. Figure B.3 shows the
architecture of the resulting Version 2 system for a configuration of 5 nodes. In this
system, the reader and writer objects again reside on node 1,1 and the master object resides
on node 1,2. The remaining processing nodes contain one test generator object, one fault
ISCAS ‘85
Circuit
C432
C499
C880
C1355
C1908
ATPG Times (Sec)
No. of Processing Nodes
2 4 8 10
30.99
28.18
46.47
119.74
237.14
18.98 16.72 17.85
19.43 18.99 18.58
41.43 44.95 46.54
71.19 64.65 61.39
138.22 114.65 110.56
C3540 1550.38 791.70 451.56 406.931.86 2.70 4.74 5.26
1.17 2.01 2.43 2.52
1.43 2.41 2.65 2.85
1.34 2.63 2.42 2.34
1.61 2.34 2.39 2.44
1.93 3.16 3.59 3.36
Speedup
Table B.2 ES-TGS Version 1 ATPG times
134
simulator object and one mini-master object.
During the ATPG phase, the mini-master receives a fault from the master. It gives
this fault to the test generator to perform ATPG. When the test generator is successful, it
returns a valid test vector to the mini-master. The mini-master then sends this vector to the
fault simulator. When the fault simulator completes, it sends the resulting list of detected
faults directly to the master. This step is necessary because the master object still maintains
the global fault list. The direct transmission from the fault simulator to the master
eliminates the redundant messages which would be required if the list was sent to the mini-
master first and then on to the master. If the master determines that the vector is good (i.e.
it detects some as yet undetected faults), it signals the mini-master of this and the mini-
master transmits the vector directly to the writer object for storage.
Table B.3 shows the results for the Version 2 system on the ISCAS ‘85 benchmark
circuits. Notice that the runtimes for larger numbers of processors scale much better than
the Version 1 system. In some cases, the speedup is greater than the number of processing
nodes. This is possible because the master processor node is not included in the processing
node count.
Reader
Writer
Master
FaultSimulator
TestGenerator
MiniMaster
TestGenerator
MiniMaster
TestGenerator
MiniMaster
FaultSimulator
FaultSimulator
Figure B.3 Parallel ES-TGS system architecture (Version2).
135
Figure B.4 shows a comparison of the speedup for the Version 2 system versus the
Version 1 system on the C3540 ISCAS ‘85 benchmark circuit. This benchmark is a medium
sized circuit with 1669 gates and 3428 single stuck-at faults. The data shows that the
Version 2 system achieves reasonable speedup on this circuit for up to 14 processors.
Speedups for the other ISCAS ‘85 benchmark circuits are not quite as good, but in the
general case, the Version 2 system performs much better than the Version 1 system and
achieves good speedup for 4 to 8 processors.
B.4 Parallel ES-TGS Version 3
The final optimization that was made to the ES-TGS system was to eliminate the
need for the master object during the ATPG phase. This change was accomplished by
dividing up the fault list among the mini-master objects. The mini-masters manage their
own portion of the fault list completely independently. The mini-master selects an untested
fault from its own fault list and sends it to the test generator. The resultant vector is sent to
the fault simulator. The list of detected faults is then marked off of the mini-master’s
portion of the fault list. The portion of the list of detected faults which is not contained on
the mini-master’s own fault list is not transmitted to the other mini-masters. This approach
ISCAS ‘85
Circuit
C432
C499
C880
C1355
C1908
No. of Processing Nodes
2 4 8 12
32.98
20.76
22.36
145.24
262.32
12.57 6.94 6.11
7.72 4.50 4.06
9.88 9.92 10.37
55.01 23.39
112.06 70.60 59.21
C3540 2038.04 704.72 306.70 206.40
14
5.77
4.07
10.70
24.03
56.37
181.20
28.04
1.03 2.98 6.85 10.2
1.13 2.61 4.23 5.05
1.16 3.07 6.03 7.23
1.25 2.83 2.82 2.70
1.58 4.24 7.28 8.07
1.01 2.59 4.70 5.33
11.6
5.30
6.96
2.61
8.05
5.65
ATPG Times (Sec) Speedup
Table B.3 ES-TGS Version 2 ATPG times
136
will result in some redundant work as the other mini-masters generate tests for these already
detected faults. The resulting set of test vectors is also larger. This redundant work is the
major limit to speedup in this Version 3 system. This system is modeled after the static
scheduling system without communication presented in [23]. The runtimes in this system
are also increased somewhat by the fact that some mini-master objects finish their fault list
ahead of the others. However, this inefficiency seems to be a minor one compared to the
one resulting from the generation of the redundant vectors. Table B.4 contains the results
for the Version 3 system on the ISCAS ‘85 benchmark circuits. Note that in this system
there is one more node in the configuration available for placing a mini-master and test
generator/fault simulator group. This configuration is possible because the master object is
used only for initialization and can be placed on the same node as the reader and writer
objects.
Figure B.5 is a comparison of the speedup obtained for the Version 2 system versus
the Version 3 system for circuit C3540. The results for this circuit indicate that the Version
3 system does not perform as well as the Version 2 system. This result is, in general, the
Figure B.4 Version 1 vs. Version 2 speedup for circuit C3540.
137
case for all of the ISCAS ‘85 benchmark circuits. However, in the circuits which have very
few hard-to-detect faults, such as C880, where the Version 2 system performs poorly, the
Version 3 system performance is much closer.
As stated previously, the performance of the Version 3 system is limited by the
extra work done in generating the redundant vectors. Figure B.6 is a comparison of the
number of test vectors generated by both systems for circuit C3540. Notice that for 14
processors, the Version 3 system generates over twice as many test vectors as the Version
2 system.
B.5 Conclusions
Our initial task was to develop a parallel ATPG system for a given distributed
environment. Our methodology began with the characterization of the environment to
determine the optimum grain size the environment supports. We then attempted to
implement the simplest form of parallelism, namely fault partitioning, to gather data on the
system using a real application. A successful public domain scalar ATPG system was used
as a starting point to reduce coding time and ensure algorithm correctness. This ATPG
system was then ported to the target system and run in a scalar mode to determine operating
ISCAS ‘85
Circuit
C432
C499
C880
C1355
C1908
No. of Processing Nodes
2 4 8 12
22.90
15.38
27.07
115.82
193.66
12.84 11.17
11.21 9.86 7.98
19.98 13.44 10.30
83.71 43.34
160.94 130.36 108.30
C3540 1298.73 813.79 471.50 399.97
14
9.71
7.49
9.47
38.74
85.44
317.85
52.25
18.02
1.60 2.55 4.4 5.19
1.32 1.60 1.97 2.37
1.23 1.69 2.72 3.27
1.20 1.63 2.43 3.17
1.27 1.74 1.98 2.45
1.28 1.63 2.29 2.64
6.53
3.01
3.66
3.45
2.61
3.03
ATPG Times (Sec) Speedup
Table B.4 ES-TGS Version 3 ATPG times
138
system overhead, which turned out to be insignificant (compared to Unix). Once this was
completed, parallelization of the ATPG process was undertaken.
The Version 1 system, although simple to implement, is limited by the amount of
processing necessary by the master object to manage all of the ATPG process. Simple
analysis of the Version 1 system using Amdahl’s law:
Figure B.5 Version 2 vs Version 3 speedup for circuit C3540.
Figure B.6 Version 2 vs. Version 3 testset size for circuit C3540.
139
whereSN = speedup,f = fraction of code that is sequential, andN = number of
processing nodes, indicates that approximately 10% to 33% of the code is sequential. This
result corresponds to a maximum speedup of only 2.5 to 4.0. This analysis assumes that
the serial portion of the code is the only source of inefficiency and ignores the small amount
of extra work performed in this system generating redundant vectors. The sequential
portion of the code is the management of the ATPG process and the maintaining of the fault
list by the master object. The different figures for the percentage of sequential execution
and hence speedup are caused by the difference in the difficulty of test generation between
the various circuits. A circuit which has many hard-to-detect faults will have significant
processing to be done by the test generator objects in proportion to the work done by the
master in maintaining the fault list. In this case the speedup will not be as limited as in the
case where the faults are all easy and the processing done by the master dominates.
The Version 2 system attempted to reduce the portion of code that is serial by off
loading the management of the intermediate steps of the ATPG process to a mini-master
object on each node. The removal of a relatively small portion of the serial code from the
master yielded a significant increase in the speedup achieved in the best case, 11.6. This
result corresponds to 1.6% of the code being executed in serial. The worst case for the
Version 2 system is little better than for the Version 1 system, 2.61, which corresponds to
33.5% serial code. In both cases, this result was for circuit C880 which has no hard-to-
detect faults and hence the master object is always busy managing the fault list.
Analysis of the Version 3 system can be done by assuming that the only inefficiency
is due to the extra work done in generating redundant vectors. Since the mini-masters each
have their own portion of the fault list which they process independently, there is no serial
code in the ATPG process. The time taken to generate the tests for a circuit by this system
givenN processing nodes is given by:
SN f f 1−( ) N×+=
TN
T1 Ccalc×N
=
140
whereTN is the execution time forN processor,T1 is the execution time for the serial
case, andCcalc is a factor which takes into account the inefficiency introduced by the
redundant work. Speedup is then given by:
andCcalc can be calculated by:
SinceCcalc is the measure of redundant work, it should be related to the number of
processors,N, and also should be able to be calculated from the number of test vectors
generated by the parallel ATPG system. If we define a new factor,Cmeas, which is the ratio
of the number of test vectors generated by the parallel system over the number of vectors
generated by the serial system, we have:
where VN is the number of test vectors generated. Table B.5 is a tabulation ofCcalc
and Cmeas for increasing numbers of processing nodes for several of the ISCAS ‘85
benchmark circuits. The data shows a strong correspondence betweenCcalc andCmeas. Any
differences can be attributed to the additional inefficiency in the Version 3 system due to
imperfect load balancing which this analysis ignored. In fact, numbers forCcalc calculated
using the average per processor ATPG time versus the maximum ATPG execution time (as
was done for Table B.5) bear a much stronger correlation toCmeas.
The Version 3 system achieved significantly lower speedups than the Version 2
system for most cases. The exceptions were the circuits with few hard-to-detect faults such
as C880 on which the Version 2 system performed most poorly. The large amount of
redundant work done by the Version 3 system is the main limit to speedup. This redundant
SN
T1
TN
T1
T1 Ccalc×N
Ccalc
N= = =
CcalcNSN
=
Cmeas
VN
V1Ccalc≈=
141
work can be reduced or even eliminated by having each processor communicate its list of
detected faults to all of the other processors after each time fault simulation is done.
However, this situation produces significant communication and has not proven to increase
speedup [8]. Dynamic load balancing can be used to address the load balancing problem,
but in general, the Version 3 type system will not scale to large numbers of processors.
ISCAS ‘85Circuit
C432
C499
C880
C1355
C1908
No. of Processing Nodes
2 4 8 12
C3540
14
Ccalc Cmeas
1.61 1.47
Ccalc Cmeas Ccalc Cmeas Ccalc Cmeas Ccalc Cmeas
3.432.45 2.31 3.49 3.02 4.54 4.62 3.57
1.57 1.53 4.002.30 2.43 4.04 3.46 4.90 5.36 4.12
1.66 1.38 3.022.45 1.94 3.29 2.61 3.78 4.05 3.21
1.62 1.48 2.802.36 1.95 2.94 2.45 3.66 3.82 3.02
151 1.34 3.352.50 2.01 4.06 2.81 5.06 4.65 3.50
1.25 1.31 2.221.56 1.62 1.81 1.98 2.31 2.14 2.36
Table B.5 Comparison ofCcalc vs.Cmeas