semiconductor defect classification

7/29/2019 Semiconductor Defect Classification

1/6

Semiconductor Defect Classification using Hyperellipsoid Clustering

Neural Networks and Model Switching

Keisuke Kameyama

Interdisciplinary Graduate School of Sci. and Eng.

Tokyo Institute of TechnologyYokohama 226-8502, Japan

Yukio Kosugi

Frontier Collaborative Research Center

Tokyo Institute of TechnologyYokohama 226-8503, Japan

Abstract

An automatic defect classification (ADC) system for visual

inspection of semiconductor wafers, using a neural network

classifier is introduced. The proposed Hyperellipsoid Clus-

tering Network (HCN) employing a Radial Basis Function

(RBF) in the hidden layer, is trained with additionalpenalty

conditions for recognizing unfamiliar inputs as originat-

ing from an unknown defect class. Also, by using a dy-namic model alteration method called Model Switching, a

reduced-model classifier which enables an efficient classifi-

cation is obtained. In the experiments, the effectiveness of

the unfamiliar input recognition was confirmed, and a clas-

sification rate sufficiently high for use in the semiconductor

fab was obtained.

1. Introduction

Visual inspection plays an important role in the manufac-

turing processes of semiconductors. The disorders found

on the wafer surface, such as the one shown in Fig. 1, are

commonly referred to as defects. The motive for defect clas-

sification is to find out the process stages and the sources

that are causing them. Early detection of the sources of de-

fects is essential in order to maintain high product yield and

quality. Fig. 1

By replacing the review process typically conducted by hu-

man experts, it is also aimed to improve both the stability

and speed of inspection. In the literature, it is reported that

the classification accuracies of human experts are typically

6080% [1]. If this stage of visual inspection could be auto-mated, it will greatly contribute to enhance the productivity

of the semiconductor fab.

The task of classifying the defect image features has several

5m

Figure 1. A defect found on a semiconductor

wafer.

specific conditions inherent to the particular problem. Most

distictive among them is the fact that the user does not have

the freedom of collecting a sufficient number of, or an ap-

propriate selection of training images. Also, the number of

the training samples are extremely unbalanced.

When the number of samples for a defect class is small, ap-

proaches whose decisions rely on all samples, such as the

radial basis function (RBF) networks [10][12] or the joint

use of nonparametric estimation of the probability distri-

bution function by Parzens method [11] and Bayes clas-

sification, perform well. However, for a class with large

samples, these methods are computationally costly. In this

case, instead of using all the training samples for classifi-

cation, methods based on distances from the class-cluster

prototypes such as the nearest neighbor algorithm [2] and

learning vector quantization [9], and those based on class

borders such as multilayer perceptrons (MLP) [14] and sup-

port vector machines [15] are computationally more effi-cient. So-called reduced variants of the above nonparamet-

ric methods such as the generalized RBF networks [12] and

reduced Parzen classifiers [3] are also methods depending

on the distances from the prototypes.


2/6

In this work, a three-layered neural network named the

Hyperellipsoid Clustering Network (HCN), having hidden

layer units of RBF type will be used. In addition to the pa-

rameter adjustment by backpropagation (BP) method [14],

model alteration method called Model Switching (MS) [7]

which allows the map acquired by training to be inherited

to the new model, is used during the training process for

efficiently obtaining an appropriate reduced model.

The second requirement to the system is to classify the

known defect classes without fail and not to make wild

guesses against unfamiliar defects. Such cases should be

pointed out as unclassifiable and be left for the human ex-

pert to see. Since the training set will usually provide an-

swers at only a small portion of the feature space, inputs

to the remaining open space should be treated as being un-

known. For recognizing unfamiliar inputs, the HCN was

trained with additional penalty condition, so that the sizes

of the hyperellipsoid kernels will be kept small, to tightly

enclose the clusters formed by the training samples.

In Sec. 2, the HCN will be introduced, together with its

training method and the output interpretation method for

recognition of unfamiliar inputs. In Sec. 3, the idea of

Model Switching for allowing dynamic model alteration

during training will be reviewed. The defect classes and the

outline of the automatic defect classification (ADC) system

will be explained in Sec. 4. In Sec. 5, the network and

the ADC system will be evaluated by applying to the clas-

sification of the defect image sets, and the paper will be

concluded in Sec. 6.

2. Hyperellipsoid clustering network (HCN)

The three-layered network model used for classifying thefeature vectors is illustrated in Fig. 2. The network has

inputs, hidden units and output units. The potential of

the -th hidden layer unit is defined as,

(1)

with the following parameters to be adjusted in the training:

: radius parameter.

: center vector.

: weight matrix.

The transfer function of the hidden layer unit is the well-

known sigmoid function. Thus, the output of unit is,

(2)

Inputvectorx

Outputvectory

Input layer Hidden layer(Hyperellipsoid

discriminant + sigmoid)

Output layer

(Linear)

Parameters : (Hn, mn, rn)

Connectionweight : wk

1

l

L

1

n

N

1

k

O

Figure 2. The Hyperellipsoid Clustering Net-

work (HCN).

x1x2

1.0

0

h1 h2

Figure 3. An example of the kernel functionsmade by the joint use of (hyper) ellipsoid dis-

criminants and sigmoid functions.

A unit in the output layer takes the fan out of the hidden

layer units and calculates the weighted sum with no bias as,

(3)

where is the weight vector

of the -th output unit, and . The

weight vector is also modified in the training process.

By employing a discriminant in Eq. (1), the discrimina-

tion plane in the feature space will always be a hyperellip-

soid. Since the unit potential in Eq. (1) depends on the

distance between the input and the center vector ,

the network is a RBF network. However, in contrast with

the popular Gaussian RBF network [12], various profiles

of the kernel function are possible by controlling the gain

[4] of the sigmoid function with the radius parameter , asshown in Fig. 3. This network model using the hyperel-

lipsoid discriminant and the sigmoid function in the hidden

layer, will be referred to as the Hyperellipsoid Clustering

Network(HCN).


3/6

The training method used in the HCN is based on the

batched BP law with momentum terms [14]. The error cri-

terion is defined as

(4)

with , , and denoting the cardi-

nality of the training set, the error for the -th training pair,

the -th training output vector and the -th output vector,

respectively.

For enabling a tight bounding by hyperellipsoids to im-

plement the recognition of the unfamiliar inputs, the vol-

ume of the hyperellipsoids should be kept small as long

as it does not harm the achievement of training. This can

be done by setting some penalty term to restrict the radius

of the hyperellipsoids. The distance from the center to the

edge of the hyperellipsoid in the direction of the -th princi-

pal component can be written as , where is the

-th eigenvalue of the matrix , which is always pos-

itive. Thus, a penalty to suppress the absolute value of the

radius parameter can be considered to be effective. Also,

a term to prevent the eigenvalues from becoming too small,was necessary. This second restriction was implemented in-

directly by preventing the Euclidean norm of the matrix

from becoming too small. Consequently, the modification

measures to the weight matrix and the radius parameter

were formulated as,

(5)

and

(6)

with the terms and denoting the modifica-

tion measures by the plain BP training. Parameters and

denote the penalty term gains.

The network will be trained to respond with a class specific

unit vector. Since the output is the weighted sum of the

kernel functions of the hidden layer units, it can be justified

to reject an output vector that does not have a significant

winner. In such a case, the input pattern should be classified

to be originating from an unknown class. Therefore, the

output interpretation of,

argmax if

unknown otherwise(7)

will be used, with being the membership threshold.

3. Model switching

As a method for obtaining a reduced network model in

the learning process, model alteration scheme called Model

Switching(MS) [7] is employed. MS is a framework for dy-

namic model alteration during the BP training for improve-

ment of the training efficiency, by avoiding the local minima

and reducing the redundancy in the network model.

Definition 1 (Model Switching) On altering the neuralnetwork model, methods which determine the moment or

the occasion of model alteration, by taking into account

both the two factors in the following :

1. The nature andfitness of the new model andthe initial

map candidate within the new model.

2. The status of the immediate model and map.

will be referred to as Model Switching (MS).

In this work, MS will be used to reduce the number of hid-

den layer units in the HCN in which the training is initially

started with a model having the same number of units as

the training sample. Pruning algorithms [13] which is also

an attempt to reduce the network size, mostly limit the oc-

casion of model reduction to after the convergence of the

training error. With MS, however, the occasion can be set at

any time, as long as the fitness of the candidate of the initial

map within the new model is met. When only the model

reduction is used in MS, only the first factor in Def. 1 needs

to be considered.

The process of training by BP with MS is shown in Fig. 4.

For each training epoch of BP, the fitness of the switchable

candidates will be evaluated, and switching will take place

when thefitness of a candidate exceeds a given threshold

.

The candidate set of the new model and map was made by

using the unit reduction method of unit fusion [6]. Unit

fusion selects a pair of units in a layer and replaces them

by a single unit. On replacement, connection weights to the

new unit is determined so that the map of the old network

will be inherited by the new network.

Let us put that units indexed and will be fused to make a

single unit . The weighted sum of the inputs from units ,

and the unity bias to the subsequent layer unit , can bewritten as,

(8)


4/6

Start

Modify parameter of theimmediate networkfNby BP

End

Determine switchable

model-map candidate setCMS

Evaluate Fitness IndexIF(fN, fN i)

for all fN i CMS

max{IF(fN, fNi)} > IF0NY

SwitchfNfNk

k=argmaxi

{

No switching

Trained ?(E< E0)

N

Y

Model sizereduction

IF(fN, fNi)}

Figure 4. BP training with Model Switching.

where , , and are the connection weight, unit re-

sponse, average unit response and the varying portion of

the response, respectively. Generally, we can put,

(9)

with and denoting the standard deviation of the unit

output, and the output similarity of the unit pair, respec-

tively, both evaluated for all the training inputs. From Eqs.

8 and 9, we have

(10)

implying that the connection weights should be changed as,

(11)

and

(12)

where the prime denote the connection weights after the fu-

sion.

Since no bias unit is used in the hidden layer of HCN, onlythe compensation in Eq. 11 will be used. As unit fusion can

be applied to all unit pairs in the hidden layer,

switching candidates exist. The one which is most fit will

be selected by evaluating the fitness index .

The fitness of the new map will be a function of the degree

of map inheritance, and the closeness of the two kernels to

be fused in the feature space, to give priority to the fusion

of kernels that are placed close together. For evaluating the

degree of map inheritance, a measure named Map Distance

will be used.

Definition 2 (Map Distance) The map distance between

two mapping vector functions and trained

with the training vector set is defined as,

(13)

where is the number of training pairs.

The fitness of the candidates will be evaluated by thefitness

index function of,

0

(14)

where , and denote the map obtained by fus-

ing the -th and -th units, the dimension of feature space,

and the maximum possible map distance, respectively. It

is assumed that all the feature elements are bounded to the

domain. On actual evaluation of the map distance, the

theorem approximating the map distance generated by the

fusion of hidden layer units [7] was used.

4. The automatic defect classifier system

(ADC) [8]

4.1. Defect classes

In this work, we will try to classify the physical defect

classes that provide most information for locating the cause

of the defects. The physical defect classes dealt with in this

work and their common appearances are listed in the fol-

lowing :

A. Foreign objects (FO)

This class includes defects such that external objects are

found on the wafer. Defects of FO class tend to appear as

small and dark colored regions, typically in near-circularshape.

B. Embedded foreign objects (EO)

This is the class of defects where one or more processedfilm


5/6

AND

Defect mask

Shape featureextraction

Colorquantization

Shape feature Color ratio

HCN Classifier

Defect class

Reference image Defect image

Figure 5. The flow of data in the ADC system.

layers have been stacked over a foreign object. EO class de-

fects appear slightly larger and irregular-shaped than those

of the FO class, because the patterns of the heaped area in

the covering layers are deformed by the embedded object.

In addition to the characteristic dark color of the particle it-

self, other colors can be observed as well. Defects of FO

and EO classes can appear quite similar, and are sometimes

hard to distinguish even for an expert.

C. Pattern failure (PF)

This class covers all kinds of defects that have pattern de-

formations without any existence of external objects. De-

fects of PF class can also be caused by insufficient exposure

or etching. Thus they can have a wide variety of size and

shape. Since the defect is usually an extra region or a lack

in the pattern of a layer, the color of the defect region tends

to be one of those observed in the normal patterns.

4.2. Feature extraction

A. Shape features

The flow of data in the ADC system is shown in Fig. 5. Af-

ter subtraction of the defectless reference pattern from the

defect image and further graylevel thresholding, the defect

mask is made. From the defect mask, shape features ofde-

fect size and roundness is calculated.

B. Color featuresThe color of the defect region is characterized by quantiz-

ing the color of each pixel to one of the prototype colors.

The prototype colors are determined in beforehand by ap-

plying the Median Cut Algorithm [5] to the defectless im-

x10 1

1

x2

(a) (b)

x10 1

1

x2

Figure 6. An artificially generated cluster data

of four classes. (a) Training set (P = 100). (b)Test set (P= 1000).

ages of the layer to be inspected. Also, typical defect colors

are manually added as prototype colors. The ratios of the

quantized colors in the defect region were used as the color

feature vector of the defect.

In the experiments in Sec. 5, the feature dimension was

12, including the 2 shape features and 10 color features, allnormalized to unity range.

5. Experiment

A. Membership thresholding in an artificial cluster data

The effect of membership thresholding and MS was evalu-

ated using an artificial four-class data in a 2D domain shown

in Fig. 6. Three types of networks and training strategies

were tried. All networks were trained to the target error of

, to respond with class specific unit vectors.

1. MLP with (input-hidden-output)=(2-4-4) units.

2. HCN with (input-hidden-output)=(2-100-4) units.

3. HCN trained by BP with MS for model reduction dur-

ing training. Initial model : (2-100-4).

The change in the recognition rate for the test set, and the

ratio of the area within the input domain which was pointed

out as being of unknown class, was evaluated by changing

the membership threshold in Eq. 7. Ideally, the recogni-

tion rate will be maintained high, even when a large portion

ofthe input domain isjudged as unknown (rejected). The re-sult is shown in Fig. 7. It is clear that by reducing the model

of HCN by MS, larger portion of input domain is properly

rejected without losing the classification ability for the test

set.


6/6

0.7

0.75

0.8

0.85

0.9

0.95

1

0 0.1 0.2 0.3 0.4 0.5 0.6

= 0.2

= 0.9

MLP

HCN

HCN with MS

Recognitionrate

Ratio of rejected input domain

Figure 7. The change in the recognition rateand the ratio of the rejected input domain,

when the membership threshold is changed.

Table 1. The classification rate and the con-fusion matrix for the HCN evaluated by the

leave-one-out method. The numbers in boldtypeface are for the cases when membership

thresholding was used.

FO PFEOTrue

Estimation Correct (%)

3232

10

00

20

2221

00

21

20

3230

97.097.0

88.983.3

91.787.5

Unknown

01

03

05

Error (%)

3.00.0

11.12.8

8.30.0

Foreign Object(FO)

Embedded Object(EO)

Pattern Failure(PF)

Average rates (weighted)92.589.2

7.51.1

B. Leave-one-out evaluation with HCN using MS

A collection of defect images obtained from the same pro-

cess layer of a product was used for evaluating the ADCsystem. The set consisted of 33 FO class, 36 EO class

and 24 PF class images. The class information for all the

images were provided by an expert inspector. The classi-

fication rates were evaluated by the leave-one-out method

[3]. A HCN network with unit configuration of (12-93-

3), initialized by placing each kernels at the training inputs

were trained using MS. The model typically converged to

reduced models with 9 to 14 hidden layer units.

The results are shown in Table 1. By employing the mem-

bership thresholding with , it is found that the non-

diagonal elements (errors) in the confusion matrix could be

reduced drastically. The obtained classification rate is con-sidered to be comparable to those of human experts. By re-

ducing the network model by MS, the computation required

for using the network was also reduced by 8590%, when

compared with the initial network model.

6. Conclusion

An ADC system for visual inspection of semiconductor

wafers, using a neural network classifier was introduced.

The Hyperellipsoid Clustering Network was introduced,

and the training rule with cost terms for recognizing unfa-

miliar inputs as originating from an unknown defect class

was given. Further, by using BP training with Model

Switching, a reduced-model classifier which enables an effi-

cient classification was obtained. The defect classes and the

descriptions of the extracted image features was defined. In

the experiments, the effectiveness of the unfamiliar input

recognition was confirmed, and a classification rate compa-

rable to those of human experts were obtained.

References

[1] P. B. Chou, A. R. Rao, M. C. Struzenbecker, F. Y. Wu, and

V. H. Brecher. Automatic defect classification for semicon-

ductor manufacturing. Machine Vision and Applications,

9(4):201214, 1997.

[2] R. O. Duda and P. E. Hart. Pattern Classification and SceneAnalysis. Wiley, 1973.

[3] K. Fukunaga. Introduction to Statistical Pattern Recogni-

tion. Academic Press, 1990.

[4] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, 1990.

[5] P. Heckbert. Color image quantization for frame buffer dis-

play. Computer Graphics, 16(3):297307, 1982.

[6] K. Kameyama and Y. Kosugi. Neural network pruning

by fusing hidden layer units. Transactions of IEICE,

E74(12):41984204,1991.

[7] K. Kameyama and Y. Kosugi. Model switching by chan-

nel fusion for network pruning and efficient feature extrac-

tion. Proceedings of International Joint Conferenceon Neu-

ral Networks 1998, pages 18611866,1998.

[8] K. Kameyama, Y. Kosugi, T. Okahashi, and M. Izumita. Au-

tomatic defect classification in visual inspection of semicon-

ductors using neural networks. IEICE Transactions on In-

formation and Systems, E81-D(11):12611271, 1998.

[9] T. Kohonen. Self-organization and associative memory.

Springer, 1988.

[10] J. E. Moody and C. J. Darken. Fast learning in networks of

locally-tuned processing units. Neural Computation, 1:281

294, 1989.

[11] E. Parzen. On estimation of a probability density function

and mode. Annals of Mathematical Statistics, 33:1065

1076, 1962.

[12] T. Poggio and F. Girosi. Networks for approximation and

learning. Proceedings of the IEEE, 78:14811497, 1990.

[13] R. Reed. Pruning algorithms a survey. IEEE Trans. Neural

Networks, 4(5):740747, 1993.

[14] D. Rumelhart, J. L. McClelland, and the PDP Research

Group. Parallel distributed processing. MIT Press, 1986.

[15] V. N. Vapnik. Statistical Learning Theory. Wiley, 1999.

semiconductor defect classification

Documents