semiconductor defect classification

Upload: chang-jae-lee

Post on 04-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Semiconductor Defect Classification

    1/6

    Semiconductor Defect Classification using Hyperellipsoid Clustering

    Neural Networks and Model Switching

    Keisuke Kameyama

    Interdisciplinary Graduate School of Sci. and Eng.

    Tokyo Institute of TechnologyYokohama 226-8502, Japan

    Yukio Kosugi

    Frontier Collaborative Research Center

    Tokyo Institute of TechnologyYokohama 226-8503, Japan

    Abstract

    An automatic defect classification (ADC) system for visual

    inspection of semiconductor wafers, using a neural network

    classifier is introduced. The proposed Hyperellipsoid Clus-

    tering Network (HCN) employing a Radial Basis Function

    (RBF) in the hidden layer, is trained with additionalpenalty

    conditions for recognizing unfamiliar inputs as originat-

    ing from an unknown defect class. Also, by using a dy-namic model alteration method called Model Switching, a

    reduced-model classifier which enables an efficient classifi-

    cation is obtained. In the experiments, the effectiveness of

    the unfamiliar input recognition was confirmed, and a clas-

    sification rate sufficiently high for use in the semiconductor

    fab was obtained.

    1. Introduction

    Visual inspection plays an important role in the manufac-

    turing processes of semiconductors. The disorders found

    on the wafer surface, such as the one shown in Fig. 1, are

    commonly referred to as defects. The motive for defect clas-

    sification is to find out the process stages and the sources

    that are causing them. Early detection of the sources of de-

    fects is essential in order to maintain high product yield and

    quality. Fig. 1

    By replacing the review process typically conducted by hu-

    man experts, it is also aimed to improve both the stability

    and speed of inspection. In the literature, it is reported that

    the classification accuracies of human experts are typically

    6080% [1]. If this stage of visual inspection could be auto-mated, it will greatly contribute to enhance the productivity

    of the semiconductor fab.

    The task of classifying the defect image features has several

    5m

    Figure 1. A defect found on a semiconductor

    wafer.

    specific conditions inherent to the particular problem. Most

    distictive among them is the fact that the user does not have

    the freedom of collecting a sufficient number of, or an ap-

    propriate selection of training images. Also, the number of

    the training samples are extremely unbalanced.

    When the number of samples for a defect class is small, ap-

    proaches whose decisions rely on all samples, such as the

    radial basis function (RBF) networks [10][12] or the joint

    use of nonparametric estimation of the probability distri-

    bution function by Parzens method [11] and Bayes clas-

    sification, perform well. However, for a class with large

    samples, these methods are computationally costly. In this

    case, instead of using all the training samples for classifi-

    cation, methods based on distances from the class-cluster

    prototypes such as the nearest neighbor algorithm [2] and

    learning vector quantization [9], and those based on class

    borders such as multilayer perceptrons (MLP) [14] and sup-

    port vector machines [15] are computationally more effi-cient. So-called reduced variants of the above nonparamet-

    ric methods such as the generalized RBF networks [12] and

    reduced Parzen classifiers [3] are also methods depending

    on the distances from the prototypes.

  • 7/29/2019 Semiconductor Defect Classification

    2/6

    In this work, a three-layered neural network named the

    Hyperellipsoid Clustering Network (HCN), having hidden

    layer units of RBF type will be used. In addition to the pa-

    rameter adjustment by backpropagation (BP) method [14],

    model alteration method called Model Switching (MS) [7]

    which allows the map acquired by training to be inherited

    to the new model, is used during the training process for

    efficiently obtaining an appropriate reduced model.

    The second requirement to the system is to classify the

    known defect classes without fail and not to make wild

    guesses against unfamiliar defects. Such cases should be

    pointed out as unclassifiable and be left for the human ex-

    pert to see. Since the training set will usually provide an-

    swers at only a small portion of the feature space, inputs

    to the remaining open space should be treated as being un-

    known. For recognizing unfamiliar inputs, the HCN was

    trained with additional penalty condition, so that the sizes

    of the hyperellipsoid kernels will be kept small, to tightly

    enclose the clusters formed by the training samples.

    In Sec. 2, the HCN will be introduced, together with its

    training method and the output interpretation method for

    recognition of unfamiliar inputs. In Sec. 3, the idea of

    Model Switching for allowing dynamic model alteration

    during training will be reviewed. The defect classes and the

    outline of the automatic defect classification (ADC) system

    will be explained in Sec. 4. In Sec. 5, the network and

    the ADC system will be evaluated by applying to the clas-

    sification of the defect image sets, and the paper will be

    concluded in Sec. 6.

    2. Hyperellipsoid clustering network (HCN)

    The three-layered network model used for classifying thefeature vectors is illustrated in Fig. 2. The network has

    inputs, hidden units and output units. The potential of

    the -th hidden layer unit is defined as,

    (1)

    with the following parameters to be adjusted in the training:

    : radius parameter.

    : center vector.

    : weight matrix.

    The transfer function of the hidden layer unit is the well-

    known sigmoid function. Thus, the output of unit is,

    (2)

    Inputvectorx

    Outputvectory

    Input layer Hidden layer(Hyperellipsoid

    discriminant + sigmoid)

    Output layer

    (Linear)

    Parameters : (Hn, mn, rn)

    Connectionweight : wk

    1

    l

    L

    1

    n

    N

    1

    k

    O

    Figure 2. The Hyperellipsoid Clustering Net-

    work (HCN).

    x1x2

    1.0

    0

    h1 h2

    Figure 3. An example of the kernel functionsmade by the joint use of (hyper) ellipsoid dis-

    criminants and sigmoid functions.

    A unit in the output layer takes the fan out of the hidden

    layer units and calculates the weighted sum with no bias as,

    (3)

    where is the weight vector

    of the -th output unit, and . The

    weight vector is also modified in the training process.

    By employing a discriminant in Eq. (1), the discrimina-

    tion plane in the feature space will always be a hyperellip-

    soid. Since the unit potential in Eq. (1) depends on the

    distance between the input and the center vector ,

    the network is a RBF network. However, in contrast with

    the popular Gaussian RBF network [12], various profiles

    of the kernel function are possible by controlling the gain

    [4] of the sigmoid function with the radius parameter , asshown in Fig. 3. This network model using the hyperel-

    lipsoid discriminant and the sigmoid function in the hidden

    layer, will be referred to as the Hyperellipsoid Clustering

    Network(HCN).

  • 7/29/2019 Semiconductor Defect Classification

    3/6

    The training method used in the HCN is based on the

    batched BP law with momentum terms [14]. The error cri-

    terion is defined as

    (4)

    with , , and denoting the cardi-

    nality of the training set, the error for the -th training pair,

    the -th training output vector and the -th output vector,

    respectively.

    For enabling a tight bounding by hyperellipsoids to im-

    plement the recognition of the unfamiliar inputs, the vol-

    ume of the hyperellipsoids should be kept small as long

    as it does not harm the achievement of training. This can

    be done by setting some penalty term to restrict the radius

    of the hyperellipsoids. The distance from the center to the

    edge of the hyperellipsoid in the direction of the -th princi-

    pal component can be written as , where is the

    -th eigenvalue of the matrix , which is always pos-

    itive. Thus, a penalty to suppress the absolute value of the

    radius parameter can be considered to be effective. Also,

    a term to prevent the eigenvalues from becoming too small,was necessary. This second restriction was implemented in-

    directly by preventing the Euclidean norm of the matrix

    from becoming too small. Consequently, the modification

    measures to the weight matrix and the radius parameter

    were formulated as,

    (5)

    and

    (6)

    with the terms and denoting the modifica-

    tion measures by the plain BP training. Parameters and

    denote the penalty term gains.

    The network will be trained to respond with a class specific

    unit vector. Since the output is the weighted sum of the

    kernel functions of the hidden layer units, it can be justified

    to reject an output vector that does not have a significant

    winner. In such a case, the input pattern should be classified

    to be originating from an unknown class. Therefore, the

    output interpretation of,

    argmax if

    unknown otherwise(7)

    will be used, with being the membership threshold.

    3. Model switching

    As a method for obtaining a reduced network model in

    the learning process, model alteration scheme called Model

    Switching(MS) [7] is employed. MS is a framework for dy-

    namic model alteration during the BP training for improve-

    ment of the training efficiency, by avoiding the local minima

    and reducing the redundancy in the network model.

    Definition 1 (Model Switching) On altering the neuralnetwork model, methods which determine the moment or

    the occasion of model alteration, by taking into account

    both the two factors in the following :

    1. The nature andfitness of the new model andthe initial

    map candidate within the new model.

    2. The status of the immediate model and map.

    will be referred to as Model Switching (MS).

    In this work, MS will be used to reduce the number of hid-

    den layer units in the HCN in which the training is initially

    started with a model having the same number of units as

    the training sample. Pruning algorithms [13] which is also

    an attempt to reduce the network size, mostly limit the oc-

    casion of model reduction to after the convergence of the

    training error. With MS, however, the occasion can be set at

    any time, as long as the fitness of the candidate of the initial

    map within the new model is met. When only the model

    reduction is used in MS, only the first factor in Def. 1 needs

    to be considered.

    The process of training by BP with MS is shown in Fig. 4.

    For each training epoch of BP, the fitness of the switchable

    candidates will be evaluated, and switching will take place

    when thefitness of a candidate exceeds a given threshold

    .

    The candidate set of the new model and map was made by

    using the unit reduction method of unit fusion [6]. Unit

    fusion selects a pair of units in a layer and replaces them

    by a single unit. On replacement, connection weights to the

    new unit is determined so that the map of the old network

    will be inherited by the new network.

    Let us put that units indexed and will be fused to make a

    single unit . The weighted sum of the inputs from units ,

    and the unity bias to the subsequent layer unit , can bewritten as,

    (8)

  • 7/29/2019 Semiconductor Defect Classification

    4/6

    Start

    Modify parameter of theimmediate networkfNby BP

    End

    Determine switchable

    model-map candidate setCMS

    Evaluate Fitness IndexIF(fN, fN i)

    for all fN i CMS

    max{IF(fN, fNi)} > IF0NY

    SwitchfNfNk

    k=argmaxi

    {

    No switching

    Trained ?(E< E0)

    N

    Y

    Model sizereduction

    IF(fN, fNi)}

    Figure 4. BP training with Model Switching.

    where , , and are the connection weight, unit re-

    sponse, average unit response and the varying portion of

    the response, respectively. Generally, we can put,

    (9)

    with and denoting the standard deviation of the unit

    output, and the output similarity of the unit pair, respec-

    tively, both evaluated for all the training inputs. From Eqs.

    8 and 9, we have

    (10)

    implying that the connection weights should be changed as,

    (11)

    and

    (12)

    where the prime denote the connection weights after the fu-

    sion.

    Since no bias unit is used in the hidden layer of HCN, onlythe compensation in Eq. 11 will be used. As unit fusion can

    be applied to all unit pairs in the hidden layer,

    switching candidates exist. The one which is most fit will

    be selected by evaluating the fitness index .

    The fitness of the new map will be a function of the degree

    of map inheritance, and the closeness of the two kernels to

    be fused in the feature space, to give priority to the fusion

    of kernels that are placed close together. For evaluating the

    degree of map inheritance, a measure named Map Distance

    will be used.

    Definition 2 (Map Distance) The map distance between

    two mapping vector functions and trained

    with the training vector set is defined as,

    (13)

    where is the number of training pairs.

    The fitness of the candidates will be evaluated by thefitness

    index function of,

    0

    (14)

    where , and denote the map obtained by fus-

    ing the -th and -th units, the dimension of feature space,

    and the maximum possible map distance, respectively. It

    is assumed that all the feature elements are bounded to the

    domain. On actual evaluation of the map distance, the

    theorem approximating the map distance generated by the

    fusion of hidden layer units [7] was used.

    4. The automatic defect classifier system

    (ADC) [8]

    4.1. Defect classes

    In this work, we will try to classify the physical defect

    classes that provide most information for locating the cause

    of the defects. The physical defect classes dealt with in this

    work and their common appearances are listed in the fol-

    lowing :

    A. Foreign objects (FO)

    This class includes defects such that external objects are

    found on the wafer. Defects of FO class tend to appear as

    small and dark colored regions, typically in near-circularshape.

    B. Embedded foreign objects (EO)

    This is the class of defects where one or more processedfilm

  • 7/29/2019 Semiconductor Defect Classification

    5/6

    AND

    Defect mask

    Shape featureextraction

    Colorquantization

    Shape feature Color ratio

    HCN Classifier

    Defect class

    Reference image Defect image

    Figure 5. The flow of data in the ADC system.

    layers have been stacked over a foreign object. EO class de-

    fects appear slightly larger and irregular-shaped than those

    of the FO class, because the patterns of the heaped area in

    the covering layers are deformed by the embedded object.

    In addition to the characteristic dark color of the particle it-

    self, other colors can be observed as well. Defects of FO

    and EO classes can appear quite similar, and are sometimes

    hard to distinguish even for an expert.

    C. Pattern failure (PF)

    This class covers all kinds of defects that have pattern de-

    formations without any existence of external objects. De-

    fects of PF class can also be caused by insufficient exposure

    or etching. Thus they can have a wide variety of size and

    shape. Since the defect is usually an extra region or a lack

    in the pattern of a layer, the color of the defect region tends

    to be one of those observed in the normal patterns.

    4.2. Feature extraction

    A. Shape features

    The flow of data in the ADC system is shown in Fig. 5. Af-

    ter subtraction of the defectless reference pattern from the

    defect image and further graylevel thresholding, the defect

    mask is made. From the defect mask, shape features ofde-

    fect size and roundness is calculated.

    B. Color featuresThe color of the defect region is characterized by quantiz-

    ing the color of each pixel to one of the prototype colors.

    The prototype colors are determined in beforehand by ap-

    plying the Median Cut Algorithm [5] to the defectless im-

    x10 1

    1

    x2

    (a) (b)

    x10 1

    1

    x2

    Figure 6. An artificially generated cluster data

    of four classes. (a) Training set (P = 100). (b)Test set (P= 1000).

    ages of the layer to be inspected. Also, typical defect colors

    are manually added as prototype colors. The ratios of the

    quantized colors in the defect region were used as the color

    feature vector of the defect.

    In the experiments in Sec. 5, the feature dimension was

    12, including the 2 shape features and 10 color features, allnormalized to unity range.

    5. Experiment

    A. Membership thresholding in an artificial cluster data

    The effect of membership thresholding and MS was evalu-

    ated using an artificial four-class data in a 2D domain shown

    in Fig. 6. Three types of networks and training strategies

    were tried. All networks were trained to the target error of

    , to respond with class specific unit vectors.

    1. MLP with (input-hidden-output)=(2-4-4) units.

    2. HCN with (input-hidden-output)=(2-100-4) units.

    3. HCN trained by BP with MS for model reduction dur-

    ing training. Initial model : (2-100-4).

    The change in the recognition rate for the test set, and the

    ratio of the area within the input domain which was pointed

    out as being of unknown class, was evaluated by changing

    the membership threshold in Eq. 7. Ideally, the recogni-

    tion rate will be maintained high, even when a large portion

    ofthe input domain isjudged as unknown (rejected). The re-sult is shown in Fig. 7. It is clear that by reducing the model

    of HCN by MS, larger portion of input domain is properly

    rejected without losing the classification ability for the test

    set.

  • 7/29/2019 Semiconductor Defect Classification

    6/6

    0.7

    0.75

    0.8

    0.85

    0.9

    0.95

    1

    0 0.1 0.2 0.3 0.4 0.5 0.6

    = 0.2

    = 0.9

    MLP

    HCN

    HCN with MS

    Recognitionrate

    Ratio of rejected input domain

    Figure 7. The change in the recognition rateand the ratio of the rejected input domain,

    when the membership threshold is changed.

    Table 1. The classification rate and the con-fusion matrix for the HCN evaluated by the

    leave-one-out method. The numbers in boldtypeface are for the cases when membership

    thresholding was used.

    FO PFEOTrue

    Estimation Correct (%)

    3232

    10

    00

    20

    2221

    00

    21

    20

    3230

    97.097.0

    88.983.3

    91.787.5

    Unknown

    01

    03

    05

    Error (%)

    3.00.0

    11.12.8

    8.30.0

    Foreign Object(FO)

    Embedded Object(EO)

    Pattern Failure(PF)

    Average rates (weighted)92.589.2

    7.51.1

    B. Leave-one-out evaluation with HCN using MS

    A collection of defect images obtained from the same pro-

    cess layer of a product was used for evaluating the ADCsystem. The set consisted of 33 FO class, 36 EO class

    and 24 PF class images. The class information for all the

    images were provided by an expert inspector. The classi-

    fication rates were evaluated by the leave-one-out method

    [3]. A HCN network with unit configuration of (12-93-

    3), initialized by placing each kernels at the training inputs

    were trained using MS. The model typically converged to

    reduced models with 9 to 14 hidden layer units.

    The results are shown in Table 1. By employing the mem-

    bership thresholding with , it is found that the non-

    diagonal elements (errors) in the confusion matrix could be

    reduced drastically. The obtained classification rate is con-sidered to be comparable to those of human experts. By re-

    ducing the network model by MS, the computation required

    for using the network was also reduced by 8590%, when

    compared with the initial network model.

    6. Conclusion

    An ADC system for visual inspection of semiconductor

    wafers, using a neural network classifier was introduced.

    The Hyperellipsoid Clustering Network was introduced,

    and the training rule with cost terms for recognizing unfa-

    miliar inputs as originating from an unknown defect class

    was given. Further, by using BP training with Model

    Switching, a reduced-model classifier which enables an effi-

    cient classification was obtained. The defect classes and the

    descriptions of the extracted image features was defined. In

    the experiments, the effectiveness of the unfamiliar input

    recognition was confirmed, and a classification rate compa-

    rable to those of human experts were obtained.

    References

    [1] P. B. Chou, A. R. Rao, M. C. Struzenbecker, F. Y. Wu, and

    V. H. Brecher. Automatic defect classification for semicon-

    ductor manufacturing. Machine Vision and Applications,

    9(4):201214, 1997.

    [2] R. O. Duda and P. E. Hart. Pattern Classification and SceneAnalysis. Wiley, 1973.

    [3] K. Fukunaga. Introduction to Statistical Pattern Recogni-

    tion. Academic Press, 1990.

    [4] R. Hecht-Nielsen. Neurocomputing. Addison-Wesley, 1990.

    [5] P. Heckbert. Color image quantization for frame buffer dis-

    play. Computer Graphics, 16(3):297307, 1982.

    [6] K. Kameyama and Y. Kosugi. Neural network pruning

    by fusing hidden layer units. Transactions of IEICE,

    E74(12):41984204,1991.

    [7] K. Kameyama and Y. Kosugi. Model switching by chan-

    nel fusion for network pruning and efficient feature extrac-

    tion. Proceedings of International Joint Conferenceon Neu-

    ral Networks 1998, pages 18611866,1998.

    [8] K. Kameyama, Y. Kosugi, T. Okahashi, and M. Izumita. Au-

    tomatic defect classification in visual inspection of semicon-

    ductors using neural networks. IEICE Transactions on In-

    formation and Systems, E81-D(11):12611271, 1998.

    [9] T. Kohonen. Self-organization and associative memory.

    Springer, 1988.

    [10] J. E. Moody and C. J. Darken. Fast learning in networks of

    locally-tuned processing units. Neural Computation, 1:281

    294, 1989.

    [11] E. Parzen. On estimation of a probability density function

    and mode. Annals of Mathematical Statistics, 33:1065

    1076, 1962.

    [12] T. Poggio and F. Girosi. Networks for approximation and

    learning. Proceedings of the IEEE, 78:14811497, 1990.

    [13] R. Reed. Pruning algorithms a survey. IEEE Trans. Neural

    Networks, 4(5):740747, 1993.

    [14] D. Rumelhart, J. L. McClelland, and the PDP Research

    Group. Parallel distributed processing. MIT Press, 1986.

    [15] V. N. Vapnik. Statistical Learning Theory. Wiley, 1999.