soft computing unit-3 by arun pratap singh

65
UNIT : III SOFT COMPUTING II SEMESTER (MCSE 205) PREPARED BY ARUN PRATAP SINGH

Upload: arunpratapsingh

Post on 27-Dec-2015

57 views

Category:

Documents


5 download

DESCRIPTION

8878061993 OIST BhopalMTech -CSE II Semester RGPV Bhopal

TRANSCRIPT

Page 1: Soft Computing Unit-3 by Arun Pratap Singh

UNIT : III SOFT COMPUTING II SEMESTER (MCSE 205)

PREPARED BY ARUN PRATAP SINGH

Page 2: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 1

1

UNSUPERVISED LEARNING IN NEURAL NETWORK :

UNIT : III

Page 3: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 2

2

Page 4: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 3

3

Page 5: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 4

4

Page 6: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 5

5

In machine learning, the problem of unsupervised learning is that of trying to find hidden

structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no

error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning

from supervised learning and reinforcement learning.

Unsupervised learning is closely related to the problem of density

estimation in statistics.[1] However unsupervised learning also encompasses many other

techniques that seek to summarize and explain key features of the data. Many methods employed

in unsupervised learning are based on data mining methods used to preprocess data.

Approaches to unsupervised learning include:

clustering (e.g., k-means, mixture models, hierarchical clustering),

hidden Markov models,

blind signal separation using feature extraction techniques for dimensionality

reduction (e.g., principal component analysis, independent component analysis, non-negative

matrix factorization, singular value decomposition).

Among neural network models, the self-organizing map (SOM) and adaptive resonance

theory (ART) are commonly used unsupervised learning algorithms. The SOM is a topographic

organization in which nearby locations in the map represent inputs with similar properties. The

ART model allows the number of clusters to vary with problem size and lets the user control the

degree of similarity between members of the same clusters by means of a user-defined constant

called the vigilance parameter. ART networks are also used for many pattern recognition tasks,

such as automatic target recognition and seismic signal processing. The first version of ART was

"ART1", developed by Carpenter and Grossberg (1988).

COUNTERPROPAGATION NETWORK:

The counterpropagation network is a hybrid network. It consists of an outstar network and a

competitive filter network. It was developed in 1986 by Robert Hecht-Nielsen. It is guaranteed to

find the correct weights, unlike regular back propagation networks that can become trapped in

local minimums during training.

The input layer neurode connect to each neurode in the hidden layer. The hidden layer is a

Kohonen network which categorizes the pattern that was input. The output layer is an outstar array

which reproduces the correct output pattern for the category.

Training is done in two stages. The hidden layer is first taught to categorize the patterns and the

weights are then fixed for that layer. Then the output layer is trained. Each pattern that will be

Page 7: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 6

6

input needs a unique node in the hidden layer, which is often too large to work on real world

problems.

The CounterPropagation update algorithm updates a net that consists of a input, hidden and

output layer. In this case the hidden layer is called the Kohonen layer and the output layer is called

the Grossberg layer. At the beginning of the algorithm the output of the input neurons is equal to

the input vector. The input vector is normalized to the length of one. Now the progression of the

Kohonen layer starts.

This means that a neuron with the highest net input is identified. The activation of this winner

neuron is set to 1. The activation of all other neurons in this layer is set to 0. Now the output of all

output neurons is calculated. There is only one neuron of the hidden layer with the activation and

the output set to 1.

This and the fact that the activation and the output of all output neurons is the weighted sum on

the output of the hidden neurons implies that the output of the output neurons is the weight of the

link between the winner neuron and the output neurons. This update function makes sense only

in combination with the CPN learning function.

Page 8: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 7

7

Page 9: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 8

8

Page 10: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 9

9

Page 11: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 10

10

Page 12: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 11

11

Page 13: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 12

12

Page 14: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 13

13

Page 15: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 14

14

Page 16: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 15

15

Page 17: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 16

16

Page 18: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 17

17

ARCHITECTURE OF COUNTER PROPAGATION NETWORK :

Page 19: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 18

18

Page 20: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 19

19

Page 21: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 20

20

Page 22: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 21

21

Page 23: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 22

22

ASSOCIATIVE MEMORY:

Page 24: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 23

23

Page 25: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 24

24

Page 26: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 25

25

Page 27: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 26

26

Page 28: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 27

27

Page 29: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 28

28

Page 30: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 29

29

Page 31: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 30

30

Page 32: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 31

31

Page 33: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 32

32

Page 34: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 33

33

Page 35: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 34

34

Bidirectional associative memory (BAM) is a type of recurrent neural network. BAM was

introduced by Bart Kosko in 1988. There are two types of associative memory, auto-associative

and hetero-associative. BAM is hetero-associative, meaning given a pattern it can return another

pattern which is potentially of a different size. It is similar to the Hopfield network in that they are

both forms of associative memory. However, Hopfield nets return patterns of the same size.

Page 36: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 35

35

Procedure-

Learning

Imagine we wish to store two associations, A1:B1 and A2:B2.

A1 = (1, 0, 1, 0, 1, 0), B1 = (1, 1, 0, 0)

A2 = (1, 1, 1, 0, 0, 0), B2 = (1, 0, 1, 0)

These are then transformed into the bipolar forms:

X1 = (1, -1, 1, -1, 1, -1), Y1 = (1, 1, -1, -1)

X2 = (1, 1, 1, -1, -1, -1), Y2 = (1, -1, 1, -1)

From there, we calculate where denotes the transpose. So,

Recall

To retrieve the association A1, we multiply it by M to get (4, 2, -2, -4), which, when run through a

threshold, yields (1, 1, 0, 0), which is B1. To find the reverse association, multiply this by the

transpose of M.

Page 37: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 36

36

Page 38: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 37

37

HOPFIELD NETWORK:

A Hopfield network is a form of recurrent artificial neural network invented by John Hopfield in

1982. Hopfield nets serve as content-addressable memory systems with binarythreshold nodes.

They are guaranteed to converge to a local minimum, but convergence to a false pattern (wrong

local minimum) rather than the stored pattern (expected local minimum) can occur. Hopfield

networks also provide a model for understanding human memory.

Structure-

The units in Hopfield nets are binary threshold units, i.e. the units only take on two different values

for their states and the value is determined by whether or not the units' input exceeds their

threshold. Hopfield nets normally have units that take on values of 1 or -1, and this convention

will be used throughout the article. However, other literature might use units that take values of 0

and 1.

Every pair of units i and j in a Hopfield network have a connection that is described by the

connectivity weight . In this sense, the Hopfield network can be formally described as a

complete undirected graph , where is a set of McCulloch-Pitts

neuronsand is a function that links pairs of nodes to a real value, the connectivity

weight.

The connections in a Hopfield net typically have the following restrictions:

(no unit has a connection with itself)

(connections are symmetric)

The requirement that weights be symmetric is typically used, as it will guarantee that the energy

function decreases monotonically while following the activation rules, and the network may exhibit

Page 39: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 38

38

some periodic or chaotic behaviour if non-symmetric weights are used. However, Hopfield found

that this chaotic behavior is confined to relatively small parts of the phase space, and does not

impair the network's ability to act as a content-addressable associative memory system.

A Hopfield net with four nodes

Updating-

Updating one unit (node in the graph simulating the artificial neuron) in the Hopfield network is

performed using the following rule:

where:

is the strength of the connection weight from unit j to unit i (the weight of the connection).

is the state of unit j.

is the threshold of unit i.

Updates in the Hopfield network can be performed in two different ways:

Asynchronous: Only one unit is updated at a time. This unit can be picked at random, or a

pre-defined order can be imposed from the very beginning.

Synchronous: All units are updated at the same time. This requires a central clock to the

system in order to maintain synchronization. This method is less realistic, since biological or

physical systems lack a global clock that keeps track of time.

Page 40: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 39

39

Neurons attract or repel each other

The weight between two units has a powerful impact upon the values of the neurons. Consider

the connection weight between two neurons i and j. If , the updating rule implies

that:

when , the contribution of j in the weighted sum is positive. Thus, is pulled by j

towards its value

when , the contribution of j in the weighted sum is negative. Then again, is pulled

by j towards its value

Thus, the values of neurons i and j will converge if the weight between them is positive. Similarly,

they will diverge if the weight is negative.

Training-

Training a Hopfield net involves lowering the energy of states that the net should "remember".

This allows the net to serve as a content addressable memory system, that is to say, the network

will converge to a "remembered" state if it is given only part of the state. The net can be used to

recover from a distorted input to the trained state that is most similar to that input. This is called

associative memory because it recovers memories on the basis of similarity. For example, if we

train a Hopfield net with five units so that the state (1, 0, 1, 0, 1) is an energy minimum, and we

give the network the state (1, 0, 0, 0, 1) it will converge to (1, 0, 1, 0, 1). Thus, the network is

properly trained when the energy of states which the network should remember are local minima.

Learning rules-

There are various different learning rules that can be used to store information in the memory of

the Hopfield Network. It is desirable for a learning rule to have both of the following two properties:

Local: A learning rule is local if each weight is updated using information available to neurons

on either side of the connection that is associated with that particular weight.

Incremental: New patterns can be learned without using information from the old patterns that

have been also used for training. That is, when a new pattern is used for training, the new

values for the weights only depend on the old values and on the new pattern.[1]

These properties are desirable, since a learning rule satisfying them is more biologically plausible.

For example, since the human brain is always learning new concepts, one can reason that human

learning is incremental. A learning system that would not be incremental would generally be

trained only once, with a huge batch of training data.

Page 41: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 40

40

Hebbian learning rule for Hopfield networks

The Hebbian Theory was introduced by Donald Hebb in 1949, in order to explain "associative

learning", in which simultaneous activation of neuron cells leads to pronounced increases in

synaptic strength between those cells. It is often summarized as "Neurons that fire together, wire

together. Neurons that fire out of sync, fail to link".

The Hebbian rule is both local and incremental. For the Hopfield Networks, it is implemented in

the following manner, when learning binary patterns:

where represents bit i from pattern .

If the bits corresponding to neurons i and j are equal in pattern , then the product will be

positive. This would, in turn, have a positive effect on the weight and the values of i and j will

tend to become equal. The opposite happens if the bits corresponding to neurons i and j are

different.

The Storkey learning rule

This rule was introduced by Amos Storkey in 1997 and is both local and incremental. Storkey also

showed that a Hopfield network trained using this rule has a greater capacity than a corresponding

network trained using the Hebbian rule.[3] The weight matrix of an attractor neural network is said

to follow the Storkey learning rule if it obeys:

where is a form of local field [1] at neuron i.

This learning rule is local, since the synapses take into account only neurons at their sides. The

rule makes use of more information from the patterns and weights than the generalized Hebbian

rule, due to the effect of the local field.

Page 42: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 41

41

ADAPTIVE RESONANCE THEORY:

Basic ART architecture

Page 43: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 42

42

Grossberg competitive network

Grossberg Network-

The L1-L2 connections are instars, which performs a clustering (or categorization)

operation. When an input pattern is presented, it is multiplied (after normalization) by

the L1-L2 weight matrix.

A competition is performed at Layer 2 to determine which row of the weight matrix is

closest to the input vector. That row is then moved toward the input vector.

After learning is complete, each row of the L1-L2 weight matrix is a prototype

pattern, which represents a cluster (or a category) of input vectors.

ART Networks –

Learning of ART networks also occurs in a set of feedback connections from Layer 2

to Layer 1. These connections are outstars which perform pattern recall.

When a node in Layer 2 is activated, this reproduces a prototype pattern (the

expectation) at layer 1.

Layer 1 then performs a comparison between the expectation and the input pattern.

When the expectation and the input pattern are NOT closely matched, the orienting

subsystem causes a reset in Layer 2.

The reset disables the current winning neuron, and the current expectation is removed.

A new competition is then performed in Layer 2, while the previous winning neuron is

disable.

Input

Layer 1(Retina)

Layer 2(Visual Cortex )

LTM(AdaptiveWeights)

STM

Normalization ConstrastEnhancement

Page 44: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 43

43

The new winning neuron in Layer 2 projects a new expectation to Layer 1, through the

L2-L1 connections.

This process continues until the L2-L1 expectation provides a close enough match to the

input pattern.

ART Architecture –

Bottom-up weights bij

Top-down weights tij

› Store class template

Input nodes

› Vigilance test

› Input normalisation

Output nodes

› Forward matching

Long-term memory

› ANN weights

Short-term memory

› ANN activation pattern

Page 45: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 44

44

• The basic ART system is unsupervised learning model. It typically consists of

• a comparison field and a recognition field composed of neurons,

• a vigilance parameter, and

• a reset module

• Comparison field

• The comparison field takes an input vector (a one-dimensional array of values)

and transfers it to its best match in the recognition field. Its best match is the

single neuron whose set of weights (weight vector) most closely matches the

input vector.

• Recognition field

• Each recognition field neuron, outputs a negative signal proportional to that

neuron's quality of match to the input vector to each of the other recognition field

neurons and inhibits their output accordingly. In this way the recognition field

exhibits lateral inhibition, allowing each neuron in it to represent a category to

which input vectors are classified.

• Vigilance parameter

• After the input vector is classified, a reset module compares the strength of the

recognition match to a vigilance parameter. The vigilance parameter has

considerable influence on the system.

• Reset Module

• The reset module compares the strength of the recognition match to the vigilance

parameter.

Page 46: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 45

45

• If the vigilance threshold is met, then training commences.

ART Algorithm –

ART Types :

• ART-1

• Binary input vectors

• Unsupervised NN that can be complemented with external changes to the

vigilance parameter

• ART-2

• Real-valued input vectors

• ART-3

• Parallel search of compressed or distributed pattern recognition codes in a

NN hierarchy.

• Search process leads to the discovery of appropriate representations of a

non stationary input environment.

• Chemical properties of the synapse emulated in the search process

Page 47: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 46

46

The ART-1 Network :

Applications of ART :

• Mobile robot control

• Facial recognition

• Land cover classification

• Target recognition

• Medical diagnosis

• Signature verification

Page 48: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 47

47

Learning model :

The basic ART system is an unsupervised learning model. It typically consists of a comparison

field and a recognition field composed of neurons, a vigilance parameter (threshold of

recognition), and a reset module. The comparison field takes an input vector (a one-dimensional

array of values) and transfers it to its best match in the recognition field. Its best match is the

single neuron whose set of weights (weight vector) most closely matches the input vector. Each

recognition field neuron outputs a negative signal (proportional to that neuron’s quality of match

to the input vector) to each of the other recognition field neurons and thus inhibits their output. In

this way the recognition field exhibits lateral inhibition, allowing each neuron in it to represent a

category to which input vectors are classified. After the input vector is classified, the reset module

compares the strength of the recognition match to the vigilance parameter. If the vigilance

parameter is overcome, training commences: the weights of the winning recognition neuron are

Page 49: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 48

48

adjusted towards the features of the input vector. Otherwise, if the match level is below the

vigilance parameter the winning recognition neuron is inhibited and a search procedure is carried

out. In this search procedure, recognition neurons are disabled one by one by the reset function

until the vigilance parameter is overcome by a recognition match. In particular, at each cycle of

the search procedure the most active recognition neuron is selected and then switched off if its

activation is below the vigilance parameter (note that it thus releases the remaining recognition

neurons from its inhibition). If no committed recognition neuron’s match overcomes the vigilance

parameter, then an uncommitted neuron is committed and its weights are adjusted towards

matching the input vector. The vigilance parameter has considerable influence on the system:

higher vigilance produces highly detailed memories (many, fine-grained categories), while lower

vigilance results in more general memories (fewer, more-general categories).

Training :

There are two basic methods of training ART-based neural networks: slow and fast. In the slow

learning method, the degree of training of the recognition neuron’s weights towards the input

vector is calculated to continuous values with differential equations and is thus dependent on the

length of time the input vector is presented. With fast learning, algebraic equations are used to

calculate degree of weight adjustments to be made, and binary values are used. While fast

learning is effective and efficient for a variety of tasks, the slow learning method is more

biologically plausible and can be used with continuous-time networks (i.e. when the input vector

can vary continuously).

SUPPORT VECTOR MACHINE:

In machine learning, support vector machines (SVMs, also support vector networks)

are supervised learning models with associated learning algorithms that analyze data and

recognize patterns, used for classification and regression analysis. Given a set of training

examples, each marked as belonging to one of two categories, an SVM training algorithm builds

a model that assigns new examples into one category or the other, making it a non-

probabilistic binary linear classifier. An SVM model is a representation of the examples as points

in space, mapped so that the examples of the separate categories are divided by a clear gap that

is as wide as possible. New examples are then mapped into that same space and predicted to

belong to a category based on which side of the gap they fall on.

A Support Vector Machine (SVM) performs classification by constructing an N-dimensional hyperplane that optimally separates the data into two categories. SVM models are closely related to neural networks. In fact, a SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network.

Support Vector Machine (SVM) models are a close cousin to classical multilayer perceptron neural networks. Using a kernel function, SVM’s are an alternative training method for polynomial, radial basis function and multi-layer perceptron classifiers in which the weights of the network are found by solving a quadratic programming problem with linear constraints, rather than by solving a non-convex, unconstrained minimization problem as in standard neural network training.

Page 50: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 49

49

In the parlance of SVM literature, a predictor variable is called an attribute, and a transformed attribute that is used to define the hyperplane is called a feature. The task of choosing the most suitable representation is known as feature selection. A set of features that describes one case (i.e., a row of predictor values) is called a vector. So the goal of SVM modeling is to find the optimal hyperplane that separates clusters of vector in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors. The figure below presents an overview of the SVM process.

A Two-Dimensional Example

Before considering N-dimensional hyperplanes, let’s look at a simple 2-dimensional example. Assume we wish to perform a classification, and our data has a categorical target variable with two categories. Also assume that there are two predictor variables with continuous values. If we plot the data points using the value of one predictor on the X axis and the other on the Y axis we might end up with an image such as shown below. One category of the target variable is represented by rectangles while the other category is represented by ovals.

Page 51: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 50

50

In this idealized example, the cases with one category are in the lower left corner and the cases with the other category are in the upper right corner; the cases are completely separated. The SVM analysis attempts to find a 1-dimensional hyperplane (i.e. a line) that separates the cases based on their target categories. There are an infinite number of possible lines; two candidate lines are shown above. The question is which line is better, and how do we define the optimal line.

The dashed lines drawn parallel to the separating line mark the distance between the dividing line and the closest vectors to the line. The distance between the dashed lines is called the margin. The vectors (points) that constrain the width of the margin are the support vectors. The following figure illustrates this.

An SVM analysis finds the line (or, in general, hyperplane) that is oriented so that the margin between the support vectors is maximized. In the figure above, the line in the right panel is superior to the line in the left panel.

Page 52: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 51

51

If all analyses consisted of two-category target variables with two predictor variables, and the cluster of points could be divided by a straight line, life would be easy. Unfortunately, this is not generally the case, so SVM must deal with (a) more than two predictor variables, (b) separating the points with non-linear curves, (c) handling the cases where clusters cannot be completely separated, and (d) handling classifications with more than two categories.

Flying High on Hyperplanes

In the previous example, we had only two predictor variables, and we were able to plot the points on a 2-dimensional plane. If we add a third predictor variable, then we can use its value for a third dimension and plot the points in a 3-dimensional cube. Points on a 2-dimensional plane can be separated by a 1-dimensional line. Similarly, points in a 3-dimensional cube can be separated by a 2-dimensional plane.

As we add additional predictor variables (attributes), the data points can be represented in N-dimensional space, and a (N-1)-dimensional hyperplane can separate them.

When Straight Lines Go Crooked

The simplest way to divide two groups is with a straight line, flat plane or an N-dimensional hyperplane. But what if the points are separated by a nonlinear region such as shown below?

Page 53: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 52

52

In this case we need a nonlinear dividing line.

Rather than fitting nonlinear curves to the data, SVM handles this by using a kernel function to map the data into a different space where a hyperplane can be used to do the separation.

The kernel function may transform the data into a higher dimensional space to make it possible to perform the separation.

Page 54: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 53

53

Ideally an SVM analysis should produce a hyperplane that completely separates the feature vectors into two non-overlapping groups. However, perfect separation may not be possible, or it may result in a model with so many feature vector dimensions that the model does not generalize well to other data; this is known as over fitting.

Page 55: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 54

54

The Kernel Trick

Many kernel mapping functions can be used – probably an infinite number. But a few kernel functions have been found to work well in for a wide variety of applications. The default and recommended kernel function is the Radial Basis Function (RBF).

Kernel functions supported by DTREG:

Linear: u’*v

(This example was generated by pcSVMdemo.)

Polynomial: (gamma*u’*v + coef0)^degree

Page 56: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 55

55

Radial basis function: exp(-gamma*|u-v|^2)

To allow some flexibility in separating the categories, SVM models have a cost parameter, C, that controls the trade off between allowing training errors and forcing rigid margins. It creates a soft margin that permits some misclassifications. Increasing the value of C increases the cost of misclassifying points and forces the creation of a more accurate model that may not generalize well. DTREG provides a grid search facility that can be used to find the optimal value of C.

Finding Optimal Parameter Values

The accuracy of an SVM model is largely dependent on the selection of the model parameters. DTREG provides two methods for finding optimal parameter values, a grid search and a pattern search. A grid search tries values of each parameter across the specified search range using geometric steps. A pattern search (also known as a “compass search” or a “line search”) starts at the center of the search range and makes trial steps in each direction for each parameter. If the

Page 57: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 56

56

fit of the model improves, the search center moves to the new point and the process is repeated. If no improvement is found, the step size is reduced and the search is tried again. The pattern search stops when the search step size is reduced to a specified tolerance.

Grid searches are computationally expensive because the model must be evaluated at many points within the grid for each parameter. For example, if a grid search is used with 10 search intervals and an RBF kernel function is used with two parameters (C and Gamma), then the model must be evaluated at 10*10 = 100 grid points. An Epsilon-SVR analysis has three parameters (C, Gamma and P) so a grid search with 10 intervals would require 10*10*10 = 1000 model evaluations. If cross-validation is used for each model evaluation, the number of actual SVM calculations would be further multiplied by the number of cross-validation folds (typically 4 to 10). For large models, this approach may be computationally infeasible.

A pattern search generally requires far fewer evaluations of the model than a grid search. Beginning at the geometric center of the search range, a pattern search makes trial steps with positive and negative step values for each parameter. If a step is found that improves the model, the center of the search is moved to that point. If no step improves the model, the step size is reduced and the process is repeated. The search terminates when the step size is reduced to a specified tolerance. The weakness of a pattern search is that it may find a local rather than global optimal point for the parameters.

DTREG allows you to use both a grid search and a pattern search. In this case the grid search is performed first. Once the grid search finishes, a pattern search is performed over a narrow search range surrounding the best point found by the grid search. Hopefully, the grid search will find a region near the global optimum point and the pattern search will then find the global optimum by starting in the right region.

Classification With More Than Two Categories

The idea of using a hyperplane to separate the feature vectors into two groups works well when there are only two target categories, but how does SVM handle the case where the target variable has more than two categories? Several approaches have been suggested, but two are the most popular: (1) “one against many” where each category is split out and all of the other categories are merged; and, (2) “one against one” where k(k-1)/2 models are constructed where k is the number of categories. DTREG uses the more accurate (but more computationally expensive) technique of “one against one”.

Optimal Fitting Without Over Fitting

The accuracy of an SVM model is largely dependent on the selection of the kernel parameters such as C, Gamma, P, etc. DTREG provides two methods for finding optimal parameter values, a grid search and a pattern search. A grid search tries values of each parameter across the specified search range using geometric steps. A pattern search (also known as a “compass search” or a “line search”) starts at the center of the search range and makes trial steps in each direction for each parameter. If the fit of the model improves, the search center moves to the new point and the process is repeated. If no improvement is found, the step size is reduced and the search is tried again. The pattern search stops when the search step size is reduced to a specified tolerance.

Page 58: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 57

57

To avoid over fitting, cross-validation is used to evaluate the fitting provided by each parameter value set tried during the grid or pattern search process.

The following figure by Florian Markowetz illustrates how different parameter values may cause under or over fitting:

Page 59: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 58

58

KOHONEN SELF-ORGANIZING MAPS :

Kohonen's networks are one of basic types of self-organizing neural networks. The

ability to self-organize provides new possibilities - adaptation to formerly unknown

input data. It seems to be the most natural way of learning, which is used in our brains,

where no patterns are defined. Those patterns take shape during the learning process,

which is combined with normal work. Kohonen's networks are a synonym of whole

group of nets which make use of self-organizing, competitive type learning method.

We set up signals on net's inputs and then choose winning neuron, the one which

corresponds with input vector in the best way. Precise scheme of rivalry and later

modifications of synapthic wages may have various forms. There are many sub-types

based on rivalry, which differ themselves by precise self-organizing algorithm.

Architecture of self-organizing maps : Structure of neural network is a very crucial matter. Single neuron is a simple mechanism and it's not able to do much by itself. Only a compound of neurons makes complicated operations possible. Because of our little knowledge about actual rules of human's brain functioning many different architectures were created, which try to imitate the structure and behaviour of human's nervous system. Most often one-way, one-layer type of network architecture is used. It is determined by the fact that all neurons must participate in the rivalry with the same rights. Because of that each of them must have as many inputs as the whole system.

Page 60: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 59

59

Neural network

2-D map of neurons

Page 61: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 60

60

Stages of operations: Functioning of self-organizing neural network is divided into three stages:

construction learning identification

System, which is supposed to realize functioning of self-organizing network, should consist of few basic elements. First of them is a matrix of neurons which are stimulated by input signals. Those signals should describe some attributes of effects which occure in the surrounding. Thanks to that description the net is able to group those effects. Information about events is translated into impulses which stimulate neurons. Group of signals transfered to every neuron doesn't have to be identical, even its number may be various. However they have to realize one condition: unambiguously define those events.

Another part of the net is a mechanism which defines the stage of similarity of every neuron's wage and input signal. Moreover it assigns the unit with the best match - the winner. At the beginning the wages are small random numbers. It's

Page 62: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 61

61

important that no symetry may occure. While learning, those wages are being modificated in the best way to show an internal structure of input data. However there is a risk that neurons could link with some values before groups are correctly recognized. Then the learning process should be repeated with diffrent wages.

At last, absolutely necessary for self-organizing process is that the net is able to adapt wages values of winning neuron and his neighbours, according to response strenght. Net topology can be defined in a very simple way by determining the neighbours of every neuron. Let's call the unit whose response on stimulation is maximal the image of this stimulation. Then we can presume that the net is in order, if topologic relations between input signals and their images are identical.

Algorithm of learning:: The name of the whole class of networks came from the designation of algorithm called self-organizing Kohonen's maps. They had been described in the publication "Self Organizing Map". Kohonen proposed two kinds of proximity : rectangular and gauss. The first is :

and the second:

"lambda" is the radius of proximity, it decreases in time. Use of Kohonen's method gives us better results than "Winner Takes All" method. Organization of the net is better (neurons organization represents the distribution of input data in a better way) and the convergence of the algorithm is higher. Because of that the time of single iteration is a few times longer - wages of many neurons , not only winners', have to be modified.

Page 63: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 62

62

Page 64: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 63

63

Page 65: Soft Computing Unit-3 by Arun Pratap Singh

PREPARED BY ARUN PRATAP SINGH 64

64