data clustering by cuckoo optimization algorithm

28
Data clustering by Cuckoo optimization algorithm Hadi M.abachi Faculty of computer science , Iran university of science & technology

Upload: hadi-abachi

Post on 14-Apr-2017

288 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Data clustering by Cuckoo optimization algorithmHadi M.abachiFaculty of computer science , Iran university of science & technology

2

K-means problems?

▪ The learning algorithm provides the local optima of the squared error function instead of global optima because of random initialization of centers.

▪ In large data sets NP-hard problem

▪ Solution: random Global optimization algorithms: Genetic algorithms ,ant colony ,PSO ,cuckoo , … .

3

Cuckoo search algorithm

▪ A method of global optimization based on the behavior of cuckoos was proposed by Yang & Deb(2009)

▪ The original “cuckoo search (CS) algorithm” is based on the idea of the following :-

▪ How cuckoos lay their eggs in the host nests.▪ How, if not detected and destroyed, the eggs are

hatched to chicks by the hosts. ▪ How a search algorithm based on such a scheme can be

used to find the global optimum of a function.

4

Behavior of Cuckoo breeding

▪ The CS was inspired by the obligate brood parasitism of some cuckoo species by laying their eggs in the nests of host birds.

▪ Some cuckoos have evolved in such a way that female parasitic cuckoos can imitate the colors and patterns of the eggs of a few chosen host species.

▪ This reduces the probability of the eggs being abandoned and, therefore, increases their re productivity .

5

Behavior of Cuckoo breeding (Cont.)

If host birds discover the eggs are not their own, they will either throw them away or simply abandon their nests and build new ones.

Parasitic cuckoos often choose a nest where the host bird just laid its own eggs.

In general, the cuckoo eggs hatch slightly earlier than their host eggs.

6

Behavior of Cuckoo breeding (Cont.)

Once the first cuckoo chick is hatched, his first instinct action is to evict the host eggs by blindly propelling the eggs out of the nest.

This action results in increasing the cuckoo chick’s share of food provided by its host bird .

Moreover, studies show that a cuckoo chick can imitate the call of host chicks to gain access to more feeding opportunity.

7

Characteristics of Cuckoo search

Each egg in a nest represents a solution, and a cuckoo egg represents a new solution.

The aim is to employ the new and potentially better solutions (cuckoos) to replace not-so-good solutions in the nests.

In the simplest form, each nest has one egg.

The algorithm can be extended to more complicated cases in which each nest has multiple eggs representing a set of solutions

8

Characteristics of Cuckoo search (cont)

▪ The CS is based on three idealized rules: Each cuckoo lays one egg at a time, and dumps it in a randomly chosen

nest

▪ The best nests with high quality of eggs (solutions) will carry over to the next generations

▪ The number of available host nests is fixed, and a host can discover an alien egg with probability p ϵ [0,1] .

▪ In this case, the host bird can either throw the egg away or abandon the nest to build a completely new nest in a new location.

9

Lѐvy Flights

▪ In nature, animals search for food in a random or quasi-random manner.

▪ Generally, the foraging path of an animal is effectively a random walk because the next move is based on both the current location/state and the transition probability to the next location.

▪ The chosen direction implicitly depends on a probability, which can be modeled mathematically.

▪ A Lévy flight is a random walk in which the step-lengths are distributed according to a heavy-tailed probability distribution.

▪ After a large number of steps, the distance from the origin of the random walk tends to a stable distribution.

10

11

▪ The habitat array uses for keeping input variable values that these variables have floating point values. Eq (1) shows the habitat array:

Habitat=[x1,x2,x3,…,xnvar] (1)▪ The profit of a habitat is obtained by evaluation of profit

fp function at a habitat and Eq(2) shows it: Profit=fp(habitat)=fp(x1,x2,x3,…,xnvar) (2)

12

▪ After that, for beginning the optimization, the algorithm generates a habitat matrix with Nvar∗ Npopsize and in each of these habits lay random egg number. In nature, each cuckoo lays between 5 and20 eggs. These numbers are used as lower and upper limits for each cuckoo in each iteration. They also lay eggs within a maximum distance from their habits. This maximum range is called Egg Laying Radius (ELR). Eq (3) shows this formula in which ∝ is an integer value for regularizing the maximum value of ELR, varhiand varlowuse in order to high limit and low limit.

▪ ELR= ∝* *(varhi-varlow) (3)

13

▪ Parameters a and Q help cuckoos to find new areas and for each cuckoo are defined as follows:

▪ a∼U[0, 1]▪ Q∼U[−w, w]▪ The parameter a is a random

value between 0 and 1. The w is a parameter constrains the deviation from goal habitat. To global maximum profit the amount ᴨ/6 seems necessary for w.

14

Main steps in the proposed COA.

• Algorithm 1 Main steps in the proposed COA:• 1 Initialize cuckoo habitat with some random points on the profit function

(accuracy) • 2 Dedicate some eggs to each cuckoo• 3 Define ELR for each cuckoo:• ELR= ∝*[( ℎ ^′ )/( 𝑡 𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 𝑐𝑢𝑐𝑘𝑜𝑜 𝑠 𝑒𝑔𝑔𝑠 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓

) ]*(var𝑒𝑔𝑔𝑠 hi-varlow) • 4. Let cuckoos lay eggs inside their corresponding ELR• 5 Kill those eggs that are recognized by host birds• 6 Let eggs hatch and the chicks grow• 7 Evaluate the habitat of each newly grown cuckoo• 8 Limit cuckoos maximum number in the environment and kill those who live in

worst habitat• 9 Cluster cuckoos and find best group and select goal habitat• 9.1 clustering with K = means method: i

j-cj||2

• 10 Let new cuckoo population immigrate toward goal habitat11 If stop condition is satisfied, if not, go to 2

15

Proposed data clustering method

▪ Question:▪ find the optimal assignment of N objects with M

attributes to one of the K clusters that in each cluster , the sum of squared Euclidean distances between the each object and the center of the belonging cluster Is minimized??

16

1) the algorithm generates R cuckoo’s agents in the range of minimum and maximum number of cuckoos to build solutions.

2) Each agent has an empty solution string S of length N where each string element corresponds to one of the test samples. In solution string S, an element assigned value shows the cluster number to which the test sample is assigned in S and is a value between 1 and K.

3) For each solution string calculates a cost with a cost function and finally the cluster with minimum cost is selected as best clustering.

Note: In different iterations, the population of cuckoos is changing but controlled

17

Illustration example

▪ 1)create Input dataset▪ Table 1 shows a dataset with N = 8 objects and M = 10

attributes that we want to category these in K = 3 clusters:

18

▪ 2)R cuckoo’s agents is created that they are initial cuckoo’s population:

▪ Si = K − 1 ∗ Rand(1, N) + 1 ▪ (N random number generates for each solution string. Each number should

be between 1 and K as shown in Table 2)

19

▪ 3) for each solution calculates a fitness.▪ 4)create Cluster Matrix:

▪ 5)calculate cluster size:▪ the count of elements in each category called li that index i has a value

between 1 and K. (The element count for the category K1 is l1= 3, for category K2 also is l2= 3 and l3= 2 for category K3 )

20

▪ 6) One of the elements should be selected as cluster center in each category. The algorithm generates K random numbers between 1 and li value for each category.

▪ 7) The cluster centers is determined with these random numbers and the squared Euclidean distance is calculated for each cluster toward to the cluster centers.

▪ 8)calculate fitness value:▪ Fitness(Ih)=

21

(In this table some objects have zero fitness that they are the cluster centers.)

9)Calculate cost value: Cost value=

22

Local Search Procedure

– Many of meta-heuristic algorithms use some form of local search procedure for improving solution are discovered.

– In this work, after the costs are calculated, for more optimization and by this reason that with more probably the cluster with bigger member has bigger error rate, we generate random cluster numbers for H% of that cluster that have bigger cluster size and calculates the cost again for this new solution.

23

Fuzzy Cuckoo Optimization

– the fitness values are calculated, the fuzzy system determines a fuzzy value for each string:

– 1) Here we use Mamdani fuzzy model and have 243 different rules– 2) input values are the same calculated fitness values and the result of this

system is the desired cost

24

Input values Output values

25

▪ 3) fuzzy cost values for all given solutions :▪ Costvalue(Ih)=w(Ih)*fitness(Ih)▪ (w is a calculated fuzzy amount for fitness value by input diagram

& mamdani)

26

Result of FCOA & COA

ER= * 100

27

Comparison Table

28

Any Question?

Thanks