Transcript
Page 1: Transfer Learning, Soft Distance-Based Bias, and the Hierarchical BOA

Transfer  learning,  so.  distance-­‐based  bias,  and  the  hierarchical  BOA  Transfer  Learning,  So.  Distance-­‐Based  Bias,  and  the  Hierarchical  BOA  

Martin Pelikan Missouri Estimation of Distribution Algorithms

Laboratory (MEDAL) University of Missouri, St. Louis, MO

E-mail: [email protected] WWW: http://martinpelikan.net/

Mark W. Hauschild Missouri Estimation of Distribution Algorithms

Laboratory (MEDAL) University of Missouri, St. Louis, MO

E-mail: [email protected]

Pier Luca Lanzi Dipartimento di Elettronica e Informazione

Politecnico di Milano Milano, Italy

E-mail: [email protected] WWW: http://www.pierlucalanzi.net/

Background  •  Model-­‐directed  op-mizers  (MDOs),  such  as  es#ma#on  of  

distribu#on  algorithms,  learn  and  use  models  to  solve  difficult  op-miza-on  problems  scalably  and  reliably.  

•  MDOs  o?en  provide  more  than  the  solu-on;  they  provide  a  set  of  models  that  reveal  informa-on  about  the  problem.  Why  not  use  that  informa#on  in  future  runs?  

Purpose  •  Combine  prior  models  with  a  problem-­‐specific  distance  

metric  to  solve  new  problem  instances  with  increased  speed,  accuracy,  reliability.  

•  Focus  on  hBOA  algorithm  and  addi-vely  decomposable  func-ons,  although  the  approach  can  be  generalized  to  other  MDOs  and  other  problem  classes.  

•  Extend  previous  work  to  mainly  demonstrate  that  •  Previous  MDO  runs  on  smaller  problems  can  be  used  

to  bias  runs  on  larger  problems.  •  Previous  MDO  runs  for  one  problem  class  can  be  used  

to  bias  runs  for  another  problem  class.  

Hierarchical  Bayesian  opAmizaAon  algorithm,  hBOA                

•  Models  allow  hBOA  to  learn  and  use  problem  structure.  •  To  build  models,  hBOA  uses  Bayesian  metrics  that  

combine  the  current  popula-on  and  prior  knowledge  (via  prior  distribu-on  of  structures  and  parameters).  

Current  popula-on  

Selected  popula-on  

New  popula-on  

Bayesian  network  

Basic  idea  of  the  approach  •  Outline  of  the  approach  

1.  Define  distances  between  problem  variables.  2.  Mine  probabilis-c  models  from  previous  runs  for  

model  regulari-es  with  respect  to  distances.  3.  Use  obtained  informa-on  to  bias  model  building  

when  solving  new  problem  instances.  •  Apply  approach  to  hBOA  and  addi-vely  decomposable  

func-ons  (ADFs)  •  Distance  metric  developed  for  ADFs,  but  other  

metrics  can  be  used.  •  Bias  incorporated  by  modifying  prior  probabili-es  of  

network  structures  when  building  the  models.  •  Strength  of  bias  can  be  controlled  using  a  

parameter  κ>0.  

Distance  metric  for  ADFs  •  ADF  

•  Create  a  graph  for  ADF  with  one  node  per  variable  by  connec-ng  variables  that  are  in  the  same  subproblem.  

•  Number  of  edges  along  shortest  path  between  two  nodes  defines  their  distance;  for  disconnected  variables  the  distance  is  equal  to  the  number  of  variables.  

•  Can  use  other  distance  metrics  (e.g.  QAP  and  scheduling).  

{Si}  are  subsets  of  variables.  {fi}  are  arbitrary  func-ons.  

Selected  results  •  Problem  classes:  

•  Nearest-­‐neighbor  NK  landscapes.  •  Spin  glasses  (2D  and  3D).  •  MAXSAT  for  transformed  graph  coloring.  •  Minimum  vertex  cover  for  random  graphs.  

•  Use  bias  from  smaller  problems  on  bigger  problems  (compare  to  the  bias  from  problems  of  the  same  size)  

             NK  landscapes                                          MVC                                        Spin  glass  (2D)  

•  Use  bias  from  another  problem  class                NK  landscapes                          MVC  (easier)                          MVC  (harder)  

 •  Summary  of  results  (many  results  omi_ed  here)  

•  Significant  speedups  obtained  in  all  test  cases.  •  Bias  based  on  runs  on  smaller  problems  and  other  

problem  classes  yields  significant  speedups  as  well.  •  This  proves  flexibility,  prac-cality  and  poten-al  of  the  

approach.  •  Approach  can  be  adapted  to  other  MDOs  (LTGA,  

ECGA,  DSMGA,  …).    

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5

0 10 20 30 40 50

60 70 80 90

100

1 2 3 4 5 6 7 8 9 10

Perc

enta

ge o

f ins

tanc

esw

ith im

prov

ed e

xecu

tion

time

Kappa (strength of bias)

NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

50 55 60 65 70

Aver

age

CPU

spe

edup

(mul

tiplic

ativ

e)

Problem size (number of bits, n)

base case (no speedup)

Kappa=10Kappa=8

Kappa=6

Kappa=4Kappa=2

(a) NK landscapes with nearest neighbors.

0 0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)

SG 2D, n=64 (8x8)

0 10 20 30 40 50

60 70 80 90

100

1 2 3 4 5 6 7 8 9 10

Perc

enta

ge o

f ins

tanc

esw

ith im

prov

ed e

xecu

tion

time

Kappa (strength of bias)

SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)

SG 2D, n=64 (8x8)

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6

64 100 144

Aver

age

CPU

spe

edup

(mul

tiplic

ativ

e)

Problem size (number of bits, n)

base case(no speedup)

Kappa=10Kappa=8

Kappa=6

Kappa=4Kappa=2

(b) 2D ±J Ising spin glass

Figure 9: Speedups obtained on NK landscapes and 2D spin glasses without using local search.

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

n=200, bias from n=150n=200, bias from n=200

(a) NK landscapes with nearest neighbors,n = 200, k = 5.

0.5

1

1.5

2

2.5

3

3.5

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

n=200, bias from n=150n=200, bias from n=200

(b) Minimum vertex cover, n = 200, c = 2.

0 0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case(no speedup)

n=200, bias from n=150n=200, bias from n=200

(c) Minimum vertex cover, n = 200, c = 4.

0.6 0.8

1 1.2 1.4

1.6 1.8

2

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

n=400, bias from n=324n=400, bias from n=400

(d) 2D ±J Ising spin glass, n = 20 ! 20 =

400

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

n=343, bias from n=216n=343, bias from n=343

(e) 3D ±J Ising spin glass, n = 7! 7! 7 =

343

Figure 10: Speedups obtained on all test problems except for MAXSAT with bias based on models obtainedfrom problems of smaller size, compared to the base case with bias based on the same problem size.

17

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5

0 10 20 30 40 50

60 70 80 90

100

1 2 3 4 5 6 7 8 9 10

Perc

enta

ge o

f ins

tanc

esw

ith im

prov

ed e

xecu

tion

time

Kappa (strength of bias)

NK, n=50, k=5NK, n=60, k=5NK, n=70, k=5

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

50 55 60 65 70

Aver

age

CPU

spe

edup

(mul

tiplic

ativ

e)

Problem size (number of bits, n)

base case (no speedup)

Kappa=10Kappa=8

Kappa=6

Kappa=4Kappa=2

(a) NK landscapes with nearest neighbors.

0 0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)

SG 2D, n=64 (8x8)

0 10 20 30 40 50

60 70 80 90

100

1 2 3 4 5 6 7 8 9 10

Perc

enta

ge o

f ins

tanc

esw

ith im

prov

ed e

xecu

tion

time

Kappa (strength of bias)

SG 2D, n=144 (12x12)SG 2D, n=100 (10x10)

SG 2D, n=64 (8x8)

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6

64 100 144

Aver

age

CPU

spe

edup

(mul

tiplic

ativ

e)

Problem size (number of bits, n)

base case(no speedup)

Kappa=10Kappa=8

Kappa=6

Kappa=4Kappa=2

(b) 2D ±J Ising spin glass

Figure 9: Speedups obtained on NK landscapes and 2D spin glasses without using local search.

0.5

1

1.5

2

2.5

3

3.5

4

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

n=200, bias from n=150n=200, bias from n=200

(a) NK landscapes with nearest neighbors,n = 200, k = 5.

0.5

1

1.5

2

2.5

3

3.5

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

n=200, bias from n=150n=200, bias from n=200

(b) Minimum vertex cover, n = 200, c = 2.

0 0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case(no speedup)

n=200, bias from n=150n=200, bias from n=200

(c) Minimum vertex cover, n = 200, c = 4.

0.6 0.8 1

1.2 1.4

1.6 1.8 2

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

n=400, bias from n=324n=400, bias from n=400

(d) 2D ±J Ising spin glass, n = 20 ! 20 =

400

0.2 0.4

0.6 0.8 1

1.2 1.4

1.6 1.8

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

n=343, bias from n=216n=343, bias from n=343

(e) 3D ±J Ising spin glass, n = 7! 7! 7 =

343

Figure 10: Speedups obtained on all test problems except for MAXSAT with bias based on models obtainedfrom problems of smaller size, compared to the base case with bias based on the same problem size.

17

0.4 0.6 0.8 1

1.2 1.4

1.6 1.8 2

2.2 2.4

2.6 2.8 3

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

Models from NKModels from MVC, c=2Models from MVC, c=4

(a) NK landscapes with nearest neighbors,n = 200, k = 5.

0.5 1

1.5 2

2.5 3

3.5 4

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case (no speedup)

Models from NKModels from MVC, c=2.0Models from MVC, c=4.0

(b) Minimum vertex cover, n = 200, c = 2.

0

0.5

1

1.5

2

2.5

3

1 2 3 4 5 6 7 8 9 10

Mul

tiplic

ativ

e sp

eedu

p w

.r.t

CPU

tim

e

Kappa (strength of bias)

base case(no speedup)

Models from NKModels from MVC, c=2.0Models from MVC, c=4.0

(c) Minimum vertex cover, n = 200, c = 4.

Figure 11: Speedups obtained on NK landscapes and minimum vertex cover with the bias from modelson another problem class compared to the base case (the base case corresponds to 10-fold crossvalidationusing the models obtained on the problems from the same class).

of experiments; the results of these experiments can be found in fig. 11. Specifically, the figure presents theresults of using bias from the NK landscapes while solving the minimum vertex cover instances for the twosets of graphs (with c = 2 and c = 4), the results of using bias from the minimum vertex cover instancesfor both the classes of graphs when solving the NK landscapes, and also the results of using bias from theminimum vertex cover on another class of graphs (with different c). It can be seen that significant speedupswere obtained even when using models from another problem class, although in some cases the speedupsobtained when using the same problem class were significantly better. Some combinations of problemclasses discussed in this paper were omitted because of the high computational demands of testing everypair of problems against each other. Despite that, the results presented here indicate that even the biasbased on problems from a different class can be beneficial.

5.3.5 Combining Distance-Based Bias and Sporadic Model Building

While most efficiency enhancement techniques provide significant speedups on their own right (Hauschildand Pelikan, 2008; Hauschild et al., 2012; Lozano et al., 2001; Mendiburu et al., 2007; Ocenasek, 2002; Pelikanet al., 2008; Sastry et al., 2006), their combinations are usually significantly more powerful because thespeedups often multiply. For example, with the speedup of 4 due to sporadic model building (Pelikanet al., 2008) and the speedup of 20 due to parallelization using 20 computational nodes (Lozano et al.,2001; Ocenasek and Schwarz, 2000), combining the two techniques can be expected to yield speedups ofabout 80; obtaining comparable speedups using parallelization on its own would require at least 4 times asmany computational nodes. How will the distance-based bias from prior hBOA runs combine with otherefficiency enhancements?

To illustrate the benefits of combining the distance-based bias discussed in this paper with other ef-ficiency enhancement techniques, we combined the approach with sporadic model building and testedthe two efficiency enhancement techniques separately and in combination. Fig. 12 shows the results ofthese experiments. The results show that even in cases where the speedup due to distance-based bias isonly moderate, the overall speedup increases substantially due to the combination of the two techniques.Nearly multiplicative speedups can be observed in most scenarios.

6 Thoughts on Extending the Approach to Other Model-Directed Optimizers

6.1 Extended Compact Genetic Algorithm

The extended compact genetic algorithm (ECGA) (Harik, 1999) uses marginal product models (MPM) toguide optimization. An MPM splits variables to disjoint linkage groups. The distribution of variables ineach group is encoded by a probability table that specifies a probability for each combination of valuesof these variables. The overall probability distribution is given by the product of the distributions of theindividual linkage groups. In comparison with Bayesian network models, instead of connecting variablesthat influence each other strongly using directed edges, ECGA puts such variables into the same linkage

18

More  informaAon  •  Download  MEDAL  Reports  No.  2012004  and  2012007:  

h_p://medal-­‐lab.org/files/2012004.pdf  h_p://medal-­‐lab.org/files/2012007.pdf  

 

Acknowledgments  •  NSF  grants  ECS-­‐0547013  and  IIS-­‐1115352.  •  ITS  at  the  University  of  Missouri  in  St.  Louis.  •  University  of  Missouri  Bioinforma-cs  Consor-um.  •  Any  opinions,  findings,  and  conclusions  or  recommenda-ons  expressed  in  

this  material  are  those  of  the  authors  and  do  not  necessarily  reflect  the  views  of  the  Na-onal  Science  Founda-on.  

 

http://medal-lab.org  

in hBOA for additively decomposable problems based on a problem-specific distance metric. However,note that the framework can be applied to many other model-directed optimization techniques and thefunction ! can be defined in many other ways. To illustrate this, we outline how this approach can beextended to several other model-directed optimization techniques in section 6.

4 Distance-Based Bias

4.1 Additively Decomposable Functions

For many optimization problems, the objective function (fitness function) can be expressed in the form ofan additively decomposable function (ADF) of m subproblems:

f(X1, . . . , Xn) =m!

i=1

fi(Si), (5)

where (X1, . . . , Xn) are problem’s decision variables, fi is the ith subfunction, and Si ! {X1, X2, . . . , Xn}is the subset of variables contributing to fi (subsets {Si} can overlap). While they may often exist multipleways of decomposing the problem using additive decomposition, one would typically prefer decomposi-tions that minimize the sizes of subsets {Si}. As an example, consider the following objective function fora problem with 6 variables:

fexample(X1, X2, X3, X4, X5, X6) = f1(X1, X2, X5) + f2(X3, X4) + f3(X2, X5, X6).

In the above objective function, there are three subsets of variables, S1 = {X1, X2, X5}, S2 = {X3, X4} andS3 = {X2, X5, X6}, and three subfunctions {f1, f2, f3}, each of which can be defined arbitrarily.

It is of note that the difficulty of ADFs is not fully determined by the order (size) of subproblems, butalso by the definition of the subproblems and their interaction. In fact, there exist a number of NP-completeproblems that can be formulated as ADFs with subproblems of order 2 or 3, such as MAXSAT for 3-CNFformulas. On the other hand, one may easily define ADFs with subproblems of order n that can be solvedby a simple bit-flip hill climbing in low-order polynomial time.

4.2 Measuring Variable Distances for ADFs

The definition of a distance between two variables of an ADF used in this paper is based on the workof Hauschild and Pelikan (2008) and Hauschild et al. (2012). Given an additively decomposable problemwith n variables, we define the distance between two variables using a graph G of n nodes, one node pervariable. For any two variables Xi and Xj in the same subset Sk, that is, Xi, Xj " Sk, we create an edge in Gbetween the nodes Xi and Xj . See fig. 2 for an example of an ADF and the corresponding graph. Denotingby li,j the number of edges along the shortest path between Xi and Xj in G (in terms of the number ofedges), we define the distance between two variables as

D(Xi, Xj) =

"

li,j if a path between Xi and Xj exists, andn otherwise.

(6)

Fig. 2 illustrates the distance metric on a simple example. The above distance measure makes variables inthe same subproblem close to each other, whereas for the remaining variables, the distances correspond tothe length of the chain of subproblems that relate the two variables. The distance is maximal for variablesthat are completely independent (the value of a variable does not influence the contribution of the othervariable in any way).

Since interactions between problem variables are encoded mainly in the subproblems of the additiveproblem decomposition, the above distance metric should typically correspond closely to the likelihoodof dependencies between problem variables in probabilistic models discovered by EDAs. Specifically, thevariables located closer with respect to the metric should more likely interact with each other. Fig. 3 illus-trates this on two ADFs discussed later in this paper—the NK landscape with nearest neighbor interactionsand the two-dimensional Ising spin glass (for a description of these problems, see section 5.1). The figureanalyzes probabilistic models discovered by hBOA in 10 independent runs on each of the 1,000 random in-stances for each problem and problem size. For a range of distances d between problem variables, the figure

6

Top Related