information theory, fitness and sampling semantics colin johnson / university of kent john woodward...

24
information theory, fitness and sampling semantics colin johnson / university of kent john woodward / university of stirling

Upload: rhoda-gregory

Post on 24-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

information theory, fitness and sampling semanticscolin johnson / university of kentjohn woodward / university of stirling

Schtick•That we can use the set of ideas around entropy,

information theory and algorithmic complexity as a way of assigning fitness in program synthesis systems.

•This involves the idea of an information distance between sampling semantics vectors and problem target definitions across a set of training cases.

• In particular we describe how to assign fitness to subprograms without putting them in the context of a whole program.

Semantics in GP etc.

Program TextCanonical formof I/O mapping

Canonical-representation Semantics

Why do we Care about Semantics?

• In the end, problems cash out as input-output behaviour.

•By having an understanding of program semantics, we can:• avoid duplicating programs with different

representations but the same I/O behaviour in the population

• choose points for crossover/mutation in a more informed way

• build new frameworks (such as geometric semantic GP) that manipulate program meanings.

Semantics in GP etc.

Program TextCanonical formof I/O mapping

Program TextVector of outputs

on training set

Canonical-representation Semantics

Sampling Semantics

Sampling Semantics

•Sampling semantics (O’Neill, Nguyen, et al.) are a data-driven way of defining a semantic representation for any kind of function.

•The sampling semantics of a function over a particular (ordered) training set T is simply a vector of the outputs of that function over T.

•This emphasis on the set of outputs (rather than just, say, a sum of errors) allows us to define metrics on pairs of population members e.g. to define how close they are in meaning.

What do we really want?

•GP assigns fitness on the basis of counting how many fitness cases are solved by each program in the population, or by summing up the total error.

•This is the wrong thing to measure.

•We want to measure whether sub-programs add information/structure that will make it easier for later parts of the program to solve the problem.

The Semantics of Wrong Programs

•Much of computer science is interested in reasoning about correct programs (or, reasoning about whether programs are correct).

•But, most programs are wrong most of the time during development.

•We need ideas that help us to reason about wrong programs, and their relationship to the target specification.

•Can we measure how much problem-specific structure a program fragment is generating?

Similarity Measures•When are two things similar? For example, two

programs, or the output from a program and the target value?

•Clearly, bitwise difference is not the most important thing.

Information Distance

• Instead of pointwise distance, consider instead information distance (Vitányi et al.).

•An example information distance is the length of the shortest program required to transform one thing into the other.

•The program that outputs 10101010101010101010 against the target01010101010101010101 is “better” than01000010110100010110 even though the latter is fitter on a conventional measure.

Information Distance Fitness

•Combine the idea of information distance and sampling semantics to get a new notion of fitness.

•The fitness of a program fragment is the length of the smallest program required to transform the sampling semantics vector into the target vector.

•A computationally grounded notion of “wrong” should be grounded in how much computation is needed to make the program “right”.

Programs by Accumulation

•Rather than the GP notion of a population of complete programs, we will find it easier to work with a set of program fragments.

•Let us call these fragments “theories”.

•Good theories represent partial solutions to all (or many) training cases; not complete solutions to some training cases. (We “cut horizontally” rather than “cutting vertically”).

•We can compare theories by their information distance to the target.

Assigning Quality to Program Fragments

•Most GP research to date assigns fitness to programs. That is, we need a complete program before we can assign fitness to programs, and we don’t assign fitness directly to substructures.

• In machine learning (e.g. C4.5), we assign “fitness” to combinations of features by using e.g. ideas like information gain. That is, we assign a fitness to a partial “program”.

Which is best: f1 or f2?

Compressibility

•One way to describe strings or mappings is in terms of their algorithmic information complexity, such as the Kolmogorov complexity.

•Roughly speaking, this is a measure of the shortest possible program required to compute the string.• So, for example, 1010101010101010 can be

described by a shorter program than 1001010101111010.

•Non-computable; but, we can approximate it by running a compression algorithm on the string.

Which is best: f1 or f2?

•diff f1 and parity is more compressible. We can find a shorter description of it.

Compression-based Program Synthesis (TDFcomp)• Choose a set of functions F.

• Create a construction set C, initially containing all of the input variables.

• LOOP:

• create a number (500) of sub-programs by applying functions from F to members of C

• calculate the difference between the output from these sub-programs and the target on all inputs (Hamming)

• choose the most compressible difference and add the relevant sub-program to C (gzip)

• UNTIL C contains a program that is a solution to the whole problem

Example: 4-bit even parity

• run:

• 0: (XNOR v_3 v_2) with quality 26

• 1: (XNOR v_2 v_3) with quality 26

• 2: (XNOR v_2 v_3) with quality 26

• 3: (XOR v_2 v_3) with quality 26

• 4: (XOR v_3 v_2) with quality 26

• 5: (AND v_2 v_1) with quality 27

• 6: (XNOR v_1 v_2) with quality 27

• 7: (XOR v_2 v_1) with quality 27

• 8: (1st v_1 v_0) with quality 28

• 9: (OR v_2 v_0) with quality 28

• 10: (OR v_2 v_3) with quality 28

• ...

Typical Run (2)

• run:

• ***************** Iteration 1

• (XNOR v_2 v_3) with quality 26

• ***************** Iteration 2

• (XOR (XNOR v_2 v_3) v_1) with quality 24

• ***************** Iteration 3

• (XOR (XOR (XNOR v_2 v_3) v_1) v_0) with quality 23

• ######################################

• 1 perfect solution found, which is:

• (XOR (XOR (XNOR v_2 v_3) v_1) v_0) with quality 23

• BUILD SUCCESSFUL (total time: 1 second)

Typical Run (2)

• run:

• ***************** Iteration 1

• (XNOR v_3 v_2) with quality 26

• ***************** Iteration 2

• (XNOR (XNOR v_3 v_2) v_1) with quality 24

• ***************** Iteration 3

• (XOR (XNOR (XNOR v_3 v_2) v_1) v_0) with quality 23

• ***************** Iteration 4

• (NOT2 v_0 (XOR (XNOR (XNOR v_3 v_2) v_1) v_0)) with quality 23

• ######################################

• 14 perfect solutions found, which are:

• (NOT2 v_0 (XOR (XNOR (XNOR v_3 v_2) v_1) v_0)) with quality 23

• ....

...and traditional GP for contrast!

• run:

• 0 2.0 XOR(NAND(OR(OR(NAND(OR(d3 d1) AND(d0 d0)) XOR(AND(d3 d0) OR(d0 d0))) XOR(XOR(NAND(d1 d3) XOR(d0 d2)) XOR(XOR(d2 d0) XOR(d2 d0)))) AND(AND(OR(OR(d2 d1) XOR(d3 d0)) OR(AND(d2 d3) XOR(d3 d0))) OR(NAND(NAND(d2 d0) AND(d2 d2)) XOR(OR(d0 d1) OR(d0 d2))))) AND(OR(NAND(XOR(NAND(d2 d2) AND(d2 d0)) NAND(XOR(d1 d1) XOR(d0 d1))) XOR(XOR(AND(d1 d2) OR(d0 d2)) XOR(OR(d3 d0) OR(d1 d2)))) NAND(OR(XOR(NAND(d3 d1) XOR(d1 d2)) AND(OR(d1 d2) AND(d3 d3))) XOR(NAND(NAND(d3 d0) OR(d3 d1)) XOR(XOR(d3 d2) AND(d1 d0))))))

• 1 2.0 XOR(NAND(OR(OR(NAND(OR(d3 d1) AND(d0 d0)) XOR(AND(d3 d0) OR(d0 d0))) XOR(XOR(NAND(d1 d3) XOR(d0 d2)) XOR(XOR(d2 d0) XOR(d2 d0)))) AND(AND(OR(OR(d2 d1) XOR(d3 d0)) OR(AND(d2 d3) XOR(d3 d0))) OR(NAND(NAND(d2 d0) AND(d2 d2)) XOR(OR(d0 d1) OR(d0 d2))))) AND(OR(NAND(XOR(NAND(d2 d2) AND(d2 d0)) NAND(XOR(d1 d1) XOR(d0 d1))) XOR(XOR(AND(d1 d2) OR(d0 d2)) XOR(OR(d3 d0) OR(d1 d2)))) NAND(OR(XOR(NAND(d3 d1) XOR(d1 d2)) AND(OR(d1 d2) AND(d3 d3))) XOR(NAND(NAND(d3 d0) OR(d3 d1)) XOR(XOR(d3 d2) AND(d1 d0))))))

• 2 2.0 XOR(NAND(OR(OR(NAND(OR(d3 d1) AND(d0 d0)) XOR(AND(d3 d0) OR(d0 d0))) XOR(XOR(NAND(d1 d3) XOR(d0 d2)) XOR(XOR(d2 d0) XOR(d2 d0)))) AND(AND(OR(OR(d2 d1) XOR(d3 d0)) OR(AND(d2 d3) XOR(d3 d0))) OR(NAND(NAND(d2 d0) AND(d2 d2)) XOR(OR(d0 d1) OR(d0 d2))))) AND(OR(NAND(XOR(NAND(d2 d2) AND(d2 d0)) NAND(XOR(d1 d1) XOR(d0 d1))) XOR(XOR(AND(d1 d2) OR(d0 d2)) XOR(OR(d3 d0) OR(d1 d2)))) NAND(OR(XOR(NAND(d3 d1) XOR(d1 d2)) AND(OR(d1 d2) AND(d3 d3))) XOR(NAND(NAND(d3 d0) OR(d3 d1)) XOR(XOR(d3 d2) AND(d1 d0))))))

• 3 1.0 XOR(XOR(XOR(XOR(XOR(OR(d2 d3) AND(d3 d3)) OR(NAND(d0 XOR(d2 d2)) AND(d0 d2))) OR(OR(NAND(d3 d3) OR(d2 d3)) AND(XOR(d2 d3) OR(d3 d2)))) XOR(OR(NAND(NAND(d1 d3) NAND(d3 d0)) NAND(AND(d1 d3) XOR(d1 d3))) NAND(OR(XOR(d1 d3) NAND(d0 d2)) OR(AND(d2 d0) XOR(d0 d1))))) NAND(XOR(AND(OR(OR(d2 d3) OR(d0 d1)) OR(AND(d1 d0) AND(d3 d3))) AND(AND(XOR(d0 d3) AND(d0 d1)) XOR(AND(d2 d1) OR(d3 d0)))) NAND(NAND(AND(d2 XOR(d3 d0)) XOR(OR(d1 d3) d2)) AND(OR(AND(d2 d2) AND(d1 d1)) XOR(AND(d3 d2) XOR(d3 d3))))))

• 4 0.0 XOR(XOR(XOR(XOR(XOR(OR(d2 d3) AND(d3 d3)) OR(NAND(d0 d2) AND(d0 d2))) OR(OR(NAND(d3 d3) OR(d2 d3)) AND(XOR(d2 d3) OR(d3 d2)))) XOR(OR(NAND(NAND(XOR(d3 d0) d3) NAND(d3 d0)) NAND(AND(d1 d3) XOR(d1 d3))) NAND(OR(XOR(d1 d3) NAND(d0 d2)) OR(AND(d2 d0) XOR(d0 d1))))) NAND(XOR(AND(OR(OR(d2 d3) OR(d0 d1)) OR(AND(OR(d1 d2) d0) AND(d3 d3))) AND(AND(XOR(d0 d3) AND(d0 d1)) XOR(AND(d2 d1) OR(d3 d0)))) NAND(NAND(d1 XOR(OR(d1 d3) d1)) AND(OR(AND(d2 d2) AND(d1 d1)) XOR(AND(d3 d2) XOR(d3 d3))))))

• BUILD SUCCESSFUL (total time: 3 seconds)

...is this a fair example?

•Perhaps this isn’t the fairest of examples.

•The parity problem has the advantage that, once you have combined two variables with the XOR or XNOR operator, you have extracted all of the information out of them.

• In other problems, this is not the case; e.g. in the multiplexer problem you need to use the address bits more than once.

• ...but, there are ways of dealing with this.

The Big Picture

•GP is measuring the wrong thing:• we want to measure how (algorithmically)

complex the “gap” is between the current program fragment and the target, not the error between the current program (fragment) and the target

•We have shown a way to give a fitness value to small components of a program during program synthesis, rather than having to always evaluate full programs.

•Can we do more to remove the “bio-inspired” from the methods and replace it with computational/informational concepts?

Questions/Comments