predictive science

8/7/2019 Predictive Science

1/101

Predictive Sciencea Tautology

Peter Nordin


2/101


3/101

The Answers is:


4/101


5/101


6/101

The asymmetry of similarity

! What thing is this like?


7/101

! And what is this like?


8/101

A heuristic measure of amount of

information: Shannons guessinggame

1. Pony?

2. Cow?

3. Dog?

345. Pegasus!

1. Pony?

2. Cow?

3. Dog?

345. Pegasus!

345

!


9/101

Science is Prediction

! When does the next solar eclipise in Europe occur?

! The next solar eclipse in Europe will happen in

August 12, 2026.


10/101

Science is Compression


11/101

The Model and

Science and Prediction

The Model


12/101

The Turkey and the issue

with inductive predictions(1)


13/101

The Turkey and the issue

with inductive predictions(2)


14/101

Mandatory Reading


15/101


16/101


17/101


18/101

All Real Science is

Predictive Science

! Predict when the sun will set tomorrow

! Predict if you will be sick or well by taking

this medicine

! Predict what will happen in this project if this

methodology is used


19/101

How to predict

anything:

1. Collect facts

2. Find a shortmodel fitting all the facts

3. Extrapolate that model into the future,

probability is length of model

4. Meta loop: Collect and include facts

about your model finding adventures,

goto step 2 and use for planning


20/101


21/101


22/101

Companies and

Prediction

! A company is a collection of peoplepredicting risk from actions

! No risk - no gain


23/101

Recent progress


24/101

Recent advances:

Universal Learning

Algorithms.There is a theoreticallyoptimal way of predicting the future,

given the past. It can be used to define

an optimal (though noncomputable)

rational agent that maximizes its

expected reward in almost arbitraryenvironments sampled from

computable probability distributions.


25/101

Recent advances:

All Scientist,: Physicists and economists and otherscientists make predictions based on observations. So

does everybody in daily life. Did you know that there is atheoretically optimalway of predicting? Everyscientist

should know about it.

Normally we do not know the true conditional probability distribution p(next event | past). But assume we do know that p is in some set P of

distributions. Choose a fixed weight w_q for each q in P such that the

w_q add up to 1 (for simplicity, let P be countable). Then construct theBayesmix M(x) = Sum_q w_q q(x), and predict using M instead of the

optimal but unknown p.How wrong is it to do that? The recent exciting work ofMarcus Hutter

(funded through Juergen Schmidhuber's SNF research grant"Unification of Universal Induction and Sequential Decision Theory")provides general and sharp loss bounds:

Let LM(n) and Lp(n) be the total expected losses of the M-predictor andthe p-predictor, respectively, for the first n events. Then LM(n)-Lp(n) is

at most of the order of sqrt[Lp(n)]. That is, M is not much worse than p.

And in general, no other predictor can do better than that!In particular, if p is deterministic, then the M-predictor soon won't make

any errors any more!If P contains ALL computable distributions, then M becomes the

celebrated enumerable universal prior. That is, after decades ofsomewhat stagnating research we now have sharp loss bounds forRaySolomonoff's universal (but incomputable) induction scheme (1964,

1978).Alternatively, reduce M to what you get if you just add up weighted

estimated future finance data probabilities generated by 1000

commercial stock-market prediction software packages. If only one ofthem happens to work fine (but you do not know which) you still should

get rich..


26/101

Intelligence

! Is compression

! If used for prediction


27/101


28/101

=


29/101

Art?


30/101

Theory Pyramid

Undedecidable stuff etc

Optimal Cognition

Algorithmic Information The.

Optimal prediction

Exprerimental planning

Turingcompete repr.

Bayes etc

Multivariate distrib stats

Sing var distrib stat


31/101

Agent


32/101

Formal Agent Model


33/101

Gdel machine


34/101

Artificial Intelligence

! Information-theoretic,

! Statistical, and

! Philosophical,

! Foundations of

! Artificial Intelligence


35/101

!

Universal AI

Universal Artificial Intelligence

= =

Decision Theory = Probability + Utility Theory

+ +

Universal Induction = Ockham + Bayes + Turing


36/101

Pieces of the puzzle

! Philosophical Issues: common principle

to their solution is Occams simplicity

principle. Based on Occams andEpicurus principle, Bayesian probability

theory, and Turings universal machine,

Solomonofdeveloped a formal theory

of induction.

! the sequential/online setup considered

in this pres and place it into the widermachine learning context.


37/101

What is I

! Informal Definition of (Artificial) Intelligence?

! Intelligence measures an agents ability to achievegoals in a wide range of environments.

! Emergent: Features such as the ability to learn andadapt, or to understand, are implicit in the above

definition as these capacities enable an agent to

succeed in a wide range of environments.

!

The science of Artifi

cial Intelligence is concernedwith the construction of intelligent systems/artifacts/agents and their analysis.


38/101

The Hiearchy

! InductionPredictionDecisionAction

! Having or acquiring or learning or inducing a model of

the environment an agent interacts with allows theagent to make predictions and utilize them in its

decision process offinding a good next action.

! Induction infers general models from specificobservations/facts/data, usually exhibiting regularities

or properties or relations in the latter.

! Example Induction: Find a model of the world

economy.

! Prediction: Use the model for predicting the futurestock market.

! Decision: Decide whether to invest assets in stocks orbonds. Action: Trading large quantities of stocks

influences the market.


39/101


40/101

Sequence

! Example 2:

! Digits of a Computable Number Extend

14159265358979323846264338327950288419716939937?

! Looks random?! Frequency estimate: n = length of

sequence. ki = number of occured i = Probabilityof next digit being i is i n . Asymptotically i n 1 10(seems to be) true.

! But we have the strong feeling that (i.e. with highprobability) the next digit will be 5 because theprevious digits were the expansion of!.

! Conclusion: We prefer answer 5, since we see more

structure in the sequence than just random digits.


41/101

Sequence 2

! Example 3:

! Number Sequences Sequence: x1 , x2 , x3 , x4 ,x5 , ... 1, 2, 3, 4, ?, ...

! x5 = 5, since xi = i for i = 1..4.

! x5 = 29, since xi = i 4 10i 3 + 35i2 49i + 24.Conclusion: We prefer 5, since linear relation involves

less arbitrary parameters than 4th-order polynomial.Sequence:

2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,?

! 61, since this is the next prime

! 60, since this is the order of the next simple group

! Conclusion: We prefer answer 61, since primes are amore familiar concept than simple groups. On-Line

Encyclopedia of Integer Seque


42/101

Occam?

! Occams Razor to the Rescue

! Is there a unique principle which allows us to

formally arrive at a prediction which - coincides(always?) with our intuitive guess -or- even better, -which is (in some sense) most likely the best orcorrect answer?

! Yes! Occams razor: Use the simplest explanation

consistent with past data (and use it for prediction). Works! For examples presented and for many more. Actually Occams razor can serve as a foundation ofmachine learning in general, and is even afundamental principle (or maybe even the mere

defi

nition) of science.

! Problem: Not a formal/mathematical objective

principle. What is simple for one may be complicatedfor another.


43/101

Blue Emeralds?

! Grue Emerald Paradox

! Hypothesis 1: All emeralds are green.

! Hypothesis 2: All emeralds found till y2010 are

green, thereafter all emeralds are blue.

! Which hypothesis is more plausible? H1!Justification?

! Occams razor: take simplest hypothesis consistentwith data. is the most important principle in machine

learning and science.


44/101

View on probalilites

! Uncertainty and Probability

! The aim of probability theory is to describeuncertainty. Sources/interpretations for uncertainty:

! Frequentist: probabilities are relative frequencies.(e.g. the relative frequency of tossing head.)

! Objectivist: probabilities are real aspects of the

world. (e.g. the probability that some atom decays in

the next hour)

! Subjectivist: probabilities describe an agents degreeof belief. (e.g. it is (im)plausible that extraterrestrians

exist)


45/101

What we need

! Kolmogorov complexity

! Universal Distribution

! Inductive Learning


46/101

Principle of

Indifference(Epicurus)

!Keep all hypotheses thatare consistent with the

facts


47/101

Occams Razor

! Among all hypotheses consistent with thefacts, choose the simplest

! Newtons rule #1 for doing nature

philosophy

! We are to admit no more costs of nature

things than such as are both true and

sufficient to explain the appearances


48/101

Question

! What does simplest mean?

! How to define simplicity?

! Can a thing be simple under one definition

and not under another?


49/101

Bayes Rule

! P(H|D) = P(D|H)*P(H)/P(D)

-P(H) is often considered as initial degree

of belief in H

! In essence, Bayes rule is a mapping fromprior probability P(H) to posterior

probability P(H|D) determined by D


50/101

How to get P(H)

! By the law of large numbers, we canget P(H|D) if we use many examples

!Give as much information about thatfrom only a limited of number ofdata

! P(H) may be unknown,uncomputable, even may not exist

! Can we find a single probabilitydistribution to use as priordistribution in each different case,with a proximately the same result asif we had used the real distribution


51/101

Hume on Induction

! Induction is impossible because we can onlyreach conclusion by using known data and

methods.

! So the conclusion is logically alreadycontained in the start configuration


52/101


53/101


54/101


55/101


56/101


57/101


58/101

Only one algorithm?


59/101

Solomonoff s Theory of

Induction

! Maintain all hypotheses consistent with thedata

! Incoporate Occams Razor-assign the

simplest hypotheses with highest probability

! Using Bayes rule


60/101

Kolmogorov

Complexity

! k(s) is the length of the shortest programwhich, on no input, prints out s

! k(s)=n

! k(s) is objective (program languageindependent) by Invariance Theorem


61/101

Universal Distribution

! P(s) = 2-k(s)

! We use k(s) to describe the complexity of anobject. By Occams Razor, the simplest

should have the highest probability.


62/101

Problem: !P(s)>1

! For every n, there exists a n-bit string s, k(s)= log n, so P(s) = 2-log n = 1/n

! "+1/3+.>1


63/101

Levins improvement

! Using prefix-free program

! A set of programs, no one of which is a

prefix of any other

! Krafts inequality

! Let L1, L2, be a sequence of natural

numbers. There is a prefix-code with this

sequence as lengths of its binary code words

iff!n2-ln


64/101

Multiplicative

domination

! Levin proved that there exists c, c*p(s) >=p(s) where c depends on p, but not on s

! If true prior distribution is computable, thenuse the single fixed universal distribution p

is almost as good as the actually truedistribution itself


65/101

! Turings thesis: Universal turingmachine can compute all intuitivelycomputable functions

!

Kolmogorovs thesis: the Kolmogorovcomplexity gives the shortestdescription length among alldescription lengths that can beeffectively approximated according to

intuition.! Levins thesis: The universal

distribution give the largestdistribution among all the distributionthat can be effectively approximatedaccording to intuition


66/101

Universal Bet

! Street gambler Bob tossing a coin and offer:

! Next is head 1 give Alice 2$

! Next is tail 0 pay Bob 1$

! Is Bob honest?

! Side bet: flip coin 1000 times, record the

result as a string s

! Alice pay 1$, Bob pay Alice 21000-k(s) $


67/101

! Good offer:

! !|s|=1000 2-1000 21000-k(s)=!|s|=1000 2

-k(s)


68/101

Notice

! The complexity of a string is non-

computable


69/101

Conclusion

! Kolmogorov complexity optimal effectivedescriptions of objects

! Universal Distribution optimal effective

probability of objects

! Both are objective and absolute


70/101

The most neutral possible prior

! Suppose we want a

prior so neutral thatit never rules out a

model

! Possible, if limit to

computablemodels

! Mixture of all

(computable) priors,

with weights, "i, that

decline fairly fast:

! Then, this

multiplicatively

dominates all priors

! though neutral priors

will mean slowlearning

! m(x) are universal

priors


71/101

The most neutral possible coding

language

! Universal programming languages (Java, matlab, UTMs,etc)

! K(x) = length of shortest program in Java, matlab, UTM,

that generates x(Kis uncomputable)

! Invariance theorem

!any languages L1, L2,

#c,

! $x|KL1(x)-KL2(x)| #c

! Mathematically justifies talk ofK(x), not KJava(x) , KMatlab(x),


72/101

So does this mean that choice oflanguage doesnt matter?

! Not quite!

! ccan be large

! And, for any $L1, c0, #L2, xsuch that

! |KL1(x)-KL2(x)| $c0

! The problem of the one-instruction code for the

entire data set

But Kolmogorov complexity can be made

concrete


73/101

Compact Universal Turing

machines

! 210 bits, !-calculus ! 272, combinators

Not much room to hide, here!


74/101

Neutral priors and Kolmogorov

complexity

! A key result:

! K(x) = -log2m(x) o(1)

! Where m is auniversal prior

! Analogous to the

Shannons sourcecoding theorem

! And foranycomputable q,

! K(x) # -log2q(x) o(1)

! Fortypicalxdrawn from q(x)

! Any data, x, that islikely for anysensible probabilitydistribution has lowK(x)


75/101

Prediction by simplicity

! Find shortest program/explanation for current

corpus (binary string)

! Predict using that program! Strictly, use weighted sum of

explanations, weighted by brevity


76/101

Prediction is possible (Solomonoff,1978)

Summed error has finite bound

! sj is summed squared error betweenprediction and true probability on item j

! So prediction converges [faster than 1/

nlog(n)], for corpus size n

! Computability assumptions only (nostationarity needed)


77/101

Summary so far

! Simplicity/occam- close and deep

connections with Bayes

! Defines universal prior (i.e., based on

simplicity)

! Can be made concrete

! General prediction results

! A convenient dual framework to Bayes,

when codes are easier than probabilities


78/101


79/101


80/101


81/101


82/101


83/101


84/101


85/101


86/101


87/101


88/101


89/101


90/101


91/101


92/101


93/101

Methods


94/101


95/101


96/101

Infrastructure


97/101


98/101


99/101


100/101


101/101

predictive science

Documents