full bayesian inference (learning)

41
Full Bayesian inference (Learning) Peter Antal [email protected] February 20, 2018 1 A.I.

Upload: others

Post on 08-Apr-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Full Bayesian inference (Learning)

Full Bayesian inference(Learning)

Peter [email protected]

February 20, 2018 1A.I.

Page 2: Full Bayesian inference (Learning)

The data age

Learning paradigms◦ Learning as inference

◦ Bayesian learning, full Bayesian inference, Bayesian model averaging

◦ Model identification, maximum likelihood learning

Probably Approximately Correct learning

February 20, 2018A.I. 2

Page 3: Full Bayesian inference (Learning)

The „data” age: can we automate analysis/learning?

Universal (statistical) predictor?

Universal learning architectures?

Self-improving super-intelligence?

Phases of AI: expert, supervised, autonomous

2/20/2018A.I. 3

Page 4: Full Bayesian inference (Learning)

Serendipity-driven („experimental”) science◦ 1000 years ago

◦ description of natural phenomena

Hypothesis-driven („analytical”) science◦ Last few hundred years

◦ Falsification, Newton's Laws, Maxwell's Equations

Computation-driven („computational”) science◦ Last few decades

◦ simulation of complex phenomena

Data-driven („hypothesis-free”) research◦ must move from data to information to knowledge

February 20, 2018A.I. 4

Jim Gray: Evolution of science: 4 paradigms

Tansley, Stewart, and Kristin M. Tolle. The fourth paradigm: data-intensive scientific discovery. Ed. Tony Hey. Vol. 1.

Redmond, WA: Microsoft research, 2009.

Page 5: Full Bayesian inference (Learning)

Moore’s law: storage and computation (grids, GPUs, quantum chips..)

Semantic web thechnologies: linked open data

Artificial intelligence methods: learning, reasoning, decision

5

Computational

hardware

Artificial

Intelligence

Semantic

technologies

Factors:

Page 6: Full Bayesian inference (Learning)

1965, Gordon Moore, founder of Intel:

„The number of transistors that can be

placed inexpensively on an integrated

circuit doubles approximately every two

years ”... "for at least ten years"

2/20/2018A.I. 6

•10 µm – 1971

•6 µm – 1974

•3 µm – 1977

•1.5 µm – 1982

•1 µm – 1985

•800 nm – 1989

•600 nm – 1994

•350 nm – 1995

•250 nm – 1997

•180 nm – 1999

•130 nm – 2001

•90 nm – 2004

•65 nm – 2006

•45 nm – 2008

•32 nm – 2010

•22 nm – 2012

•14 nm – 2014

•10 nm – 2017

•7 nm – ~2019

•5 nm – ~2021

2012: single

atom transistor

(~0.1n, 1A)

Page 7: Full Bayesian inference (Learning)

7

Page 9: Full Bayesian inference (Learning)

9

Linking Open Data cloud diagram 2017, by Andrejs Abele, John P. McCrae, Paul Buitelaar, Anja Jentzsch and Richard

Cyganiak. http://lod-cloud.net/

Page 10: Full Bayesian inference (Learning)

February 20, 2018A.I. 10

Page 11: Full Bayesian inference (Learning)

„Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make, provided that the machine is docile enough to tell us how to keep it under control.” Good (1965),

Artificial General Intelligence (AGI)

N-AI-1

N-AI-2

N-AI-k….

Artificial Narrow Intelligence

Page 12: Full Bayesian inference (Learning)

Narrow AI present

General AI

Super AI/strong AI

1950 Turing: "Computing Machinery and Intelligence„ learning

Page 13: Full Bayesian inference (Learning)

Possibility of learning is an empirical observation.

The most incomprehensible thing about the world

is that it is at all comprehensible.

Albert Einstein.

No theory of knowledge

should attempt to explain

why we are successful in

our attempt to explain

things.

K.R.Popper: Objective

Knowledge, 1972

Page 14: Full Bayesian inference (Learning)

Epicurus' (342? B.C. - 270 B.C.) principle of multiple explanations which states that one should keep all hypotheses that are consistent with the data.

The principle of Occam's razor (1285 - 1349, sometimes spelt Ockham). Occam's razor states that when inferring causes entities should not be multiplied beyond necessity. This is widely understood to mean: Among all hypotheses consistent with the observations, choose the simplest. In terms of a prior distribution over hypotheses, this is the same as giving simpler hypotheses higher a priori probability, and more complex ones lower probability.

Page 15: Full Bayesian inference (Learning)

What is the probability that the sun will rise tomorrow?◦ P[{sun rises tomorrow} | {it has risen k times

previously}]=(k+1)/(k+2)◦ (k: ...Laplace inferred the number of days by saying

that the universe was created about 6000 years ago, based on a young-earth creationist reading of the Bible. ..)

◦ https://en.wikipedia.org/wiki/Sunrise_problem

Rule of succession◦ https://en.wikipedia.org/wiki/Rule_of_succession

February 20, 2018A.I. 15

Page 16: Full Bayesian inference (Learning)

February 20, 2018A.I. 16

Page 17: Full Bayesian inference (Learning)

February 20, 2018A.I. 17

Page 18: Full Bayesian inference (Learning)
Page 19: Full Bayesian inference (Learning)

Russel&Norvig: Artificial intelligence, ch.20

Page 20: Full Bayesian inference (Learning)

Russel&Norvig: Artificial intelligence

Page 21: Full Bayesian inference (Learning)

Russel&Norvig: Artificial intelligence

Page 22: Full Bayesian inference (Learning)

Russel&Norvig: Artificial intelligence

Page 23: Full Bayesian inference (Learning)
Page 24: Full Bayesian inference (Learning)
Page 25: Full Bayesian inference (Learning)

February 20, 2018A.I. 25

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7 8 9 10 11 12

sequential likelihood of a given data

h1 h2 h3 h4 h5

Page 26: Full Bayesian inference (Learning)

Simplest form: learn a function from examples

f is the target function

An example is a pair (x, f(x))

Problem: find a hypothesis hsuch that h ≈ fgiven a training set of examples

(This is a highly simplified model of real learning:◦ Ignores prior knowledge◦ Assumes examples are given)◦

Page 27: Full Bayesian inference (Learning)

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

Page 28: Full Bayesian inference (Learning)

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

Page 29: Full Bayesian inference (Learning)

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

Page 30: Full Bayesian inference (Learning)

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

Page 31: Full Bayesian inference (Learning)

Construct/adjust h to agree with f on training set

(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

Page 32: Full Bayesian inference (Learning)

Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples)

E.g., curve fitting:

Ockham’s razor: prefer the simplest hypothesis consistent with data

Page 33: Full Bayesian inference (Learning)

How do we know that h ≈ f ?1. Use theorems of computational/statistical learning theory

2. Try h on a new test set of examples

(use same distribution over example space as training set)

Learning curve = % correct on test set as a function of training set size

Page 34: Full Bayesian inference (Learning)

Example from concept learning

X: i.i.d. samples.

n: sample size

H: hypotheses

bad

The Probably Approximately Correct PAC-learning

A single estimate of the expected error for a given hypothesis is convergent,

but can we estimate the errors for all hypotheses uniformly well??

Page 35: Full Bayesian inference (Learning)

Assume that the true hypothesis f is element of the hypothesis space H.

Define the error of a hypothesis h as its misclassification rate:

Hypothesis h is approximately correct if

(ε is the “accuracy”)

For h∈Hbad

𝑒𝑟𝑟𝑜𝑟 ℎ = 𝑝(ℎ(𝑥) ≠ 𝑓(𝑥))

𝑒𝑟𝑟𝑜𝑟 ℎ < 𝜀

𝑒𝑟𝑟𝑜𝑟 ℎ > 𝜀

Page 36: Full Bayesian inference (Learning)

H can be separated to H<ε and Hbad as Hε<

By definition for any h ∈ Hbad, the probability of error is larger than 𝜀thus the probability of no error is less than )1(

bad

Page 37: Full Bayesian inference (Learning)

Thus for m samples for a hb ∈ 𝐻𝑏𝑎𝑑:

For any hb ∈ 𝐻𝑏𝑎𝑑, this can be bounded as

𝑝 𝐷𝑛:ℎ𝑏 𝑥 = 𝑓 𝑥 ≤ (1 − 𝜀)𝑛

𝑝 𝐷𝑛:∃ℎ𝑏∈ 𝐻, ℎ𝑏 𝑥 = 𝑓 𝑥 ≤

≤ 𝐻𝑏𝑎𝑑 1 − 𝜀 𝑛

≤ |𝐻| (1 − 𝜀)𝑛

Page 38: Full Bayesian inference (Learning)

To have at least δ “probability” of approximate correctness:

By expressing the sample size as function of ε accuracy and δ confidence we get a bound for sample complexity

|𝐻| (1 − 𝜀)𝑛≤ δ

1/𝜀(ln 𝐻 + ln1

δ) ≤ n

Page 39: Full Bayesian inference (Learning)

How many distinct concepts/decision trees with n Boolean attributes?

= number of Boolean functions

= number of distinct truth tables with 2n rows = 22n

E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees

Page 40: Full Bayesian inference (Learning)

In practice, the target typically is not inside the hypothesis space: the total real error can be decomposed to “bias + variance”

“bias”: expected error/modelling error

“variance”: estimation/empirical selection error

For a given sample size the error is decomposed:

Modeling error

Statistical error

(Model selection error)

Total error

Model complexity

To

tal e

rro

r

Page 41: Full Bayesian inference (Learning)

Normative predictive probabilistic inference◦ performs Bayesian model averaging

◦ implements learning through model posteriors

◦ avoids model identification(!)

Model identification is hard:◦ Probably Approximately Correct learning

◦ Bias-variance dilemma

February 20, 2018A.I. 41