concept learning examples word meanings edible foods abstract structures (e.g., irony) glorch not...
TRANSCRIPT
Concept Learning
Examples
Word meanings
Edible foods
Abstract structures (e.g., irony)
glorch glorch notglorch
notglorch
Supervised Approach To Concept Learning
Both positive and negative examples provided
Typical models (both in ML and Cog Sci) circa 2000 required both positive and negative examples
++ +
++ ++ +
+
++ -
-
- - -
- -
-
Contrast With Human Learning Abiliites
Learning from positive examples only
Learning from a small number of examples
E.g., word meanings
E.g., learning appropriate social behavior
E.g., instruction on some skill
What would it mean to learn from asmall number of positive examples?
+
+
+
Tenenbaum (1999)
Two dimensional continuous feature space
Concepts defined by axis-parallel rectangles
e.g., feature dimensions
cholesterol level
insulin level
e.g., concept
healthy
+
+
+
Learning Problem
Given a set of given a set of n examples,X = {x1, x2, x3, …, xn}, which are instances of the concept…
Will some unknown example Y also be an instance of the concept?
Problem of generalization
12
3
Hypothesis (Model) Space
H: all rectangles on the plane,parameterized by (l1, l2, s1, s2)
h: one particular hypothesis Note: |H| = ∞
Consider all hypotheses in parallel In contrast to non-Bayesian approach of maintaining only the best hypothesisat any point in time.
Prediction Via Model Averaging
Will some unknown input y be in the concept given examples X = {x1, x2, x3, …, xn}?
Q: y is a positive example of the concept domain(Q) = {true, false}
P(Q | X) = ⌠h p(Q & h | X) dh
P(Q & h | X) = p(Q | h, X) p(h | X)
P(Q | h, X) = P(Q | h) = 1 if y is in h
p(h | X) ~ P(X | h) p(h)
priorlikelihood
Chain rule
Marginalization
Conditional independence and deterministic concepts
Bayes rule
Prior p(h)Prior should be location invariant
Uninformative prior
depends only onrectangle area
Expected size prior
Other possibilities too…
xExpected size prior
Likelihood Function p(X | h)
X = set of n examples
Size principle
Generalization Gradients
MIN: smallest hypothesis consistent with data
weak Bayes: instead of using size principle, assumes examples are produced by process independent of the true class
Dark line =50% prob.
Experimental Design
Subjects shown n dots on screen that are “randomly chosen examples from some rectangle of healthy levels”
n drawn from {2, 3, 4, 6, 10, 50}
Dots varied in horizontal and vertical range
r drawn from {.25, .5, 1, 2, 4, 8} units in a 24 unit window
Task
draw the ‘true’ rectangle around the dots
Experimental Results
Number Game
Experimenter picks integer arithmetic concept C
E.g., prime number
E.g., number between 10 and 20
E.g., multiple of 5
Experimenter presents positive examples drawn at random from C, say, in range [1, 100]
Participant asked whether some new test case belongs in C
Empirical Predictive Distributions
Hypothesis Space
Even numbers
Odd numbers
Squares
Multiples of n
Ends in n
Powers of n
All numbers
Intervals [n, m] for n>0, m<101
Powers of 2, plus 37
Powers of 2, except for 32
•Observation = 16
•Likelihood function Size principle
•Prior Intuition
•Observation = 16 8 2 64
•Likelihood function Size principle
•Prior Intuition
Posterior Distribution After Observing 16
Model Vs. Human Data
MODEL
HUMANDATA
Summary of Tenenbaum (1999)
Method
Pick prior distribution (includes hypothesis space)
Pick likelihood function (size principle)
Leads to predictions for generalization as a function of r (range) and n (number of examples)
Claims people generalize optimally given assumptions about priors and likelihood
Bayesian approach provides best description of how people generalize on rectangle task.
Explains how people can learn from a small number of examples, and only positive examples.
Important Ideas in Bayesian Models
Generative theory captures process that produces observations
Prior
Likelihood Consideration of multiple hypotheses in parallel
Potentially infinite hypothesis space Inference
Role of priors diminishes with amount of evidence
Prediction via model (hypothesis) averaging
Explaining away Learning
just another form of inference
trade off between model simplicity and fit to data Bayesian Occam’s Razor
Ockham's Razor
If two hypotheses are equally consistent with the data, prefer the simpler one.
Simplicity
can accommodate fewer observations
smoother
fewer parameters
restricts predictions more(“sharper” predictions)
Examples 1st vs. 4th order polynomial small rectangle vs. large rectangle
in Tenenbaum model
H 1
H 0
H 1H 0
medieval philosopherand monk
tool for cutting(metaphorical)
Motivating Ockham's Razor
Aesthetic considerations
A theory with mathematical beauty is more likely to be right (or believed) than an ugly one, given that both fit the same data.
Past empirical success of the principle
Develop inference techniques (e.g., Bayesian reasoning) that automatically incorporate Ockham's razor
Two theories H1 and H
2
PRIORS
LIKELIHOODS
Ockham's Razor with Priors
Jeffreys (1939) probabililty text more complex hypotheses should have lower priors
Requires a numerical rule for assessing complexity
e.g., number of free parameters e.g., Vapnik-Chervonenkis (VC) dimension
Subjective vs. Objective Priors
subjective or informative prior specific, definite information about a random variable
objective or uninformative prior vague, general information
Philosophical arguments for certain priors as uninformative Maximum entropy / least committment
e.g., interval [a b]: uniform
e.g., interval [0, ∞) with mean 1/λ: exponential distribution
e.g., mean μ and std deviation σ: Gaussian
Independence of measurement scale
e.g., Jeffrey’s prior 1/(θ(1-θ)) for θ in [0,1]expresses same belief whether we talkabout θ or logθ
Ockham’s Razor Via Likelihoods
Coin flipping example H
1: coin has two heads
H2: coin has a head and a tail
Consider 5 flips producing HHHHH H
1 could produce only this sequence
H2 could produce HHHHH, but also HHHHT,
HHHTH, ... TTTTT P(HHHHH | H
1) = 1, P(HHHHH | H
2) = 1/32
H2 pays the price of having a lower likelihood via the fact it can accommodate a greater range of observations
H1 is more readily rejected by observations
Simple and Complex Hypotheses
H2
H1
Bayes Factor
BIC is approximation to Bayes factor
A.k.a. likelihood ratio
Note: “model”and “hypothesis”are generallyinterchangeable
Hypothesis Classes Varying In ComplexityE.g., 1st, 2nd, and 3d order polynomials
Hypothesis class is parameterized by w
v
Rissanen (1976)Minimum Description Length
Prefer models that can communicate the data in the smallest number of bits.
The preferred hypothesis H for explaining data D minimizes:
(1) length of the description of the hypothesis (2) length of the description of the data with the help of the chosen theory
L: length
MDL & BayesL: some measure of length (complexity)
MDL: prefer hypothesis that min. L(H) + L(D|H)
Bayes rule implies MDL principle P(H|D) = P(D|H)P(H) / P(D) –log P(H|D) = –log P(D|H) – log P(H) + log P(D) = L(D|H) + L(H) + const
Relativity ExampleExplain deviation in Mercury's orbit at perihelion with respect to prevailing theory
E: Einstein's theory α = true deviationF: fudged Newtonian theory a = observed deviation
Relativity Example (Continued)
Subjective Ockham's razor result depends on one's belief about P(α|F)
Objective Ockham's razor
for Mercury example, RHS is 15.04
Applies to generic situation