1 a(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04
TRANSCRIPT
![Page 1: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/1.jpg)
1
A(n) (extremely) brief/crude introduction to minimum description length princ
iplejdu
2006-04
![Page 2: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/2.jpg)
2
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
![Page 3: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/3.jpg)
3
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
![Page 4: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/4.jpg)
4
Introduction
• Example: data compression– Description methods
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
![Page 5: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/5.jpg)
5
Introduction
• Example: regression– Model selection and overfitting– Complexity of the model vs. Goodness of fit
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
![Page 6: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/6.jpg)
6
Introduction
• Models vs. Hypotheses
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
![Page 7: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/7.jpg)
7
Introduction
• Crude 2-part version of MDL
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
![Page 8: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/8.jpg)
8
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
![Page 9: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/9.jpg)
9
Probabilities and Codelengths• Let X be a finite or countable set
– A code C(x) for X• 1-to-1 mapping from X to Un>0{0,1}n
• LC(x): number of bits needed to encode x using C
– P: probability distribution defined on X• P(x): the probability of x• A sequence of (usually iid) observations x1, x2,
…, xn: xn
![Page 10: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/10.jpg)
10
Probabilities and Codelengths• Prefix codes: as examples of uniquely
decodable codes– no code word is a prefix of any other
a 0
b 111
c 1011
d 1010
r 110
! 100
Source: http://www.cs.princeton.edu/courses/archive/spring04/cos126/
![Page 11: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/11.jpg)
11
Probabilities and Codelengths• Expected codelength of a code C
– Lower bound:
• Optimal code– if it has minimum expected codelength over all un
iquely decodable codes– How to design one given P?
• Huffman coding
Xx
CCP xLxPxLE )()())((
Xx
xPxPxH )(log)()( 2
![Page 12: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/12.jpg)
12
Probabilities and Codelengths• Huffman coding
Source: http://star.itc.it/caprile/teaching/algebra-superiore-2001/
![Page 13: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/13.jpg)
13
Probabilities and Codelengths• How to design code for {1, 2, …, M}?
– Assuming a uniform distribution: 1/M for each number
– ~logM bits
![Page 14: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/14.jpg)
14
Probabilities and Codelengths• How to design code for all the
positive integers?– For each k
• Describe it with 0s • Followed by a 1• Then encode k using the uniform code for• In total, ~ 2logk + 1 bits
– Can be refined…
![Page 15: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/15.jpg)
15
Probabilities and Codelengths• Let P be a probability distribution over X,
then there exists a code C for X such that:
• Let C be a uniquely decodable code over X, then there exists a probability distribution P such that:
)(log)( xPxLC
)(log)( xPxLC
)(log)( nnC xPxL
![Page 16: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/16.jpg)
16
Probabilities and Codelengths• Codelength revisited
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
![Page 17: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/17.jpg)
17
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
![Page 18: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/18.jpg)
18
Crude MDL
• Preliminary: k-th order Markov chain on X={0,1}– A sequence: X1, X2, …, XN
– Special case: 0-th order: Bernoulli model (biased coin)
• Maximum Likelihood estimator
![Page 19: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/19.jpg)
19
Crude MDL
• Preliminary: k-th order Markov chain on X={0,1}– Special case: first order Markov chain B(1)
• MLE
![Page 20: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/20.jpg)
20
Crude MDL
• Preliminary: k-th order Markov chain on X={0,1}– 2k parameters
• theta[1|000…000] = n[1|000…000]/n[000…000]• theta[1|000…001]• …• theta[1|111…110]• theta[1|111…111]
– Log likelihood function: …– MLE: …
![Page 21: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/21.jpg)
21
Crude MDL
• Question: Given data D=xn, find the Markov chain that best explains D.– We do not want to restrict ourselves to cha
ins of fixed order• How to avoid overfitting?• Obviously, an (n-1)-th order Markov model wo
uld always fit the data the best
![Page 22: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/22.jpg)
22
Crude MDL
• two-part MDL revisited
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
![Page 23: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/23.jpg)
23
Crude MDL
• Description length of data given hypothesis
![Page 24: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/24.jpg)
24
Crude MDL
• Description length of hypothesis– The code should not change with the
sample size n.– Different codes will lead to preferences
of different hypotheses– How to design a code that
• Leads to good inferences with small, practically relevant sample sizes?
![Page 25: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/25.jpg)
25
Crude MDL
• An ``intuitive” and ``reasonable” code for k-th order Markov chain– First describe k using 2logk+1 bits– Then describe the d=2k parameters
• Assume n is given in advance– For each theta in the MLE {theta[1|000…000], …, theta[1|111
…111]}, the best precision we can achieve by counting is 1/(n+1)
– Describe each theta with log(n+1) bits– L(H)=2logk+1+dlog(n+1)– L(H)+L(D|H) = 2logk+1+dlog(n+1) – logP(D|k, theta)– For a given k, only the MLE theta need to be consi
dered
![Page 26: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/26.jpg)
26
Crude MDL
• Good news– We have found a principled manner to
encode data D using H
• Bad news– We have not found clear guidelines to
design codes for H
![Page 27: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/27.jpg)
27
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other issues
![Page 28: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/28.jpg)
28
Refined MDL
• Universal codes and universal distributions– maximum likelihood code depends on the
data• How to describe the data in an unambiguous
manner?– Design a code such that for every possible
observation, its codelength corresponds to its ML? - impossible
![Page 29: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/29.jpg)
29
Refined MDL
• Worst-case regret
• Optimal universal model
![Page 30: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/30.jpg)
30
Refined MDL
• Normalized maximum likelihood (NML)
• Minimizing -logNML
![Page 31: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/31.jpg)
31
Refined MDL
• Complexity of a model
– The more sequences that can be fit well by an element of M, the larger M’s complexity
– Would it lead to a ``right” balance between complexity and fit?• Hopefully…
![Page 32: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/32.jpg)
32
Refined MDL
• General refined MDL
Source: Grnwald et al. (2005) Advances in Minimum Description Length: Theory and Applications.
![Page 33: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/33.jpg)
33
Outline
• Conceptual/non-technical introduction
• Probabilities and Codelengths• Crude MDL• Refined MDL• Other topics
![Page 34: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/34.jpg)
34
Other topics
• Mixture code• Resolvability• …
![Page 35: 1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu 2006-04](https://reader030.vdocuments.site/reader030/viewer/2022032804/56649e4f5503460f94b46216/html5/thumbnails/35.jpg)
35
References
• Barron, A.; Rissanen, J. & Yu, B. (1998), 'The minimum description length principle in coding and modeling', Information Theory, IEEE Transactions on 44(6), 2743--2760.
• Grnwald, P.D.; Myung, I.J. & Pitt, M.A. (2005), Advances in Minimum Description Length: Theory and Applications (Neural Information Processing), The MIT Press.
• Hall, P. & Hannan, E.J. (1988), 'On stochastic complexity and nonparametric density estimation', Biometrika 75(4), 705-714.