introduction to probability and statistics (continued)introduction to probability and statistics...
Post on 21-May-2020
39 Views
Preview:
TRANSCRIPT
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Introduction to Probability and Statistics
(Continued)Prof. Nicholas Zabaras
Center for Informatics and Computational Science
https://cics.nd.edu/
University of Notre Dame
Notre Dame, Indiana, USA
Email: nzabaras@gmail.com
URL: https://www.zabaras.com/
August 29, 2018
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Markov and Chebyshev Inequalities
The Law of Large Numbers, Central Limit Theorem, Monte Carlo
Approximation of Distributions, Estimating ๐, Accuracy of Monte Carlo
approximation, Approximating the Binomial with a Gaussian,
Approximating the Poisson Distribution with a Gaussian, Application of
CLT to Noise Signals
Information theory, Entropy, KL divergence, Jensenโs Inequality, Mutual
information, Maximal Information Coefficient
Contents
2
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
References
โข Following closely Chris Bishopsโ PRML book, Chapter 2
โข Kevin Murphyโs, Machine Learning: A probablistic perspective, Chapter 2
โข Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge
University Press.
โข Bertsekas, D. and J. Tsitsiklis (2008). Introduction to Probability. Athena
Scientific. 2nd Edition
โข Wasserman, L. (2004). All of statistics. A Concise Course in Statistical
Inference. Springer.
3
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
You can show (Markovโs inequality) that if ๐ is a non-negative integrable
random variable and for any ๐ > 0:
You can generalize this using any function of the random variable ๐ as:
Using ๐(๐) = (๐ โ ๐ผ[๐])2 , we derive the following Chebyshev inequality:
In terms of stdโs, we can restate as :
Thus the probability of ๐ being more than 2s away from ๐ผ[๐] is โค ยผ.
[ ]Pr[ ]
XX a
a
2
2Pr [ ]X X
s
[ ( )]Pr[ ( ) ]
f Xf X a
a
22 2, [ ]a X X s
0
: [ ] ( ) ( ) ( ) Pr[ ]a a
Indeed X x x dx x x dx a x dx a X a
Markov and Chebyshev Inequalities
2
1Pr [ ]X X s
4
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Let ๐๐ for ๐ = 1, 2, . . . , ๐ be independent and identically distributed random variables (i.i.d.) with finite mean E(๐๐) = ๐ & variance Var(๐๐) =
๐2.
Let
Note that
Weak LLN:
Strong LLN:
The Law of Large Numbers (LLN)
This means that with
probability one, the
average of any
realizations of ๐ฅ1, ๐ฅ2, โฆ of
the random variables
๐1, ๐2, โฆ converges to
the mean.
เดค๐๐ =1
๐
๐=1
๐
๐๐
๐ผ เดค๐๐ =1
๐
๐=1
๐
๐ = ๐
๐๐๐ เดค๐๐ =1
๐2
๐=1
๐
๐2 =๐2
๐
lim๐โโ
Pr | เดค๐๐ โ ๐| โฅ ๐ = 0โ๐ > 0
lim๐โโ
เดค๐๐ = ๐ ๐๐๐๐๐ ๐ก ๐ ๐ข๐๐๐๐ฆ
5
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Assume that we have a set of observations
The problem is to infer on the underlying probability distribution that gives
rise to the data S.
1 2{ , ,... }, n
N jS x x x x
Parametric problem: The underlying probability density has a
specified known form that depends on a number of parameters. The
problem of interest is to infer those parameters.
Non-parametric problem: No analytical expression for the probability
density is available. Description consists of defining the dependency
or independency of the data. This leads to numerical exploration.
A typical situation for the parametric model is when the distribution is the
PDF of a random variable : .nX
Statistical Inference: Parametric & Non-Parametric Approach
6
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Assume that we sample
We consider a parametric model with xj realizations of
where we take both the mean and the variance as
unknowns.
The probability density of ๐ is:
Our problem is to estimate and
2
1 2{ , ,... },N jS x x x x
0~ ( , )X x N
0x 2 2
1
0 0 01/2
1 1( | , ) exp ( ) ( )
22 det
Tx x x x x x
0x 2 2
Example of the Law of Large Numbers
7
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
From the law of large numbers, we calculate:
To compute the covariance matrix, note that if ๐1, ๐2, โฆ are i.i.d. so are
๐(๐1), ๐(๐2), โฆ for any function
Then we can compute:
The above formulas define the empirical mean and empirical
covariance.
2: kf
Empirical Mean and Empirical Covariance
๐ฅ0 = ๐ผ ๐ โ1
๐
๐=1
๐
๐ฅ๐ = าง๐ฅ
๐ฎ = cov(๐) = ๐ผ เตซ๐ฅ โ ๐ผ ๐ ) ๐ฅ โ ๐ผ ๐ ๐ โ ๐ผ เตซ๐ฅ โ าง๐ฅ) ๐ฅ โ าง๐ฅ ๐
โ
๐ฎ โ1
๐
๐=1
๐
(๐ฅ๐ โ าง๐ฅ) ๐ฅ๐ โ าง๐ฅ๐= เดฅ๐ฎ
8
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The Central Limit Theorem Let (๐1, ๐2, โฆ ๐๐) be independent and identically distributed
(i.i.d.) continuous random variables each with expectation m
and variance s2.
Define:
As ๐ โ> โ, the distribution of ๐๐ converges to the
distribution of a standard normal random variable
If , for ๐ large,
Somewhat of a justification for assuming Gaussian noise is
common
2 /21
lim2
x
t
NN
P Z x e dt
๐๐ =1
๐ ๐(๐1 + ๐2+. . . +๐๐ โ ๐๐) =
เดค๐๐ โ ๐๐
๐
, เดค๐๐ =1
๐
๐=1
๐
๐๐
เดค๐๐ =1
๐
๐=1
๐
๐๐ เดค๐๐~๐ฉ ๐,๐2
๐๐๐ ๐ โ โ
9
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The CLT and the Gaussian Distribution
As an example, assume ๐ variables (๐1, ๐2, โฆ ๐๐) each of
which has a uniform distribution over [0, 1] and then
consider the distribution of the sample mean
(๐1 + ๐2 + โฆ+ ๐๐)/๐. For large ๐, this distribution tends
to a Gaussian. The convergence as ๐ increases can be
rapid.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
3.5
4
MatLab Code
1N 2N 10N
10
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
0 0.5 10
1
2
3
N = 10
The CLT and the Gaussian Distribution
We plot a histogram of where
As ๐ โ โ, the distribution tends towards a Gaussian.
0 0.5 10
1
2
3
N = 1
0 0.5 10
1
2
3
N = 5
1
1, 1:10000
N
ij
i
x jN
~ (1,5)ijx Beta
Run centralLimitDemo
from PMTK
11
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Accuracy of Monte Carlo Approximation
In Monte Carlo approximation of the mean using the
sample mean าง๐ approximation, we have:
We can now derive the following error bars (using central
intervals):
The number of samples needed to drop the error within
is then:
]๐น๐๐ ๐ = ๐ผ[๐(๐ฅ)
าง๐ โ ๐~๐ฉ 0,๐2
๐
๐2 = ๐ผ[๐2 ๐ฅ ] โ ๐ผ[f x ]2 โ1
๐
๐ =1
๐
๐(๐ฅ๐ ) โ าง๐ 2 โก เดค๐2
Pr ๐ โ 1.96เดค๐
๐โค าง๐ โค ๐ + 1.96
เดค๐
๐= 0.95
1.96เดค๐
๐โค ๐ โ ๐ โฅ
4เดค๐2
๐2
12
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Monte Carlo Approximation of Distributions
Computing the distribution of ๐ฆ = ๐ฅ2, ๐(๐ฅ) is uniform.
The MC approximation is shown on the right. Take
samples from ๐(๐ฅ), square them and then plot the
histogram.
Run changeofVarsDemo1d
from PMTK
-1 0 1-0.5
0
0.5
1
1.5
0 0.5 10.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
0 0.5 10
0.05
0.1
0.15
0.2
0.25
๐(๐ฅ)
๐(๐ฆ)
13
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Accuracy of Monte Carlo Approximation
Increase of the accuracy of MC with the number of
samples. Histograms (on the left) and (on the right) pdfs
using kernel
density estimation.
The actual
distribution is
shown on red.
Run mcAccuracyDemo
from PMTK0.5 1 1.5 2 2.50
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8100 samples
0.5 1 1.5 2 2.50
0.5
1
1.5
2
2.510 samples
2|1.5,0.25xN
14
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Example of CLT: Estimating by MC
Use the CLT to approximate ๐. Let ๐ฅ, ๐ฆ~๐ฐ[โ๐, ๐].
Here ๐ฅ, ๐ฆ are uniform random
variables on [โ๐,+๐], ๐ = 2.
We find = 3.1416with standard
error 0.09. Run mcEstimatePi
from PMTK
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2 2 2 2
22 2 2
2 2
2 2 2
4 (
( )
1 1( )
4 ( )
) (
( )
)
( )
r r
r r
r r
r r
r r
r r
I x y r dxdy r
I x y r dxdyr r
x y r p x p y dxd
r p p
y
x y
เดค๐ โ 41
๐
๐ =1
๐
)๐(๐ฅ๐ 2 + ๐ฆ๐
2 โค ๐2 , ๐ฅ๐ , ๐ฆ๐ ~๐ฐ[โ๐, ๐]
15
๐ฅ, ๐ฆ~๐ฐ[โ๐, ๐]
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
CLT: The Binomial Tends to a Gaussian
One consequence of the CLT is that the binomial
distribution
which is a distribution over m defined by the sum of ๐observations of the random binary variable ๐ฅ, will tend to a
Gaussian as ๐ โ โ.
| , (1 )m N mN
m Nm
Bin
0 1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35binomial distribution
( 10, 0.25)N m Bin
Matlab Code
16
1 2
~
1...
,
~ ,
1
Nx x x m
N N N
m N NN
N
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider that we count the number of photons from a light source. Let
๐(๐ก) be the number of photons observed in the time interval [0, ๐ก]. ๐(๐ก) is
an integer-valued random variable. We make the following assumptions:
a. Stationarity: Let โ1 and โ2 be any two time intervals of equal length, ๐any non-negative integer. Assume that
Prob. of ๐ photons in โ1= Prob. of ๐ photons in โ2
b. Independent increments: Let โ1, โ2, โฆ , โ๐ be non-overlapping time
intervals and ๐1, ๐2, โฆ , ๐๐ non-negative integers. Denote by ๐ด๐ the event
defined as
๐ด๐ = ๐๐ photons arrive in the time interval โ๐
Assume that these events are mutually independent, i.e.
c. Negligible probability of coincidence: Assume that the probability of
two or more events at the same time is negligible. More precisely,
๐(0) = 0 and
1 2 1 2( ... ) ( ) ( )... ( )n nP A A A P A P A P A
( ) 1lim 0
P N h
h
0h
Poisson Process
17
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
If these assumptions hold, then for a given time ๐ก, ๐ is a Poisson
process:
Let us fix ๐ก = ๐ = observation time and define a random variable ๐ =๐(๐). Let us define the parameter ๐ = ๐๐. We then denote:
( ) , 0, 0,1,2,...,!
n
tt
P N t n e nn
~ ( )!
n
N en
Poisson
Poisson Process
18
D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007
S Ghahramani: Fundamentals of Probability,1996.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider the Poisson (discrete) distribution
The mean and the variance are both equal to .
( ) ( | )!
n
PoissonP N n n en
0
2
( | ) ,Poisson
n
N n n
N
Poisson Process
19
0,1,2,...,N
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Approximating a Poisson Distribution With a Gaussian
Theorem: A random variable ๐~Poisson(๐) can be considered as the
sum of n independent random variables ๐๐~Poisson(๐/๐).a
According to the Central Limit Theorem, when ๐ is large enough,
Then based on the Theorem is a draw from and
from the CLT also follows a Gaussian for large n with:
Thus
The approximation of a Poisson distribution with a Gaussian for large
๐ is a result of the CLT!
a For a proof that the sum of independent Poisson Random Variables is a Poisson distribution see this document.
21
1~ ( , ) ~ ( , )
n
i i
i
Take X Xn n n n n
Poisson N
1
n
i
i
X X
2
2
X nn
var X nn
~ ( , ) X N
20
)๐ซโด๐พ๐๐โด๐(๐
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Approximating a Poisson Distribution with a Gaussian
0 2 4 6 8 10 12 14 16 18 200
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
Mean=5
0 2 4 6 8 10 12 14 16 18 200
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Mean=10
0 5 10 15 20 25 30 35 40 45 500
0.02
0.04
0.06
0.08
0.1
0.12
Mean=15
0 5 10 15 20 25 30 35 40 45 500
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
Mean=20
21
Poisson distributions (dots) vs their Gaussian approximations (solid line) for various values of the
mean . The higher the , the smaller the distance between the two distributions. See this MatLab
implementation.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Let us consider the following two distributions:
We often use the Kullback-Leibler distance to define the distance
between two distributions. In particular, in approximating the Poisson
distribution with a Gaussian distribution, we have the following:
0
( | )(. | ), (. | , ) ( | ) log
( | , )
PoissonPoisson Gaussian Poisson
n Gaussian
nKL distance n
n
2
( | )!
1 1( | , ) exp ( )
22
n
Poisson
Gaussian
n n en
x x x
Kullback-Leibler Distance Between Two Densities
22
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
0 5 10 15 20 25 30 35 40 45 5010
-3
10-2
10-1
Approximating a Poisson Distribution With a Gaussian
23
The KL distance of the Poisson distribution from
its Gaussian approximation as a function of the
mean in a logarithmic scale. The horizontal
line indicates where the KL distance has
dropped to 1/10 of its value at = 1.
See the following MatLab implementation.
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider discrete sampling where the output is noise of length ๐.
The noise vector is a realization of
We estimate the mean and the variance of the noise in a single measurement
as follows:
To improve the signal to noise ratio, we repeat the measurement and average
the noise vector signals:
The average noise is a realization of a random variable:
If are i.i.d., X is asymptotically a Gaussian by the CLT, and its
variance is . We need to repeat the experiment until
(a given tolerance).
nx : nX
2
2
0 0
1 1
1 1,
n n
j j
j j
x x x xn n
s
( )
1
1 Nk n
k
x xN
( )
1
1 Nk n
k
X XN
2
var( )XN
s
(1) (2), ,...X X
22
Ns
Application of the CLT: Noise Signals
24
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Gaussian noise vectors of size 50 are used with ๐2 = 1s is the std of a single noise vector
0 10 20 30 40 50-3
-2
-1
0
1
2
3
N=1
0 10 20 30 40 50-3
-2
-1
0
1
2
3
N=5
0 10 20 30 40 50-3
-2
-1
0
1
2
3
N=10
0 10 20 30 40 50-3
-2
-1
0
1
2
3
N=25
Noise Reduction By Averaging
25
:
2
Estimated
Noise level
N
s
See the following MatLab implementation.
D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 2007
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Information theory is concerned
with representing data in a compact fashion (data
compression or source coding), and
transmitting and storing it in a way that is robust to
errors (error correction or channel coding).
To compactly representing data requires allocating
short codewords to highly probable bit strings, and
reserving longer codewords to less probable bit strings.
e.g. in natural language, common words (โaโ, โtheโ,
โandโ) are much shorter than rare words.
Introduction to Information Theory
D. MacKay, Information Theory, Inference and Learning Algorithms (Video Lectures)
26
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Decoding messages sent over noisy channels requires
having a good probability model of the kinds of
messages that people tend to send.
We need models that can predict which kinds of data
are likely and which unlikely.
Introduction to Information Theory
โข David MacKay, Information Theory, Inference and Learning Algorithms , 2003 (available on line)
โข Thomas M. Cover, Joy A. Thomas , Elements of Information Theory , Wiley, 2006.
โข Viterbi, A. J. and J. K. Omura (1979). Principles of Digital Communication and Coding. McGraw-Hill.
27
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider a discrete random variable ๐ฅ. We ask how
much information (โdegree of surpriseโ) is received when
we observe (learn) a specific value for this variable?
Observing a highly probable event provides little
additional information.
If we have two events ๐ฅ and ๐ฆ that are unrelated, then
the information gain from observing both of them should
be โ(๐ฅ, ๐ฆ) = โ(๐ฅ) + โ(๐ฆ).
Two unrelated events will be statistically independent, so
๐(๐ฅ, ๐ฆ) = ๐(๐ฅ)๐(๐ฆ).
Introduction to Information Theory
28
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
From โ(๐ฅ, ๐ฆ) = โ(๐ฅ) + โ(๐ฆ) and ๐(๐ฅ, ๐ฆ) = ๐(๐ฅ)๐(๐ฆ), it is easily shown that โ(๐ฅ) must be given by the
logarithm of ๐(๐ฅ) and so we have
Low probability events correspond to high
information content.
When transmitting a random variable, the average
amount of transmitted information is:
the units of h(x) are bits (โbinary digitsโ)2( ) log ( ) 0h x p x
2
1
: ( ) log ( )K
k
Entropy of X X p X k p X k
Entropy
29
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Example 1 (Coding theory): ๐ฅ discrete random variable with 8 possible
states; how many bits to transmit the state of ๐ฅ?
All states equally likely
Example 2: consider a variable having 8 possible states
{๐, ๐, ๐, ๐, ๐, ๐, ๐, โ} for which the respective (non-uniform) probabilities
are given by ( 1/2 , 1/4 , 1/8 , 1/16 , 1/64 , 1/64 , 1/64 , 1/64 ).
The entropy in this case is smaller than for the uniform distribution.
2
1 18 log 3
8 8x bits
1 1 1 1 11 2 3 4 4 6 2
2 4 8 16 64average code length bits
2 2 2 2 2
1 1 1 1 1 1 1 1 4 1log log log log log 2
2 2 4 4 8 8 16 16 64 64x bits
Note: shorter codes
for the more probable
events vs longer codes
for the less probable
events.
Shanonโs Noiseless Coding Theorem (1948): The entropy is a lower bound on the number of bits needed
to transmit the state of a random variable
Noiseless Coding Theorem (Shanon)
30
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Considering a set of ๐ identical objects that are to be divided
amongst a set of bins, such that there are ๐๐ objects in the ith bin.
Consider the number of different ways of allocating the objects to
the bins.
In the ith bin there are ๐๐! ways of reordering the objects
(microstates), and so the total number of ways of allocating the ๐objects to the bins is given by (multiplicity)
The entropy is defined as
We now consider the limit ๐ โโ,
!
!i
i
NW
n
1 1 1
ln ln ! ln !iiW N n
N N N
ln ! ln , ln ! lni i i iN N N N n n n n
lim ln lni ii i
Ni i
n np p
N N
pi is the probability of an object assigned
to the ith bin.
The occupation numbers pi correspond
to macrostates.
Alternative Definition of Entropy
31
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Interpret the bins as the states ๐ฅ๐ of a discrete random variable ๐,
where ๐(๐ = ๐ฅ๐) = ๐๐. The entropy of the random variable ๐ is
then
Alternative Definition of Entropy
Distributions ๐(๐ฅ) that are sharply peaked around a few values will
have a relatively low entropy, whereas those that are spread more
evenly across many values will have higher entropy.
lni i
i
p p x p x
32
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The maximum entropy configuration can be found by maximizing H using
a Lagrange multiplier to enforce the normalization constraint on the
probabilities. Thus we maximize
We find ๐ is the number of possible states and H= ln2๐.
To verify that the stationary point is indeed a maximum, we can evaluate
the 2nd derivative of the entropy, which gives
For any discrete distribution with ๐ด states, we have: H[๐ฅ] โค ln2๐
Here we used Jensenโs inequality (for the concave function log)
where are the elements of the identity matrix.ijI
( ) 1/ ,ip x M
Maximum Entropy: Uniform Distribution
1 1( ) ln ( ) ( ) ln ln ( ) ln
( ) ( )i i i i
i i ii i
p x p x p x p x Mp x p x
เดฅโ = โ
๐
๐(๐ฅ๐)ln๐(๐ฅ๐) + ๐
๐
๐(๐ฅ๐) โ 1
๐2เดฅโ
เตฏ๐๐(๐ฅ๐)๐๐(๐ฅ๐= โ๐ผ๐๐
1
๐๐
33
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Recall the DNA Sequence logo example earlier.
The height of each bar is defined to be 2 โ H, where H is
the entropy of that distribution, and 2 (= ln24) is the
maximum possible entropy.
Thus a bar of height 0corresponds to a uniform
distribution (ln24), whereas
a bar of height 2 corresponds
to a deterministic distribution.
Example: Biosequence Analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 150
1
2
Sequence Position
Bit
s
seqlogoDemo from PMTK
34
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider binary random variables, ๐ โ {0, 1}, we can write
๐(๐ = 1) = ๐ and ๐(๐ = 0) = 1 โ ๐.
Hence the entropy becomes (binary entropy function)
The maximum value of 1occurs when the distribution
is uniform, ๐ = 0.5.
Binary Variable
{0,1}, ( 1) , ( 0) 1X p X p X
2 2log 1 log 1X
0 0.5 10
0.5
1
p(X = 1)
H(X
)
MatLab function
bernoulliEntropyFig
from PMTK
35
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Divide ๐ฅ into bins of width ฮ. Assuming ๐(๐ฅ) is
continuous, for each such bin, there must exist ๐ฅ๐ such
that
The ln ฮ term is omitted since it diverges as ฮ0(indicating that infinite bits are needed to describe a
continuous variable)
( 1)
0
( ) ( )
( ) ln ( ) ( ) ln ( ) ln
lim ( ) ln ( ) ( ) ln ( )
i
i
i
i i i i
i i
i i
i
p x dx p x =
p x p x p x p x
p x p x p x p x dx
probability in falling in bin
(can be negative)
Differential Entropy
36
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
For a density defined over multiple continuous
variables, denoted collectively by the vector ๐, the
differential entropy is given by
Differential (unlike the discrete) entropy can be negative
When doing variable transformation ๐(๐), use ๐(๐)๐๐ =๐(๐)๐๐, e.g. if ๐ = ๐จ๐ then:
( ) ln ( )p p d x x x x
Differential Entropy
( ) ln ( ) | | l | | n | |n lp p d yx y y A y A x Ay
37
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The distribution that maximizes the differential entropy with
constraints on the first two moments is a Gaussian:
Using calculus of variations ,
Evaluating the differential entropy of the Gaussian, we obtain
(an expression for a multivariate Gaussian is also given)
2 21 11 ln 2 ln 2 det , 1, det
2 2
dx e ds s
21 2 3
2
1
1/2 22
int
1( ) ( ) exp
22
x x
Usetheconstra s
xp x e p x
m m
ss
2 2
1 2 3( ) ln ( ) ( ) 1 ( ) ( )
Normalization Given Givenmean std
p x p x dx p x dx xp x dx x p x dx m m s
Differential Entropy and the Gaussian Distribution
Note โ[๐ฅ] < 0 for
๐2 < 1/(2๐๐)
2
1 2 3( ) ln ( ) ( ) ( ) ( ) ( ) 0p x p x dx p x dx p x dx x p x dx x p x dx m
เทฉโ =
ฮดเทฉโ =
38
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider some unknown distribution ๐(๐ฅ), and suppose
that we have modeled this using an approximating
distribution ๐(๐ฅ).
If we use ๐(๐ฅ) to construct a coding scheme for the
purpose of transmitting values of ๐ฅ to a receiver, then the
additional information to specify ๐ฅ is:
The cross entropy is defined as:
( )|| ( ) ln ( ) ( ) ln ( ) ( ) ln
( )
q xKL p q p x q x dx p x p x dx p x dx
p x
I transmit q(x) butI average it with theexact probability p(x)
Kullback-Leibler Divergence and Cross Entropy
, ( ) ln ( )p q p x q x dx
39
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The cross entropy is the average
number of bits needed to encode data coming from a
source with distribution p when we use model q to define
our codebook.
H(๐)=H(๐, ๐) is the expected # of bits using the true model.
The KL divergence is the average number of extra bits
needed to encode the data, because we used distribution q
to encode the data instead of the true distribution p.
The โextra number of bitsโ interpretation makes it clear that
KL(p||q) โฅ 0, and that the KL is only equal to zero iff q = p.
The KL distance is not a symmetrical quantity, that is
( )|| ( ) ln ( ) ( ) ln ( ) ( ) ln
( )
q xKL p q p x q x dx p x p x dx p x dx
p x
KL Divergence and Cross Entropy
|| ||KL p q KL q p
40
, ( ) ln ( )p q p x q x dx
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider ๐(๐ฅ) = ๐ฉ(๐ฅ|๐, ๐2) and ๐(๐ฅ) = ๐ฉ(๐ฅ|๐, ๐ 2).
Note that the first term can be computed using the
moments and normalization condition of a Gaussian and
the second term from the differential entropy of a Gaussian.
Finally we obtain:
222 2
2
11 ( ) ln 2( | , ) ln 222
|| ( ) ln ( ) ( ) ln ( )
x m ex s dxs
KL p q p x q x dx p x p x dx
sm s
N
KL Divergence Between Two Gaussians
2 2 2 2
2 2
1 2|| ln 1
2
s m mKL p q
s
s m m
s
41
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider now ๐(๐) = ๐ฉ(๐|๐, ๐บ) and ๐(๐) = ๐ฉ(๐|๐, ๐ณ).
1
1 1 1 1 1ln 2 ln| |2
1( | , ) ln 2 ln| | ( ) ( )
2
1ln| | 1 ln 2
2 2
( ) ln ( )
( ) ln
||
1 | |ln
2
(
2 | |
)
T
T T T TD Tr
D d
D
KL p q
D
p q d
p p d
N
L L L m m L m L m
x L x m L x m x
x x x
x
L
x x
mm m m
m
1 1 1 1T T T TTr
L L m m L m L mmm m m
KL Divergence Between Two Gaussians
42
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
For a convex function ๐, Jensenโs inequality gives (can
be proven easily by induction)
This is equivalent (assume ๐ = 2)
to our requirement for convexity ๐โ(๐ฅ) > 0.
Assume ๐โ(๐ฅ) > 0 (strict convexity) for any ๐ฅ.
Jensenโs inequality is thus shown:
1 1
( ), 0 1M M
i i i i i i
i i i
f x f x and
Jensenโs Inequality
0
2
0 0 0 0 0 0 0
0 0 0
0 0 0
0 0 0 :
1( ) ( ) '( )( ) "( *)( ) ( ) '( )( )
2
( ) ( ) '( )( ), : ( ) (1 ) ( ) ( ) '( )( (1 ) )
( ) ( ) '( )( )Set x
f x f x f x x x f x x x f x f x x x
f a f x f x a xFor x a b f a f b f x f x a b x
f b f x f x b x
( ) (1 ) ( ) (1 )f a f b f a b
43
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Assume Jensenโs inequality. We should show that
๐โ(๐ฅ) > 0 (strict convexity) for any ๐ฅ.
Set the following: ๐ = ๐ โ 2๐, ๐ = ๐ + 2๐ > ๐, ๐ > 0.
Using Jensenโs inequality, we can easily derive the
above equation as:
For small, we thus have:
Jensenโs Inequality
( ) ( ) ( ) ( )'( ) '( ) (.)
f b f b f a f aor f b f a f is convex
1 1( ) ( ) 0.5 0.5
2 2
1 10.5( 2 ) 0.5 0.5 0.5( 2 )
2 2
1 1( ) ( ) ( ) ( ) ( ) ( )
2 2
f a f b f a b
f b b f a a
f b f a f b f b f a f a
44
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Using Jensenโs inequality
for a discrete random variable results in:
We can generalize this result to
continuous random variables:
We will use this shortly in the context of the KL distance.
We often use Jensenโs inequality for concave functions
(e.g. log ๐ฅ). In that case, be sure you reverse the
inequality!
1 1
( ), 0 1M M
i i i i i i
i i i
f x f x and
( ) ( ) ( ) ( )for continuous rv f xp x dx f x p x dx
Jensenโs Inequality
( )i ip f x f x Set :
45
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
As another example of Jensenโs inequality, consider the
arithmetic and geometric means of a set of real variables:
Using Jensenโs inequality for ๐(๐ฅ) = log(๐ฅ) (concave),
i.e.
, we can show:
Jensenโs Inequality: Example
าง๐ฅ๐ด =1
๐
๐=1
๐
๐ฅ๐ , าง๐ฅ๐บ = เท
๐=1
๐
๐ฅ๐
ฮค1 ๐
ln าง๐ฅ๐บ =1
๐ln เท
๐=1
๐
๐ฅ๐ =
๐=1
๐
1
๐ln๐ฅ๐ โค ln
๐=1
๐
1
๐๐ฅ๐ = ln าง๐ฅ๐ด โ าง๐ฅ๐บ โค าง๐ฅ๐ด
ln( ) lnx x
46
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Using Jensenโs inequality, we can show (โlog is a
convex function) that:
Thus we derive the following Information Inequality:
|| 0, || 0 ( ) ( )KL p q with KL p q if and only if p x q x
( ) ( ) ( ) ( )f x f x f xp x dx f x p x dx
( ) ( )
|| ( ) ln ln ( ) ln ( ) 0( ) ( )
q x q xKL p q p x dx p x dx q x dx
p x p x
The Kullback-Leibler Divergence
47
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
An important consequence of the information inequality is
that the discrete distribution with the maximum entropy is
the uniform distribution.
More precisely, โ(๐) โค log |๐ณ |, where | ๐ณ | is the
number of states for ๐, with equality iff ๐(๐ฅ) is uniform. To
see this, let ๐ข(๐ฅ) = 1/ | ๐ณ |. Then
This principle of insufficient reason, argues in favor of
using uniform distributions when there are no other
reasons to favor one distribution over another.
|| ( ) log ( ) ( ) log ( ) log | | ( ) 0x x
KL p u p x u x p x p x x X
Principle of Insufficient Reason
48
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Data compression is in some way related to density
estimation.
The Kullback-Leibler divergence is measuring the distance
between two distributions and it is zero when the two
densities are identical.
Suppose the data is generated from an unknown ๐(๐) that we
try to approximate with a parametric model ๐(๐|๐). Suppose
we have observed training points ๐๐~๐ ๐ , ๐ = 1,โฆ ,๐. Then:
1
( ) 1|| ( ) ln ln | ln ( )
( )
N
n n
n
q xKL p q p x dx q p
p x N
x x
Sampleaverageapproximationof the mean
The Kullback-Leibler Divergence
49
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Note that only the first term is a function of ๐.
Thus minimizing is equivalent to maximizing the
likelihood function for under the distribution ๐.
So the MLE estimate minimizes the KL divergence to the
empirical distribution
1
( ) 1|| ( ) ln ln | ln ( )
( )
N
n n
n
qKL p q p dx q p
p N
xx x x
x
||KL p q
The KL Divergence Vs. MLE
1
1( )
n
N
emp
n
pN
xx x
1
( ) 1arg min ( ) || ( ) ln ln |
( )
N
emp emp nq nemp
qKL p q p d const q
p N
xx x x x
x
50
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
For a joint distribution, the conditional entropy is
This represents the average information to specify ๐ฆ if we
already know the value of ๐ฅ
It is easily seen, using , and substituting
inside the log in that the
conditional entropy satisfies the relation
where H[๐ฅ, ๐ฆ] is the differential entropy of ๐(๐ฅ, ๐ฆ)and H[๐ฅ] is the differential entropy of ๐(๐ฅ).
, |x y y x x
| ( , ) ln ( | )y x p y x p y x dydx
Conditional Entropy
( , ) ( | ) ( )p y x p y x p x
, ( , ) ln ( , )x y p x y p x y dydx
51
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider the conditional entropy for discrete variables
To understand further the meaning of conditional entropy,let us consider the implications of H[๐ฆ|๐ฅ] = 0.
We have:
From this we can conclude that
Since ๐๐๐๐๐ = 0 โ ๐ = 0 or ๐ = 1 and since ๐(๐ฆ๐|๐ฅ๐) is
normalized, there is only one ๐ฆ๐ s.t. with all
other . Thus ๐ฆ is a function of ๐ฅ.
| ( , ) ln ( | )i j i j
i j
y x p y x p y x
Conditional Entropy for Discrete Variables
0
| ( | ) ln ( | ) ( ) 0i j i j j
i j
y x p y x p y x p x
: ( | ) ln ( | ) 0i j i jthe following must hold p y x p y x
( | ) 1i jp y x
. . ( ) 0j jFor each x s t p x
(. | ) 0jp x
52
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
If the variables are not independent, we can gain some
idea of whether they are โcloseโ to being independent by
considering the KL divergence between the joint
distribution and the product of the marginals:
The mutual information is related to the conditional
entropy through
, | |x y x x y y y x
: , , || ( ) ( )
( ) ( ), ln 0
,
, 0 ,
Mutual Information x y KL p x y p x p y
p x p yp x y dxdy
p x y
x y iff x y independent
Mutual Information
( )
, , ln [ ] [ | ]|
p yx y p x y dxdy y y x
p y x
53
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The mutual information represents the reduction in the
uncertainty about ๐ฅ once we learn the value of ๐ฆ (and
reversely).
In a Bayesian setting, ๐(๐ฅ) =prior, ๐(๐ฅ|๐ฆ) posterior, and I[๐ฅ, ๐ฆ] represents the reduction in uncertainty in ๐ฅ once
we observe ๐ฆ.
, | |x y x x y y y x
Mutual Information
H y
|
|
x x y
y y x
54
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
This is easy to prove noticing that
and
from which
The equality here is true only if ๐ฅ, ๐ฆ are independent:
(sufficiency condition)
(necessary
condition)
, | 0 ( )x y y y x KL divergence
Note that H[x,y]โคH [x]+H [y]
, |x y y x x
, ,x y x y x y x y
, ( , ) ln ( , ) ( , ) ln ( ) ln ( ) [ ] [ ]x y p x y p x y dydx p x y p x p y dydx x y
| [ , ] 0 ( , ) ( ) ( )y x y x y p x y p x p y
55
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Consider two correlated Gaussians as follows:
For each of these variables we can write:
The joint entropy is also given similarly as
Thus:
Note:
2 2
2 2
0~ | ,
0
X X
Y Y
s s
s s
Mutual Information for Correlated Gaussians
2
1 1, , log
2 1x y x y x y
21ln 2
2X Y e s
2 4 2
det
1, ln 2 (1 )
2X Y e s
0 ( , ) , 0
1 ( ) ,
independent X Y x y
linear correlated X Y x y
56
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
A quantity which is closely related to ๐๐ผ is the pointwise
mutual information or ๐๐๐ผ. For two events (not random
variables) ๐ฅ and ๐ฆ, this is defined as
This measures the discrepancy between these events
occurring together compared to what would be expected
by chance. Clearly the ๐๐ผ, , of ๐ and ๐ is just the
expected value of the ๐๐๐ผ.
This is the amount we learn from updating the prior ๐(๐ฅ)into the posterior ๐(๐ฅ|๐ฆ), or equivalently, updating the
prior ๐(๐ฆ) into the posterior ๐(๐ฆ|๐ฅ).
Pointwise Mutual Information
( ) ( ) ( | ) ( | )
( , ) : log log log, ( ) ( )
p x p y p x y p y xPMI x y
p x y p x p y
57
,x y
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
For continuous random variables, it is common to first
discretize or quantize them into bins, and computing
how many values fall in each histogram bin (Scott 1979).
The number of bins used, and the location of the bin
boundaries, can have a significant effect on the results.
One can estimate the ๐๐ผ directly, without performing
density estimation (Learned-Miller, 2004). Another
approach is to try many different bin sizes and locations,
and to compute the maximum ๐๐ผ achieved.
Mutual Information
Scott, D. (1979). On optimal and data-based histograms, Biometrika 66(3), 605โ610.
Learned-Miller, E. (2004). Hyperspacings and the estimation of information theoretic quantities. Technical Report
04-104, U. Mass. Amherst Comp. Sci. Dept.
Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518โ1524.
Speed, T. (2011, December). A correlation for the 21st century. Science 334, 152โ1503.
*Use MatLab function miMixedDemo from Kevin Murphysโ PMTK
58
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
This statistic appropriately normalized is known as the
maximal information coefficient (๐๐ผ๐ถ).
We first define:
Here G (๐ฅ, ๐ฆ) is the set of 2๐ grids of size , and
๐(๐บ), ๐ (๐บ) represents a discretization of the variables
onto this grid (The maximization over bin locations is
performed efficiently using dynamic programming)
Now define the ๐๐ผ๐ถ as
Maximal Information Coefficient
( , )max ( ); ( )( , )
log min( , )
G x y X G Y Gm x y
x y
G
, :max ( , )
x y xy BMIC m x y
Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518โ1524.
x y
59
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The ๐๐ผ๐ถ is defined as:
๐ต is some sample-size dependent bound on the number
of bins we can use and still reliably estimate the
distribution (Reshef et al. suggest ๐ต ~ ๐0.6).
๐๐ผ๐ถ lies in the range [0, 1], where 0 represents no
relationship between the variables, and 1 represents a
noise-free relationship of any form, not just linear.
Maximal Information Coefficient
( , )max ( ); ( )( , )
log min( , )
G x y X G Y Gm x y
x y
G
, :max ( , )
x y xy BMIC m x y
Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518โ1524.
60
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The data consists of 357 variables measuring a variety
of social, economic, etc. indicators, collected by WHO.
On the left, we see the correlation coefficient (CC)
plotted against the MIC for all 63,566 variable pairs.
On the right, we see scatter plots for particular pairs of
variables.
Correlation Coefficient Vs MIC
Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P.
Sabeti (2011, December). Detecting novel associations n large data sets. Science 334, 1518โ1524.
see mutualInfoAllPairsMixed for
and miMixedDemo from PMTK3
61
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
Point marked ๐ถ has a low ๐ถ๐ถ and a low ๐๐ผ๐ถ. From the
corresponding scatter we see that there is no relationship
between these two variables.
The points marked ๐ท and ๐ป have high ๐ถ๐ถ (in absolute
value) and high ๐๐ผ๐ถ and we see from the scatter plot that
they represent nearly linear relationships.
Correlation Coefficient Vs MIC
62
Statistical Computing, University of Notre Dame, Notre Dame, IN, USA (Fall 2018, N. Zabaras)
The points ๐ธ, ๐น, and ๐บ have low ๐ถ๐ถ but high ๐๐ผ๐ถ. They
correspond to non-linear (and sometimes, as in ๐ธ and ๐น,
one-to-many) relationships between the variables.
Statistics (such as ๐๐ผ๐ถ) based on mutual information can
be used to discover interesting relationships between
variables in a way that correlation coefficients cannot.
Correlation Coefficient Vs MIC
63
top related