mohammad reza karimi spring 2020 · mohammad reza karimi spring 2020. entropy let p= (p 1;:::;p k)...
TRANSCRIPT
![Page 1: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/1.jpg)
Basics of Information TheoryMohammad Reza Karimi
Spring 2020
![Page 2: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/2.jpg)
Entropy
Let p = (p1, . . . , pk) be a distribution over k objects. The entropy of
p is defined as
H(p) = −k∑i=1
pi log pi.
There are several intuitions about (Shannon’s) entropy, most importantly
the compression idea, but we describe another interesting one.
![Page 3: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/3.jpg)
Let X1, . . . , Xn be independent draws from the distribution p. Define
Yi = − log p(Xi). It is easy to check that E[Yi] = H(p). We now use
the law of large numbers:
∀ε > 0, limn→∞
P( ∣∣∣∣∣1n
n∑i=1
Yi −H(p)
∣∣∣∣∣ ≤ ε
)= 1.
![Page 4: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/4.jpg)
But we know that
1
n
n∑i=1
Yi = −1
nlog p(X1, . . . , Xn),
resulting in
2−n(H+ε) ≤ p(X1, . . . , Xn) ≤ 2−n(H−ε),
with high probability!
![Page 5: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/5.jpg)
Another way to state this is that
For large n, the distribution over sequences is like a uniform
distribution over Group A, and zero on Group B.
We call sequences in Group A, the “typical sequences”.
![Page 6: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/6.jpg)
As an example, take a biased coin with distribution (34,14). We toss this
coin 1000 times. Now the “most probable outcome” of this experiment,
is the sequence of all heads. For this sequence we have 1n
∑Yi =
− 11000 log(34)1000 ≈ 0.125, but the entropy of the distribution is−3
4 log 34−
14 log 1
4 ≈ 0.811. These two numbers are far away, and this makes the
“all heads” sequence not a typical sequence.
![Page 7: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/7.jpg)
One thing to keep in mind, is that the size of the typical set is determined
by entropy. The higher the entropy, the larger the set of typical sequences.
In the extreme case, if we have an unbiased coin, then every sequence
would be typical. By simple calculation, one can show that the size of
typical set is approximatly equal to 2nH(p), so
Entropy is a measure of the volume of the typical set,
another intuition!
![Page 8: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/8.jpg)
Kullback-Leibler Divergence
Let x be a string of length n over the alphabet 1, . . . , k. For x we
can compute the frequencies of each alphabet letter and put all these
frequencies in a vector (p1, . . . , pk). We call this vector the “type of x”
and write it as Px. For example, if x = 1314231, and the alphabet is
1, . . . , 4, then Px = (37,17,
27,
17).
![Page 9: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/9.jpg)
Exercise 1. Show that the number of sequences having a certain
type P , is approximately equal to 2−nH(P ).
![Page 10: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/10.jpg)
Now take an arbitrary distribution Q over the alphabet. The question
that we can ask now is what is the probability of observing a sequence
of type P under the assumption that the sequence is generated by Q.
Interestingly we can compute this probability and in the limit (n→∞)
the solution would be approximately equal to. . .
![Page 11: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/11.jpg)
2−nKL(P‖Q),
where
KL(P‖Q) =
k∑i=1
pi logpiqi,
is called the Kullback-Leibler Divergence of P from Q.
![Page 12: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/12.jpg)
As an example, let us say that we toss a fair coin 1000 times, and ask
what is the probability to get a sequence in which 34 of the outcomes are
heads and 14 of the outcomes are tails. We can see that
KL((34,14)‖(12,
12)) =
3
4log
3
2+
1
4log
1
2≈ 0.189,
and by the result above we see that the probability is about 2−189 ≈
10−57.
![Page 13: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/13.jpg)
This means that the higher the divergence of the “candidate distribution”
P and the “true distribution” Q gets, the lower would be the probability
of observing an outcome of P . That is why we can see the KL divergence
as a measure of “distance” between distributions.
![Page 14: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/14.jpg)
Properties of KL Divergence
• Always KL(P‖Q) ≥ 0.
• For product distributions, KL is additive. That is,
KL(P1 ⊗ P2‖Q1 ⊗Q2) = KL(P1‖Q1) + KL(P2‖Q2).
• The Pinsker Inequality:
dTV(P,Q)2 ≤ 2 KL(P‖Q)
![Page 15: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/15.jpg)
Application: Testing a Coin
We are given a coin, but we don’t know if it is biased or not. We only
know that the bias is either 1/2 or 1/2 + ε.
Question 1. How many times we should toss this coin, so that we
can tell if it is the biased coin or not?
![Page 16: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/16.jpg)
Let us denote by X = (X1, . . . , Xn) the results of the tosses. Suppose
that we have a decision rule ψ that
ψ : X 7→ B,U.
Then, the probability of ψ making a mistake is
P[error] =1
2PB[ψ(X) 6= B] +
1
2PU [ψ(X) 6= U ].
![Page 17: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/17.jpg)
The following theorem tells us what is the best we can achieve:
Theorem 1. We have
infψPB[ψ(X) 6= B] + PU [ψ(X) 6= U ] = 1− dTV(PB,PU).
![Page 18: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/18.jpg)
Proof. Let A ⊂ Ω be the set X : ψ(X) = B. Note that a classifier
is identified by its acceptance set A. We have
PB[ψ(X) 6= B] + PU [ψ(X) 6= U ] = PB(Ac) + PU(A)
= 1− PB(A) + PU(A).
![Page 19: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/19.jpg)
Taking the infimum gives
infψ· · · = inf
A1− PB(A) + PU(A)
= 1− supA
(PB(A) + PU(A))
= 1− dTV(PB,PU).
![Page 20: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/20.jpg)
Now using Pinsker inequality, we understand that the least probability
of error is bounded below by
1−√
2 KL(PB ‖PU).
Remains to compute the KL divergence. Note that both PB and PU are
product probability distributions.
![Page 21: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/21.jpg)
Denote by pB the distribution of a single biased coin and pU likewise.
Then
KL(PB ‖PU) = n · KL(pB‖pU)
= n ·(
(12 + ε) log12 + ε
12
+ (12 − ε) log12 − ε
12
)≈ n ·
((12 + ε)(1 + 2ε) + (12 − ε)(1− 2ε)
)= n · (1 + 4ε2) = O(nε2)
![Page 22: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/22.jpg)
So for example, if we want to have P[error] ≤ 1/4, we should have
n = Ω(ε−2).
![Page 23: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/23.jpg)
footline[frame number]
Information theory in loss functionsMohammad Reza Karimi
Anastasia Makarova
Spring 2020
![Page 24: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/24.jpg)
Part 1:Basics of Information Theory
![Page 25: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/25.jpg)
Entropy
Let p = (p1, . . . , pk) be a distribution over k objects. The entropy
of p is defined as
H(p) = kX
i=1
pi log pi.
There are several intuitions about (Shannon’s) entropy, most impor-
tantly the compression idea, but we describe another interesting
one.
![Page 26: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/26.jpg)
Let X1, . . . , Xn be independent draws from the distribution p. Define
Yi = log p(Xi). It is easy to check that Yi = H(p). We now use the
law of large numbers:
8" > 0, limn!1
¶
1
n
nX
i=1
Yi H(p)
"
= 1.
![Page 27: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/27.jpg)
But we know that
1
n
nX
i=1
Yi = 1
nlog p(X1, . . . , Xn),
resulting in
2n(H+") p(X1, . . . , Xn) 2n(H"),
with high probability!
![Page 28: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/28.jpg)
Another way to state this is that
For large n, the distribution over sequences is like a uni-
form distribution over Group A, and zero on Group B.
We call sequences in Group A, the “typical sequences” .
![Page 29: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/29.jpg)
As an example, take a biased coin with distribution (34,14). We toss this
coin 1000 times. Now the “most probable outcome” of this experiment
is
For this sequence we have 1n
PYi =
but the entropy of the distribution is 34 log
34
14 log
14 0.811.
![Page 30: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/30.jpg)
One thing to keep in mind, is that the size of the typical set is deter-
mined by entropy.
By simple calculation, one can show that (Exercise) the size of typical
set is approximatly equal to 2nH(p), so
Entropy is a measure of the volume of the typical set,
another intuition to keep in mind!
![Page 31: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/31.jpg)
Kullback-Leibler Divergence
Let x be a string of length n over the alphabet 1, . . . , k. For x we
can compute the frequencies of each alphabet letter and put all these
frequencies in a vector (p1, . . . , pk). We call this vector the type of x
and write it as Px.
![Page 32: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/32.jpg)
Exercise 1. Show that the number of sequences having a certain type
P , is approximately equal to 2nH(P ).
![Page 33: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/33.jpg)
Now take an arbitrary distribution Q over the alphabet.
Question 1. What is the probability of observing a sequence of type
P under the assumption that the sequence is generated by Q.
Interestingly we can compute this probability and in the limit (n !
1) the solution would be approximately equal to. . .
![Page 34: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/34.jpg)
2nPQ
where
PQ =kX
i=1
pi logpi
qi,
is called the Kullback-Leibler Divergence of P from Q.
![Page 35: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/35.jpg)
As an example, let us say that we toss a fair coin 1000 times, and ask
what is the probability to get a sequence in which 34 of the outcomes
are heads and 14 of the outcomes are tails. We can see that
(34,14)(
12,
12) = ,
and by the result above we see that the probability is about
![Page 36: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/36.jpg)
This means that the higher the divergence of the “candidate distribu-
tion” P from the “true distribution” Q gets, the lower would be the
probability of observing an outcome of P . That is why we can see the
KL divergence as a measure of “distance” between distributions.
![Page 37: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/37.jpg)
Properties of KL Divergence
• Always PQ 0.
• For product distributions, KL is additive. That is,
P1 P2Q1 Q2 = P1Q1 + P2Q2.
• The Pinsker Inequality:
P,Q2 2PQ
![Page 38: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/38.jpg)
Application: Testing a Coin
We are given a coin, but we don’t know if it is fair or not. We only
know that the bias is either 1/2 or 1/2 + .
Question 2. How many times we should toss this coin, so that we
can tell if it is the biased coin or not?
![Page 39: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/39.jpg)
Let us denote by X = (X1, . . . , Xn) the results of the tosses. Suppose
that we have a decision rule that
: X 7! B,U.
Then, the probability of making a mistake is
¶[error] = 1
2¶B (X) 6= B +
1
2¶U (X) 6= U.
![Page 40: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/40.jpg)
The following theorem tells us what is the best we can achieve:
Theorem 1. We have
inf
¶B (X) 6= B + ¶U (X) 6= U = 1 ¶B,¶U.
![Page 41: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/41.jpg)
Proof. Let A be the set X : (X) = B.
![Page 42: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/42.jpg)
Taking the infimum gives
![Page 43: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/43.jpg)
Now we use Pinsker inequality:
![Page 44: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/44.jpg)
Denote by pB the distribution of a single biased coin and pU likewise.
Then
¶B¶U = n · pBpU
![Page 45: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/45.jpg)
So for example, if we want to have ¶[error] 1/4, we should have
n = (2).
![Page 46: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/46.jpg)
Learned Concepts
1. Shannon Entropy
2. Typical Sequence
3. Kullback-Leibler Divergence
4. Total Variation Distance
![Page 47: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/47.jpg)
After break
1. Cross-entropy Loss (CE)
2. Relation to MLE and KL
3. Demo for intuition
4. Handling imbalance with CE
![Page 48: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/48.jpg)
Maximum Likelihood Estimation (Recap)
Predicted likelihood P (Y,X|w)n
Data D= (Y, X) = ( xi , Yi ) Yi .- oW model parameters
iid
MLEargm.ae Fly, xlw) = argmwax
F (Yl x , W) =W
= argmwax II , FC Yilxicw)=
r-argmiwn-7.logpcyilxi.ir#T
![Page 49: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/49.jpg)
MLE and Cross-entropy discrete courtier
• Def Cross - entropy tecp.o.ge#pC-logQI--..IZpihegqi=-!pCDhyfCDify--y:
• p =p Cy Ix) true distribution PCY ki ) = ! else
p^w ← Fly ix. w) predictive dist ⇒ one -hot encoding
-2 fly - c Ix , w ) =L• Cross - entropy : for point xi cec
Hi Cp . Fw ) = -Zay pcylxihogpwcylxi) = - tagpwcyilxi )For dataset D i-II Tam:
as
![Page 50: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/50.jpg)
KL divergence and Cross-entropy• KL ( Pll Paw ) - Epc- log PIP] [email protected]
) = HCP) t KLC Pll Pw) - min=minwct-misnexactlywhatf.IS
tecp,Pw) the same as reversed te Cpw , P)=kL) t tec Pw ) for classification ?
reversed KL
Pcylxi) is one- hot encodingwhere all mass is on y
=
yi
tHCP
,Pj ) computed only in
points where Ply--yilxi) =L
① and Pnw is ignored in other points y EY .
Alternatively , tick ,P ) doesn't penalize a
for points where Pcylx ) >o , but Pwlylx) -o.
④
![Page 51: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/51.jpg)
Eduapp
→ 0. . .
I. . .
O
"c
O
-
CE (x) = - ¥, PHIx) .x,w)-
I
![Page 52: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/52.jpg)
Eduapp
How to a'void that surprise?
![Page 53: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/53.jpg)
CE into the wild (NN case)'
ii:÷÷:i÷÷:"-
O
'
ipcyilxi.no) pcylxi)to
Softmay functions as hypothesis modelsare conservative enough2-
![Page 54: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/54.jpg)
Why CE? 3 class classification cat 1 dog / rabbit
Model 1 output Model 2 output
output ph target correct ?
i: f :0.3 0.4 0.3 d o o
0.2 0.1 0.7 t O OO - O
← nuschassification error is's
classification error = 7/3 the same classic. error = %
et = - #fo - leg to:D . .,yet =- ( o . log loan) to - bag 6.2) + a.ly/o.ajL'--In.?..cEi=gfhgo.ya-hyMt=a.ssL
"
= In ?_?.
Cti - I = o.su
ego.at =
![Page 55: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/55.jpg)
Why CE?FH) n
.
CE is moresensitive to p
as . (mis classif . error e- max (pm -p)
€?j5÷!!:"9.9!:p. a.different.ae
? Why not MSE ? - Hwa. compare gradient step in tews
for CE with gradient step for MSE
![Page 56: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/56.jpg)
Visualization
https://www.desmos.com/calculator/zytm2sf56e
Here pis continuous , though in classification it is discrete
CE?
ETry different configurations :
Ci ) pis uuimode Cii) p has 2 modes Ciii) tecp, f) f- Hcp, g)
to see how CE & KL change
![Page 57: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/57.jpg)
Weighted CE lossWhat if D= (Xi, Yi ) != , is imbalanced ?
He binary classification : logistic
ID , I - Fo and 11321=-91
the.tw) = Ply: # ylx) - C- yx)u
Dwllw) = ? Dw li Cw)=L
use weighted CE loss b- deal with imbalance
![Page 58: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/58.jpg)
Learned Concepts
1. Basics of Information Theory
2. KL divergence and Cross-Entropy
3. Imbalance in data
![Page 59: Mohammad Reza Karimi Spring 2020 · Mohammad Reza Karimi Spring 2020. Entropy Let p= (p 1;:::;p k) be a distribution over kobjects. The entropy of pis de ned as H(p) = Xk i=1 p ilogp](https://reader034.vdocuments.site/reader034/viewer/2022052008/601cb50d996dc1574d2d46fa/html5/thumbnails/59.jpg)
Eduapp: the question