information theory metrics giancarlo schrementi. expected value example: die roll...

16
Information Theory Metrics Giancarlo Schrementi

Upload: joan-sheila-baker

Post on 14-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Information Theory Metrics

Giancarlo Schrementi

Page 2: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Expected Value

• Example: Die Roll

(1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+(1/6)*6 = 3.5

The equation gives you the value that you can expect an event to take.

EV = p(i)v(i)i∈X

Page 3: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

A Template Equation

• Expected Value forms a template for many of the equations in Information Theory

• Notice it has three parts, a summation over all possible events, a probability of an event and the value of that event.

• It can be thought of as a weighted average

EV = p(i)v(i)i

N

Page 4: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Entropy as an EV

• Notice it has the same three parts• The value here is the log base 2 of the

probability which can be seen as the amount of information that an event transmits.

• Since it’s a logarithm that means less likely occurrences are more informative than highly likely occurrences.

• So Entropy can be thought of as the expected amount of information that we will receive.

H(X) = − p(i)lg(p(i))i∈X

Page 5: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Mutual Information

• Our three friends show up again

• The value this time is the joint probability divided by the product of the two marginal probabilities.

• This equation is definitely an expected value equation but what does that value tell us?

I(X;Y ) = p(i, j)lgp(i, j)

p(i)p( j)

⎝ ⎜

⎠ ⎟

i, j∈X ,Y

Page 6: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

MI’s Value

• The bottom tells us the probability of the two events occurring if they are independent.

• The top is the actual probability of the two events occurring which is going to be different if they are dependent in some way.

• Dividing probabilities tells us how much more likely one probability is than the other.

• Thus the value tells us how much more likely the joint event is than it would be if the two were independent, giving us a measure of dependence.

• The lg scales the value to be in the bits of information one event tells us about the next event.

lgp(i, j)

p(i)p( j)

⎝ ⎜

⎠ ⎟

Page 7: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Mutual Information as an EV

• Mutual Information can be looked at as the expected number of bits that one event tells us about another event in the distribution. This can be thought of as how many fewer bits are need to encode the next event because of your knowledge of the prior event.

• An MI of zero means that every event is independent of every other event.

• It is also symmetric across X,Y and is always non-negative.

I(X;Y ) = p(i, j)lgp(i, j)

p(i)p( j)

⎝ ⎜

⎠ ⎟

i, j∈X ,Y

Page 8: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Kullback-Liebler Divergence

• Once again it looks like an EV equation.• P and Q are two distributions and P(x) is the

probability of event x in P.• The division in the value part tells us how much

more/less likely an event is in distribution P than it would be Q.

• The lg scales this to be in bits, but what does that tell us?

KL(p,q) = p(x)lgp(x)

q(x)

⎝ ⎜

⎠ ⎟

x

Page 9: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

What does KL tell us?

• The value part tells us what the informational content difference is between event x in distribution P and event x in distribution Q.

• So KL tells us the expected difference in informational content of events in the two distributions.

• This can be thought of as the difference in number of bits needed to encode event X in the two distributions.

KL(p,q) = p(x)lgp(x)

q(x)

⎝ ⎜

⎠ ⎟

x

Page 10: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Other KL Observations

• As it is written above it is not-symmetric and does not obey the triangle inequality but it is non-negative.

• If p and q are the same then the equation results in zero.

• Kullback and Leibler actually define it as the sum of the the equation above plus its counterpart where p & q are switched. It then becomes symmetric.

KL(p,q) = p(x)lgp(x)

q(x)

⎝ ⎜

⎠ ⎟

x

Page 11: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Comparing MI and KL

• The two equations are noticeably similar.• In fact you can express MI as the KL divergence

between p(i,j) and p(i)p(j).• This tells us that MI is computing the divergence

between the true distribution and the distribution where each event is completely independent. Essentially computing KL with respect to a reference point.

• KL divergence gives us a measure of how different two distributions are, MI gives us a measure of different a distribution is from what it would be if its events were independent.

KL(p,q) = p(x)lgp(x)

q(x)

⎝ ⎜

⎠ ⎟

x

I(X;Y ) = p(i, j)lgp(i, j)

p(i)p( j)

⎝ ⎜

⎠ ⎟

i, j∈X ,Y

Page 12: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Sample Application: Chunk Parser

• Chunk parsers attempt to divide a sentence into its logical/grammatical units. In this case recursively, by splitting the sentence into two chunks and then splitting those two chunks into two chunks and so on.

• MI and KL are both measures that can tell us how statistically related two words are to each other.

• If two words are highly related to each other there is not likely to be a chunk boundary between them and likewise if they are largely unrelated then between them would be a good location to suppose a chunk boundary.

Page 13: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

MI and KL in this Task

• Here we are only concerned with the values at a point not the expected value over all points so the summations and the weight are missing giving us a ‘point-wise’ calculation.

• These calculations are across a bigram <xy> in a text, where x and y are two words.

• p(<xy>) is the probability of a bigram occurring• Note in the KL equation what P and Q are. Q is p(y) which

can be thought of as the prior and P is p(y|x) which can be thought of as the posterior. This gives you the information difference that knowledge of x brings to the probability of the right word y.

MI( xy ) = lgp( xy )

p(x)p(y)

KL( xy ) = lgp(y | x)

p(y)

Page 14: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

Results of the Comparison

• First, note that these equations can result in negative numbers if the top probability is lower than the bottom one. Also, MI is no longer symmetrical because p(<xy>) is different from p(<yx>).

• This asymmetry is useful for language analysis because most languages are direction dependent and so we want our metrics to be sensitive to that.

• They both provide an accurate dependence measure, but the KL measure provides more exaggerated values the further it gets from zero. This is because p(y|x) can get much higher than p(<xy>) will ever be.

• This makes KL more useful for this context in that relationships are more striking and thus easier to identify from noise.

MI( xy ) = lgp( xy )

p(x)p(y)

KL( xy ) = lgp(y | x)

p(y)

Page 15: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

One Final Variant

• These two equations are variants of mutual information designed to tell you how much a word tells you about the words that could occur that to its left or to its right.

• Notice that its an expected value over all the bigrams in which the word occurs in the right or the left and weighted by the conditional probability of that bigram given the word in question.

• This can be used to give you an estimate of the handedness of a word or how much the word restricts what can occur to the left or the right of it.

p( yx | x)lgp( yx )

p(x)p(y)y∈ Yx{ }

p( xy | x)lgp( xy )

p(x)p(y)y∈ xY{ }

Page 16: Information Theory Metrics Giancarlo Schrementi. Expected Value Example: Die Roll (1/6)*1+(1/6)*2+(1/6)*3+(1/6)*4+(1/6)*5+( 1/6)*6 = 3.5 The equation

References• S. Kullback and R. A. Leibler. On information and sufficiency.

Annals of Mathematical Statistics 22(1):79-86, March 1951.• Damir Ćavar, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues,

Giancarlo Schrementi. "On Statistical Parameter Setting." Proceedings of the First Workshop on Psycho-computational Models of Human Language Acquisition (COLING-2004). Geneva, Switzerland. August 28-29, 2004.

• Damir Ćavar, Paul Rodrigues, Giancarlo Schrementi. "Syntactic Parsing Using Mutual Information and Relative Entropy." Proceedings of the Midwest Computational Linguistics Colloquium (MCLC). Bloomington, IN, USA. July 2004.

• http://en.wikipedia.org/wiki/Mutual_information• http://en.wikipedia.org/wiki/Kullback-Leibler_divergence