1 natural language processing (5) zhao hai 赵海 department of computer science and engineering...

Natural Language Processing(5)

Zhao Hai 赵海

Department of Computer Science and Engineering

Shanghai Jiao Tong University

zhaohai@cs.sjtu.edu.cn

Lexicons and Lexical Analysis

Collocation

Hypothesis Testing

T Test

Mutual Information

Outline

Lexicons and Lexical Analysis (254)

Collocation (35)Hypothesis Testing (1/5)

One difficulty that we have glossed over so far is that high

frequency and low variance can be accidental.

For example, if the two constituent words of a frequent bigram

like new companies are frequently occurring words (as new and

companies are), then we expect the two words to co-occur a lot

just by chance, even if they do not form a collocation.

What we really want to know is whether two words occur

together more often than chance.

Assessing whether or not something is a chance event is one

of the classical problems of statistics. It is usually couched in

terms of hypothesis testing.

We formulate a null hypothesis H0 that there is no association

between the words beyond chance occurrences, compute the

probability p that the event would occur.

If H0 were true, and then reject H if p is too low (typically if

beneath a significance level of p < 0.05, 0.01, 0.0005, or 0.001)

and retain H0 as possible otherwise.

Lexicons and Lexical Analysis (257)Collocation (38)

Hypothesis Testing (4/5)

We need to formulate a null hypothesis which states what should be true if two words do not form a collocation.

For such a free combination of two words we will assume that each of the words w1 and w2 is generated completely independently of the other, and so their chance of coming together is simply given by:

The model implies that the probability of co-occurrence is just

the product of the probabilities of the individual words.

This is a rather simplistic model, and not empirically accurate,

but for now we adopt independence as our null hypothesis.

The T Test (1)

We need a statistical test that tells us how probable or

improbable it is that a certain constellation will occur.

A test that has been widely used for collocation discovery is the

t test.

The t test looks at the mean and variance of a sample of

measurements, where the null hypothesis is that the sample is

drawn from a distribution with mean μ.

Collocation (41)The T Test (2)

where is the sample mean, is the sample variance, N is the

sample size, andμis the mean of the distribution.

If the t statistic is large enough we can reject the null hypothesis.

The test t looks at the difference between the observed ( )

and expected (μ) means, scaled by the variance of the data.

It tells us how likely one is to get a sample of that mean and

variance (or a more extreme mean and variance) assuming that

the sample is drawn from a normal distribution with mean μ.

For instance, our null hypothesis is that the mean height of a

population of men is 158cm.

We are given a sample of 200 men with = 169 and =

2600 and want to know whether this sample is from the general

population (the null hypothesis) or whether it is from a different

population of smaller men.

This gives us the following t according to the above formula:

We can also find out exactly how large it has to be by looking

up the table of the t distribution.

If we look up the value of t that corresponds to a confidence

level of α= 0.005, we will find 2.576. Since the t we got is larger

than 2.576, we can reject the null hypothesis with 99.5%

confidence.

So we can say that the sample is not drawn from a population

with mean 158cm, and our probability of error is less than 0.5%.

To see how to use the t test for finding collocations, let us

compute the t value for new companies.

We think of the text corpus as a long sequence of N bigrams,

and the samples are then indicator random variables that take on

the value 1 when the bigram of interest occurs, and are 0

otherwise.

Using maximum likelihood estimates, we can compute the

probabilities of new and companies as follows.

In the corpus, new occurs 15,828 times, companies 4,675

times, and there are 14,307,668 tokens overall.

The T Test (9)

The null hypothesis is that occurrences of new and companies are independent.

If the null hypothesis is true, then the process of randomly generating bi-grams of words and assigning

1 to the outcome new companies and

0 to any other outcome

can be treated as a Bernoulli trial.

The mean for this distribution is ; and the variance

is , which is approximately p. The approximation

holds since for most bigrams p is small.

It turns out that there are actually 8 occurrences of new companies

among the 14307668 bigrams in our corpus. So, for the sample, we

have that the sample mean is:

Now we have everything we need to apply the t test:

This t value of 0.999932 is not larger than 2.576, the critical value for α= 0.005. So we cannot reject the null hypothesis that new and companies occur independently and do not form a collocation.

The above table shows t values for ten bigrams that occur

exactly 20 times in the corpus.

For the top five bigrams, we can reject the null hypothesis that

the component words occur independently for α= 0.005, so these

are good candidates for collocations.

The bottom five bigrams fail the test for significance, so we

will not regard them as good candidates for collocations.

Note that a frequency-based method would not be able to rank

the ten bigrams since they occur with exactly the same frequency.

We can see that the t test takes into account the number of co-

occurrences of the bigram relative to the frequencies of

the component words.

If a high proportion of the occurrences of both words

(Ayatollah Ruhollah, videocassette recorder) or at least a very

high proportion of the occurrences of one of the words (unsalted)

occurs in the bigram, then its t value is high.

This criterion makes intuitive sense.

The analysis in the table includes some stop words (Note: A stop word is a word that is common and frequently used, such as the, a, for, of, etc.) – without stop words, it is actually hard to find examples that fail significance. It turns out that most bigrams attested in a corpus occur significantly more often than chance.

For 824 out of the 831 bigrams that occurred 20 times in our corpus the null hypothesis of independence can be rejected.

But we would only classify a fraction as true collocations. The reason for this surprisingly high proportion of possibly

dependent bigrams is that language itself – if compared with a random word generator – is very regular so that few completely unpredictable events happen.

The t test and other statistical tests are most useful as a method for ranking collocations. The level of significance itself is less useful.

Lexicons and Lexical Analysis

Collocation

Hypothesis Testing

T Test

Mutual Information

Outline

Entropy (1/5)

The entropy (or self-information) is the average uncertainty of a single random variable:

H(p) = H(X) = -∑p(x)log2p(x)

x ∈χ

Note: Let p(x) be the probability mass function of a random variable X, over a discrete set of symbols (or alphabet) χ:

p(x) = P (X = x), x ∈χ

Entropy (2/5)

Entropy measures the amount of information in a random variable. It is normally measured in bits (hence the log to the base 2), but using any other base yields only a linear scaling of results. For example, suppose you are reporting the result of rolling an 8-sided die. Then the entropy is:

8 8 1 1 1

H(X) = -∑p(i)log2p(i) = -∑ log = -log =log 8 = 3 bits

i=1 i=1 8 8 8

Entropy (3/5)

The joint entropy of a pair of discrete random variables X, Y

is the amount of information needed on average to specify both

their values. It is defined as:

H(X, Y) = - ∑∑ p(x, y)logp(x, y)

x ∈χy ∈У

Entropy (4/5)

The condition entropy of a discrete random variables Y given another X, for X, Y, p(x, y), expresses how much extra information you still need to supply on average to communicate Y given that the other party knows X:

H(Y|X) = ∑p(x) H(Y|X=x) = ∑p(x) [-∑ p(y|x)logp(y|x)]

x ∈χ x ∈χ y ∈У

= - ∑ ∑ p(x, y)logp(y|x) x ∈χy ∈У

Entropy (5/5)

There is a Chain rule for entropy:

H(X, Y) = H(X) + H(Y|X)

H(X1, …, Xn) = H(X1) + H(X2|X1) + … + H(Xn|X1, X2, …, Xn-1)

By this Chain rule:

H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y), therefore,

H(X) - H(X|Y) = H(Y) - H(Y|X)

Mutual Information (1/7)

This difference is called the mutual information between X

and Y: I(X,Y)=H(X) - H(X|Y) = H(Y) - H(Y|X)

It is the reduction in uncertainty of one random variable due to

knowing about another.

In other words, the amount of information one random

variable contains about another.

H(X) H(Y)

H(X, Y)

I(X; Y)

H(X|Y)

H(Y|X)

Mutual information is a symmetric, non-negative measure of

the common information in the two variables.

People often think of mutual information as a measure of

dependence between variables.

However, it is actually better to think of it as a measure of

independence because:

It is 0 only when two variables are independent, but

For two dependent variables, mutual information grows not only with the degree of dependence, but also according to the entropy of the variables.

I(X; Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(X, Y)

= ∑p(x)log + ∑p(y)log + ∑p(x, y)logp(x, y) χ p(x) У p(y) χ, У

p(x, y)

= ∑p(x, y)log χ, У p(x) p(y)

Since H(X|X) = 0, note that: H(X) = H(X) – H(X|X) = I(X; X) This illustrates both why entropy is also called self-information, and how the mutual information between two totally dependent variables is not constant but depends on their entropy.

An information-theoretically motivated measure for discovering

interesting collocations is pointwise mutual information (Church et al. (1991), Church & Hanks (1989) and Hindle (1990)).

Fano (1961) originally defined mutual information between

particular events x’ and y’, in our case the event is occurrence of

particular words.

This type of mutual information is roughly a measure of how

much one word tells us about the other.

About Definitions

The definition of mutual information used here is common in

corpus linguistic studies, but is less common in Information

Theory. It is important to check what a mathematical concept is a

formalization of.

We will see that pointwise mutual information is of limited

utility for acquiring the types of linguistic properties.

Mutual Information Exp. (1/3)

These two types of mutual information are quite different

creatures.

When we apply this definition to the 10 collocations from the

previous table, we get the same ranking as with the t test. See the

following table:

As usual, we use maximum likelihood estimates to compute the probabilities, for example:

The amount of information we have about the occurrence of Ayatollahat position i in the corpus increases by 18.38 bits if we are told that Ruhollah occurs at position i + 1. In other words, we can be much more certain that Ruhollah will occur next if we are told that Ayatollah is the current word.

Mutual Information Fails: χ2 test (1/5)

Unfortunately, this measure of “increased information” is in many cases not a good measure of what an interesting correspondence between two events is. Consider the two examples in the following table of counts of word correspondences between French and English sentences in the Hansard corpus, an aligned corpus of debates of the Canadian parliament. Let’s see two French words,

chambre room, house communes common

Mutual Information Fails: χ2 Test (2/5)

• Mutual information gives a higher score to (communes, house)

Note: χ2 test is Pearson’s chi-square test. The χ2 statistic sums the differences between observed and expected values in all squares of the table, scaled by the magnitude of the expected values.

EO 22 )(

The reason that house frequently appears in translations of

French sentences containing chambre and communes is that the

most common use of house is the phrase House of Commons

which corresponds to Chambre de communes in French.

But it is easy to see that communes is a worse match for house

than chambre since most occurrences of house occur without

communes on the French side.

The χ2 test is able to infer the correct correspondence

whereas mutual information gives preference to the

incorrect pair (communes, house).

The higher mutual information value for communes reflects

the fact that communes causes a larger decrease in uncertainty.

In contrast, the χ2 is a direct test of probabilistic dependence,

which in this context we can interpret as the degree of association

between two words and hence as a measure of their quality as

translation pairs and collocations.

Frequency Matters (1/5) The table shows a second problem with using mutual information for finding collocations. Statistics over different sized corpora.

Frequency Matters (2/5)

We show ten bigrams that occur exactly once in the first 1000

documents of the reference corpus and their mutual information

score based on the 1000 documents.

The right half of the table shows the mutual information score

based on the entire reference corpus (about 23,000 documents).

The larger corpus of 23,000 documents makes some better

estimates possible, which in turn leads to a slightly better

ranking.

The bigrams marijuana growing and new converts (arguably

collocations) have moved up and Reds survived (definitely not a

collocation) has moved down.

However, what is striking is that even after going to a 10

times larger corpus 6 of the bigrams still only occur once.

As a consequence, they have inaccurate maximum likelihood

estimates and artificially inflated mutual information scores.

All 6 are not collocations.

None of the measures works very well for low-frequency events.

But there is evidence that sparseness is a particularly difficult

problem for mutual information.

When Mutual Information Works (1/5)

Consider two extreme cases: perfect dependence of the occurrences of the

two words and perfect independence of that.

For perfect dependence (they only occur together ) we have:

That is, among perfectly dependent bigrams, as they get rarer, their mutual

information increases.

For perfect independence (the occurrence of one does not give us any

information about the occurrence of the other ) we have:

We can say that mutual information is a good measure of

independence. Values close to 0 indicate independence

(independent of frequency).

But it is a bad measure of dependence because for

dependence the score depends on the frequency of the individual

words.

Other things being equal, bigrams composed of low-frequency

words will receive a higher score than bigrams composed of

high-frequency words.

That is the opposite of what we would want a good measure to

do since higher frequency means more evidence and we would

prefer a higher rank for bigrams for whose interestingness we have

more evidence.

One solution that has been proposed for this is to use a cutoff

and to only look at words with a frequency of at least 3.

However, such a move does not solve the underlying problem,

but only ameliorates its effects.

Since pointwise mutual information does not capture the

intuitive notion of an interesting collocation very well, it is often

not used when it is made available in practical applications.

Summary (1/5)

There are actually different definitions of the notion of collocation.

For instance, a sequence of two or more consecutive words, that has

characteristics of a syntactic and semantic unit, and whose exact and

unambiguous meaning or connotation cannot be derived directly from

the meaning or connotation of its components (Choueka, 1988).

Summary (2/5)

The following criteria are typical of linguistic treatments of collocations. Non-compositionality is the main one we have relied on here.

Non-compositionality. The meaning of a collocation is not a straightforward composition of the meanings of its parts. Either the meaning is completely different from the free combination (such as idioms) or there is a connotation or added element of meaning that cannot be predicted from the parts.

Summary (3/5)

Non-substitutability. We cannot substitute near-synonyms for

the components of a collocation. For example, we can’t say

yellow wine instead of white wine even though yellow is as a

good description of the color of white wine as white is (it is kind

of a yellowish white).

Summary (4/5)

Non-modifiability. Many collocations cannot be freely

modified with additional lexical material or through grammatical

transformations. This is especially true for frozen expressions

like idioms. For example, we can’t modify frog in to get a frog in

one’s throat (喉咙不适 ) into to get an ugly frog in one’s throat

although usually nouns like frog can be modified by adjectives

like ugly.

Summary (5/5)

A nice way to test whether a combination is a collocation is to

translate it into another language.

If we cannot translate the combination word by word, then that

is evidence that we are dealing with a collocation. For example,

translating make a decision into French one word at a

time we get faire une décision which is incorrect.

prendre une décision should be the correct translation

References

K. W. Church and P. Hanks. 1990. Word Association Norms, Mutual Information and Lexicography. Computational Linguistics, Vol. 16, No.1. T. Fontenelle et al. 1994. Survey of Collocation Extraction Tools. Technical Report, University of Liege, Liege, Belgium. J. Hodges et al. 1996. An Automated System that Assists in the Generation of Document Indexes. Natural Language Engineering No. 2.

1 natural language processing (5) zhao hai 赵海 department of computer science and engineering...

Documents

the coca-cola corporation’s market analysis report 组长...

cassandra实时统计分享 - 赵伟

1 programming thinking and method (0) zhao hai 赵海...

threads - cs.sjtu.edu.cn

基础无机化学赵新华

1 programming thinking and method (1) zhao hai 赵海...

宝鸡一中赵科锋 2007.3.23

1 natural language processing (6) zhao hai 赵海 department...

2010 年 3 月 , 赵克勤

1 natural language processing (3a) zhao hai 赵海...

artificial intelligence professor: liqing zhang contact...

big data and internet thinking - cs.sjtu.edu.cn

tutorial of cnn 赵子健9.16

1 programming thinking and method (3-4) zhao hai 赵海...

the evolution of programming languages day 2 lecturer: xiao...

高级软件工程 - cs.sjtu.edu.cn · documenting software...

赵玉民上海交通大学

赵市中心小学陈亚

big data processing techniques - cs.sjtu.edu.cn

actresses: stella 赵馨 carol 赵艺婷 carol 赵艺婷...