a comparison of document, sentence, and term event spaces

A Comparison of Document, Sentence, and

Term Event Spaces Catherine Blake

School of Information and Library Science

University of North Carolina at Chapel Hill

North Carolina, NC [email protected]

Classic Information Retrieval

Document Representation

QueryInformation

Need

Match

? ? ?

Representation?Matching

- Exact match = Boolean Model- Weighted match = Vector

Model

?

Term Weighting

• Goal : Favor discriminating terms• Commonly used : TF x IDF

• IDF(ti)=log2(N)–log2(ni)+1– N = total number of documents in the

corpus

– ti = a term (typically an stemmed word)

– ni = number of documents that contain at least one occurrence of the term ti

Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21.

Salton,G. & Buckley,C. (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24 (5):513-23

Practical Motivations

• Systems moving toward sub-document retrieval– Document Summarization – Why not use Inverse

Sentence Frequency (ISF) ?– Question Answering – Why not use Inverse Term

Frequency (ITF) ?

• Calculating IDF is problematic– How many documents to have stable IDF estimates ?

• Corpora have changed since initial experiments– # documents - Vocabulary size– # terms per document

Theoretical Motivations

• TF x IDF combines two different event spaces – TF – number of terms – IDF – number of documents– Are the limits of these spaces really the same

?

• Foundational theories use the term space– Zipf’s Law (Zipf, 1949) – Shannon’s Theory (Shannon, 1948)

Goal : Compare and Contrast

1. Raw term comparison2. Zipf Law comparison 3. Direct IDF, ISF, and ITF

comparison4. Abstract versus full-text

comparison5. IDF Sensitivity

Corpora

• Full text scientific articles in chemistry• Initial corpus:

– 103,262 articles– Published in 27 journals over the last 4 years– Two journals excluded due to formatting

inconsistencies• These experiments:

– 100,830 articles– 16,538,655 sentences– 526,025,066 total unstemmed terms – 2,001,730 distinct unstemmed terms– 1,391,763 distinct stemmed terms (Porter

algorithm)–1,391,763 distinct stemmed terms (Porter algorithm)

Journal # Docs%

Corpus Avg Length Million %

ACHRE4 548 0.5 4923 2.7 1

ANCHAM 4012 4.0 4860 19.5 4

BICHAW 8799 8.7 6674 58.7 11

BIPRET 1067 1.1 4552 4.9 1

BOMAF6 1068 1.1 4847 5.2 1

CGDEFU 566 0.5 3741 2.1 <1

CMATEX 3598 3.6 4807 17.3 3

ESTHAG 4120 4.1 5248 21.6 4

IECRED 3975 3.9 5329 21.2 4

INOCAJ 5422 5.4 6292 34.1 6

JACSAT 14400 14.3 4349 62.6 12

JAFCAU 5884 5.8 4185 24.6 5

JCCHFF 500 0.5 5526 2.8 1

JCISD8 1092 1.1 4931 5.4 1

JMCMAR 3202 3.2 8809 28.2 5

JNPRDF 2291 2.2 4144 9.5 2

JOCEAH 7307 7.2 6605 48.3 9

JPCAFH 7654 7.6 6181 47.3 9

JPCBFK 9990 9.9 5750 57.4 11

JPROBS 268 0.3 4917 1.3 <1

MAMOBX 6887 6.8 5283 36.4 7

MPOHBP 58 0.1 4868 0.3 <1

NALEFD 1272 1.3 2609 3.3 1

OPRDFK 858 0.8 3616 3.1 1

ORLEF7 5992 5.9 1477 8.8 2

Example IDF, ISF, ITF

Document Sentence Term

TermAbstract

Non-Abs All

Abstract

Non-Abs All

Abstract

Non-Abs All

the 1.0 1.0 1.0 1.3 1.4 1.4 4.6 9.4 5.2

chemist 11.1 6.0 5.7 13.6 12.8 12.6 22.8 17.6 17.6

synthesis 14.3 11.2 10.8 17.1 18.0 17.6 26.4 22.6 22.5

eletrochem 17.5 15.3 15.0 20.3 22.6 22.4 29.6 27.0 27.5

IDF(ti)=log2(N)–log2(ni)+1

1) Raw term comparison

• Document vs Sentence Frequency (log scales)

1.0E+0

1.0E+1

1.0E+2

1.0E+3

1.0E+4

1.0E+5

1.0E+6

1.0E+7

1.0E+8

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06

Document Frequency (Log scale)

Ave

rag

e S

ente

nce

Fre

qu

ency

(L

og

sca

le)


• Document vs Term Frequency (log scales)

1.0E+0

1.0E+1

1.0E+2

1.0E+3

1.0E+4

1.0E+5

1.0E+6

1.0E+7

1.0E+8

1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06

Document Frequency (Log scale)

Ave

rag

e T

erm

Fre

qu

ency

(L

og

sca

le)

Luhn

Image Source: Van Rijsbergen, 1979


• Sentence vs Term Frequency (log scales)

1.0E+0

1.0E+1

1.0E+2

1.0E+3

1.0E+4

1.0E+5

1.0E+6

1.0E+7

1.0E+8

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Sentence Frequency (Log scale)

Ave

rag

e T

erm

Fre

qu

ency

(L

og

sca

le)

2) Zipf Law comparison

• Zipf’s Law : The frequency of terms in a corpus conforms to a power law distribution K/jθ where θ is close to 1 (Zipf, 1949)

• Term distributions followed a power law

• θ differed between the event spaces– Average θ in document space = -1.65– Average θ in sentence space = -1.73– Average θ in term spaces = -1.73

2) Example Document Distribution

MAMOBX

1.0E+0

1.0E+1

1.0E+2

1.0E+3

1.0E+4

1.0E+5

1.E+0 1.E+1 1.E+2 1.E+3 1.E+4

Word Occurances (log)

Actual

theta=1

2) θ Comparison of all journals

-1.85

-1.80

-1.75

-1.70

-1.65

-1.60

-1.55

-1.80 -1.70 -1.60 -1.50Document Slope

Sen

tenc

e or

Ter

m S

lope

Sentence

Term JACSAT

3) Direct IDF vs ISF comparison

y = 1.0662x + 5.5724

R2 = 0.9974

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

IDF

ISF

AvgMinMax

3) Direct IDF vs ITF comparison

y = 1.0721x + 10.452

R2 = 0.9972

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

IDF

ITF

AvgMinMax

y = 1.0144x + 4.6937

R2 = 0.9996

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

ISF

ITF

AvgMinMax

3) Direct ISF vs ITF comparison

4) Abstract versus full-text

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18Global IDF

Ave

rag

e ab

stra

ct/N

on

-ab

stra

ct I

DF

AbstractNon-Abstract

4) IDF Sensitivity

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18IDF of Total Corpus

Ave

rage

ID

F o

f S

tem

med

Ter

ms

102030405060708090

% of Total Corpus

4) IDF Sensitivity

0

5

10

15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Global IDF

Ave

rag

e L

oca

l ID

F

ACHRE4

ANCHAM

BICHAW

BIPRET

BOMAF6

CGDEFU

CMATEX

ESTHAG

IECRED

INOCAJ

JACSAT

JAFCAU

JCCHFF

JCISD8

JMCMAR

JNPRDF

JOCEAH

JPCAFH

JPCBFK

JPROBS

MAMOBX

MPOHBP

NALEFD

OPRDFK

ORLEF7

Conclusions

• raw document frequencies differ from sentence & term frequencies. – around the areas of important terms– difficult to perform a linear

transformation from the document to a sub-document space

• raw term frequencies correlate well with the sentence frequencies

• IDF, ISF and ITF are highly correlated

Conclusions

• IDF values are surprisingly stable – with respect to random samples at 10% of the

total corpus. – average IDF values based on only a 20% random

stratified sample correlated almost perfectly to IDF

• Journal based IDF samples did not correlate well to the global IDF

• language used in abstracts is systematically different from the language used in the body of a full-text scientific document.

a comparison of document, sentence, and term event spaces

Documents