a comparison of document, sentence, and term event spaces

24
A Comparison of Document, Sentence, and Term Event Spaces Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina, NC 27599-3360 [email protected]

Upload: keene

Post on 17-Jan-2016

30 views

Category:

Documents


7 download

DESCRIPTION

A Comparison of Document, Sentence, and Term Event Spaces. Catherine Blake School of Information and Library Science University of North Carolina at Chapel Hill North Carolina, NC 27599-3360 [email protected]. ?. Representation. Classic Information Retrieval. ?. ?. ?. Information - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Comparison of Document, Sentence, and Term Event Spaces

A Comparison of Document, Sentence, and

Term Event Spaces Catherine Blake

School of Information and Library Science

University of North Carolina at Chapel Hill

North Carolina, NC [email protected]

Page 2: A Comparison of Document, Sentence, and Term Event Spaces

Classic Information Retrieval

Document Representation

QueryInformation

Need

Match

? ? ?

Representation?Matching

- Exact match = Boolean Model- Weighted match = Vector

Model

?

Page 3: A Comparison of Document, Sentence, and Term Event Spaces

Term Weighting

• Goal : Favor discriminating terms• Commonly used : TF x IDF

• IDF(ti)=log2(N)–log2(ni)+1– N = total number of documents in the

corpus

– ti = a term (typically an stemmed word)

– ni = number of documents that contain at least one occurrence of the term ti

Sparck Jones, K. (1972) A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28:11-21.

Salton,G. & Buckley,C. (1988) Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24 (5):513-23

Page 4: A Comparison of Document, Sentence, and Term Event Spaces

Practical Motivations

• Systems moving toward sub-document retrieval– Document Summarization – Why not use Inverse

Sentence Frequency (ISF) ?– Question Answering – Why not use Inverse Term

Frequency (ITF) ?

• Calculating IDF is problematic– How many documents to have stable IDF estimates ?

• Corpora have changed since initial experiments– # documents - Vocabulary size– # terms per document

Page 5: A Comparison of Document, Sentence, and Term Event Spaces

Theoretical Motivations

• TF x IDF combines two different event spaces – TF – number of terms – IDF – number of documents– Are the limits of these spaces really the same

?

• Foundational theories use the term space– Zipf’s Law (Zipf, 1949) – Shannon’s Theory (Shannon, 1948)

Page 6: A Comparison of Document, Sentence, and Term Event Spaces

Goal : Compare and Contrast

1. Raw term comparison2. Zipf Law comparison 3. Direct IDF, ISF, and ITF

comparison4. Abstract versus full-text

comparison5. IDF Sensitivity

Page 7: A Comparison of Document, Sentence, and Term Event Spaces

Corpora

• Full text scientific articles in chemistry• Initial corpus:

– 103,262 articles– Published in 27 journals over the last 4 years– Two journals excluded due to formatting

inconsistencies• These experiments:

– 100,830 articles– 16,538,655 sentences– 526,025,066 total unstemmed terms – 2,001,730 distinct unstemmed terms– 1,391,763 distinct stemmed terms (Porter

algorithm)–1,391,763 distinct stemmed terms (Porter algorithm)

Page 8: A Comparison of Document, Sentence, and Term Event Spaces

Journal # Docs%

Corpus Avg Length Million %

ACHRE4 548 0.5 4923 2.7 1

ANCHAM 4012 4.0 4860 19.5 4

BICHAW 8799 8.7 6674 58.7 11

BIPRET 1067 1.1 4552 4.9 1

BOMAF6 1068 1.1 4847 5.2 1

CGDEFU 566 0.5 3741 2.1 <1

CMATEX 3598 3.6 4807 17.3 3

ESTHAG 4120 4.1 5248 21.6 4

IECRED 3975 3.9 5329 21.2 4

INOCAJ 5422 5.4 6292 34.1 6

JACSAT 14400 14.3 4349 62.6 12

JAFCAU 5884 5.8 4185 24.6 5

JCCHFF 500 0.5 5526 2.8 1

JCISD8 1092 1.1 4931 5.4 1

JMCMAR 3202 3.2 8809 28.2 5

JNPRDF 2291 2.2 4144 9.5 2

JOCEAH 7307 7.2 6605 48.3 9

JPCAFH 7654 7.6 6181 47.3 9

JPCBFK 9990 9.9 5750 57.4 11

JPROBS 268 0.3 4917 1.3 <1

MAMOBX 6887 6.8 5283 36.4 7

MPOHBP 58 0.1 4868 0.3 <1

NALEFD 1272 1.3 2609 3.3 1

OPRDFK 858 0.8 3616 3.1 1

ORLEF7 5992 5.9 1477 8.8 2

Page 9: A Comparison of Document, Sentence, and Term Event Spaces

Example IDF, ISF, ITF

Document Sentence Term

TermAbstract

Non-Abs All

Abstract

Non-Abs All

Abstract

Non-Abs All

the 1.0 1.0 1.0 1.3 1.4 1.4 4.6 9.4 5.2

chemist 11.1 6.0 5.7 13.6 12.8 12.6 22.8 17.6 17.6

synthesis 14.3 11.2 10.8 17.1 18.0 17.6 26.4 22.6 22.5

eletrochem 17.5 15.3 15.0 20.3 22.6 22.4 29.6 27.0 27.5

IDF(ti)=log2(N)–log2(ni)+1

Page 10: A Comparison of Document, Sentence, and Term Event Spaces

1) Raw term comparison

• Document vs Sentence Frequency (log scales)

1.0E+0

1.0E+1

1.0E+2

1.0E+3

1.0E+4

1.0E+5

1.0E+6

1.0E+7

1.0E+8

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06

Document Frequency (Log scale)

Ave

rag

e S

ente

nce

Fre

qu

ency

(L

og

sca

le)

Page 11: A Comparison of Document, Sentence, and Term Event Spaces

1) Raw term comparison

• Document vs Term Frequency (log scales)

1.0E+0

1.0E+1

1.0E+2

1.0E+3

1.0E+4

1.0E+5

1.0E+6

1.0E+7

1.0E+8

1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06

Document Frequency (Log scale)

Ave

rag

e T

erm

Fre

qu

ency

(L

og

sca

le)

Page 12: A Comparison of Document, Sentence, and Term Event Spaces

Luhn

Image Source: Van Rijsbergen, 1979

Page 13: A Comparison of Document, Sentence, and Term Event Spaces

1) Raw term comparison

• Sentence vs Term Frequency (log scales)

1.0E+0

1.0E+1

1.0E+2

1.0E+3

1.0E+4

1.0E+5

1.0E+6

1.0E+7

1.0E+8

1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Sentence Frequency (Log scale)

Ave

rag

e T

erm

Fre

qu

ency

(L

og

sca

le)

Page 14: A Comparison of Document, Sentence, and Term Event Spaces

2) Zipf Law comparison

• Zipf’s Law : The frequency of terms in a corpus conforms to a power law distribution K/jθ where θ is close to 1 (Zipf, 1949)

• Term distributions followed a power law

• θ differed between the event spaces– Average θ in document space = -1.65– Average θ in sentence space = -1.73– Average θ in term spaces = -1.73

Page 15: A Comparison of Document, Sentence, and Term Event Spaces

2) Example Document Distribution

MAMOBX

1.0E+0

1.0E+1

1.0E+2

1.0E+3

1.0E+4

1.0E+5

1.E+0 1.E+1 1.E+2 1.E+3 1.E+4

Word Occurances (log)

Actual

theta=1

Page 16: A Comparison of Document, Sentence, and Term Event Spaces

2) θ Comparison of all journals

-1.85

-1.80

-1.75

-1.70

-1.65

-1.60

-1.55

-1.80 -1.70 -1.60 -1.50Document Slope

Sen

tenc

e or

Ter

m S

lope

Sentence

Term JACSAT

Page 17: A Comparison of Document, Sentence, and Term Event Spaces

3) Direct IDF vs ISF comparison

y = 1.0662x + 5.5724

R2 = 0.9974

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

IDF

ISF

AvgMinMax

Page 18: A Comparison of Document, Sentence, and Term Event Spaces

3) Direct IDF vs ITF comparison

y = 1.0721x + 10.452

R2 = 0.9972

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

IDF

ITF

AvgMinMax

Page 19: A Comparison of Document, Sentence, and Term Event Spaces

y = 1.0144x + 4.6937

R2 = 0.9996

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

ISF

ITF

AvgMinMax

3) Direct ISF vs ITF comparison

Page 20: A Comparison of Document, Sentence, and Term Event Spaces

4) Abstract versus full-text

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18Global IDF

Ave

rag

e ab

stra

ct/N

on

-ab

stra

ct I

DF

AbstractNon-Abstract

Page 21: A Comparison of Document, Sentence, and Term Event Spaces

4) IDF Sensitivity

0

2

4

6

8

10

12

14

16

18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18IDF of Total Corpus

Ave

rage

ID

F o

f S

tem

med

Ter

ms

102030405060708090

% of Total Corpus

Page 22: A Comparison of Document, Sentence, and Term Event Spaces

4) IDF Sensitivity

0

5

10

15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Global IDF

Ave

rag

e L

oca

l ID

F

ACHRE4

ANCHAM

BICHAW

BIPRET

BOMAF6

CGDEFU

CMATEX

ESTHAG

IECRED

INOCAJ

JACSAT

JAFCAU

JCCHFF

JCISD8

JMCMAR

JNPRDF

JOCEAH

JPCAFH

JPCBFK

JPROBS

MAMOBX

MPOHBP

NALEFD

OPRDFK

ORLEF7

Page 23: A Comparison of Document, Sentence, and Term Event Spaces

Conclusions

• raw document frequencies differ from sentence & term frequencies. – around the areas of important terms– difficult to perform a linear

transformation from the document to a sub-document space

• raw term frequencies correlate well with the sentence frequencies

• IDF, ISF and ITF are highly correlated

Page 24: A Comparison of Document, Sentence, and Term Event Spaces

Conclusions

• IDF values are surprisingly stable – with respect to random samples at 10% of the

total corpus. – average IDF values based on only a 20% random

stratified sample correlated almost perfectly to IDF

• Journal based IDF samples did not correlate well to the global IDF

• language used in abstracts is systematically different from the language used in the body of a full-text scientific document.