the case for corpus profiling
DESCRIPTION
The Case for Corpus Profiling. Anne De Roeck (Udo Kruschwitz, Nick Webb, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite, Dawei Song) Centre for Research in Computing The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK. Fact or Factoid: Hyperlinks. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/1.jpg)
The Case for Corpus Profiling
Anne De Roeck
(Udo Kruschwitz, Nick Webb, Abduelbaset Goweder, Avik Sarkar, Paul Garthwaite, Dawei Song)
Centre for Research in ComputingThe Open University, Walton Hall,
Milton Keynes, MK7 6AA, UK.
![Page 2: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/2.jpg)
Fact or Factoid: Hyperlinks
• Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999).
![Page 3: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/3.jpg)
Fact or Factoid: Hyperlinks
• Hyperlinks do not significantly improve recall and precision in diverse domains, such as the TREC test data (Savoy and Pickard 1999, Hawking et al 1999).
• Hyperlinks do significantly improve recall and precision in narrow domains and Intranets (Chen et al 1999, Kruschwitz 2001).
![Page 4: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/4.jpg)
Fact or Factoid: Stemming
• Stemming does not improve effectiveness of retrieval (Harman 1991)
![Page 5: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/5.jpg)
Fact or Factoid: Stemming
• Stemming does not improve effectiveness of retrieval (Harman 1991)
• Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992)
![Page 6: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/6.jpg)
Fact or Factoid: Stemming
• Stemming does not improve effectiveness of retrieval (Harman 1991)
• Stemming improves performance for morphologically complex languages (Popovitch and Willett 1992)
• Stemming improves performance on short documents (Krovetz 1993)
![Page 7: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/7.jpg)
Fact or Factoid: Long or Short.
• Stemming improves performance on short documents (Krovetz 1993)
• Short keyword based queries behave differently from long structured queries (Fujii and Croft 1999)
• Keyword based retrieval works better on long texts (Jurawsky and Martin 2000)
![Page 8: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/8.jpg)
Fact
• Performance of IR and NLP techniques depends on the characteristics of the dataset.
![Page 9: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/9.jpg)
Fact
• Performance of IR and NLP techniques depends on the characteristics of the dataset.
• Performance will vary with task, technique and language.
![Page 10: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/10.jpg)
Fact
• Performance of IR and NLP techniques depends on the characteristics of the dataset.
• Performance will vary with task, technique and language.
• Datasets really are significantly different.
![Page 11: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/11.jpg)
Fact
• Performance of IR and NLP techniques depends on the characteristics of the dataset.
• Performance will vary with task, technique and language.
• Datasets really are significantly different.– Vital Statistics– Sparseness
![Page 12: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/12.jpg)
Description
Contents of the documents
AP Copyrighted AP Newswire stories from 1989.
DOE Short abstracts from the Department of Energy.
FR Issues of the Federal Register (1989), reporting source actions by government agencies.
PAT U.S. Patent Documents for the years 1983-1991.
SJM Copyrighted stories from the San Jose Mercury News (1991).
WSJ Copyrighted stories from the Wall Street Journal (1987-1989).
ZF Information from the Computer Select disks for 1989/1990, copyrighted by Ziff-Davis Publishing Co.
OU The Open University intranet and extranet web-pages.
![Page 13: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/13.jpg)
Vital Stats
Data Set
No of Docs
Corpus Length Av. Doc Length
No of Terms
Av.Term per Doc
Sht Doc
Long Doc.
AP 242,918 114,438,101 471.1 347,966 238.25 9 2,944
DOE 226,086 26,882,774 119.0 179,310 72.90 1 373
FR 45,820 62,805,175 1,370.7 157,313 292.65 2 387,476
PAT 6,711 32,151,785 4,790.91 146,943 653.05 73 74,964
SJM 90,257 39,546,073 438.15 178,571 223.60 21 10,393
WSJ 98,732 41,560,108 420.94 159,726 204.26 7 7,992
ZF 293,121 115,956,732 395.59 295,326 168.42 19 75,030
OU 53,681 39,807,404 744.36 304,468 219.87 1 15,430
![Page 14: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/14.jpg)
Type to Token Ratios
Length of text
AP DOE FR PAT SJM WSJ ZF TIPSTER OVERALL
100 1.333 1.515 1.492 1.315 1.428 1.282 1.47 1.405200 1.626 1.562 1.666 1.538 1.612 1.55 1.68 1.605400 1.877 1.762 2.051 2.259 1.869 1.886 1.941 1.949800 2.144 2.067 2.572 3.065 2.035 2.072 2.305 2.323
1600 2.797 2.315 3.047 4.266 2.476 2.584 2.758 2.8923200 3.062 2.824 3.841 5.169 3.013 3.225 3.285 3.4886400 3.561 3.575 5.437 6.009 3.557 3.83 4.238 4.315
16000 4.563 4.737 8.583 9.744 4.153 4.566 5.289 5.94820000 4.972 5.196 9.199 11.031 4.463 4.988 5.383 6.462
100000 9.14 10.451 15.453 20.764 8.463 9.413 12.017 12.2431000000 30.573 30.157 50.571 62.637 26.377 30.909 38.105 38.476
10000000 106.845 94.778 144.866 134.017 102.149 116.183 121.798 117.234
![Page 15: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/15.jpg)
Type to Token Ratios
TIPSTER OVERALL
OU Bengali Arabic Brown Corpus
Length of text
1.405 1.47 1.204819 1.19 1.449 1001.605 1.694 1.388889 1.342 1.613 2001.949 2.247 1.67364 1.423 2.424 4002.323 2.622 1.864802 1.578 2.439 8002.892 3.053 2.288984 1.774 2.576 16003.488 3.673 2.775369 2.082 3.674 32004.315 4.312 3.309204 2.357 4.702 64005.948 6.24 4.663363 2.771 5.928 160006.462 6.944 5.20969 2.875 6.341 20000
12.243 12.41 6.074628 10000038.476 36.127 10.81093 8.252 20.408 1000000
117.234 82.064 10000000
![Page 16: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/16.jpg)
Assumption
• Successful (statistical?) techniques can be successfully ported to other languages.– Western European languages– Japanese, Chinese, Malay, …
• WordSmith: Effective use requires 5M word corpus (Garside 2000)
![Page 17: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/17.jpg)
Type to Token ratio
Text length Bengali(CIIL)
English(Brown)
Arabic(Al-Hayat)
100 1.204 1.449 1.190
1600 2.288 2.576 1.774
6400 3.309 4.702 2.357
16000 4.663 5.928 2.771
20000 5.209 6.341 2.875
1000000 10.811 20.408 8.252
![Page 18: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/18.jpg)
Cargo Cult Science?
• Richard Feynman (1974)
![Page 19: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/19.jpg)
Cargo Cult Science?
• Richard Feynman (1974)
“It's a kind of scientific integrity, a principle of scientific thought that corresponds to a kind of utter honesty--a kind of leaning over backwards. For example, if you're doing an experiment, you should report everything that you think might make it invalid--not only what you think is right about it: other causes that could possibly explain your results; and things you thought of that you've eliminated by some other experiment, and how they worked--to make sure the other fellow can tell they have been eliminated.”
![Page 20: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/20.jpg)
Cargo Cult Science?
• Richard Feynman (1974)
“Details that could throw doubt on your interpretation must be given, if you know them. You must do the best you can--if you know anything at all wrong, or possibly wrong--to explain it.”
“In summary, the idea is to give all of the information to help others to judge the value of your contribution; not just the information that leads to judgement in one particular direction or another.”
![Page 21: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/21.jpg)
Cargo Cult Science?
• The role of data in the outcome of experiments should be clarified
• Why?
• How?
![Page 22: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/22.jpg)
Why explore role of data?
• Methodological: Replicability– Barbu and Mitkov (2001) – Anaphora resolution– Donaway et al (2000) – Automatic Summarisation
![Page 23: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/23.jpg)
Why explore role of data?
• Methodological: Replicability– Barbu and Mitkov (2001) – Anaphora resolution– Donaway et al (2000) – Automatic Summarisation
• Epistemological: Theory induction – What is the relationship between data properties
and technique performance?
![Page 24: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/24.jpg)
Why explore role of data?
• Methodological: Replicability– Barbu and Mitkov (2001) – Anaphora resolution– Donaway et al (2000) – Automatic Summarisation
• Epistemological: Theory induction – What is the relationship between data properties
and technique performance?
• Practical: Application– What is relationship between two sets of data?– What is this dataset (language?) like?
![Page 25: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/25.jpg)
How explore role of data?
• One way: Profiling for Bias• Assumption: Collection will be biased w.r.t.
technique & task• Find measures that reflect bias• Verify effects experimentally
![Page 26: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/26.jpg)
How explore role of data?
• Profile standard collections – Adds to past experiments
• Profile new data– Gauge distance to known collections– Estimate effectiveness of techniques
![Page 27: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/27.jpg)
Why Profile for Bias?
• And by the way, the others think it is vital.(Machine Learning, Data Mining, Pattern Matching
etc.)
![Page 28: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/28.jpg)
Why Profile for Bias?
• And by the way, the others think it is vital.(Machine Learning, Data Mining, Pattern Matching
etc.)
• And so did we! (or do we?)
![Page 29: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/29.jpg)
Profiling: An Abandoned Agenda?
• Sparck-Jones (1973) “Collection properties influencing automatic term classification performance.” Information Starage and Retrieval. Vol 9
• Sparck-Jones (1975)“A performance Yardstick for test collections.” Journal of Documentation. 31:4
![Page 30: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/30.jpg)
What has changed?
• Sparck-Jones (1973)
– Is a collection useably classifiable?– Number of query terms which can be used for
matching.
– Is a collection usefully classifiable?– Number of useful, linked terms in document or
collection
– Is a collection classifiable?– Size of vocabulary and rate of incidence
![Page 31: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/31.jpg)
![Page 32: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/32.jpg)
Profiling: An Abandoned Agenda
• Term weighting formula tailored to query– Salton 1972
• Stop word identification relative to collection/query– Wilbur & Sirotkin1992; Yang & Wilbur 1996
• Effect of collection homogeneity on language model quality– Rose & Haddock 1997
![Page 33: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/33.jpg)
What has changed?
• Proliferation of (test) collections • More data per collection• Increased application need
![Page 34: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/34.jpg)
What has changed?
• Proliferation of (test) collections • More data per collection• Increased application need
• Sparseness is only one kind of bias
![Page 35: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/35.jpg)
What has changed?
• Proliferation of (test) collections • More data per collection• Increased application need
• Sparseness is only one kind of bias
• Better (ways of computing) measures?
![Page 36: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/36.jpg)
Profiling Measures
• Requirements: measures should be– relevant to NLP techniques given task– fine grained– cheap to implement
![Page 37: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/37.jpg)
Profiling Measures
• Requirements: measures should be– relevant to NLP techniques given task– fine grained– cheap to implement
• Need to agree a framework• Fixed points:
– Collections?– Properties?– Measures?
![Page 38: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/38.jpg)
Profiling Measures
• Simple starting point:– Vital Statistics– Zipf (sparseness; ideosyncracy)– Type to token ratio (sparseness, specialisation)– Manual sampling (quality; content)
• Refine?– Homogeneity?– Burstiness?– (Words and Genre?)
![Page 39: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/39.jpg)
Profiling Measures
• Homogeneity (or how strong is evidence defeating homogeneity assumption)
• Term Distribution Models (Words!)
• Frequentist vs non-frequentist• Very frequent terms (!!)
![Page 40: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/40.jpg)
Very Frequent Terms
• Lots of them• Reputedly “noise-like” (random?
homogeneous?)• Present in most datasets (comparison)• Stop word identification relative to
collection/query is independently relevant– Wilbur & Sirotkin1992; Yang & Wilbur 1996
![Page 41: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/41.jpg)
Homogeneity
• Homogeneity Assumption– Bag of Words– Function word distribution– Content word distribution
• Measure of Heterogeneity as dataset profile– Kilgariff & others 1992 onwards– Measure distance between corpora– Identify genre
![Page 42: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/42.jpg)
Heterogeneity Measures
2 (Kilgariff 1997; Rose & Haddock 1997)• G2 (Rose & Haddock 1997; Rayson &
Garside 2000 )• Correlation, Mann-Whitney (Kilgariff 1996)• Log-likelihood (Rayson & Garside 2000)• Spearman’s S (Rose & Haddock 1997) • Kullback-Leibler divergence (Cavaglia 2002)
![Page 43: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/43.jpg)
Measuring Heterogeneity
• Divide corpus using 5000 word chunks in random halves
• Frequency list for each half• Calculate 2 for term frequency distribution
differences between halves• Normalise for corpus length• Iterate over successive random halves
![Page 44: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/44.jpg)
Measuring Heterogeneity
• Kilgariff registers values of 2 statistic• High value indicates high heterogeneity• Finds high heterogeneity in all texts
![Page 45: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/45.jpg)
Defeating the Homogeneity Assumption
• Assume word distribution is homogeneous (bag of words)
• Explore chunk sizes– Chunk size 1 -> homogeneous (random)– Chunk size 5000 -> heterogeneous (Kilgariff 1997)
2 test (statistic + p-value)– Defeat assumption with statistical relevance– Register differences between datasets
• Focus on frequent terms (!)
![Page 46: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/46.jpg)
Homogeneity detection at a level of statistical significance
• p-value: evidence for/against the hypothesis– < 0.1 -- weak evidence against – < 0.01 -- strong evidence against– < 0.001 -- very strong evidence against– < 0.05 -- significant (moderate evidence against
the hypothesis)
• Indication of statistically significant non-homogeneity
![Page 47: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/47.jpg)
Dividing a Corpus
• docDiv: place documents in random halves– term distribution across documents
• halfdocDiv: place half documents in random halves– term distribution within the same document
• chunkDiv: place chunks (between 1 and 5000 words) in random halves– term distribution between text chunks (genre?)
![Page 48: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/48.jpg)
Results DocDiv
Number of Terms (N)Dataset10 20 50 100 500 1000 7000 20000
AP 2.1070.1216
1.5760.2139
2.5830.0003
2.2900
2.7320
2.6010
2.4410
2.4350
DOE 1.1720.463
1.4500.160
1.7550.0259
1.9830
1.8380
1.7860
1.7950
1.8720
FR 54.5240
41.7150
72.0930
66.7870
51.3870
61.2660
39.0430
23.5340
PAT 21.0740
29.3150
62.4940
55.3530
50.2650
44.8240
32.0560
22.4680
SJM 3.5950.1193
2.7680.0077
3.2310
2.9760
3.0120
2.9590
2.5600
2.5110
WSJ 2.3580.178
2.6630.0019
2.3640
2.3350
2.6230
2.7490
2.8310
2.9170
ZF 11.9470
8.1330
6.9070
6.5760
6.1220
5.6340
4.5950
4.5760
OU 232.9130
158.5200
94.7490
67.2930
32.6630
25.1810
14.2240
8.2970
![Page 49: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/49.jpg)
Results HalfDocDiv
Number of Terms (N)Dataset10 20 50 100 500 1000 7000 20000
AP 1.7740.087
1.4730.117
1.3690.057
1.2710.066
1.1710.021
1.1870.0001
1.1470
1.1360
DOE 0.7280.655
0.9310.533
1.0540.438
1.0430.372
1.0610.195
1.0270.285
1.0140.271
1.010.182
FR 7.9050.001
9.5490
11.6270
11.6420
8.8470
8.1660
6.5430
5.3360
PAT 20.3600
15.5680
16.0170
11.8860
7.6940
6.2430
5.1020
4.6110
SJM 1.3230.3860
1.5690.3919
1.3200.4436
1.4690.1069
1.3320
1.2970
1.2400
1.2420
WSJ 1.5630.279
1.6180.248
1.3420.203
1.2980.260
1.2360.017
1.2100.0007
1.1780
1.1500
ZF 1.9480.1288
1.8580.116
1.7090.0283
1.6090.0240
1.5590
1.5980
1.5360
1.5560
OU 7.7210.033
6.1030.0025
8.0910
8.2160
6.3660
5.5020
4.2230
3.0870
![Page 50: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/50.jpg)
Results ChunkDiv (5)
Number of Terms (N)Dataset10 20 50 100 500 1000 7000 20000
AP 0.6280.7516
0.8360.6375
0.8710.677
0.9840.484
0.9900.535
1.0070.523
1.0180.1595
1.0120.179
DOE 1.1410.3946
1.2250.3461
1.1510.2505
1.0500.3540
1.0380.4229
1.0020.462
1.0080.431
1.0080.3667
FR 0.7540.650
0.9610.504
0.9670.54
1.0330.405
1.0160.4174
1.0250.335
1.0220.2281
1.0130.211
PAT 1.2840.2451
1.4570.091
1.2550.2273
1.1530.1862
1.0510.226
1.0070.429
1.0080.330
1.0200.077
SJM 1.2040.429
1.1750.375
1.2260.293
1.1270.268
0.9790.608
1.0040.454
1.0120.262
1.0100.181
WSJ 0.8340.573
1.0080.492
0.7780.822
0.9240.679
0.9570.682
0.9840.6202
1.0000.498
1.010.252
ZF 0.8610.5781
0.7910.704
0.9390.636
0.9130.703
0.9940.525
1.0120.394
1.0070.393
1.0160.1258
OU 1.2420.3395
1.2570.271
1.1650.234
1.0230.424
1.0810.118
1.0540.142
1.0420.034
1.0330.005
![Page 51: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/51.jpg)
Results: ChunkDiv (100)
Number of Terms (N)Dataset10 20 50 100 500 1000 7000 20000
AP 0.8240.6023
1.1050.3560
1.4120.0735
1.6070.0019
1.4710
1.3720
1.30040
1.30260
DOE 1.1020.3937
1.8640.0280
1.6460.0231
1.5110.0317
1.3540.0299
1.4140
1.40130
1.4240
FR 1.0060.5071
1.4410.229
1.6080.076
1.8030.025
1.9240
1.8340
1.7820
1.7460
PAT 4.1810.0232
3.0510.0025
2.6820.0007
2.4200
2.2520
2.1040
1.9770
1.8760
SJM 0.9950.4720
1.1170.3851
1.1460.3203
1.1800.2463
1.4100
1.4020
1.3170
1.2910
WSJ 1.1120.3741
1.2130.324
1.1980.2426
1.2300.0937
1.1960.0383
1.2830
1.29020
1.3190
ZF 1.5760.4152
1.2830.366
1.7090.011
2.1900
1.410
1.6730
1.3150
1.8840
OU 6.2310.0004
5.6570
4.8700
4.2780
3.3100
2.7330
2.2610
1.8650
![Page 52: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/52.jpg)
Results
• docDiv:– heterogeneity across most documents, except:
• AP and DOE (20 terms or fewer)
• halfdocDiv:– tests sensitive to certain document types
• DOE very homogeneous• PAT and OU very heterogeneous
• chunkDiv:– chunk length vs. document boundary?– similar behaviour of WSJ and SJM
(Intranet data gives extreme results. How transferable is corpus based training?)
![Page 53: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/53.jpg)
Pro
• Heterogeneity test reasonable profiling measure?– sensitive to document types
• eg. different behaviour for halfdocDiv
– cheap to implement– relation between measure and p-value
![Page 54: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/54.jpg)
Drawbacks
• Frequency based• Coarse grained• Not homogeneous = bursty
• Bursty in what way?
• Useful for applications?
![Page 55: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/55.jpg)
Profiling by Measuring Burstiness
• Pioneer’s agenda: Clumps!– Sparck-Jones & Needham 1964
• Models– Two poisson (Church 2000)– K-mixtures (Katz 1996)– Exponential mixtures (Sarkar et al 2006)
![Page 56: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/56.jpg)
Sarkar Burstiness Model
• Model gaps (not term occurrence)
• Mixture of exponential distributions
• Between-burst (1/1, or 1’)
• Within-burst (1/2 or 2’)
![Page 57: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/57.jpg)
Burstiness Model
• First occurrence• No occurrence: censoring
![Page 58: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/58.jpg)
Burstiness Model
• Baysian estimation– posterior prior x likelihood– choose uninformative prior– estimate posterior using Gibbs Sampling (MCMC)– WinBUGS software– 1000 iteration burn-in– further 5000 iterations for estimate
![Page 59: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/59.jpg)
Burstiness Model
• Word behaviour hypotheses
– Small 1’, small 2’: frequently occurring (function?) word
– Large 1’, small 2’: bursty (content?) word
– Small 1’, large 2’: frequent but well spaced (function?) word
– Large 1’, large 2’: infrequent scattered (function?) word
– p’: proportion of times term does not occur in a burst– 1-p’: proportion of times term appears in a burst
![Page 60: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/60.jpg)
![Page 61: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/61.jpg)
![Page 62: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/62.jpg)
Very frequent function words
![Page 63: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/63.jpg)
Less frequent function words
![Page 64: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/64.jpg)
Style indicative terms
![Page 65: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/65.jpg)
Content terms
![Page 66: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/66.jpg)
What now?
• Experimental verification.• Other aspects:
– Coverage (narrow or broad)– Lay-out and meta data– Language – Links and mark-up
![Page 67: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/67.jpg)
Conclusions
![Page 68: The Case for Corpus Profiling](https://reader035.vdocuments.site/reader035/viewer/2022062409/56815118550346895dbf366e/html5/thumbnails/68.jpg)
Conclusions
• NLP/IR papered over the elephant in the room
• Dataset profiling can be a useful way of augmenting known results
• Profiles have to be relative to task• Measures have to be be informative • Finding effective profiling measures is a
substantial, difficult essential research agenda