![Page 1: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/1.jpg)
Statistical Measures for Corpus Profiling
Michael P. Oakes
University of Sunderland
Corpus Profiling Workshop, 2008.
![Page 2: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/2.jpg)
Contents
• Why study differences between corpora? (Kilgarriff, 2001)• Case Study in parsing (Sekine, 1997).• Words and “countable linguistic features”.• Overall differences between corpora and contributions of
individual features:– Information theory– Chi-squared test– Factor Analysis
• “Gold standard” comparison of measures (Kilgarriff, 2001).
![Page 3: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/3.jpg)
Why study differences between corpora?
• Kilgarriff (2001), “Comparing Corpora”, Int. J. Corpus Linguistics 6(1), pp. 97-133.
• Taxonomise the field: how does a new corpus stand in relation to existing ones?
• If an interesting finding is found for one corpus, for what other corpora does it hold?
• Is a new corpus sufficiently different from ones you have already got to be worth acquiring?
• Difficulty in porting a new corpus to an existing NLP system: time and cost are measurable.
![Page 4: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/4.jpg)
Different Text Types
• Englishes of the world, e.g. US vs. UK (Hofland and Johannson, 1982)
• Social differentiation e.g. gender, age, social class (Rayson, Leech and Hodges 1997), diachronic, geographical location.
• Stylometry, e.g. disputed authorship • Genre analysis, e.g. science fiction, e-shop (Santini, 2006)• Sentiment analysis (Westerveld, 2008). • Relevant vs. non-relevant documents? Probabilistic IR. • Statistical techniques exist to discriminate between these
text types. Here the interest is in the types of language per se, rather than their amenability to NLP tools.
![Page 5: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/5.jpg)
Words and countable linguistic features
• Bits of words e.g. 2-grams (Kjell, 1994)• Words (many studies)• Linguistic features for Factor Analysis (Biber,
1995) e.g. questions, past participles.• Phrase rewrite rules (Sekine 1997, Baayen, van
Halteren and Tweedie, 1996). • Any countable feature characteristic of one corpus
as opposed to another.• Not hapax legomena, Semitisms in the New
Testament.
![Page 6: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/6.jpg)
Domain independence of parsing (Sekine, 1997)
• Used 8 genres from the Brown Corpus, chosen to give equal amount of fiction (KLNP) and non-fiction (ABEJ).
• Characterised domains by production rules which fire.• From this data produced a matrix of Cross Entropy of
grammar across domains.• Then average linking of the domains based on the matrix
of cross entropy gave intuitively reasonable results.• Evaluated (training / test) corpus difference on parser
performance. • Discussed size of the training corpus.
x
xqxpqXH )(log).(, 2
![Page 7: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/7.jpg)
Broad Text Category
Genre Texts in Brown
Texts in LOB
Press A Reportage 44 44
B Editorial 27 27
C Reviews 17 17
General Prose D Religion 17 17
E Skills, Trades, Hobbies 36 38
F Popular Lore 48 44
G Belles Lettres, Biographies, Essays 75 77
H Miscellaneous 30 30
J Academic Prose 80 80
Fiction K General Fiction 29 29
L Mystery and Detective 24 24
M Science Fiction 6 6
N Adventure and Western 29 29
P Romance and Love Story 29 29
R Humour 9 9
![Page 8: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/8.jpg)
Sekine characterised domains by production rules which fire
Domain A Domain B
PP IN NP (8.40%) NP PRP (9.52%)
NP NN PX (5.42%) PP IN NP (5.79%)
S S (5.06%) S NP VP (5.77%)
S NP VP (4.28%) S S (5.37%)
NP DT NNX (3.81%) NP DT NNX (3.90%)
![Page 9: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/9.jpg)
Sekine: Cross-Entropy of Grammar Across Domains
T/M A B E J K L N P
A 5.13 5.35 5.41 5.45 5.51 5.52 5.53 5.55
B 5.47 5.19 5.50 5.51 5.55 5.58 5.60 5.60
E 5.50 5.48 5.20 5.48 5.58 5.59 5.58 5.61
J 5.39 5.37 5.35 5.15 5.52 5.57 5.58 5.61
K 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17
L 5.32 5.25 5.31 5.41 4.95 5.14 5.15 5.17
N 5.29 5.25 5.28 5.43 5.10 5.06 4.89 5.12
P 5.43 5.36 5.40 5.55 5.23 5.21 5.21 5.00
![Page 10: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/10.jpg)
![Page 11: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/11.jpg)
![Page 12: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/12.jpg)
Overall differences between corpora and contributions of individual features.
• Vocabulary richness (e.g. type/token ratio, Yule’s K Characteristic, V2/N) is a characteristic of the entire corpus. Puts all corpora on a linear scale.
• The techniques we will look at (chi-squared, information theoretic and factor analysis) can both give a value for the overall difference between two corpora, and quantify the contributions made by individual features.
![Page 13: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/13.jpg)
Measures of Vocabulary Richness
• Yule’s K characteristic: K = 10000 * (M2 -M1) / (M1 * M1); M1 = tokens; M2 = (V1 * 1²) + (V2 * 2²) + (V3 * 3²) …
• Gerson 35.9, Kempis 59.7, De Imitatione Christi 84.2
• Heap’s Law: Vocabulary size as a function of text size, M = kT^b. Parameters k and b could discriminate texts, and allow them to be plotted in two dimensions.
• Entropy is a form of vocabulary richness (but high individual contributions from both common and rare words).
i
ipipEntropy )(log).( 2
![Page 14: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/14.jpg)
The chi-squared test (Oakes and Farrow, 2006): (O - E)² / E values for three words in five balanced
corpora (Σ (O-E)²/E = 414916.8)
Australian
British US Indian NZ
A 12.68 1.36 2.55 76.65 8.33
Commonwealth
399.63 31.20 32.95 19.84 2.16
zzzzooop - - - - -
![Page 15: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/15.jpg)
Measures from Information Theory (Dagan et al., 1997)
• Kullback Leibler (KL) divergence (also called relative entropy) used as a measure of semantic similarity by Dagan et al., 1997.
• Meaning in coding theory• Problems: we get a value of
infinity if there is a word with frequency 0 in corpus B and >0 in corpus A, and not symmetrical
• Dagan (1997), Information Radius.
i i
ii q
ppqpD 2log.)||(
2||
2||
qpqD
qppD
![Page 16: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/16.jpg)
Information Radius
• L (Fiction: detective) and P (Fiction: romance): 0.180
• A (Press reportage) and B (Press editorial): 0.257
• J (Academic prose) and P (Fiction: romance): 0.572
![Page 17: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/17.jpg)
Detective versus Romantic Fiction
Detective Romance Detective Romance
The .00821 -.00732 Her .00819 -.00522
Of .00308 -.00277 She .00784 -.00535
A .00280 -.00257 You .00453 -.00345
Was .00180 -.00172 To .00235 -.00229
It .00161 -.00148 Be .00128 -.00110
He .00157 -.00148 They .00126 -.00097
On .00110 -.00099 Would .00121 -.00097
Been .00106 -.00089 Are .00087 -.00056
Man .00089 -.00061 Your .00084 -.00062
Money .00065 -.00034 Love .00081 -.00039
![Page 18: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/18.jpg)
Factor Analysis
• Decathlon analogy: running, jumping and throwing. • Biber (1988): groups of countable features which
consistently co-occur in texts are said to define a “linguistic dimension”.
• Such features are said to have positive loadings with respect to that dimension, but dimensions can also be defined by features which are in “complementary distributions”, i.e. negatively loaded.
• Example: at one pole is “many pronouns and contractions”, near which lie conversational texts and panel discussions. At the other pole, “few dimensions and contractions” are scientific texts and fiction.
![Page 19: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/19.jpg)
![Page 20: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/20.jpg)
Evaluation of Measures (Kilgarriff 2001)
• Reference corpus made up of known proportions of two corpora: 100% A, 0% B; 90% A, 10% B; 80% A, 20% B …
• This gives a set of “gold standard” judgements: subcorpus 1 is more like subcorpus 2 than subcorpus 3, etc.
• Compare machine ranking of corpora with the gold standard ranking using Spearman’s rank correlation coefficient.
![Page 21: Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008](https://reader035.vdocuments.site/reader035/viewer/2022062714/56649d485503460f94a241a0/html5/thumbnails/21.jpg)
Conclusions
• Some measures allow comparisons of entire corpora, others enable the identification of typical features.
• Different measure allow different kinds of maps: vocabulary richness allows ranking of corpora on a linear scale, Heap’s Law a 2D map of two parameters. Information theoretic measures give the (dis)similarity between two corpora – best viewed using clustering. With Factor Analysis, you don’t know what the dimensions are until you’ve done it.
• Maps enable contours of application success.