1 comparable corpora within and across languages, word frequency lists and the kelly project adam...
TRANSCRIPT
![Page 1: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/1.jpg)
Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project
Adam KilgarriffLexical Computing LtdLexicography MasterClassUniversities of Leeds and Sussex
![Page 2: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/2.jpg)
Malta, May 2010 Kilgarriff: BUCC 2
Two corpora are comparable iff
roughly the same text types, subject matter, proportions
![Page 3: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/3.jpg)
Malta, May 2010 Kilgarriff: BUCC 3
Two corpora are comparable iff
roughly the same text types, subject matter, proportions
Applicable where Different languages Same language
comparable=similar Any corpus is entirely similar to itself
![Page 4: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/4.jpg)
Malta, May 2010 Kilgarriff: BUCC 4
Comparing Corpora
Input Word freq list for c1 Word freq list for c2
For top 500 words compute sum of (observed-expected)2/expected
Chi-square-based Discriminates well
Better than spearman rank, cross-entropy
![Page 5: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/5.jpg)
Malta, May 2010 Kilgarriff: BUCC 5
1990s work
Then Very few corpora Purely theoretical interest
Now Web lots of corpora, created to spec Compare…
first question to ask about a new corpus
![Page 6: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/6.jpg)
Malta, May 2010 Kilgarriff: BUCC 6
(Monolingual) Word Lists
Define a syllabus Which words get used in
Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing
NS: educational psychologists NNS: proficiency levels
![Page 7: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/7.jpg)
Malta, May 2010 Kilgarriff: BUCC 7
Should be corpus-based
Most aren't Corpora are quite new
Easy to do better People will use them
Maybe also Governments
![Page 8: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/8.jpg)
Malta, May 2010 Kilgarriff: BUCC 8
How
Take your corpus Count Voila
![Page 9: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/9.jpg)
Malta, May 2010 Kilgarriff: BUCC 9
Complications
What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy
All are slightly different issues for each lg
![Page 10: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/10.jpg)
Malta, May 2010 Kilgarriff: BUCC 10
What is a word; delimiters
Found between spaces Not for Chinese: segmentation
English co-operate, widely-held, farmer's, can't
Norwegian, Swedish Compounding, separable verbs
Arabic, Italian Clitics, al, ...
...
![Page 11: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/11.jpg)
Malta, May 2010 Kilgarriff: BUCC 11
Words and lemmas
Word form (in text) invading
Lemma (dictionary headword) Invade for forms invade invades invaded
invading
Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara
![Page 12: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/12.jpg)
Malta, May 2010 Kilgarriff: BUCC 12
Word Families
Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable
‘Word families’ tradition eg: Coxhead, Academic word list
Pedagogy: one item to learn But
Where do families end? Different meanings
![Page 13: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/13.jpg)
Malta, May 2010 Kilgarriff: BUCC 13
Grammatical classes
brush (verb) and brush (noun) Same item or different? (both in same word family)
Required (short) list of word classes POS-tagger
Will make mistakes
![Page 14: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/14.jpg)
Malta, May 2010 Kilgarriff: BUCC 14
Marginal cases
Numbers twelve, seventeenth, fifties
Closed sets Days of week, months
Countries Capitals, nationalities, currencies, adjectives,
languages regional/dialects, political groups, religions
easter, christmas, islam, republican
policies always needed
![Page 15: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/15.jpg)
Malta, May 2010 Kilgarriff: BUCC 15
Multiwords
According to Linguistically a word but
Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords
![Page 16: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/16.jpg)
Malta, May 2010 Kilgarriff: BUCC 16
Homonymy
bank (river) and bank (money) Word sense disambiguation
We can't do (with decent accuracy) We can't give freqs for senses
Lists of words not meanings Sometimes disconcerting
![Page 17: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/17.jpg)
Malta, May 2010 Kilgarriff: BUCC 17
Corpora
A fairly arbitrary sample of a lg To limit arbitrariness of wordlist
Make it big and diverse WACKY corpora
From web Can do for any language
??? Comparable ??? Web language: less formal
![Page 18: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/18.jpg)
Malta, May 2010 Kilgarriff: BUCC 18
Comparing corpora
Corpora: new We are all beginners Best way to get sense of a corpus
Compare with another Keywords of each vs. other
Case studies Sketch Engine functions
![Page 19: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/19.jpg)
Malta, May 2010 Kilgarriff: BUCC 19
Comparing frequency lists
• Web1T
– Present from google
– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012) words of English
• that’s 1,000,000,000,000
• Compare with BNC
– Take top 50,000 items of each
– 105 Web1T words not in BNC top50k
– 50 words with highest Web1T:BNC ratio
– 50 words with lowest ratio
![Page 20: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/20.jpg)
Malta, May 2010 Kilgarriff: BUCC 20
Web-high (155 terms)
• 61 web and computing– config browser spyware url www forum
• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web
– poker viagra lingerie ringtone dvd casino rental collectible tiffany
– NB: BNC is old
• 4 legal– trademarks pursuant accordance herein
![Page 21: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/21.jpg)
Malta, May 2010 Kilgarriff: BUCC 21
Web-low
• Exclude British English, transcription/tokenisation anomalies
– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
![Page 22: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/22.jpg)
Malta, May 2010 Kilgarriff: BUCC 22
Observations
• Pronouns and past tense verbs
– Fiction
• Masc vs fem
• Yesterday
– Probably daily newspapers
• Constancy of ratios:
– He/him/himself
– She/her/herself
![Page 23: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/23.jpg)
Malta, May 2010 Kilgarriff: BUCC 23
Corpus Factory
Many languages General corpus, 100m+ words
Fast High quality Comparable across languages
![Page 24: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/24.jpg)
Malta, May 2010 Kilgarriff: BUCC 24
Gather Seed words
Wikipedia (Wiki) Corpora many domains free 265 languages covered, more to come
Extract text from Wiki. Wikipedia 2 Text
Tokenise the text. Morphology of the language is important Can use the existing word tokeniser tools.
![Page 25: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/25.jpg)
Malta, May 2010 Kilgarriff: BUCC 25
Web Corpus Statistics
Unique URLscollected
Afterfiltering
After de-duplication
Web corpus size MB Words
Dutch 97,584 22,424 19,708 739 MB 108.6 mHindi 71,613 20,051 13,321 424 MB 30.6 mTelugu 37,864 6,178 5,131 107 MB 3.4 mThai 120,314 23,320 20,998 1.2 GB 81.8 mVietnamese 106,076 27,728 19,646 1.2 GB 149 m
![Page 26: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/26.jpg)
Malta, May 2010 Kilgarriff: BUCC 26
Evaluation
For each of the languages, two corpora available: Web and Wiki Dutch: also a carefully designed lexicographic corpus.
Hypothesis: Wiki corpora are ‘informational’ Informational --> typical written Interactional --> typical spoken
![Page 27: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/27.jpg)
Malta, May 2010 Kilgarriff: BUCC 27
Evaluation
1st, 2nd person pronouns strong indicators of interactional language. English: I me my mine you your yours we us our
For each language Take ten commonest 1st and 2nd person pronouns For each
Calculate ratio: web:wiki
![Page 28: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/28.jpg)
Malta, May 2010 Kilgarriff: BUCC 28
Results: ratios, web:wiki
Language Average Min Max
Dutch 2.98 2.03 10.03
Hindi 5.36 1.85 11.50
Telugu 4.96 0.54 7.34
Thai 2.40 0.63 7.87
Vietnamese 3.82 1.81 19.41
![Page 29: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/29.jpg)
Malta, May 2010 Kilgarriff: BUCC 29
KELLY
EU lifelong learning project Goal: wordcards
Word in one lg on one side, other on other Language learning
9 languages, 36 pairs Arabic Chinese English Greek Italian
Norwegian Polish Russian Sweden Partners in 6 countries
![Page 30: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/30.jpg)
Malta, May 2010 Kilgarriff: BUCC 30
Method
Prepare monolingual lists Translate
Each into 8 target languages Professional translation services
Integrate, finalise Produce cards Goal for each set
9000 pairs at 6 levels
![Page 31: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/31.jpg)
Malta, May 2010 Kilgarriff: BUCC 31
Stages
Sort out corpora, tagging Automatically generate M1 lists
names, numbers, countries ... keywords vis-a-vis other corpora
Review, compare, prepare M2 lists Translate Use translations: M3 lists Finalise
![Page 32: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/32.jpg)
Malta, May 2010 Kilgarriff: BUCC 32
review - how?
points system 2 points for each of 6 levels 12 points for most freq words
deduct points for words in over-represented areas
add in words from other corpora
![Page 33: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/33.jpg)
Malta, May 2010 Kilgarriff: BUCC 33
Translation database
On the web All translations entered into it Queries like
All Swedish words used as translations more than six times
All 1:1:1:1... 'simple cases'
![Page 34: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/34.jpg)
Malta, May 2010 Kilgarriff: BUCC 34
Using the translations database
Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq
word in several of the 8 other lgs So:
add it to English list Homonyms: could be similar
![Page 35: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/35.jpg)
Malta, May 2010 Kilgarriff: BUCC 35
Monolingual master lists (M3)
Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs
Useful words which might not be hi-freq added words/multiwords must be above a
lower freq threshold
Target 9000
![Page 36: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/36.jpg)
Malta, May 2010 Kilgarriff: BUCC 36
Numbers
Target: 9000 per list M2 lists
Estimate: 5000-6000 needed We add 3000-4000 multiwords and other
'back-translations'
![Page 37: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/37.jpg)
Malta, May 2010 Kilgarriff: BUCC 37
Current status
M1 lists prepared Lists checked, compared with other
lists Corpus-based and other
M2 lists prepared Translation underway
![Page 38: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/38.jpg)
Malta, May 2010 Kilgarriff: BUCC 38
Big problems
Multiwords (as anticipated) Homonymy (as anticipated) orange banana alphabet elbow, Hello
Worse than anticipated Lists from spoken corpora, learner
corpora, needed Relation between
Competence for communicating The corpora at our disposal
![Page 39: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/39.jpg)
Malta, May 2010 Kilgarriff: BUCC 39
Word lists are useful, but
...are they scientific? A tiny bit, occasionally
...could they be scientific? Yes
article of faith By the end of KELLY, we'll have a clearer
idea how
![Page 40: 1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass](https://reader036.vdocuments.site/reader036/viewer/2022081511/56649ebe5503460f94bc77aa/html5/thumbnails/40.jpg)
Malta, May 2010 Kilgarriff: BUCC 40
And now for something completely different: DANTE Lexical database for English
Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins
BNC, FrameNet, Euralex, COBUILD...
English side, New English-Irish dictionary Available for NLP research imminently