corpus-based word frequency lists
DESCRIPTION
Adam Kilgarriff. Corpus-based word frequency lists. Word lists and the KELLY project. EU lifelong learning project: Goal: wordcards Word in one lg on one side, other on other Language learning 9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian Polish Russian Sweden - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/1.jpg)
1
Corpus-based word frequency lists
Adam Kilgarriff
Kivik 2013 Kilgarriff: Word lists, KELLY
![Page 2: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/2.jpg)
Word lists and the KELLY project
EU lifelong learning project: Goal: wordcards
Word in one lg on one side, other on other Language learning
9 languages, 36 pairs Arabic Chinese English Greek Italian Norwegian
Polish Russian Sweden Partners in 6 countries
http://kellyproject.eu
Kivik 2013 Kilgarriff: Word lists, KELLY 2
![Page 3: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/3.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 3
(Monolingual) Word Lists
Define a syllabus Which words get used in
Learning-to-read books (NS children) NNS language learner textbooks Dictionaries Language testing
NS: educational psychologists NNS: proficiency levels
![Page 4: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/4.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 4
Should be corpus-based
Most aren't Corpora are quite new
Easy to do better People will use them
Maybe also Governments
![Page 5: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/5.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 5
How
Take your corpus Count Voila
![Page 6: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/6.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 6
Complications
What is a word Words and lemmas Grammatical classes Numbers, names... Multiwords Homonymy
All are slightly different issues for each lg
![Page 7: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/7.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 7
What is a word; delimiters
Found between spaces Not for Chinese: segmentation
English co-operate, widely-held, farmer's, can't
Norwegian, Swedish Compounding, separable verbs
Arabic, Italian Clitics, al, ...
...
![Page 8: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/8.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 8
Words and lemmas
Word form (in text) invading
Lemma (dictionary headword) Invade for forms invade invades invaded
invading
Lemmatisation Chinese, none; English, simple Middling: Swe Nor It Gr Tough: Rus, Pol, Ara
![Page 9: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/9.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 9
Word Families
Derivational morphology efficient/efficiently access/accessible/accessibility available/availability/unavailable
‘Word families’ tradition eg: Coxhead, Academic word list
Pedagogy: one item to learn But
Where do families end? Different meanings
![Page 10: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/10.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 10
Grammatical classes
brush (verb) and brush (noun) Same item or different? (both in same word family)
Required (short) list of word classes POS-tagger
Will make mistakes
![Page 11: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/11.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 11
Marginal cases
Numbers twelve, seventeenth, fifties
Closed sets Days of week, months
Countries Capitals, nationalities, currencies, adjectives,
languages regional/dialects, political groups, religions
easter, christmas, islam, republican
policies always needed
![Page 12: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/12.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 12
Multiwords
According to Linguistically a word but
Multiword frequency list: top item of the Can't use freqs (alone) to select multiwords
![Page 13: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/13.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 13
Homonymy
bank (river) and bank (money) Word sense disambiguation
We can't do (with decent accuracy) We can't give freqs for senses
Lists of words not meanings Sometimes disconcerting
![Page 14: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/14.jpg)
Method
• Prepare monolingual lists• Translate
– Each into 8 target languages– Professional translation services
• Integrate, finalise• Produce cards• Goal for each set
– 9000 pairs at 6 levels
Kivik 2013 Kilgarriff: Word lists, KELLY 14
![Page 15: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/15.jpg)
Stages
• Sort out corpora, tagging• Automatically generate M1 lists
– names, numbers, countries ...– keywords vis-a-vis other corpora
• Review, compare, prepare M2 lists• Translate • Use translations: M3 lists• Finalise
Kivik 2013 Kilgarriff: Word lists, KELLY 15
![Page 16: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/16.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 16
Corpora
A fairly arbitrary sample of a lg To limit arbitrariness of wordlist
Make it big and diverse WaC, TenTen corpora
From web Can do for any language
??? Comparable ???
![Page 17: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/17.jpg)
review - how?
points system 2 points for each of 6 levels 12 points for most freq words
deduct points for words in over-represented areas
add in words from other corpora
Kivik 2013 Kilgarriff: Word lists, KELLY 17
![Page 18: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/18.jpg)
Translation database
On the web http://kelly.sketchengine.co.uk
All translations entered into it Queries like
All Swedish words used as translations more than six times
All 1:1:1:1... 'simple cases'
Kivik 2013 Kilgarriff: Word lists, KELLY 18
![Page 19: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/19.jpg)
Using the translations database
Find words not in M2 lists, that need adding Multiwords English look for Probably, the translation of a high-freq
word in several of the 8 other lgs So:
add it to English list Homonyms: could be similar
Kivik 2013 Kilgarriff: Word lists, KELLY 19
![Page 20: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/20.jpg)
Matches across 9 languages
Set of symmetrical relations across all 36 pairs music library sun hospital theory
Kivik 2013 Kilgarriff: Word lists, KELLY 20
![Page 21: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/21.jpg)
Monolingual master lists (M3)
Based on a WAC corpus Input from other same-lg corpora And from translations from 8 lgs
Useful words which might not be hi-freq added words/multiwords must be above a
lower freq threshold
Target 9000
Kivik 2013 Kilgarriff: Word lists, KELLY 21
![Page 22: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/22.jpg)
Big problems
• Multiwords (as anticipated)• Homonymy (as anticipated)• orange banana alphabet elbow, Hello
– Worse than anticipated– Lists from spoken corpora, learner
corpora, needed– Relation between
• Competence for communicating• The corpora at our disposal
Kivik 2013 Kilgarriff: Word lists, KELLY 22
![Page 23: Corpus-based word frequency lists](https://reader036.vdocuments.site/reader036/viewer/2022062408/568135aa550346895d9d192f/html5/thumbnails/23.jpg)
Kivik 2013 Kilgarriff: Word lists, KELLY 23
Word lists are useful, but
...are they scientific? A tiny bit, occasionally
...could they be scientific? Yes …
At least, more so