exploring variation in lexis and genre in the sketch engine adam kilgarriff lexical computing...

29
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT

Upload: homer-paul

Post on 19-Jan-2018

216 views

Category:

Documents


0 download

DESCRIPTION

My big question  How to compare corpora  How else can corpus methods/corpus linguistics be scientific  Roles  How do varieties contrast  How do corpora contrast  When we don’t know if they are different  Find bugs in corpus construction LSB, Brussels, Jan 2013Kilgarriff: Variation 3

TRANSCRIPT

Page 1: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Exploring Variation in Lexis and Genre in the Sketch EngineAdam Kilgarriff

Lexical Computing Ltd., UK

Supported by EU Project PRESEMT

Page 2: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

How to compare language varieties

Qualitative Quantitative

Quantitative means corpus Corpus represents variety Compare corpora

LSB, Brussels, Jan 2013Kilgarriff: Variation

2

Page 3: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

My big question

How to compare corpora How else can corpus methods/corpus linguistics be

scientific Roles

How do varieties contrast How do corpora contrast

When we don’t know if they are different Find bugs in corpus construction

LSB, Brussels, Jan 2013Kilgarriff: Variation

3

Page 4: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Corpus comparison

Qualitative Quantitative

LSB, Brussels, Jan 2013Kilgarriff: Variation

4

Page 5: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Qualitative

Take keyword lists [a-z]{3,} Lemma if lemmatisation identical, else word C1 vs C2, top 100/200 C2 vs C1, top 100/200 study

LSB, Brussels, Jan 2013Kilgarriff: Variation

5

Page 6: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Qualitative: example, OCC and OEC

OEC: general reference corpus OCC: writing for children

Look at fiction onlyTop 200 keywords (each way)

what are they?

LSB, Brussels, Jan 2013Kilgarriff: Variation

6

Page 7: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

LSB, Brussels, Jan 2013Kilgarriff: Variation

7

Page 8: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Paper Getting to know your corpus

Technology The Sketch Engine http://www.sketchengine.co.uk

LSB, Brussels, Jan 2013Kilgarriff: Variation

8

Page 9: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

The Sketch Engine

Not 25, but 10’s not bad Word sketches

One-page, corpus-driven summaries of a word’s grammatical and collocational behaviour

First used: Macmillan English Dictionary

Dictionary-making UK: almost a monopoly

OUP, CUP, Macmillan, Collins National/supranational institutes including INL

Linguistic research Teaching: languages, translation

LSB, Brussels, Jan 2013Kilgarriff: Variation

9

Page 10: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Do it

Sketch Engine does the grunt work It’s ever so interesting

LSB, Brussels, Jan 2013Kilgarriff: Variation

10

Page 11: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Quantitative

Methods, evaluation Kilgarriff 2001, Comparing Corpora, Int J Corp Ling Then:

not many corpora to compare Now:

Many Ad hoc, from web

First question: is it any good, how does it compare

Let’s make it easy: offer it in Sketch Engine

LSB, Brussels, Jan 2013Kilgarriff: Variation

11

Page 12: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Original method

C1 and C2: Same size, by design Put together, find 500 highest freq words

For each of these words Freqs: f1 in C1, f2 in C2, mean=(f1+f2)/2 (f1-f2)2/mean (chi-square statistic)

Sum Divide by 500: CBDF

LSB, Brussels, Jan 2013Kilgarriff: Variation

12

Page 13: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Evaluated

Known-similarity corpora Shows it worked Used to set parameter (500) CBDF better than alternative measures tested

LSB, Brussels, Jan 2013Kilgarriff: Variation

13

Page 14: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Adjustments for SkE

Problem: non-identical tokenisation Some awkward words: can’t undermine stats as one corpus has zero

Solution commonest 5000 words in each corpus intersection only commonest 500 in intersection

LSB, Brussels, Jan 2013Kilgarriff: Variation

14

Page 15: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Adjustments for SkE

Corpus size highly variable Chi-square not so dependable Also not consistent with our keyword lists

Link to keyword lists – link quant to qual Keyword lists

nf = normalised (per million) frequencies Keyword lists: nf1+k/nf2+k Default value for k=100 We use: if nf1>nf2, nf1+k/nf2+k, else nf2+k/nf1+k

Evaluated on Known-Sim Corpora as good as/better than chi-square

LSB, Brussels, Jan 2013Kilgarriff: Variation

15

Page 16: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

LSB, Brussels, Jan 2013Kilgarriff: Variation

16

Page 17: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

What’s missing

Heterogeneity “how similar is BNC to WSJ”?

We need to know heterogeneity before we can interpret

The leading diagonal 2001 paper: randomising halves

Inelegant and inefficient Depended on standard size of document

LSB, Brussels, Jan 2013Kilgarriff: Variation

17

Page 18: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

New definition, method

Heterogeneity (def) Distance between most different partitions

Cluster to find ‘most different partitions’ Bottom-up clustering

until largest cluster has over one third of data Rest: the other partition

Problem nxn distance matrix where n > 1 million Solution: do it in steps

LSB, Brussels, Jan 2013Kilgarriff: Variation

18

Page 19: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Summary

Corpus comparison Qualitative: use keywords Quantitative

On beta Heterogeneity (to complete the task) to follow (soon)

LSB, Brussels, Jan 2013Kilgarriff: Variation

19

Page 20: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Going multilingual

Bilingual word sketches How do we connect between languages? Three flavours

Parallel corpus Dictionary

With comparable corpus Manual

LSB, Brussels, Jan 2013Kilgarriff: Variation

20

Page 21: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

In Sketch Engine

Parallel 1000+ language pairs Credit: Joerg Tiedermann, OPUS project

LSB, Brussels, Jan 2013Kilgarriff: Variation

21

Page 22: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

LSB, Brussels, Jan 2013Kilgarriff: Variation

22

Page 23: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Going live

For parallel corpora http://beta.sketchengine.co.uk

For manual clustering http://corpdev.sketchengine.co.uk/sortable

LSB, Brussels, Jan 2013Kilgarriff: Variation

23

Page 24: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Simple maths for keywords

This word is twice as common in this text type as that

N freq Freq per mFocus Corp 2m 80 40Ref corp 15m 300 20ratio 2

LSB, Brussels, Jan 2013Kilgarriff: Variation

24

Page 25: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Intuitive Nearly right but:

How well matched are corpora Not here

Burstiness Not here

Can’t divide by zero Commoner vs. rarer words

LSB, Brussels, Jan 2013Kilgarriff: Variation

25

Page 26: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

You can’t divide by zero

Standard solution: add one

Problem solved

fc rc ratio buggle 10 0 ? stort 100 0 ? nammikin 1000 0 ?

fc rc ratio buggle 11 1 11 stort 101 1 101 nammikin 1001 1 1001

LSB, Brussels, Jan 2013 Kilgarriff: Variation 26

Page 27: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

High ratios more common for rarer words

fc rc ratio interesting?

spug 10 1 10 no

grod 1000 100 10 yes

LSB, Brussels, Jan 2013 Kilgarriff: Variation 27

•some researchers: grammar, grammar words•some researchers: lexis content wordsNo right answerSlider?

Page 28: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Solution: don’t just add 1, add n

n=1

n=100

word fc rc fc+n rc+n Ratio Rankobscurish 10 0 11 1 11.00 1middling 200 100 201 101 1.99 2common 12000 10000 12001 10001 1.20 3

word fc rc fc+n rc+n Ratio Rankobscurish 10 0 110 100 1.10 3middling 200 100 300 200 1.50 1common 12000 10000 12100 10100 1.20 2

LSB, Brussels, Jan 2013 Kilgarriff: Variation 28

Page 29: Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd.,…

Solution n=1000

Summary

word fc rc fc+n rc+n Ratio Rankobscurish 10 0 1010 1000 1.01 3middling 200 100 1200 1100 1.09 2common 12000 10000 13000 11000 1.18 1

word fc rc n=1 n=100 n=1000obscurish 10 0 1st 2nd 3rdmiddling 200 100 2nd 1st 2ndcommon 12000 10000 3rd 3rd 1stLSB, Brussels, Jan 2013 Kilgarriff: Variation 29