sketch enginesmithsgj/sketch engine.pdf · –that’s basically what we have to learn •lewis...

30
Sketch Engine A corpus query tool Simon Smith & Adam Kilgarriff

Upload: others

Post on 09-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Sketch Engine

A corpus query tool

Simon Smith & Adam Kilgarriff

Page 2: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Plan for these sessions

• Today

– lecture overview of SkE

• Homework

– Chinese collocation survey

– SkE walkthrough

• Next class

– Practical exercises

Page 3: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Plan for today

• Short review of corpus basics

• 4 ages of corpus research

– From pre-computer age, to SkE

• Functions of SkE

• Demonstration of SkE in use

Page 4: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006

Sketch Engine

Developer: Pavel Rychly, Brno

Designer: Adam Kilgarriff

Users: OUP, Chambers, CUP

Universities for teaching and research

ELT textbook authors

Demo: http://www.sketchengine.co.uk/

• Self-registration for free account

Kilgarriff, Lexical ComputingSlide: 4

Page 5: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Quiz

• What’s a (linguistic) corpus?• What does the Latin word mean?• What are corpora?• What’s the BNC?• How big is the British National Corpus?• The BNC is

– Monolingual– Static– Synchronic– Balanced– English

• What other kinds are there?

Page 6: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

5 major uses for linguistic corpora

• Language learning and teaching

• Theoretical research on Language and Linguistics

• Literary research and analysis

• Language technology

• Lexicography

• (=dictionary making)

– Cobuild, Longman, …

– All learner dictionaries now use corpora

Page 7: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

How do you make a dictionary? (what resources…?)

• Use you own intuitions

• Ask all your friends for their intuitions

• Consult other dictionaries

• Read thousands of books

– and take lots of notes• Use a corpus

Page 8: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006

Four ages of corpus

research (in lexicography)

Kilgarriff, Lexical ComputingSlide: 8

Pre-computer

KWIC concordance (KWIC=?)

Collocational tools

(what’s a collocation?)

Word Sketch (using Sketch Engine)

Page 9: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 9

Age 1:

Pre-computer

First Oxford

English (1860)

Dictionary:

• 20 million

index cards

– a word (usually

rare) and a citation

Page 10: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 10

Age 2: KWIC Concordance1 arity, which will be used to take a party of under-privileged children to D

2 from outside. You are invited to a party and after a couple of drinks you d

3 tion, we believe politicians of all parties will listen to our views. &equo

4 ould be reaching agreement with all parties concerned, as to which events,

5 lack people. I have certainly been party to one or two discussions amongst

6 . These should be discussed by both parties before entering into the relatio

7 presents They had hosted a cocktail party at Kensington palace, for example

8 akes. By midnight the end-of-course party is in full swing, but most cadet

9 e should be a right for the injured party to terminate the contract. A mana

10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh

11 s. Ahead I could see the rest of my party plodding towards the final slope t

12 cial ethic. The two main political parties - the Tories and the Liberals -

13 ritish successes in Perth The small party of British players competing in th

14 to help control. One member of the party went to summon the rescue team and

15 rket society fashion magazine. The party was held at his flat which was a l

16 security and secrecy than any Tory Party Conference : it seems that bootleg

Page 11: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 11

Age 2: KWIC Concordances

From 1980

Computerised

COBUILD project was innovator

the coloured-pens method

Page 12: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 12

1 political association 4 person in an agreement/dispute

2 social event 5 to be party to something...

3 group of people

1 arity, which will be used to take a party of under-privileged children to D

2 from outside. You are invited to a party and after a couple of drinks you d

3 tion, we believe politicians of all parties will listen to our views. &equo

4 ould be reaching agreement with all parties concerned, as to which events,

5 lack people. I have certainly been party to one or two discussions amongst

6 . These should be discussed by both parties before entering into the relatio

7 presents They had hosted a cocktail party at Kensington palace, for example

8 akes. By midnight the end-of-course party is in full swing, but most cadet

9 e should be a right for the injured party to terminate the contract. A mana

10 by the Safran Peoples ' Liberation Party. This presents the powerful neigh

11 s. Ahead I could see the rest of my party plodding towards the final slope t

12 cial ethic. The two main political parties - the Tories and the Liberals -

13 ritish successes in Perth The small party of British players competing in th

14 to help control. One member of the party went to summon the rescue team and

15 rket society fashion magazine. The party was held at his flat which was a l

16 security and secrecy than any Tory Party Conference : it seems that bootleg

The coloured pens method

Page 13: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 13

Age 2: limitations

as corpora get bigger:

too much data

• 50 lines for a word: read all

• 500 lines: could read all, takes a long time

• 5000 lines: no

Page 14: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Why do corpora keep getting bigger? (anyone?)

• Because they can

– Price of storage

– Speed of access

• Representativeness

– Small corpus many examples of common words, maybe

– but…… ?

Page 15: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis
Page 16: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis
Page 17: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Lexical distribution

• What’s the most common word in English?

• What % does it make up of a whole corpus?

• The 100 most common words make up __% of all the words in a corpus?

• The 7500 most common words make up __%

• Answers:– 45% and 90%

• So: – you need massive corpora, if you want to really

represent rare words properly

Page 18: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis
Page 19: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

19

Limitation of KWIC analysis

• As corpora get bigger: too much data

– 50 lines for a word: read all

– 500 lines: could read all, takes a long time

– 5000 lines: no

• Instead, create a statistical summary of

word usage

– Show most common collocates

Page 20: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006

Age 3: Collocation listing, to find common

patterns (Problems here?)

word freq word freq

forests 6 life 36

$1.2 6 dollars 8

lives 37 costs 7

enormous 6 thousands 6

annually 7 face 9

jobs 20 estimated 6

money 64 your 7

Kilgarriff, Lexical ComputingSlide: 20

For right collocates of save (>5 hits)

Page 21: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

21

Limitations of collocation listing

• Some items are not genuine collocates– yours appears only because it is adjacent to save

• The collocates can belong to any part of speech– It would better if they were classified into POS– and the role they play in the sentence

• Thus,– for arrest in “The police were quick to arrest a number of

suspects on the spot”

• We would like to see– Keyword: arrest– Subject: police– Object: suspect(s)– Modifier: on the spot

• We would not be especially interested in to, a and number– These non-collocates happen to be close to the keyword

Page 22: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 22

Age 4: The word sketch,

from Sketch Engine

A corpus-derived one-page summary of

a word’s grammatical and

collocational behaviour

Page 23: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 2006Kilgarriff, Lexical ComputingSlide: 23

Age 4: The word sketch

Large well-balanced corpus

Parse to find

subjects, objects, heads, modifiers etc

One list for each grammatical relation

Statistics to sort each list

Page 24: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 200624

Page 25: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Taiwan, Dec 200625

Page 26: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Functions of SkE

• KWIC concordance

– Sorting, filtering etc

• Word sketch

• Automatic thesaurus

• Sketch difference– discriminate near-synonyms

26

Page 27: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

Corpora in language learning

• Teachers:– Use corpora to develop materials

– Confirm their own intuitions about L2

– Project info on screen, in-class

• Students can be set research tasks– Data-driven language learning (DDL)

– At home, or in computer classrooms

– Problem?

• Answer: pretty high motivation needed

Page 28: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

28

Lexical approach to SLA

• What other “approaches” or “methodologies” are there?

• Brain stores linguistic knowledge– vocab (+features) and a lot of grammar rules– that’s basically what we have to learn

• Lewis (1993) and Schmitt (2000) say– the vocab is stored in chunks and collocations – kith is stored with kin– scotch is stored with rumour, and snake, and

whisky

• Saying strong car or powerful tea or broken house gives away non-native speakers

Page 29: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

29

From www.teachingenglish.org - a lexical approach activity, based on a story text

Page 30: Sketch Enginesmithsgj/Sketch Engine.pdf · –that’s basically what we have to learn •Lewis (1993) and Schmitt (2000) say –the vocab is stored in chunks and collocations –kithis

More collocations?

• Let’s look them up

• Before next time–Sign up for Sketch Engine (you

should have already!)

–Take the pre-test, using your real name

–Do the walkthrough

–mcu.edu.tw/~ssmith/walkthrough