principles of corpus construction matthew brook odonnell university of liverpool - corpus...

51
Principles of corpus construction Matthew Brook O’Donnell ty of Liverpool - Corpus Linguistics Summer Institu

Upload: charlotte-saffold

Post on 28-Mar-2015

284 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Principles of corpus construction

Matthew Brook O’Donnell

University of Liverpool - Corpus Linguistics Summer Institute 2008

Page 2: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Aims

What is a corpus? What principles guide the construction,

development and selection of a corpus? When and How to build a corpus Can the web be used for building corpora? Workshop: Build a small corpus of web texts

Page 3: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

What is a corpus?

John Sinclair (1933-2007)

Page 4: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

What is a corpus?

John Sinclair (1933-2007)

A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.

(Sinclair 2004)

Page 5: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

What is a corpus?

John Sinclair (1933-2007)

A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.

(Sinclair 2004)

Page 6: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

What is a corpus?

John Sinclair (1933-2007)

A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.

(Sinclair 2004)

Page 7: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

What is a corpus?

John Sinclair (1933-2007)

A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.

(Sinclair 2004)

Page 8: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

What is a corpus?

John Sinclair (1933-2007)

A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.

(Sinclair 2004)

Page 9: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

What is a corpus?

John Sinclair (1933-2007)

A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.

(Sinclair 2004)

Page 10: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Corpus

Authentic language data Electronic/machine readable form Designed and collected according to

sampling procedures Representative of language For linguistic investigation

Page 11: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Corpus

Authentic language data Electronic/machine readable form Designed and collected according to

sampling procedures Representative of language For linguistic investigation

Page 12: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Corpus

Authentic language data Electronic/machine readable form Designed and collected according to

sampling procedures Representative of language For linguistic investigation

Page 13: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

But first….

Let’s talk about food! How could we compile a representative list

(=CORPUS) of food/dishes from around the world?

Page 23: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

How can we group these foods?

1. Where they come from

Page 25: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

How can we group these foods?

1. Where they come from

2. Their main component

Page 27: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Where do you eat it?

Page 28: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

How can we group these foods?

1. Where they come from

2. Their main component

3. Where you usually eat it

Page 29: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

3. ‘Fast food potential’ corpus

Takeaway

Restaurant

Page 30: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

How can we group these foods?

1. Where they come from

2. Their main component

3. Where it is usually eaten

4. What you use to eat it

Page 32: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

How can we group these foods?

1. Where they come from

2. Their main component

3. Where it is usually eaten

4. What you use to eat it

Page 33: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

1+2. Continental & Main Component Corpus

Europe

Asia

America (North & South)

Fish Meat Veg.

Page 34: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Corpus

Authentic language data Electronic/machine readable form Designed and collected according to

sampling procedures Representative of language For linguistic investigation

Page 35: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Language corpora

Different types of corpus Corpus size Sample size Representativeness - sampling Classification criteria

Page 36: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Types of corpus

•Sample Corpus: a fixed sample of text, often used as a reference corpus for comparing

•Monitor Corpus: a corpus which develops and is added to or filtered depending on the researcher’s needs

•Mini-corpus: a small corpus (e.g. to be compared with a reference corpus)

•Multilingual Corpus: corpus in a variety of languages

Page 37: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Types of corpus

•Comparable Corpus: texts in 2 languages or 2 varieties but not matched up

•Parallel Corpus: texts are translations of each other, eg. Canadian Hansard, corpus of versions of Plato, Bible

•Translation Corpus: 2 or more sets of texts classified as either originals or translations, the purpose being to identify features of translation (Manchester: Baker)

•Diachronic Corpus: Helsinki, LOB v. FLOB

•Learner Corpus: texts are written by language learners

Page 38: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Corpus Size – Is bigger better?

1st Generation Corpora = 1 Million Words– BROWN, LOB– ICE corpora

2nd Generation Sample Corpora – BNC, ANC = 100 Million Words

Monitor Corpora– Bank of English (450+ million and growing!)

Specialized corpora– Depends on source and scope of problem under

investigation

Page 39: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Sample Size

‘Personally I would like to see ‘whole text’ as a default condition, thus classifying sample corpora as one of the categories of special corpora... To me the use of small samples is just a remnant of the early restraints on corpus building, and the advantages of whole texts can be set out in powerful argument. The use of samples of constant size gains only a spurious air of scientific method, since it confers no benefit on the corpus, and is as practical as Genghis Khan’s fabled policy of having all his soldiers the same height.’

(Sinclair 1995: 27-28)

Page 40: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Sampling

Population– Production– Reception

Page 41: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Classifying Texts

Internal Criteria– Topic (aboutness)– Register/Style

Page 42: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Classifying Texts

Internal Criteria– Topic (aboutness)– Register/Style

External Criteria (situational parameters)

Page 43: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008
Page 44: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Brits treat English with such disdain

By Mr M. Rasheed Iqbal

Published: June 30 2007 03:00 | Last updated: June 30 2007 03:00

From Mr M. Rasheed Iqbal.

Sir, I agree with Henry von Blumenthal (Letters, June 23). It is very discouraging to hear news presenters saying "gonna" and "wanna" on the BBC news.

We were brought up to speak English properly and it is disappointing to see Brits treat the language with such disdain.

I hope the BBC will stem the tide and pull up its socks.

M. Rasheed Iqbal,National Bank of Dubai,Deira, Dubai, UAE

Copyright The Financial Times Limited 2007

Page 45: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Mode

Primary Channel – Written

Format – Published (print & web)

Setting - Public

Page 46: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Tenor

Addressee– Plurality – individual (editor) / plural (readers)– Presence – absent– Interactiveness – written correspondence/response– Shared knowledge – readers of same publication

Addressor– Demographic: Male, from Dubai, works in Bank,

educated, at least bilingual?– Acknowledgement: Self-identified in text

Page 47: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Field

Factuality – responding to actual event (TV broadcast), expressing personal opinion

Purposes – complain, express viewpoint, condemn slipping standards, correct perceived decline

Topics – use of British English on BBC, value of language education in former era

Page 48: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

When and How to build a corpus

1. DON’T! – use one of the available corporaa. If interested in differences in conversational language in

British English (age, sex, class etc. differences)… USE British National Corpus

b. Combine and subsample existing corpora to match your

2. Repurpose existing archive/collectiona. Any electronic texts available – results of surveys, DA/CA

transcripts

3. Build your own!a. OCR, download, extract from PDF

b. TYPE IT IN!!!!

Page 49: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Using the web as source for corpora

Page 50: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Web as corpus: Advantages

Massive (and expanding) amounts of electronic text

Whole texts Wide reach of text-types/topics/genres Much in the public domain Google (etc) as corpus query tool

Page 51: Principles of corpus construction Matthew Brook ODonnell University of Liverpool - Corpus Linguistics Summer Institute 2008

Web as corpus: Disadvantages

So much text that balance is difficult to achieve

Copyright is difficult to ascertain for many documents

Pages containing large amounts of extraneous material (menus, formatting, graphics)

Explosion of information… who is writing it? Who is reading it?