the translational english corpus: a practical approach to corpus building

31
The Translational English Corpus: A practical approach to corpus building

Upload: kristopher-brown

Post on 23-Dec-2015

244 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Translational English Corpus: A practical approach to corpus building

The Translational English Corpus: A practical approach to corpus building

Page 2: The Translational English Corpus: A practical approach to corpus building

Outline• TEC and new developments

– EDT Corpus– Humanities Corpus

• Corpus design– Representativeness– Balance– Size

• Corpus building– Identifying material– Scanning/Converting texts– Tagging & Annotation

Page 3: The Translational English Corpus: A practical approach to corpus building

Translational English Corpus

A corpus of contemporary English translations: written texts translated into English from a

variety of source languages

http://www.llc.manchester.ac.uk/ctis/research/english-corpus/

Page 4: The Translational English Corpus: A practical approach to corpus building

Subc

orpo

ra

Page 5: The Translational English Corpus: A practical approach to corpus building

Lang

uage

s

French

German

Span

ish

Portugu

ese

Norwegia

n

Catalan

Latin Americ

an Sp

a...

Slove

neTam

il

Finnish

Hebrew

Vietnam

ese0

5

10

15

20

25

30

24 23

1513

96 6 5 4 4 3 3 3 3 3 3 3 2 2 2 2 2 2 1

Number of books in each language

for fiction and (auto)biography

Page 6: The Translational English Corpus: A practical approach to corpus building

Set of software tools for the investigation of a wide range of issues to do with the language of translated texts.

Header File: contains meta data such as the title of ‐the text, author, publisher, etc. Text File: contains the actual data to be analysed

– Sub-corpus Selection: Allows you to select particular text files or groups of text files to search.

– Sort Tool: Allows you to sort concordances to the left or right, and specify the number words between the search keywords.

– Corpus Tree Viewer: Allows you to “grow” a tree for various keywords. The size of the text reflects frequency of occurrence in the corpus.TE

C To

ols

Page 7: The Translational English Corpus: A practical approach to corpus building

TEC Database

An electronic database of all material (to be) included in the TEC for the subcorpora of fiction and (auto)biography.

The entry for each book includes not only most of the information that is included in the header file, but also images of the covers of the books.

Page 8: The Translational English Corpus: A practical approach to corpus building

English Discourses on Translation Corpus• A corpus of discourses on translation for the investigation of

they way in which translation/translators are conceptualised in society at different historical periods.

• No time, language or genre restriction: any material is included as long as it is written in English.

• Two types of material– Peritextual : material that accompanies the translation, e.g.

prefaces, introductions, afterwords, etc.

– Epitextual: published material (broadsheet and mainstream newspapers, literary magazines, etc.)

• Link with TEC

Page 9: The Translational English Corpus: A practical approach to corpus building

Humanities CorpusA corpus of translations into English of works by theorists in the humanities, e.g. philosophers, sociologists, literary theorists, etc.

Temporality: translations date from 1900 onwards, but the source textstexts do not have a time restriction.

* Multiple translations of the same book.

Page 10: The Translational English Corpus: A practical approach to corpus building

What is a Corpus?

Page 11: The Translational English Corpus: A practical approach to corpus building

Corpus DesignWhat is a corpus?

‘A collection of texts held in machine-readable form and capable of being analysed automatically or semi-automatically’ (Baker 1995)…

….and has certain characteristics:

– Representativeness

– Balance

– Size

Page 12: The Translational English Corpus: A practical approach to corpus building

Representativeness

“a corpus is thought to be representative of the language variety it is supposed to represent, if the findings based on its contents can be generalised to the said language variety” (Leech 1991).

A corpus may focus on a particular genre/language/ author/translator, etc.

Decisions about criteria for selection of texts

Page 13: The Translational English Corpus: A practical approach to corpus building

TEC Design

Material: English translations (whole texts)

Genres: Fiction, (auto)biography, in-flight magazines, news articles

Time of publication: Late 80s onwards

Place of publication: UK and USA

Repr

esen

tativ

enes

s

Page 14: The Translational English Corpus: A practical approach to corpus building

Balance

“a balanced corpus covers a wide range of texts which are supposed to be representative of the language variety under question” (McEnery et al. 2006).

Also, ‘internal’ balance, e.g.

– Gender balance

– Source language balance

– Genre balance

Page 15: The Translational English Corpus: A practical approach to corpus building

Bala

nce

Page 16: The Translational English Corpus: A practical approach to corpus building

Corpus Size

A corpus needs to be adequate for the purposes for which it is intended.

A bigger corpus is not necessarily more useful than a smaller one.

Factors that affect corpus size:

– Purpose of the corpus

–Availability of data

–Copyright

Page 17: The Translational English Corpus: A practical approach to corpus building

• Research questions (purpose of the corpus)

– Specialised corpora and corpora intended for morphosyntactic studies tend to be smaller than general corpora and corpora intended for lexical studies. Static corpora are also smaller than dynamic ones.

• Availability of data

– The availability of suitable data (especially in machine-readable form), as well as the ease with which they can be identified may affect the size of a corpus.

Corp

us S

ize

Page 18: The Translational English Corpus: A practical approach to corpus building

• Copyright

– Copyright clearance can impede corpus development as well as the accessibility and availability of a corpus to a wide audience.

– Copyright law varies internationally. – Fair dealing: no permission needed for short extracts

not exceeding 400 words for prose (or a total of 800 words in a series of extracts, none exceeding 300 words).

– Out of copyright material: author’s / translator’s lifetime + 70 years (UK).

– If you’re in doubt, seek permission! (McEnery et al. 2006)

Corp

us S

ize

Page 19: The Translational English Corpus: A practical approach to corpus building

Communication with publishers

We're delighted to learn of your interest project, and pleased to grant you general permission to use all book reviews and blogs on our site. We'll be grateful if you can include a link to the site in the

pieces you use.

….We don't feel comfortable posting the entirety of both titles to your database, but would be willing to make half of both books available to your research center…We typically charge a fee of $150 per title for use of such a large portion.

…University Press is pleased to grant you non-exclusive, English language, world rights to reprint limits of fair use (under 300 words)…

We're delighted to learn of your interesting project, and pleased to grant you general permission to use all book reviews and blogs on our site. We'll be grateful if you can include a link to the site in the pieces you use.

But also…

Page 20: The Translational English Corpus: A practical approach to corpus building

Corpus Building

• Identifying material

• Scanning

• Converting texts

• Corpus tagging and annotation

• Ready to be used

Page 21: The Translational English Corpus: A practical approach to corpus building

Identifying Material• Possible sources

• Publishers’ websites• Search engines e.g. Farrar, Strauss and Giroux, NYTimes• Publishing houses specialising in translation

• Databases• National databases e.g. Three Percent, LTI Korea

• Internet, archives, etc.

• Problems• Search engine not well-designed e.g. The Telegraph• Need for specific material• In some cases, not indicated whether it is a translation or not• For reviews: not always related to translation

Page 22: The Translational English Corpus: A practical approach to corpus building

Scanning and Converting Texts• Scanning

• Flat-bed scanner – Document feeder• Paper and print quality• Scanner settings: Resolution and Colour vs Greyscale

• OCR (Optical Character Recognition) Process• Language support• Accuracy• Font type• Document format

• Text File• Spelling errors• Character recognition errors (e.g. Tm instead of I’m)• Save as .txt file

Page 23: The Translational English Corpus: A practical approach to corpus building

Corpus Tagging and Annotation

Adds value to a corpus, makes it easier to extract information and prepares texts to be used with a corpus software

Factors that affect the extent of tagging/annotation (Olohan 2004):

• Purpose of the corpus

• Corpus software

• Accessibility of the corpus

• Technical expertise of the researcher

Page 24: The Translational English Corpus: A practical approach to corpus building

Hea

der F

ile

Page 25: The Translational English Corpus: A practical approach to corpus building

Text

File

Page 26: The Translational English Corpus: A practical approach to corpus building

Corpus Annotation

• POS (Part-of-Speech) Tagging– Marks up a word in a corpus as corresponding to a particular part of

speech, based on both its definition, as well as its context. E.g. John_NP0 loves_VVZ Mary_NP0 ._.

• Lemmatisation– Reduces the inflectional variants of words to their respective

lemmas, i.e. as they appear in a dictionary. E.g. is, are, am -> BE

• Parsing– Marks the syntactic structure of each sentence.

E.g. (S (NP (NNP John)) (VP (VPZ loves) (NP (NNP Mary)))

Page 27: The Translational English Corpus: A practical approach to corpus building

Sear

chin

g a

tagg

ed c

orpu

s

Page 28: The Translational English Corpus: A practical approach to corpus building

Ready to be Used

• Develop and use your own software

• Use existing corpus tools– TEC Tools

For more information about how to use TEC Tools with local corpora, you can download the tutorial from the TEC webpage.

– WordSmith Tools

A collection of corpus linguistics tools

– ParaConcA bilingual or multilingual concordancer

– ….

Page 29: The Translational English Corpus: A practical approach to corpus building

“When a corpus is created, a compromise has often to be reached between ideal design criteria and practical constraints. However, while opportunistic choices may be justified, the limitations and distortions they introduce in the makeup of a corpus should not be forgotten when evaluating the results”. (Zanettin 2011)

Page 30: The Translational English Corpus: A practical approach to corpus building

Thank you!

TEC website

http://www.llc.manchester.ac.uk/ctis/research/english-corpus/

TEC Email Address

[email protected]

Page 31: The Translational English Corpus: A practical approach to corpus building

References

Baker, Mona (1995) ‘Corpora in Translation Studies: An overview and some suggestions for future research’, Target 7(2): 223-243.

Leech, Geoffrey (1991) ‘The state of the Art in Corpus Linguistics’, in Karin Aijmer and Bengt Altenberg (eds) English Corpus Linguistics: Linguistic studies

in honour of Jan Svartvik, London: Longman, pp. 8-29.

McEnery, Tony, Richard Xiao and Yukio Tono (2006) Corpus-based Language Studies, London and New York: Routledge.

Olohan, Maeve (2004) Introducing Corpora in Translation Studies, London and New York: Routledge.

Zanettin, Federico (2011) ‘Translation and Corpus Design’, SYNAPS 26:14-23.