korpus je soubor počítačově uložených textů (v případě...

The Czech National Corpus Looking Back, Looking Forward

Věra SchmiedtováInstitute of the Czech National Corpus, Charles University [email protected]

While the Institute of the Czech National Corpus (ICNC) was founded in 1994 as part of the Faculty of Arts, Charles University in Prague, the Czech National Corpus project took off only in 1996, when the Institute could begin its activity two years later thanks to its first grant.

Although the ICNC was an integral part of Charles University, the Faculty of Arts had no financial means to support its operation. From the very beginning the Institute has been able to exist only due to special-purpose grants provided by the Czech Science Foundation (Grantová agentura České Republiky) or due to special-purpose financial contributions made by the Ministry of Education of the Czech Republic. In its initial stages the Institute’s existence was also dependent on sponsorship from Czech banks, especially the Commercial Bank. Why should we mention this? It is important to remember that the existence of the ICNC has never been simple, nor is its future going to be entirely free of uncertainty.

At the outset the Institute’s primary aim was to compile a material database for a new monolingual dictionary of the Czech language. The material is ready (text databases, frequency lists, etc.), the compilation of a new monolingual dictionary of contemporary Czech has not started yet.

In any case the Institute has set itself a number of other targets. It compiles various types of corpora which make it possible to trace the development of Czech since the 13th century. the main focus, however, is on the synchronic stage of both the written and the spoken language. It organizes an intensive collection of data in a number of locations all over the country, i.e. recoridng and transcription of spontaneous speech.

To date, the Institute has built the following types of corpus which are accessible to users via the Internet:

1. Corpora of synchronic written language;

Two representative corpora have been made public, SYN2000 (100 million words) and SYN2005 (100 million words). One other corpus, the non-representative SYN2006pub (300 million words), was been released into the public domain. There are plans to make SYN2009pub (500 million words) accessible to everybody (Křen 2009) in the course of 2009. This means that during this year the Czech language will have available electronic data of a billion word forms of contemporary written language.

2. Diachronic corpus;

(It includes texts from the end of 13th century onwards, numbering more than 3 600 000 word forms.)

3. Corpora of spontaneous spoken language;

There are four corpora of spontaneous spoken language available now - Pražský mluvený korpus (PMK) [Prague Spoken Corpus], Brněnský mluvený korpus (BMK) [Brno Spoken Corpus] and the corpora ORAL2006 (Kopřivová-Waclawičová 2006) and ORAL2008.(Waclawičová-Křen 2008)

4. Parallel corpora;

mailto:[email protected]

In 2005 the project InterCorp was initiated; participating in its compilation are linguistic departments of the Faculty of Arts, Charles University, Akademie věd ČR and some other departments at the Masaryk University in Brno. The project includes 23 of mostly European languages contrasted with Czech. At present these corpora are comprised of some 4 million word forms in fiction texts (Čermák 2009).

5. Single-author corpora;

(The Institute produced single-author corpora of two outstanding 20th-century Czech writers – Karel Čapek and Bohumil Hrabal. One of the corpora has already resulted in a frequency analysis of Čapek’s language Slovník Karla Čapka [The Dictionary of Karel Čapek] (2007) (for detailed reports see Čermák 2008; Křen 2008); a similar analysis of Bohumil Hrabal’s language Slovník Bohumila Hrabala [The Dictionary of Bohumil Hrabal] will be published during 2009.)

We may take a closer look at the corpora now:

Corpora of synchronic written language

SYN2000 Corpus contains some 100 million word forms. It consists of whole texts that were included in it on the basis of research into written language reception so as to cover as wide a spectrum of genres of the Czech language as possible. It contains most texts originating between 1990 and 1999. The corpus also includes important works of Czech literature appearing before 1990 such as the novel Krakatit (1924) by Karel Čapek (1890-1938) or the novel Zbabělci [Cowards](1958) by Josef Škvorecký (1924). Inclusion of such earlier texts is conditioned by the author‘s year of birth preceding 1880.

The corpus is lemmatized and part-of-speech tagged. It can be searched using the corpus manager Bonito which lets the user see a word in context, search for a string of words, search by part-of-speech tags and canonical word forms, resort a concordance, see information on the origin and type of text in which the word occurred, save selected concordance lines on the hard-disk of the computer used, see statistical data and create subcorpora.

Composition of the SYN2000 Corpus:

SYN2005 Corpusis also a synchronic corpus representative of present-day written Czech, containing 100 million word forms (tokens). It differs from the previous corpus in many respects. Their comparison is attempted in articles by Křen and Hlaváčová (2006, 2008). None of the texts included in the SYN2005 Corpus was previously used in the SYN2000 Corpus, the two corpora are disjunct, and together they contain some 200 million words (tokens).

60 % journalistic texts 25 % technical texts15 % fiction

Composition of the SYN2005 Corpus:

The representativeness ((balanced composition) of the SYN2005 Corpus is based on a recent study of written language reception; its composition therefore differs from that of the SYN2000 Corpus considerably in many ways. A comparison of the two corpora according to the main categories of texts is shown in the following table:

Other differences can be found within the main genres: while the thematic subcategories of technical writing have changed only little, by contrast the composition of journalistic writing changed substantially. All journalistic texts date from 2000-2004, with each year having the same proportion of journalistic texts. Compared to the SYN2000 Corpus, though, the representation of individual titles is different. Technical writing dates from 1990-2004, fiction texts are occasionally even older than that, but in both cases care was taken to include as little earlier texts as possible.

Unlike the SYN2000 Corpus, this corpus uses an improved version of lemmatization and part-of-speech tagging. The actual system of part-of-speech tags had remained the same, only one more was added, that of verbal aspect.

Corpus SYN2006PUBis a synchronic corpus of journalistic writing of 300 million word forms (tokens). It exclusively contains journalistic texts from November 1989 to the end of 2004, i.e. from the same period as covered by the SYN2000 Corpus and the SYN2005 Corpus. As regards the actual texts included there is no overlap between the three corpora whatsoever. Altogether the three corpora, SYN2000, SYN2005 and SYN2006PUB, contain some 500 million words (tokens).

Compared to the SYN2005 Corpus, the lemmatization and part-of-speech tagging had again been upgraded, although the difference is not as marked as that between the SYN2000 and SYN2005 corpora.

Corpora of spoken Czech

ORAL2006, ORAL2008contain transcripts of conversations taking place exclusively in informal situations. ORAL2008 is the first spoken corpus of the ICNC which is fully balanced in terms of the basic sociolinguistic variables of the speakers (gender, age, education, and the area of residence in

40 % fiction27 % technical texts33 % journalistic texts

SYN2005 SYN2000

fiction 40 % 15 %

technical texts 27 % 25 %

journalistic texts 33 % 60 %

childhood, i.e. at a time when their idiolect was formed; these areas are defined according to the traditional classification of regional dialects).

The ORAL2008 Corpus is based on the same type of data as ORAL2006, however none of the transcripts included in ORAL2008 are part of the ORAL2006 Corpus.

The corpora are composed of recordings made in different parts all over the Czech Lands (but not in Moravia, which is a region of the Czech Republic where a different group of Czech dialects are spoken). All these recordings were made in informal settings; the speakers knew each other well and had friendly relations. They were not informed in advance about the purpose of the recording; they were told only after it had been made. All of them subsequently agreed to the recordings being used for the purposes of the Czech National Corpus. The recording, transcription and annotation were made in keeping with the general principles applied in the compilation of the previous spoken corpora as part of the Czech National Corpus, and especially the ORAL2006 Corpus. All corpora use identical system of annotating the three basic binary sociolinguistic categories of speakers.

ICNC publishing activities

The list of the ICNC’s publishing achievements is quite extensive. The SYN2000 Corpus was the source for Frekvenční slovník češtiny [A Frequency Dictionary of Czech] (2004) (for a report see Čermák-Křen 2005), while the PMK was the source for Frekvenční slovník mluvené češtiny [A Frequency Dictionary of Spoken Czech] (2007). The Institute has so far published eight volumes in the series Studie z korpusové lingvistiky [Corpus Linguistics Studies], presenting monographs based on the ICNC’s corpora. The first volume to appear in the Corpus Lexicography Series was The Dictionary of Karel Čapek mentioned above. It is to be followed by The Dictionary of Bohumil Hrabal later this year. The Institute also intends to publish here Slovník totalitního jazyka [A Dictionary of Totalitarian Language] for the purposes of which a corpus of texts from this period has already been compiled (see Schmiedtová 2006).

The ICNC’s future is assured until 2011 when the Ministry of Education’s research grant which provides for the Institute’s operation expires. Before the end of this research grant period the Institute plans to go on developing all lines of research. It aims to publish more monographs in the Corpus Linguistics Studies series. (In the immediate future there are plans for monographs on: Valency of abstract nouns; Contemporary declination of one specific type of noun; The perfect in present-day Czech.)

This year the ICNC is going to publish a book entitled Statistika češtiny [The Statistics of the Czech Language], and has organized a conference on parallel corpora.

ConclusionInevitably the Institute’s activities and work on such labour-intensive projects are very costly. Therefore we hope that the Institute will continue to receive financial support that will enable it to carry on with the work, particularly as the value and importance of these projects increases if language development can be traced systematically and continuously.

A brief information on some of the books published the ICNCFrekvenční slovník češtiny [A Frequency Dictionary of Czech]František Čermák - Michal Křen (eds.)It was published in November 2004 by the NLN Publishing House in Prague. Based on the sufficiently large SYN2000 Corpus, carefully balanced so as to be representative of contemporary written language, analyzed automatically with subsequent extensive manual corrections, the dictionary is guaranteed to offer highly reliable data. The main body of the dictionary includes: 50 000 most frequent common nouns provided with their frequencies, frequency-based ranking and also their typical distribution in the main genres (fiction, technical and journalistic texts) expressed in percentages; 2 000 most frequent proper names; 1 000 most frequent abbreviations. The appendices provide information in the most frequent punctuation marks, letters in the Czech texts, and the

proportion of words forms in text of the lemmas listed by the dictionary. The dictionary comes with a CD which allows browsing the word list, its sorting and searching according to various criteria and, naturally, saving the selected entries for further analysis.

Frekvenční slovník mluvené češtiny [A Frequency Dictionary of Czech]František Čermák (ed.)The first dictionary ever to present authentic spoken Czech in contrast to both standard and spoken Czech. It is derived from the Prague Spoken Corpus based on sociolinguistically representative recordings. The included CD contains the whole corpus complete with corpus extraction software.

Jak využívat Český národní korpus [How to use the Czech National Corpus]František Čermák - Renata Blatná (eds.)The manual for users interested in work with corpora provides information on how to search in a corpus and makes use of the data collected from them. It is intended as a study guide for those want to explore Czech in a different way from that used in traditional textbooks. It includes exercises in phonetics, morphology, lexicon and word combinations.

Slovník Karla Čapka [The Dictionary of Karel Čapek]František Čermák (ed.)This volume opens a new series of corpus-based monographs on language called Korpusová lexikografie [Corpus Lexicography]. It will present a succession of dictionaries relating to a specific crucial historical period or single-author dictionaries presenting the language of important and well-knownpersonalities of national culture who significantly shaped the era in which they lived and the language of that time.

A list of monographs derived from the ICNC corpora:

Corpus linguistics: Collocations Multiword prepositions in The state of the art and contemporary Czechmodel approaches

The valency of Czech Aspectual morphology of Morphology of spoken Czech:adjectives the Czech verb Frequency analysis

Czech in the spoken corpus Regulation of language and The valency of the Czech the Concept of minimal substantivesintervention

References:Čermák, F.-Křen, M. (2005): New Generation Corpus-Based Frequency Dictionaries: The Case

of Czech. In: International Journal of Corpus Linguistics,10, 4, 453 - 467. John Benjamins, Amsterdam - Philadelphia 2005. ISSN 1384-6655

Čermák, F. (2008): An Author´s Dictionary: The Case of Karel Čapek, In Proceedings of the XIII Euralex International Congress, Institut Universitari fr Linguistica Aplicada Universitat Pompeu Fabra, Barcelona 2008, pp.323-332

Čermák, F. (2009): Parallel Corpora: The Case of InterCorp, Corpus Linguistics Conference 2009, Liverpool, Great Britain

Čermák, F. (2009): Spoken Corpora design. In: International Journal of Corpus Lingvistics 14,1, 113-122. John Benjamins, Amsterdam - Philadelphia

Kocek, J.-Kopřivová, M.-Schmiedtová, V. (2000): The Czech National Corpus. Proceedings of the 9th EURALEX International Congress, Heid U., Evert S., Lehmann E., Rohrer Ch. (eds.), Stuttgart 2000, s. 127 – 132

Kopřivová, M.-Waclawičová, M. (2006): Representativeness of Spoken Corpora on the Example of the New Spoken Corpora of the Czech Language In: Proceedings of the international conference „Corpus linguistics – 2006“ St. – Petersburg University Press. St. Petersburg pp.174-181

Křen, M. (2006): Frequency Dictionary of Czech: A Detailed Processing Description. In: Insight into Slovak and Czech Corpus Linguistics, Šimková, Mária (ed.), pp. 16 - 25. Veda, Bratislava 2006.

Křen, M. (2006): SYN2000 vs. SYN2005: Comparing the Large Synchronic Corpora of Czech. In: Труды международной конференции "Корпусная лингвистика - 2006", с. 182 - 189. Издательство СПбГУ, Санкт-Петербург 2006.

Křen, M. (2008): Compilation of the Dictionary of Karel Čapek. In: Corpus Linguistics, Computer Tools, and Applications - State of the Art, Barbara Lewandowska-Tomaszczyk (ed.), pp. 469 - 481. Peter Lang, Frankfurt am Main.

Křen, M.-Hlaváčová, J. (2008): Corpus as a Means for Study of Lexical Usage Changes. In: Proceedings of the 13th EURALEX International Congress, Bernal, Elisenda - DeCesaris, Janet (eds.), pp. 437 - 447. Barcelona 2008.

Křen, M. (2009): The SYN Concept: Towards One-Billion Corpus of Czech, Corpus Linguistics Conference 2009, Liverpool, Great Britain

Schmiedtová, V. (2006): What did the totalitarian language in the former socialist Czechoslovakia look like? Conference of The Slavic Linguistics Society 2006, http://www.indiana.edu/~sls2006/page2/page2.html, downloaded - http://ucnk.ff.cuni.cz/english/stahni.php#schmiedt

Waclawičová, M. (2007): Spoken Corpus ORAL2006, Information It Provides ang General Caracteristics of Spoken Text. In: Computer Treatment of Slavic and East European Languages, eds. J. Levická, R. Garabík. Tribun, Bratislava, Brno 2007, pp. 283-289

http://ucnk.ff.cuni.cz/english/stahni.php#schmiedt

http://www.indiana.edu/~sls2006/page2/page2.html

Waclawičová, M.-Křen, M. (2008): ORAL2008: New Balanced Corpus of Spoken Czech. In: Труды международной конференции "Корпусная лингвистика - 2008", pp. 105 - 112. Издательство СПбГУ, Санкт-Петербург 2008. ISBN 978-5-288-04769-5

Publications based on ICNC corpora: Cvrček, V. 2008: Regulace jazyka a Koncept minimální intervence. Nakladatelství Lidové

noviny, Praha, ISBN 978-80-7106-600-2 Čermák, F. (ed.) 2007: Slovník Karla Čapka. Nakladatelství Lidové noviny, Praha, ISBN 978-

80-7106-915-7 Čermák, F. (ed.) 2007: Frekvenční slovník mluvené češtiny. Karolinum, Praha, ISBN 978-80-

246-1425-0Čermák, F. - Blatná, R. (eds.) 2005/2007: Jak využívat Český národní korpus. Nakladatelství

Lidové noviny, Praha . ISBN 80-7106-736-9

Čermák, F. - Blatná, R. (eds.) 2006: Korpusová lingvistika: Stav a modelové přístupy. Nakladatelství Lidové noviny, Praha. ISBN 80-7106-861-6

Čermák, F. - Šulc, M. (eds.) 2006: Kolokace. Nakladatelství Lidové noviny, Praha 2006, ISBN 80-7106-863-2

Čermák, F. - Křen, M. (eds.) 2004: Frekvenční slovník češtiny. Nakladatelství Lidové noviny, Praha, ISBN 80-7106-676-1

Čermáková, A. 2009: Valence českých substantiv. Nakladatelství Lidové noviny, Praha 2009, ISBN 978-80-7106-426-8

Esvan, F. 2007: Vidová morfologie českého slovesa. Nakladatelství Lidové noviny, Praha, ISBN 978-80-7106-913-300

Kopřivová, M. - Waclawičová (eds.) 2008: Čeština v mluveném korpusu. Nakladatelství Lidové noviny, Praha, ISBN 978-80-7106-982-9

Kopřivová, M. 2006: Valence českých adjektiv. Nakladatelství Lidové noviny, Praha, ISBN 80-7106-862-4

Šonková, J. 2008: Morfologie mluvené češtiny: Frekvenční analýza. Nakladatelství Lidové noviny, Praha, ISBN 978-80-7106-956-0

Appendix

Corpora writen language (contemporary)

corpus size(number of words)

lemmatizazion tags corpus characterization

SYN2006PUB 300 mil. Yes Yes corpus of newspapers texts from 1989 to 2004

SYN2005 100 mil. Yes Yes gendres balanced corpus, predominate texts from 2000 to 2004

SYN2000 100 mil. Yes Yes gendres balanced corpus, predominate texts from 1989 to 1999

FSC2000 100 mil. Yes Yes adapted SYN2000, a reference source for the

Frequency Dictionary of the Czech Languge

KSK 800.000 No No transcripted private letters from1990 to 2004 years

ORWELL 80.000 Yes Yes manually tagged corpus the novel "1984" by Orwell

Corpora writen language (contemporary)



ORAL2008 1 mil No No sociolinguisticly balanced corpus of informal spoken Czech

ORAL2006 1 mil. No No corpus of informal spoken Czech

PMK 675.000 No No Prague Spoken Corpus

BMK 490.000 No No Brno Spoken Corpus

Corpora writen language (diachronic)



DIAKORP 1,6 mil. No No diachronic corpus

Parallel corpora



InterCorp

31 mil.Czech part34,5 other languages

Yes(partly) (partly) paralellel corpora

korpus je soubor počítačově uložených textů (v případě...

Documents