language and knowledge technologies for news collections in croatia

ITN2008Dubrovnik2008-05-21

Language and Knowledge Technologies for News Collections

in Croatia Bojana Dalbelo Bašić, Marko Tadić

University of Zagreb,Faculty of Electrical Engineering and Computing / Faculty of Humanities and

Social Sciencesbojana.dalbelo@fer.hr, marko.tadic@ffzg.hr

ITN2008Dubrovnik

2008-05-21

Talk overview who we are? what are we doing? text collections used for research applicable language technologies applicable knowledge technologies

Who we are?

University of Zagreb, Croatia two faculties in a joint mission

build the systems that will develop and enable the usage of language resources and tools for Croatian

Who we are 2? Faculty of Humanities and

Social Sciences Institute/Department of

Linguistics Department of Information

Sciences basic computational linguistic tasks for Croatian

compiling and processing large language resources Croatian National Corpus, Croatian Morphological

Lexicon, Croatian WordNet, Croatian Dependency Treebank

digitalization of Croatian lexicographic heritage:60+ dictionaries digitalized so far

tagger, lemmatizer chunker, parser NERC system, gazeteers (e.g. Croatian (sur)names)

Who we are 3? Faculty of Electrical Engineering and Computing

Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab

Knowledge Technogies Laboratory Group deals with text preprocessing techniques for Croatian for

machine learning procedures dimensionality reduction and document clustering

in the vector space model + visualisation automatic indexing of

documents intelligent, language specific

and non-specific informationretrieval and extraction

What are we doing? working jointly on several research projects

AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA) Institute of Linguistics/FFZG & ZEMRIS/FER, 2006-2008

Computational Linguistic Models and Language Technologies for Croatian (rmjt.ffzg.hr), 2007-2011 national research programme, prof. Marko Tadić

Sources for Croatian Heritage and Croatian European Identity, 2007-2011 national research programme, prof. Damir Boras

CADIAL: Computer Aided Document Indexing for Accessing Legislation joint Flemish-Croatian project, 2007-2009 prof. Marie-Francine Moens & prof. Bojana Dalbelo

Bašić

What are we doing 2? Composition of the programme RMJT

P1: Croatian language resources and their annotation project leader: Marko Tadić

P2: Computational syntax of Croatian project leader: Zdravko Dovedan

P3: Lexical semantics in building Croatian WordNet project leader: Ida Raffaelli

P4: Information technology in translating Croatian and language e-learning project leader: Sanja Seljan

P5: Knowledge discovery in textual data project leader: Bojana Dalbelo Bašić

participation in a FP7 project CLARIN LR & LT as a research infrastructure for e-SSH

Text collections used for research we have done research on different kinds of

texts, but predominantly in journalistic genre Croatian National Corpus (hnk.ffzg.hr)

101,2 million tokens in size newspaper articles: 37% (ca 37 million tokens) magazines articles: 16% (ca 16 million tokens)

Croatian-English Parallel Corpus 3,5 million tokens from Croatian Weekly newspaper articles: 100%, bilingual

special text collections database of Vjesnik articles: 2000-2003, >90,000

articles Narodne novine collection: 1998-2008, >10,000

texts, >15 million tokens Parallel corpus of Southeast European Times: 2007-,

>25,000 articles, >4 million tokens, in 10 languages

Applicable language technologies morphological processing

important for inflectionally rich languages, e.g. Croatian noun in 14 word-forms (7 cases, 2

numbers):N: student studentiG: studenta studenataD: studentu studentimaA: studenta studenteV: studentu studentiL: studentu studentimaI: studentom studentima

unlike English noun in 2(4?) word-forms (2 numbers+ possesive?):

Sg: student Poss: (student’s)Pl: students Poss: (students’)

present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ...

Applicable language technologies 2 recognizing to which lexeme(s) a WF belongs to

helps us in avoiding the problem of data sparsness in many text processing tasks: information retrieval text mining document classification document indexing

query processing search engines are not “inflectionally sensitive” speakers of inflectionally rich language use the

normal/base form = lemma e.g. www.google.hr input: noun in nominative

singular did you know that accusative and genitive are more

frequent in Croatian?

Applicable language technologies 3

Applicable language technologies 6 Named Entity Recognition and Classification

(NERC) NEs are introducing the exact information from outer

world into the world-of-text represent answers to the basic journalistic

questions: who?, where?, when?, how much? types of NEs (according to MUC conferences)

person organization location date time valute and measurements percentage

system that works for Croatian with >90% precision

Applicable language technologies 7 system that works for Croatian with >90%

precision

Applicable language technologies 8 semantic networks as language resources

covering the general lexicon and NEs in a language

WordNet: words are linked by meaning synonyms, antonyms, hypo-/hyperonyms,

meronyms… realized as ontologies or taxonomies allow for words and/or NEs

synonymy/antonymy search evoking upper-levels in taxonomy

e.g. activating the region/state/continent when a city is mentioned or a company when a director is in focus

explicit social networking connections between NEs

Applicable L&K technologies

covering the general lexicon and NEs in a language WordNet: words are linked by meaning

synonyms, antonyms, hypo-/hyperonyms, meronyms…

realized as ontologies or taxonomies allow for words and/or NEs

explicit social networking connections between NEs semantic processing: roles in sentences (agent,

patient, instrument etc.)

synonyms, antonyms, hypo-/hyperonyms, meronyms… realized as ontologies or taxonomies allow for words and/or NEs

patient, instrument etc.) event detection: from verbal frames and scenarios

synonyms, antonyms, hypo-/hyperonyms, meronyms… realized as ontologies or taxonomies allow for words and/or NEs

patient, instrument etc.) event detection: from verbal frames and scenarios connection with geo-data

Applicable knowledge technologies automatic document indexing

eCADIS system developed for Croatian legal docs applicable to any document collection uses machine learning techniques automatically attaches the keywords

(descriptors) from a controlled thesaurus to a document

represent the document content description

integrates the corpus and document analysis

CADIS system

eCADIS system integrates the information from the whole

document collection greyed n-grams are statistically relevant in

the corpus i.e. collocations

eCADIS system automatic suggestion of relevant

descriptors, hence the automatic indexing

eCADIS system compare it to manually attached

descriptors…

Applicable knowledge technologies automatic document classification

uses a series of classifiers, combined 3500 classifiers

results represented in a vector-space model dimensionality reduction

matrices could be huge (Vjesnik: 90,000 x 600,000) features selected

types lemmas collocations NEs …

evaluated by F1 measure (combination of precision/recall) F1 > 90% in most of cases

Applicable knowledge technologies

visualisationof classification between pages Croatia

Weekly English side go= economy

ks = culture/sportte = turism/ecol.po = politics

visualisationof classification between culture (low right) and sport (high left) Croatia Weekly English side go= economy

ks = culture/sportte = turism/ecol.po = politics

visualisationof classification for documents that differentiate between home (blue upward) and foreign policy (blue downward) Croatia Weekly English side go= economy

ks = culture/sportte = turism/eco.po = politics

Language and Knowledge Technologies for News Collections

in Croatia Bojana Dalbelo Bašić, Marko Tadić

University of Zagreb,Faculty of Electrical Engineering and Computing / Faculty of Humanities and

Social Sciencesbojana.dalbelo@fer.hr, marko.tadic@ffzg.hr

ITN2008Dubrovnik

2008-05-21

language and knowledge technologies for news collections in croatia

croatian heritage

croatian language resources

croatian surnameswho

croatian wordnetproject

croatian morphological

croatian european identity

language e

language specific

Documents

croatia - greecemed travel · croatia introduction ......

methods and technologies for digital preservation of ... ·...

welcome to croatia. croatia plitvica unesco protected...

croatia part1 (meet croatia)

ts 101 231 codes register (2019-04) - tech.ebu.ch ·...

proceedings of elmar-2020...igor kuzle, croatia panos...

13 april 2007amy_rudersdorf@ncsu.edu user participation in...

experience value results up-front collections and today’s...

x made in croatia medarska 69, zagreb, croatia r3003

collections edited by fondazione acimit · collections...

(anniversary special) croatia national day...jun 23, 2017...

essential contact center technologies for collections ebook

kornati islands, croatia risk management in medical...

cornell institute for digital collections digital...

human language technologies data collections & studies wp4-...

innovative tools for access: enhancing digital collections...

selling croatia or selling out croatia? tourism ... ·...

republic of croatia and egtc. content – croatia and eu...

kellytoy.comkellytoy.com/catalogues/everyday_fall_2013_web.pdf ·...

2nd murshidabad international salon 2016 · croatia...