language and knowledge technologies for news collections in croatia

32
ITN2008 Dubrovnik 2008-05-21 Language and Knowledge Technologies for News Collections in Croatia Bojana Dalbelo Bašić, Marko Tadić University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences bojana.dalbelo@fer.hr, marko.tadic@ffzg.hr ITN2008 Dubrovnik 2008-05-21

Upload: ray

Post on 25-Feb-2016

22 views

Category:

Documents


1 download

DESCRIPTION

Language and Knowledge Technologies for News Collections in Croatia. Bojana Dalbelo Bašić, Marko Tadić University of Zagreb, Faculty of Electrical Engineering and Computing / Faculty of Humanities and Social Sciences bojana . dalbelo @ fe r . hr, marko.tadic @ ffzg.hr - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Language and Knowledge Technologies for News Collections

in Croatia Bojana Dalbelo Bašić, Marko Tadić

University of Zagreb,Faculty of Electrical Engineering and Computing / Faculty of Humanities and

Social [email protected], [email protected]

ITN2008Dubrovnik

2008-05-21

Page 2: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Talk overview who we are? what are we doing? text collections used for research applicable language technologies applicable knowledge technologies

Page 3: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Who we are?

University of Zagreb, Croatia two faculties in a joint mission

build the systems that will develop and enable the usage of language resources and tools for Croatian

Page 4: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Who we are 2? Faculty of Humanities and

Social Sciences Institute/Department of

Linguistics Department of Information

Sciences basic computational linguistic tasks for Croatian

compiling and processing large language resources Croatian National Corpus, Croatian Morphological

Lexicon, Croatian WordNet, Croatian Dependency Treebank

digitalization of Croatian lexicographic heritage:60+ dictionaries digitalized so far

tagger, lemmatizer chunker, parser NERC system, gazeteers (e.g. Croatian (sur)names)

Page 5: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Who we are 3? Faculty of Electrical Engineering and Computing

Department of Electronics, Microelectronics, Computer and Intelligent Systems / KTLab

Knowledge Technogies Laboratory Group deals with text preprocessing techniques for Croatian for

machine learning procedures dimensionality reduction and document clustering

in the vector space model + visualisation automatic indexing of

documents intelligent, language specific

and non-specific informationretrieval and extraction

Page 6: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

What are we doing? working jointly on several research projects

AIDE: Automatic Indexing with Descriptors from Eurovoc (cooperation with the Government of the Republic of Croatia, HIDRA) Institute of Linguistics/FFZG & ZEMRIS/FER, 2006-2008

Computational Linguistic Models and Language Technologies for Croatian (rmjt.ffzg.hr), 2007-2011 national research programme, prof. Marko Tadić

Sources for Croatian Heritage and Croatian European Identity, 2007-2011 national research programme, prof. Damir Boras

CADIAL: Computer Aided Document Indexing for Accessing Legislation joint Flemish-Croatian project, 2007-2009 prof. Marie-Francine Moens & prof. Bojana Dalbelo

Bašić

Page 7: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

What are we doing 2? Composition of the programme RMJT

P1: Croatian language resources and their annotation project leader: Marko Tadić

P2: Computational syntax of Croatian project leader: Zdravko Dovedan

P3: Lexical semantics in building Croatian WordNet project leader: Ida Raffaelli

P4: Information technology in translating Croatian and language e-learning project leader: Sanja Seljan

P5: Knowledge discovery in textual data project leader: Bojana Dalbelo Bašić

participation in a FP7 project CLARIN LR & LT as a research infrastructure for e-SSH

Page 8: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Text collections used for research we have done research on different kinds of

texts, but predominantly in journalistic genre Croatian National Corpus (hnk.ffzg.hr)

101,2 million tokens in size newspaper articles: 37% (ca 37 million tokens) magazines articles: 16% (ca 16 million tokens)

Croatian-English Parallel Corpus 3,5 million tokens from Croatian Weekly newspaper articles: 100%, bilingual

special text collections database of Vjesnik articles: 2000-2003, >90,000

articles Narodne novine collection: 1998-2008, >10,000

texts, >15 million tokens Parallel corpus of Southeast European Times: 2007-,

>25,000 articles, >4 million tokens, in 10 languages

Page 9: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies morphological processing

important for inflectionally rich languages, e.g. Croatian noun in 14 word-forms (7 cases, 2

numbers):N: student studentiG: studenta studenataD: studentu studentimaA: studenta studenteV: studentu studentiL: studentu studentimaI: studentom studentima

unlike English noun in 2(4?) word-forms (2 numbers+ possesive?):

Sg: student Poss: (student’s)Pl: students Poss: (students’)

present in all Slavic languages (excl. Bulgarian), German, Greek, Baltic languages, Finnish, ...

Page 10: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 2 recognizing to which lexeme(s) a WF belongs to

helps us in avoiding the problem of data sparsness in many text processing tasks: information retrieval text mining document classification document indexing

query processing search engines are not “inflectionally sensitive” speakers of inflectionally rich language use the

normal/base form = lemma e.g. www.google.hr input: noun in nominative

singular did you know that accusative and genitive are more

frequent in Croatian?

Page 11: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 3

Page 12: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 4

Page 13: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 5

Page 14: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 6 Named Entity Recognition and Classification

(NERC) NEs are introducing the exact information from outer

world into the world-of-text represent answers to the basic journalistic

questions: who?, where?, when?, how much? types of NEs (according to MUC conferences)

person organization location date time valute and measurements percentage

system that works for Croatian with >90% precision

Page 15: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 7 system that works for Croatian with >90%

precision

Page 16: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 8 semantic networks as language resources

covering the general lexicon and NEs in a language

WordNet: words are linked by meaning synonyms, antonyms, hypo-/hyperonyms,

meronyms… realized as ontologies or taxonomies allow for words and/or NEs

synonymy/antonymy search evoking upper-levels in taxonomy

e.g. activating the region/state/continent when a city is mentioned or a company when a director is in focus

explicit social networking connections between NEs

Page 17: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable L&K technologies

Page 18: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable L&K technologies

Page 19: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 8 semantic networks as language resources

covering the general lexicon and NEs in a language WordNet: words are linked by meaning

synonyms, antonyms, hypo-/hyperonyms, meronyms…

realized as ontologies or taxonomies allow for words and/or NEs

synonymy/antonymy search evoking upper-levels in taxonomy

e.g. activating the region/state/continent when a city is mentioned or a company when a director is in focus

explicit social networking connections between NEs semantic processing: roles in sentences (agent,

patient, instrument etc.)

Page 20: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 8 semantic networks as language resources

covering the general lexicon and NEs in a language WordNet: words are linked by meaning

synonyms, antonyms, hypo-/hyperonyms, meronyms… realized as ontologies or taxonomies allow for words and/or NEs

synonymy/antonymy search evoking upper-levels in taxonomy

e.g. activating the region/state/continent when a city is mentioned or a company when a director is in focus

explicit social networking connections between NEs semantic processing: roles in sentences (agent,

patient, instrument etc.) event detection: from verbal frames and scenarios

Page 21: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable language technologies 8 semantic networks as language resources

covering the general lexicon and NEs in a language WordNet: words are linked by meaning

synonyms, antonyms, hypo-/hyperonyms, meronyms… realized as ontologies or taxonomies allow for words and/or NEs

synonymy/antonymy search evoking upper-levels in taxonomy

e.g. activating the region/state/continent when a city is mentioned or a company when a director is in focus

explicit social networking connections between NEs semantic processing: roles in sentences (agent,

patient, instrument etc.) event detection: from verbal frames and scenarios connection with geo-data

Page 22: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable knowledge technologies automatic document indexing

eCADIS system developed for Croatian legal docs applicable to any document collection uses machine learning techniques automatically attaches the keywords

(descriptors) from a controlled thesaurus to a document

represent the document content description

integrates the corpus and document analysis

Page 23: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

CADIS system

Page 24: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Page 25: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

eCADIS system integrates the information from the whole

document collection greyed n-grams are statistically relevant in

the corpus i.e. collocations

Page 26: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

eCADIS system automatic suggestion of relevant

descriptors, hence the automatic indexing

Page 27: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

eCADIS system compare it to manually attached

descriptors…

Page 28: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable knowledge technologies automatic document classification

uses a series of classifiers, combined 3500 classifiers

results represented in a vector-space model dimensionality reduction

matrices could be huge (Vjesnik: 90,000 x 600,000) features selected

types lemmas collocations NEs …

evaluated by F1 measure (combination of precision/recall) F1 > 90% in most of cases

Page 29: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable knowledge technologies

visualisationof classification between pages Croatia

Weekly English side go= economy

ks = culture/sportte = turism/ecol.po = politics

Page 30: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable knowledge technologies

visualisationof classification between culture (low right) and sport (high left) Croatia Weekly English side go= economy

ks = culture/sportte = turism/ecol.po = politics

Page 31: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Applicable knowledge technologies

visualisationof classification for documents that differentiate between home (blue upward) and foreign policy (blue downward) Croatia Weekly English side go= economy

ks = culture/sportte = turism/eco.po = politics

Page 32: Language and Knowledge Technologies for News Collections in Croatia

ITN2008Dubrovnik2008-05-21

Language and Knowledge Technologies for News Collections

in Croatia Bojana Dalbelo Bašić, Marko Tadić

University of Zagreb,Faculty of Electrical Engineering and Computing / Faculty of Humanities and

Social [email protected], [email protected]

ITN2008Dubrovnik

2008-05-21