project meeting zagreb 2007-11-12 computer aided document indexing for accessing legislation joint...

42
Project meeting Zagreb 2007-11- 12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb 2009-03-12

Upload: daniela-harmon

Post on 28-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project meetingZagreb2007-11-12

Computer Aided Document Indexing for Accessing Legislation

Joint Flemish-Croatian project

5th project meetingZagreb

2009-03-12

Page 2: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

CADIAL introduction joint Flemish-Croatian project supported by

Government of Flanders (grant KRO/009/06) Ministry of Science, Education and Sports of the

Republic of Croatia aim of the project

transfer of expert knowledge from:

Department of Computer Science at the K. U. Leuven to:

Croatian governmental agency HIDRA

why ensuring the infrastructure for durable public

access to Croatian legislative documents in national and multilingual European context5th project

meetingZagreb2009-03-12

Page 3: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

CADIAL introduction 2 project partners

Department of Computer Science, Katholieke Universiteit Leuven, Belgium Prof. Marie-Francine Moens (promotor)

Faculty of Electrical Engineering and Computing (FER), University of Zagreb, Croatia Prof. Bojana Dalbelo Bašić (partner leader)

Faculty of Humanities and Social Sciences (FFZG), University of Zagreb, Croatia Prof. Marko Tadić

Croatian Information Documentation Referral Agency (HIDRA), Croatia Neda Erceg; Maja Cvitaš, M.Sc.

5th project meetingZagreb2009-03-12

Page 4: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

CADIAL introduction 3 expected results

publicly accessible textual database of 15,000 indexed Croatian legislative documents

system for enriched computer aided document indexing (eCADIS) developed at the University of Zagreb

intelligent web-based search engine for accessing this database

all expected results achieved

5th project meetingZagreb2009-03-12

Page 5: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

CADIAL introduction 4 wider social impact of the project

public web-service that ensures the accessibility and transparency of legislative documentation of the Republic of Croatia

accessible also to users from abroad by usage of existing multilingual versions of Eurovoc

direct contribution to the alignment of the Republic of Croatia with EU standards

5th project meetingZagreb2009-03-12

Page 6: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

EUROVOC: official documentational thesaurus of EU bodies majority of EU parliaments

http://europa.eu/eurovoc

6625 descriptors (keywords) 6 hierarchical levels 21 top level categories

Descriptors translated (1:1) to 22 EU official languages + Croatian

5th project meetingZagreb2009-03-12

Page 7: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project meetingZagreb2007-11-12

Conversion of documents

Marko Tadić

University of ZagrebFaculty of Humanities and Social Sciences

Department of Linguistics

[email protected]

5th project meetingZagreb

2009-03-12

Page 8: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Conversion web-site of the official journal of the Republic

of Croatia Narodne novine (http://www.nn.hr) e-text identical to the official version

downloaded in HTML format generated with at least six different programs six different internal HTML structures (tags, divs…)

expected XML format uniform for all versions of input HTML

conversion must be consistent, yet flexible two subtasks

conversion with user-defined script (html2xml tool) validation of converted XML documents (DTD

developed after manual investigation of documents) ended with 11,248 documents with more to

come5th project meetingZagreb2009-03-12

Page 9: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project meetingZagreb2007-11-12

Text preprocessing

Jan Šnajder

KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems

Faculty of Electrical Engineering and ComputingUniversity of [email protected]

5th project meetingZagreb

2009-03-12

Page 10: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Text preprocessing

5th project meetingZagreb2009-03-12

belgija belgija N=fpgbelgijama belgija N=fpdbelgijama belgija N=fplbelgijama belgija N=fpibelgije belgija N=fsgbelgije belgija N=fpnbelgije belgija N=fpabelgije belgija N=fpvbelgiji belgija N=fsdbelgiji belgija N=fslbelgijo belgija N=fsvbelgijom belgija N=fsibelgiju belgija N=fsabelgijska belgijski Aspnpnbelgijska belgijski Aspnpabelgijska belgijski Aspnpvbelgijska belgijski Aspfsnbelgijska belgijski Aspfsvbelgijske belgijski Aspmpabelgijske belgijski Aspfsgbelgijske belgijski Aspfpnbelgijske belgijski Aspfpabelgijske belgijski Aspfpvbelgijski belgijski Aspmsnbelgijski belgijski Aspmsabelgijski belgijski Aspmsvbelgijski belgijski Aspmpnbelgijski belgijski Aspmpvbelgijskih belgijski Aspmpgbelgijskih belgijski Aspnpgbelgijskih belgijski AspfpgBelgijanac belgijanac N=msnBelgijanac belgijanac N=msaBelgijanaca belgijanac N=mpgBelgijanca belgijanac N=msg

Belgijance belgijanac N=mpaBelgijancem belgijanac N=msiBelgijanci belgijanac N=mpnBelgijanci belgijanac N=mpvBelgijancima belgijanac N=mpdBelgijancima belgijanac N=mplBelgijancima belgijanac N=mpiBelgijancu belgijanac N=msdBelgijancu belgijanac N=mslBelgijanče belgijanac N=msvbelgijanaka belgijanka N==pgBelgijanci belgijanka N==sdBelgijanci belgijanka N==slBelgijanci belgijanka N=fsdBelgijanci belgijanka N=fslBelgijanka belgijanka N==snBelgijanka belgijanka N==pgBelgijanka belgijanka N=fsnBelgijanka belgijanka N=fpgBelgijankama belgijanka N==pdBelgijankama belgijanka N==plBelgijankama belgijanka N==piBelgijankama belgijanka N=fpdBelgijankama belgijanka N=fplBelgijankama belgijanka N=fpiBelgijanke belgijanka N==sgBelgijanke belgijanka N==pnBelgijanke belgijanka N==paBelgijanke belgijanka N==pvBelgijanke belgijanka N=fsgBelgijanke belgijanka N=fpnBelgijanke belgijanka N=fpaBelgijanke belgijanka N=fpvBelgijanki belgijanka N==sdBelgijanki belgijanka N==sl

Belgijanki belgijanka N==slBelgijanki belgijanka N=fsdBelgijanki belgijanka N=fslBelgijanko belgijanka N==svBelgijanko belgijanka N=fsvBelgijankom belgijanka N==siBelgijankom belgijanka N=fsiBelgijanku belgijanka N==saBelgijanku belgijanka N=fsaBelgijac belgijac N=msaBelgijaca belgijac N=mpgBelgijca belgijac N=msgBelgijce belgijac N=msvBelgijce belgijac N=mpaBelgijcem belgijac N=msiBelgijci belgijac N=mpnBelgijci belgijac N=mpvBelgijcima belgijac N=mpdBelgijcima belgijac N=mplBelgijcima belgijac N=mpiBelgijcom belgijac N=msiBelgijcu belgijac N=msdBelgijcu belgijac N=mslBelgijče belgijac N=msv

Page 11: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Text preprocessing 2 before indexing: morphological

normalization Croatian language is morphogically complex inflectional + derivational normalization

used in document classification automatic indexing search engine

lexicon-based normalization procedure advantage: good normalization performance drawback: limited coverage

two approaches A: Croatian Morphological Lexicon (HML) B: Automatically acquired lexicon (Molex)

5th project meetingZagreb2009-03-12

Page 12: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Text preprocessing 3 HML (Tadić & Fulgosi 2003,

http://hml.ffzg.hr) 111,943 lemmas, 3.9+ million word-forms assembled manually, almost error-free

Molex (Šnajder et al. 2008) inflectional + derivational normalization inflectional lexicon acquired from raw corpus 70,000 lemmas acquired from 50 million corpus good coverage, good normalization quality

both lexica converted to FSA fast access, low memory requirements

publication Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko: Automatic Acquisition

of Inflectional Lexica for Morphological Normalisation // Information Processing & Management, vol. 44, no. 5, 1720-1731, 2008.

5th project meetingZagreb2009-03-12

Page 13: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project meetingZagreb2007-11-12

Document clustering

Artur Šilić

KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems

Faculty of Electrical Engineering and ComputingUniversity of Zagreb

[email protected]

5th project meetingZagreb

2009-03-12

Page 14: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Document clustering experiment on clustering performed on

NN9225 corpus 9225 documents from Narodne novine

clusters evaluated with1.Eurovoc thesaurus

(top level categories)2.Source groups defined by HIDRA

(according to the field of competence of official bodies)

5th project meetingZagreb2009-03-12

Page 15: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Evaluation

Source groups by HIDRA (25)Social activities and human rightsSocial politicsFinanceEconomy and tradeConstruction and urbanismIndustry and technologyInformation, documentation and mediaCommunication and information technologyEnergyAgriculture, forestry and fisheriesPolitics and public administrationEnvironmentEducationDefense, interior affairs and national securityInternational businessCulture and national heritageLaw and judiciaryTransportationLabor and employmentTourismHealth careScience and researchScience and research, natural and applied

sciencesScience and research, technical sciencesLocal and regional government

Top level Eurovoc categories (27)

5th project meetingZagreb2009-03-12

Page 16: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Document clustering visualisation performed with PCA

5th project meetingZagreb2009-03-12

Page 17: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Document clustering

5th project meetingZagreb2009-03-12

Cluster - Category Overlapping

0

10

20

30

40

50

60

70

80

90

100

1 3 5 7 9 11 13 15 17 19 21Cluster

Per

cen

tag

e o

f o

verl

apin

g d

ocu

men

ts

Eurovoc - ig

Hidra - ig

Eurovoc - chi2

Hidra - chi2

FinanceGeographyPoliticsTransportAgriculture, forestry and fisheriesAgriculture, forestry and fisheriesFinanceTradeTransportTransportSocial questionsPoliticsEmployment and working conditionsAgriculture, forestry and fisheriesEmployment and working conditionsProduction, technology and researchTradeSocial questionsSocial questionsLawSocial questions

Overlapping of clusters and categories

Page 18: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

5th project meetingZagreb2009-03-12

content-based labels (Eurovoc descriptors) are better in separating the data with respect toK-means clusters

publication Šilić, Artur; Moens, Marie-Francine; Žmak, Lovro; Dalbelo

Bašić, Bojana:Comparing Document Classification Schemes Using K-Means Clustering. Lecture Notes in Artificial Intelligence. 5117 (2008)

Conclusion on clustering

Page 19: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project meetingZagreb2007-11-12

eCADIS: a system for automatic indexing of documents with

EurovocFrane Šarić

KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems

Faculty of Electrical Engineering and ComputingUniversity of [email protected]

5th project meetingZagreb

2009-03-12

Page 20: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

CADIS and eCADIS system

CADIS: Computer Aided Document Indexing System a workstation that speeds up the human

document indexing (AIDE project)

eCADIS: enhanced CADIS application of machine learning techniques automatic suggestion of relevant descriptors

i.e. automatic indexing

5th project meetingZagreb2009-03-12

Page 21: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

eCADIS: two parallel windows

5th project meetingZagreb2009-03-12

document window

Eurovoc browser

window

Page 22: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Eurovoc browser window

5th project meetingZagreb2009-03-12

Page 23: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Document window

5th project meetingZagreb2009-03-12

Page 24: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

5th project meetingZagreb2009-03-12

Leuven, 2007-05-22

Page 25: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

eCADIS features list of n-grams

5th project meetingZagreb2009-03-12

Page 26: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

eCADIS features automatic suggestion of relevant

descriptors i.e. automatic indexing

5th project meetingZagreb2009-03-12

Page 27: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

eCADIS with documents in English Comparison

manually attached descriptors

vs. automatically assigned by eCADIS

5th project meetingZagreb2009-03-12

Page 28: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project meetingZagreb2007-11-12

Intelligent web-based search engine

Jure Mijić

KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems

Faculty of Electrical Engineering and ComputingUniversity of Zagreb

[email protected]

5th project meetingZagreb

2009-03-12

Page 29: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Intelligent web-based search engine search engine object oriented model uses Text Mining Tools library (KTLab) features

morphological normalization support for structured documents two search procedures

phrase searching language modelling

searching the document text and document title searching the document for Eurovoc descriptors

andnon-descriptors (in English or Croatian)

easy development and implementation of new procedures

5th project meetingZagreb2009-03-12

Page 30: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Intelligent web-based search engine search engine model scheme

5th project meetingZagreb2009-03-12

Page 31: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Intelligent web-based search engine evaluation of the search engine

performance INEX workshop, December 2008, Dagstuhl,

Germany Ad Hoc Track

Wikipedia collection used 660,000 structured documents in XML format 4.6 GB in size

focused retrieval ranking 29th place out of 76 runs

article retrieval ranking 9th place out of 76 runs

5th project meetingZagreb2009-03-12

Page 32: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Intelligent web-based search engine

5th project meetingZagreb2009-03-12

Page 33: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project meetingZagreb2007-11-12

Project summary

Bojana Dalbelo Bašić

KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems

Faculty of Electrical Engineering and ComputingUniversity of Zagreb

[email protected]

5th project meetingZagreb

2009-03-12

Page 34: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project summary activities

5 project meetings Jure Mijić: 3 month stay at KU Leuven Frane Šarić: 4 month stay at KU Leuven 2 invited lectures at KU Leuven MIPRO 2008: invited lecture about CADIAL at

„Local government“ section participation at INEX workshop participation at ITI2007, ITI2008, ACL 2008,CICling

2009 participation at ITN2008

reward VIDI e-innovation for eCADIS

5th project meetingZagreb2009-03-12

Page 35: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project summary published research results

1 book chapter 1 invited lecture 2 journal papers + 1 accepted for publication 6 conference papers + 1 accepted for publication

book about CADIAL in preparation accepted for publication

5th project meetingZagreb2009-03-12

Page 36: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project summary: references book chapters

Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-Francine. Computer Aided Document Indexing for Accessing Legislation // Toegang tot de wet / (eds.) J. Van Nieuwenhove & P. Popelier. Die Keure, Brugge. 2008.

invited lectures Dalbelo Bašić, Bojana. Collocation Extraction Using Genetic Programming //

Department of Computer Science, Catholic University Leuven, 2008. Šnajder, Jan; Tadić, Marko. Morphological Normalization // Department of

Computer Science, Catholic University Leuven, 2008. Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-Francine. Computer Aided

Document Indexing System for Accessing Legislation // Instituut voor Constitutioneel Recht K.U. Leuven, 2007.

journal papers Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko. Automatic Acquisition of

Inflectional Lexica for Morphological Normalisation // Information Processing & Management, vol. 44, no. 5, 1720-1731, 2008.

Moens, Marie-Francine. Information Extraction: The Power of Words and Pictures // Journal of computing and information technology, vol. 15, no. 4, 295-304, 2007.

5th project meetingZagreb2009-03-12

Page 37: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project summary: references 2 conference papers

Šnajder, Jan; Dalbelo Bašić, Bojana; Petrović, Saša; Sikiric, Ivan. Evolving New Lexical Association Measures Using Genetic Programming // Proceedings of ACL-08: HLT, Short Papers. Columbus, Ohio : Association for Computational Linguistics, 181-184, 2008.

Šilić, Artur; Moens, Marie-Francine; Žmak, Lovro; Dalbelo Bašić, Bojana. Comparing Document Classification Schemes Using K-Means Clustering // Lecture Notes in Artificial Intelligence, vol. 5117, no. 1, 615-624, 2008.

Agić, Željko; Tadić, Marko; Dovedan, Zdravko. Investigating Language Independence in HMM PoS/MSD-Tagging // Proceedings of the 30th International Conference on Information Technology Interfaces / Lužar-Stiffler, Vesna ; Hljuz Dobrić, Vesna ; Bekić, Zoran (ur.). - Zagreb : SRCE University Computer Centre, University of Zagreb, 2008. 657-662.

Mijić, Jure; Dalbelo Bašić, Bojana; Šnajder, Jan. Building a Search Engine Model with Morphological Normalization Support // Proceedings of the ITI 2008 30th Int. Conf. on Information Technology Interfaces / Lužar-Stiffler, Vesna ; Hljuz Dobrić, Vesna ; Bekić, Zoran (ur.). Zagreb : SRCE, 2008. 619-624.

Šnajder, Jan; Dalbelo Bašić, Bojana. Higher-order Functional Representation of Croatian Inflectional Morphology // Proceedings of the Sixth International Conference on Formal Approaches to South Slavic and Balkan Languages / Tadić, Marko ; Dimitrova-Vulchanova, Mila ; Koeva, Svetla (ur.). - Zagreb : Croatian Language Technologies Society , 2008. 121-130.

Dalbelo Bašić, Bojana; Dovedan, Zdravko; Raffaelli, Ida; Seljan, Sanja; Tadić, Marko. Computational Linguistic Models and Language Technologies for Croatian // ITI 2007 Proceedings of the 29th International Conference on INFORMATION TECHNOLOGY INTERFACES / (eds.) Lužar-Stiffler, Vesna ; Hljuz Dobrić, Vesna. Zagreb : SRCE, 2007. pp. 521-528.

Šilić, Artur; Šarić, Frane; Dalbelo Bašić, Bojana; Šnajder, Jan. TMT: Object-Oriented Text Classification Library // ITI 2007 Proceedings of the 29th International Conference on INFORMATION TECHNOLOGY INTERFACES / (eds.) Lužar-Stiffler, Vesna ; Hljuz Dobrić, Vesna. Zagreb : SRCE, 2007. pp. 559-566.

5th project meetingZagreb2009-03-12

Page 38: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Thanks go to Neda Erceg and Maja Cvitaš, (HIDRA)

Flemish government for recognising the importance of this idea

Croatian goverment for cofinancing the project

Promotor: Prof. Marie-Francine Moens, KU Leuven

5th project meetingZagreb2009-03-12

Page 39: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Project meetingZagreb2007-11-12

CADIAL perspectives

Marie-Francine Moens

Dept. of Computer ScienceKatholieke Universiteit Leuven

[email protected]

5th project meetingZagreb

2009-03-12

Page 40: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Possible CADIAL perspectives eCADIS for other languages

now only Croatian and English (ß-version) covered

usable for other languages as it is, but without the linguistic module less efficient no list of lemmas, but types poor statistics for n-grams cooperation with language technology

training the automatic indexing system for other languages automatic suggestions of relevant descriptors in

other languages5th project meetingZagreb2009-03-12

Page 41: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Possible CADIAL perspectives new multilingual intelligent search engine

associative terms = plain language equivalents of Eurovoc descriptors

associative models built by machine learning querying available using also associative terms

i.e. near-synonymity cross-lingual search beside Croatian also for Dutch and English

possible additional features for search engine “light” NLP of query text chunking, NERC, lemmatisation, partial shallow

parsing first in Croatian, but extendable to other languages5th project

meetingZagreb2009-03-12

Page 42: Project meeting Zagreb 2007-11-12 Computer Aided Document Indexing for Accessing Legislation Joint Flemish-Croatian project 5th project meeting Zagreb

Thank you for your attention!

5th project meetingZagreb2009-03-12