project meeting zagreb 2007-11-12 computer aided document indexing for accessing legislation joint...
TRANSCRIPT
Project meetingZagreb2007-11-12
Computer Aided Document Indexing for Accessing Legislation
Joint Flemish-Croatian project
5th project meetingZagreb
2009-03-12
CADIAL introduction joint Flemish-Croatian project supported by
Government of Flanders (grant KRO/009/06) Ministry of Science, Education and Sports of the
Republic of Croatia aim of the project
transfer of expert knowledge from:
Department of Computer Science at the K. U. Leuven to:
Croatian governmental agency HIDRA
why ensuring the infrastructure for durable public
access to Croatian legislative documents in national and multilingual European context5th project
meetingZagreb2009-03-12
CADIAL introduction 2 project partners
Department of Computer Science, Katholieke Universiteit Leuven, Belgium Prof. Marie-Francine Moens (promotor)
Faculty of Electrical Engineering and Computing (FER), University of Zagreb, Croatia Prof. Bojana Dalbelo Bašić (partner leader)
Faculty of Humanities and Social Sciences (FFZG), University of Zagreb, Croatia Prof. Marko Tadić
Croatian Information Documentation Referral Agency (HIDRA), Croatia Neda Erceg; Maja Cvitaš, M.Sc.
5th project meetingZagreb2009-03-12
CADIAL introduction 3 expected results
publicly accessible textual database of 15,000 indexed Croatian legislative documents
system for enriched computer aided document indexing (eCADIS) developed at the University of Zagreb
intelligent web-based search engine for accessing this database
all expected results achieved
5th project meetingZagreb2009-03-12
CADIAL introduction 4 wider social impact of the project
public web-service that ensures the accessibility and transparency of legislative documentation of the Republic of Croatia
accessible also to users from abroad by usage of existing multilingual versions of Eurovoc
direct contribution to the alignment of the Republic of Croatia with EU standards
5th project meetingZagreb2009-03-12
EUROVOC: official documentational thesaurus of EU bodies majority of EU parliaments
http://europa.eu/eurovoc
6625 descriptors (keywords) 6 hierarchical levels 21 top level categories
Descriptors translated (1:1) to 22 EU official languages + Croatian
5th project meetingZagreb2009-03-12
Project meetingZagreb2007-11-12
Conversion of documents
Marko Tadić
University of ZagrebFaculty of Humanities and Social Sciences
Department of Linguistics
5th project meetingZagreb
2009-03-12
Conversion web-site of the official journal of the Republic
of Croatia Narodne novine (http://www.nn.hr) e-text identical to the official version
downloaded in HTML format generated with at least six different programs six different internal HTML structures (tags, divs…)
expected XML format uniform for all versions of input HTML
conversion must be consistent, yet flexible two subtasks
conversion with user-defined script (html2xml tool) validation of converted XML documents (DTD
developed after manual investigation of documents) ended with 11,248 documents with more to
come5th project meetingZagreb2009-03-12
Project meetingZagreb2007-11-12
Text preprocessing
Jan Šnajder
KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems
Faculty of Electrical Engineering and ComputingUniversity of [email protected]
5th project meetingZagreb
2009-03-12
Text preprocessing
5th project meetingZagreb2009-03-12
belgija belgija N=fpgbelgijama belgija N=fpdbelgijama belgija N=fplbelgijama belgija N=fpibelgije belgija N=fsgbelgije belgija N=fpnbelgije belgija N=fpabelgije belgija N=fpvbelgiji belgija N=fsdbelgiji belgija N=fslbelgijo belgija N=fsvbelgijom belgija N=fsibelgiju belgija N=fsabelgijska belgijski Aspnpnbelgijska belgijski Aspnpabelgijska belgijski Aspnpvbelgijska belgijski Aspfsnbelgijska belgijski Aspfsvbelgijske belgijski Aspmpabelgijske belgijski Aspfsgbelgijske belgijski Aspfpnbelgijske belgijski Aspfpabelgijske belgijski Aspfpvbelgijski belgijski Aspmsnbelgijski belgijski Aspmsabelgijski belgijski Aspmsvbelgijski belgijski Aspmpnbelgijski belgijski Aspmpvbelgijskih belgijski Aspmpgbelgijskih belgijski Aspnpgbelgijskih belgijski AspfpgBelgijanac belgijanac N=msnBelgijanac belgijanac N=msaBelgijanaca belgijanac N=mpgBelgijanca belgijanac N=msg
Belgijance belgijanac N=mpaBelgijancem belgijanac N=msiBelgijanci belgijanac N=mpnBelgijanci belgijanac N=mpvBelgijancima belgijanac N=mpdBelgijancima belgijanac N=mplBelgijancima belgijanac N=mpiBelgijancu belgijanac N=msdBelgijancu belgijanac N=mslBelgijanče belgijanac N=msvbelgijanaka belgijanka N==pgBelgijanci belgijanka N==sdBelgijanci belgijanka N==slBelgijanci belgijanka N=fsdBelgijanci belgijanka N=fslBelgijanka belgijanka N==snBelgijanka belgijanka N==pgBelgijanka belgijanka N=fsnBelgijanka belgijanka N=fpgBelgijankama belgijanka N==pdBelgijankama belgijanka N==plBelgijankama belgijanka N==piBelgijankama belgijanka N=fpdBelgijankama belgijanka N=fplBelgijankama belgijanka N=fpiBelgijanke belgijanka N==sgBelgijanke belgijanka N==pnBelgijanke belgijanka N==paBelgijanke belgijanka N==pvBelgijanke belgijanka N=fsgBelgijanke belgijanka N=fpnBelgijanke belgijanka N=fpaBelgijanke belgijanka N=fpvBelgijanki belgijanka N==sdBelgijanki belgijanka N==sl
Belgijanki belgijanka N==slBelgijanki belgijanka N=fsdBelgijanki belgijanka N=fslBelgijanko belgijanka N==svBelgijanko belgijanka N=fsvBelgijankom belgijanka N==siBelgijankom belgijanka N=fsiBelgijanku belgijanka N==saBelgijanku belgijanka N=fsaBelgijac belgijac N=msaBelgijaca belgijac N=mpgBelgijca belgijac N=msgBelgijce belgijac N=msvBelgijce belgijac N=mpaBelgijcem belgijac N=msiBelgijci belgijac N=mpnBelgijci belgijac N=mpvBelgijcima belgijac N=mpdBelgijcima belgijac N=mplBelgijcima belgijac N=mpiBelgijcom belgijac N=msiBelgijcu belgijac N=msdBelgijcu belgijac N=mslBelgijče belgijac N=msv
Text preprocessing 2 before indexing: morphological
normalization Croatian language is morphogically complex inflectional + derivational normalization
used in document classification automatic indexing search engine
lexicon-based normalization procedure advantage: good normalization performance drawback: limited coverage
two approaches A: Croatian Morphological Lexicon (HML) B: Automatically acquired lexicon (Molex)
5th project meetingZagreb2009-03-12
Text preprocessing 3 HML (Tadić & Fulgosi 2003,
http://hml.ffzg.hr) 111,943 lemmas, 3.9+ million word-forms assembled manually, almost error-free
Molex (Šnajder et al. 2008) inflectional + derivational normalization inflectional lexicon acquired from raw corpus 70,000 lemmas acquired from 50 million corpus good coverage, good normalization quality
both lexica converted to FSA fast access, low memory requirements
publication Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko: Automatic Acquisition
of Inflectional Lexica for Morphological Normalisation // Information Processing & Management, vol. 44, no. 5, 1720-1731, 2008.
5th project meetingZagreb2009-03-12
Project meetingZagreb2007-11-12
Document clustering
Artur Šilić
KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems
Faculty of Electrical Engineering and ComputingUniversity of Zagreb
5th project meetingZagreb
2009-03-12
Document clustering experiment on clustering performed on
NN9225 corpus 9225 documents from Narodne novine
clusters evaluated with1.Eurovoc thesaurus
(top level categories)2.Source groups defined by HIDRA
(according to the field of competence of official bodies)
5th project meetingZagreb2009-03-12
Evaluation
Source groups by HIDRA (25)Social activities and human rightsSocial politicsFinanceEconomy and tradeConstruction and urbanismIndustry and technologyInformation, documentation and mediaCommunication and information technologyEnergyAgriculture, forestry and fisheriesPolitics and public administrationEnvironmentEducationDefense, interior affairs and national securityInternational businessCulture and national heritageLaw and judiciaryTransportationLabor and employmentTourismHealth careScience and researchScience and research, natural and applied
sciencesScience and research, technical sciencesLocal and regional government
Top level Eurovoc categories (27)
5th project meetingZagreb2009-03-12
Document clustering visualisation performed with PCA
5th project meetingZagreb2009-03-12
Document clustering
5th project meetingZagreb2009-03-12
Cluster - Category Overlapping
0
10
20
30
40
50
60
70
80
90
100
1 3 5 7 9 11 13 15 17 19 21Cluster
Per
cen
tag
e o
f o
verl
apin
g d
ocu
men
ts
Eurovoc - ig
Hidra - ig
Eurovoc - chi2
Hidra - chi2
FinanceGeographyPoliticsTransportAgriculture, forestry and fisheriesAgriculture, forestry and fisheriesFinanceTradeTransportTransportSocial questionsPoliticsEmployment and working conditionsAgriculture, forestry and fisheriesEmployment and working conditionsProduction, technology and researchTradeSocial questionsSocial questionsLawSocial questions
Overlapping of clusters and categories
5th project meetingZagreb2009-03-12
content-based labels (Eurovoc descriptors) are better in separating the data with respect toK-means clusters
publication Šilić, Artur; Moens, Marie-Francine; Žmak, Lovro; Dalbelo
Bašić, Bojana:Comparing Document Classification Schemes Using K-Means Clustering. Lecture Notes in Artificial Intelligence. 5117 (2008)
Conclusion on clustering
Project meetingZagreb2007-11-12
eCADIS: a system for automatic indexing of documents with
EurovocFrane Šarić
KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems
Faculty of Electrical Engineering and ComputingUniversity of [email protected]
5th project meetingZagreb
2009-03-12
CADIS and eCADIS system
CADIS: Computer Aided Document Indexing System a workstation that speeds up the human
document indexing (AIDE project)
eCADIS: enhanced CADIS application of machine learning techniques automatic suggestion of relevant descriptors
i.e. automatic indexing
5th project meetingZagreb2009-03-12
eCADIS: two parallel windows
5th project meetingZagreb2009-03-12
document window
Eurovoc browser
window
Eurovoc browser window
5th project meetingZagreb2009-03-12
Document window
5th project meetingZagreb2009-03-12
5th project meetingZagreb2009-03-12
Leuven, 2007-05-22
eCADIS features list of n-grams
5th project meetingZagreb2009-03-12
eCADIS features automatic suggestion of relevant
descriptors i.e. automatic indexing
5th project meetingZagreb2009-03-12
eCADIS with documents in English Comparison
manually attached descriptors
vs. automatically assigned by eCADIS
5th project meetingZagreb2009-03-12
Project meetingZagreb2007-11-12
Intelligent web-based search engine
Jure Mijić
KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems
Faculty of Electrical Engineering and ComputingUniversity of Zagreb
5th project meetingZagreb
2009-03-12
Intelligent web-based search engine search engine object oriented model uses Text Mining Tools library (KTLab) features
morphological normalization support for structured documents two search procedures
phrase searching language modelling
searching the document text and document title searching the document for Eurovoc descriptors
andnon-descriptors (in English or Croatian)
easy development and implementation of new procedures
5th project meetingZagreb2009-03-12
Intelligent web-based search engine search engine model scheme
5th project meetingZagreb2009-03-12
Intelligent web-based search engine evaluation of the search engine
performance INEX workshop, December 2008, Dagstuhl,
Germany Ad Hoc Track
Wikipedia collection used 660,000 structured documents in XML format 4.6 GB in size
focused retrieval ranking 29th place out of 76 runs
article retrieval ranking 9th place out of 76 runs
5th project meetingZagreb2009-03-12
Intelligent web-based search engine
5th project meetingZagreb2009-03-12
Project meetingZagreb2007-11-12
Project summary
Bojana Dalbelo Bašić
KTLabDept. of Electronics, Microelectronics, Computer and Intelligent Systems
Faculty of Electrical Engineering and ComputingUniversity of Zagreb
5th project meetingZagreb
2009-03-12
Project summary activities
5 project meetings Jure Mijić: 3 month stay at KU Leuven Frane Šarić: 4 month stay at KU Leuven 2 invited lectures at KU Leuven MIPRO 2008: invited lecture about CADIAL at
„Local government“ section participation at INEX workshop participation at ITI2007, ITI2008, ACL 2008,CICling
2009 participation at ITN2008
reward VIDI e-innovation for eCADIS
5th project meetingZagreb2009-03-12
Project summary published research results
1 book chapter 1 invited lecture 2 journal papers + 1 accepted for publication 6 conference papers + 1 accepted for publication
book about CADIAL in preparation accepted for publication
5th project meetingZagreb2009-03-12
Project summary: references book chapters
Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-Francine. Computer Aided Document Indexing for Accessing Legislation // Toegang tot de wet / (eds.) J. Van Nieuwenhove & P. Popelier. Die Keure, Brugge. 2008.
invited lectures Dalbelo Bašić, Bojana. Collocation Extraction Using Genetic Programming //
Department of Computer Science, Catholic University Leuven, 2008. Šnajder, Jan; Tadić, Marko. Morphological Normalization // Department of
Computer Science, Catholic University Leuven, 2008. Dalbelo Bašić, Bojana; Tadić, Marko; Moens, Marie-Francine. Computer Aided
Document Indexing System for Accessing Legislation // Instituut voor Constitutioneel Recht K.U. Leuven, 2007.
journal papers Šnajder, Jan; Dalbelo Bašić, Bojana; Tadić, Marko. Automatic Acquisition of
Inflectional Lexica for Morphological Normalisation // Information Processing & Management, vol. 44, no. 5, 1720-1731, 2008.
Moens, Marie-Francine. Information Extraction: The Power of Words and Pictures // Journal of computing and information technology, vol. 15, no. 4, 295-304, 2007.
5th project meetingZagreb2009-03-12
Project summary: references 2 conference papers
Šnajder, Jan; Dalbelo Bašić, Bojana; Petrović, Saša; Sikiric, Ivan. Evolving New Lexical Association Measures Using Genetic Programming // Proceedings of ACL-08: HLT, Short Papers. Columbus, Ohio : Association for Computational Linguistics, 181-184, 2008.
Šilić, Artur; Moens, Marie-Francine; Žmak, Lovro; Dalbelo Bašić, Bojana. Comparing Document Classification Schemes Using K-Means Clustering // Lecture Notes in Artificial Intelligence, vol. 5117, no. 1, 615-624, 2008.
Agić, Željko; Tadić, Marko; Dovedan, Zdravko. Investigating Language Independence in HMM PoS/MSD-Tagging // Proceedings of the 30th International Conference on Information Technology Interfaces / Lužar-Stiffler, Vesna ; Hljuz Dobrić, Vesna ; Bekić, Zoran (ur.). - Zagreb : SRCE University Computer Centre, University of Zagreb, 2008. 657-662.
Mijić, Jure; Dalbelo Bašić, Bojana; Šnajder, Jan. Building a Search Engine Model with Morphological Normalization Support // Proceedings of the ITI 2008 30th Int. Conf. on Information Technology Interfaces / Lužar-Stiffler, Vesna ; Hljuz Dobrić, Vesna ; Bekić, Zoran (ur.). Zagreb : SRCE, 2008. 619-624.
Šnajder, Jan; Dalbelo Bašić, Bojana. Higher-order Functional Representation of Croatian Inflectional Morphology // Proceedings of the Sixth International Conference on Formal Approaches to South Slavic and Balkan Languages / Tadić, Marko ; Dimitrova-Vulchanova, Mila ; Koeva, Svetla (ur.). - Zagreb : Croatian Language Technologies Society , 2008. 121-130.
Dalbelo Bašić, Bojana; Dovedan, Zdravko; Raffaelli, Ida; Seljan, Sanja; Tadić, Marko. Computational Linguistic Models and Language Technologies for Croatian // ITI 2007 Proceedings of the 29th International Conference on INFORMATION TECHNOLOGY INTERFACES / (eds.) Lužar-Stiffler, Vesna ; Hljuz Dobrić, Vesna. Zagreb : SRCE, 2007. pp. 521-528.
Šilić, Artur; Šarić, Frane; Dalbelo Bašić, Bojana; Šnajder, Jan. TMT: Object-Oriented Text Classification Library // ITI 2007 Proceedings of the 29th International Conference on INFORMATION TECHNOLOGY INTERFACES / (eds.) Lužar-Stiffler, Vesna ; Hljuz Dobrić, Vesna. Zagreb : SRCE, 2007. pp. 559-566.
5th project meetingZagreb2009-03-12
Thanks go to Neda Erceg and Maja Cvitaš, (HIDRA)
Flemish government for recognising the importance of this idea
Croatian goverment for cofinancing the project
Promotor: Prof. Marie-Francine Moens, KU Leuven
5th project meetingZagreb2009-03-12
Project meetingZagreb2007-11-12
CADIAL perspectives
Marie-Francine Moens
Dept. of Computer ScienceKatholieke Universiteit Leuven
5th project meetingZagreb
2009-03-12
Possible CADIAL perspectives eCADIS for other languages
now only Croatian and English (ß-version) covered
usable for other languages as it is, but without the linguistic module less efficient no list of lemmas, but types poor statistics for n-grams cooperation with language technology
training the automatic indexing system for other languages automatic suggestions of relevant descriptors in
other languages5th project meetingZagreb2009-03-12
Possible CADIAL perspectives new multilingual intelligent search engine
associative terms = plain language equivalents of Eurovoc descriptors
associative models built by machine learning querying available using also associative terms
i.e. near-synonymity cross-lingual search beside Croatian also for Dutch and English
possible additional features for search engine “light” NLP of query text chunking, NERC, lemmatisation, partial shallow
parsing first in Croatian, but extendable to other languages5th project
meetingZagreb2009-03-12
Thank you for your attention!
5th project meetingZagreb2009-03-12