automatic indexing of bibliographic metadata: the agrotagger usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case

Fabrizio Celli – Food and Agriculture Organization of the UN - 27th March 2014

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Before Starting…

• AGROVOC is the FAO 30 years old multilingual vocabulary containing more than 32 000 concepts in 22 languages (http://aims.fao.org/standards/agrovoc/about )

• AGRIS (http://agris.fao.org/ ) is a database of more than 7 million bibliographic references in Agriculture– A collaborative network of more than 150 institutions from 65

countries– AGRIS bibliographic metadata are enhanced by AGROVOC descriptors,

which is very important in the context of adopting LOD technologies (http://agris.fao.org/content/about )

• Both are exposed as RDF

http://aims.fao.org/standards/agrovoc/about

http://agris.fao.org/

http://agris.fao.org/content/about


Outline

• Disambiguation• How does it work?• Use Case 1: indexing AGRIS resources• Use Case 2: crawling the Web


Disambiguation

• At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to enhance bibliographic resources

• The name AgroTagger may refer to different tools:– MIMOS-hosted IIT Kanpur Agrotagger: a tool developed in

collaboration with Indian Institute of Technology of Kanpur (IITK) in 2010, built on top of the popular Keyword Extraction Engine (KEA, http://www.nzdl.org/Kea/ )

http://www.nzdl.org/Kea/


Disambiguation (2)

– A Web Application developed by MIMOS in collaboration with IITK and FAO (http://kt.mimos.my/AgroTagger/) • built on top of the IITK tagging service• It generates keywords as RDF triples• It builds a tag cloud showing the most commonly

extracted keywords• More information on AIMS:

http://aims.fao.org/agrotagger

http://kt.mimos.my/AgroTagger/

http://aims.fao.org/agrotagger


Disambiguation (3)

• «AgroTagger» refers also to a command line application, based on MAUI (https://code.google.com/p/maui-indexer/)

• There isn’t a graphic interface neither a Web Service on top of the application

• It is a JAVA API• This is the AgroTagger exposed in this presentation!

https://code.google.com/p/maui-indexer/


MAUI

• Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits

• Similarly, the Maui algorithm assimilates two software tools named after New Zealand native birds Kea (keyphrase extraction algorithm) and Weka (the machine learning toolkit for creating the topic indexing model from documents with topics assigned by people and applying it to new documents)

• Maui automatically identifies main topics in text documents


How does it work?

• The purpose of the application is to index some Web resources (i.e. URLs) with the AGROVOC thesaurus

• The application can accept two different inputs:– A text file with a list of URLs– The output file of an Apache Nuts Web Crawler (which

contains a list of discovered URLs, but in a specific format)• The output is a set of connections between input

URLs and some extracted AGROVOC URIs– It can be a simple text file or a set of triples (NTRIPLES

serialization)


A text file with a list of URLs of Web resources input

AgroTagger

output

http://www.w3.org/2001/sw/grddl-wg/tut7/images/


How does it work?

• For each URL in the input file– Download the resource– Run the MAUI indexer trained with AGROVOC (the

application was trained with 780 bibliographic resources manually indexed by FAO cataloguers)

– Update the output file with discovered connections (source URL -> set of AGROVOC URIs)


Use Case 1: indexing AGRIS

resources


AGRIS

• A collection of more than 7 million bibliographic references in agriculture

• AGRIS records come with AGROVOC descriptors

• An RDF-aware system– the AGRIS database is exposed as RDF– AGROVOC is the backbone to interlink to external

sources of information (statistics, distribution maps, country profiles, germplasm data…)


The problem

• Sometimes AGRIS records have not been indexed with Agrovoc keywords

• When Agrovoc keywords are not available, an AGRIS record cannot be interlinked to external sources of information


The solution

Not yet implemented!


An example

• In 2012 AGRIS received from the WorldBank 28.582 bibliographic records

• All records came with a fulltext link, but no keywords associated

• Running the AgroTagger we were able to assign from 4 to 10 AGROVOC keywords to each WorldBank resource

• We did a manual, random evaluation of the quality of the output, with good results!


AgroTagger output


Use Case 2: crawling the Web


The setting

• Objective: discovering Web resources in agriculture and interlinking them to AGRIS records

• Tools:– Apache Nuts Crawler– AgroTagger Java API

• Final Goal: when the system displays an AGRIS record, a list of related Web resources should be available to the user


The algorithm

• The Apache Nuts Web Crawler, after a tuning, crawls the Web starting from a list of preselected URLs– The output of the Crawler (a list of discovered URLs) is

given to the AgroTagger• The AgroTagger assigns some AGROVOC URIs to

each URL discovered by the Crawler• AGRIS records are interlinked to these URLs if they

have at least 5 common AGROVOC URIs (the number has to be tuned)


First test: some numbers

• A first test started from the URL: http://ageconsearch.umn.edu/

• 101,000 distinct Web resources have been discovered by the WebCrawler and associated to AGROVOC URIs by the AgroTagger

• An algorithm tried to match AGRIS data to these resources– E.g. the resource

«http://www.waeaonline.org/WEForum/WEF-Vol.9-No.2-Fall2010.pdf» was associated to the AGRIS record «http://agris.fao.org/aos/records/US7938594»

http://ageconsearch.umn.edu/

http://ageconsearch.umn.edu/


First test: some numbers (2)Number of AGRIS records Common AGROVOC URIs

between AGRIS and the output of the Crawler

Number of associations

900 K 3 17 MLN

530 K 4 1,9 MLN

2,3 MLN 5 1,27 MLN


Future

• Other qualitative/quantitative tests• Optimization of the algorithm to run faster• Tuning of the physical infrastructure• Complete automation of procedures (e.g. the

output goes directy to a triplestore)• Reach the final goal: when the system displays

an AGRIS record, a list of related Web resources are available to the user


Thank you !

automatic indexing of bibliographic metadata: the agrotagger usecase

Technology

bibliographic resources

bibliographic references

rdf agrovoc

agrovoc keywords

topic indexing model

agrovoc descriptors

agrovoc thesaurus

web application