automatic indexing of bibliographic metadata: the agrotagger usecase

24
Automatic Indexing of Bibliographic Metadata: The AgroTagger use case Fabrizio Celli – Food and Agriculture Organization of the UN - 27th March 2014

Category:

Technology


0 download

DESCRIPTION

The webinar will present the keyword extractor AgroTagger. AgroTagger is a tool based on MAUI that uses the AGROVOC thesaurus as its set of allowable keywords. It can read the fulltext of publications through the extraction of related AGROVOC keywords.

TRANSCRIPT

Page 1: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case

Fabrizio Celli – Food and Agriculture Organization of the UN - 27th March 2014

Page 2: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Before Starting…

• AGROVOC is the FAO 30 years old multilingual vocabulary containing more than 32 000 concepts in 22 languages (http://aims.fao.org/standards/agrovoc/about )

• AGRIS (http://agris.fao.org/ ) is a database of more than 7 million bibliographic references in Agriculture– A collaborative network of more than 150 institutions from 65

countries– AGRIS bibliographic metadata are enhanced by AGROVOC descriptors,

which is very important in the context of adopting LOD technologies (http://agris.fao.org/content/about )

• Both are exposed as RDF

Page 3: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Outline

• Disambiguation• How does it work?• Use Case 1: indexing AGRIS resources• Use Case 2: crawling the Web

Page 4: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Disambiguation

• At a high level of abstraction, AgroTagger is a keyword extractor that uses the AGROVOC thesaurus to enhance bibliographic resources

• The name AgroTagger may refer to different tools:– MIMOS-hosted IIT Kanpur Agrotagger: a tool developed in

collaboration with Indian Institute of Technology of Kanpur (IITK) in 2010, built on top of the popular Keyword Extraction Engine (KEA, http://www.nzdl.org/Kea/ )

Page 5: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Disambiguation (2)

– A Web Application developed by MIMOS in collaboration with IITK and FAO (http://kt.mimos.my/AgroTagger/) • built on top of the IITK tagging service• It generates keywords as RDF triples• It builds a tag cloud showing the most commonly

extracted keywords• More information on AIMS:

http://aims.fao.org/agrotagger

Page 6: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Disambiguation (3)

• «AgroTagger» refers also to a command line application, based on MAUI (https://code.google.com/p/maui-indexer/)

• There isn’t a graphic interface neither a Web Service on top of the application

• It is a JAVA API• This is the AgroTagger exposed in this presentation!

Page 7: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

MAUI

• Maui is named after the Polynesian mythological hero and demi-god, which would transform himself into different kinds of birds to perform many of his exploits

• Similarly, the Maui algorithm assimilates two software tools named after New Zealand native birds Kea (keyphrase extraction algorithm) and Weka (the machine learning toolkit for creating the topic indexing model from documents with topics assigned by people and applying it to new documents)

• Maui automatically identifies main topics in text documents

Page 8: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

How does it work?

• The purpose of the application is to index some Web resources (i.e. URLs) with the AGROVOC thesaurus

• The application can accept two different inputs:– A text file with a list of URLs– The output file of an Apache Nuts Web Crawler (which

contains a list of discovered URLs, but in a specific format)• The output is a set of connections between input

URLs and some extracted AGROVOC URIs– It can be a simple text file or a set of triples (NTRIPLES

serialization)

Page 9: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

A text file with a list of URLs of Web resources input

AgroTagger

output

Page 10: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

How does it work?

• For each URL in the input file– Download the resource– Run the MAUI indexer trained with AGROVOC (the

application was trained with 780 bibliographic resources manually indexed by FAO cataloguers)

– Update the output file with discovered connections (source URL -> set of AGROVOC URIs)

Page 11: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Use Case 1: indexing AGRIS

resources

Page 12: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

AGRIS

• A collection of more than 7 million bibliographic references in agriculture

• AGRIS records come with AGROVOC descriptors

• An RDF-aware system– the AGRIS database is exposed as RDF– AGROVOC is the backbone to interlink to external

sources of information (statistics, distribution maps, country profiles, germplasm data…)

Page 13: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Page 14: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

The problem

• Sometimes AGRIS records have not been indexed with Agrovoc keywords

• When Agrovoc keywords are not available, an AGRIS record cannot be interlinked to external sources of information

Page 15: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

The solution

Not yet implemented!

Page 16: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

An example

• In 2012 AGRIS received from the WorldBank 28.582 bibliographic records

• All records came with a fulltext link, but no keywords associated

• Running the AgroTagger we were able to assign from 4 to 10 AGROVOC keywords to each WorldBank resource

• We did a manual, random evaluation of the quality of the output, with good results!

Page 17: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

AgroTagger output

Page 18: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Use Case 2: crawling the Web

Page 19: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

The setting

• Objective: discovering Web resources in agriculture and interlinking them to AGRIS records

• Tools:– Apache Nuts Crawler– AgroTagger Java API

• Final Goal: when the system displays an AGRIS record, a list of related Web resources should be available to the user

Page 20: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

The algorithm

• The Apache Nuts Web Crawler, after a tuning, crawls the Web starting from a list of preselected URLs– The output of the Crawler (a list of discovered URLs) is

given to the AgroTagger• The AgroTagger assigns some AGROVOC URIs to

each URL discovered by the Crawler• AGRIS records are interlinked to these URLs if they

have at least 5 common AGROVOC URIs (the number has to be tuned)

Page 21: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

First test: some numbers

• A first test started from the URL: http://ageconsearch.umn.edu/

• 101,000 distinct Web resources have been discovered by the WebCrawler and associated to AGROVOC URIs by the AgroTagger

• An algorithm tried to match AGRIS data to these resources– E.g. the resource

«http://www.waeaonline.org/WEForum/WEF-Vol.9-No.2-Fall2010.pdf» was associated to the AGRIS record «http://agris.fao.org/aos/records/US7938594»

Page 22: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

First test: some numbers (2)Number of AGRIS records Common AGROVOC URIs

between AGRIS and the output of the Crawler

Number of associations

900 K 3 17 MLN

530 K 4 1,9 MLN

2,3 MLN 5 1,27 MLN

Page 23: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Future

• Other qualitative/quantitative tests• Optimization of the algorithm to run faster• Tuning of the physical infrastructure• Complete automation of procedures (e.g. the

output goes directy to a triplestore)• Reach the final goal: when the system displays

an AGRIS record, a list of related Web resources are available to the user

Page 24: Automatic Indexing of Bibliographic Metadata: The AgroTagger Usecase

Automatic Indexing of Bibliographic Metadata: The AgroTagger use case - Fabrizio Celli - 27/03/2014

Thank you !