structured affiliations extraction from the scientific literature

Structured affiliations extractionfrom scientific literature

D. Tkaczyk, B. Tarnawski and Ł. BolikowskiInterdisciplinary Centre for Mathematical and Computational ModellingUniversity of Warsaw

24 June 2015

Introduction

CERMINE system

TITLE

AUTHORS

AFFILIATIONSEMAILS

ABSTRACT

KEYWORDS AUTHOR

TITLE

SOURCEYEAR

PAGES

URL

VOLUME

CERMINE analyses born-digital scientific articles and extracts:document metadata, eg. title, authors, abstract, keywords, publication date, ...a list of parsed bibliographic referencesfull text with sections hierarchy

This presentation

The presentation focuses on the following tasks:extracting a list of authors from a paperextracting a list of affiliations from a paperestablishing relations between extracted authors and affiliationsdetecting institution, address and country in affiliation strings

Motivation

CERMINE can be used to:

extract high-quality metadatafrom large PDF collections,when it is missing or fragmentaryprovide intelligent user interfacesfor metadata acquisition

Requirements

The metadata extraction system should be:comprehensive,automatic,modular,open and widely available,easily applicable,flexible and able to adapt to new layouts,well tested.

Architecture and Implementation

The workflow

PDFBT /F13 10 Tf 250 720 Td (PDF) TjET

<XML><author> <aff>1</aff></author> <aff id="1"> <inst>Instit... <addr>Wars.. <country>P...</aff>

structureextraction

M.K.1, J.I.2, T.W

1 University of2 Institute of

Institute of ...Warsaw, 027...Poland

XML recordgeneration

affiliationparsing

splitting andassociation

classification

Layout Extraction

1 Character extraction — iText library2 Page segmentation — Docstrum3 Reading order resolving — bottom-up

heuristic-based

Content Classification

general classification (labels: metadata, references,body and other)metadata classification (labels: abstract, bib_info, type,title, affiliation, author, keywords, correspondence, datesand editor)SVM with 83 and 62 features: geometrical, lexical,sequential, formatting, heuristicsthe best SVM parameters were found automatically bymaximizing mean F-score on a validation datasetclassifiers are trained on 2,551 and 2,716 documents,respectively

The output so far

TrueViz XML format:

hierarchical structure containing:pages, zones, lines, words, charactersall elements have bounding boxesreading order is givenzones have labels

<Page>...

<PageID Value="0"/>

<Zone>...

<ZoneID Value="0"/>

<ZoneCorners>

<Vertex x="55.320" y="34.295"/>

<Vertex x="235.704" y="58.295"/>

</ZoneCorners>

<ZoneNext Value="1"/>

<Category Value="TITLE"/>

<Line>...

<Word>...

<Character>...

Authors and Affiliations Extraction

authors are split based on a listof separators

affiliations indexes are found using a listof index symbols and superscript

association is done by detected indexes

affiliations are already assignedto authorsfirst line is assumed to be the authoremail is found by a regexpthe remaining part is treatedas the affiliation

Affiliation Parsing

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5A blok D, 02-106 Warsaw, Poland

affiliation parsing detects institution, address, countrythe implementation is based on a CRF token classifier with features:

the classified word itself,whether the token is a number, all uppercase word, all lowercase word, a lowercaseword that starts with an uppercase letter,whether the token is contained by dictionaries of countries or words commonlyappearing in institutions or addresses,the features of two preceding and two following tokens.

Evaluation

Datasets

GROTOAP2: the evaluation and trainingof the zone classifiersGROTOAP2-affiliations: the evaluationand training of the affiliation parserPubMed Central Open Access Subset:the evaluation of the entire workflow

PDF

<NLM>

PubMedCentral CERMINE

tools

zone textmatching

rules

PDF

<NLM>

PDF

<NLM>

Zone Classification

2,551 documents fromGROTOAP2, containing:

355,779 zones68,557 metadata zones

5-fold cross-validation

metadata other labels precision recallmetadata 66,372 2,185 96.8 % 97.0 %

other labels 2,052 285,170 - -

affiliation other labels precision recall

affiliation 3,496 185 95.0 % 95.3 %other labels 173 64,703 - -

Affiliation Parsing

8,267 affiliationsfrom PubMed Central

5-fold cross-validation

Token classification:

address country institution precision recall

address 44,481 12 1,225 96.8 % 97.3 %country 50 8,108 8 99.6 % 99.3 %

institution 1,434 18 92,457 98.7 % 98.5 %

Affiliation metadata extraction:

institution recognized in 92.4% of casesaddress recognized in 92.2% of casescountry recognized in 99.5% of cases92.1% of affiliations entirely correctly parsed

Workflow Evaluation

1,943 documents from PMCevaluated tasks:

extracting author stringsextracting affiliation stringsdetermining author-affiliation relationsdetermining author-affiliation relations,if authors and affiliations extractedflawlessly

0

20

40

60

80

100

authors affiliations relations(total)

relations(perfect input)

F1

SystemCERMINEGROBIDParsCitPDFX

Summary

System Features

CERMINE extracts metadata and content from scholarly articles in PDF formatthe system is based on a modular workflowthe implementation uses machine learning and heuristicsthe default system is trained on large and diverse datasetsthe source code is open and available on GitHubCERMINE is available as a web service and RESTful services

System Usage

Java + MavenJAR fileRESTful services:

$ curl -X POST –data-binary @article.pdf–header "Content-Type: application/binary"http://cermine.ceon.pl/extract.do

$ curl -X POST –data "affiliation=the textof the affiliation" http://cermine.ceon.pl/parse.do

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●●

●● ●

●

●●

●

●●

●

●●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

●●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●

●●

●

●

●

●●

●●

●

●●

●

● ●● ●

●

●

●

●●

●●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●

●

●

● ●

●

●

●●●

●

●

●

●

●

● ●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

● ● ●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

● ●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

● ●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

● ●●

●

●

●●

● ●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

● ●

●

●

●

●●●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

0

20

40

0 10 20 30 40Number of pages

Tim

e [s

]

Links

CERMINE web service: http://cermine.ceon.plCERMINE source code: https://github.com/CeON/CERMINEGROTOAP2: http://cermine.ceon.pl/grotoap2/GROTOAP2-affiliations: http://cermine.ceon.pl/grotoap2/affiliations/

http://cermine.ceon.pl

https://github.com/CeON/CERMINE

http://cermine.ceon.pl/grotoap2/

http://cermine.ceon.pl/grotoap2/affiliations/

Thank you!

linkedin.com/in/bolikowski

twitter.com/bolikowski

[email protected]

c© 2015 ICM, University of Warsaw. This document is distributed under the CC BY 4.0 license, see: http://creativecommons.org/licenses/by/4.0/

structured affiliations extraction from the scientific literature

Science