structured affiliations extraction from the scientific literature
TRANSCRIPT
Structured affiliations extractionfrom scientific literature
D. Tkaczyk, B. Tarnawski and Ł. BolikowskiInterdisciplinary Centre for Mathematical and Computational ModellingUniversity of Warsaw
24 June 2015
Introduction
CERMINE system
TITLE
AUTHORS
AFFILIATIONSEMAILS
ABSTRACT
KEYWORDS AUTHOR
TITLE
SOURCEYEAR
PAGES
URL
VOLUME
CERMINE analyses born-digital scientific articles and extracts:document metadata, eg. title, authors, abstract, keywords, publication date, ...a list of parsed bibliographic referencesfull text with sections hierarchy
This presentation
The presentation focuses on the following tasks:extracting a list of authors from a paperextracting a list of affiliations from a paperestablishing relations between extracted authors and affiliationsdetecting institution, address and country in affiliation strings
Motivation
CERMINE can be used to:
extract high-quality metadatafrom large PDF collections,when it is missing or fragmentaryprovide intelligent user interfacesfor metadata acquisition
Requirements
The metadata extraction system should be:comprehensive,automatic,modular,open and widely available,easily applicable,flexible and able to adapt to new layouts,well tested.
Architecture and Implementation
The workflow
PDFBT /F13 10 Tf 250 720 Td (PDF) TjET
<XML><author> <aff>1</aff></author> <aff id="1"> <inst>Instit... <addr>Wars.. <country>P...</aff>
structureextraction
M.K.1, J.I.2, T.W
1 University of2 Institute of
Institute of ...Warsaw, 027...Poland
XML recordgeneration
affiliationparsing
splitting andassociation
classification
Layout Extraction
1 Character extraction — iText library2 Page segmentation — Docstrum3 Reading order resolving — bottom-up
heuristic-based
Content Classification
general classification (labels: metadata, references,body and other)metadata classification (labels: abstract, bib_info, type,title, affiliation, author, keywords, correspondence, datesand editor)SVM with 83 and 62 features: geometrical, lexical,sequential, formatting, heuristicsthe best SVM parameters were found automatically bymaximizing mean F-score on a validation datasetclassifiers are trained on 2,551 and 2,716 documents,respectively
The output so far
TrueViz XML format:
hierarchical structure containing:pages, zones, lines, words, charactersall elements have bounding boxesreading order is givenzones have labels
<Page>...
<PageID Value="0"/>
<Zone>...
<ZoneID Value="0"/>
<ZoneCorners>
<Vertex x="55.320" y="34.295"/>
<Vertex x="235.704" y="58.295"/>
</ZoneCorners>
<ZoneNext Value="1"/>
<Category Value="TITLE"/>
<Line>...
<Word>...
<Character>...
Authors and Affiliations Extraction
authors are split based on a listof separators
affiliations indexes are found using a listof index symbols and superscript
association is done by detected indexes
affiliations are already assignedto authorsfirst line is assumed to be the authoremail is found by a regexpthe remaining part is treatedas the affiliation
Affiliation Parsing
Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5A blok D, 02-106 Warsaw, Poland
affiliation parsing detects institution, address, countrythe implementation is based on a CRF token classifier with features:
the classified word itself,whether the token is a number, all uppercase word, all lowercase word, a lowercaseword that starts with an uppercase letter,whether the token is contained by dictionaries of countries or words commonlyappearing in institutions or addresses,the features of two preceding and two following tokens.
Evaluation
Datasets
GROTOAP2: the evaluation and trainingof the zone classifiersGROTOAP2-affiliations: the evaluationand training of the affiliation parserPubMed Central Open Access Subset:the evaluation of the entire workflow
<NLM>
PubMedCentral CERMINE
tools
zone textmatching
rules
<NLM>
<NLM>
Zone Classification
2,551 documents fromGROTOAP2, containing:
355,779 zones68,557 metadata zones
5-fold cross-validation
metadata other labels precision recallmetadata 66,372 2,185 96.8 % 97.0 %
other labels 2,052 285,170 - -
affiliation other labels precision recall
affiliation 3,496 185 95.0 % 95.3 %other labels 173 64,703 - -
Affiliation Parsing
8,267 affiliationsfrom PubMed Central
5-fold cross-validation
Token classification:
address country institution precision recall
address 44,481 12 1,225 96.8 % 97.3 %country 50 8,108 8 99.6 % 99.3 %
institution 1,434 18 92,457 98.7 % 98.5 %
Affiliation metadata extraction:
institution recognized in 92.4% of casesaddress recognized in 92.2% of casescountry recognized in 99.5% of cases92.1% of affiliations entirely correctly parsed
Workflow Evaluation
1,943 documents from PMCevaluated tasks:
extracting author stringsextracting affiliation stringsdetermining author-affiliation relationsdetermining author-affiliation relations,if authors and affiliations extractedflawlessly
0
20
40
60
80
100
authors affiliations relations(total)
relations(perfect input)
F1
SystemCERMINEGROBIDParsCitPDFX
Summary
System Features
CERMINE extracts metadata and content from scholarly articles in PDF formatthe system is based on a modular workflowthe implementation uses machine learning and heuristicsthe default system is trained on large and diverse datasetsthe source code is open and available on GitHubCERMINE is available as a web service and RESTful services
System Usage
Java + MavenJAR fileRESTful services:
$ curl -X POST –data-binary @article.pdf–header "Content-Type: application/binary"http://cermine.ceon.pl/extract.do
$ curl -X POST –data "affiliation=the textof the affiliation" http://cermine.ceon.pl/parse.do
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
●● ●
●
●●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
● ●● ●
●
●
●
●●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
● ●
●
●
●●●
●
●
●
●
●
● ●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
● ● ●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
● ●●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
● ●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
0
20
40
0 10 20 30 40Number of pages
Tim
e [s
]
Links
CERMINE web service: http://cermine.ceon.plCERMINE source code: https://github.com/CeON/CERMINEGROTOAP2: http://cermine.ceon.pl/grotoap2/GROTOAP2-affiliations: http://cermine.ceon.pl/grotoap2/affiliations/
Thank you!
linkedin.com/in/bolikowski
twitter.com/bolikowski
c© 2015 ICM, University of Warsaw. This document is distributed under the CC BY 4.0 license, see: http://creativecommons.org/licenses/by/4.0/