structured affiliations extraction from the scientific literature

23
Structured affiliations extraction from scientific literature D. Tkaczyk, B. Tarnawski and L. Bolikowski Interdisciplinary Centre for Mathematical and Computational Modelling University of Warsaw 24 June 2015

Upload: lukasz-bolikowski

Post on 05-Aug-2015

169 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Structured affiliations extraction from the scientific literature

Structured affiliations extractionfrom scientific literature

D. Tkaczyk, B. Tarnawski and Ł. BolikowskiInterdisciplinary Centre for Mathematical and Computational ModellingUniversity of Warsaw

24 June 2015

Page 2: Structured affiliations extraction from the scientific literature

Introduction

Page 3: Structured affiliations extraction from the scientific literature

CERMINE system

TITLE

AUTHORS

AFFILIATIONSEMAILS

ABSTRACT

KEYWORDS AUTHOR

TITLE

SOURCEYEAR

PAGES

URL

VOLUME

CERMINE analyses born-digital scientific articles and extracts:document metadata, eg. title, authors, abstract, keywords, publication date, ...a list of parsed bibliographic referencesfull text with sections hierarchy

Page 4: Structured affiliations extraction from the scientific literature

This presentation

The presentation focuses on the following tasks:extracting a list of authors from a paperextracting a list of affiliations from a paperestablishing relations between extracted authors and affiliationsdetecting institution, address and country in affiliation strings

Page 5: Structured affiliations extraction from the scientific literature

Motivation

CERMINE can be used to:

extract high-quality metadatafrom large PDF collections,when it is missing or fragmentaryprovide intelligent user interfacesfor metadata acquisition

Page 6: Structured affiliations extraction from the scientific literature

Requirements

The metadata extraction system should be:comprehensive,automatic,modular,open and widely available,easily applicable,flexible and able to adapt to new layouts,well tested.

Page 7: Structured affiliations extraction from the scientific literature

Architecture and Implementation

Page 8: Structured affiliations extraction from the scientific literature

The workflow

PDFBT /F13 10 Tf 250 720 Td (PDF) TjET

<XML><author> <aff>1</aff></author> <aff id="1"> <inst>Instit... <addr>Wars.. <country>P...</aff>

structureextraction

M.K.1, J.I.2, T.W

1 University of2 Institute of

Institute of ...Warsaw, 027...Poland

XML recordgeneration

affiliationparsing

splitting andassociation

classification

Page 9: Structured affiliations extraction from the scientific literature

Layout Extraction

1 Character extraction — iText library2 Page segmentation — Docstrum3 Reading order resolving — bottom-up

heuristic-based

Page 10: Structured affiliations extraction from the scientific literature

Content Classification

general classification (labels: metadata, references,body and other)metadata classification (labels: abstract, bib_info, type,title, affiliation, author, keywords, correspondence, datesand editor)SVM with 83 and 62 features: geometrical, lexical,sequential, formatting, heuristicsthe best SVM parameters were found automatically bymaximizing mean F-score on a validation datasetclassifiers are trained on 2,551 and 2,716 documents,respectively

Page 11: Structured affiliations extraction from the scientific literature

The output so far

TrueViz XML format:

hierarchical structure containing:pages, zones, lines, words, charactersall elements have bounding boxesreading order is givenzones have labels

<Page>...

<PageID Value="0"/>

<Zone>...

<ZoneID Value="0"/>

<ZoneCorners>

<Vertex x="55.320" y="34.295"/>

<Vertex x="235.704" y="58.295"/>

</ZoneCorners>

<ZoneNext Value="1"/>

<Category Value="TITLE"/>

<Line>...

<Word>...

<Character>...

Page 12: Structured affiliations extraction from the scientific literature

Authors and Affiliations Extraction

authors are split based on a listof separators

affiliations indexes are found using a listof index symbols and superscript

association is done by detected indexes

affiliations are already assignedto authorsfirst line is assumed to be the authoremail is found by a regexpthe remaining part is treatedas the affiliation

Page 13: Structured affiliations extraction from the scientific literature

Affiliation Parsing

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, ul. Pawińskiego 5A blok D, 02-106 Warsaw, Poland

affiliation parsing detects institution, address, countrythe implementation is based on a CRF token classifier with features:

the classified word itself,whether the token is a number, all uppercase word, all lowercase word, a lowercaseword that starts with an uppercase letter,whether the token is contained by dictionaries of countries or words commonlyappearing in institutions or addresses,the features of two preceding and two following tokens.

Page 14: Structured affiliations extraction from the scientific literature

Evaluation

Page 15: Structured affiliations extraction from the scientific literature

Datasets

GROTOAP2: the evaluation and trainingof the zone classifiersGROTOAP2-affiliations: the evaluationand training of the affiliation parserPubMed Central Open Access Subset:the evaluation of the entire workflow

PDF

<NLM>

PubMedCentral CERMINE

tools

zone textmatching

rules

PDF

<NLM>

PDF

<NLM>

Page 16: Structured affiliations extraction from the scientific literature

Zone Classification

2,551 documents fromGROTOAP2, containing:

355,779 zones68,557 metadata zones

5-fold cross-validation

metadata other labels precision recallmetadata 66,372 2,185 96.8 % 97.0 %

other labels 2,052 285,170 - -

affiliation other labels precision recall

affiliation 3,496 185 95.0 % 95.3 %other labels 173 64,703 - -

Page 17: Structured affiliations extraction from the scientific literature

Affiliation Parsing

8,267 affiliationsfrom PubMed Central

5-fold cross-validation

Token classification:

address country institution precision recall

address 44,481 12 1,225 96.8 % 97.3 %country 50 8,108 8 99.6 % 99.3 %

institution 1,434 18 92,457 98.7 % 98.5 %

Affiliation metadata extraction:

institution recognized in 92.4% of casesaddress recognized in 92.2% of casescountry recognized in 99.5% of cases92.1% of affiliations entirely correctly parsed

Page 18: Structured affiliations extraction from the scientific literature

Workflow Evaluation

1,943 documents from PMCevaluated tasks:

extracting author stringsextracting affiliation stringsdetermining author-affiliation relationsdetermining author-affiliation relations,if authors and affiliations extractedflawlessly

0

20

40

60

80

100

authors affiliations relations(total)

relations(perfect input)

F1

SystemCERMINEGROBIDParsCitPDFX

Page 19: Structured affiliations extraction from the scientific literature

Summary

Page 20: Structured affiliations extraction from the scientific literature

System Features

CERMINE extracts metadata and content from scholarly articles in PDF formatthe system is based on a modular workflowthe implementation uses machine learning and heuristicsthe default system is trained on large and diverse datasetsthe source code is open and available on GitHubCERMINE is available as a web service and RESTful services

Page 21: Structured affiliations extraction from the scientific literature

System Usage

Java + MavenJAR fileRESTful services:

$ curl -X POST –data-binary @article.pdf–header "Content-Type: application/binary"http://cermine.ceon.pl/extract.do

$ curl -X POST –data "affiliation=the textof the affiliation" http://cermine.ceon.pl/parse.do

●●

● ●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●● ●

●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

● ●

●●

●●

● ● ●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

●●

●●

● ●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●●●

●●

● ●●

●●

●●

●●

● ●

●●

● ●

●●●

●●

●●

●●

0

20

40

0 10 20 30 40Number of pages

Tim

e [s

]

Page 22: Structured affiliations extraction from the scientific literature

Links

CERMINE web service: http://cermine.ceon.plCERMINE source code: https://github.com/CeON/CERMINEGROTOAP2: http://cermine.ceon.pl/grotoap2/GROTOAP2-affiliations: http://cermine.ceon.pl/grotoap2/affiliations/

Page 23: Structured affiliations extraction from the scientific literature

Thank you!

linkedin.com/in/bolikowski

twitter.com/bolikowski

[email protected]

c© 2015 ICM, University of Warsaw. This document is distributed under the CC BY 4.0 license, see: http://creativecommons.org/licenses/by/4.0/