reducing costs and expanding xml submissions with pdf to jats conversion by keishi katoh (...
TRANSCRIPT
Reducing Costs and Expanding XML Submissions with PDF to JATS Conversion
by Keishi KATOH (加藤圭志 )
DIGITAL COMMUNICATIONS Co Ltd
Agenda
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS2
About J-STAGE Service overview Positioning of Bibliographic XML creation tool
Bibliographic XML creation tool Tool workflow Conversion from PDF to JATS XML Demonstration of the tool
Conversion results analysis and future improvements
Brief introduction for J-STAGE and bibliographic XML creation tool
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS3
About J-STAGE
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS4
J-STAGE = “Japan Science and Technology Information Aggregator, Electronic” The major e-journal publishing platforms of Japan
provided by Japan Science and Technology Agency (JST)
1,684 titles, 2.4M articles (Oct 2012) www.jstage.jst.go.jp
J-STAGE3 the new platform was launched in May 2012 With JATS XML submission (full text / bibliographic
info)
Service positioning of J-STAGE
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS5
Copyright ©2012 Japan Science and Technology AgencyThe brand names and product names are registered trademarks of respective companies.
Bibliographic XML creation tool in J-STAGE
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS6
J-STAGE
Academic Society
Internet
ArticlePDF
JATSbibXML
Bibliographic XMLcreation tool
J-STAGEpublic system
J-STAGEregistration system
Users access from the internet
Here
The tool with reasons
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS7
Is XML easy? XML spec is simple JATS tag suite is easily understood
Domain specific light-weight tag set Easy structures and attributes
Easily created from author’s data!!
Difficulty for authors to create papers in XML format
Many various tools used for writing the papers Printing / production process from writing to publishing
Printing company’s capabilities to work with XML Higher skills required using XML
Why from PDF?
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS8
Various tools and formats in publication For writing: Word, TeX… For printing:
DTP Tools - InDesign, FrameMaker Automated publishing systems - 3B2/APP, AH Formatter
For distributing: PDF, HTML, XML…
Almost all academic societies have PDFs
Conversion workflow
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS9
Workflow with two phases
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS10
Phase 1: Template pattern creation Phase 2: Registration of PDF and conversion to
XMLPhase 1:Templatepatterncreation
Phase 2:XMLconversion
SampleArticlePDF
AutomaticAnalyze
Template
Pattern
ArticlePDF
XMLConversion
JATSXML
ArticlePDF
ArticlePDF
ArticlePDF
JATSXML
JATSXML
JATSXMLAutomatic
Analyze
Details are shown in a demonstration
Sources & Outputs
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS11
Source: PDF ver. 1.3~1.5 Fonts are embedded, not rasterized and scanned
PDF Without security permission flag
Output: JATS valid XML With J-STAGE’s XML submission guideline
compliant Bibliographic elements
Demonstration
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS12
Demo contents
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS13
Create new template Select sample PDF for template Set page margin
Setting of template pattern Select the ‘block’ Assign ‘pseudo-JATS’ elements to blocks About Japanese-English contents
PDFs Conversion using template pattern Converting process XML Editing (Empty template)
practices in 30 sec
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS14
山 mountain
木 tree
鳥 bird
魚 fish
亀 tortoise
Create a new template
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS15
Go to Create new template function Select sample PDF and submit Set page margin
Analyzing PDF
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS16
Header / Footer region
to next page
Contents flow order
Contents region
Template settings
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS17
Select ‘Block’ for extracting information Assign Pseudo-JATS item to block
Selecting block
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS18
Block type Paragraphs with heading Paragraphs only
Selecting methods Font name, size,
bold/italic Text pattern Page range, region on the
page
Block continues until other selection settings’ block
Assign a pseudo-JATS item
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS19
Pseudo-JATS items denotes ‘Not single xml element of JATS’ trans-title and title kwd-group and kwd
Items for English and Japanese
Configure pseudo-JATS item
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS20
Content region Whole block Select by condition With heading With inline heading
Pseudo-JATS specific setting Dividing keywords contrib-author to
institution
Preview of conversion
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS21
Preview with design of J-STAGE public system Some XML structure information
Workflow with two phases (again)
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS22
Phase 1: Template pattern creation Phase 2: Registration of PDF and conversion to
XMLPhase 1:Templatepatterncreation
Phase 2:XMLconversion
SampleArticlePDF
AutomaticAnalyze
Template
Pattern
ArticlePDF
XMLConversion
JATSXML
ArticlePDF
ArticlePDF
ArticlePDF
JATSXML
JATSXML
JATSXMLAutomatic
Analyze
Details are shown in a demonstration
Convert and edit articles
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS23
Upload PDFs and select the template
Wait a seconds Check and edit
extracted data Get XML!!
Conversion results
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS24
Conversion accuracy with 10 journals, about 10 articlesJournal Langua
geAutomatic recognition
rateAvg Min Max
Number of articles
EL J/E 91% 58% 100% 10JO J/E 97% 89% 100% 10JE J/E 98% 95% 99% 10CL E 93% 86% 100% 10TR E 90% 50% 100% 10JI J/E 91% 83% 96% 8NI J 91% 83% 100% 10BU J/E 93% 75% 98% 8AD E 100% 97% 100% 7PJ E 98% 90% 100% 9
Errata / essays are excluded from the evaluation.
Recognizing failures in references and keywords
Automatic recognition rate = 100× Number of items extracted automaticallyNumber of items in paper image to be recognized automatically (%)
Future improvements
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS25
Improvement of PDF analyzer engine Recognition of text blocks
Columns and sequence of text flow Reconstruction algorithms with text content
Dehyphenation and space insertion
JATS context recognizing ability Template setting pattern Additional Bibliographic elements
For full text into JATS XML Extract images, vector graphics Equations
*details are undecided at this time.
Conclusion
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS26
Bibliographic XML creation tool is provided. Easy settings, easy editing But need more improvements
Utilization trend of bibliographic XML creation tool From access analysis, Some societies are using the
tool with publication interval (monthly / bi-monthly)
790 articles with 33 journals are registered in 4 months
Contacts
JATS-Con 2012 Copyright ©2012 DIGITAL COMMUNICATIONS27
J-STAGE servicesJapan Science and Technology Agency
Technical questionsDIGITAL COMMUNICATIONS Co., Ltd.
Antenna House, Inc.International sales
[email protected]+1 302-427-2456