nlm conversion to build “atomic” physics content in an agile fashion

37
NLM CONVERSION TO BUILD “ATOMIC” PHYSICS CONTENT IN AN AGILE FASHION JATS-CON, April 2, 2014 OSA – The Optical Society & DCL – Data Conversion Laboratory, Inc. 1

Upload: mabyn

Post on 25-Feb-2016

25 views

Category:

Documents


0 download

DESCRIPTION

NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion. JATS-CON, April 2, 2014 OSA – The Optical Society & DCL – Data Conversion Laboratory, Inc. scholarly publisher with 19 current and legacy journals, 300+ conference proceedings. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

1

NLM CONVERSION TO BUILD “ATOMIC” PHYSICS CONTENT IN AN AGILE FASHIONJATS-CON, April 2, 2014OSA – The Optical Society &DCL – Data Conversion Laboratory, Inc.

Page 2: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

scholarly publisher with 19 current and legacy journals, 300+ conference proceedings

2

Page 3: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

3

How?

Break 1917-2012 content into “well-polished” atomic pieces following an industry standard

Develop infrastructure to manage and enrich content, to build new products and services in an agile fashion

Budget allocated for five-year strategic plan

OSA Governance: Build more-flexible products and services!

Page 4: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

4

Some evidence of successWith content converted to NLM XML, have developed

Enhanced article: Interactive HTML

Derivative products: ImageBank

Business Intelligence: New insights into author, topic, funding, and other trends

Page 5: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

5

Citation data

Page 6: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

6

Equation data

Page 7: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

7

Page 8: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

8

Legacy content (750,000 journal pages)

We expected this . . .

Page 9: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

9

This . . . not so much

JOURNAL AS COMIC BOOK SCHOOL YEARBOOK

Page 10: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

10

1. Most confusing: Articles skipping pages, sometimes in two directions

Page 11: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

11

2. Most shocking: legacy PDF not matching Legacy print

Print

Legacy PDF for same article

Page 12: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

12

3. Most pervasive: nonscientific content tacked onto research articles

These are not the authors

Page 13: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

13

Project specifications: two extremes

2. Spend up to a year doing heavy content analysis and spec creation

1. Hand the project over to

the trusted vendor and be

done with it

Page 14: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

14

Data Conversion Laboratory• We convert content from any format to any format.• Expertise with JATS, and most industry standard DTD’s and Schemas• Established in 1981; a pioneer in the data conversion industry• Over a billion pages converted• Expertise in complex conversion projects; STM Publishing, eBooks, Technical

documents, Educational Publishing, and Library Digitization.• Projects range from one book to entire libraries and legacy collections• Infrastructure for large-scale projects, with automated tracking, quality

assurance, and customer reporting for every item• Industries include Publishing, Technical Societies, Aerospace, Government,

Defense, Health Sciences, Libraries & Universities• Publish DCLNews, a monthly newsletter devoted to XML and Electronic

Publishing topics going to 7,000 subscribers

Page 15: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

15

Thoughts on Managing a Large Legacy Conversion Effort

1) Phased Approach

2) Flexibility and Collaboration

3) Keep it Simple

4) Keep Monitoring Quality

Page 16: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

16

1) Phased ApproachWhy?• Varied sources (PDF, XML, SGML)• Content that changed over time• Very large input corpus going back to 1917• Allow for the quick, phased release of new OSA products

Strategy for OSA materials• Focus on one source type at a time but keep the big picture in mind• Convert newest material first • Review and decide on conversion nuances as they came up

Page 17: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

17

XML • OSA Proprietary DTD• NLM v2.3 DTD

PDF • PDF Normal• PDF Image

SGML• Multiple DTDs

Source Material Challenges

Page 18: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

18

• Develop an overall specification, with allowance for change as new scenarios are uncovered

• Software development sprints to incorporate changes

• Close collaboration with OSA to manage new situations affecting completed work and work in process

2) Build Flexibility and Collaboration into the Conversion Process

Page 19: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

19

Tools Used to Retain Flexibility

• Client-Vendor collaboration for decision making

• Hub and Spoke processing

• Handling of conversion anomalies

• Quality assurance reviews

• Learning databanks

Page 20: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

20

3)There’s a Lot of Detail – Keep It Simple

• Fitting structures into the existing JATS tagging structure

• CALS to HTML table conversion

• MathML line break retention

• Cross-reference ranges

• Rendering limitations

• Unexpected content scenarios

Page 21: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

21

Cross-Reference Ranges• Bibliographic

• Figure

Page 22: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

22

Rendering Limitations

• No CSS support for table character alignmentPDF: HTML:

Page 23: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

23

• Missing text - Printed page problems

Unexpected Content Scenarios

Page 24: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

24

• Jumping pages

Unexpected Content Scenarios (cont.)

Page 25: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

25

• Special characters with no corresponding Unicode

Unexpected Content Scenarios (cont.)

Page 26: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

26

<body>     <boxed-text>           <sec>                <title>Optical Activities in Industry</title>                <p>66 Summer Street, North Brookfield, Mass. Mr. Cooke welcomes news and comments                      for this column which should be sent to him at the above address</p>                <p>                      <inline-graphic xlink:href="ao-8-4-792-i001"/></p>           </sec>     </boxed-text>

____________________________________

• Non-standard Structure

Unexpected Content Scenarios (cont.)

Page 27: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

27

Unexpected Content Scenarios (cont.)• White space filler

Page 28: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

28

• Visual review

• OSA Schematron

• Reporting stylesheets

• OCR and hyphenation spellchecker software

• QA software

• Learning databanks

4) Keep Checking Quality – Don’t Get Too Far Ahead

Page 29: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

29

• Correct entities are used

• Math displays correctly

• Table alignment is accurate

• Images correspond to the source

Visual Review

Page 30: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

30

• The Schematron includes over 300 checks

Warning:ALERT [LJF:RGCO250]: ref 'b10': unpublished materials must have @publication-type='other' ($unpublished and @publication-type != 'communication' and @publication-type != 'other' / warning) [report]

Warning:ALERT [LJF:JBCO140]: no tables found but title reads 'Figures and Tables' (matches(title, 'Table') and not(exists(table-wrap)) / warning) [report]

ERROR [LJF:RGCO250]: ref 'b14': journal citation contains more than one article-title (count(article-title) &gt; 1) [report]

OSA Schematron

Page 31: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

31

• Highlight any discrepancies between the specifications and the tagging

• Identify suspicious start of a paragraph

• Flag missing external files associated with the XML

• Find missing cross references to specified structures such as Tables and Figures

DCL QA Software

Page 32: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

32

Hyphenation Spellchecker

Page 33: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

33

• Provides easier review of metadata components for a set of articles

Reporting Stylesheets

Page 34: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

34

• Modified versions of the fonts designed to help distinguish between similar looking characters – “O” vs “0”, “Z” vs “2”, “1” vs “l” used within the proofreading phase

OCR Tools

Page 35: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

35

Ongoing updates made based on feedback and newly determined rules and structures

• Conversion software

• QA software

• Schematron

• Spellchecker and hyphenation software

• Editorial guidelines

• Image creation

Learning Databanks

Page 36: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

36

ConclusionsOSA has nearly completed a large backfile conversion project in close coordination with DCL. The project, which is based around NLM markup, has allowed OSA to enhance its publishing platform, build derivative products, and significantly improve its ability to gather business intelligence from a deep journal backfile. We offer the following lessons learned:

• With large content projects, plan ahead but prepare to work in an agile fashion

• The content owner should stay engaged throughout the project to align real-time decisions with business aims

• Owner–vendor collaboration—when the right partners are involved—improves morale, attention to detail, and decision-making

Page 37: NLM Conversion to Build “Atomic” Physics Content in an Agile Fashion

37

Scott DineenSr. Director Publishing Production & Technol.The Optical [email protected]

Devorah AshlemSenior Project ManagerData Conversion [email protected]