sherborn: lyal - digitising legacy taxonomic literature: processes, products and using the output

Digitising legacy taxonomic literature: processes, products and using the output

Chris LyalThe Natural History Museum, London

What are we trying to achieve with digitisation?

In the first place, • improving access; • improving security for originals;• commercial benefit.

Digital copies accessible through the internet a means of achieving these objectives

How to find digitised taxonomic information

But:•How does one locate a digital item?

– multiple libraries, not all searchable through Google etc

– search terms sometimes rigid (loss of benefit of library systems like Dewey and browsing)

– unit size being searched for (volume cf author + date)

• Creating digital versions of traditional libraries is a vital step but not maximally efficient


• First pass of solutions:– Expose digital content to web searches– Enable searches for smaller entities

• Article• Text element (e.g. treatment)

– Enable searches for key index terms, e.g. author, taxon name

• Author ‘easy’ if publication author (part of metadata);

• if ‘other author’ subject to OCR issues– Taxon names also cause OCR issues


• OCR efficiency relatively poor (SI’s 99.995% rarely reached)– Can be improved with some

techniques (ABLE project)– Still may require interpretation (author

name abbreviations, genus abbreviations etc)

– (Born digital provides much more reliable search)

What do we really want to build?

• We have been building forward from the past– digital analogues of traditional publications– extraction of, e.g., specimen data

• not back from the future– What properties do we require of future taxonomic

information access?– How do we apply this to legacy literature?


Search:– Requires single search– Wide range of search terms – Simple / Boolean search

Retrieval:– Article– Subsets of taxonomic publications– All descriptions of a taxon and its children– Repurposable downloads– Excludes the stuff we don’t want


(Some) subsets of Articles and Treatments:– Hierarchy– Taxon name + author + date + nomenclatural/taxonomic act– Original description citation [name][author][date][reference]– Subsequent taxonomic / nomenclatural changes citation – Diagnosis, description– Biological associations– Specimen data– Character statements


Ideally:•retrieve such data without manually scanning whole paper and retrieving required data by copying


If we can do this, then:dynamic linking of new content with extant systems:

automatic population and updating of taxonomic catalogues, faunal lists, EoL etc

– compare classifications in different publicationsRetrieval of normalised data for re-use

– (cf Endnote or Mendeley) Population of ZooBank

- Automatic assessment of availability

What do we really want to build? (ZooBank)

• The publication is obtainable in numerous identical copies - metadata• Publication: If non-paper, deposited in at least 5 major publicly-accessible libraries - metadata• Publication: Not excluded by Article 9 – metadata• The name is: published using the Latin Alphabet - metadata• Name: in the case of species-group names, agrees in gender with the genus name – markup =

algorithm• Name: in the case of family-group names, has a permitted ending – markup + list• Name: in the case of family-group names, has an ending appropriate for the rank given – markup

+ list + algorithm• Name: in the case of family-group names, is based on the genus name stated – markup +

algorigthm• Name: not already registered – markup + ZooBank search• Name: contains more than one letter – markup + algorithm• Genus in which new species-group name is placed (if applicable) - markup• The name is not published as a synonym but as a valid name – markup• Valid genus name on which new family-group name is based - markup• Type species of new genus-group name (including original combination, author and date): markup• Description of taxon, or bibliographic reference to a description, is part of publication – markup +

algorithm

What do we need to do?

• User-needs assessment: clarity on what to retrieve

• Overview of necessary system


EcologyData: TDWG OBS; SEEK; LTER & ...

Taxon Concepts Data: TDWG TNC

Names (including Synonyms)

Data: [Linnaean Core]

Specimens

Data: TDWG ABCD

TDWG Darwin Core

TDWG Image / MorphobankTDWG OBS

Identification

Taxon Level

LiteratureData: emerging TDWG Lit standard

taXMLit; taxonX; other single use standards;

Relationship to MorphBank?Id

en

tific

atio

n,

Info

rms

& In

clud

es

Incl

ude

d in

Distribution &Geospatial Data: TDWG IMG

TDWG GIG, OGC & other external

standards

Barcodes & Sequences Data: various standards

Morphology, other dataData: various standards (or none) P

rovi

des Adds to

CollectionsData: NCD

Common to All:

vouchers & material for analyses

Included in

Incl

ude

d in

Informed by

Informed by

Implies

Charac-tersData:

TDWG SDDTDWG IMGMorphbank

Phylogenetic & other

analyses(various

standards)

Other taxonlevel

general-izations

Data:TDWG ISIGTDWG SPMPollinators

Defines

Info

rme

d by

Informed by

Info

rme

d by

Info

rme

d by

Info

rme

d by

Included in

Incl

ude

d in Informed

by

Incl

ude

d in

· Data Source

· Time· Agents

(people)· GUIDs

Included in Informed by

Interoperable links to library standards

What do we need to do?• Bibliography • Author / agent database with synonyms• Repositories for:

– Original texts (e.g. BHL)– marked-up texts

• XML markup schema(s)– mark-up atomisation cf data retrieval and integration

• Links / interoperability between different systems• Nomenclator (ZooBank)• Taxonomic databases (linked to ZooBank)• Effective search system

What do we need to do?• Bibliography (TL-2; ViBRANT bibliography of life; CiteBank)

– Library and taxonomic sector standards – Standard citations– Abbreviations– De-duplication– Location of resource

– free – open-source

Options, needs and current activities

• XML schemas: e.g., TaxPub, taxonX and taXMLit

• ViBRANT:– developing a new workflow for legacy literature– seeking to increase automated component process– developing workflow for new literature

• Manual and automatic data retrieval demonstrated by INOTAXA, Plazi

Options, needs and current activities

• ZooBank:– need to consider the properties required to be part of

a larger picture• Taxonomic databases:

– fragmented with non-standard terminology and content;

– Catalogue of Life not tailored to this particular vision;– Need to be standardised for content and properties

to be part of a coherent system.

Agree what we want to build• And what we expect it to

deliver

Identify components•databases •data types •interoperability

Prioritise the content

sherborn: lyal - digitising legacy taxonomic literature: processes, products and using the output

Education

publication markup

zoobank search

publication author

markup description of

markup valid genus

registered markup

letter markup

applicable markup