ocr implementation in the caribbean plants digitization project

15
The New York Botanical Garden OCR implementation in The Caribbean Plants Digitization Project A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical Garden *Legend: estimated number of specimens per country The New York Botanical Garden Presented by: Stephen Gottschalk

Upload: merrill

Post on 16-Feb-2016

37 views

Category:

Documents


0 download

DESCRIPTION

OCR implementation in The Caribbean Plants Digitization Project. A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical Garden. The New York Botanical Garden. *Legend : estimated number of specimens per country. Presented by: Stephen Gottschalk. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: OCR implementation in The Caribbean Plants Digitization Project

OCR implementation in The Caribbean Plants Digitization Project

A project to image and catalog over 150,000 Caribbean specimens at the New York Botanical Garden

*Legend: estimated number of specimens per countryThe New York Botanical Garden

Presented by: Stephen Gottschalk

Page 2: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

NYBG’s Caribbean Collections More than 100 expeditions sponsored by the

garden since 1895. Notable and prolific collections by current and

former Garden staff including the Garden’s founder, Nathaniel Lord Britton

Approximately 75 % of the specimen data could be digitized from field books at NYBG and other institutions, or from published itineraries which provide the same information

Page 3: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Caribbean Project workflow summary:

Curation and rapid barcoding of specimens

Specimen imagingOptical CharacterRecognition (OCR)and data parsing

Field book entries

Manual keyingof specimendata

Specimen CatalogRecord

Page 4: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Sample ideal fieldbook:Plant family Plant

description

Collection locality No. of

duplicates

DeterminationCollection no.

Collection dateHabitat

Page 5: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Sample fieldbook - the product:

Page 6: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Sample Caribbean fieldbooks, less than ideal:

Vol 132, J. A. Safer, 1909 Vol. 69, Van Hermann, 1904

Page 7: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

OCR assists in attaching fieldbook records:

OCR derived fields

Fieldbook entries

user input

IRN

Page 8: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Using OCR to populate fields:

Python script findsline of query term

User detects pattern to update fields

Query raw OCR toextract recordsof a given label type

SELECT *FROM OCR_allwhere label like "*New*Yor*Bot*Gar*Exp*Cub*";

Example:

Page 9: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Using OCR to populate fields:

Python script findsline of query term

User detects pattern to update fields

Query raw OCR toextract recordsof a given label type

Example: Return line containing “Col”

Page 10: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Using OCR to populate fields:

Python script findsline of query term

User detects pattern to update fields

Query raw OCR toextract recordsof a given label type

Example:Length of string Find position of “j.

a.”find “sha”Find “afer”

J. A. Shafer collections!

Page 11: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Avoid false positives:

F. S. Earle – no!

Page 12: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Consider pattern training and a second OCR pass:

Wright Labels, 162 total, generally low quality:

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

"Plantæ""Cubenses""Wright-ianæ"Full String

Perc

enta

ge c

orre

ctly

OC

R’d

OCR Pattern Training Used

Page 13: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Consider pattern training and a second OCR pass:

Zanoni Labels, 114 total, generally typed:

built-in trained once trained mult trained other trained both0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

"Moscoso"

"Rafael"

"Zanoni"

Full Heading: Jardin Botan-ico Nacional "Dr. Rafael M. Moscoso"

stripped “ " . ” punctuation from heading: Jardin Botan-ico Nacional Dr Rafael M Moscoso

OCR Pattern Training Used

Perc

enta

ge c

orre

ctly

OC

R’d

Page 14: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Closing thoughts:

•OCR plus human parsing works well with very little programming.•Works well for large, self contained data sets but maybe not for partial or changing data sets – automation would be helpful for addressing this.•Allows for creation of “digital” fieldbooks (ie order by collector, collection number and place).

Page 15: OCR implementation in The Caribbean Plants Digitization Project

The New York Botanical Garden

Acknowledgements National Science Foundation

Barbara Thiers, Jacquelyn Kallunki, Michael Bevans, Anthony Kirchgessner, Melissa Tulig, Benito Santos, Nicole Tarnowsky, Tom Zanoni, Benjamin Saracco, Stephen Sinon, Vinson Doyle, Jessica Allen, Sarah Dutton, Lane Gibbons, Elizabeth Kiernan, Brandy Watts, Charles Zimmerman

Visit the Virtual Herbarium: http://sciweb.nybg.org/science2/vii2.asp