making information more accessible

25
Making Textual Information More Accessible Holly Miller Florida Institute of Technology

Upload: holly-miller

Post on 16-Aug-2015

27 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Making Textual Information More

Accessible

Holly MillerFlorida Institute of Technology

About me

•Biochemist•Curious about Information•Librarian• Informatician/Project Director/Library Director•Asst. Dean, Scholarly Content & Faculty Engagement

DOMO

Every minute:• Facebook users share nearly 2.5 million pieces of content.• Twitter users tweet nearly 300,000 times.• Email users send over 200 million messages.

50 million articles published by 2009

Jinha (2010) Learned Publishing 23:258.

Cancer - 694,372 articles in the last 5 years

Climate change – 88,565

Species extinction – 8,453

Median # of articles read in a year – 264*

Cance

r

Climat

e Cha

nge

Spec

ies E

xtinct

ion

# of a

rticle

s rea

d/ye

ar100.00

1,000.00

10,000.00

100,000.00

1,000,000.00694372

88565

8453

264

Num

ber

of

Art

icle

s

*Nature News (2014) Scientists may be reaching a peak in reading habits

Too much information

Example: Species Identification

Names offer a logical way to search for and index content

Names are one of biology’s Controlled Vocabularies

How to do it? In the past….

Georges Louis Leclerc, comte de BuffonHistoire naturelle : générale et particulière (Oiseaux), 1799-1808

FindIT - Scientific Name Recognition Algorithm

The OCR Problem

Epitonium foliaceicostwm Orbigny Wrinkled-ribbed Wentletrap Southeast Florida to the Lesser Antilles.

Phyllodesmium acanthorhinum

Source: http://ab.co/1ByZcIb Photographer: Robert Bolland

Machine Learning for

Species Identification

Reptilia and Batrachia. (1885-1902) by Albert C.L.G.  Günther

NetiNetiName Extraction from Textual Information-Name Extraction for Taxonomic Indexing

The fluorescent sea slug Phyllodesmium acanthorhinum is more than just a pretty collection of colors: the creature bridged the gap for scientists trying to understand the relationship between sea slugs that feed on hydroids and those that dine on corals.

Source: http://ab.co/1ByZcIb Photographer: Robert Bolland

Akella et al. BMC Bioinformatics 2012, 13:211http://www.biomedcentral.com/1471-2105/13/211

Named Entity Recognition (NER)

to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations…

 Adjective noun unknown

How does NetiNeti work?

Named Entity Recognition (NER)

The fluorescent sea slug Phyllodesmium acanthorhinum is more than just a pretty collection of colors:

Adjective noun unknown

How does NetiNeti work?

• Text is tokenized (broken into chunks)• Prefiltering step• Probability that token is a name is calculated (structure

and context)• Training (positive and negative examples)• Features (letter combinations, # of vowels, part of speech)

The fluorescent sea slug Phyllodesmium acanthorhinum is more than just a pretty collection of colors:

name not a name

How well does NetiNeti work?

http://gnrd.globalnames.org/

Connecting Biodiversity Literature to EOL

Questions?

The language of birds :London: Saunders and Otley,1837.biodiversitylibrary.org/page/47512020via Flickr

Thank You!