the pragmatics of ontology and heterogeneous data sources the ins and outs of ctsasearch david...

Post on 19-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Pragmatics of Ontology and Heterogeneous Data Sources

The Ins and Outs of CTSAsearchDavid Eichmann

School of Library and Information Science

University of Iowa

Research Networking

• Programmatic support for discovery and use of research and scholarly information regarding people and resources.

• They are essentially special purpose institutional knowledge management systems.

Representative RN Systems

• Profiles (Harvard)• VIVO (VIVO Consortium)• Loki (Iowa)• SciVal Experts (aka Pure – Elsevier)• A number of others

Why Bother with VIVO(the ontology)?

• Words in a profile are just sequences of characters carrying no meaning– Try asking Google Scholar what grant

funded a given hit…• With structure and relationship comes

meaning, aka semantics– Enter the Semantic Web!

Connecting the Dots

• The real challenge here is translation of information already in existence in scattered sources– Research networking tools– Citation databases (e.g., PubMED)– Award databases (e.g., NIH Reporter)– Curated archives (e.g., GenBank)– Locked up in text (the research literature)

CTSAsearch – version 1

• 10 SPARQL endpoints• 19 institutions• 124,945 individuals

• Proved challenging for some sites to handle the queries

CTSAsearch – version 1• subclass | count • --------------------+---------• NonFacultyAcademic | 2592383• FacultyMember | 26826• NonAcademic | 15268• EmeritusFaculty | 2134• EmeritusProfessor | 2070• Postdoc | 1226• Librarian | 232• Student | 89• GraduateStudent | 71

CTSAsearch – version 2

• 10 SPARQL endpoints (19 institutions)• 15 VIVO sites

– Harvested with customized crawler• 14 Profile sites

– Harvested with customized crawler

CTSAsearch – version 2• subclass | count • --------------------+---------• NonFacultyAcademic | 2592885• FacultyMember | 55499• NonAcademic | 15430• Student | 11074• GraduateStudent | 10951• EmeritusFaculty | 3096• EmeritusProfessor | 2072• Postdoc | 1410• Librarian | 264

CTSAsearch – architecture

• 1 VIVO-based SPARQL harvester• 2(!) VIVO-based crawlers• 1 Profiles-based crawler• 2 Platform-specific HTML crawlers• 1 CSV-based loader

CTSAsearch – architecture

CTSAsearch – current

• 45,456,417 VIVO-derived triples• 48,569,115 Profiles-derived triples

Recent Work

• Cross-linkage across sites– Resolving ‘stubs’– Formation of a single ecosystem

• Macro concerns– Institution-scale analytics– Pondering reflection

Current “profile”

CTSAsearch/Polyglot – version x

• Temporary SPARQL endpoint:– http://marengo.info-science.uiowa.edu:2020

• Shared visualization widgets– Intended for embedding in institutional sites

• Community-wide sameAs assertions

Pattuelli’s Spectrum of Relationships (2012)

http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf

Pattuelli’s Spectrum of Relationships (2012)

RNTools

http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf

Pattuelli’s Spectrum of Relationships (2012)

RNTools

LinkedIn

http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf

Pattuelli’s Spectrum of Relationships (2012)

• Ontologies used– foaf (Friend of a Friend)– rel (Relationship)– mo (Music)

• Echos of Trigg’s link taxonomy– Trigg, R. 1983. Network-Based Approach to Text Handling

for the Online Scientific Community. Ph.D. dissertation, Department of Computer Science, University of Maryland, technical report TR-1346

Connecting the Dots – Take 2

Figure courtesy of Melissa Haendel, OHSU

PubMed Central Open Access

• 886,172 papers (as of 1/1/15)• 423,764 with acknowledgements• 994,931 sentences• 4,329,972 parses

The Simple Cases• PMCID: 3008610• SeqNum: 2• SentNum: 6• Sentence: EK analysed the data.• POS: [EK/NNP, analysed/VBD, the/DT, data/NNS, ./.]• Parse: [S

[NP EK/NNP ]

[VP analysed/VBD

[NP the/DT data/NNS ]

] ./. ]

And the Not So Simple…• PMCID: 4159542• Sentence: We thank Sheila Harvey, Clinical Trials Unit Manager at

ICNARC, and Ruth Canter, Trials Administrator at ICNARC, for their assistance in chasing completed surveys; Dr Kevin Gunning for early advice and project development; Drs Neill K. J. Adhikari and Gordon D. Rubenfeld for feedback and discussion of analysis plan; Dr Chris AKY Chong for his valuable comments on the initial draft of this manuscript; and our Responders: Addenbrooke’s Hospital ( Dr Kevin Gunning ), Airedale General Hospital ( Dr John Scriven ), Alexandra Hospital ( Dr Tracey Leach ), Arrowe Park Hospital ( Dr Lawrence Wilson ), Barnet Hospital ( Dr AH Wolff ), …

• 8,245 character long sentence

Extract Entities/Relationships with Syntactic Queries

• [S [NP:Author NN:Author ] [VP NN [NP:Person ] [PP ] , [PP ] ] ]• S <1NP:Author <2[VP <1/thank/ <2(NP) <3(PP) ]

– For the sentence having this pattern, match the object noun phrase and the next prepositional phrase

• NP <#2 <1(NNP) <2(NNP)– For the noun phrase, extract two proper nouns

• PP <#2 <1DT <2(NP)– For the prepositional phrase, match the noun phrase

Person Results SnippetID Title First Name Middle Name Last Name

76 Hans Matrin

77 Jeff Vieira

78 P. ZAMORE

79 Prof. Eric Schon

80 Carlos Lois

81 Andrea Möll

82 Elena Govorkova

83 K. M. Pollard

84 Dr. Michael Berton

Relationships for Person 77PMCID Category PP

4006053 Support the kind gift of rKSHV.219

4006053 Support the kind gift of rKSHV.219 and for helpful discussions

4006053 Collaboration helpful discussions

Relationships for Person 79PMCID Category PP

2801706 Resource the rabbit polyclonal antibody

2801706 Resource the ECFP and EYFP plasmids

4013013 Collaboration his helpful advice and discussions

Category FrequenciesCategory Count

Collaboration 47,052

46,327

Technique 33,598

Resource 8,894

Support 6,836

Event 3,744

Project 854

Place Name 229

Publication Component

210

Place 186

Organization 93

Next Steps• Continue slogging through extraction pattern

definition• Define patterns for

– funding declarations– chairs, fellowships, etc.

• Merge data into CTSAsearch visualizations• Align current category scheme with Melissa

Haendel’s current draft ontology for CASRAI taxonomy and then merge with VIVO-ISF

In the Next Year

• Joint work with Melissa Haendel (OHSU) on administrative supplement to OHSU’s CTSA bridging RNs and NIH’s SciENcv– Map SciENcv data model to VIVO-ISF– Enable bi-directional data exchange– Integrate clinical/trial data sources– Integrate SciENcv, ORCID data into CTSAsearch– Multi-granularity search and visualization

Questions?

• Email: david-eichmann@uiowa.edu

top related