the pragmatics of ontology and heterogeneous data sources the ins and outs of ctsasearch david...
TRANSCRIPT
![Page 1: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/1.jpg)
The Pragmatics of Ontology and Heterogeneous Data Sources
The Ins and Outs of CTSAsearchDavid Eichmann
School of Library and Information Science
University of Iowa
![Page 2: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/2.jpg)
Research Networking
• Programmatic support for discovery and use of research and scholarly information regarding people and resources.
• They are essentially special purpose institutional knowledge management systems.
![Page 3: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/3.jpg)
Representative RN Systems
• Profiles (Harvard)• VIVO (VIVO Consortium)• Loki (Iowa)• SciVal Experts (aka Pure – Elsevier)• A number of others
![Page 4: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/4.jpg)
Why Bother with VIVO(the ontology)?
• Words in a profile are just sequences of characters carrying no meaning– Try asking Google Scholar what grant
funded a given hit…• With structure and relationship comes
meaning, aka semantics– Enter the Semantic Web!
![Page 5: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/5.jpg)
Connecting the Dots
• The real challenge here is translation of information already in existence in scattered sources– Research networking tools– Citation databases (e.g., PubMED)– Award databases (e.g., NIH Reporter)– Curated archives (e.g., GenBank)– Locked up in text (the research literature)
![Page 6: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/6.jpg)
CTSAsearch – version 1
• 10 SPARQL endpoints• 19 institutions• 124,945 individuals
• Proved challenging for some sites to handle the queries
![Page 7: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/7.jpg)
CTSAsearch – version 1• subclass | count • --------------------+---------• NonFacultyAcademic | 2592383• FacultyMember | 26826• NonAcademic | 15268• EmeritusFaculty | 2134• EmeritusProfessor | 2070• Postdoc | 1226• Librarian | 232• Student | 89• GraduateStudent | 71
![Page 8: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/8.jpg)
CTSAsearch – version 2
• 10 SPARQL endpoints (19 institutions)• 15 VIVO sites
– Harvested with customized crawler• 14 Profile sites
– Harvested with customized crawler
![Page 9: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/9.jpg)
CTSAsearch – version 2• subclass | count • --------------------+---------• NonFacultyAcademic | 2592885• FacultyMember | 55499• NonAcademic | 15430• Student | 11074• GraduateStudent | 10951• EmeritusFaculty | 3096• EmeritusProfessor | 2072• Postdoc | 1410• Librarian | 264
![Page 10: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/10.jpg)
CTSAsearch – architecture
• 1 VIVO-based SPARQL harvester• 2(!) VIVO-based crawlers• 1 Profiles-based crawler• 2 Platform-specific HTML crawlers• 1 CSV-based loader
![Page 11: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/11.jpg)
CTSAsearch – architecture
![Page 12: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/12.jpg)
CTSAsearch – current
• 45,456,417 VIVO-derived triples• 48,569,115 Profiles-derived triples
![Page 13: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/13.jpg)
Recent Work
• Cross-linkage across sites– Resolving ‘stubs’– Formation of a single ecosystem
• Macro concerns– Institution-scale analytics– Pondering reflection
![Page 14: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/14.jpg)
Current “profile”
![Page 15: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/15.jpg)
CTSAsearch/Polyglot – version x
• Temporary SPARQL endpoint:– http://marengo.info-science.uiowa.edu:2020
• Shared visualization widgets– Intended for embedding in institutional sites
• Community-wide sameAs assertions
![Page 16: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/16.jpg)
Pattuelli’s Spectrum of Relationships (2012)
http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf
![Page 17: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/17.jpg)
Pattuelli’s Spectrum of Relationships (2012)
RNTools
http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf
![Page 18: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/18.jpg)
Pattuelli’s Spectrum of Relationships (2012)
RNTools
http://www.oclc.org/content/dam/research/grants/reports/2012/pattuelli2012.pdf
![Page 19: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/19.jpg)
Pattuelli’s Spectrum of Relationships (2012)
• Ontologies used– foaf (Friend of a Friend)– rel (Relationship)– mo (Music)
• Echos of Trigg’s link taxonomy– Trigg, R. 1983. Network-Based Approach to Text Handling
for the Online Scientific Community. Ph.D. dissertation, Department of Computer Science, University of Maryland, technical report TR-1346
![Page 20: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/20.jpg)
Connecting the Dots – Take 2
Figure courtesy of Melissa Haendel, OHSU
![Page 21: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/21.jpg)
PubMed Central Open Access
• 886,172 papers (as of 1/1/15)• 423,764 with acknowledgements• 994,931 sentences• 4,329,972 parses
![Page 22: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/22.jpg)
The Simple Cases• PMCID: 3008610• SeqNum: 2• SentNum: 6• Sentence: EK analysed the data.• POS: [EK/NNP, analysed/VBD, the/DT, data/NNS, ./.]• Parse: [S
[NP EK/NNP ]
[VP analysed/VBD
[NP the/DT data/NNS ]
] ./. ]
![Page 23: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/23.jpg)
And the Not So Simple…• PMCID: 4159542• Sentence: We thank Sheila Harvey, Clinical Trials Unit Manager at
ICNARC, and Ruth Canter, Trials Administrator at ICNARC, for their assistance in chasing completed surveys; Dr Kevin Gunning for early advice and project development; Drs Neill K. J. Adhikari and Gordon D. Rubenfeld for feedback and discussion of analysis plan; Dr Chris AKY Chong for his valuable comments on the initial draft of this manuscript; and our Responders: Addenbrooke’s Hospital ( Dr Kevin Gunning ), Airedale General Hospital ( Dr John Scriven ), Alexandra Hospital ( Dr Tracey Leach ), Arrowe Park Hospital ( Dr Lawrence Wilson ), Barnet Hospital ( Dr AH Wolff ), …
• 8,245 character long sentence
![Page 24: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/24.jpg)
Extract Entities/Relationships with Syntactic Queries
• [S [NP:Author NN:Author ] [VP NN [NP:Person ] [PP ] , [PP ] ] ]• S <1NP:Author <2[VP <1/thank/ <2(NP) <3(PP) ]
– For the sentence having this pattern, match the object noun phrase and the next prepositional phrase
• NP <#2 <1(NNP) <2(NNP)– For the noun phrase, extract two proper nouns
• PP <#2 <1DT <2(NP)– For the prepositional phrase, match the noun phrase
![Page 25: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/25.jpg)
Person Results SnippetID Title First Name Middle Name Last Name
76 Hans Matrin
77 Jeff Vieira
78 P. ZAMORE
79 Prof. Eric Schon
80 Carlos Lois
81 Andrea Möll
82 Elena Govorkova
83 K. M. Pollard
84 Dr. Michael Berton
![Page 26: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/26.jpg)
Relationships for Person 77PMCID Category PP
4006053 Support the kind gift of rKSHV.219
4006053 Support the kind gift of rKSHV.219 and for helpful discussions
4006053 Collaboration helpful discussions
![Page 27: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/27.jpg)
Relationships for Person 79PMCID Category PP
2801706 Resource the rabbit polyclonal antibody
2801706 Resource the ECFP and EYFP plasmids
4013013 Collaboration his helpful advice and discussions
![Page 28: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/28.jpg)
Category FrequenciesCategory Count
Collaboration 47,052
46,327
Technique 33,598
Resource 8,894
Support 6,836
Event 3,744
Project 854
Place Name 229
Publication Component
210
Place 186
Organization 93
![Page 29: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/29.jpg)
Next Steps• Continue slogging through extraction pattern
definition• Define patterns for
– funding declarations– chairs, fellowships, etc.
• Merge data into CTSAsearch visualizations• Align current category scheme with Melissa
Haendel’s current draft ontology for CASRAI taxonomy and then merge with VIVO-ISF
![Page 30: The Pragmatics of Ontology and Heterogeneous Data Sources The Ins and Outs of CTSAsearch David Eichmann School of Library and Information Science University](https://reader036.vdocuments.site/reader036/viewer/2022081520/5697c0071a28abf838cc5ea9/html5/thumbnails/30.jpg)
In the Next Year
• Joint work with Melissa Haendel (OHSU) on administrative supplement to OHSU’s CTSA bridging RNs and NIH’s SciENcv– Map SciENcv data model to VIVO-ISF– Enable bi-directional data exchange– Integrate clinical/trial data sources– Integrate SciENcv, ORCID data into CTSAsearch– Multi-granularity search and visualization