a tale of two data catalogs

Post on 11-Aug-2014

349 Views

Category:

Data & Analytics

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation will describe two studies undertaken to build two separate data catalogs: the first for NIH-funded datasets and the second for institutional datasets created within an academic medical center. To inform the creation of an NIH data catalog, the purpose of the first study was to a) develop a set of minimal metadata elements used to describe datasets, and b) carry out an analysis to identify datasets in NIH-funded research articles that do not provide an indication that their data has been shared in a data repository. This study served as the foundation for developing an index of all NIH-funded datasets, and provided information about in what repositories researchers share their data most often. The second study was spurred on by the first, and involved interviewing institutional faculty members and researchers to learn more about how they collect data, what challenges they face when collecting data, whether they’ve thought about sharing data, and what they would find most useful from an institutional data catalog. The results of this study informed the workflows, metadata creation, and requirements for building a data catalog within the medical center. Additionally, interview responses were used to further inform the data services provided by the health sciences library, including education, research consultations and clinical quality improvement initiatives. Both studies provide various examples of how a librarian working in the health sciences can contribute to, and participate in data-related services within their institution.

TRANSCRIPT

1

DATA CATALOGS

Table of Contents

I. NIH Data Discovery Index

• Methodology• Findings• Questions raised

II. Institutional Data Interviews

• Methodology• Findings

III. Outcomes

• Benefits to the library

By: Charles DickensKEVIN READ

2

It was the best of times…

3

NIH Big Data to Knowledge (BD2K)Facilitating Broad Use of Biomedical Big Data

4

NIH Data Discovery Index

Datasets areCITABLE

Datasets areDISCOVERABLE

Datasets areLINKED TO

THE LITERATURE

Datasets arePART OF THE

RESEARCH ECOSYSTEM

NIH Data Sharing Repositories

http://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html

Searching for NIH-funded unidentified datasets in PubMed and PMC

6

113,089

75,441

Remaining articles with unidentified datasets

NIH-funded articles for 2011:

88,592 78,901

Non-PMC Articles

Non-research Articles

Molecular Sequence Data MH

71,913 SI Field

71,680 69,857XML

7

PMC Acknowledgements

SI Field

Clinical-Trials.gov

PDB GEO GenBank PubChem RefSeq ISRCTN OMIM0

200

400

600

800

1000

1200

1400

1600

Excluded Articles

8

9

PMC Acknowledgements

PDB

Clinica

lTrials.

gov

GenBan

kGEO

IRD

MGIDIP

Flybase

dbGaPSRA

Worm

BaseM

PD

NURSARGD

ICPSR

VectorB

ase0

100

200

300

400

500

600

700

800

Excluded keywords

10

XML Keyword

GenBan

kPDB

GEOdbSNP

Clinica

lTrials.

govRGD

Flybase SRA DIPdbGaP

Worm

Base MGI

BioGRID

VectorB

ase

Multiple

Keywords

0

100

200

300

400

500

600

Excluded keywords

FlyBase:GeneNetwork:Mouse Genome Informatics:Neuroscience Information

Framework:Rat Genome Database:WormBase:Zebrafish Model

Organism Database

GenBank:PDB

NIH-sponsored data repositories now added to PubMed and PMC search indexes

11

383

What category of dataset was used for the research described in the article?

Were live human or animal subjects used in the collection

of the data?

What were the subject(s) of study (from which or whom the data was collected)?

If new dataset(s) were created, what type(s) of data were

collected?

What existing dataset(s) were used? If any?

How many datasets are there in each article?

12

13

Measuring blood pressure in mice

Measuring left hemisphere of brain for growth factor

Staining and imaging

Analysis of images using software

Results

14

Average number of datasets per article:

2.92

15

% of datasets that use live subjects

54%

Human

51%Animal

49%

16

% of new data

87%

17

% of data created using pre-existing datasets

13%

18

It was the worst of times…

Data Types

19

Image Genetic or Genomic

Chemical

Biochemical

Electrical (Elecrophysiological)

Optical – non-image

Behavioral

Computational Simulation or model

Magnetic Resonance – non-image

Structural

Physiological

Questionnaire/Survey

Clinical Measures

Geospatial

INSUFFICIENT

Inter-rater Reliability:

20Total # of datasets (High) Total # of datasets (Low)

0

100

200

300

400

500

600

700

800

Total number of datasets found per 25 ar-ticles

43%

How do we define a data set?

21

Dataset

How do we define a data set?

22

Datasets

How do we define a data set?

23

Datasets

Where in the collection/processing pipeline

should data be described?

24

Book of the Second

Understanding institutional data challenges via interviews

26

Institutional Data Catalog

• Organize and describe institutional research data

• Promote collaboration within the institution

• Promote a culture of sharing and transparency

27

Methodology

• Literature review• ID researchers/PIs using

active grant system• Analyzed datasets in

researcher papers before interviews– Used NIH Data Discovery

Index method

Understand your researchersBASIC SCIENCE RESEARCHERS CLINICAL RESEARCHERS

Data Interviews

Postdocs or student leaves with data

Lack of standards/procedures

Size of data

Messiness/Disconnect between datasets

Too challenging

0 1 2 3 4 5 6 7

Challenges Organizing Data – Basic Science Researchers

Data Interviews

Storage expense

Changes in software

Lack of IT resources

Lack of preservation procedures (readme, plans, postdoc etc.)

Data in multiple storage locations

Storage space

0 1 2 3 4 5 6

Challenges Preserving Data – Basic Science Researchers

Data Interviews

Data quality

Messiness/Disconnect between datasets

Poor data output formats

Can't search data

Data loss

Team miscommunication on who's using data

0 1 2 3 4 5 6

Challenges Organizing Data – Clinical Researchers

Data Interviews

Collaboration only

unknown parties

data repository

general public

primary results only

Do not share

0 1 2 3 4 5 6 7 8 9

Basic ScienceClinical

Experience with Data Sharing

33

Only the best of times…How the library benefitted from this exercise

34

Identified group to pilot institutional data catalog – Population Health

35

Acquired new opportunities for teaching data management

36

Developing a lab tool for basic scientists to manage metadata

37

Developed a better understanding of researcher needs and challenges

38

AcknowledgementsBD2K Project• Lou Knecht, Jim Mork, Kathel Dunn, Betsy Humphreys, Jerry

Sheehan, Mike Huerta, Dr. Donald LindbergAnnotators• Preeti Kochar, Helen Ochej, Susan Schmidt, Melissa Yorks, Shari

Mohary, Olga Printseva, Janice Ward, Oleg Rodionov, Sally Davidson, Jennie Larkin, Peter Lyster, Matt McAuliffe, Greg Farber, Betsy Humphreys, Jerry Sheehan, Mike Huerta, Lou Knecht, Suzy Roy, Swapna Abhyankar, Olivier Bodenreider, Karen Gutzman, Dina Demner Fusman, Laritza Rodriguez, Sonya Shooshan, Samantha Tate, Matthew Simpson, Tracy Edinger, Olubumi Akiwumi, Mary Ann Hantakas, Corinn Sinnott

39

References1. Adamick J, Canavan M, McGinty S, Reznik-Zellen R, Schmidt M, Stevens R. Building as We Climb: The Data Working Group at the University of Massachusetts Amherst [Internet]. Univ. Massachusetts New Engl. Area Libr. e-Science Symp. 2011. Available from: http://escholarship.umassmed.edu/escience_symposium/2011/posters/3 2. Bardyn TP, Resnick T, Camina SK. Translational Researchers’ Perceptions of Data Management Practices and Data Curation Needs: Findings from a Focus Group in an Academic Health Sciences Library. J. Web Librariansh. [Internet]. 2012 Oct [cited 2013 Jan 30];6(4):274–87. Available from: http://www.tandfonline.com/doi/abs/10.1080/19322909.2012.730375 3. Carlson J, Fosmire M, Miller CC, Nelson MS. Determining Data Information Literacy Needs: A Study of Students and Research Faculty. portal Libr. Acad. 2011;11(2):629 – 657. 4. Delserone LM. At the watershed: Preparing for research data management and stewardship at the University of Minnesota Libraries. Libr. Trends [Internet]. Urbana-Champaign, Illinois: John Hopkins University Press and the Graduate School of Library and Information Science.; 2008 [cited 2013 Jan 11]. p. 202–10. Available from: https://www.ideals.illinois.edu/handle/2142/10670 5. Harrison A, Searle S. Not drowning , ingesting : dealing with the research data deluge at an institutional level. VALA2010 Proc. [Internet]. 2010. Available from: http://www.vala.org.au/vala2010/papers2010/VALA2010_43_Harrison_Final.pdf 6. Hruby GW, McKiernan J, Bakken S, Weng C. A centralized research data repository enhances retrospective outcomes research capacity: a case report. J. Am. Med. Inform. Assoc. [Internet]. 2013 Jan 15 [cited 2013 Apr 11];1–5. Available from: http://www.ncbi.nlm.nih.gov/pubmed/23322812 7. Johnson LM, Butler JT, Johnston LR. Developing E-Science and Research Services and Support at the University of Minnesota Health Sciences Libraries. J. Libr. Adm. [Internet]. Routledge; 2012 Nov [cited 2013 Jan 11];52(8):754–69. Available from: http://dx.doi.org/10.1080/01930826.2012.751291 8. Jones S, Ross S, Ruusalepp R. Data Audit Framework Methodology [Internet]. Glasgow; 2009 p. 1–70. Available from: http://www.data-audit.eu/DAF_Methodology.pdf 9. Lage K, Losoff B, Maness J. Receptivity to Library Involvement in Scientific Data Curation: A Case Study at the University of Colorado Boulder. portal Libr. Acad. [Internet]. 2011 [cited 2012 Nov 21];11(4):915–37. Available from: http://muse.jhu.edu/journals/portal_libraries_and_the_academy/v011/11.4.lage.html 10. Newton MP, Miller CC, Bracke MS. Librarian Roles in Institutional Repository Data Set Collecting: Outcomes of a Research Library Task Force. Collect. Manag. 2011;36(1):53–67. 11. Peters C, Dryden AR. Assessing the Academic Library’s Role in Campus-Wide Research Data Management: A First Step at the University of Houston. Sci. Technol. Libr. [Internet]. Routledge; 2011 Sep [cited 2013 Jan 11];30(4):387–403. Available from: http://dx.doi.org/10.1080/0194262X.2011.626340 12. Piwowar H a. Who shares? Who doesn’t? Factors associated with openly archiving raw research data. PLoS One [Internet]. 2011 Jan [cited 2013 Mar 10];6(7):e18657. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3135593&tool=pmcentrez&rendertype=abstract 13. Raboin R, Reznik-Zellen RC, Salo D. Forging New Service Paths: Institutional Approaches to Providing Research Data Management Services. J. eScience Librariansh. [Internet]. 2012;1(3). Available from: http://escholarship.umassmed.edu/jeslib/vol1/iss3/2/ 14. Reznik-Zellen R, Adamick J, McGinty S. Tiers of Research Data Support Services. J. eScience Librariansh. [Internet]. 2012 [cited 2012 Nov 10];1(1):27–35. Available from: http://escholarship.umassmed.edu/jeslib/vol1/iss1/5/ 15. Scaramozzino JM, Ramirez ML, McGaughey KJ. A Study of Faculty Data Curation Behaviors and Attitudes at a Teaching-Centered University. Coll. Res. Libr. [Internet]. Association of College & Research Libraries; 2012 Jul 1 [cited 2013 Jan 11];73(4):349–65. Available from: http://crl.acrl.org/content/73/4/349.abstract 16. Soehner C, Steeves C, Ward J. E-Science and Data Support Services. 2010 [cited 2013 Jan 11];(August). Available from: http://www.arl.org/storage/documents/publications/escience-report-2010.pdf 17. Trinidad SB, Fullerton SM, Bares JM, Jarvik GP, Larson EB, Burke W. Genomic research and wide data sharing: views of prospective participants. Genet. Med. 2010 Aug;12(8):486–95. 18. Walters TO. Data curation program development in U.S. universities: The Georgia Institute of Technology example. Int. J. Digit. Curation [Internet]. 2009;4(3):83–92. Available from: http://www.ijdc.net/index.php/ijdc/article/viewFile/136/153 19. Westra B. Data Services for the Sciences: A Needs Assessment. Ariadne [Internet]. 2010;(64). Available from: http://www.ariadne.ac.uk/issue64/westra 20. Williams SC. Using a Bibliographic Study to Identify Faculty Candidates for Data Services. Sci. Technol. Libr. [Internet]. Routledge; 2013 May 9 [cited 2013 May 14];1–8. Available from: http://dx.doi.org/10.1080/0194262X.2013.774622 21. Xia J, Liu Y. Usage Patterns of Open Genomic Data. Coll. Res. Libr. [Internet]. Association of College & Research Libraries; 2013 Mar 1 [cited 2013 Mar 7];74(2):195–207. Available from: http://crl.acrl.org/content/74/2/195.abstract

40

ImagesPonderings for All Things Blog. 2010. Available from: http://ponderingsofallthings.blogspot.com/2010/05/tale-of-two-cities-charles-dickens.html Reading Charles Dickens Blog. Manette in Bastille. 2012. Available from: http://readingcharlesdickens.com/wp-content/uploads/2012/07/Manette-in-Bastille-253x300.jpg Grandma’s Graphics. Old Scrooge say busy in his counting-house. 2000. Available from: http://www.grandmasgraphics.com/graphics/childrens/childrens379_2000.jpg Sungardas Blog. Apple to Orange. 2010. Available from: http://blog.sungardas.com/wp-content/uploads/Apple-to-Orange.jpg Patel R. Questions?. Flickr. 2007. Available from: https://www.flickr.com/photos/23679420@N00/545653437 / Biomedical Engineering Laboratory.Wikimedia. 2012. Available from: http://upload.wikimedia.org/wikipedia/commons/a/a3/Biomedical_Engineering_Laboratory.jpg

41

kevin.read@nyumc.org Questions?

top related