rdap 15: research data integration in the purdue libraries
TRANSCRIPT
Research Data Integration in the Purdue Libraries
Lisa Zilinski, Carnegie Mellon University Amy Barton, Purdue University Tao Zhang, Purdue University Line Pouchard, Purdue University Pete Pascuzzi, Purdue University
April 22, 2014 Minneapolis, MN
Data @ Purdue Libraries
The panel Amy Barton Metadata Specialist Tao Zhang, @jimmieego Digital User Experience Specialist Line Pouchard, @linepouchard ComputaKonal Science InformaKon Specialist Pete Pascuzzi Molecular Biosciences InformaKon Specialist Moderator Lisa Zilinski, @l_zilinski Carnegie Mellon University, Data Services Librarian
Other players Liaison Librarians Geographic InformaKon Systems (GIS) Specialist InformaKon Literacy Specialist Scholarly CommunicaKon Specialist Digital Data Repository Specialist Data Specialist
Village People 1979 Photograph: Lynn Goldsmith/Corbis
Today’s discussion • How does metadata interface with data in the context of a data repository?
• How do users perceive and experience data management tools and resources?
• To what extent are issues in Big Data curaKon different from “small” data curaKon?
• How can a subject-‐specialist incorporate data into liaison responsibiliKes?
METADATA DOCUMENT, INDEX, DISCOVER, ACCESS PURDUE UNIVERSITY LIBRARIES
Amy Barton Assistant Professor of Library Science, Metadata Specialist
OBJECTIVES WHAT I’D LIKE TO SHARE WITH YOU…
My role and responsibilities in the Purdue Libraries Two examples of metadata intersecting with Research Data @ Purdue Libraries Conceptual model representing how libraries and domain expertise interplay in a research project
METADATA SPECIALIST ROLE REPRESENTATION
Services
Research EducaKon
My Background • B.A. in English & CommunicaKon, MLS Technology Track, Indiana University • Entrepreneurial posiKon, Metadata Specialist in the context of Research Data
Data
Research: • Metadata development for a
data repository • Metadata applicaKon in
research projects Services: • Metadata consultaKon • Metadata development EducaKon: • Graduate student lectures,
educaKonal materials, workshops
METADATA SPECIALIST ENGAGEMENT
Participate in national and international discussions about the emerging and dynamic role of metadata in providing access to information resources. For example: • DataCite Metadata Working Group • Research Data Alliance Metadata Directory Working Group • MetaArchive Metadata Working Group (Co-chair)
THE PURDUE UNIVERSITY RESEARCH REPOSITORY A BRIEF OVERVIEW:
The Purdue University Research Repository (PURR) is a research collaboration and data management solution for Purdue researchers and their collaborators.
Dedicated data repository
THE PURDUE UNIVERSITY RESEARCH REPOSITORY A BRIEF OVERVIEW:
The Purdue University Research Repository (PURR) is a research collaboration and data management solution for Purdue researchers and their collaborators.
Data management plan support
THE PURDUE UNIVERSITY RESEARCH REPOSITORY A BRIEF OVERVIEW:
• Funded projects with PIs from Purdue - 100 GB - 10 years or life of grant • Just trying things out, or don't need much space - 10 GB - 3 years
Project collaboraKon & project management
THE PURDUE UNIVERSITY RESEARCH REPOSITORY A BRIEF OVERVIEW:
Publication of and access to datasets with unique Digital Object Identifier (DOI)
DataCite DOI
METADATA DEVELOPMENT METADATA STANDARDS FOR PURR
• Metadata Encoding and Transmission Standard (METS) • Wrapper
• DCMI Metadata Terms (dcterms) • Descriptive metadata • User-contributed à Metadata consultation/assignment
• Metadata Object Description Schema (MODS) • Dataset producer contact information • Access condition à embargoed or publically available
• Preservation Metadata: Implementation Strategies (PREMIS) • Preservation metadata - MetaArchive
IMPORTANCE OF GOOD METADATA IMPORTANCE OF GOOD DESCRIPTIVE METADATA
DATA DISCOVERABILITY, VALIDATION, REPLICATION & REUSE… PUBLICATION – DATA LINKING, CITATION & IMPACT FACTOR!
PRESERVATION & DISATER RECOVERY
METADATA IN THE CONTEXT OF RESEARCH HOW DISCIPLINARY & LIBRARY EXPERTISE INTERPLAY IN RESEARCH
Amnesty International Urgent Action Bulletin Project • Digitization & digital collection development • Research dataset development • Research database development
“As a record of real-‐Kme informaKon about human rights violaKons and advocacy across the globe, the documents have an unusually strong public interest component.”
Ann Clark
CONTROLLED VOCABULARY COMBINED 3 CONTROLLED VOCABULARIES FOR CODING & DESCRIPTIVE METADATA
hep://www.huridocs.org/resource/micro-‐thesauri/ hep://www2.witness.org/vocab/topics/ hep://uhri.ohchr.org/search/guide
RESEARCH DATA NVIVO & CODING We are developing our data collaboraKon process in both a library
science and a social science context, which both enables qualitaKve use of the documents and also embeds theoreKcally and pracKcally informed coding that can later help the researcher build both numeric and text-‐based data using the digital collecKons.
METADATA-CENTRIC DEVELOPMENT OF MERGED MASTER METADATA TEMPLATE
Controlled Vocabulary
Nvivo ClassificaKons
DigiKzaKon Control Sheet
RESEARCH DATABASE SOLR DATABASE FOR FACETED SEARCHING
Components: • ConceptualizaKon • Data collecKon • Data processing • CollaboraKon • Data product • Data curaKon
CollecKon Funded Project
Scholarly CollaboraKon Cloud
Data Processing
Digital Programs
Data Management &
CuraKon ExperKse
Data Management
Metadata Development
Research Database
Dataset PublicaKon
Scholarly DisseminaKon
Digital CollecKon,
PreservaKon & CuraKon
A Draj Conceptual Model for Libraries ExperKse Conjoining with Domain ExperKse to Apply AcKve Research to Produce Research Data
Color code: Light blue = Research Channel (throughput) Dark blue = Domain experKse & library experKse Lavender = Library services Pink = The collaboraKon of domain experKse, library science
experKse, and library services…
Tao Zhang Digital User Experience Specialist Purdue University Libraries [email protected]
User Experience Design and Research for Data Services
• M.S. in Human-Computer Interaction, Ph.D. in Human Factors • User-centered design and research for Purdue Libraries website
MY BACKGROUND
Services
Research EducaKon
Data
Research: • User research and analyKcs • InformaKon architecture • User experience of data tools Services: • User experience design • User evaluaKons
UX: All aspects of an end-user’s interaction with a system or service • Task performance • Usability • Satisfaction • Engagement UX Design: Make things work in ways that people enjoy • Interface • Workflow UX Research: • Understand the user • Evaluate systems with the user
USER EXPERIENCE = UX
CollecKon Funded Project
Scholarly CollaboraKon Cloud
Data Processing
Digital Programs
Metadata Development
Research Database
Dataset PublicaKon
Scholarly DisseminaKon
Digital CollecKon,
PreservaKon & CuraKon
UX & DATA SERVICES
Data CuraKon Profiles (DCP) Toolkit
Data Management Planning (DMP) Tool
Data Management &
CuraFon ExperFse
Data Management
INTEGRATING UX INTO DATA SERVICES
User study: Researchers’ workflow and data needs Assessment: Usability and user acceptance of data curation/management tools User-Centered Design: Inputs from stakeholders and users, use case planning, and iterative design
PROJECTS
Assessing the Data Curation Profiles (DCP) Toolkit Designing the interface of Data Management Planning (DMP) Tool
The DCP Toolkit Engaging researchers in discussion about data:
• Interview protocol • Capture information about dataset across lifecycle • Explore how data are used and managed • Identify data curation needs • DCP profiles generated from interview
PreparaKon Interviews ConstrucKng DCP
ASSESSING DCP TOOLKIT
Source: Davis (1989) Perceived Usability
ASSESSMENT METHOD
Technology Acceptance Model
Survey measured: • 28 external variables • Perceived usefulness and ease of use • Intention to use • Open-ended user feedback
RESULTS
Data analysis • Likert ratings of external variables -> Exploratory Factor Analysis ->
Regression Analysis • Qualitative analysis of open-ended responses
Factors
Applicability, Time, Complexity, Experience and Share, Training and Help, Extensibility, Interviewee Requirements
RESULTS
Perceived Usefulness affected by: + Applicability + Experience and Share + Training and Help
Perceived Ease of Use affected by:
+ Applicability - Complexity, Interviewee Requirements
Intention to Use affected by:
+ Applicability, Training and Help, Extensibility - Time, Complexity
RESULTS
Finding the right balance of UX and utility • Time required for librarians and researchers • Depth of information in DCP
Applicability
• Adapting structure and format to contexts • Making decisions based on results
Extending the DCPT
• Compact, lite version • Focus on particular data types and fields • Community building based on DCPs
UX DESIGN FOR DMPTOOL V2
Step-‐by-‐step wizard for generaKng DMP
Create | edit | re-‐use | share | save | generate
Open to community
Links to insKtuKonal resources
Directorate informaKon & updates
hep://dmptool.org
DMPTOOL 2
Everything the DMPTool does, plus … • Plan co-ownership
• Self-service administrative functions – Create and edit DMP templates
– Customize guidance and resource links
– Maintain individual and institutional profiles • Refactored UI
• Optional plan review for lifecycle management
• And more!
EXPANDED USER ROLES
Dashboard and privileges personalized to role
• Owner • Co-‐owner • InsKtuKonal reviewer • InsKtuKonal administrator
• …
PROVIDE DMP DETAILS
Easy access to customized • InstrucKons • Suggested
answers • InsKtuKonal
resources
PREVIEW DMP
Download DMP as • PDF • Plain text • RTF
REVIEW DMPS
• OpKonal or mandatory
• Comments can be passed back and forth between
• Owners • Co-‐owners • Reviewers
• Important for administraKve oversight
UX DESIGN FOR DMPTOOL 2
Research • Stakeholders: expand growth, streamline operation, achieve
sustainability • Users: focus groups, feedback from broader user community
Plan • Use cases for funding agencies and institution policies • Prototypes evaluated by stakeholders and beta testers
Design • Wireframes for layout and interaction • Multiple iterations
Line Pouchard, PhD
Purdue Libraries, Research Data
RESEARCH DATA ACCESS AND PRESERVATION SUMMIT
04/222015
Issues in Big Data CuraFon
BIG DATA @ PURDUE LIBRARIES
Services
Research
EducaKon
Data
DEFINITIONS OF DATA CURATION
• Data curation is a term used to indicate management activities required to maintain research data long-term such that it is available for reuse and preservation (Wikipedia)
• The active and ongoing management of data through its life cycle
of interest and usefulness to scholarship, science, and education. Data curation activities enable data discovery and retrieval, maintain its quality, add value, and provide for reuse over time, and this new field includes authentication, archiving, management, preservation, retrieval, and representation (GSLIS)
BIG DATA LIFECYCLE
Assure
ISSUES IN BIG DATA CURATION • Storage • Data preparation & clean up • Quality • Discoverability • Selection for preservation • Privacy and ethics • Reproducibility
QUESTIONS INFORMING CURATION ACTIVITIES Plan Acquire Prepare
Volume What is an esKmate of volume & growth rate?
What is the most suited storage (databases, NoSQL, cloud)?
How do we prepare datasets for analysis? (remove blanks, duplicates, spliong columns, adding/removing headers)?
Variety Are the data sensiKve? What provisions are made to accommodate sensiKve data?
What are the data formats and steps needed to integrate them?
What transformaKons are needed to aggregate data? Do we need to create a pipeline?
Velocity Is bandwidth sufficient to accommodate input rates?
Will datasets be aggregated into series? Will metadata apply to individual datasets or to series?
What type of naming format is needed to keep track of incoming and derived datasets?
Veracity What are the data sources? What allows us to trust them?
Who collects the data? Do they have the tools and skills to ensure conKnuity?
Are the wrangling steps sufficiently documented to foster trust in the analysis?
QUESTIONS INFORMING CURATION ACTIVITIES Analyse Preserve Discover
Volume Are adequate compute power and analysis methods available?
Should raw data be preserved? What storage space is needed in the long-‐term?
What part of the data (derived, raw, sojware code) will be made accessible to searches?
Variety Are the various analyKcal methods compaKble with the different datasets?
Are there different legal consideraKons for each data source? Are there conflicts with privacy and confidenKality?
What search methods best suit this data – keyword-‐based, geo-‐spaKal searches, metadata-‐based, semanKc searches?
Velocity At what Kme point does the analyKcal feedback need to inform decisions?
When does data become obsolete?
What degree of search latency is tolerable?
Veracity What kind of access to scripts, sojware, and procedures is needed to ensure transparency and reproducibility?
What are the trade-‐offs if only derived products and no raw data are preserved?
Providing well-‐documented data in open access allows scruKny. How is veracity supported with sensiKve and private data?
COLLABORATIONS • Collaborations on multi-disciplinary proposals and projects
• Levels of collaboration
• Developing customized Data Management Plans
• Organizing your data
• Describing your data
• Sharing your data
• Publishing your datasets
• Preserving your data
• Education on best practices
A BIG DATA PROJECT AT PURDUE
Dr. Yung-‐Hsiang Lu, PI
CURATION ISSUES IN CAM2 PROJECT • Data access and re-use
- policies of video streams and CCTV - Sparse legal framework – except UK - Few policies available
• Data ownership • Data storage • Data organization
- naming scheme - metadata
• Protect metadata storage – where the intellectual property lies
• Data information literacy skills for Big Data
CONCLUSION • Big Data looks very different than small data in
maintenance and storage • Curation primarily focuses on different areas than small
data • Planning from the beginning is crucial: without planning,
curation will fall short • Collaborations are more important • Facilitating access is where efforts need to focus, not
storing the data.
THE END
SOLUTION WHAT DID WE GET?
Approximately 2.25 PB of IBM GPFS Hardware provided by a pair of Data Direct Networks SFA12k arrays, one in each of MATH and FREH datacenters 160 Gb/sec to each datacenter 5x Dell R620 servers in each datacenter
MOLECULAR BIOSCIENCES INFORMATION SPECIALIST
Pete E. Pascuzzi Assistant Professor, Libraries Assistant Professor of Biochemistry (by courtesy)
BACKGROUND
• B.A. in Biology and Chemistry • Ph.D. in Biochemistry • Postdoctoral training in
genomics and bioinformatics
• Joined Purdue Libraries in 2013 as part of a cluster hire in Systems Biology
Services
Research Education
Data
OVERVIEW EDUCATION
• Introduction to R and Bioconductor • Data Management for the Life Sciences • Embedded lectures • Workshops • Graduate student consultations
RESEARCH • Disciplinary faculty research collaborations • Libraries’ research
SERVICES
• Data management planning • Subject specialist for Purdue University Research Repository
EDUCATION RESEARCH DATA MANAGEMENT
DATA SHARING MANDATES DATA MANAGEMENT PLANS DATA ARCHIVING METADATA STANDARDS
EDUCATION INTRODUCTION TO R AND BIOCONDUCTOR
EDUCATION INTRODUCTION TO R AND BIOCONDUCTOR
• Introductory bioinformatics course • R is a computer language for statistical computing and visualization • Bioconductor is the R project that supports bioinformatics • Eight-week summer course
EDUCATION INTRODUCTION TO R AND BIOCONDUCTOR
Course Syllabus • Basic R programming • Select Bioconductor packages for bioinformatic analysis • Manipulation of genome-scale annotation data • Manipulation and searching genomic sequence data • Visualization of genome-scale data • Exploratory data analysis • RNA-seq analysis
EDUCATION INTRODUCTION TO R AND BIOCONDUCTOR
Hidden Agenda • Data management • Directory organization • File naming • Data discovery and acquisition (using metadata) • Documentation of analysis (creating metadata) • Reading and reformatting of genome-scale, tab-delimited text files • Data and database structures • Etc.
EDUCATION INTRODUCTION TO R AND BIOCONDUCTOR
EDUCATION INTRODUCTION TO R AND BIOCONDUCTOR
EDUCATION INTRODUCTION TO R AND BIOCONDUCTOR
EDUCATION GRADUATE STUDENT CONSULTATIONS
Pros • Ongoing collaborations with four research groups through graduate students • Weekly meetings to assist students with data issues • Serve on two graduate student thesis advisory committees • Exposure to range of research questions (human cancer, eye development in fruit fly, RNA processing in yeast, tomato ripening, drug discovery, . . .) • Informal means to gather information on researchers’ needs • Avenue to collaborative research publications and grants • Good PR for Libraries.
Cons • Time consuming (~ 1 hour/week/student) • Collaborations can become exploitive, i.e. student wants you to do the work! • Discussions of data quality are uncomfortable • Can require domain-specific training • Graduate student thesis advisory committee meetings can be painful
RESEARCH DISCIPLINARY FACULTY RESEARCH COLLABORATIONS
Acquisition Reformatting Analysis Visualization
Training Proposal Development Consultation
Low High
Commitment
RESEARCH DISCIPLINARY FACULTY RESEARCH COLLABORATIONS
RESEARCH DISCIPLINARY FACULTY RESEARCH COLLABORATIONS
CONCLUSIONS
Contact information Amy Barton, [email protected] Tao Zhang, [email protected] Line Pouchard, [email protected] Pete Pascuzzi, [email protected] Lisa Zilinski, [email protected]