research data management fundamentals for msu engineering students

41
Research Data Management Data documentation, organization, storage and sharing Def i n e a qu e s t ion Gat h e r inf o r mat ion For m a hyp o t hes is Tes t t he hyp o t hes is Ana l y ze the d ata In t e r pre t t h e dat a Pub l i sh re s u l ts R et e s t Aaron Collie Digital Curation Librarian [email protected]

Upload: aaron-collie

Post on 19-Jun-2015

68 views

Category:

Internet


2 download

DESCRIPTION

Presented October 16 2014

TRANSCRIPT

  • 1. Research Data ManagementData documentation, organization, storage and sharingAaron CollieDigital Curation [email protected]

2. Data Management. Isnt that trivial? Not so much. Data is a primary output of research; it is veryexpensive to produce high quality data. Data may be collectedin nanoseconds, but it takes the expert application ofresearch protocol and design to generate quality data.CC-BY-SA-3.0 Rob LavinskyCC-BY-SA-3.0 Rob 3. To put that into perspective, consider data as theproduct of an industry. Data is the output of aprocess that generates higher orders ofunderstanding.WisdomKnowledgeInformationDataUnderstandingis hierarchical!Russell Ackoff 4. Data Industries In the academic sector that industry is called scholarlycommunication.Data ResearchArticle In the private sector that industry is called research &development.Data NewProduct 5. Industry is changingMultiauthor Papers: Onward and Upward - ScienceWatch Newsletter. (n.d.). Retrieved October4, 2013, from http://archive.sciencewatch.com/newsletter/2012/201207/multiauthor_papers/ The demise of the lone author : Article : Historyof the Journal Nature. (n.d.). Retrieved October4, 2013, fromhttp://www.nature.com/nature/history/full/nature06243.html 6. Science is always changing Thousand years ago:science was empiricaldescribing natural phenomena Last few hundred years:theoretical branchusing models, generalizations Last few decades:a computational branchsimulating complex phenomena Today:data exploration (eScience)unify theory, experiment, and simulation Data captured by instrumentsor generated by simulator Processed by software Information/Knowledge stored in computer Scientist analyzes database / filesusing data management and statistics222.G c34aaa Slide credit: Gray, J. & Szalay, A. (11 January 2007). eScience Talk at NRC-CSTB meeting. http://research.microsoft.com/en-us/um/people/gray/talks/NRC-CSTB_eScience.ppt 7. Research is now a team sport(cc) SpoiltCat 8. This has been noticed.must include a supplementary document of no morethan two pages labeled Data Management Plan.expects the timely release and sharing of final researchdata"should describe how the project team will manage anddisseminate data generated by the projectrequires that databe submitted to and archived bydesignated national data centers.NASA promotes the full and open sharing of all data"IMLS encourages sharing of research data." 9. But why are we really here? Impetus: NSF has mandated that all grant applicationssubmitted after January 18th, 2011 must include asupplemental Data Management Plan Effect: The original NSF mandate has had a domino effect, andmany funders now require or state guidelines for datamanagement of grant funded research Response: Data management has not traditionally received afull treatment in (many) graduate and doctoral curricula;intervention is necessary 10. Positive reinforcement. National Science Foundation Data ManagementPlan mandate (January 18, 2011) Presidential Memorandum on ManagingGovernment Records (August 24, 2012) Managing Government Records Directive: All permanentelectronic records in Federal agencies will be managedelectronically to the fullest extent possible for eventualtransfer and accessioning by NARA in an electronic format. 11. Positive reinforcement (cont.) White House policy memo (February 22, 2013) Increasing Access to the Results of Federally Funded ScientificResearch: Federal agencies with more than $100M in R&Dexpenditures must develop plans to make the published results offederally funded research freely available to the public within one yearof publication. OSTP policy memo (March 20, 2014) Improving the Management of and Access to Scientific Collections:directs each Federal agency that owns, maintains, or otherwisefinancially supports permanent scientific collections to develop a draftscientific-collections management and access policy within six months. 12. How does this apply to you? Data Management is an now an expect job skill. Especially in the research fields (RDM). Studies show that data management is not typically asignificant part of undergraduate or graduate curriculum(s). We have a causality dilemma! 13. Whats in it for you? Better organization for your classes Course Management: Angel / Desire2Learn Bibliographic Management: Zotero / Endnote / Mendelay File Management: Google Drive / Git / File-system Direct application to your career Data management is an unnamed practice Start now so you can this skill on your Resume or CV Academia is changing: big data is here 14. Course Managementhttp://help.d2l.msu.edu/ 15. Bibliographic Managementhttp://classes.lib.msu.edu/ 16. File Managementhttp://tech.msu.edu/storage/ 17. Storage @ MSUhttp://www.egr.msu.edu/decs/ DECS provides MANY services, specifically designed for the College ofEngineering: storage, equipment, software, hostinghttp://afs.msu.edu AFS space provides 1GB of networked storage.https://wiki.hpcc.msu.edu HPCC provides 50GB (home) and 1TB (research group) networked storage.http://www.cstat.msu.edu/ CSTAT offers statistical consulting and training among other services 18. RDM SystemsFile StorageFile ContentFile FormatFile System File Systems Hierarchical Database Systems Hierarchical, Relational, orObject Oriented Asset ManagementSystems Combination of Databaseand File System 19. o Storage Optionso Single points of failureo Backup Strategyo Project Documentationo Process Documentationo Data Documentationo Sharing Datao Publishing Datao Archiving DataDataManagementStorageArchitectureFileManagementDocumentationPracticesAccessManagement(cc) Alan Cleaver (cc) Will Scullino File Organizationo File Namingo File Formats 20. o Storage Optionso Single points of failureo Backup StrategyStorageArchitectureFile StorageFile ContentFile FormatFile System 21. o Storage Options o Single points of failureo Backup StrategyStorageArchitectureOptical Storage CD-ROM DVD-ROM Blu-ray DiscsSolid-State Storage USB Flash Drives Memory Cards Internal Device StorageMagnetic Storage Internal Hard Drives External Hard Drives Tape DrivesNetworked Storage Server and Web Storage Managed Networked Storage Cloud Storage Tape Libraries 22. o Storage Optionso Single points of failure o Backup StrategyStorageArchitectureGood practices for avoiding single points of error: Use managed networked storage whenever possible Move data off of portable media Never rely on one copy of data Do not rely on CD or DVD copies to be readable Be wary of software lifespans (e.g. Angel)Limited Task Term Short Project Term Long Life Term Optical Media CD, DVD, Blu-ray Portable Flash Media USB Flash Drives Memory Cards Internal Memory Magnetic Storage Internal HD External HD Networked Storage Server/Web Space Cloud Storage Networked Storage Managed Network Magnetic Storage Tape Drives 23. o Storage Optionso Single points of failureo Backup Strategy StorageArchitectureGood practices for creating a backup strategy: Make 3 copies E.g. original + external/local + external/remote E.g. original + 2 formats on 2 drives in 2 locations Geographically distribute and secure Local vs. remote, depending on needed recovery time Know what resources are available to you: personalcomputer, external hard drives, departmental, oruniversity servers may be used 24. o Storage Optionso Single points of failureo Backup Strategyo Project Documentationo Process Documentationo Data Documentationo Sharing Datao Publishing Datao Archiving DataDataManagementStorageArchitectureFileManagementDocumentationPracticesAccessManagement(cc) Alan Cleaver (cc) Will Scullino File Organizationo File Namingo File Formats 25. o File Organizationo File Namingo File FormatsFileManagementFile StorageFile ContentFile FormatFile System 26. Create a file plan Better chance you will use a standard method when the time comes Simple organization is intuitive to team members and colleagues Reduces unsynchronized copies in personal drives and emailattachmentso File Organization o File Namingo File FormatsFileManagement 27. o File Organizationo File Naming o File FormatsFileManagementUtilize a file naming convention Create logical sequences for sorting through many files and versions Identify what youre searching for by filename by using a primary term If not using a version control system, implement simple versioning Its sort of like a tweet Should not exceed 255 characters for most modern operating systemsExample file names using simple version control: Primary term:lakeLansing_waltM_fieldNotes_20091012_v002.doc locationOrgChart2009_petersK_20090101_d001.svg content20110117_sharpeW_krillMicrograph_backscatter3_v002.tif dateborgesJ_collocation_20080414.xml person 28. o File Organizationo File Namingo File Formats FileManagementMake an informed decision in selecting file formats It is important to choose platform and vendor-independent fileformats to ensure the best chance for future compatibility Open formats are often (but not always) supported broadly by acommunity rather than individually by a company or vendorFormat Genre Great Not Bad AvoidTEXT .txt; .odt; .xml; .html .pdf; .rtf; .docx .docAUDIO .flac; .wav .ogg; .mp3 .wma; .ra; .ram;compressionVIDEO .mp2/.mp4, MKV .wmv; .mov; .avi; compressionIMAGE .tif; .png; .svg; .jpg .gif; .psd; compressionDATA .sql; .csv; .xml .xlsx .xls; proprietary DB formats 29. o Storage Optionso Single points of failureo Backup Strategyo Project Documentationo Process Documentationo Data Documentationo Sharing Datao Publishing Datao Archiving DataDataManagementStorageArchitectureFileManagementDocumentationPracticesAccessManagement(cc) Alan Cleaver (cc) Will Scullino File Organizationo File Namingo File Formats 30. o Project Documentationo Process Documentationo Data DocumentationDocumentationPracticesFile StorageFile ContentFile FormatFile System 31. o Project Documentation o Process Documentationo Data DocumentationDocumentationPracticesGood practice for documenting project information: Oftentimes a team effort At minimum, store documentation in readme.txt file Include name of project, people, roles & contact information Include executive summary or abstract for basic context Include an inventory of servers, directories, data, labequipment, and other resources A great start for project documentation is a project charter 32. o Project Documentationo Process Documentation o Data DocumentationDocumentationPracticesGood practices for documenting processes: Sometimes an individual effort, sometimes collaborative Protocols, software or code settings, code commentary Workflow descriptions (text) or diagrams (image) Include example scripts, inputs, outputs if applicable A great start for process documentation is a lab notebookExample of R code commentary# Cumulative normal densitypnorm(c(-1.96,0,1.96)) 33. o Project Documentationo Process Documentationo Data Documentation Good practices for documenting data: Use standard methods of documentation wherethey exist Metrics/Measurements Code Book Metadata Standardunit~1.57107 K = Temperature of the sun (center)measure/metricmetadataDocumentationPractices 34. o Storage Optionso Single points of failureo Backup Strategyo Project Documentationo Process Documentationo Data Documentationo Sharing Datao Publishing Datao Archiving DataDataManagementStorageArchitectureFileManagementDocumentationPracticesAccessManagement(cc) Alan Cleavero File Organizationo File Namingo File Formats 35. o Sharing Datao Publishing Datao Archiving DataAccessManagementFile StorageFile ContentFile FormatFile System 36. o Sharing Data o Publishing Datao Archiving DataAccessManagementGood practices for sharing or distributing data: Basics Synchronization, Versioning, Access Restrictions (and logs) Collaborative tools can save time and effort (and help with scale) Intellectual property Data itself not protected by copyright law in U.S. Expressions of data (forms, reports, visuals) can be copyrightable Data can be licensed similarly to software Ethics Human subjects (e.g. IRB restrictions) Private/sensitive information 37. o Sharing Datao Publishing Data o Archiving DataAccessManagementGood practices for publishing data: Not Publishing Self Publishing (Web Site) Create and add data citations to personal websites Journal (Supplementary Material) Publish data with a journal that will provide a persistent link to yourdataset (e.g. DOI, handle) Archive/Repository Institutional (see above example) Disciplinary (e.g. article & data) 38. o Sharing Datao Publishing Datao Archiving Data AccessManagementGood practices for archiving research data: LOCKSS! Archive documentation with data Write costs for data management and archiving into yourresearch budgets (and in some cases, proposals) Define access policies including restrictions or embargos Understand requirements for submission of data prior toproject completion 39. o Storage Optionso Single points of failureo Backup Strategyo Project Documentationo Process Documentationo Data Documentationo Sharing Datao Publishing Datao Archiving DataDataManagementStorageArchitectureFileManagementDocumentationPracticesAccessManagemento File Organizationo File Namingo File Formats 40. Resources at the LibraryResearch Data Management Guidance Face-to-face consulting on RDM strategies [email protected] Volkening Engineering Librarian [email protected] Collie Digital Curation Librarian [email protected] 41. Questions? Store Three Copies on Three Disks in Three Locations Organize If you make a plan, you just might follow it. Document What would my colleagues need to know tounderstand this data? Share Data makes an impact Slides are HERE: http://tiny.cc/yvdpqwAaron CollieDigital Curation [email protected]