data science for nsf data science workshop 2015 dr. brand niemann director and senior data...

32
Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Semantic Community Data Science NSF Data Science Workshop 2015 August 24, 2015 1

Upload: jessie-merritt

Post on 02-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

1

Data Science for NSF Data Science Workshop

2015Dr. Brand Niemann

Director and Senior Data Scientist/Data JournalistSemantic CommunitySemantic Community

Data ScienceNSF Data Science Workshop 2015

August 24, 2015

Page 4: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

4

Workshop Knowledge Base

• Content• Overview• Agenda• Mentors, Observers,

Ethnographers & Organizers• Posters• Team Assignments• Team Work Products• GERT PhD Program in Big Data and

Data Science at the UW

• Results:• White Papers (Only 3 and Review

Criteria Met?)• Interviews (?)

• Audit (See Next Slide for Details):• Mine• Science• Questions• Publish

Page 5: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

5

Data Mining - Science - Questions - Publication Process

• Data Mining Process:• Business Understanding• Data Understanding• Data Preparation• Modeling• Evaluation• Deployment

• Data Science Process:• Data Preparation• Data Ecosystem• Data Story

• Data Science Questions:• How was the data collected?• Where is the data stored?• What are the data results? and• Why should we believe the data results?

• Data Science Data Publication:• Knowledge Base• Spreadsheet Index• Web & PDF Tables to Spreadsheet• Data Browser• Dynamically Linked Adjacent

Visualizations

Page 6: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

6

Workshop White Paper Conclusions

• Genomic Data Science: Problems regarding the speed, cost and hardware that are required for analyzing and sharing the big genomic data are among the major challenges in Genomic Data Science. On the other hand, the area of genomics provides Data Science with not only great challenges but also great promises.

• Big Data: From correlation to causation: For data science and big data analytics to become more useful towards examining causal relations nowadays, I argue that we need to draw on the substantial knowledge base created in the economics and social science fields over the years in order to infer interesting causal effects as simply analyzing large amounts of data does not necessarily help us make better data-driven decisions.

• Shape mapping in genome-wide association studies: Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biological traits. Currently, the major focuses of GWASs are the associations between single-nucleotide polymorphisms (SNPs) and traits such as human diseases.

Page 7: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

7

NIH Data Commons

• FAIR Principles:• Findable• Accessible• Interoperable• Reusable

• Cloud:• Data• Software• Results

• Federal Science Policy:• OSTP Public Access to Scientific Data

Memo (February 2013)• New Program: Big-Data-to-

Knowledge (2013)• New Position: Associate Director of

Data Science (2014)• Digital Enterprise (2015): Data

Commons• Metadata• Open APIs• Digital Objects• Containers

Federal Big Data Working Group Meetup, August 17, 2015:A NIH – Semantic Medline Data Science Data Publication Commons

Page 8: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

8

https://datascience.nih.gov/commons

The Commons Framework is:Discoverability: Search and FindOpen APIs: Data and ToolsUnique IDs: for Digital Research ObjectsContainers: For Packaging ApplicationsComputing Platform: Cloud & HPC

Page 9: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

9

OSTP/NSF Data Science Meetup of Meetups

• Week of November 2nd:• NSF Data Science/Big Data

Principal Investigators (About 300)• NSF Data Hubs (4)• Organizers of Largest Data

Science/Big Data Meetups (About 65)

• Pipeline for Return on Investment:• PIs put their data, tools and

research results in the Data Hubs• Data Hubs provide those data,

tools, and research results to the world, but especially to the Data Science/Big Data Meetups• Data Science/Big Data Meetups

collaborate with PIs and Data Hubs to increase usage and feedback

Page 10: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

10

We Already Do This!

• Semantic Community:• Provides a Community Sandbox that is

like a GitHub, Data Hub, Data Commons, etc.• Metadata (MindTouch)• Open APIs (MIndTouch)• Digital Objects (MindTouch)• Containers (Spotfire)

• Organize the Federal Big Data Working Group Meetup

• Support Agencies and Programs in Crowdsourcing Their Data Sets

• Mentor Data Scientists (Tutorials and MOOCs) and Entrepreneurs (Eastern Foundry)

• Federal Big Data Working Group Meetup:• Federal: Supports the Federal Big Data

Initiative, but not endorsed by the Federal Government or its Agencies;

• Big Data: Supports the Federal Digital Government Strategy which is "treating all content as data", so big data = all your content;

• Working Group: Data Science Teams composed of Federal Government and Non-Federal Government experts producing big data products; and

• Meetup: The world's largest network of local groups to revitalize local community and help people around the world self-organize like MOOCs (Massive Open On-line Classes) now embraced by the White House.

Page 11: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

11

The Journey to Data and Meetup 1

• Since the three white papers from the NSF Data Science Workshop did not describe any actual work with data sets, I decided to use their content to find a data set and the first reference in the first white paper was GenBank, and when I shared that, I got a response from a member of the OSTP/NSF Data Science Meetup of Meetups planning team that works directly with it:• I do a lot of genomics data stuff (I work for NCBI, which is the largest genomic

database in the world [we make Genbank, which is the first citation in the genomics data challenge summary]). • I think I might be able to help focus the genomics data challenge a bit more.

Page 12: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

12

The Journey to Data and Meetup 2

• I responded:• I looked at: http://www.ncbi.nlm.nih.gov/genbank/• And found: http://www.ncbi.nlm.nih.gov/guide/data-software/• And wondered where it would be good to start?• This is like a Data Commons that Vivien Bonazzi talked about at our last

meetup: A NIH – Semantic Medline Data Science Data Publication Commons (Click See All). • I could build an searchable index in a spreadsheet and Spotfire with your

guidance like I have done for other NIH data sets.

Page 13: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

13

The Journey to Data and Meetup 3

• He responded:• We also have bigger databases (in terms of data size) like SRA, dbGaP and GEO.• Here’s a third party attempt at normalizing the SRA metadata:

• http://www.bioconductor.org/packages...tml/SRAdb.html• We also provide a run selector tool for visualization in SRA, if you go to the send

to menu.• We’ve also done some hackathoning with such data

• https://github.com/DCGenomics?tab=repositories• To come full circle, the RNA_mapping repo here:

• https://github.com/NCBI-Hackathons• May be the preamble for a collaboration with the

NIH Data Science Data Commons

Page 14: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

14

http://semanticommunity.info/%40api/deki/files/35592/NSFNCBI.txt?origin=mt-web

My Data Mining Notes in Notepad That Helped Structure What Follows Next

Page 15: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

15

http://www.ncbi.nlm.nih.gov/

The majority of NCBI data are available for downloading, either directly from the NCBI FTP site or by using software tools to download custom datasets.

Page 16: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

16

NCBI Download: FTP and Aspera

http://www.ncbi.nlm.nih.gov/public/ ftp://ftp.ncbi.nlm.nih.gov/

Page 18: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

18

For downloading purposes, please keep in mind that the uncompressed GenBank flatfiles are approximately 735GB (sequence files only); the ASN.1 data are approximately 600GB.

http://www.ncbi.nlm.nih.gov/news/08-19-2015-genbank-release-209/

Page 20: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

20

http://www.ncbi.nlm.nih.gov/genbank/

Page 21: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

21

http://www.ncbi.nlm.nih.gov/genbank/samplerecord/

This is very complicated big data that requires subject matter expertise and big data science expertise and tools.Is there another way? Yes, and I found it by chance!

Page 22: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

22

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4383886/

Somehow I found this page!Which has links to Web Site, Table 1, and Supplementary Data

We believe that our database will contribute to the future establishment of personalized medicine and increase our understanding of genetic factors underlying diseases.So can SemMed generate such a catalog!

Page 23: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

23

http://bmi-tokai.jp/VaDE/

Page 24: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

24

http://bmi-tokai.jp/VaDE/all-gwas-snp/

25,758 Records and 19 MB!

Page 27: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

27

VaDE Supplementary Data

http://nar.oxfordjournals.org/content/suppl/2014/10/30/gku1037.DC1 See Next Slide

Page 28: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

28

http://nar.oxfordjournals.org/content/suppl/2014/10/30/gku1037.DC1/Table_S1_nagai3.docx

Page 29: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

29

http://www.ncbi.nlm.nih.gov/pubmed/19573626

These results reveal a novel function of Maitake beta-glucan that enhances the granulopoiesis and mobilization of granulocytes and their progenitors by stimulating G-CSF production. This finding presents opportunities to develop new therapeutic strategies against the immunosuppression caused by chemotherapies in cancer patients.

We had beta-Glucan results from Data Science and Semantic Medline at our August 17th Meetups!

Page 32: Data Science for NSF Data Science Workshop 2015 Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data Science NSF

32

Conclusions and Recommendations

• There are at least three phases and products in geonomic data science:• Raw data to a Commons like GenBank with data, software, and results.• Distilled associations for personalized medicine like VarySysDB Disease Edition (VaDE).• Data Science Data Publications for students, researchers, medical doctors, data scientists, and

the public.

• Tasks in process:• Build a Knowledge Base with searchable index in a wiki, spreadsheet and Spotfire like I have

done for other NIH data sets that follows the Commons Framework: Discoverability: Search and Find; Open APIs: Data and Tools; Unique IDs: for Digital Research Objects; Containers: For Packaging Applications; and Computing Platform: Cloud & HPC. See Slide 10 for Mapping Our Commons to the NIH Commons.

• Build data science data publications of the multiple content types and formats.• Submit this as a White Paper for the NSF Data Science Workshop and Federal Big Data Working

Group Meetup.