national consortium for data science

27
A Public-Private Partnership to Advance Data Science Ashok Krishnamurthy PhD Deputy Director, RENCI University of North Carolina, Chapel Hill The National Consortium for Data Science (NCDS)

Upload: orau

Post on 16-Jul-2015

107 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: National Consortium for Data Science

A Public-Private Partnership to Advance Data Science

Ashok Krishnamurthy PhD Deputy Director, RENCI University of North Carolina, Chapel Hill

The National Consortium for Data Science (NCDS)

Page 2: National Consortium for Data Science

2

RENCI FOCUS: DATA SCIENCE TO ENABLE RESEARCH, INNOVATION, AND ECONOMIC DEVELOPMENT

HEALTH & BIO SCIENCES

ENVIRONMENTAL SCIENCES

SOCIAL SCIENCES

Driving Technology Development

Funding Students

COMPUTING • HPC • Performance

Optimization

DATA SCIENCE • iRODS

NETWORKING • Software Defined

Networks • NIaaS

SOFTWARE • Open Source

Development • Hydroshare • SRS

VISUALIZATION • InfoViz • SciViz • GeoViz

FOUNDATIONAL TECHNOLOGIES

Page 3: National Consortium for Data Science

3

What is NCDS? is a public-private partnership to advance data science

Mission Leadership in data science research & education, help industry to use the power of data to drive economic growth

Vision Focused multi-sector, multidisciplinary data science community to solve big data challenges and drive the field forward

Goals Engage broad communities of data experts Coordinate data science research priorities that span disciplines and industries Facilitate development education & training programs Support development of technical, ethical & policy standards Apply NCDS expertise to data challenges in science, business and government

Page 4: National Consortium for Data Science

4

Why a Consortium?

Consortium can plant a stake in the ground quickly: significant funding and full-time staff not essential to launch.

Able to try different models, different key projects, different core foci to see what works best and to respond to changing and varied needs and interests.

Shared vision, ability to have your voice heard, define the issues to be tackled.

Consortium is way of building a community that can eventually become the foundation for a center (a physical place).

Time Participation

Flexibility Community

Page 5: National Consortium for Data Science

5

NCDS Components

Data Lab & Observatory Shared, distributed infrastructure housing large organized data; serves as a platform for data R&D and data science education (Graduate certificates, MS)

Working Groups Year long deep dive into topics of interest to members Produces position papers, workshops, software, events, etc.

Data Fellows Program Seed grants for faculty to work on consortium-approved projects; NCDS review panel evaluates proposals Industry internships for graduate students Visiting industry data scientists at member universities

Data Science Events Leadership Summits (Spring) Data Matters Short Courses (Summer) Student Career events (Fall/Spring) Invited lectures and outreach (ongoing)

Page 6: National Consortium for Data Science

6

Accomplishments: 2013-2015

Upcoming:

Bylaws passed, steering committee, kickoff featuring Dr. Eric Green (Director, NHGRI) and US Rep. David Price, 10 paid memberships so far.

NCDS Leadership Summit (April 2013); Faculty Fellows (5 in 2013-14, 3 in 2014-15); Student-Industry-Faculty career awareness event (April 2014, 2015); Data Innovation Showcase (May 2014, 2015); Data Matters short course series (June 2014, 2015); Working groups (3, 2014-15;) Observatory active with data sets (since June 2014).

• IoT and Big Data Workshops: Industrial IoT (July 2015), Smart Cities (Sept. 2015), Mobile and Environmental Health (Nov. 2015)

• North Carolina Data Science & Analytics Initiative (March 2015-2018)

Programmatic:

Organizational:

Page 7: National Consortium for Data Science

7

NCDS Members

Page 8: National Consortium for Data Science

8

NCDS Data Science Partnerships

Key: Infrastructure that adapts to problems

Secure Research Workspace/ Secure Medical Research Workspace: Secured virtual environments

DataBridge: Social media- like discovery of useful data sets

Genomic Medical Workflow Engine: Informatics and HPC in High Throughput Sequencing

iRODS: Policy-driven data management

ExoGENI/ADAME NT: Federated Infrastructure as a Service

Page 9: National Consortium for Data Science

9

Secure (Medical) Research Workspace

1. Safeguard Protected Health Information (PHI) data

2. Enable medical and translational research

9

Key Technology Across Domains

A secure “virtual desktop” where researchers can work with sensitive data

Page 10: National Consortium for Data Science

10

The secure research space

Research Appliance Data

Warehouse

Data Warehouse

Institution specific management of data policies and security

Federation Management

Managed, secured, peer-peer data sharing/data federation

Centralized management of networking and base-level security

Research Appliance

Data Warehouse

Version 1.0 in production at UNC

Research Appliance

Page 11: National Consortium for Data Science

11

Construct a multi-dimensional sociometric network for data.

1 2 3 Develop similarity

metrics applicable to scientific data sets

Perform community detection on the resulting

set of similarities

Provide query interfaces on a resulting

multi-dimensional network

The DataBridge: A Social Network for Data

3 Challenges

Page 12: National Consortium for Data Science

12

The DataBridge: Shining a Light on Data

Maximize the usefulness of data for scientific research

Facilitate searching for collaborators

Enable data set publication as a means of communication

Assist scientists in discovering “interesting” data sets by automatically forming communitites of data

Page 13: National Consortium for Data Science

13

ADAMANT– Pegasus/ExoGENI

Network Infrastructure-as-a-Service (NlaaS) for workflow-driven compute applications.

Workflow triggering adaptive infrastructure

Tools for workflows integration with adaptive infrastructure (ExoGENI).

Pegasus workflows using ExoGENI

Adapt to application demands • Compute: Add/free compute nodes • Storage: Allocate dynamic network storage • Network: Dynamic bandwidth provisioned

networks between compute and storage

Integrate data movement into NlaaS • On-ramps: Dynamic bandwidth provisioned

networks to static data repositiories.

Target Applications • Montage Galactic plane ensemble:

Astronomy mosaics • CyberShake: Probabilistic

Seismic Hazard Analysis • MapSeq: High-Throughput

Sequencing

Page 14: National Consortium for Data Science

14

Dynamic Workflows

1 2 3 4 5 Start

Workflow Create

Compute Nodes Compute Intensive

Workflow Step

Destroy Compute Nodes

End Workflow

Dynamic Slice

Few compute nodes for

beginning steps

Add compute nodes for parallel compute

intensive step

Dynamically provision network

between cloud sites

Free extraneous compute nodes

after compute step

Page 15: National Consortium for Data Science

NCGenes

NC Genes Funding: National Human Genome Research Institute ($1.6M/year for 4 years) Focus: Investigating the clinical utility of whole exome sequencing.

Methods: Coding portion of genome is captured and examined for variations that can be used in diagnosis or treatment of patient’s condition.

– Sequence also examined for actionable important medical findings.

Sequencing: UNC High Throughput Sequencing Facilities

15

NCGENES is Part of the NHGRI Clinical Sequencing Exploratory Research Consortium

RENCI provides: Software to manage the process, from patient enrollment through sample processing and sequencing, analysis, and data review. Data processing and management infrastructure. Annotation and analysis of variants. Display of results to clinical analysis teams.

Page 16: National Consortium for Data Science

16

NC Genes

NC Genes

16

Molecular Pathologist

Clinician Researcher

ELSI Researcher

Seqware

Dx process

Bin process

Research process

Coded Blood Sample

Identified Blood Sample

BSP

Wet lab

HTSF

Pull Push

Pull

Notify

Data

Conclusions

Molecular analyst

1

2a

2b

3a

3b

4

5

6

7

8a

8b

9a 9b

VarDB

12

13

NC GENES Data Mart

(in CDW-H)

14

Not

ify

Data

15

16

Repo

rt

11

10

21

CLIA lab

17

WebCIS

18

Project Operations

Coded DNA

Coded Capture Library

2

20 19

Initial patient enrollment and analysis

Data Store

Page 17: National Consortium for Data Science

17

iRODS: Opportunity

Solves data sharing pain points for research enterprise

Already has substantial market share and traction • BGI, Sanger, Broad, NASA, other federal agencies

A major component for existing sponsored research • NSF DFC (DataNet Federation Consortium)

Key to major national and international collaborations • NCDS, iRODS Consortium

Basis of commercial partnerships • DDN

Creates a leadership position in data space

Facilitates winning grants • Direct as well as touch

Page 18: National Consortium for Data Science

18

• Plugin Architecture • Binary Distribution • Resource

Composition • www.irods.org

iRODS 4.0

Page 19: National Consortium for Data Science

19

iRODS Meets Software – Defined Networks

Tying together the network and the data management components of the infrastructure – through policies.

Policies Data Management

Network Management

Application Management

Storage Resource

Management

Computational Resource

Management

iRODS policy-driven Data Management

Policies to direct and optimize network traffic

Page 20: National Consortium for Data Science

20 20

iRODS Consortium • Membership model for adoption, growth, and support of iRODS • Members: RENCI, The DICE Center, DataDirect Networks, Seagate,

The Wellcome Trust Sanger Institute, EMC Corporation

Membership benefits • Free and prioritized support, training, and documentation • Voting and participation in determining:

-Release roadmaps, testing, certification, standards • Co-marketing and branding • Involvement in planning and governance

Strategic partnerships

• Dual support models with consortium • Proprietary extensions and kits • New development and technical directions

Page 21: National Consortium for Data Science

21

iRODS Consortium Members

Page 22: National Consortium for Data Science

22

Data-driven Hazards Research

• Data Analysis: NOAA NOS gauge data, USGS data, US DHS/FEMA collected high-water mark, meteorological forecasts from NOAA’s NCEP and NHC

• Statistical Forecasting: Very large pre-existing datasets; provides early guidance information, available about 10 minutes after official NHC forecast storm advisory

• US DHS-funded research activity through the DHS Coastal Hazards Center of Excellence at the University of North Carolina at Chapel Hill

• Winner, DHS Science & Technology Impact Award, 2012

Storm Surge Forecasting with the ADCIRC storm surge and tide model

Collaborations with: U Delaware, Oklahoma, Nat’l Hurricane Center, Notre Dame OPeNDAP.org, Cornell, UNC, Applied Research Associates, USACE

Page 23: National Consortium for Data Science

23

Data Science Graduate Education

Modular courses for 11 month program • Graduate Certificate in Data Science (Half time) • MS in Data Science (Full time)

Page 24: National Consortium for Data Science

24

Workforce Training

Page 25: National Consortium for Data Science

25

Data Matters Courses for 2015

Page 26: National Consortium for Data Science

26

Conclusion

Develop the next generation of data science experts and leaders

Create strategies, practices and scientific methods for understanding data

Enable more collaborations among data and domain scientists, business, academia and government

Assist those who are struggling to collect, analyze, manage and use data

Establish methodologies for measuring the value and impact of data

Developing Data Science Will:

Page 27: National Consortium for Data Science

THANK YOU! Ashok Krishnamurthy

[email protected]