national consortium for data science
TRANSCRIPT
A Public-Private Partnership to Advance Data Science
Ashok Krishnamurthy PhD Deputy Director, RENCI University of North Carolina, Chapel Hill
The National Consortium for Data Science (NCDS)
2
RENCI FOCUS: DATA SCIENCE TO ENABLE RESEARCH, INNOVATION, AND ECONOMIC DEVELOPMENT
HEALTH & BIO SCIENCES
ENVIRONMENTAL SCIENCES
SOCIAL SCIENCES
Driving Technology Development
Funding Students
COMPUTING • HPC • Performance
Optimization
DATA SCIENCE • iRODS
NETWORKING • Software Defined
Networks • NIaaS
SOFTWARE • Open Source
Development • Hydroshare • SRS
VISUALIZATION • InfoViz • SciViz • GeoViz
FOUNDATIONAL TECHNOLOGIES
3
What is NCDS? is a public-private partnership to advance data science
Mission Leadership in data science research & education, help industry to use the power of data to drive economic growth
Vision Focused multi-sector, multidisciplinary data science community to solve big data challenges and drive the field forward
Goals Engage broad communities of data experts Coordinate data science research priorities that span disciplines and industries Facilitate development education & training programs Support development of technical, ethical & policy standards Apply NCDS expertise to data challenges in science, business and government
4
Why a Consortium?
Consortium can plant a stake in the ground quickly: significant funding and full-time staff not essential to launch.
Able to try different models, different key projects, different core foci to see what works best and to respond to changing and varied needs and interests.
Shared vision, ability to have your voice heard, define the issues to be tackled.
Consortium is way of building a community that can eventually become the foundation for a center (a physical place).
Time Participation
Flexibility Community
5
NCDS Components
Data Lab & Observatory Shared, distributed infrastructure housing large organized data; serves as a platform for data R&D and data science education (Graduate certificates, MS)
Working Groups Year long deep dive into topics of interest to members Produces position papers, workshops, software, events, etc.
Data Fellows Program Seed grants for faculty to work on consortium-approved projects; NCDS review panel evaluates proposals Industry internships for graduate students Visiting industry data scientists at member universities
Data Science Events Leadership Summits (Spring) Data Matters Short Courses (Summer) Student Career events (Fall/Spring) Invited lectures and outreach (ongoing)
6
Accomplishments: 2013-2015
Upcoming:
Bylaws passed, steering committee, kickoff featuring Dr. Eric Green (Director, NHGRI) and US Rep. David Price, 10 paid memberships so far.
NCDS Leadership Summit (April 2013); Faculty Fellows (5 in 2013-14, 3 in 2014-15); Student-Industry-Faculty career awareness event (April 2014, 2015); Data Innovation Showcase (May 2014, 2015); Data Matters short course series (June 2014, 2015); Working groups (3, 2014-15;) Observatory active with data sets (since June 2014).
• IoT and Big Data Workshops: Industrial IoT (July 2015), Smart Cities (Sept. 2015), Mobile and Environmental Health (Nov. 2015)
• North Carolina Data Science & Analytics Initiative (March 2015-2018)
Programmatic:
Organizational:
7
NCDS Members
8
NCDS Data Science Partnerships
Key: Infrastructure that adapts to problems
Secure Research Workspace/ Secure Medical Research Workspace: Secured virtual environments
DataBridge: Social media- like discovery of useful data sets
Genomic Medical Workflow Engine: Informatics and HPC in High Throughput Sequencing
iRODS: Policy-driven data management
ExoGENI/ADAME NT: Federated Infrastructure as a Service
9
Secure (Medical) Research Workspace
1. Safeguard Protected Health Information (PHI) data
2. Enable medical and translational research
9
Key Technology Across Domains
A secure “virtual desktop” where researchers can work with sensitive data
10
The secure research space
Research Appliance Data
Warehouse
Data Warehouse
Institution specific management of data policies and security
Federation Management
Managed, secured, peer-peer data sharing/data federation
Centralized management of networking and base-level security
Research Appliance
Data Warehouse
Version 1.0 in production at UNC
Research Appliance
11
Construct a multi-dimensional sociometric network for data.
1 2 3 Develop similarity
metrics applicable to scientific data sets
Perform community detection on the resulting
set of similarities
Provide query interfaces on a resulting
multi-dimensional network
The DataBridge: A Social Network for Data
3 Challenges
12
The DataBridge: Shining a Light on Data
Maximize the usefulness of data for scientific research
Facilitate searching for collaborators
Enable data set publication as a means of communication
Assist scientists in discovering “interesting” data sets by automatically forming communitites of data
13
ADAMANT– Pegasus/ExoGENI
Network Infrastructure-as-a-Service (NlaaS) for workflow-driven compute applications.
Workflow triggering adaptive infrastructure
Tools for workflows integration with adaptive infrastructure (ExoGENI).
Pegasus workflows using ExoGENI
Adapt to application demands • Compute: Add/free compute nodes • Storage: Allocate dynamic network storage • Network: Dynamic bandwidth provisioned
networks between compute and storage
Integrate data movement into NlaaS • On-ramps: Dynamic bandwidth provisioned
networks to static data repositiories.
Target Applications • Montage Galactic plane ensemble:
Astronomy mosaics • CyberShake: Probabilistic
Seismic Hazard Analysis • MapSeq: High-Throughput
Sequencing
14
Dynamic Workflows
1 2 3 4 5 Start
Workflow Create
Compute Nodes Compute Intensive
Workflow Step
Destroy Compute Nodes
End Workflow
Dynamic Slice
Few compute nodes for
beginning steps
Add compute nodes for parallel compute
intensive step
Dynamically provision network
between cloud sites
Free extraneous compute nodes
after compute step
NCGenes
NC Genes Funding: National Human Genome Research Institute ($1.6M/year for 4 years) Focus: Investigating the clinical utility of whole exome sequencing.
Methods: Coding portion of genome is captured and examined for variations that can be used in diagnosis or treatment of patient’s condition.
– Sequence also examined for actionable important medical findings.
Sequencing: UNC High Throughput Sequencing Facilities
15
NCGENES is Part of the NHGRI Clinical Sequencing Exploratory Research Consortium
RENCI provides: Software to manage the process, from patient enrollment through sample processing and sequencing, analysis, and data review. Data processing and management infrastructure. Annotation and analysis of variants. Display of results to clinical analysis teams.
16
NC Genes
NC Genes
16
Molecular Pathologist
Clinician Researcher
ELSI Researcher
Seqware
Dx process
Bin process
Research process
Coded Blood Sample
Identified Blood Sample
BSP
Wet lab
HTSF
Pull Push
Pull
Notify
Data
Conclusions
Molecular analyst
1
2a
2b
3a
3b
4
5
6
7
8a
8b
9a 9b
VarDB
12
13
NC GENES Data Mart
(in CDW-H)
14
Not
ify
Data
15
16
Repo
rt
11
10
21
CLIA lab
17
WebCIS
18
Project Operations
Coded DNA
Coded Capture Library
2
20 19
Initial patient enrollment and analysis
Data Store
17
iRODS: Opportunity
Solves data sharing pain points for research enterprise
Already has substantial market share and traction • BGI, Sanger, Broad, NASA, other federal agencies
A major component for existing sponsored research • NSF DFC (DataNet Federation Consortium)
Key to major national and international collaborations • NCDS, iRODS Consortium
Basis of commercial partnerships • DDN
Creates a leadership position in data space
Facilitates winning grants • Direct as well as touch
18
• Plugin Architecture • Binary Distribution • Resource
Composition • www.irods.org
iRODS 4.0
19
iRODS Meets Software – Defined Networks
Tying together the network and the data management components of the infrastructure – through policies.
Policies Data Management
Network Management
Application Management
Storage Resource
Management
Computational Resource
Management
iRODS policy-driven Data Management
Policies to direct and optimize network traffic
20 20
iRODS Consortium • Membership model for adoption, growth, and support of iRODS • Members: RENCI, The DICE Center, DataDirect Networks, Seagate,
The Wellcome Trust Sanger Institute, EMC Corporation
Membership benefits • Free and prioritized support, training, and documentation • Voting and participation in determining:
-Release roadmaps, testing, certification, standards • Co-marketing and branding • Involvement in planning and governance
Strategic partnerships
• Dual support models with consortium • Proprietary extensions and kits • New development and technical directions
21
iRODS Consortium Members
22
Data-driven Hazards Research
• Data Analysis: NOAA NOS gauge data, USGS data, US DHS/FEMA collected high-water mark, meteorological forecasts from NOAA’s NCEP and NHC
• Statistical Forecasting: Very large pre-existing datasets; provides early guidance information, available about 10 minutes after official NHC forecast storm advisory
• US DHS-funded research activity through the DHS Coastal Hazards Center of Excellence at the University of North Carolina at Chapel Hill
• Winner, DHS Science & Technology Impact Award, 2012
Storm Surge Forecasting with the ADCIRC storm surge and tide model
Collaborations with: U Delaware, Oklahoma, Nat’l Hurricane Center, Notre Dame OPeNDAP.org, Cornell, UNC, Applied Research Associates, USACE
23
Data Science Graduate Education
Modular courses for 11 month program • Graduate Certificate in Data Science (Half time) • MS in Data Science (Full time)
24
Workforce Training
25
Data Matters Courses for 2015
26
Conclusion
Develop the next generation of data science experts and leaders
Create strategies, practices and scientific methods for understanding data
Enable more collaborations among data and domain scientists, business, academia and government
Assist those who are struggling to collect, analyze, manage and use data
Establish methodologies for measuring the value and impact of data
Developing Data Science Will:
THANK YOU! Ashok Krishnamurthy