komatsoulis internet2 global forum 2015
TRANSCRIPT
George A. Komatsoulis, Ph.D.National Center for Biotechnology Information (NCBI)
National Library of MedicineNational Institutes of Health
U.S. Department of Health and Human Services
NIH Data Initiatives:Harnessing Big (and small) Data to
Improve Health
NIH: The Nations Medical Research AgencyMission: ”To seek fundamental
knowledge about the nature and behavior of living systems and the application of that knowledge to enhance health, lengthen life, and reduce illness and disability.”
Composed of 27 Institutes and Centers
Annual Budget = $30.3B80% of NIH budget goes to
about 50,000 grants
The Current “Big Data” problem in biology
All of this has happened before…
1960 1970 1980 1990 2000 2010 2020
… and other scientific fields have similar problemsSensor Stream = 500 EB/dayStores 69 TB/day
Collection = 14 EB/dayStore 1PB/day
Total Data = 14 PBStore an average of 3.3TB/day for 10 years!
Although biomedicine has some unique challenges
Launched to support biomedical data science researchSupport for multiple facets of data science:
BD2K CentersData and Software DiscoveryStandards and InteroperabilityTraining and Workforce DevelopmentThe Commons
Led by Dr. Phil Bourne, NIH Associate Director for Data Science
NIH BD2K: Big Data to Knowledge
Public Data Repositories
Local Data
U N I V E R S I T YU N I V E R S I T Y
Locally Developed Software
Publicly AvailableSoftware
Local storage andcompute resources
Standard Model of Biomedical Computation
The Commons: A shared virtual spaceIs scalable and exploits new computing modelsIs more cost effective given digital growthSimplifies sharing digital research objects such
as data, software, metadata and workflowsMakes digital research objects more FAIR:
Findable, Accessible, Interoperable and Reusable
DOES NOT replace existing, well-curated databases Phil Bourne, 2014
The
Com
mon
sDigital Objects
(with identifiers)
Search(Indexed Metadata and API)
Computing Platform
The Commons: Conceptual Framework
Ope
n AP
Is
Softw
are
Enca
psul
ation
The
Com
mon
s
Digital Objects(with identifiers)
Search(Indexed Metadata and API)
Computing Platform
CommonsFederation
(Infrastructure)
BD2K Centers
DDICC(Search)
ExistingResources
Indexes Methods
Content
CommonsFederation
(Infrastructure)
BD2K Centers
DDICC(Search)
ExistingResources
Indexes Methods
Content
Investigator
Works In
Searches
CommonsFederation
(Infrastructure)
Conformant ProviderA
Conformant ProviderB
Conformant ProviderC
The Commons: Business Model
Researcher
Discovery IndexThe Commons
Cloud ProviderC
Cloud ProviderB
Cloud ProviderA
NIH
Provides Digital Objects
Retrieves/Uses Digital Objects
Option: Fund Providers to
Support NIH Directed Resources
Indexes Commons
ProvideCredits
UsesCredits
FindsObjects
Commons Implemented as a federation of ‘conformant’ cloud providers and HPC environments
Funded primarily by providing credits to investigators
Cost effective - Only pay for IT support usedDrives competition – Better services at lower
costSupports Data sharing by driving science into
the CommonsFacilitates public-private partnershipScalable to most categories of data expected in
the next 5 years.
Potential Advantages of this Model
Novelty: Never been tried, so we don’t have data about likelihood of success
Cost Models: Predicated on stable or declining prices among providers True for the last several years, but we can’t guarantee that it will
continue, particularly if there is significant consolidation in industry Service Providers:
Predicated on service providers willing to make the investment to become conformant
Market research suggests 3-5 providers within 2-3 months of program launch
Persistence: The model is ‘Pay As You Go’ which means if you stop paying it stops going Giving investigators an unprecedented level of control over what lives (or
dies) in the Commons
Potential Disadvantages of this Model
Investigator
Reseller of CloudServices
The Commons
Cloud ProviderC
Cloud ProviderB
Cloud ProviderA
Investigator Institution
Directs resellerto distributecredits
Instructs provider toput credits on investigator account
1
2
Review
NIH
3
45
6
7
Approves CreditRequest
Requests Credits
Uses credits
Distributes CreditsTo Investigator
Business Model in Practice
What does it mean to be conformant?Minimum set of requirements for
Business relationships (reseller, investigators)Interfaces (upload, download, manage, compute)Capacity (storage, compute)Networking and ConnectivityInformation AssuranceAuthentication and authorization
Still need to work out details of how to manage approval of conformance
A conformant cloud ≠ an IaaS providerDraft specification out for comment among vendors
Phase 0: Build the plumbingPhase 1: Pilot the model on a small number of
investigators experienced with cloud computing, probably within the context of BD2K awards
Phase 2: Open the Commons credit process to grantees from a subset of NIH Institutes and Centers
Phase 3: Open the process to all NIH grantees
Pilot of the Commons Business Model
NCI GDC and Cloud Pilots
QA/QCValidation
Aggregation
Authoritative NCI Reference Data Set
Data Coordinating Center
NCI Genomic Data Commons (under development)
NCI Clouds
High PerformanceComputing
Search/Retrieve
Download
Analysis
Secure Computational CapacityPre-loaded Data
Secure Computational CapacityPre-loaded Data
Secure Computational CapacityPre-loaded Data
NCI Genomics Consortium
NCI Genomic Data Repositories
NIH Office of ADDSVivien Bonazzi, Ph.D.Philip Bourne, Ph.DMichelle Dunn, Ph.DMark Guyer, Ph.D.Jennie Larkin, Ph.D.Leigh FinneganBeth Russell
NCBIDennis Benson, Ph.D.Alan GraeffDavid Lipman, MDJim Ostell, Ph.D.Don PreussSteve Sherry
Acknowledgements