genomics and computation in precision medicine march 2017
TRANSCRIPT
Insights from Genomics and
Computational Biology that inform
Precision MedicineWarren Kibbe, PhD
@wakibbe
March 6th, 2017
2
1. Background2. Data in Biomedicine3. Data Sharing4. Data Commons5. Genomics and
Computation
Thanks to many folks for slides, but especially Jerry Lee
3
In 2016 there were an estimated1,700,000 new cancer cases
and600,000 cancer deaths
- American Cancer Society
Cancer remains the second most common cause of death in the U.S.
- Centers for Disease Control and Prevention
4
In 2016 there are an estimated
15,500,000 cancer survivors in the U. S.
5
Understanding Cancer
Precision medicine will lead to fundamental understanding of the complex interplay between genetics, epigenetics, nutrition, environment and clinical presentation and direct effective, evidence-based prevention and treatment.
6
Cancer is a grand challenge
Deep biological understanding
Advances in scientific methods
Advances in instrumentation
Advances in technology
Data and computation
Cancer Research and Care generatedetailed data that is critical to create a learning health system for cancer
Requires:
7
The primary goal of TCGA is to identify disease subtypes and generate a comprehensive catalog of all cancer genes and pathways
Pilot phase started in 2006 -- 3 tumor types 500 T/N pairs eachExtended in 2009 -- 20 x 500 T/N + 15 x (50-150) T/N
Freely available genomic data enable researchers anywhere around the world to make and validate important discoveries
9
10
11
Biological Scales
13
14
Digital technology is changing rapidly
15
Cancer therapy is changing
16
Application of Cancer Genomics is changing
17
18
19
Biology and Medicine are now data intensive enterprises
Scale is rapidly changingTechnology, data, computing and IT are pervasive in the lab, the clinic, in the home, and across the population
20
21
22
“...an advantage of machine learning is that it can be used even in cases where it is infeasible or difficult to write down explicit rules to solve a problem...”
https://www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf
23
24
25
" there is great potential for new insights to come from the combined analysis of cancer proteomic and genomic data, as proteomic data can now reproducibly provide information about protein levels and activities that are difficult or impossible to infer from genomic data alone ”
Douglas R. Lowy, MD Acting Director of the National Cancer Institute, National Institutes of Health
5/25/2016
26
Genomics is the beginning of precision medicine, not the end!
27
Plumbing, architecture, genomics
Cancer Research Data Ecosystem – Cancer Moonshot BRP
Well characterized research data
sets
Cancer cohorts Patient data
EHR, Lab Data, Imaging, PROs, Smart Devices,
Decision Support
Learning from everycancer patient
Active researchparticipation
Research informationdonor
Clinical ResearchObservational studies
ProteogenomicsImaging dataClinical trials
Discovery Patient engaged Research
SurveillanceBig Data
Implementation research
SEERGDC
Data Commons Framework
30
Data Commons Structure
Data Contributors and Consumers
Researchers PatientsCliniciansInstitutions
NCI ThesauruscaDSR
NLM UMLSRxNormLOINC
SNOMED
31
Cancer Data Sharing& Data Commons
• Support open science • Support data reusability• Aligned with Cancer Moonshot• Part of Precision Medicine• Reduce Health Disparities• Improve patient access to clinical
trials• Work toward a learning National
Cancer Data Ecosystem
Reduce the risk, improve early detection, outcomes and survivorship in cancer
Cancer Data Sharing EffortsSignature Efforts DataBRCA ChallengeSomatic variant sharing
Isolated genetic variantsNo raw sequencing data
Precision medicine questionsSomatic variant sharing
Panel gene resequencingClinical response
Clinical trialPublic-private partnerships
Comprehensive genomicsDetailed clinical
phenotype data
Clinical trial accessClinical/genomic data aggregation
EHR dataClinical sequencing
Clinical oncology standardsEHR dataClinical sequencing
33
Changing the conversation around data sharing
How do we find data, software, standards? How can we make data, annotations, software, metadata accessible? How do we reuse data standards? How do we make more data machine readable?
NIH Data CommonsNCI Genomic Data Commons
National Cancer Data Ecosystem
Data Commons co-locate data, storage and computing infrastructure, and frequently used tools for analyzing and sharing data to create an
interoperable resource for the research community.*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson, A Case for Data Commons Towards Data Science as a Service, to appear. Source of image: Interior of one of Google’s Data Center, www.google.com/about/datacenters/.
34
Basic Ingredients of a Cancer Research Data Ecosystem
• Open Science. Supporting Open Access, Open Data, Open Source, and Data Liquidity for the cancer community
• Standardization through CDEs and Case Report Forms
• Interoperability by exposing existing knowledge through appropriate integration of ontologies, vocabularies and taxonomies
• Consistency of curation and capture of evidence (tissue types, recurrence, therapy – including genomic, epigenomic, etc context)
• Sustainable models for informatics infrastructure, services, data, metadata
35
GDC as an example of a new architecture for storing and sharing cancer data
36
The Cancer Genomic Data Commons (GDC) is an existing effort to standardize and simplify submission of genomic data to NCI and follow the principles of FAIR – Findable, Accessible, Attributable, Interoperable, Reusable, and Provide Recognition.
The GDC is part of the NIH Big Data to Knowledge (BD2K) initiative and an example of the NIH Commons
Genomic Data Commons
Microattribution, nanopublications, tracking the use of data, annotation of data, use of algorithms, supports
the data /software /metadata life cycle to provide credit and analyze impact of data, software, analytics,
algorithm, curation and knowledge sharing
Force11 white paperhttps://www.force11.org/group/fairgroup/fairprinciples
37
The NCI Genomic Data Commons• Unify fragmentary repositories• Support the receipt, quality control, integration, storage, and
redistribution of standardized genomic data sets derived from cancer research studies
• Harmonization of raw sequence both from existing and new cancer research programs
• Application of state-of-the-art methods of generating derived genomic data
• Provide the foundation for: • Identification of high- and low-frequency cancer drivers• Defining genomic determinants of response to therapy• Clinical trial cohorts sharing targeted genetic lesions
NCI Genomic Data Commons The GDC went live on June 6, 2016 with approximately 4.1 PB of data.
577,878 files about 14194 cases (patients), in 42 cancer types, across 29 primary disease sites, 400 clinical data elements
10 major data types, ranging from Raw Sequencing Data, Raw Microarray Data, to Copy Number Variation, Simple Nucleotide Variation and Gene Expression.
Data are derived from 17 different experimental strategies, with the major ones being RNA-Seq, WXS, WGS, miRNA-Seq, Genotyping Array and Expression Array.
Foundation Medicine announced the release of 18,000 genomic profiles to the GDC at the Cancer Moonshot Summit, June 29th, 2016
The Multiple Myeloma Research Foundation announced it would be releasing its CoMMpass study of more than 1000 cases of Multiple Myeloma on Sept 29, 2016.
39
Content in the Genomic Data Commons TCGA
11,353 cases TARGET 3,178 cases
Current
Foundation Medicine18,000 cases Cancer studies in dbGaP ~4,000 cases Multiple Myeloma RF~1,000 cases
Coming soon
NCI-MATCH~3,000 cases Clinical Trial Sequencing Program ~3,000 cases
Planned (1-3 years)
Cancer Driver Discovery Program ~5,000 cases Human Cancer Model Initiative ~1,000 cases APOLLO – NCI, VA and DoD
~8,000 cases
~58,000 cases
Exome-seq
Whole genome-seq
RNA-seq
Copy number
Genomealignment
Genomealignment
Genomealignment
Datasegmentation
1° processing
Mutations
Mutations +structural variants
Digital geneexpression
Copy numbercalls
2° processingOncogene vs.
Tumor suppressor
Translocations
Relative RNA levelsAlternative splicing
Gene amplification/ deletion
3° processing
GDC Data HarmonizationMultiple data types and levels of processing
Mutect2pipeline
GDC Data HarmonizationOpen Source, Dockerized Pipelines
Recoveryrate
(% true positives) A0F0
SomaticSniper 81.1% 76.5%VarScan 93.9% 84.3%
MuSE 93.1% 87.3%All Three 96.4% 91.2%
GDC variant callingpipelinesWash UBaylorBroad
GDC Data HarmonizationMultiple pipelines needed to recover all variants
Development of the NCI Genomic Data Commons (GDC)To Foster the Molecular Diagnosis and Treatment of Cancer
GDC
Bob Grossman PIUniv. of Chicago
Ontario Inst. Cancer Res.Leidos
Institute of MedicineTowards Precision Medicine
2011
Discovery of Cancer Drivers With 2% Prevalence
Lung adeno.+ 2,900
Colorectal+ 1,200
Ovarian+ 500
Lawrence et al, Nature 2014
Power Calculation for Cancer Driver Discovery
Need to resequence >100,000 tumors to identify all cancer drivers at >2% prevalence
The NCI Cancer Genomics Cloud Pilots
Understanding how to meet the research community’s need to
analyze large-scale cancer genomic and clinical data
48
NCI Cancer Genomics Cloud Pilots
Democratize access to NCI-generated genomic and related data, and to
create a cost-effective way to provide scalable computational capacity to the cancer research
community.
Cloud Pilots provide:• Access to large genomic data sets without need to download• Access to popular pipelines and visualization tools• Ability for researchers to bring their own tools and pipelines to the data• Ability for researchers to bring their own data and analyze in combination with existing genomic
data• Workspaces, for researchers to save and share their data and results of analyses
49
• PI: Gad Getz• Google Cloud• Firehose in the cloud including Broad best practices workflows• http://firecloud.org
Broad Institute
• PI: Ilya Shmulevich• Google Cloud• Leverage Google infrastructure; Novel query and visualization• http://cgc.systemsbiology.net/
Institute for Systems Biology
• PI: Deniz Kural• Amazon Web Services• Interactive data exploration; > 30 public pipelines• http://www.cancergenomicscloud.org
Seven Bridges Genomics
Three NCI Genomics Cloud Pilots
Selection Design/Build I
Design/Build II Evaluation Extension
Sept 2016Jan 2016April 2015Sept 2014Jan 2014
Broad Institute Cloud Pilot
• Targeted at users performing analyses at scale
• Modeled after their Firehose analysis infrastructure developed for the TCGA program
• Users can upload their own data and tools and/or run the Broad’s best practice tools and pipelines on pre-loaded data
Institute for Systems Biology Cloud Pilot
51
PI / Biologistweb access
Computational Research ScientistPython, R, SQL
Algorithm Developerssh, programmatic access
ISB-CGC Web App
Google Cloud Console
Google APIs
ISB-CGC APIs
Compute Engine VMs
Cloud Storage
BigQuery
Genomics
Local Storage
ISB-CGC Hosted
DataControlled-Access DataOpen-Access Data User Data
• Closely tied with Google Cloud Platform tools including BigQuery, App Engine, Cloud Datalab, Google Genomics, and Compute Engine
• Aggregated TCGA data in BigQuery allows fast SQL-like queries across the entire dataset• Web interface allows scientists to interactively compare and define cohorts
Seven Bridges Genomics Cloud Pilot
• Built upon the SBG commercial cloud-based genomics platform
• Graphical query interface to identify hosted data of interest
• Includes a native implementation of the Common Workflow Language specification and graphical interface for creating user-defined workflows
Workspace – isolated environment for collaborative analysis
Data + Methods → Results
sample data and metadata (e.g.
BAMs, tissue type)
algorithms (e.g. mutation
calling)
Wiring logic(e.g. use the exome
capture BAM)
executions and results(e.g. run mutation caller v41 on this exact bam and track
results)
Slide courtesy of Broad Institute
GDC AcknowledgementsNCI Center for Cancer Genomics Univ. of Chicago
Bob GrossmanAllison Heath
Mike FordZhenyu Zhang
Ontario Institute for Cancer Research
Lou StaudtZhining Wang
Martin FergusonJC Zenklusen
Daniela GerhardDeb Steverson
Vincent Ferretti'Francois Gerthoffert
JunJun Zhang
Leidos Biomedical ResearchMark Jensen
Sharon GaheenHimanso Sahni
NCI NCI CBIITTony KerlavageTanya Davidsen
CGC Pilot Team Principal Investigators • Gad Getz, Ph.D - Broad Institute - http://firecloud.org • Ilya Shmulevich, Ph.D - ISB - http://cgc.systemsbiology.net/ • Deniz Kural, Ph.D - Seven Bridges – http://www.cancergenomicscloud.org
NCI Project Officer & CORs• Anthony Kerlavage, Ph.D –Project Officer• Juli Klemm, Ph.D – COR, Broad Institute• Tanja Davidsen, Ph.D – COR, Institute for Systems Biology • Ishwar Chandramouliswaran, MS, MBA – COR, Seven Bridges Genomics
GDC Principal Investigator• Robert Grossman, Ph.D - University of Chicago• Allison Heath, Ph.D - University of Chicago• Vincent Ferretti, Ph.D - Ontario Institute for Cancer Research
Cancer Genomics Project Teams
NCI Leadership Team• Doug Lowy, M.D.• Lou Staudt, M.D., Ph.D.• Stephen Chanock, M.D.• George Komatsoulis, Ph.D.• Warren Kibbe, Ph.D.
Center for Cancer Genomics Partners• JC Zenklusen, Ph.D.• Daniela Gerhard, Ph.D.• Zhining Wang, Ph.D.• Liming Yang, Ph.D.• Martin Ferguson, Ph.D.
56
Integrated data sets, interoperable resources, harmonized data are necessary for and enable biologically informed cancer predictive models
www.cancer.gov www.cancer.gov/espanol
59
NIH Genomic Data Sharing Policy
https://gds.nih.gov/ Went into effect January 25, 2015
NCI guidance:http://
www.cancer.gov/grants-training/grants-management/nci-policies/genomic-data
Requires public sharing of genomic data sets