genomics and computation in precision medicine march 2017

Insights from Genomics and

Computational Biology that inform

Precision MedicineWarren Kibbe, PhD

[email protected]

@wakibbe

March 6th, 2017

2

1. Background2. Data in Biomedicine3. Data Sharing4. Data Commons5. Genomics and

Computation

Thanks to many folks for slides, but especially Jerry Lee

3

In 2016 there were an estimated1,700,000 new cancer cases

and600,000 cancer deaths

- American Cancer Society

Cancer remains the second most common cause of death in the U.S.

- Centers for Disease Control and Prevention

4

In 2016 there are an estimated

15,500,000 cancer survivors in the U. S.

5

Understanding Cancer

Precision medicine will lead to fundamental understanding of the complex interplay between genetics, epigenetics, nutrition, environment and clinical presentation and direct effective, evidence-based prevention and treatment.

6

Cancer is a grand challenge

Deep biological understanding

Advances in scientific methods

Advances in instrumentation

Advances in technology

Data and computation

Cancer Research and Care generatedetailed data that is critical to create a learning health system for cancer

Requires:

The primary goal of TCGA is to identify disease subtypes and generate a comprehensive catalog of all cancer genes and pathways

Pilot phase started in 2006 -- 3 tumor types 500 T/N pairs eachExtended in 2009 -- 20 x 500 T/N + 15 x (50-150) T/N

Freely available genomic data enable researchers anywhere around the world to make and validate important discoveries

Biological Scales

14

Digital technology is changing rapidly

15

Cancer therapy is changing

16

Application of Cancer Genomics is changing

19

Biology and Medicine are now data intensive enterprises

Scale is rapidly changingTechnology, data, computing and IT are pervasive in the lab, the clinic, in the home, and across the population

22

“...an advantage of machine learning is that it can be used even in cases where it is infeasible or difficult to write down explicit rules to solve a problem...”

https://www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf

25

" there is great potential for new insights to come from the combined analysis of cancer proteomic and genomic data, as proteomic data can now reproducibly provide information about protein levels and activities that are difficult or impossible to infer from genomic data alone ”

Douglas R. Lowy, MD Acting Director of the National Cancer Institute, National Institutes of Health

5/25/2016

26

Genomics is the beginning of precision medicine, not the end!

27

Plumbing, architecture, genomics

Cancer Research Data Ecosystem – Cancer Moonshot BRP

Well characterized research data

sets

Cancer cohorts Patient data

EHR, Lab Data, Imaging, PROs, Smart Devices,

Decision Support

Learning from everycancer patient

Active researchparticipation

Research informationdonor

Clinical ResearchObservational studies

ProteogenomicsImaging dataClinical trials

Discovery Patient engaged Research

SurveillanceBig Data

Implementation research

SEERGDC

Data Commons Framework

30

Data Commons Structure

Data Contributors and Consumers

Researchers PatientsCliniciansInstitutions

NCI ThesauruscaDSR

NLM UMLSRxNormLOINC

SNOMED

31

Cancer Data Sharing& Data Commons

• Support open science • Support data reusability• Aligned with Cancer Moonshot• Part of Precision Medicine• Reduce Health Disparities• Improve patient access to clinical

trials• Work toward a learning National

Cancer Data Ecosystem

Reduce the risk, improve early detection, outcomes and survivorship in cancer

Cancer Data Sharing EffortsSignature Efforts DataBRCA ChallengeSomatic variant sharing

Isolated genetic variantsNo raw sequencing data

Precision medicine questionsSomatic variant sharing

Panel gene resequencingClinical response

Clinical trialPublic-private partnerships

Comprehensive genomicsDetailed clinical

phenotype data

Clinical trial accessClinical/genomic data aggregation

EHR dataClinical sequencing

Clinical oncology standardsEHR dataClinical sequencing

33

Changing the conversation around data sharing

How do we find data, software, standards? How can we make data, annotations, software, metadata accessible? How do we reuse data standards? How do we make more data machine readable?

NIH Data CommonsNCI Genomic Data Commons

National Cancer Data Ecosystem

Data Commons co-locate data, storage and computing infrastructure, and frequently used tools for analyzing and sharing data to create an

interoperable resource for the research community.*Robert L. Grossman, Allison Heath, Mark Murphy, Maria Patterson, A Case for Data Commons Towards Data Science as a Service, to appear. Source of image: Interior of one of Google’s Data Center, www.google.com/about/datacenters/.

34

Basic Ingredients of a Cancer Research Data Ecosystem

• Open Science. Supporting Open Access, Open Data, Open Source, and Data Liquidity for the cancer community

• Standardization through CDEs and Case Report Forms

• Interoperability by exposing existing knowledge through appropriate integration of ontologies, vocabularies and taxonomies

• Consistency of curation and capture of evidence (tissue types, recurrence, therapy – including genomic, epigenomic, etc context)

• Sustainable models for informatics infrastructure, services, data, metadata

35

GDC as an example of a new architecture for storing and sharing cancer data

36

The Cancer Genomic Data Commons (GDC) is an existing effort to standardize and simplify submission of genomic data to NCI and follow the principles of FAIR – Findable, Accessible, Attributable, Interoperable, Reusable, and Provide Recognition.

The GDC is part of the NIH Big Data to Knowledge (BD2K) initiative and an example of the NIH Commons

Genomic Data Commons

Microattribution, nanopublications, tracking the use of data, annotation of data, use of algorithms, supports

the data /software /metadata life cycle to provide credit and analyze impact of data, software, analytics,

algorithm, curation and knowledge sharing

Force11 white paperhttps://www.force11.org/group/fairgroup/fairprinciples

https://www.force11.org/group/fairgroup/fairprinciples



37

The NCI Genomic Data Commons• Unify fragmentary repositories• Support the receipt, quality control, integration, storage, and

redistribution of standardized genomic data sets derived from cancer research studies

• Harmonization of raw sequence both from existing and new cancer research programs

• Application of state-of-the-art methods of generating derived genomic data

• Provide the foundation for: • Identification of high- and low-frequency cancer drivers• Defining genomic determinants of response to therapy• Clinical trial cohorts sharing targeted genetic lesions

NCI Genomic Data Commons The GDC went live on June 6, 2016 with approximately 4.1 PB of data.

577,878 files about 14194 cases (patients), in 42 cancer types, across 29 primary disease sites, 400 clinical data elements

10 major data types, ranging from Raw Sequencing Data, Raw Microarray Data, to Copy Number Variation, Simple Nucleotide Variation and Gene Expression.

Data are derived from 17 different experimental strategies, with the major ones being RNA-Seq, WXS, WGS, miRNA-Seq, Genotyping Array and Expression Array.

Foundation Medicine announced the release of 18,000 genomic profiles to the GDC at the Cancer Moonshot Summit, June 29th, 2016

The Multiple Myeloma Research Foundation announced it would be releasing its CoMMpass study of more than 1000 cases of Multiple Myeloma on Sept 29, 2016.

39

Content in the Genomic Data Commons TCGA

11,353 cases TARGET 3,178 cases

Current

Foundation Medicine18,000 cases Cancer studies in dbGaP ~4,000 cases Multiple Myeloma RF~1,000 cases

Coming soon

NCI-MATCH~3,000 cases Clinical Trial Sequencing Program ~3,000 cases

Planned (1-3 years)

Cancer Driver Discovery Program ~5,000 cases Human Cancer Model Initiative ~1,000 cases APOLLO – NCI, VA and DoD

~8,000 cases

~58,000 cases

Exome-seq

Whole genome-seq

RNA-seq

Copy number

Genomealignment

Genomealignment

Genomealignment

Datasegmentation

1° processing

Mutations

Mutations +structural variants

Digital geneexpression

Copy numbercalls

2° processingOncogene vs.

Tumor suppressor

Translocations

Relative RNA levelsAlternative splicing

Gene amplification/ deletion

3° processing

GDC Data HarmonizationMultiple data types and levels of processing

Mutect2pipeline

GDC Data HarmonizationOpen Source, Dockerized Pipelines

Recoveryrate

(% true positives) A0F0

SomaticSniper 81.1% 76.5%VarScan 93.9% 84.3%

MuSE 93.1% 87.3%All Three 96.4% 91.2%

GDC variant callingpipelinesWash UBaylorBroad

GDC Data HarmonizationMultiple pipelines needed to recover all variants

Development of the NCI Genomic Data Commons (GDC)To Foster the Molecular Diagnosis and Treatment of Cancer

GDC

Bob Grossman PIUniv. of Chicago

Ontario Inst. Cancer Res.Leidos

Institute of MedicineTowards Precision Medicine

2011

Discovery of Cancer Drivers With 2% Prevalence

Lung adeno.+ 2,900

Colorectal+ 1,200

Ovarian+ 500

Lawrence et al, Nature 2014

Power Calculation for Cancer Driver Discovery

Need to resequence >100,000 tumors to identify all cancer drivers at >2% prevalence

The NCI Cancer Genomics Cloud Pilots

Understanding how to meet the research community’s need to

analyze large-scale cancer genomic and clinical data

48

NCI Cancer Genomics Cloud Pilots

Democratize access to NCI-generated genomic and related data, and to

create a cost-effective way to provide scalable computational capacity to the cancer research

community.

Cloud Pilots provide:• Access to large genomic data sets without need to download• Access to popular pipelines and visualization tools• Ability for researchers to bring their own tools and pipelines to the data• Ability for researchers to bring their own data and analyze in combination with existing genomic

data• Workspaces, for researchers to save and share their data and results of analyses

49

• PI: Gad Getz• Google Cloud• Firehose in the cloud including Broad best practices workflows• http://firecloud.org

Broad Institute

• PI: Ilya Shmulevich• Google Cloud• Leverage Google infrastructure; Novel query and visualization• http://cgc.systemsbiology.net/

Institute for Systems Biology

• PI: Deniz Kural• Amazon Web Services• Interactive data exploration; > 30 public pipelines• http://www.cancergenomicscloud.org

Seven Bridges Genomics

Three NCI Genomics Cloud Pilots

Selection Design/Build I

Design/Build II Evaluation Extension

Sept 2016Jan 2016April 2015Sept 2014Jan 2014

http://firecloud.org/

http://cgc.systemsbiology.net/

http://www.cancergenomicscloud.org/

Broad Institute Cloud Pilot

• Targeted at users performing analyses at scale

• Modeled after their Firehose analysis infrastructure developed for the TCGA program

• Users can upload their own data and tools and/or run the Broad’s best practice tools and pipelines on pre-loaded data

Institute for Systems Biology Cloud Pilot

51

PI / Biologistweb access

Computational Research ScientistPython, R, SQL

Algorithm Developerssh, programmatic access

ISB-CGC Web App

Google Cloud Console

Google APIs

ISB-CGC APIs

Compute Engine VMs

Cloud Storage

BigQuery

Genomics

Local Storage

ISB-CGC Hosted

DataControlled-Access DataOpen-Access Data User Data

• Closely tied with Google Cloud Platform tools including BigQuery, App Engine, Cloud Datalab, Google Genomics, and Compute Engine

• Aggregated TCGA data in BigQuery allows fast SQL-like queries across the entire dataset• Web interface allows scientists to interactively compare and define cohorts

Seven Bridges Genomics Cloud Pilot

• Built upon the SBG commercial cloud-based genomics platform

• Graphical query interface to identify hosted data of interest

• Includes a native implementation of the Common Workflow Language specification and graphical interface for creating user-defined workflows

Workspace – isolated environment for collaborative analysis

Data + Methods → Results

sample data and metadata (e.g.

BAMs, tissue type)

algorithms (e.g. mutation

calling)

Wiring logic(e.g. use the exome

capture BAM)

executions and results(e.g. run mutation caller v41 on this exact bam and track

results)

Slide courtesy of Broad Institute

GDC AcknowledgementsNCI Center for Cancer Genomics Univ. of Chicago

Bob GrossmanAllison Heath

Mike FordZhenyu Zhang

Ontario Institute for Cancer Research

Lou StaudtZhining Wang

Martin FergusonJC Zenklusen

Daniela GerhardDeb Steverson

Vincent Ferretti'Francois Gerthoffert

JunJun Zhang

Leidos Biomedical ResearchMark Jensen

Sharon GaheenHimanso Sahni

NCI NCI CBIITTony KerlavageTanya Davidsen

CGC Pilot Team Principal Investigators • Gad Getz, Ph.D - Broad Institute - http://firecloud.org • Ilya Shmulevich, Ph.D - ISB - http://cgc.systemsbiology.net/ • Deniz Kural, Ph.D - Seven Bridges – http://www.cancergenomicscloud.org

NCI Project Officer & CORs• Anthony Kerlavage, Ph.D –Project Officer• Juli Klemm, Ph.D – COR, Broad Institute• Tanja Davidsen, Ph.D – COR, Institute for Systems Biology • Ishwar Chandramouliswaran, MS, MBA – COR, Seven Bridges Genomics

GDC Principal Investigator• Robert Grossman, Ph.D - University of Chicago• Allison Heath, Ph.D - University of Chicago• Vincent Ferretti, Ph.D - Ontario Institute for Cancer Research

Cancer Genomics Project Teams

NCI Leadership Team• Doug Lowy, M.D.• Lou Staudt, M.D., Ph.D.• Stephen Chanock, M.D.• George Komatsoulis, Ph.D.• Warren Kibbe, Ph.D.

Center for Cancer Genomics Partners• JC Zenklusen, Ph.D.• Daniela Gerhard, Ph.D.• Zhining Wang, Ph.D.• Liming Yang, Ph.D.• Martin Ferguson, Ph.D.








56

Integrated data sets, interoperable resources, harmonized data are necessary for and enable biologically informed cancer predictive models

57

Questions?

Warren Kibbe, Ph.D.

[email protected]

@wakibbe

mailto:[email protected]

www.cancer.gov www.cancer.gov/espanol

59

NIH Genomic Data Sharing Policy

https://gds.nih.gov/ Went into effect January 25, 2015

NCI guidance:http://

www.cancer.gov/grants-training/grants-management/nci-policies/genomic-data

Requires public sharing of genomic data sets

https://gds.nih.gov/

https://gds.nih.gov/

http://www.cancer.gov/grants-training/grants-management/nci-policies/genomic-data



genomics and computation in precision medicine march 2017

Health & Medicine