intel precision medicine apr 2015
TRANSCRIPT
Taking Precision
Medicine
Mainstream
Ketan Paranjape
GM Life Sciences
Intel Corp.
1
www.intel.com/healthcare/bigdata
Health & Life Sciences at IntelWhere information and care meet
Big Data Analytics
in Health and Life Sciences
Today: Many disparate
data types, streams…
Future: Integrated
computing and data
2
Genomics
Clinical
Claims &
transaction
s
Meds &
labs
Patient
experience
Personal
data
Health & Life Sciences at IntelWhere information and care meet
Vision for Precision Medicine1 Patient visit 2
Genes causing disease and key pathways identified
3Gene targeted drugs identified, Treatment begins in earnest
Imaging, parallelism, statistical work, accelerated algorithms
Various algorithms, big data, parallelism, accelerated algorithms, statistical and math puzzles, in memory processing
+ imaging + ML + drug sensitivity assays + mechanistic learning and
systems + correlation work and knowledge systems
1-4 days MonthsWeeks
PRIMARY ANALYSIS SECONDARY ANALYSIS, DNA/RNA PIPELINE + MORE PRECISION MEDICINE
HC
HC
HC
HC
HC
HC
HC
HC
HC
HC
Multiple sample compute starts here
Joint genotyping
Variantstore
Pop/DisstudyIndividual Multiple individual
Predicted Actionable VariantsData Driven Association
Clinically Actionable Variants
Knowledge DtabaseClinical trail groups
Data curation
3GOAL :: Precision Medicine in a Day by 2020 !!
$1K-$5K $10K+$5K-$10K
Health & Life Sciences at IntelWhere information and care meet
Life Sciences World Map
4
$11B Japan investments
$100M ETRI Korea
Ontario LS Investments
Genomics France
Beijing Genomics Institute (PRC)
$1.58B line of credit
Geisinger Health
100,000 patients
NY Genome Center (US)
$200M investments
Mt Sinai (US)
Genomics England
100,000 patients
Genomics Qatar
400,000 patients
Broad Institute (US) Wellcome Trust Sanger (UK)
Charité Germany
100,000 patients
Moffitt Health
100,000 patients
We are barely scratching the surface …
Health & Life Sciences at IntelWhere information and care meet
Challenges in Life Sciences
5
Big Data in Life
Sciences
• Sequencer advances – 4x data in 50% less time
.5TB/device/day
• 4D molecular imaging produces 2TB/device/day
• Fragmented software ecosystem, lots of open source
Burdens of Data
Management
• Store, manage, share, ingest and move PBs of research
& clinical data
• Need to reliably ‘snapshot’ pipelines with archive to tiered
storage
Innovation Drives
Change
• Rapid iteration of algorithms far outpace IT, requiring
flexibility, agility
• Most applications do not fully leverage available
infrastructure
Converged
Infrastructure
• Workloads converging between local and cloud-based
HPC/Big Data
• Advanced orchestration required to maximize throughput
& efficiency
At the Intersection
of Transformative
Forces
1018 Enabling extreme-
scale computing on
massive data sets
Helping enterprises
build open,
interoperable clouds
Contributing code
and fostering the
ecosystem
Health & Life Sciences at IntelWhere information and care meet
Intel Partnerships and Ecosystem Enablement
to resolve challenges
Need for Balanced Compute Infrastructure
*Other names and brands may be claimed as the property of others.
Health & Life Sciences at IntelWhere information and care meet
Optimizing Top Applications and PipelinesIntel working with industry experts worldwide
• Genomics, Molecular Dynamics
and Molecular Imaging
applications targeting both Intel®
Xeon® processors and Xeon® Phi™
coprocessors
• Fine- and coarse-grained
optimization at the node and cluster
level
• Work with code authors to release
optimizations, disseminate best
practices
ABySS*
BLAST*
Bowtie*
TopHat*
Cufflinks*
BWA*
GATK*
Picard*
SAMtools*
MPI-HMMER*
Velvet*
*Some names and brands may be claimed as the property of others.
AMBER*
CAS-Soft Sphere*
CAS-IPE*
CP2K*
CPMD*
DLPOLY*
GAMESS*
Gaussian*
GROMACS*
LAMMPS*
NAMD*
NWChem*
Quantum Espresso*
VASP*
7
Health & Life Sciences at IntelWhere information and care meet
Profiling: Single Instance Run of GATKGATK: Genome Analysis Toolkit (The Broad Institute)
• # of Machines = 1
• # of cores/Machine = 24
• Temporary Storage – RAID0 2x4TB HDD
• Input Dataset: G15512.HCC1954.1, coverage:
65x
Average CPU utilization is very low. Most cores not being usedAverage I/O bandwidth is very low. Application not I/O bound
Average memory footprint is small. Application not using memory available in newer systems
There is a lot of room to improve
• Open Source Distribution:
https://01.org/workflow-profiler
Health & Life Sciences at IntelWhere information and care meet
GATK 3.0 with The Broad Institute
• Pair HMM Acceleration using Intel® AVX
resulted in 970x speedup
− Computation kernel and bottleneck in
GATK Haplotype Caller
− AVX enables 8 floating point SIMD
operations in parallel
9
Health & Life Sciences at IntelWhere information and care meet
Compression Libraries Tuned for Genomics• Challenge:
− Data compression is a significant performance limiter for
genomics analytics
− With post-analytics archive of large datasets, very good
compression ratios are required
− Within an active workflow, transient data needs very fast
compression with “acceptable” compression ratio, especially
SAM and BAM formats
• Solution:
− igzip is a library for performing high-speed DEFLATE/gzip
compression
− For BAM & SAM files, the compression ratio of igzip is very close
to zlib -1
− igzip is ~4X the speed of zlib* (at fastest settings)
For more information including technical benchmark specification details: https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data
10
Speedup and Compression Ratio delta of igzip0c* vs. gzip/Zlib -1
Compression Performance (Cycles/Byte) and Compression Ratio
*Some names and brands may be claimed as the property of others.
Health & Life Sciences at IntelWhere information and care meet
BLAST – Basic Local Alignment Search Tool
National Center for Biotechnology Information (NCBI)Xeon BLASTp vs. GPU-BLASTp Xeon + Tesla k40x
• Application:• blastn v.29. Basic Local Alignment Search Tool searching for
alignment in nucleotide query sequences against a known
nucleotide db volume set.
• Availability: • blastn v.29:
ftp:://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST
• Highlights: − throughput for this offload model has a wide sweet spot for a
sufficiently large query set.
• Results: − Simulation rate with Xeon + Phi™ heterogeneous model is up
to 1.4X
• Code Optimization Strategy:− Xeon: GAT and OFS parallelized (48T)
− KNC: GAT and OFS parallelized (180T)
11
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance
of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to
http://www.intel.com/performance
1
1.3
1.4
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1 Node
Speedup (Higher is Better)
• 2S Intel® Xeon® processor E5-2697v2 (BLASTn v.29 Baseline)
• 2S E5-2697v2 + Intel® Xeon Phi™ 7120A OFS serial
• 2S E5-2697v2 + Intel® Xeon Phi™ 7120A OFS parallelized
Health & Life Sciences at IntelWhere information and care meet
GROMACS
Application: GROMACS 5.0-RC1Description:
− GROMACS is a versatile package to perform molecular dynamics, i.e. simulate the Newtonian equations of motion for systems with hundreds to millions of particles. It is one of the fastest and the most popular Molecular Dynamics packages
− Workload: 512K H2O with RF method
Availability: − VERSION 5.0-rc1 is available from http://www.gromacs.org/Downloads &
− ftp://ftp.gromacs.org/pub/gromacs/gromacs-5.0-rc1.tar.gz
− Recipe: https://software.intel.com/en-us/articles/gromacs-for-intel-xeon-phi-coprocessor
Results: − Highly optimized for Intel® Xeon® Processors (AVX-intrinsics)
− Able to run full simulation on Intel® Xeon Phi™ coprocessor natively + host processor using a symmetric model
− Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors
12
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark
and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance
of that product when combined with other products. See benchmark tests and configurations in the speaker notes. For more information go to
http://www.intel.com/performance
1.0x
1.56x
1.79x
2S Intel® Xeon® processor E5-2697 v2
2S Intel® Xeon® processor E5-2697 v2 + 1 Intel® Xeon
Phi™ coprocessor 7120P/X
2S Intel® Xeon® processor E5-2697 v2 + 2 Intel® Xeon
Phi™ coprocessor 7120P/X
Health & Life Sciences at IntelWhere information and care meet
Genomics Data Processing Pipeline
Lustre
*Some names and brands may be claimed as the property of others.
Health & Life Sciences at IntelWhere information and care meet
HPC Appliances for Life Sciences• Challenge: Experiment processing takes 7 days with current
infrastructure. Delays treatment for sick patients
• Solution: Dell Next Generation Sequencing Appliance
− Single Rack Solution; 9 Teraflops, Lustre File Storage; Intel SW tools
• Benefits: RNA-Seq processing reduced to 4 hour
• Includes everything you need for NGS - compute, storage, software,
networking, infrastructure, installation, deployment, training, service &
supportDell HSS (Lustre)(up to 360TB)
Dell NSS (NFS)(up to 180TB)
Infrastructure: Dell PE, PC & F10
M420 (Compute)(up to 32 nodes)
2U Plenum
Actual placement in racks may vary.
NSS-HA Pair
NSS User Data
HSS Metadata
Pair
HSS OSS Pair
HSS User Data
** 2-socket Intel(R) Xeon(R) CPU E5-2687W / 3.1 GHz*Other names and brands may be claimed as the property of others.
Health & Life Sciences at IntelWhere information and care meet
HPC Appliances for Life Sciences
• Challenge: Set up personalized medicine competency – DNA seq. analysis, Finite
Element Analysis, Natural Language Processing, Image Processing, Computational
Fluid Dynamics
• Solution: Dell Next Generation Sequencing Appliance
− 400 cores derived from 40 Intel® Xeon® E5-2680v2 2.8 GHz Ivy Bridge processors in 20 Dell®
PowerEdge M620 nodes providing 9 teraflops
− Room for 12 additional nodes in the Dell M1000e blade enclosure
− 270 terabytes of usable Intel® Enterprise Edition for Lustre* software scalable file system in
a 60-drive Dell® PowerVault MD3260 with 60-drive MD3060e expansion enclosure
− Deployed with Intel® Parallel Studio XE development environment
− InfiniBand backplane and multiple 10 gigabit/ sec uplinks to 1 petabyte replicated grid storage
and the network backbone
• Software:
− CentOS Linux 6.5 64-bit; Bright Cluster Manager® 7.0
− MPI Library: OpenMPI 1.8.1; Intel® MKL; GCC; IEEL v2.1.0.0
− SLURM (Simple Linux Utility for Resource Management) v14.03.0-2462U Plenum
Actual placement in racks may vary.
NSS-HA Pair
NSS User Data
HSS Metadata
Pair
HSS OSS Pair
HSS User Data
** 2-socket Intel(R) Xeon(R) CPU E5-2687W / 3.1 GHz*Other names and brands may be claimed as the property of others.
Health & Life Sciences at IntelWhere information and care meet
IBM, CLC bio Genomics Sequencing Analytics
Solution• Challenge: Need for processing power and storage
capacity in order to correlate the variants in the genome with the relevant patient symptoms
• Solution: IBM®, CLC Genomics server SW, Genomics Workbench client SW; Small (48 Cores, 192 GB), Medium, Large (192 Cores, 768 GB) Analytics Solutions
• Benefits:
– Reference Mapping for 37x coverage human genome – ~9hr (1 node) to ~30mins (37 nodes)
– Variant Calling and annotation for 37x coverage – ~40 hrs (1 node) to ~3hrs (23 nodes)
• Infrastructure– IBM System x® 3550 M4, E5-2650; 48 CPU cores and 192 GBs of memory to 192 CPU
cores and 768 GBs of memory
– IBM Storwize® V7000
– CLC Genomics Server 5.0.2 , Workbench 6.0.1
– 7x 3TB SAS 6 Gbps HDD (16 TB usable)
http://www-148.ibm.com/bin/newsletter/tool/landingPage.cgi?lpId=6155
Health & Life Sciences at IntelWhere information and care meet
BIONANO Appliance
http://www.thinkmate.com/systems/solutions/bionanogenomics#specs17
System Specs
•2x Ten-Core Intel® Xeon® Processor
E5-2680 V2 2.80GHz 25MB Cache (115W)
•8x 16GB (128GB Total) PC3-14900 1866MHz
DDR3 ECC Registered DIMM
•2x 1.0TB SATA 6.0Gb/s 7200RPM 2.5"
Seagate Constellation.
•6x Intel® Xeon Phi Co-Processor 5110P
1.053GHz - 8GB - 60 Cores
•2x Intel 10-Gigabit Ethernet Ports
via Intel® X540 Chipset - 10GBase-T (RJ-45)
http://www.thinkmate.com/systems/solutions/bionanogenomics#spec
s
Health & Life Sciences at IntelWhere information and care meet
Genomics & Clinical Analytics Appliances
18
2U Plenum
Actual placement in racks may vary.
NSS-HA Pair
NSS User Data
HSS Metadata Pair
HSS OSS Pair
HSS User Data
Health & Life Sciences at IntelWhere information and care meet
Aspera* & Intel optimized solutions for Science DMZ
– Moving the World’s Data at Maximum Speed
• Challenge: Enterprise to Cloud Transmission of Terabyte Payloads
• Solution:
− Demonstrated effective throughput of 73.3Gbps
− Equivalent to downloading 254 whole human genomes per
hour (7.3x speed-up), as compared with baseline Aspera using
commodity 10GbE at 35 whole human genomes per hour
• Benefits with Intel® Xeon® E5 v3 product family:
− AES-NI encryption provides 2x faster inline data encryption,
securely transporting sensitive workloads from enterprise-to-cloud.
− Intel® DPDK, Intel® XL710 40GbE, Intel® NVMe SSDs and Intel®
Xeon® E5 v3 processors significantly boost overall system
performance by moving data closer to the processor, avoiding
unnecessary memory copies, and reducing protocol-related
latencies.
19
Infrastructure and Data Characteristics:
Aspera High-Throughput Transport, featuring Intel pre-production system (Intel® Server Board S2600WT) with
two Intel® Xeon® processor E5-2697 v3 (45M Cache, 2.30 GHz, Intel® Hyper-Threading Technology enabled),
128GB DDR4 memory (2133 MHz), Intel® Communications Chipset 89xx Series, 2x dual-port Intel® XL710
Ethernet Controller (40GbE), 5x Intel® DC P3700 PCIe NVMe Solid-State Drives (800GB), Intel® DPDK 1.7,
Intel® DDIO, Intel® AES-NI GCM encryption, Fedora 20 with custom kernel. Aspera A4 (based on Aspera
fasp* 4). Source: Aspera testing as of August 2014.
Health & Life Sciences at IntelWhere information and care meet
Charite “Real-time” Cancer Analysis – Matching proper
therapies to patients using in-memory techniques
• Challenge: Real-time analysis of cancer
patients using in-memory SAP HANA
Oncolyzer database running on Intel®
Xeon® family infrastructure. (3.5M Data
points per Patient, Up to 20 TB of
data/patient)
• Solution: Using structured and unstructured
data to collect and analyze tables used to
take up to two days -- now takes seconds
• Benefits: Improves medical quality in
disruptive way for Patient, Doctor, Hospital,
Research
http://moss.ger.ith.intel.com/sites/SAP/SAP%20account%20team%20documents/Marketing/SAP%20HANA/SAPHANA_Charite_case_study_HI.PDF
Health & Life Sciences at IntelWhere information and care meet
High Throughput Science: Embracing Cloud-based
Analytics for Computational Chemistry Simulation • Challenge: Sustaining 50000+ compute
cores for large scale simulations, for less
than a week; CapEX v. OpX
• Solution: Novartis leveraged software from
AWS partner, Cycle Computing, and
MolSoft to provision a fully secured cluster
of 30,000 CPUs, powered by the Intel®
Xeon® processor E5 family.
− Completed screening of 3.2 million
compounds in approximately 9 hrs,
compared to 4 -14 days on existing
resources.
Powerof60.com
Health & Life Sciences at IntelWhere information and care meetIntel Confidential
• Solution: Intel Distribution for
Hadoop (IDH), Map Reduce, Hbase,
Hive
• Benefits: Ability to compare 14
million proteins and more, reducing
the processing time from days to
hours.
• Project Characteristics:
SLA: reducing processing time
from 30 days to less then a day
and scale to 4x4 million samples
comparison
Data: Multi-Terabyte database
Problem Statement:
Comparing all pairs of 14 million proteins,
using BLAST search.
Big Data, Bioinformatics
Team websiteBlast
Program
Genome data
Proteins comparison
High performance scalable
Hadoop/Hbase cluster
Health & Life Sciences at IntelWhere information and care meet
Drug Information Network Analysis
for Treatment & Chemoinformatics
Integration of heterogeneous data sets
reveals relationships among drug compounds,
enabling insight into treatment efficacy
Business Outcomes• Faster, cheaper discovery of new drugs via focusing research
• Improving patient outcomes by reducing toxic drug reactions
• Reduce cost of treatment by predicting novel efficacy of existing
compounds
25 Pharmacological
Data sets
41 Million
Compounds
• Data preparation
• Feature engineering
• Network/graph modeling
• Visualization
• Machine Learning
• Query
Advanced Analytics
• Medical Research
• Toxicity
• Drug interactions
• Efficacy prediction
Health & Life Sciences at IntelWhere information and care meet
Accelerating Secondary Genomic Analysis Using
Intel Reference Architecture• Challenge: How to accelerate Secondary Analysis and
Interpretation in Whole Genome Sequencing
• Solution: Intel Genomics Reference Architecture
(IGRA) utilizing Big Data Technologies (Cloudera
CDH5 Hadoop Distro/Impala/Hive/Python) and Intel
Hardware and Infrastructure. (E5-2650, 8 GB*16, 300
GB SSD, E10G42BFSR)
• Benefits:
− 732x acceleration in Query
− 242x acceleration in statistical function (histogram of indel
lengths)
− 15x acceleration in Annotation
− 11x acceleration in Intersect
* 1000 genome data set, comparison of VCFtools vs. IGRA, cumulative runtime for 23 chromosomes
24
Health & Life Sciences at IntelWhere information and care meet
High-Performance Interconnect (InfiniBand)
Intel® True Scale Fabric
• Challenge: Can high performance interconnect
technology (InfiniBand) keep up with increase in
number of processor cores?
• Workloads: VASP, WIEN2K
• Benchmarks: MVAPICH (MPI over InfiniBand), IMPI
(Intel MPI)
• Results:
− Scale-up research – 5 to 10x speed up when
scaling from single node to 16 nodes
− Intel® True Scale Fabric QDR-40 shows
excellent price/performance results
25
High Performance Scale-out Storage
Challenge:
• Challenge: Need to accelerate and optimize
“time to results” clinical trial simulation
environment; resource allocation and job
prioritization was manual/ad-hoc
• Solution: Scale-Out” architecture: SAS Visual
Analytics, Enterprise Miner, Grid Manager, Red
Hat Enterprise Linux, Xeon E5 servers (HP)
• Benefits:
− Clinical trial simulation exercises reduced
from hours to < 5 minutes; registration
decisions accelerated with multi-hundred
million USD impact
Health & Life Sciences at IntelWhere information and care meet
Genomic Assembly of Non-Human Species
• Challenge: Complete assembly of non-human
genome on existing clusters
• Solution: Intel Xeon E7 4800 V3 + SSDs
(S3700)
• Benefits:
− 29.92 hours (on their cluster) vs. 4.03 hours
(on HSW-EX) for a dataset of 73GB.
High Performance Scale-out Storage
Challenge:
• Challenge: 10-15TB data added weekly, small
fraction of overall storage capacity and need a
system to scale, be flexible and efficient
• Solution: HPC-class storage, powered by Intel®
Enterprise Edition for Lustre software
• Benefits:
− Openess, global namespace
− Performance of upwards of 1 TB/s
− Virtually unlimited file system and per file
sizes, and management simplicity
26
www.intel.com/healthcare/bigdataIntel Health & Life Sciences
Where information and care meet
Intel/Intermountain Collaboration
An Idea For New CDS Applications Combining
Clinical, Genetic/Genomic, and Family Health History
Data
• Goal - Promote widespread use of clinical decision
support that will help clinicians/counselors in
assessing risk and assist genetic counselors in
ordering genetic tests.
• Build a scalable CDS that leverages standardized data
that includes:
− Family Health History
− Clinical and Screening
− Genomic data
• The solution will:
• Be agnostic to data collection tools. The solution Be
scaled to different clinical domains (grow beyond
Breast Cancer) and other healthcare institutions.
• Be standards based where they exist
• Work across all EHRs, but starting with Cerner.
• Leverage Intel technologies (infrastructure, Intel Data
Platform etc.).
• Be flexible to incorporate other data sources (e.g.
Imaging data, personal device data)
Health & Life Sciences at IntelWhere information and care meet
Policy – United States, European
Union Snapshot of US, EU Recommendations
Develop an ICT-enabled European Strategy for Personalised
Medicine
2014-2020
Driving research to unleash the potential of ICT at the point-of-care
EU R&D initiatives must address:
Interoperability of technical standards for managing and sharing sequence data in research and clinical samples;
Development of hardware, software and workflow algorithms to accelerate cost efficient analysis of genetic abnormalities that cause cancer and other complex diseases;
Research to ensure convergence of Big Data and Cloud Computing infrastructure to meet the requirements of High Performance Computing and data throughout the life sciences and healthcare value chains
The eHealth Action Plan 2020 should include Personalised Medicine as a
priority
Gain knowledge of the challenges and barriers (technical, organizational, legal and political) to the adoption of ICT in support of Personalised Medicine leveraged by genomic information;
Evaluate how to change workflows and education requirements to facilitate adoption of ICT mediated personalized medicine in clinical practice;
Expand collaboration with other regions of the world in matters of common interest, e.g. by leveraging the eHealth MoU with the United States of America;
Study, evaluate and disseminate technology neutral risk assessment frameworks for data privacy and security, covering the entire ICT enabled Personalised Medicine delivery chain;
Develop effective methods for enabling the use of medical information for public health and research
Health & Life Sciences at IntelWhere information and care meet
29
Genes causing it
identified & disease
pathways determined
Precision
medicine regime
Precision Medicine: All in a day by 2050 2020
Intel Confidential — Do Not Forward