genome analysis pipelines, big data style
TRANSCRIPT
®© 2015 MapR Technologies 1
®
© 2015 MapR Technologies
Allen Day, PhD // Chief Scientist @ MapR.com 2016.04.12, Big Data Everywhere
®© 2015 MapR Technologies 2
Agenda • Presentation Motivations
– Data inertia, data local computing
• Highlights of BigData solutions ecosystem – MapR, NoSQL, Spark
• Biotech Analytics Use Cases – Transition from sensors to insights - population DBs
• NoSQL performance
– Cost savings • NoSQL cost structure
– Legacy tools – integration • Spark wrappers
®© 2015 MapR Technologies 3
Data Inertia • Newton’s 1st Law of Motion (Law of Inertia) • “An object at rest stays at rest … unless acted
upon by an unbalanced force” • Force required to transport data increases with
data size and device latency – CPU < CPU caches < RAM < Disk/SSD < Network
bigger
faster
®© 2015 MapR Technologies 4
Data Inertia + Exponential Data Growth => Data Local “BigData” Computing
• Traditional algorithm design moves data to the executing program – High Perf Cluster + Storage Network (HPC+SAN)
• Key insight – program proportionally much smaller than data, thus easier to move.
• Modern algorithm design moves executing program to the data
®© 2015 MapR Technologies 5
Some BigData Tools What is Spark? • Spark is a parallel computing framework that
allows a job to run on 1000s of computers as easily as 1. No code changes required.
• Makes good use of RAM and SSD storage What is HBase? • HBase is a non-relational (NoSQL), distributed
database modeled on Google’s BigTable. • Provides highly scalable sustained and random
access to very large data sets
®© 2015 MapR Technologies 6
MapR Converged Platform for BigData
®© 2015 MapR Technologies 7 © 2015 MapR Technologies ®
Cost-Effective ETL (Novartis)
®© 2015 MapR Technologies 8
The Problem • Key step in data ingest for R&D handled
by enterprise data warehouse (EDW) – Video, Proteomics, NGS, Metagenomics
• EDW at maximum capacity – Multiple rounds of software optimization
already done – Data still growing
• Insight limiting (= career limiting) bottleneck
®© 2015 MapR Technologies 9
Three Options
1. No more insights / candidates
2. Increase EDW size – Expensive – Known to not scale well
3. Find a more scalable solution
®© 2015 MapR Technologies 10
Extract, Load
Raw data: • Public/private • Compounds • Expression data • Genotype data • EHR data • …
Transform, Load
Downstream Analysis (R&D)
Original Flow – ELTL
Knowledge graph
Data Warehouse
®© 2015 MapR Technologies 11
Simplified Analysis – EDW Strategy • Majority of EDW storage consumed by ELTL
processing – Caused by minority of code
(raw data transformations)
• Increasing EDW capacity yields sub-linear performance – poor division of labor
®© 2015 MapR Technologies 12
With ETL Offload
Raw data: • Public/private • Compounds • Expression data • Genotype data • EHR data • …
Extract, Load
Transform, Load
Knowledge graph
Data Warehouse
Downstream Analysis (R&D)
MapR
®© 2015 MapR Technologies 13
Simplified Analysis – MapR Strategy • Lower Cost per TB of increased ETL
capacity by replacing EDW with MapR
• Scale-out architecture – linear spend gives linear performance increase
• Strategic advantage – next-gen architecture for implementing new use cases – Insights/time (and career) acceleration
®© 2015 MapR Technologies 14
Additionally…
Raw data: • Public/private • Compounds • Expression data • Genotype data • EHR data • …
Extract, Load
Knowledge graph
Data Warehouse
Downstream Analysis (R&D)
MapR Transform, Load
®© 2015 MapR Technologies 15
New Use Cases are Enabled
Raw data: • Public and private • Compounds • Expression data • Genotype data • EHR data • …
Extract, Load
Knowledge graph
Data Warehouse
Downstream Analysis (R&D)
New Use Cases
MapR
Transform, Load
®© 2015 MapR Technologies 16 © 2015 MapR Technologies ®
NoSQL: Scalable Population DBs
®© 2015 MapR Technologies 17
Catalog genetic variants => find QTLs • Current public human cohort proposals
100K-1M individuals, >400% CAGR
• Seed and livestock companies, same trend • Px/Dx biomarkers for PGx, reproductive
medicine, biometrics, etc.
• Idea is to catalog genetic variants, find QTLs
• Well studied problem, let’s take a look
®© 2015 MapR Technologies 18
Genome × Phenome Analysis
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
SPARSE Billion + Phenotypes
SPA
RS
E B
illion + Genotypes
For given population,
given SNP 𝛿, and
given phenotype ϕ: Count the number of occurrences as the value of the matrix
®© 2015 MapR Technologies 19
Associate QTLs to variants via Genome × Phenome Matrix Factorization
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal Genotypes (column Eigenvector)
Archetypal Phenotypes (row Eigenvector)
Factorize w/ Spark &
MapR
• Row Eigenvectors of X represent – Sets of related phenotypes (by SNP)
• Column Eigenvectors of Y represent – Sets of related SNPS (by phenotype)
®© 2015 MapR Technologies 20
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal Genotypes (column Eigenvector)
Archetypal Phenotypes (row Eigenvector)
Moreover… This is a generalized GWAS
®© 2015 MapR Technologies 21
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal Genotypes (column Eigenvector)
Archetypal Phenotypes (row Eigenvector)
Moreover… This is a generalized GWAS
it’s PheWAS
®© 2015 MapR Technologies 22
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Archetypal Genotypes (column Eigenvector)
Archetypal Phenotypes (row Eigenvector)
Moreover… This is a generalized GWAS
it’s PheWAS NB: These calculations are mixed I/O workload – require high-throughput sustained read and low-latency random-access Proven MapR-DB use case: Aadhar Biometric system, 1B humans biometrics
®© 2015 MapR Technologies 23
𝛿5
ϕ5 ϕ3 ϕ1
𝛿3
𝛿1
Furthermore…
®© 2015 MapR Technologies 24
doc5
user5 user3 user1
doc3
doc1
If we change the labels…
®© 2015 MapR Technologies 25
doc5
user5 user3 user1
doc3
doc1
INTERESTS
BEHAVIORS
We have the core of Google / Facebook / Twitter Ad Revenue Engine
®© 2015 MapR Technologies 26
doc5
user5 user3 user1
doc3
doc1
INTERESTS
BEHAVIORS
We have the core of Google / Facebook / Twitter Ad Revenue Engine
®© 2015 MapR Technologies 27 © 2015 MapR Technologies ®
Spark: Porting Legacy Pipelines
®© 2015 MapR Technologies 28
Alignment Reference Sequences
Aligned Reads Downstream
Applications…
DNA Reads
®© 2015 MapR Technologies 29
Alignment Reference Sequences
DNA Reads
Aligned Reads Downstream
Applications…
Align()
®© 2015 MapR Technologies 30
Possible Align() Outcomes
Unaligned DNA Reads
Reference Sequences
Single Location
Reads
Multiple Location
Reads
UnlocatableReads
Align()
®© 2015 MapR Technologies 31
Many-to-Many Relationship Between Reads and Locations • Read1 • Read2
• Read3
• Read4 • NULL
• LocationA • LocationB • LocationC • LocationD • LocationA • NULL • LocationE
®© 2015 MapR Technologies 32
Parallelizing Alignment
Unaligned DNA Reads
Locations
Locations
Locations
Part1
Part2
Part3
Aligned DNA
Reads
Align() Concat() Sort() Etc… Split()
®© 2015 MapR Technologies 33
Using HPC+SAN has Bottlenecks (GridEngine, Etc)
Part1
Part2
Part3
Volume Read Bottleneck
Volume Write Bottleneck
Read & Write Bottleneck
®© 2015 MapR Technologies 34
Using Spark Eliminates Bottlenecks
Align() Concat() Sort() Split()
®© 2015 MapR Technologies 35
Bottom Level: Integration with Legacy Tools
Local I/O Container
Legacy Sub-process
®© 2015 MapR Technologies 36
Bottom Level: Integration with Legacy Tools
®© 2015 MapR Technologies 37
Bottom Level: Integration with Legacy Tools • No time today to look at code, but a deeper
slideshow of doing this with Bowtie aligner: • http://www.slideshare.net/allenday • https://github.com/allenday/spark-genome-
alignment-demo
Local I/O Container
Legacy Sub-process
®© 2015 MapR Technologies 38
Thanks! Questions?
@allenday, @mapr
linkedin.com/in/allenday slideshare.net/allenday