genome analysis pipelines, big data style

®© 2015 MapR Technologies 1

®

© 2015 MapR Technologies

Allen Day, PhD // Chief Scientist @ MapR.com 2016.04.12, Big Data Everywhere


Agenda •  Presentation Motivations

–  Data inertia, data local computing

•  Highlights of BigData solutions ecosystem –  MapR, NoSQL, Spark

•  Biotech Analytics Use Cases –  Transition from sensors to insights - population DBs

•  NoSQL performance

–  Cost savings •  NoSQL cost structure

–  Legacy tools – integration •  Spark wrappers


Data Inertia •  Newton’s 1st Law of Motion (Law of Inertia) •  “An object at rest stays at rest … unless acted

upon by an unbalanced force” •  Force required to transport data increases with

data size and device latency –  CPU < CPU caches < RAM < Disk/SSD < Network

bigger

faster


Data Inertia + Exponential Data Growth => Data Local “BigData” Computing

•  Traditional algorithm design moves data to the executing program –  High Perf Cluster + Storage Network (HPC+SAN)

•  Key insight – program proportionally much smaller than data, thus easier to move.

•  Modern algorithm design moves executing program to the data


Some BigData Tools What is Spark? •  Spark is a parallel computing framework that

allows a job to run on 1000s of computers as easily as 1. No code changes required.

•  Makes good use of RAM and SSD storage What is HBase? •  HBase is a non-relational (NoSQL), distributed

database modeled on Google’s BigTable. •  Provides highly scalable sustained and random

access to very large data sets


MapR Converged Platform for BigData

®© 2015 MapR Technologies 7 © 2015 MapR Technologies ®

Cost-Effective ETL (Novartis)


The Problem •  Key step in data ingest for R&D handled

by enterprise data warehouse (EDW) –  Video, Proteomics, NGS, Metagenomics

•  EDW at maximum capacity –  Multiple rounds of software optimization

already done –  Data still growing

•  Insight limiting (= career limiting) bottleneck


Three Options

1.  No more insights / candidates

2.  Increase EDW size –  Expensive –  Known to not scale well

3.  Find a more scalable solution


Extract, Load

Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  …

Transform, Load

Downstream Analysis (R&D)

Original Flow – ELTL

Knowledge graph

Data Warehouse


Simplified Analysis – EDW Strategy •  Majority of EDW storage consumed by ELTL

processing –  Caused by minority of code

(raw data transformations)

•  Increasing EDW capacity yields sub-linear performance –  poor division of labor


With ETL Offload


Extract, Load

Transform, Load

Knowledge graph

Data Warehouse


MapR


Simplified Analysis – MapR Strategy •  Lower Cost per TB of increased ETL

capacity by replacing EDW with MapR

•  Scale-out architecture – linear spend gives linear performance increase

•  Strategic advantage – next-gen architecture for implementing new use cases –  Insights/time (and career) acceleration


Additionally…


Extract, Load

Knowledge graph

Data Warehouse


MapR Transform, Load


New Use Cases are Enabled

Raw data: •  Public and private •  Compounds •  Expression data •  Genotype data •  EHR data •  …

Extract, Load

Knowledge graph

Data Warehouse


New Use Cases

MapR

Transform, Load


NoSQL: Scalable Population DBs


Catalog genetic variants => find QTLs •  Current public human cohort proposals

100K-1M individuals, >400% CAGR

•  Seed and livestock companies, same trend •  Px/Dx biomarkers for PGx, reproductive

medicine, biometrics, etc.

•  Idea is to catalog genetic variants, find QTLs

•  Well studied problem, let’s take a look


Genome × Phenome Analysis

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

SPARSE Billion + Phenotypes

SPA

RS

E B

illion + Genotypes

For given population,

given SNP 𝛿, and

given phenotype ϕ: Count the number of occurrences as the value of the matrix


Associate QTLs to variants via Genome × Phenome Matrix Factorization

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

Archetypal Genotypes (column Eigenvector)

Archetypal Phenotypes (row Eigenvector)

Factorize w/ Spark &

MapR

•  Row Eigenvectors of X represent –  Sets of related phenotypes (by SNP)

•  Column Eigenvectors of Y represent –  Sets of related SNPS (by phenotype)


𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1



Moreover… This is a generalized GWAS


𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1




it’s PheWAS


𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1




it’s PheWAS NB: These calculations are mixed I/O workload – require high-throughput sustained read and low-latency random-access Proven MapR-DB use case: Aadhar Biometric system, 1B humans biometrics


𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

Furthermore…


doc5

user5 user3 user1

doc3

doc1

If we change the labels…


doc5

user5 user3 user1

doc3

doc1

INTERESTS

BEHAVIORS

We have the core of Google / Facebook / Twitter Ad Revenue Engine


Spark: Porting Legacy Pipelines


Alignment Reference Sequences

Aligned Reads Downstream

Applications…

DNA Reads


Alignment Reference Sequences

DNA Reads

Aligned Reads Downstream

Applications…

Align()


Possible Align() Outcomes

Unaligned DNA Reads

Reference Sequences

Single Location

Reads

Multiple Location

Reads

UnlocatableReads

Align()


Many-to-Many Relationship Between Reads and Locations •  Read1 •  Read2

•  Read3

•  Read4 •  NULL

•  LocationA •  LocationB •  LocationC •  LocationD •  LocationA •  NULL •  LocationE


Parallelizing Alignment

Unaligned DNA Reads

Locations

Locations

Locations

Part1

Part2

Part3

Aligned DNA

Reads

Align() Concat() Sort() Etc… Split()


Using HPC+SAN has Bottlenecks (GridEngine, Etc)

Part1

Part2

Part3

Volume Read Bottleneck

Volume Write Bottleneck

Read & Write Bottleneck


Using Spark Eliminates Bottlenecks

Align() Concat() Sort() Split()


Bottom Level: Integration with Legacy Tools

Local I/O Container

Legacy Sub-process


Bottom Level: Integration with Legacy Tools


Bottom Level: Integration with Legacy Tools •  No time today to look at code, but a deeper

slideshow of doing this with Bowtie aligner: •  http://www.slideshare.net/allenday •  https://github.com/allenday/spark-genome-

alignment-demo

Local I/O Container

Legacy Sub-process


Thanks! Questions?

@allenday, @mapr

[email protected]

linkedin.com/in/allenday slideshare.net/allenday

genome analysis pipelines, big data style

Technology