genome analysis pipelines, big data style

38
© 2015 MapR Technologies 1 ® © 2015 MapR Technologies Allen Day, PhD // Chief Scientist @ MapR.com 2016.04.12, Big Data Everywhere

Upload: julius-remigio-cbip

Post on 15-Apr-2017

122 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 1

®

© 2015 MapR Technologies

Allen Day, PhD // Chief Scientist @ MapR.com 2016.04.12, Big Data Everywhere

Page 2: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 2

Agenda •  Presentation Motivations

–  Data inertia, data local computing

•  Highlights of BigData solutions ecosystem –  MapR, NoSQL, Spark

•  Biotech Analytics Use Cases –  Transition from sensors to insights - population DBs

•  NoSQL performance

–  Cost savings •  NoSQL cost structure

–  Legacy tools – integration •  Spark wrappers

Page 3: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 3

Data Inertia •  Newton’s 1st Law of Motion (Law of Inertia) •  “An object at rest stays at rest … unless acted

upon by an unbalanced force” •  Force required to transport data increases with

data size and device latency –  CPU < CPU caches < RAM < Disk/SSD < Network

bigger

faster

Page 4: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 4

Data Inertia + Exponential Data Growth => Data Local “BigData” Computing

•  Traditional algorithm design moves data to the executing program –  High Perf Cluster + Storage Network (HPC+SAN)

•  Key insight – program proportionally much smaller than data, thus easier to move.

•  Modern algorithm design moves executing program to the data

Page 5: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 5

Some BigData Tools What is Spark? •  Spark is a parallel computing framework that

allows a job to run on 1000s of computers as easily as 1. No code changes required.

•  Makes good use of RAM and SSD storage What is HBase? •  HBase is a non-relational (NoSQL), distributed

database modeled on Google’s BigTable. •  Provides highly scalable sustained and random

access to very large data sets

Page 6: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 6

MapR Converged Platform for BigData

Page 7: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 7 © 2015 MapR Technologies ®

Cost-Effective ETL (Novartis)

Page 8: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 8

The Problem •  Key step in data ingest for R&D handled

by enterprise data warehouse (EDW) –  Video, Proteomics, NGS, Metagenomics

•  EDW at maximum capacity –  Multiple rounds of software optimization

already done –  Data still growing

•  Insight limiting (= career limiting) bottleneck

Page 9: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 9

Three Options

1.  No more insights / candidates

2.  Increase EDW size –  Expensive –  Known to not scale well

3.  Find a more scalable solution

Page 10: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 10

Extract, Load

Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  …

Transform, Load

Downstream Analysis (R&D)

Original Flow – ELTL

Knowledge graph

Data Warehouse

Page 11: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 11

Simplified Analysis – EDW Strategy •  Majority of EDW storage consumed by ELTL

processing –  Caused by minority of code

(raw data transformations)

•  Increasing EDW capacity yields sub-linear performance –  poor division of labor

Page 12: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 12

With ETL Offload

Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  …

Extract, Load

Transform, Load

Knowledge graph

Data Warehouse

Downstream Analysis (R&D)

MapR

Page 13: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 13

Simplified Analysis – MapR Strategy •  Lower Cost per TB of increased ETL

capacity by replacing EDW with MapR

•  Scale-out architecture – linear spend gives linear performance increase

•  Strategic advantage – next-gen architecture for implementing new use cases –  Insights/time (and career) acceleration

Page 14: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 14

Additionally…

Raw data: •  Public/private •  Compounds •  Expression data •  Genotype data •  EHR data •  …

Extract, Load

Knowledge graph

Data Warehouse

Downstream Analysis (R&D)

MapR Transform, Load

Page 15: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 15

New Use Cases are Enabled

Raw data: •  Public and private •  Compounds •  Expression data •  Genotype data •  EHR data •  …

Extract, Load

Knowledge graph

Data Warehouse

Downstream Analysis (R&D)

New Use Cases

MapR

Transform, Load

Page 16: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 16 © 2015 MapR Technologies ®

NoSQL: Scalable Population DBs

Page 17: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 17

Catalog genetic variants => find QTLs •  Current public human cohort proposals

100K-1M individuals, >400% CAGR

•  Seed and livestock companies, same trend •  Px/Dx biomarkers for PGx, reproductive

medicine, biometrics, etc.

•  Idea is to catalog genetic variants, find QTLs

•  Well studied problem, let’s take a look

Page 18: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 18

Genome × Phenome Analysis

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

SPARSE Billion + Phenotypes

SPA

RS

E B

illion + Genotypes

For given population,

given SNP 𝛿, and

given phenotype ϕ: Count the number of occurrences as the value of the matrix

Page 19: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 19

Associate QTLs to variants via Genome × Phenome Matrix Factorization

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

Archetypal Genotypes (column Eigenvector)

Archetypal Phenotypes (row Eigenvector)

Factorize w/ Spark &

MapR

•  Row Eigenvectors of X represent –  Sets of related phenotypes (by SNP)

•  Column Eigenvectors of Y represent –  Sets of related SNPS (by phenotype)

Page 20: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 20

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

Archetypal Genotypes (column Eigenvector)

Archetypal Phenotypes (row Eigenvector)

Moreover… This is a generalized GWAS

Page 21: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 21

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

Archetypal Genotypes (column Eigenvector)

Archetypal Phenotypes (row Eigenvector)

Moreover… This is a generalized GWAS

it’s PheWAS

Page 22: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 22

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

Archetypal Genotypes (column Eigenvector)

Archetypal Phenotypes (row Eigenvector)

Moreover… This is a generalized GWAS

it’s PheWAS NB: These calculations are mixed I/O workload – require high-throughput sustained read and low-latency random-access Proven MapR-DB use case: Aadhar Biometric system, 1B humans biometrics

Page 23: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 23

𝛿5

ϕ5 ϕ3 ϕ1

𝛿3

𝛿1

Furthermore…

Page 24: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 24

doc5

user5 user3 user1

doc3

doc1

If we change the labels…

Page 25: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 25

doc5

user5 user3 user1

doc3

doc1

INTERESTS

BEHAVIORS

We have the core of Google / Facebook / Twitter Ad Revenue Engine

Page 26: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 26

doc5

user5 user3 user1

doc3

doc1

INTERESTS

BEHAVIORS

We have the core of Google / Facebook / Twitter Ad Revenue Engine

Page 27: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 27 © 2015 MapR Technologies ®

Spark: Porting Legacy Pipelines

Page 28: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 28

Alignment Reference Sequences

Aligned Reads Downstream

Applications…

DNA Reads

Page 29: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 29

Alignment Reference Sequences

DNA Reads

Aligned Reads Downstream

Applications…

Align()

Page 30: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 30

Possible Align() Outcomes

Unaligned DNA Reads

Reference Sequences

Single Location

Reads

Multiple Location

Reads

UnlocatableReads

Align()

Page 31: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 31

Many-to-Many Relationship Between Reads and Locations •  Read1 •  Read2

•  Read3

•  Read4 •  NULL

•  LocationA •  LocationB •  LocationC •  LocationD •  LocationA •  NULL •  LocationE

Page 32: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 32

Parallelizing Alignment

Unaligned DNA Reads

Locations

Locations

Locations

Part1

Part2

Part3

Aligned DNA

Reads

Align() Concat() Sort() Etc… Split()

Page 33: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 33

Using HPC+SAN has Bottlenecks (GridEngine, Etc)

Part1

Part2

Part3

Volume Read Bottleneck

Volume Write Bottleneck

Read & Write Bottleneck

Page 34: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 34

Using Spark Eliminates Bottlenecks

Align() Concat() Sort() Split()

Page 35: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 35

Bottom Level: Integration with Legacy Tools

Local I/O Container

Legacy Sub-process

Page 36: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 36

Bottom Level: Integration with Legacy Tools

Page 37: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 37

Bottom Level: Integration with Legacy Tools •  No time today to look at code, but a deeper

slideshow of doing this with Bowtie aligner: •  http://www.slideshare.net/allenday •  https://github.com/allenday/spark-genome-

alignment-demo

Local I/O Container

Legacy Sub-process

Page 38: Genome Analysis Pipelines, Big Data Style

®© 2015 MapR Technologies 38

Thanks! Questions?

@allenday, @mapr

[email protected]

linkedin.com/in/allenday slideshare.net/allenday