variantspark - a spark library for genomics

VariantSpark: a library for Genomics

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Lynn Langit

“Genomical” Big Data

Natalie Twine

Transformational Bioinformatics Team


Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson

Adrian White

Mia ChampionGaetan Burgio

Collaborators

David Levy

News

Software

Dan Andrews

Kaitao Lai

Kaylene Simpson

Iva Nikolic

Ian Blair

Kelly Williams

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

Cited

4

VariantSpark | Denis C. Bauer @allPowerde

Unsupervised ML : K-Means

www.cloudaccess.eu

1000 x 40 Million variants

Matrix *

k-means

Predict super

population

414 ethnic groups and

s u p e rpopulations


* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants

Comparing K-Means Implementations

0

1000

2000

Pytho

n R

Had

oop

Ada

m

ADM

IXTU

RE

Variant

Spa

rk

method

tim

e in

se

co

nd

s

task

binary−conversion

clustering

pre−processing

103 75 29 28 18 4 min


Supervised ML: Wide Random Forests



Genomic Research Workflow

https://www.projectmine.com/about/

Focus

Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome


Scaling to 50 M variables and 10 K samples


100K trees: 5 – 50h

AWS: ~$215.50

100K trees: 200 – 2000h

AWS: ~ $ 8620.00

• Yarn Cluster (12 workers)• 16 x Intel Xeon [email protected] CPU

• 128 GB of RAM

• Spark 1.6.1 on YARN• 128 executors

• 6GB / executor (0.75TB)

• Synthetic dataset (mtry = 0.25)

Whole Genome

RangeGWAS Range

Databricks &VariantSpark via a Jupyter notebook

Solving Important Questions…Cancer genomics?

DEMO: Who is a Hipster?

• Quickly access a managed Spark cluster - AWS EC2 / spot instances

• Link to your data and perform whole genome analysis in real-time

VariantSpark & Databricks Notebooks


Jupyter Notebook


Joint-loci association test

Hipster-Index = ((2 + GT[B6]) * (1.5 + GT[R1])) + ((0.5 + GT[C2]) * (1 + GT[B2]))

Label = 1 if Hipster-Index>10

Genomic profile Label

Sam

ple

s (

n=

2500)


Try it out: VariantSpark Notebook

https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html

VariantSpark: a library for Genomics


Lynn Langit

variantspark - a spark library for genomics

Science