variantspark - a spark library for genomics

18
VariantSpark: a library for Genomics Transformational Bioinformatics | Denis C. Bauer | @allPowerde Lynn Langit

Upload: lynn-langit

Post on 21-Jan-2018

1.516 views

Category:

Science


1 download

TRANSCRIPT

Page 1: VariantSpark - a Spark library for genomics

VariantSpark: a library for Genomics

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Lynn Langit

Page 2: VariantSpark - a Spark library for genomics

“Genomical” Big Data

Page 3: VariantSpark - a Spark library for genomics

Natalie Twine

Transformational Bioinformatics Team

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson

Adrian White

Mia ChampionGaetan Burgio

Collaborators

David Levy

News

Software

Dan Andrews

Kaitao Lai

Kaylene Simpson

Iva Nikolic

Ian Blair

Kelly Williams

Page 4: VariantSpark - a Spark library for genomics

BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)

Cited

4

VariantSpark | Denis C. Bauer @allPowerde

Page 5: VariantSpark - a Spark library for genomics

Unsupervised ML : K-Means

www.cloudaccess.eu

1000 x 40 Million variants

Matrix *

k-means

Predict super

population

414 ethnic groups and

s u p e rpopulations

VariantSpark | Denis C. Bauer @allPowerde

* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants

Page 6: VariantSpark - a Spark library for genomics

Comparing K-Means Implementations

0

1000

2000

Pytho

n R

Had

oop

Ada

m

ADM

IXTU

RE

Variant

Spa

rk

method

tim

e in

se

co

nd

s

task

binary−conversion

clustering

pre−processing

103 75 29 28 18 4 min

VariantSpark | Denis C. Bauer @allPowerde

Page 7: VariantSpark - a Spark library for genomics

Supervised ML: Wide Random Forests

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 8: VariantSpark - a Spark library for genomics

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Genomic Research Workflow

https://www.projectmine.com/about/

Focus

Page 9: VariantSpark - a Spark library for genomics

Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 10: VariantSpark - a Spark library for genomics

Scaling to 50 M variables and 10 K samples

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

100K trees: 5 – 50h

AWS: ~$215.50

100K trees: 200 – 2000h

AWS: ~ $ 8620.00

• Yarn Cluster (12 workers)• 16 x Intel Xeon [email protected] CPU

• 128 GB of RAM

• Spark 1.6.1 on YARN• 128 executors

• 6GB / executor (0.75TB)

• Synthetic dataset (mtry = 0.25)

Whole Genome

RangeGWAS Range

Page 11: VariantSpark - a Spark library for genomics
Page 12: VariantSpark - a Spark library for genomics

Databricks &VariantSpark via a Jupyter notebook

Page 13: VariantSpark - a Spark library for genomics

Solving Important Questions…Cancer genomics?

Page 14: VariantSpark - a Spark library for genomics

DEMO: Who is a Hipster?

Page 15: VariantSpark - a Spark library for genomics

• Quickly access a managed Spark cluster - AWS EC2 / spot instances

• Link to your data and perform whole genome analysis in real-time

VariantSpark & Databricks Notebooks

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Jupyter Notebook

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 16: VariantSpark - a Spark library for genomics

Joint-loci association test

Hipster-Index = ((2 + GT[B6]) * (1.5 + GT[R1])) + ((0.5 + GT[C2]) * (1 + GT[B2]))

Label = 1 if Hipster-Index>10

Genomic profile Label

Sam

ple

s (

n=

2500)

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Page 17: VariantSpark - a Spark library for genomics

Try it out: VariantSpark Notebook

https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html

Page 18: VariantSpark - a Spark library for genomics

VariantSpark: a library for Genomics

Transformational Bioinformatics | Denis C. Bauer | @allPowerde

Lynn Langit