variantspark - a spark library for genomics
TRANSCRIPT
![Page 1: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/1.jpg)
VariantSpark: a library for Genomics
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Lynn Langit
![Page 2: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/2.jpg)
“Genomical” Big Data
![Page 3: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/3.jpg)
Natalie Twine
Transformational Bioinformatics Team
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Denis Bauer Oscar Luo Rob Dunne Piotr SzulAidan O’BrienLaurence Wilson
Adrian White
Mia ChampionGaetan Burgio
Collaborators
David Levy
News
Software
Dan Andrews
Kaitao Lai
Kaylene Simpson
Iva Nikolic
Ian Blair
Kelly Williams
![Page 4: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/4.jpg)
BMC Genomics 2015, 16:1052 PMID: 26651996 (IF=4)
Cited
4
VariantSpark | Denis C. Bauer @allPowerde
![Page 5: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/5.jpg)
Unsupervised ML : K-Means
www.cloudaccess.eu
1000 x 40 Million variants
Matrix *
k-means
Predict super
population
414 ethnic groups and
s u p e rpopulations
VariantSpark | Denis C. Bauer @allPowerde
* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
![Page 6: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/6.jpg)
Comparing K-Means Implementations
0
1000
2000
Pytho
n R
Had
oop
Ada
m
ADM
IXTU
RE
Variant
Spa
rk
method
tim
e in
se
co
nd
s
task
binary−conversion
clustering
pre−processing
103 75 29 28 18 4 min
VariantSpark | Denis C. Bauer @allPowerde
![Page 7: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/7.jpg)
Supervised ML: Wide Random Forests
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 8: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/8.jpg)
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Genomic Research Workflow
https://www.projectmine.com/about/
Focus
![Page 9: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/9.jpg)
Performance – Faster and More Accurate VariantSpark is the only method to scale to 100% of the genome
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 10: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/10.jpg)
Scaling to 50 M variables and 10 K samples
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
100K trees: 5 – 50h
AWS: ~$215.50
100K trees: 200 – 2000h
AWS: ~ $ 8620.00
• Yarn Cluster (12 workers)• 16 x Intel Xeon [email protected] CPU
• 128 GB of RAM
• Spark 1.6.1 on YARN• 128 executors
• 6GB / executor (0.75TB)
• Synthetic dataset (mtry = 0.25)
Whole Genome
RangeGWAS Range
![Page 11: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/11.jpg)
![Page 12: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/12.jpg)
Databricks &VariantSpark via a Jupyter notebook
![Page 13: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/13.jpg)
Solving Important Questions…Cancer genomics?
![Page 14: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/14.jpg)
DEMO: Who is a Hipster?
![Page 15: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/15.jpg)
• Quickly access a managed Spark cluster - AWS EC2 / spot instances
• Link to your data and perform whole genome analysis in real-time
VariantSpark & Databricks Notebooks
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Jupyter Notebook
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 16: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/16.jpg)
Joint-loci association test
Hipster-Index = ((2 + GT[B6]) * (1.5 + GT[R1])) + ((0.5 + GT[C2]) * (1 + GT[B2]))
Label = 1 if Hipster-Index>10
Genomic profile Label
Sam
ple
s (
n=
2500)
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
![Page 17: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/17.jpg)
Try it out: VariantSpark Notebook
https://databricks.com/blog/2017/07/26/breaking-the-curse-of-dimensionality-in-genomics-using-wide-random-forests.html
![Page 18: VariantSpark - a Spark library for genomics](https://reader033.vdocuments.site/reader033/viewer/2022050614/5a64799b7f8b9a31568b479d/html5/thumbnails/18.jpg)
VariantSpark: a library for Genomics
Transformational Bioinformatics | Denis C. Bauer | @allPowerde
Lynn Langit