a workflow-driven discovery and training ecosystem for distributed analysis of biomedical big data
TRANSCRIPT
![Page 1: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/1.jpg)
A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data
İlkay ALTINTAŞ, Ph.D.Chief Data Science Officer, San Diego Supercomputer CenterFounder and Director, Workflows for Data Science Center of Excellence
![Page 2: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/2.jpg)
SAN DIEGO SUPERCOMPUTER CENTER at UC San DiegoProviding Cyberinfrastructure for Research and Education
• Establishedasanationalsupercomputerresourcecenterin1985byNSF
• AworldleaderinHPC,data-intensivecomputing,andscientificdatamanagement
• Currentstrategicfocuson“BigData”,“versatilecomputing”,and“lifesciencesapplications”
1985
today
Two discoveries in drug design from 1987 and 1991.
![Page 3: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/3.jpg)
Ross Walker Group
SDSC continues to be a leader in scientific computing and big data!
Gordon: FirstFlash-basedSupercomputerforData-intensiveApps
Comet: Serving the Long Tail of Science
27 standard racks= 1944 nodes= 46,656 cores= 249 TB DRAM= 622 TB SSD
~ 2 Pflop/s
• 36 GPU nodes• 4 Large Memory nodes• 7 PB Lustre storage• High performance
virtualization
![Page 4: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/4.jpg)
SDSC Data Science Office-- Expertise, Systems and Training
for Data Science Applications --
SDSC Data Science Office (DSO)
SDSC DSO is a collaborative virtual organization at SDSC for collective lasting innovation in data science research, development and education.
DSO
SDSC Expertise and Strengths
Big
Dat
a P
latfo
rms
Trai
ning
Indu
stry
App
licat
ions
![Page 5: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/5.jpg)
Life Sciences is an ongoing strategic application thrust at SDSC…
![Page 6: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/6.jpg)
Genomic Analysis is a Big Data and Big Compute Problem
BIG DATACOMPUTING AT
SCALE
Enables dynamic data-driven applicationsComputer-Aided Drug Discovery
Personalized Precision Medicine
Requires:• Data management • Data-driven methods• Scalable tools for
dynamic coordination and resource optimization
• Skilled interdisciplinary workforce
Team work and process management
Vaccine Development
Metagenomics
…
![Page 7: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/7.jpg)
New era of data science!
Needs and Trends for the New Era Data Science
-- the Big Data Era Goals --• Moredata-driven• Moredynamic• Moreprocess-driven• Morecollaborative• Moreaccountable• Morereproducible• Moreinteractive• Moreheterogeneous
![Page 8: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/8.jpg)
Velocity
Variety
Volume Scalable batch processing
Stream processing
Extensible data storage, access and integration
Genomic Data Management and Processing in the Big Data Era has Unique Challenges!
![Page 9: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/9.jpg)
HBase
Hive Pig
Zookeeper Giraph
Storm
Spark
MapReduce
YARN
MongoDB
Cassandra
HDFSFlink
Lower levels:Storage and scheduling
Higher levels:Interactivity
These challenges push for new tools to tackle them.
![Page 10: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/10.jpg)
COORDINATION AND WORKFLOW MANAGEMENT
DATA INTEGRATION AND PROCESSING
DATA MANAGEMENT AND STORAGE
How do we use these new tools
and combine them with existing
domain-specific solutions in
scientific computing and data science?
![Page 11: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/11.jpg)
COORDINATION AND WORKFLOW MANAGEMENT
DATA INTEGRATION AND PROCESSING
DATA MANAGEMENT AND STORAGE
Layer 1: Data Management and Storage
![Page 12: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/12.jpg)
COORDINATION AND WORKFLOW MANAGEMENT
DATA INTEGRATION AND PROCESSING
DATA MANAGEMENT AND STORAGE
Layer 2: Data Integration and Processing
HBase
Hive PigZookeeper Giraph
Storm
Spark
MapReduce
YARN
MongoDB
Cassandra
HDFS
Flink + Application
specific libraries
![Page 13: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/13.jpg)
Most of the time, more than one analysis need to take place…
And each analysis has multiple steps to integrate!
![Page 14: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/14.jpg)
Pipelining is a way to put the steps together.
Source: http://www.slideshare.net/BigDataCloud/big-data-analytics-with-google-cloud-platform
Source: https://www.mapr.com/blog/distributed-stream-and-graph-processing-apache-flink
Source: https://www.computer.org/csdl/mags/so/2016/02/mso2016020060.html
Source: http://www.slideshare.net/ThoughtWorks/big-data-pipeline-with-scala
![Page 15: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/15.jpg)
COORDINATION AND WORKFLOW MANAGEMENT
DATA INTEGRATION AND PROCESSING
DATA MANAGEMENT AND STORAGE
Layer 3: Coordination and Workflow Management
![Page 16: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/16.jpg)
COORDINATION AND WORKFLOW MANAGEMENT
ACQUIRE PREPARE ANALYZE REPORT ACT…kepler-project.org
![Page 17: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/17.jpg)
Workflows for Data Science Center of Excellence at SDSC
Building functional, operational and reproducible solution
architectures using big data and HPC tools is what we do.
Focusonthequestion,notthe
technology!
• Access and query data• Scale computational analysis• Increase reuse • Save time, energy and money• Formalize and standardize
Real-TimeHazardsManagementwifire.ucsd.edu
Data-ParallelBioinformaticsbioKepler.org
ScalableAutomatedMolecularDynamicsandDrugDiscoverynbcr.ucsd.edu
WorDS.sdsc.edu
![Page 18: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/18.jpg)
bioKepler:A Kepler Module for Bio Big Data Analysis
Data-ParallelBioinformaticsbioKepler.org
![Page 19: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/19.jpg)
Source: Larry Smarr, Calit2
• Metagenomic Sequencing• JCVIProduced
• ~150BillionDNABasesFromSevenofLSStoolSamplesOver1.5Years
• ~3TrillionDNABasesFromNIHHumanMicrobiomeProgramDataBase• 255HealthyPeople,21withIBD
IlluminaHiSeq 2000 at JCVI
SDSC Gordon Data Supercomputer
Example from 2013: Inflammatory Bowel Disease (IBD)• Supercomputing(W.Li,JCVI/HLI/UCSD):
• ~20CPU-YearsonSDSC’sGordon• ~4CPU-YearsonDell’sHPCCloud
• ProducedRelativeAbundanceof• ~10KBacteria,Archaea,Virusesin~300People• ~3MillionFilledSpreadsheetCells
![Page 20: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/20.jpg)
Ongoing Research:Optimization of Heterogeneous Resource Utilization using bioKepler
NationalResources
(Gordon) (Comet)
(Stampede)(Lonestar)
CloudResources
Optimized
LocalClusterResources
Uses existing genomics tools and computing
systems!
Computing is just one part of it…
…new methods needed!
![Page 21: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/21.jpg)
Needs of a Dynamic Ecosystem of Genomic Discovery • Exploratorymethodstoseetemporalchangesandpatternsinsequence
data• Efficientupdatestoanalysisasquickasnewsequencedatagetsgenerated• Regularrerunsofannotationsasreferencedatabasesevolve• Integrationofgenomicdatawithothertypesofdata,e.g.,image,
environmental,socialgraphs• Dynamicabilitytocheckqualityandprovenanceofdataandanalysis• Transparentsupportforcomputingplatformsdesignedforgenomic
discoveryandpatternanalysis• Workflowcoordinationandsystemintegration• Peopleandculturetomakeithappencollaboratively!
![Page 22: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/22.jpg)
Examples from 2016: Apache Big Data Technologies in Life Sciences• LightningFastGenomicswithADAM
• Goal• Studygeneticvariationsinpopulationsatscale(e.g.,1000GenomesProject)
• Technologystack• ApacheAvro(dataserialization,schemadefinition)• ApacheParquet(compactcolumnarstorage)• ApacheSpark(distributedparallelprocessing)• SparkMLlib(machinelearning,clustering)
• Source:AMPLab,UCBerkeley(http://bdgenomics.org/)• CompressiveStructuralBioinformaticsusingMMTF
• Goal• 100+speedupoflarge-scale3DstructuralanalysisoftheProteinDataBank(PDB)
• Technologystack• MMTF(MacromolecularTransmissionformat,compactstorageinHadoopSequenceFiles)• ApacheSpark(in-memory,paralleldistributedworkflowsusingcompresseddata)• SparkML(clustering)
• Source:SDSC,UCSanDiego(http://mmtf.rcsb.org/)
![Page 23: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/23.jpg)
Development of tools and technologies that enable models to bridge across diverse scales of biological organization, while leveraging all types and
sources of data
NBCR Example: Distilling Medical Image Data for Biomedical Action nbcr.ucsd.edu
![Page 24: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/24.jpg)
Identify gaps in multiscale modeling capabilities and develop new methods and tools that allow us to bridge across these gaps
Å nm – μm 0.1mm - mm cm
fs - μs μs - ms ms - s s - lifespan
Molecular & Macromolecular Sub-Cellular Cell Tissue Organ
Spat
ial a
nd
Tem
pora
l Sc
ales
Driving Biomedical Projects propel technology development across multi-scale modeling capability gaps, from simulation to data assembly & integration
![Page 25: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/25.jpg)
A challenge: Data Integration
Challenge to bridge across diverse scales of biological organization, to understand emergent behavior, and the molecular mechanisms underlying biological function & disease
![Page 26: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/26.jpg)
Integrated Multi-Scale Modeling Toolkits in NBCR
UserInterface NBCRProducts
Battling complexity while facilitating collaboration and increasing reproducibility.Cyberinfrastructure Innovation Based on User Needs
Domain-specific tools, workflows, data and computing infrastructure.
Components for Multi-Scale ModelingA handful of customizable and and
extensible tools, workflows, user interfaces and publishable research
objects.
NBCR Products
Workflows
ScientificTools
PastExperiments
• UI generation • Logical workflow generation• Uncertainty quantification• Workflow execution• Provenance tracking • System integration
![Page 27: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/27.jpg)
mediu
m Pri
ma-1
Stictic
acid
35ZW
F 25
KKL
22LS
V 32
CTM
26RQ
Z 27
WT9
33AG
6 33
BAZ
28NZ
6 27
TGR
27VF
S 35
LWZ
36EB
5 27
UDP
32LD
E 0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2 nop53
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no compound
Prima-1
35ZWF
25KKL
25PWS
24MLP
26YYG
22LSV
24MNR
32CTM
22KTV
24MY4
24LBC
24NPU
24NW3
Series1"Series2"
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
no compound
Prima-1
35ZWF
25KKL
25PWS
24MLP
26YYG
22LSV
24MNR
32CTM
22KTV
24MY4
24LBC
24NPU
24NW3
Series1"Series2"cancercellwithp53-R175Hmutant
cellprolife
ratio
n
15 new reactivation compounds
reactivation compounds kill cells with p53 cancer mutant
BENEFITS:• Increasereuse• Reproducibility• Scaleexecution,
problem&solution• Comparemethods• Trainstudents
![Page 28: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/28.jpg)
Minimization Actor Equilibration Actor
AMBER GPU MD Workbench
![Page 29: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/29.jpg)
Rommie Amaro, PI, UCSDComputational chemistry, biophysics
Andrew McCammon, UCSDComputational chemistry, biophysics, chemical physics
Mark Ellisman, UCSDMolecular & cellular biology
Andrew McCulloch, UCSDBioengineering, biophysics
Michel Sanner, TSRIDrug discovery & molecular visualization
Phil Papadopoulos, UCSD/SDSCComputer engineering, cyberinfrastructure
technologyIlkay Altintas, UCSD/SDSCWorkflows, provenance
Michael Holst, UCSDMath, physics
Arthur Olson, TSRIComputational chemistry, drug discovery, visualization
LEADERSHIPTEAM
![Page 30: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/30.jpg)
Training at the interface
Challenge: how do we build the next generation of interdisciplinary scientists?
Data-to-Structural-Models Simulation-Based Drug Discovery
![Page 31: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/31.jpg)
Biomedical Big Data Training Collaboratoryhttp://biobigdata.ucsd.edu
• BBDTCwebsiteisupandevolving!• BBDTCcontainssevenfull,openbiomedicaltrainingcourses• Four-coursebiomedicalbigdataseriesisplannedforWinter2017
![Page 32: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/32.jpg)
Working with Industry Partners at SDSC
![Page 33: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/33.jpg)
SDSC Provides a Range of Strategies for Engaging with Industry
• Sponsoredresearchagreements• Serviceagreementsforuseofsystems&consulting• Focusedcentersofexcellence(BigDataSystems,PredictiveAnalytics,Workflow
Technologies)• TrainingprogramsinDataScience&Analytics• IndustryPartnersProgramfor“jumpstarting”collaborations
Working with industry helps companies be more competitive, drives innovation, and fosters a healthy ecosystem between the research and private sector.
![Page 34: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/34.jpg)
Example for Industrial Collaboration: Janssen R&D Rheumatoid Arthritis Study• JanssenwasinterestedincorrelatinggenomicprofilewithresponsetoTNFαinhibitorgolimumab
• Sequenced438patients(fullgenome)• SDSCassistedwithre-alignmentandvariantcallingusingnew/improvedalgorithms
• Neededanalysisdoneinareasonabletimeframe(afewweeks)
![Page 35: A Workflow-Driven Discovery and Training Ecosystem for Distributed Analysis of Biomedical Big Data](https://reader031.vdocuments.site/reader031/viewer/2022022123/58a3aebc1a28ab9e6a8b63d5/html5/thumbnails/35.jpg)
Que
stio
ns?
Ilkay
Alti
ntas
, Ph.
D.
Emai
l: ia
ltint
as@
ucsd
.edu