Download - Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines
![Page 1: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/1.jpg)
Scaling Agent-based Simulation of Contagion Diffusion
over Dynamic Networks on Petascale Machines
Keith BissetJae-Seung Yeom, Ashwin Aji
[email protected] ndssl.vbi.vt.edu
Network Dynamics and Simulation Science LabVirginia Bioinformatics Institute
Virginia Tech
![Page 2: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/2.jpg)
• The problem we are trying to solve– Contagion propagation across large interaction networks – ~300 million nodes, ~1.5/70 billion edges
• Examples– Infectious Disease– Norms and Fads (e.g., Smoking, Obesity)– Digital viruses (e.g., computer viruses, cell phone
worms)– Human immune system modeling
Contagion Diffusion
![Page 3: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/3.jpg)
• Episimdemics is an individual-based modeling environment – Each individual is represented based on synthetic population of US– Each interaction between two co-located individuals is represented
• Uses a people-location bipartite graph as the underlying network.
• Planned: Add people-people graph for direct interactions
• Features– Time dependent and location dependent interactions– A scripting language to specify complex interventions– PTTS representation of disease and behavior
EpiSimdemics
![Page 4: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/4.jpg)
Example Person-Person Graph
Image courtesy SDSC
![Page 5: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/5.jpg)
Charm Implementation
6
PM1
PM2
PMn
Person
Manager
LM1
LM2
LMn
Location
Manager
P
P
P
L
L
L
main
PE0
PE1
person datavisit data
Location data
visit
PM1
PM2
LM1
LM2
PM2 LM3
LM1
LM2
LM3
done()
done()
done()
main
PM1
PM2
LM1
LM2
PM2 LM3
sendInteractors() computeInfection()
PM1
PM2
LM1
LM2
PM2 LM3
PM1
PM2
PM2
done()
done()
done()
computeInfection()
PM1
PM2
PM2
main
endOfDay()
Processing steps of an iteration
Sync. by Charm’s CD Sync. by Charm’s CD
![Page 6: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/6.jpg)
• P-L graph explicit, defines communication• P-P graph implicit, defines computation, 50x more edges• Both graphs evolve over time• US Population
– 270 million people, 70 million locations– 1.5 billion edges P-L graph– ~75 billion edges P-P graph (potential interactions/step)
Data organization
![Page 7: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/7.jpg)
Complex, Layered Interventions
Intervention Population Compliance When
VaccinationAdult
ChildrenCrit Workers
25%60%
100%Day 0
School ClosureReopen
60% 1.0% Children diagnosed (by county)
Quarantine Crit Workers 100% 1.0% adults diagnosed
Self Isolate All 20% 2.5% adults diagnosed
# stay home when symptomatic. intervention symptomatic set num_symptomatic++ apply diagnose with prob=0.60 schedule stayhome 3
trigger disease.symptom >= 2apply symptomatic
# vaccinate 25% of adultsintervention vaccinate_adult
treat vaccineset num_vac_adult++
trigger person.age > 18 apply vaccinate_adult with prob=0.25
![Page 8: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/8.jpg)
Effects of Interventions
![Page 9: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/9.jpg)
• Charm++ SMP mode• Gemini network layer• 4 processes/node• 3 compute 1 comm threads per process• Application based message coalescence
BlueWaters Setup
![Page 10: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/10.jpg)
Weak Scaling
![Page 11: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/11.jpg)
• Location load depends on number of visits• Location size follows power law• Not apparent until running at scale
Location Granularity
![Page 12: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/12.jpg)
Scaling for US Population
100 1000 10000 100000 10000000
0.2
0.4
0.6
0.8
1
1.2
RR-splitLoc
RR
Number of Compute PEs
Effic
ienc
y
0 50000 100000 150000 200000 250000 3000000
5000100001500020000250003000035000400004500050000
RR-splitLocRR
Number of Compute Pes
Spee
dup
![Page 13: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/13.jpg)
• Round Robin– Random distribution– Low overhead– Works well for small geographic areas (metro area)
• Graph Partitioner– Metis based partitioning– Multi-constraint (two phases separated by sync)– Higher Overhead– Helps as geographic area increases (state, national)
Static Partitioning
![Page 14: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/14.jpg)
Static Partitioning - Results
SendInteractor(). Person computation to generate visit messagesAddVisitMessage(). Location side message receive handling. ComputeInfections(). Location computation of interaction among visitors
![Page 15: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/15.jpg)
Message Volume
Round Robin
Graph Partitioner
256 nodes, 10 million people
![Page 16: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/16.jpg)
Graph Sparsification
10 100 1000 10000 1000001000
10000
100000
1000000
10000000
MI
full2001002010
Number of Partitions
Max
imum
Num
ber o
f Rem
ote
Mes
sage
per
Par
titio
n
CA NY MI NC IA AR WY0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
me200me100m20me10
Ratio
of N
umbe
r of E
dges
to th
e O
rigin
al N
umbe
r of E
dgesProcedure
• Randomly remove edges from high degree nodes
• Partition sparse graph• Use full graph for execution
Goal: Improve runtime of Graph Partitioning
![Page 17: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/17.jpg)
Impact of GPU Acceleration on Execution Profile
without GPU with GPU0
100
200
300
400
500
600
EndOfDayBarrierInfectScheduleStartOfDay
Exec
ution
Tim
e (s
econ
ds)
70.9%
7.7x
Assume 1CPU cores per GPU devices, in practice, CPU > GPU
![Page 18: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/18.jpg)
• Scenario 1 – All chares from all CPU processes offload simultaneously to GPU– GPUs (Kepler) maintain tasks queue from different
processes– Inefficient: CPUs will be idle waiting for GPU
execution to complete
• Scenario 2 – Chares from only some select CPU processes offload to GPU– 1:1 ratio can be maintained between “GPU”
processes and GPUs– But, “GPU” chares will finish sooner than “CPU”
chares, i.e. load imbalance– Use LB methods of Charm++ to rebalance chares
GPU-CharmSimdemics Scenarios
Node
GPU
GPU
Node
GPU
GPU
![Page 19: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/19.jpg)
• Dynamic Load Balancing with semantic information– Prediction model based on past runs– Information from simulation state variables– Use dynamic interventions – more variable load
• Try Charm++ Meta Load Balancer• Further improvements to initial partitioning
– Minimize message imbalance as well as edge-cut• Message reduction• Sequential replicates to amortize data load time• Scale to global population - 10 billion people
Future Work
![Page 20: Scaling Agent-based Simulation of Contagion Diffusion over Dynamic Networks on Petascale Machines](https://reader036.vdocuments.site/reader036/viewer/2022062501/5681694e550346895de0ede7/html5/thumbnails/20.jpg)
Acknowledgements
NSF HSD Grant SES-0729441, NSF PetaApps Grant OCI-0904844, NSF NETS Grant CNS-0831633, NSF CAREER Grant CNS-0845700, NSF NetSE Grant CNS-1011769, NSF SDCI Grant OCI-1032677, DTRA R&D Grant HDTRA1-09-1-0017, DTRA Grant HDTRA1-11-1-0016, DTRA CNIMS Contract HDTRA1-11-D-0016-0001, DOE Grant DE-SC0003957, PSU/DOE 4345-VT-DOE-4261, US Naval Surface Warfare Center Grant N00178-09-D-3017 DEL ORDER 13, NIH MIDAS Grant 2U01GM070694-09, NIH MIDAS Grant 3U01FM070694-09S1, LLNL Fellowship SubB596713 DOI Contract D12PC00337
UIUC Parallel Programming Lab
NDSSL Faculty, Staff and Students