hpc/htc and cloud - home - openstack is open source ... · hpc/htc and cloud: making them work ......
TRANSCRIPT
![Page 1: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/1.jpg)
HPC/HTC and Cloud:Making them work together efficiently
Rajul Kumar
Northeastern University
![Page 2: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/2.jpg)
Our group
Rajul Kumar
Northeastern [email protected]
Evan Weinberg
Boston [email protected]
Chris Hill
Massachusetts Institute of [email protected]
![Page 3: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/3.jpg)
HPC and Cloud convergence
High Performance Computing (HPC)
• HPC users have infinite demand for resources
Cloud
• Overprovisioned to meet the peak workloads and mostly stay underutilized
Can we make HPC soak up these idle cycles without impacting cloud workload
![Page 4: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/4.jpg)
Simple Case: Single node HTC jobs
• High Throughput Computing (HTC) jobs focus on efficient execution ofloosely-coupled tasks
• Backfilled HTC jobs get killed to release resources for HPC workload
• Invested compute cycles are lost and requires complete rework
Suspend and resume the Virtual Machine running the jobs as and when the resources are available
![Page 5: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/5.jpg)
Implementation
HPC cluster OpenStack cloud
Resource monitorHPC
HTC
Cloud
![Page 6: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/6.jpg)
Implementation
HPC cluster OpenStack cloud
Resource monitor
OpenVPN
![Page 7: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/7.jpg)
Implementation
Control daemon
HPC cluster OpenStack cloud
Resource monitors
OpenVPN
![Page 8: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/8.jpg)
Implementation
Control daemon
HP
C c
lust
er
Op
enStack clo
ud
Resource monitors
OpenVPN
HPC jobs
HPC job arrives
![Page 9: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/9.jpg)
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
HTC jobs moved to Cloud
![Page 10: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/10.jpg)
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
Cloud utilization increases
![Page 11: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/11.jpg)
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
HTC job suspended to release resources for cloud
![Page 12: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/12.jpg)
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
Cloud utilization goes low
![Page 13: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/13.jpg)
Implementation
Control daemon
Resource monitors
OpenVPN
HP
C c
lust
er
Op
enStack clo
ud
HTC jobs resumed on cloud
![Page 14: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/14.jpg)
Modifications to Slurm
Slurm – A workload manager for HPC cluster
• Manages the resource and job scheduling
• Marks a node DOWN and removes the jobs for an unreachable node
• Does the same for a suspended virtual node
Modified Slurm to manage the suspended node and keep the job states intact
![Page 15: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/15.jpg)
Future prospects
• Harden and utilize full data center performance (hardware, network etc.)
• Running multi-node jobs in virtual environment
• Move the jobs between Virtual Machine and Bare metal nodes
• Experiment with container frameworks
![Page 16: HPC/HTC and Cloud - Home - OpenStack is open source ... · HPC/HTC and Cloud: Making them work ... Simple Case: Single node HTC jobs •High Throughput Computing (HTC) ... HPC cluster](https://reader035.vdocuments.site/reader035/viewer/2022062223/5ad536827f8b9a1a028cd380/html5/thumbnails/16.jpg)
Conclusion
• Dynamic HPC/HTC cluster with least overhead and impact
• Better productive utilization of the HPC/HTC cluster
• Better resource utilization of the cloud
http://info.massopencloud.org