Beyond Monitoring:
Proactive Server Preservation in an HPC Environment
Chad FellerUniversity of Nevada, Reno
8 May 2012Thesis Defense
Acknowledgements
My wife, Veronica
My kids
My good friend, Derek Eiler
My committee, Dr. Harris, Dr. Dascalu, Dr. Schlauch
Background
Monitoring systems
Increasingly sophisticated
Still large holes in capabilities
9/9/9
9/9/9Power failure sequence kicks in
UPS caught outageGenerator started up
Temperature risingUPS only powers servers
Power switches to generators
Temperature still rising
9/9/9
9/10/9
9/10/9
HPC
Computing Density
ILOM
9/10/9
9/10/9
7/20/11
Environmental Considerations
ILOM/IPMI
Sun Grid Engine
Linux
Architecture
Frontend
MaI
n Loop
Local Testing
Global Testing
Global Testing
Demo
ConclusionDeveloped a temperature monitoring system
Local PerspectiveGlobal PerspectiveIntelligent ResponseDesigned for HPC & Enterprise servers
Modular ImplementationCan be easily adapted to other hardwareSoftware can be leveraged to other environments
Tested