sss test results scalability, durability, anomalies todd kordenbrock technology consultant scalable...
TRANSCRIPT
SSS Test Results
Scalability, Durability, Anomalies
Todd Kordenbrock
Technology Consultant
Scalable Computing Division
Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration
under contract DE-AC04-94AL85000.
• Effective System Performance Benchmark
• Scalability– Service Node
– Cluster Size
• Durability
• Anomalies
Overview
The Setup
• The physical machine– dual processor 3GHz Xeon
– 2 GB RAM
– FC3 and VMWare 5
• The 4 node VMWare cluster– 1 service node
– 4 compute nodes
– OSCAR 1.0 on Redhat 9
• The 64 virtual node cluster– 16 WarehouseNodeMonitors running on each compute node
Dual Processor Xeon
service1 - SystemMonitor
VMWare
compute1 compute4compute3compute2
NodeMon 1NodeMon 5NodeMon 9NodeMon13...NodeMon n-3
NodeMon 2NodeMon 6NodeMon 10NodeMon 14...NodeMon n-2
NodeMon 3NodeMon 7NodeMon 11NodeMon 14...NodeMon n-1
NodeMon 4NodeMon 8NodeMon 12NodeMon 16...NodeMon n
Effective System Performance Benchmark
• Developed by the National Energy Research Scientific Computing Center
• System utilization test, NOT a throughput test
• Focused on O/S attributes– launch time, accounting, job scheduling
• Constructed to be processor-speed independent
• Low resource usage (besides network)
• Two variants: Throughput and Multimode
• The final result is the ESP Efficiency Ratio
ESP Efficiency Ratio
• Calculating the ESP Efficiency Ratio
• CPUsecs = sum(jobsize * runtime * job count)
• AMT = CPUsecs/syssize
• ESP Efficiency Ratio = AMT/observed runtime
ESP2 Efficiency (64 nodes)
• CPUsecs = 680251.75
• AMT = 680251.75/64 = 10628.93
• Observed Runtime = 11586.7169
• ESP Efficiency Ratio = 0.9173
Scalability
• Service Node Scalability (Load Testing)– Bamboo (Queue Manager)
– Gold (Accounting)
• Cluster Size– Warehouse scalability (Status Monitor)
– Maui scalability (Scheduler)
Usage Reports
User DB
Accounting Scheduler
JobManager &
Monitor
SystemMonitor
QueueManager
Checkpoint/Restart
DataMigration
MetaScheduler
NodeConfiguration
& BuildManager
MetaMonitor
MetaManager
ResourceAllocation
management
Application Environment
HighPerformance
Communication& I/O
Access controlSecurity manager
FileSystem
Interacts withall components
Userutilities
Usage Reports
User DB
Accounting Scheduler
JobManager &
Monitor
SystemMonitor
QueueManager
Checkpoint/Restart
DataMigration
MetaScheduler
NodeConfiguration
& BuildManager
MetaMonitor
MetaManager
ResourceAllocation
management
Application Environment
HighPerformance
Communication& I/O
Access controlSecurity manager
FileSystem
Interacts withall components
Userutilities
Bamboo Job Submission
0
0.01
0.02
0.03
0.04
0.05
0.06
se
co
nd
s/t
ran
sa
cti
on
1x1000 10x100 100x10 250x4
Usage Reports
User DB
Accounting Scheduler
JobManager &
Monitor
SystemMonitor
QueueManager
Checkpoint/Restart
DataMigration
MetaScheduler
NodeConfiguration
& BuildManager
MetaMonitor
MetaManager
ResourceAllocation
management
Application Environment
HighPerformance
Communication& I/O
Access controlSecurity manager
FileSystem
Interacts withall components
Userutilities
Gold Operations
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
se
co
nd
s/t
ran
sa
cti
on
Reservation Withdraw Balance User Query
1x1000
10x100
Usage Reports
User DB
Accounting Scheduler
JobManager &
Monitor
SystemMonitor
QueueManager
Checkpoint/Restart
DataMigration
MetaScheduler
NodeConfiguration
& BuildManager
MetaMonitor
MetaManager
ResourceAllocation
management
Application Environment
HighPerformance
Communication& I/O
Access controlSecurity manager
FileSystem
Interacts withall components
Userutilities
Warehouse Scalability
• Initial concerns– per process file descriptor (socket) limits
– time required to gather status from 1000s of nodes
• Discussed with Craig Steffen– had the same concerns
– experienced file descriptor limits
– resolved with a hierarchical configuration
– no tests on large clusters, just simulations
Usage Reports
User DB
Accounting Scheduler
JobManager &
Monitor
SystemMonitor
QueueManager
Checkpoint/Restart
DataMigration
MetaScheduler
NodeConfiguration
& BuildManager
MetaMonitor
MetaManager
ResourceAllocation
management
Application Environment
HighPerformance
Communication& I/O
Access controlSecurity manager
FileSystem
Interacts withall components
Userutilities
Maui Scalability
0
0.5
1
1.5
2
2.5
3
3.5
1way 1seconds 1way hostname allway hostname
se
co
nd
s/t
ran
sa
cti
on
4nodes gold
4nodes no gold
64nodes gold
64nodes no gold
Scalability Conclusions
• Bamboo• Gold• Warehouse• Maui
Durability
• What is durability?
• A few terms regarding starting and stopping
• Easy Tests
• Hard Tests
Durability and Other Terms
• Durability Testing - examines a software system's ability to react to and recover from failures and conditions external to the system itself.
• Warm Start/Stop - an orderly startup/shutdown of the SSS services on a particular node
• Cold Start/Stop – a warm start/stop paired with a system boot/shutdown on a particular node
Easy Tests
• Compute Node Warm Stop– 30 sec delay between stop and Maui notification
– race condition
• Compute Node Warm Start– 10 sec delay between start and Maui notification
– jobs in the queue do not get scheduled, new jobs do
• Compute Node Cold Stop– 30 sec delay between stop and Maui notification
– race condition
More Easy Tests
• Single Node Job Failure– mpd to queue manager communication
• Resource Hog - stress– disk
– memory
– network
More Easy Tests
• Resource Exhaustion– compute node
• disk – no failures
– service node• disk – gold fails in logging package
Hard Tests
• Compute Node Failure/Restore– Current release of warehouse fails to reconnect
• Service Node Failure/Restore– Requires a restart of mpd on all compute nodes
• Compute Node Network Failure/Restore– 30 sec delay between failure and Maui notification
– race condition
– 20 sec delay between restore and Maui notification
More Hard Tests
• Service Node Network Failure/Restore– 30 sec delay between failure and Maui notification
– race condition
– 20 sec delay between restore and Maui notification
– If outage >10 sec, mpd can't reconnect to computes
Durability Conclusions
• Bamboo• Gold• Warehouse• Maui
Anomalies Discovered
• Maui– Jobs in the queue do not get scheduled after service
node warm restart
– If max runtime expires on the last job in the queue, repeated attempts are made to delete it; the account is charged actual runtime + max runtime
– Otherwise, the last job in the queue doesn't get charge until another job is submitted
– Maui loses connections to other services
More Anomalies
• Warehouse– warehouse_SysMon exits after ~8 hrs (current
release)
– warehouse_SysMon doesn't reconnect to power cycled compute nodes (current release)
• Gold– “Quotation Create” pair fails with missing column
error
– gquote succeeds, glsquote fails with similar error
– Spikes CPU usage when gold.db file gets large (>64MB). sqlite problem?
More Anomalies
• happynsm– /etc/init.d/nsmup needs a delay to allow the server
time to initialize
– Is NSM in use at this time?
• emng.py throws errors – After a few hundred jobs, errors begin showing up in
/var/log/messages
– Jobs continue to execute, but slowly without events
Conclusions
• Overall scalability is good. Warehouse needs to be tested on a large cluster.
• Overall durability is good. Some problems with warehouse have been resolved in the latest development release.
ToDo List
• Develop and execute tests for the BLCR module• Retest on a larger cluster• Get the latest release of all the software and retest• Formalize this information into a report