sss test results scalability, durability, anomalies todd kordenbrock technology consultant scalable...

SSS Test Results

Scalability, Durability, Anomalies

Todd Kordenbrock

Technology Consultant

Scalable Computing Division

[email protected]

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy’s National Nuclear Security Administration

under contract DE-AC04-94AL85000.

• Effective System Performance Benchmark

• Scalability– Service Node

– Cluster Size

• Durability

• Anomalies

Overview

The Setup

• The physical machine– dual processor 3GHz Xeon

– 2 GB RAM

– FC3 and VMWare 5

• The 4 node VMWare cluster– 1 service node

– 4 compute nodes

– OSCAR 1.0 on Redhat 9

• The 64 virtual node cluster– 16 WarehouseNodeMonitors running on each compute node

Dual Processor Xeon

service1 - SystemMonitor

VMWare

compute1 compute4compute3compute2

NodeMon 1NodeMon 5NodeMon 9NodeMon13...NodeMon n-3

NodeMon 2NodeMon 6NodeMon 10NodeMon 14...NodeMon n-2

NodeMon 3NodeMon 7NodeMon 11NodeMon 14...NodeMon n-1

NodeMon 4NodeMon 8NodeMon 12NodeMon 16...NodeMon n

Effective System Performance Benchmark

• Developed by the National Energy Research Scientific Computing Center

• System utilization test, NOT a throughput test

• Focused on O/S attributes– launch time, accounting, job scheduling

• Constructed to be processor-speed independent

• Low resource usage (besides network)

• Two variants: Throughput and Multimode

• The final result is the ESP Efficiency Ratio

ESP Efficiency Ratio

• Calculating the ESP Efficiency Ratio

• CPUsecs = sum(jobsize * runtime * job count)

• AMT = CPUsecs/syssize

• ESP Efficiency Ratio = AMT/observed runtime

ESP2 Efficiency (64 nodes)

• CPUsecs = 680251.75

• AMT = 680251.75/64 = 10628.93

• Observed Runtime = 11586.7169

• ESP Efficiency Ratio = 0.9173

Scalability

• Service Node Scalability (Load Testing)– Bamboo (Queue Manager)

– Gold (Accounting)

• Cluster Size– Warehouse scalability (Status Monitor)

– Maui scalability (Scheduler)

Usage Reports

User DB

Accounting Scheduler

JobManager &

Monitor

SystemMonitor

QueueManager

Checkpoint/Restart

DataMigration

MetaScheduler

NodeConfiguration

& BuildManager

MetaMonitor

MetaManager

ResourceAllocation

management

Application Environment

HighPerformance

Communication& I/O

Access controlSecurity manager

FileSystem

Interacts withall components

Userutilities

Bamboo Job Submission

0

0.01

0.02

0.03

0.04

0.05

0.06

se

co

nd

s/t

ran

sa

cti

on

1x1000 10x100 100x10 250x4

Usage Reports

User DB


JobManager &

Monitor

SystemMonitor

QueueManager

Checkpoint/Restart

DataMigration

MetaScheduler

NodeConfiguration

& BuildManager

MetaMonitor

MetaManager

ResourceAllocation

management


HighPerformance

Communication& I/O


FileSystem


Userutilities

Gold Operations

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

se

co

nd

s/t

ran

sa

cti

on

Reservation Withdraw Balance User Query

1x1000

10x100

Usage Reports

User DB


JobManager &

Monitor

SystemMonitor

QueueManager

Checkpoint/Restart

DataMigration

MetaScheduler

NodeConfiguration

& BuildManager

MetaMonitor

MetaManager

ResourceAllocation

management


HighPerformance

Communication& I/O


FileSystem


Userutilities

Warehouse Scalability

• Initial concerns– per process file descriptor (socket) limits

– time required to gather status from 1000s of nodes

• Discussed with Craig Steffen– had the same concerns

– experienced file descriptor limits

– resolved with a hierarchical configuration

– no tests on large clusters, just simulations

Usage Reports

User DB


JobManager &

Monitor

SystemMonitor

QueueManager

Checkpoint/Restart

DataMigration

MetaScheduler

NodeConfiguration

& BuildManager

MetaMonitor

MetaManager

ResourceAllocation

management


HighPerformance

Communication& I/O


FileSystem


Userutilities

Maui Scalability

0

0.5

1

1.5

2

2.5

3

3.5

1way 1seconds 1way hostname allway hostname

se

co

nd

s/t

ran

sa

cti

on

4nodes gold

4nodes no gold

64nodes gold

64nodes no gold

Scalability Conclusions

• Bamboo• Gold• Warehouse• Maui

Durability

• What is durability?

• A few terms regarding starting and stopping

• Easy Tests

• Hard Tests

Durability and Other Terms

• Durability Testing - examines a software system's ability to react to and recover from failures and conditions external to the system itself.

• Warm Start/Stop - an orderly startup/shutdown of the SSS services on a particular node

• Cold Start/Stop – a warm start/stop paired with a system boot/shutdown on a particular node

Easy Tests

• Compute Node Warm Stop– 30 sec delay between stop and Maui notification

– race condition

• Compute Node Warm Start– 10 sec delay between start and Maui notification

– jobs in the queue do not get scheduled, new jobs do

• Compute Node Cold Stop– 30 sec delay between stop and Maui notification

– race condition

More Easy Tests

• Single Node Job Failure– mpd to queue manager communication

• Resource Hog - stress– disk

– memory

– network

More Easy Tests

• Resource Exhaustion– compute node

• disk – no failures

– service node• disk – gold fails in logging package

Hard Tests

• Compute Node Failure/Restore– Current release of warehouse fails to reconnect

• Service Node Failure/Restore– Requires a restart of mpd on all compute nodes

• Compute Node Network Failure/Restore– 30 sec delay between failure and Maui notification

– race condition

– 20 sec delay between restore and Maui notification

More Hard Tests

• Service Node Network Failure/Restore– 30 sec delay between failure and Maui notification

– race condition

– 20 sec delay between restore and Maui notification

– If outage >10 sec, mpd can't reconnect to computes

Durability Conclusions

• Bamboo• Gold• Warehouse• Maui

Anomalies Discovered

• Maui– Jobs in the queue do not get scheduled after service

node warm restart

– If max runtime expires on the last job in the queue, repeated attempts are made to delete it; the account is charged actual runtime + max runtime

– Otherwise, the last job in the queue doesn't get charge until another job is submitted

– Maui loses connections to other services

More Anomalies

• Warehouse– warehouse_SysMon exits after ~8 hrs (current

release)

– warehouse_SysMon doesn't reconnect to power cycled compute nodes (current release)

• Gold– “Quotation Create” pair fails with missing column

error

– gquote succeeds, glsquote fails with similar error

– Spikes CPU usage when gold.db file gets large (>64MB). sqlite problem?

More Anomalies

• happynsm– /etc/init.d/nsmup needs a delay to allow the server

time to initialize

– Is NSM in use at this time?

• emng.py throws errors – After a few hundred jobs, errors begin showing up in

/var/log/messages

– Jobs continue to execute, but slowly without events

Conclusions

• Overall scalability is good. Warehouse needs to be tested on a large cluster.

• Overall durability is good. Some problems with warehouse have been resolved in the latest development release.

ToDo List

• Develop and execute tests for the BLCR module• Retest on a larger cluster• Get the latest release of all the software and retest• Formalize this information into a report

sss test results scalability, durability, anomalies todd kordenbrock technology consultant scalable...

Documents

maui notificationjobs

warm startstop

esp efficiency ratiocpusecs

maui notificationif

job schedulingconstructed

job countamt

system bootshutdown

particular nodecold