keeping your software ticking testing with metronome and the nmi lab

14
Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Upload: emil-reeves

Post on 25-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Keeping Your Software Ticking

Testing with Metronomeand the NMI Lab

Page 2: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Background: Why (In a Slide!)

• Grid Software: Important to Science and Industry• Quality of Grid Software: Not So Much• Testing: Key to Quality• Testing Distributed Software: Hard• Testing Distributed Software Stacks: Harder• Distributed Software Testing Tools: Nonexistent

(before)

We Needed Help, We Built Something to Help Ourselves and Our Friends, We Think It Can Help Others

Page 3: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Background: What (In a Slide!)• A Framework and Tool: Metronome

– Lightweight, built atop Condor, DAGMan, and other proven distributed computing tools– Portable, open source– Language/harness independent– Assumes >1 user, >1 project, >1 environment needing resources at >1 site.– Encourages explicit, well-controlled build/test environments for reproducibility– Central results repository– Fault-tolerant– Encourages build/test separation

• A Facility: The NMI Lab– 200+ cores, 50+ platforms @ UW (Noah’s Ark; the Anti-Cluster)– Built to use distributed resources at other sites, grids, etc.– 200 users, dozens of registered projects (most of them “real”)– 84k builds & tests managed by 1M Condor jobs, producing 6.5M tracked tasks in the DB

• A Team– Subset of Condor Team: Becky Gietzel, Todd Miller, Ross Oldenburg, myself. (More coming.)

• A Community– Working with TeraGrid, OSG, ETICS, others towards a common intl. build/test infrastructure.

Page 4: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

MySQL Results DB

Web Status Pages

FinishedBinaries

Customer Source Code

Condor Queue

Metronome

Customer Build/Test

Scripts

INPUTO

UTP

UT

Distributed Build/Test Pool

Spec File

Spec File

DAGMan

DAG results

build/testjobs

DAG

results

results

Metronome Architecture (In a Slide!)

Page 5: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Why Is This Architecture Powerful?

• Fault tolerance, resource management.• Real scheduler, not a toy or afterthought.• Flexible workflow tools.• Nothing to deploy in advance on worker nodes

except Condor– can harness “unprepared” resources.

• Advanced job migration capabilities– critical for goal of a common build/test infrastructure

across projects, sites, countries.

Page 6: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Example: NMI Lab / ETICSSite Federation with Condor-C

Page 7: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

10k Foot View

• Past:– humble beginnings, ragtag crew of developers making building & testing

easier for the projects around them (Condor, Globus, VDT, Teragrid...)

• Present:– now we have tax money and users should have higher expectations– good news: six months into a new 3y funding cycle, our "professionalism"

has improved from our humble beginnings -- better hardware, better processes, better staffing

– bad news: we’re still a bit ragtag -- inconsistent support/development request tracking, inconsistent info on resource/lab improvements, issues, and resolution, generally reactive to problems

– we're clearly contributing to the build & test capabilities of the community, but we’d like to deliver much more, especially WRT testing.

Page 8: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

10k Foot View: Future

• Maintain Metronome and the NMI Lab– continue to professionalize lab infrastructure, improve

availability, stability, uptime– Better monitoring -> more proactive response to issues– Better scheduling of jobs, better use of VMs to respond to

uneven x86 platform demand

• Enhance Metronome and the NMI Lab– New features, new capabilities – but might be less

important than clarity, usability, fit & finish of existing features.

Page 9: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

10k Foot View: Future

• Support Metronome and the NMI Lab– more systematic support operation (ticketing, etc.)– more utilization of basic testing capabilities by new users– more utilization of advanced testing capabilities by existing users– more & better information for users, admins, and pointed-haired

bosses• better reporting on users, resources, usage, operations, etc.

• Nurture Distributed Software Testing Community– to identify common B&T needs to improve software quality.– to challenge and help us to provide software & services to help meet

B&T needs.– Tuesday’s meeting was a good start, I hope…

Page 10: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Maslow’s Pyramid of Testing Needs

Page 11: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Testing Opportunities• more resources == more possibilities (just like science)

– don’t just test under normal conditions, test the not-so-edge cases too (e.g., with CPU load!)

– test everywhere your users run, not just where you develop

– old/exotic/unique resources you don’t own (NMI Lab, TeraGrid)

• “black box”– run your existing tinderbox, etc. test harness inside Metronome

• decoupled builds & tests– run new tests on old builds– cross-platform binary compatibility testing– run quick smoke tests continuously, heavy tests nightly,

performance/scalability tests before release

Page 12: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Testing Opportunities

• managed (static) vs. “unmanaged” (auto-updating) platforms– isolate your changes from the OS vendors– test your changes against a fixed target– test your working code against a moving target

• root-level testing• automated reports from testing tools– ValGrind, Purify, Coverity, etc.

• cross-platform binary testing (build on A, test on B)

Page 13: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Testing Opportunities

• Parameterized dependencies– build with multiple library versions, compilers, etc.– test against every Java VM, Maven, Ant version around– test against different DBs (MySQL, Postgres, Oracle, etc.), VM

platforms (Xen, VMWare, etc.), batch systems– make sure new versions of Condor, Globus, etc. don’t break your code

• Parallel scheduled testbeds– cross-platform testing (A to B)– deploy software stack across many hosts, test whole stack– multi-site testing (US to Europe)– network testing (cross-firewall, low-bandwidth, etc.)– scalability testing

Page 14: Keeping Your Software Ticking Testing with Metronome and the NMI Lab

Upshot

• This is all work we’d like to help this community do.

• Start small -- automated builds are an excellent start.

• Think big -- what kinds of testing would pay dividends?

• Let us know what we can do to help make it happen.