guide to developing an effective stat test strategy v4

STAT COE TEST PLANNING GUIDE V4.0

STAT T&E Center of Excellence 2950 Hobson Way – Wright-Patterson AFB, OH 45433

Guide to Developing an Effective STAT Test Strategy V4.0

Sarah Burke, Ph.D. Michael Harman

Francisco Ortiz, Ph.D. Aaron Ramert

Bill Rowell, Ph.D. Lenny Truett, Ph.D.

21 December 2016

The goal of the STAT T&E COE is to assist in developing rigorous, defensible test strategies to more effectively quantify and characterize system performance and provide information that reduces risk. This and other COE products are

available at www.AFIT.edu/STAT


TABLE OF CONTENTS

LIST OF ACRONYMS ............................................................................................................................................... 4

EXECUTIVE SUMMARY ........................................................................................................................................... 6

1. OVERVIEW ......................................................................................................................................................... 6

2. TEST AND EVALUATION IN DOD ......................................................................................................................... 7

2.1 DT&E .................................................................................................................................................................... 8 2.2 OT&E .................................................................................................................................................................... 8 2.3 LFT&E ................................................................................................................................................................... 8 2.4 IT&E ..................................................................................................................................................................... 8 2.5 T&E STRATEGY ........................................................................................................................................................ 8

3. STAT PROCESS COMPONENTS .......................................................................................................................... 11

3.1 UNDERSTANDING THE REQUIREMENT(S) ..................................................................................................................... 12 3.2. MISSION & TASK DECOMPOSITION AND RELATION ...................................................................................................... 14 3.3. SETTING TEST OBJECTIVES ....................................................................................................................................... 16 3.4. DEFINE RESPONSES ................................................................................................................................................ 19 3.5. FACTORS AND LEVELS ............................................................................................................................................. 20 3.6. CONSTRAINTS ....................................................................................................................................................... 25 3.7. TEST DESIGN ........................................................................................................................................................ 26 3.8. TEST EXECUTION PLANNING .................................................................................................................................... 30 3.9. ANALYSIS ............................................................................................................................................................. 32

4. CONCLUSION ................................................................................................................................................... 34

APPENDIX A. EXAMPLE APPLICATION OF SEQUENTIAL TESTING .......................................................................... 35

A.0 INTRODUCTION ...................................................................................................................................................... 35 A.1 TEST RESULTS ........................................................................................................................................................ 35 A.2 DOE STATISTICAL ANALYSIS OF RESULTS ..................................................................................................................... 39

APPENDIX B. EXAMPLE OF ANALYSIS OF RANDOMIZED BLOCK DESIGN ............................................................... 45

APPENDIX C. GLOSSARY OF STAT TERMS ............................................................................................................. 47

APPENDIX D. LEARNING RESOURCES ................................................................................................................... 50

REFERENCES ........................................................................................................................................................ 52


List of Figures

Figure 1. STAT in the Test & Evaluation Process Schematic ......................................................................... 7 Figure 2. Test and Evaluation Strategy ......................................................................................................... 9 Figure 3. A depiction of the flow from requirements to performance in the operational envelope. ........ 13 Figure 4. Test Design in the SE Process ....................................................................................................... 14 Figure 5. Example Mission Decomposition ................................................................................................. 15 Figure 6. Assessment of Quantitative Objectives Over the Operational Envelope. ................................... 17 Figure 7. Process Flow Diagram, Armed Escort Mission (Hutto, 2011) ...................................................... 19 Figure 8. Fishbone Diagram ........................................................................................................................ 20 Figure 9. Affinity Diagram (General Example) ............................................................................................ 21 Figure 10. Inter-relationship diagram (General Example) .......................................................................... 22 Figure 11. Continuous Versus Categorical Variables .................................................................................. 23 Figure 12. Impact of Factor Type on Response ........................................................................................... 24 Figure 13. Illustrative Representation of the Randomized Blocking Procedure. ........................................ 31 Figure 14. An Example of Test Data Analysis and Interpretation. .............................................................. 33


List of Acronyms

ACAT: Acquisition Category ATO: Authorization to Operate C&A: Certification & Accreditation C: Control CDD: Capability Development Document COE: Center of Excellence COI: Critical Operational Issues CTI: Critical Technical Issues CTP: Critical Technical Parameters DoD: Department of Defense DT&E: Developmental Test and Evaluation DT: Developmental Testing DTI: Developmental Test Issues E: Easy to Change FDS: Fraction of Design Space H: Hold Constant HTC: Hard to Change IATT: Interim Authority to Test IED: Improvised Explosive Device IT&E: Integrated test and Evaluation IT: Information Technology KPP: Key Performance Parameter

KSA: Key System Attributes LFT&E: Live Fire Test and Evaluation LRIP: Low Rate of Initial Production MOE: Measure of Effectiveness MOP: Measure of Performance MOS: Measure of Suitability N: Noise OA: Operational Assessment OSD: Office of the Secretary of Defense OT&E: Operational Test and Evaluation OT: Operational Testing RMF: Risk Management Framework RWR: Radar Warning Receiver SCA: Security Controls Assessment SEMP: System Engineering Management Plan SME: Subject Matter Expert STAT: Scientific Test & Analysis Techniques T&E: Test and Evaluation TEMP: Test and Evaluation Master Plan TPM: Technical Performance Measures VH: Very Hard to Change VIF: Variance Inflation Factor


Change Summary • V1.0: Original • V2.0: Updated figures and terminology • V3.0: Addition of learning resource appendix and errata • V4.0: Full review of technical descriptions, updated images and sections, glossary of STAT terms,

and new examples


Executive Summary Scientific Test and Analysis Techniques (STAT) are deliberate, methodical processes and techniques that create traceability from requirements to analysis. All phases and types of testing (developmental, operational, integrated, and live fire testing) strive to deliver meaningful, that is to say, defensible and decision-enabling results in an efficient manner. The incorporation of STAT provides a rigorous methodology to accomplish this goal.

The primary challenge in applying STAT to Department of Defense (DOD) testing is the broad scale and complexity of systems to be acquired together with the wide variety of missions and doctrines that drive system requirements. Although academic references and industrial examples are available in large quantities, much of this lacks relevance to DOD test planning. Case studies, such as those that are present in test planning course curricula, are more application-oriented. However, even these rarely contain enough detail to be useful in practical settings. Another potential source of confusion arises from test documentation that not only fails to adequately define system requirements, but also omits conditions, such as the cyberspace environment, for the employment of these systems. Such difficulties must be overcome, usually at a high cost in both time and effort, to produce efficient and highly effective, rigorous test plans (including experimental designs) that instill a sufficient level of confidence in the inferred performance of the system under test.

This guide addresses the critical planning processes and procedures that are required for the effective integration of STAT into DOD test planning. Since many of the key aspects of such processes and procedures reside in the problem definition and mission decomposition phases of test planning, this guide provides greater emphasis on these areas. For further information concerning mathematical and statistical details relevant to the design of test, we direct the reader’s attention to specific references at various locations throughout this guide.

1. Overview The STAT process is the application of the scientific method to the planning, design, execution, and analysis of experiments that address test objectives. Although STAT has traditionally been applied to physical systems, its use can be extended to software, software-intensive systems, simulations, and even test objectives that address cybersecurity requirements. STAT identifies and quantifies the risk associated with different experimental design choices and thus assists practitioners in developing efficient, rigorous test strategies that will yield defensible results. Freeman et al. (2014) and Coleman and Montgomery (1993) both outline considerations and processes for planning objective tests. The anticipated product of the sustained use of STAT is a more progressive, sequential testing approach that correctly leverages past test information while adhering to established systems engineering methodology.

The primary challenge in applying STAT to DOD testing is the broad scale and complexity of the systems, missions, and conditions. This is best addressed by breaking down the requirement, system, and/or mission into smaller pieces, which can then be readily translated into rigorously quantifiable statistical designs, a point that is expressed in the following excerpt from Montgomery (2013):

“One large comprehensive experiment is unlikely to answer the key questions satisfactorily. A single comprehensive experiment requires the experimenters to know the answers to lots of questions, and if they are wrong, the results will be disappointing. This leads to wasting time, materials, and other resources, and may result in never


answering the original research question satisfactorily. A sequential approach employing a series of smaller experiments, each with a specific objective … is a better strategy.”

To support such a breakdown, the STAT methodology prescribes an iterative procedure that begins with the requirement and proceeds through the generation of test objectives, designs, and analysis plans, all of which may be directly traced back to the requirement. Critical questions at every stage help the planner keep the process on track. At the end, the design and analysis plan is reviewed to ensure that it supports the objective that began the process. Figure 1 provides a concise process flow diagram that summarizes the application of STAT to the test and evaluation process.

Figure 1. STAT in the Test & Evaluation Process Schematic

2. Test and Evaluation in DOD The DOD exercises three formal and statutory categories of tests administered by the Office of the Secretary of Defense (OSD): developmental test & evaluation (DT&E), operational test & evaluation (OT&E), and live fire test & evaluation (LFT&E). The simultaneous execution and independent assessment of developmental and operational testing is called Integrated Test & Evaluation (IT&E).


(Note: For complete definitions refer to The Defense Acquisition Guidebook (DOD, 2015) and the Test and Evaluation Management Guide (DOD, 2012).

2.1 DT&E DT&E verifies that a system is built correctly and meets the technical requirements specified in the contract. DT&E is conducted throughout the lifecycle of a system to inform the systems engineering process and acquisition decisions, to help manage design and programmatic risks, and to evaluate the combat capability of the system and its ability to provide timely and accurate information to the war fighters.

2.2 OT&E OT&E validates a system in terms of the manner in which it executes its mission in a realistic operational environment. Every operational test evaluates a system against three major categories, which are effectiveness, suitability, and survivability. Operational effectiveness is the extent to which a system is capable of accomplishing its intended mission(s) when used by representative personnel in the environment planned or expected for operational employment. A representative operational environment is considered to be a realistic arrangement of various elements, both occurring in time and static, and both physical and conceptual, such as operators, maintainers, weather, terrain, organization, training, doctrine, tactics, survivability, vulnerability, and threat (including the growing cybersecurity threat). Operational suitability is the degree to which a system can be placed in field use when limitations on the system’s reliability, availability, compatibility, transportability, interoperability, wartime usage rates, maintainability, safety, human factors, manpower supportability, logistics supportability, documentation, environmental effects, and training requirements are considered. The evaluation of operational suitability informs decision makers about the ability of a system to sustain operations tempo over an extended period of time in the anticipated operating environment(s).

2.3 LFT&E FT&E provides an assessment of the lethality of a system with respect to its intended targets and vulnerability to adversary countermeasures as it progresses through design and development. Anticipated threats to the user of the system as a result of vulnerabilities that are induced by system design shortcomings are a major consideration with regard to LFT&E. A sound LFT&E strategy proactively incorporates design changes resulting from testing and analysis before proceeding beyond the Low Rate of Initial Production (LRIP) phase of acquisition.

2.4 IT&E IT&E is “the collaborative planning and collaborative execution of test phases and events to provide shared data in support of independent analysis, evaluation, and reporting by all stakeholders, particularly the development (both contractor and government) and operational test and evaluation communities” (OSD Memorandum Definition of Integrated Testing). Integrated testing focuses the test strategy on designing, developing, and implementing a comprehensive, yet economical, test design for collaborative use among the various organizations participating in the test. The goal is to improve the quality of the overall system evaluation through synergistic interactions between the organizations.

2.5 T&E Strategy Defense acquisition programs are designated by category and type. The acquisition category (ACAT) is a statutory set of cost thresholds maintained by the Defense Acquisition Authority. ACAT classifies acquisition programs by estimated total life cycle cost, thereby prescribing acquisition strategies


according to the scale of life cycle costs. The test strategy, in turn, is developed to describe how T&E supports the acquisition strategy. Although test strategies may differ, they must all adhere to sound T&E processes and practices that promote thrift, timeliness, and system performance.

The Test and Evaluation Master Plan (TEMP) is the master document for the planning, management, and execution of the T&E program for a particular system. It describes the overall test program structure, strategy, schedule, resources, objectives, and evaluation frameworks.

Figure 2. Test and Evaluation Strategy

As indicated in Figure 2, T&E planning begins with the evaluation of the capability requirements, which are usually identified by those individuals and organizations either with intimate knowledge of, or are directly involved in, operational activities. The requirements are then sorted into technical and operational categories, where they are expressed in the form of measurable and testable technical performance measures (TPMs). The developmental and operational issues, objectives, and measures comprise the twin pillars of the T&E strategy, namely the Developmental and Operational Evaluation Frameworks. The evaluation frameworks describe how a system will be evaluated against its technical and operational requirements to inform programmatic, technical, and operational decisions.

Before a DoD Information Technology (IT) system can operate, it must undergo a formal Certification & Accreditation (C&A) process to verify that it is secure enough to operate in the expected cybersecurity

Technical Requirements

Capability Requirements

Deve

lopm

enta

lTe

st O

bjec

tives

Evaluation Frameworks

OperationalRequirements

Ope

ratio

nal

Test

Obj

ectiv

es

Scientific Test and Analysis Techniques

Materiel Solutions Analysis

Technology Maturation andRisk Reduction

Engineering and Manufacturing Development

Production and Deployment

Operations and Support

TEMP TEMP LFT&E TEMP IOT&E FOT&E

IntegratedTest & Evaluation

DevelopmentalTest & Evaluation

OperationalTest & Evaluation

Critical Technical Issues Critical Operational Issues

Critical Technical Parameters

Measures of Effectiveness Measures of Suitability


environment. DoDI 8510.01 “Risk Management Framework (RMF) for DoD Information Technology (IT)” documents DoD guidance on implementing the National Institute of Science and Technology’s RMF – the current DoD C&A approach. The key part of this process is verifying that the necessary security controls are in place to allow the system both to begin testing (Interim Authorization to Test (IATT)) during the Engineering & Manufacturing Development phase and to fully operate (Authorization to Operate (ATO)), during the Production & Deployment phase. Planned cybersecurity testing during the T&E process is essential to inform the Security Controls Assessment (SCA) part of the RMF process.

The Cybersecurity Test and Evaluation Guidebook (DoD, 1 July 2015) outlines the following 6 Cybersecurity T&E phases:

1. Understand cybersecurity requirements 2. Characterize the cyber-attack surface 3. Cooperative vulnerability identification 4. Adversarial cybersecurity DT&E 5. Cooperative vulnerability and penetration assessment 6. Adversarial assessment

These phases occur from pre-Milestone A test planning through developmental test, to cybersecurity T&E after Milestone C. The first four phases primarily support DT&E while the last two phases support OT&E.

The overarching guidelines for cybersecurity T&E are:

• Planning and executing cybersecurity DT&E should occur early in the acquisition lifecycle. • Test activities should integrate RMF security control assessments with tests of commonly

exploited and emerging vulnerabilities early in the acquisition life cycle. • The TEMP should detail how testing will provide the information needed to assess

cybersecurity and inform acquisition decisions. • The cybersecurity T&E phases support the development and testing of mission-driven

cybersecurity requirements, which may require specialized systems engineering and T&E expertise.

The requirements are typically associated with two types of test issues: the critical technical issues (CTI), which are associated with the technical requirements, and the critical operational issues (COI), which are associated with the operational requirements. The CTIs and COIs guide technical and operational evaluations throughout the acquisition process in refining performance requirements, improving the system design, and enriching milestone decision reviews. They are typically formulated as questions for which multiple measures are often required to adequately determine an answer.

CTIs, also referred to as developmental test issues (DTIs), are the technical evaluation counterparts of the COIs. CTIs must be examined during DT&E to evaluate technical parameters, characteristics, and engineering specifications. They are normally resolved by demonstrating that a system has fulfilled key performance parameters (KPPs), key system attributes (KSAs), and critical technical parameters (CTPs). KPPs, KSAs, and CTPs are a subset of the TPM that are derived from the requirements document and the System Engineering Management Plan (SEMP). They provide quantitative and qualitative information on how well a system, when performing the mission essential tasks specified in the requirements document, is designed and manufactured. The failure of a system to meet KPP and KSA thresholds may provide grounds for program reevaluation, further technical review, or even program termination.


COIs are operational effectiveness and operational suitability issues that must be examined in OT&E to evaluate the capability of a system to perform its mission. The resolution of the COI is based on the evaluation of measures of effectiveness (MOE) or measures of suitability (MOS) using standard criteria. MOEs and MOSs comprise the subset of TPMs that reflect the operational requirements derived from the KPPs and other requirements. MOEs indicate how well a system performs its mission under a given set of conditions while MOSs indicate how ready, supportable, survivable, and safe a system is to sustain effective performance in combat and peacetime operations. COIs address the overall system’s operational capability when operated by the warfighter in realistic operational mission environments.

3. STAT Process Components In this section, we provide an overview of the key components in the STAT process, originally

depicted in Figure 1. We review each of these topics in more detail throughout the remainder of this guide.

• Requirements o The information, purpose, and intent contained in the requirements drive the entire

process. o All subsequent steps support selection of test points that will provide sufficient data

to definitively address the original requirement. • Mission/System Decomposition

o Break the system or mission down into smaller segments. o Smaller segments make it easier to discern relevant response variables and the

associated factors. • Setting Objectives

o Objectives are derived from the requirements and reflect the focus and purpose of testing.

o These serve to further define the scope of testing. o Objectives should be unbiased, specific, measurable, and of practical consequence

(Coleman and Montgomery, 1993). • Response Variables

o Responses are the measured output of a test event. o These are the dependent outputs of the system that are influenced by the

independent or controlled variables, otherwise known as factors. o They are used to quantify performance and address the requirement.

• Factors and Levels o Design factors are the input (independent) variables for which there are inferred

effects on the response, or dependent, variable(s). o Levels are the values set for a given factor throughout the experiment. Between

each experimental run, the levels of each factor should be reset, even when the next run may have the same level for some factors.

o Tests are designed to measure the response over different levels of a factor or factors.

o Statistical methods are then used to determine the statistical significance of any changes in response over different factor levels.

o Uncontrolled factors contribute to noise and are referred to as nuisance factors. • Constraints


o A constraint is anything that limits the design or test space. o Constraints may be resources, range procedures, operational limitations, and many

others. o Limitations are significant in the choice of design, execution planning, execution,

and analysis. • Test Design

o All the aforementioned considerations combine to inform the final test design. o The design provides the tester with an exact roadmap of what and how much to

test. o The design provides the framework for future statistical tests of the significance of

factor effects on the measured response(s). • Test Execution

o The planning accounts for, and requires, certain aspects of the execution to be accomplished in a particular manner.

o Since variations from the test plan during execution may potentially impact the analysis, they must be approved (if the change is voluntary) and documented as data is collected during test execution.

• Analysis o This phase begins when all of the data is collected and the statistical analysis begins. o The analysis must reflect how the experiment was actually executed (which may be

different from the original test plan). o Conclusions as to the efficacy of the system during test are documented, organized,

and then reported to decision makers. o The information collected from the previous test phases may either influence the

design of the next phase of testing or inform future program decisions. • Program Decisions

o The final step is the decision concerning whether to proceed into the production phase or to continue development of the test article.

o The decision phase represents the fulfillment of the entire test process. o The program utilizes data collected on the system to quantitatively comment on

performance and adjudicate the acquisition process.

3.1 Understanding the Requirement(s) Requirements are the starting and ending point for test and evaluation. Planning starts with mapping out a path to report on the requirement. Many choices are made along the way to address concerns like operating conditions, resource constraints, and range limitations. These choices help focus the process toward the test design and execution methods so the analysis will provide the right information to make a decision about the requirement. If the requirement is not understood clearly at the beginning of the test-planning process, the test team may find that it cannot produce the data needed to report on the requirement. Understanding what is written, what is missing, or what needs to be clarified in the requirement is the first step in effective planning (Harman, 2014). Figure 3 depicts the translation of requirements to performance measures which define the outcome of test events. One should construct requirements in a thoughtful manner that facilitates a meaningful analysis of the performance of the system under test (see Example 1 in Appendix A).


TIP: The most important ingredient in a recipe for test planning success is full and active test team participation. This is best accomplished through a formal STAT working group initiated

early and maintained throughout the acquisition cycle.

Critical questions:

1. To what part of the mission does this requirement pertain? 2. Are specific factors described?

a. Is there anything specifying performance or conditions? b. What are the operating conditions relevant to this requirement?

3. What analysis is implied? a. Is there a percentage of the time it must pass? b. Is the evaluation to be based on the overall average performance?

4. What remains to be clarified in the requirement? a. Can I effectively characterize the system? b. If not, where are the risk (un-testable) regions?

Takeaway: Question, understand, clarify, and document details in the requirements first.

Figure 3. A depiction of the flow from requirements to performance in the operational envelope.


3.2. Mission & Task Decomposition and Relation Every test design must enable evaluation of top level requirements in realistic operational conditions. Because end-to-end, systems-of-systems testing in realistic operational conditions is very costly, it is usually impractical to test every possible mission scenario. So, how does one develop a test program that addresses all critical functions over a comprehensive set of operational conditions and provides rigorous and defensible results? In this section, the process of adapting higher-level requirements to testable criteria as it is depicted in Figure 4 will be discussed at length.

Figure 4. Test Design in the SE Process

After determining the requirements and the corresponding objectives, the next step of the process is to understand how the system must function in order to meet the objectives. These functions can then be decomposed into the components that are required to accomplish them. This will enable the straightforward determination of the responses, factors, and levels needed to evaluate those critical components. This process is depicted along the left side of the flow diagram in Figure 4. Decomposing a mission from top level objectives to segments to functions might look like the example illustrated in Figure 5. At the top level, the “kill targets” objective (1) is too broad and may have numerous factors over the mission space. The lower level segments (2) are more defined and will have fewer contributing factors. At another level below the segments, the functions (3) are further refined with an even smaller factor list. Early DT should focus on this lowest level to ensure performance is characterized, quantified,


and verified.

Figure 5. Example Mission Decomposition

If one begins testing at the mission level, it is often very difficult to determine how individual system components influence performance. Even if one could identify a particular component as being the specific cause of observed behavior during test, the exact mechanism of this causal relationship may not be well understood. This issue is mitigated by a sequential, progressive test strategy that starts with low level test designs and culminates in top-level operational testing. Box and Liu (1999) and Box (1999) emphasize the importance of sequential testing and learning as follows:

• The fundamental starting point for any test (e.g., the ‘Component Testing’ block of the flow diagram in Figure 3) should be to understand the performance of each critical component. The type and the extent of the component testing required is a function of criticality and previous knowledge.

• The next level of testing (see `Functional Testing’ in Figure 4) is to evaluate individual functions. To the greatest extent possible, functional characteristics should be evaluated over the entire range of operational conditions. Testing over a subset of possible conditions greatly increases the risks of delaying failure discoveries until OT&E.

• Next, the combination of all of the functions should be evaluated in the system (see `System Testing’ in Figure 4). The goal is to discover failures caused by the integration of all of the functions.

• Finally, operational testing should be conducted in order to evaluate the highest level system-of-systems measures and to validate all previous testing.

The goal of the preceding test strategy is to identify and correct failures as early as possible, and does not necessarily draw a line between DT&E and OT&E. DT&E should be performed over the entire range of operational conditions. OT&E may then focus on the system-of-systems mission level testing, thus augmenting and/or verifying the DT&E results.

TIP - Whenever a fundamental failure of a component or a function is discovered during complex system-level testing, it is more difficult to isolate, more complex to determine the root

cause, and more expensive to correct.


Critical questions:

1. Are all the steps necessary to meet the requirements included in the decomposition? 2. Are responses and factors easily defined for critical components? 3. Will DT testing cover the entire range of operational conditions? 4. Will early testing be used to inform more complex test designs? 5. Will risk areas (for failing to meet requirements) be identified before developing OT test plans?

3.3. Setting Test Objectives Planning cannot proceed without a set of clear test objectives. Objectives are derived from the requirements and serve to focus resources and designs toward addressing the requirement in a clear, quantitative, and unambiguous way. As indicated in Figure 4, the objective(s) (there may be more than one) are ideally measurable in terms of continuous values in order to facilitate statistical analysis. The objectives form a roadmap that can be referenced throughout the planning process to ensure that subsequent steps remain on course to produce a relevant design. Table 1 provides examples of requirements and their associated objectives.

Takeaway: Decompose missions, systems, and functions until obvious and meaningful responses

and factors are revealed; then, develop a test strategy that builds on previous testing as

complexity increases


Figure 6. Assessment of Quantitative Objectives Over the Operational Envelope.

Critical questions:

1. How does the requirement drive the focus of testing? 2. What is needed to address the requirement and its details? Is the desire to

a. Characterize performance across a number of factors? b. Identify factors that impact the response(s) (screen)? c. Validate performance determined in other testing or simulation? d. Compare systems or configurations? e. Optimize performance in a specified region? f. Predict performance in a specified region?

3. Can you state the objective clearly and concisely? a. Is it specific? b. Is it meaningful to the Warfighter?


Table 1. Example Requirements and Objectives

Requirements Test Objectives Demonstrate the system has an operational availability greater than 0.98 with 95% confidence.

Compare the operational availability of the system to the 0.98 requirement.

Rate of fire of X rounds per minute (rpm) and a maximum rate of fire of Y rpm.

Characterize rate of fire across expected operational conditions and configurations to include accuracy.

Radar range for detection at least X nm for patrol boat.

Characterize radar detection ranges for defined patrol boat types and sizes across varying weather conditions, radar modes, and boat speeds.

Respond to a fire mission with first round fired within X seconds Y% of the time.

Characterize ability of the system to respond in the prescribed time under expected operational conditions and for prescribed targets.

Probability of detection of X % in a mission duration of Y hours.

Quantify the probability of detection for prescribed targets at varying ranges, depths, and speeds using operational systems deployed across typical ranges and combinations.

The Radar Warning Receiver (RWR) receives uncorrupted/unimpeded data from radar input

Compare the data received by the RWR with the data sent from the radar.

Evaluate Improvised Explosive Device (IED) probe detection capability for various terrain types and target depths (for a rapid acquisition initiative)

Develop an empirical model of the probe signal output as a function of terrain type (categorical) and target depth (continuous)

Employ same ballistic tables if existing mortar installed in new vehicle exhibits same performance as existing mortar/vehicle.

Use impact accuracy to screen both vehicle types across a variety of factors to determine if a statistical similarity or difference exists

Takeaway: Objectives should be unbiased, specific, measurable, and of practical consequence (Coleman

and Montgomery, 1993).


3.4. Define Responses A key to good test design is to fully understand the goals of the test and what response(s) should be measured. Responses are the measured output of a test event. There may be several responses measured for a given test and/or in support of a requirement. One tool to help in determining these measures is a process flow diagram that details each step of functionality. To illustrate, consider the following process flow map of an armed escort mission in Figure 7.

Figure 7. Process Flow Diagram, Armed Escort Mission (Hutto, 2011)

Steps and decisions are coded in blue while possible responses are in yellow. Within such a flow diagram, stated system performance measures should be traceable back to KPPs, measures of performance (MOPs), MOEs and MOSs. Moreover, measures that already exist in capability documents, particularly those metrics that were used in similar test programs or were developed by subject matter experts, should be documented in the flow diagram.

Critical questions:

1. Does the response relate to top level requirements (e.g. through KPP, MOE)? a. Is the response expressed as a continuous variable? b. If not, can it be converted to a continuous variable?

2. Is the response directly measurable? 3. Is it defined in a clear and unambiguous manner?


3.5. Factors and Levels Factors are inputs to, or conditions for, a test event that potentially influence the variability in the response. Factors can be derived from prior testing, system knowledge, or insight into the underlying physics of the problem. Factors include configurations, physical and ambient conditions, and operator considerations (e.g., training, skills, limitations, shifts). Factors should not be ruled out without proper screening or technical analysis. Levels, which are the distinct values that are set for each factor in a design, may number anywhere from two (low and high, for example) to many values. Factors should be made continuous whenever possible to ensure the most information is contained in the design and to maximize the level of detail contained in the analysis. Categorical (non-continuous) factors also add additional runs to the test matrix and should thus be avoided whenever possible. We discuss the benefits of using continuous factors (and responses) in section 3.5.1 on data type consideration. Figure 8Figure 8, is a convenient brainstorming tool whose purpose is to facilitate extracting causal factors. There are six broad categories often referred to as the 6 M’s: Machines, Manpower, Materials, Measurements, Methods, and Mother Nature. These categories help to further identify potential factors that influence the response variable. The fishbone diagram is often used in a collaborative environment in which all potential reasonable sources are categorized into one or other of the 6 M’s.

Figure 8. Fishbone Diagram

Other methods for brainstorming potential factors include affinity diagrams and inter-relationship diagrams, which can be used in conjunction with the fishbone diagram. An affinity diagram, which has

Takeaway: The response must be clear, concise, measureable, preferably continuous, and directly

related to the requirement.


been used successfully in industry, can be an effective tool to organize many items into natural groupings or when a group needs to come to a consensus. Creating the affinity diagram is a group exercise. Each member of the group writes down ideas, for example potential factors that may affect the response, on separate sticky notes or cards. All of the cards are then placed before the entire group. Then, with no communication between group members, each person looks and groups similar items together. Some of these cards may be moved more than once and some may not belong in a group with others. A generic example of the results of this process is shown in Figure 9. One advantage of the affinity diagram process is that it encourages creative thinking from the group and can generate new ideas that may have been previously missed (Tague, 2004, pp. 96-99).

Figure 9. Affinity Diagram (General Example)

An inter-relationship diagram (Figure 10) can be used after generating a cause-and-effect or affinity diagram in order to further understand and explore links between items (or steps) in a complex solution (or process). For every item generated from the fishbone or affinity diagram, ask “does this item cause or influence any other item?” Draw arrows from each item to the items it influences. After completing this process for all of the items, analyze which items have the most arrows going in and out of them. In general, those that have mostly outgoing arrows are basic causes to investigate; those that have mostly incoming arrows are critical effects to address (Tague, 2004, pp. 444-446).


Figure 10. Inter-relationship diagram (General Example)

The test team further classifies each factor as:

• Control (C) – an experimental factor that can be controlled and varied during test that could impact the response as factor levels change.

• Hold Constant (H) – a factor that likely influences the response that is set to a constant value throughout all of the tests. The factor may not be of primary interest, may be too difficult to vary, or may have been previously shown to not be significant.

• Noise (N) – a factor that likely influences the response, but may not be possible (or desirable) to control in an experiment of in the field. Nevertheless, it may be possible to observe and record such factors during test.

The test team identifies these factors and agree on how to treat each one for the test program. This is an iterative assessment in which potential factors are mapped to the responses which themselves are mapped to test goals. The test team assigns a role (C, H, or N) to each factor and will also determine whether they are easy (E), hard (H), or very hard (VH) to change. The impact of this latter classification on the design choice will be discussed in Section 3.7.

Critical questions:

1. Have all factors been considered? a. Do all the listed factors contribute to the response? b. How were potential factors determined? c. Were they determined via a collaborative brainstorming session? d. Has input from all subject matter experts been collected?

2. Are the factors clear and unambiguous? a. How many levels can (and should) be allocated for each factor?


b. Will the number of levels support the objective of test? c. Will the number of levels adequately sample the design space?

3. How is each factor expressed? a. Are the factors expressed as continuous or categorical variables? b. Can categorical variables be converted to continuous variables?

4. Has each factor been classified as a hold constant, control, or noise factor? a. Can the factors be controlled? b. Is there any difficulty in controlling or setting the factor levels?

5. What are the factor priorities for testing? a. Which are to be controlled or held constant? b. Which will contribute to noise in the response?

3.5.1 Data Type Consideration The data type chosen to represent factors (inputs) and responses (outputs) in an experiment may affect the resources needed to conduct an experiment and, consequently, the quality of the statistical analysis. The STAT T&E COE has observed, through a review of program TEMPs and test plans, that categorical data types are too often used when describing factor levels and responses. This may be due to vaguely-defined requirements and test objectives. It could also be that planners find it easier to conceptualize and plan experiments using generic settings and measures. Perhaps there is difficultly in measuring the inputs and outputs in great detail. For example, live fire testing could be destructive and thus impossible to precisely measure impact point. As shown in Figure 11, it may prove easier to use a count of hit and misses instead of measured missed distance from the target.

Figure 11. Continuous Versus Categorical Variables

Takeaway: Factors must be clear, concise, controllable, preferably continuous, and must relate

directly to the response.


Another example would be that we are testing the system during different times of the day and list the factor settings as “day” and “night” instead of using some measure of illumination. Whatever the reason, any perceived savings in favoring categorical over continuous data types may be paid for in wasted resources and complications during analysis (Ortiz, 2014).

As mentioned previously, categorical factors add additional runs to the test and may also have an effect on the quality of the analysis (see Figure 12). In addition, categorical data types do not allow for prediction between levels. Suppose temperature is a factor of interest. Using low, medium, high as the levels for this factor, you can estimate the response for each of these levels. However, if you associate degrees to these levels and treat temperature as a continuous factor, you are now able to estimate the response at temperatures within the entire range of these levels.

In the case of responses, categorical data types contain a relatively poor amount of information in comparison to continuous data types (38% to 60% less in some cases; Cohen 1981). This state of reduced information results in an increased difficulty for tests to detect changes in the presence of noise. More specifically, these tests will have a poor signal-to-noise ratio (The ratio the desired difference in the response to detect 𝛿𝛿 over the magnitude of the process noise variability 𝜎𝜎). Poor signal-to-noise rations will in turn require a greater number of replications/runs to achieve sufficient power.

Figure 12. Impact of Factor Type on Response

TIP – Just because the requirement is stated as a probability should not dictate how the data are collected! Continuous responses can be converted to probabilities in the analysis.

STAT T&E COE: Scientia Prudentia et Valor

Continuous vs. Categorical Factors

SLOW FAST

0 10 20 30

Factor level varies inside a categorical band

1 2 3 4Factor

Continuous Factor/Levels

1 2 3 4

Categorical Factor/Levels

Controlled input reduces variability in the output

Varying input contributes to variation in the output

When you have a choice, use of continuous factors is preferred and can reduce variability in the response.

Response0 10 20 30

Factor level is controlled at a precise value

Noise/variation/σ


Critical questions:

1. Is the factor or response a continuous variable or can it be converted to a continuous variable? 2. What is the estimated signal-to-noise ratio for each response? 3. Do I need to have an understanding of what is happening between categorical levels? 4. How many fewer runs are needed if a response were continuous rather than categorical?

3.6. Constraints An important step in the test planning process is to identify any restrictions on the test design or execution. Test design and execution restrictions can influence the design choice and analysis. Constraints affect which combinations of levels among different factors may appear in the design simultaneously, thus influencing the geometrical arrangement of the design points. Common constraints include the budget, the experimental region, and restrictions on randomization (which includes difficulty changing factor levels).

3.6.1. Budget Constraints The test budget, in terms of both time and money, is an important consideration when planning an experiment. Planning a “should execute” design and then factoring in cost constraints to create the actual design enables the team to assess the risk imposed by budget constraints. This risk can be expressed as the loss of statistical power (refer to Section 3.7 for more information on power).

3.6.2. Restrictions on the Experimental Design Region The test planning team must consider any restrictions on the experimental region. The team must decide on the experimental region of interest, which may be a subset of the region of operability. This experimental region need not be rectangular; a characteristic which indicates that disallowed factor combinations may exist. For example, the high levels of two factors (say pressure and temperature) considered separately may be within the region of operability; however, the combination of both factors at high levels may be outside the region of operability. Prohibited combinations may stem from factor interactions on the response, safety considerations, undesired results occurring under certain combinations, and known scientific principles. It is important that all prohibited combinations be identified early in the planning process.

3.6.3. Restrictions on Randomization Randomization is one of the three principles of experimental design and refers to both the randomization of the order of test runs and the random allocation of test materials. Failure to randomize could result in factors confounded with nuisance variables, meaning that the analysis will be unable to discriminate effects due to the factors and nuisance variables. Therefore, any restrictions on randomization must be identified in the planning stages, as these may affect the design and analysis.

The ease with which factor levels can be varied must be considered as well. Are there any factors that are difficult or expensive to change many times? A decision to vary levels of such hard-to-change factors less often will represent a restriction on randomization, which may subsequently lead to a split plot design (Anderson-Cook, 2007). Another common restriction on randomization is blocking, which is a variance reduction technique used for dealing with nuisance variables, factors that could influence the response, but which is not a factor of interest. Blocking is discussed in more detail in section 3.7.

In section 3.7, we discuss one of the principles of DOE: replication, independent test runs of each factor level combination. Distinct from replication, repeated observations, also called sub-sampling or pseudo-


replicates, are multiple measurements made at once for a given factor level combination. For example, suppose you have two factors with 𝑎𝑎 and 𝑏𝑏 levels, respectively, and 𝑛𝑛 replicates. In a completely randomized design, all 𝑎𝑎𝑏𝑏𝑛𝑛 observations would be taken in a random order. Suppose instead that the 𝑛𝑛 replicates are taken all at once for one of the levels of factor A and one of the levels of factor B. This may be done because both factors are difficult to change (or it is impractical to do so). There are analysis methods to account for this restriction on randomization, often done with a nested model. See Kutner et al. (2005, pp. 1106) for details.

Critical questions:

1. What is the budget for this test? 2. What is the total time allotted for this test? 3. How much time does each run take to complete? 4. How much does each run cost? 5. Are there restrictions on the operational space or design region? 6. What range restrictions are imposed? 7. How many disallowed combinations of factor levels are there? 8. What factors are hard or costly to change?

a. Does the system configuration remain set for multiple runs before being changed? b. Are there factor levels that can only be changed in a linear manner (e.g. small to large)?

9. Is blocking necessary? a. Does the test span multiple days? b. Will different operators perform different test runs?

3.7. Test Design Many considerations impact the choice of design, some of which include the following and which have appeared in previous sections of this guide:

• What is the test objective (e.g., screening, characterization, optimization)? • Are the factors quantitative or categorical? • How many factors are there? • How many levels does each factor have? • Are there any constraints? • Are there any restrictions on randomization?

Table 2 lists some sample designs for various test objectives. Details of many of the designs listed in Table 2 can be found in Box, Hunter, and Hunter (2005), Montgomery (2009), and the NIST website.

Takeaway: Detail, define, and document how constraints will limit or impact the design or

execution.


Table 2. Design Types

* The design choices listed in this table are general guidelines. The design should be chosen to match your goals as well as account for any restrictions in the test execution as discussed in Section 3.6. For example, your goal may be to screen factors; but if there are restrictions on the design region, a factorial or fractional factorial design is not appropriate.

There are many considerations to weigh when choosing between candidate designs. Table 3 provides a list of criteria that can be used to evaluate and compare designs, thereby minimizing the risk and cost for a given design.

Table 3. Design Metrics

Criteria Definition Application Additional Notes

Statistical Model Supported

The effects that can be estimated by the design (e.g. main effects, 2-factor interactions, quadratic effects, or higher order terms).

Corresponds to test objective.

Confidence The probability of concluding a factor has no effect on the response, when it does not have an effect. (True Negative Rate and equal to 1 – type I error rate)

Maximize Must be determined before the experiment is executed. Is used to help determine power of a design.

Test Objectives Sample Designs* Screening for Important Factors Factorial, Fractional Factorial

Designs, Definitive Screening Designs, Optimal Designs

Characterize a System or Process over a Region of Interest

Factorial, Fractional Factorial Designs, Response Surface Designs, Optimal Designs

Process Optimization Response Surface Designs, Optimal Designs

Test for Problems (Errors, Faults, Software bugs, Cybersecurity vulnerabilities)

Combinatorial Designs, Orthogonal Arrays, Space Filling Designs

Reliability Assessment Sampling Plans, Sequential Probability Ratio Test, Design of Experiments


Power The probability of concluding a factor has an effect on the response when in fact it does. (True Positive Rate and equal to 1 – type II error rate)

Maximize

Correlation Coefficient between Model Parameter Estimates

Measures the strength and direction of the linear relation between two model parameters

Minimize correlation between model parameters.

If model parameters are orthogonal, the estimated parameter for one model term is the same value whether the other model term is included in the model or not.

Variance Inflation Factor (VIF)

The VIF for a factor measures the degree of multicollinearity with the other factors. Multicollinearity occurs when the factors are correlated among one another.

If a factor is not linearly related to the other factors, the VIF is 1; otherwise, it is greater than 1. VIFs greater than 10 indicate serious problems with multicollinearity.

If there’s multicollinearity, the values of the estimated model parameters change depending on which terms are included in the model.

Design Resolution Metric of a screening design that indicates the degree of confounding in the design. In general, a design is resolution R if no p-factor effect is aliased with another effect containing less than R-p factors.

The higher the better, but also more expensive

The higher the resolution, the less restrictive the assumptions are on which higher-order interactions are negligible.

Prediction Variance Variance of the predicted response.

Balance over regions of interest.

Important when the goal is prediction, but may be less of a priority for screening

Fraction of Design Space (FDS) Plot

Summarizes the prediction variance across the design region.

Ideally, the curve is relatively flat with small values so that the prediction variance does not change drastically across the design space.

Useful to identify minimum, median, and maximum prediction variance across the design region.

Design efficiency Evaluation measure related to computer-generated optimal designs (Could be related to parameter estimation or response prediction).

Maximize Several criteria available (D-optimal, I-optimal, etc). Higher efficiency is better, but does not give full picture of design properties.


Strength (Applies to combinatorial and orthogonal array designs)

Indicates the level of interactions that are fully covered by the design.

Sets the design size and is used to scope or reduce risk for finding errors.

The first decision that one must make is that of choosing between a classical design and an optimal design. We recommend the “classical first” approach to experimental design due to the fact that classical designs possess certain desirable properties such as optimality with respect to several optimality criteria such as D-optimality (Myers et al. pp. 467). Optimal designs are most appropriate for tests in which the sample size is unusual, the design region is highly constrained, or there are several categorical factors at more than two levels.

To pick a final design, the design metrics can be compared to reach the final choice. Table 4 shows an example of how design options can be evaluated. The number of points, the supported model, the confidence and power levels, and other critical design metrics can be varied to balance cost and risk for the test program. In addition, Myers et al. (2016, pp. 370) discuss 11 desirable properties of experimental designs which are repeated below. These properties should be balanced to achieve the priorities of your experiment.

1. Result in good fit of the model to the data. 2. Provide good model parameter estimates. 3. Provide a good distribution of prediction variance of the response throughout the region of

interest. 4. Provide an estimate of “pure” experimental error. 5. Give sufficient information to allow for lack of fit test. 6. Provide a check on the homogeneous variance assumption. 7. Be insensitive (robust) to the presence of outliers in the data. 8. Be robust to errors in the control of design levels. 9. Allow models of increasing order to be constructed sequentially. 10. Allow for experiments to be done in blocks. 11. Be cost effective.

There are many considerations to keep in mind when selecting a design. However, it is worth the time and effort to choose the right design for your experiment. As Montgomery (2013) says, “A well-designed experiment is important because the results and conclusions that can be drawn from the experiment depend to a large extent on the manner in which the data were collected.” Choosing the right experiment allows you to leverage data to answer your objectives and requirements.


3.8. Test Execution Planning Some issues to consider when planning the execution of a design are the three principles of experimental design: randomization, replication, and blocking.

Most runs in a test matrix design are executed in random run order in order to protect against known (or unknown) uncontrollable factors as well as nuisance factors (noise) If a design has factors of interest that are hard-to-change, or the situation otherwise prevents a complete randomization of the run order, a split-plot design can be used. A split-plot design is one that incorporates only the most useful (power-inducing) treatment patterns among the total available within the design.

Replicating the design, or some portion of runs within the design, not only provides an estimate of pure error (the component of the error sum of squares attributed to replicates), but also mitigates the potential increase in variance induced by the presence of outliers that may result as a consequence of executing the design. Blocking is a planning and analysis technique that sets aside (blocks out) both undesirable and known sources of nuisance variability so they do not hinder the estimation of other effects. Blocking is done when there is an expectation that the nuisance variation may introduce noise into the experimental data. The excessive presence of such noise may decrease the power to detect system performance

Takeaway: A design must be tailored to meet the test objectives and unique test circumstances.

Table 4: Alternative Test Design Choices for a Notional Program.


responses to legitimate factors of interest. Figure 13 depicts a notional example in which test points are grouped together, or blocked, based upon a single effect; such an effect may be the test schedule, the type of equipment used, test locations, or operators. See example 2 in Appendix A for an example of a blocked experiment analyzed (correctly) with the block effect and (incorrectly) without the block effect. If the effect of the blocking variable is of interest, the design should not be blocked with respect to this factor. Because of the restricted randomization of blocks, there is not a valid statistical test to analyze the effect of a blocking variable.1

Figure 13. Illustrative Representation of the Randomized Blocking Procedure.

We have limited the discussion here to block effects treated as a fixed factor. There are many cases where the block effect could (or should) be treated as a random effect (for example, day of the week or a batch of raw material). Conclusions from the experiment with random blocks are valid across the population of blocks (e.g., across all days or all batches of material). Refer to Montgomery (2013, pp. 151) for details on the analysis of an experiment with random blocks. After the data has been collected, it must be analyzed. Any deviations from the test protocol must be recorded at the time of execution and accounted for in the analysis. Critical questions:

1. Will the design be executed in random run order? a. If not, why? b. If randomization is restricted, must the design be analyzed as a split-plot design?

2. Are we interested in the blocking variable? If so, then the design should not be blocked according to this factor. Consider a split-plot design instead.

3. How will deviations from the test design plan be recorded and reported?

1 Note that many statistical software packages, including JMP, report an F-statistic and p-value for blocking factors. As mentioned above, however, the restrictions on randomization violate the assumptions of an F-test, making the p-value meaningless.


3.9. Analysis It is critical to have a good plan for the analysis of the data. Are we characterizing, screening,

validating, comparing, optimizing, and/or predicting performance? The requirement will guide us to the type of analysis we must perform. Is the requirement that the new system be better than the old system? Is the requirement set to a specific bench mark? If you have followed the guidelines for the planning and design of test documented in this manual, the analysis of your test data will be straightforward and hence easily traced back to the defined objectives.

Software will play a key role in the analysis of the test data. There are various software packages for classical design analysis (JMP, Design Expert, SPSS, and Minitab to name a few). Since capabilities may vary depending upon the software package, you may end up using several simultaneously. It is best to experiment with a variety of software packages in order to find those that best meet your needs.

In any analysis, it is important to know how to read and interpret the software’s output. For example, in a classical design, we observe the value of the F-statistic for the model to determine overall model significance. Next, we would observe the individual effect’s t-statistic or F-statistic (and p-value) to determine the magnitude of its contribution to the response that is defined for a particular model. One would also require the mean squared error and the degrees of freedom.

It is imperative to analyze the data as it was actually collected during the test, not the way it was planned to be collected. It is also imperative that if the test deviates from the planned protocol, then the analysis of the data must be adjusted to reflect the actual (not planned) test execution. For example, if a completely randomized design was planned but the runs were not randomized, the analysis must be that of a split plot or blocked design.

Figure 14 shows how an empirical response surface developed during DT can be used to predict responses in OT. The relative scarcity of controllable factors in OT might induce a response in the interior of the region (depicted by yellow dots). Even if these OT points were never actually tested, the response surface permits prediction for later validation with the OT points. Futhermore, failing combinations, the rate at which perfromance degrades across the test space, optimum operating conditions, and other analysis can be derived from the surface.

Takeaway: Try to execute the planned design, record any deviations, and use the appropriate analysis.


Will the analysis address the requirement? Early in the planning the team defined the objectives, responses, and factors. If these were clearly and thoroughly understood, then the resulting designs should provide the right information.

Critical questions:

1. Will the analysis address the objective? a. Is there sufficient power to screen factors? b. Are the points sufficient to minimize prediction error? c. Are sufficient points included for assessing and fitting for curvature?

2. Will the data inform the details in the requirement? a. Are a sufficient number of factors being controlled? b. Are the factors clear, concise, and controllable?

3. Were the test points selected as part of a formal design or larger test strategy? 4. Do the test points exhibit orthogonality or space filling properties that will support the

objective? 5. Will the data answer the questions posed by the requirement?

a. Must pass across all operation conditions? b. Must pass in a defined region? c. Probability of success or percentage of time? d. Aggregate performance? e. Overall average?

Takeaway: Check that you are answering the right question with the analysis.

Figure 14. An Example of Test Data Analysis and Interpretation.


4. Conclusion Thorough planning is fundamental to ensuring that sufficient rigor and traceability from requirements to analysis are incorporated into the test design. STAT is a collection of deliberate, methodical processes and techniques that embodies this design philosophy. STAT introduces organization and rigor into the testing process by means of a structured, iterative approach to test design, beginning with requirements and mission profiles and culminating in a realistic and executable design, the execution of which enables informed decisions on system acquisition and future planning to meet tomorrow’s threats.


Appendix A. Example Application of Sequential Testing

A.0 Introduction Past testing has shown a variance in the spurt characteristics resulting from ballistic impacts of

fuel tanks. A thorough understanding of the variables influencing fuel spurt, when coupled with known threat flash/function characteristics, would greatly improve fire prediction, vulnerability assessments, and could potentially lead to the development of new vulnerability reduction technologies. This test program’s overall goal was to quantify spurt and internal fluid cavitation characteristics as a function of threat type, threat mass, impact velocity, fluid head pressure, shotline angle, material thickness, etc. for an impact on a replica fuel tank.

This test program was conducted in two phases using statistically designed test matrices. Phase I was used to screen for all potential factors which influence the response variables. This phase was conducted using water as a surrogate for fuel for reasons of convenience, cost, and schedule. The use of water also allowed for better test measurements since fuel may ignite during the ballistic test event. The internal water cavity and external water spurt characteristics were observed and measured (e.g., spurt timing, cavitation size and timing, projectile characteristics, internal pressure). High-speed video from several angles was critical in gathering the desired information required. At the completion of Phase I, the test data was analyzed to determine the significant factors and a Phase II test matrix was developed. Initial Phase II testing was performed to determine if there is a physical correlation (i.e. Reynolds or Weber number) in spurt and cavity characteristics and between water and fuel results. After determining that the spurt and cavitation response in fuel was similar to water, but with an offset due to the speed of sound of the fluid, the rest of Phase II was completed with water and a cubical fragment to support the development of a response-surface model.

A.1 Test Results A.1.1 Phase I Test Results

Phase I investigated the results of a ball bearing of various size and velocity, impacting a large tank of water to examine the resulting cavitation and spurt. These tests included multiple impact angles and tank configurations to see if these variables affected the cavity and/or spurt. Table A-1 shows the spurt and cavity timings for the Phase I tests sorted by the primary spurt time. All of these tests were conducted with water for the fluid type.

Table A-1. Phase I Spurt and Cavitation Characteristics (Sorted by Primary Spurt Time)

Run Number

Test Number

Threat (grains)

Velocity (fps)

Cavitation Collapse

(µsec)

Pre-Spurt (µsec)

Primary Spurt (µsec)

Secondary Spurt (µsec)

Tertiary Spurt (µsec)

2 29 42.2 4,002 10,174 - 12,201 - - 23 30 42.2 4,151 10,261 - 12,477 - - 9 45 42.2 5,863 15,350 - 13,839 31,919 58,157

30 46 42.2 6,048 15,680 - 14,078 35,037 58,874 12 41 54.8 6,260 18,718 - 18,580 36,260 65,220 14 13 69.7 3,833 20,400 - 18,640 35,599 58,719 33 42 54.8 6,332 19,759 - 19,120 37,120 64,160


Run Number

Test Number

Threat (grains)

Velocity (fps)

Cavitation Collapse

(µsec)

Pre-Spurt (µsec)




13 27 69.7 4,107 25,565 1,039 19,199 26,399 37,439 25 8 42.2 3,932 15,600 - 19,440 66,881 - 4 7 42.2 3,981 15,760 - 19,761 67,040 83,440

35 14 69.7 4,276 21,760 - 20,159 36,159 60,639 34 28 69.7 4,416 26,000 960 20,560 26,640 37,440 24 10 42.2 4,060 20,720 - 20,640 34,160 46,960 16 49 69.7 4,153 NA - 20,681 - - 22 36 42.2 4,185 20,199 - 21,099 39,361 69,200 37 50 69.7 4,072 NA - 21,198 - - 3 9 42.2 3,932 22,320 - 21,360 34,800 60,320

31 4 54.8 3,728 23,760 1,040 22,000 41,360 56,001 1 35 42.2 4,177 19,065 - 22,320 38,320 69,601 6 2 42.2 6,324 22,765 2,480 23,040 41,441 67,921

27 1 42.2 6,255 23,262 2,560 23,201 40,881 65,760 10 3 54.8 3,808 25,920 800 23,279 58,239 84,320 17 15 69.7 5,286 24,786 - 25,521 61,985 81,485 38 24 69.7 5,044 25,000 - 26,335 64,658 82,681 7 5 42.2 5,977 26,720 - 26,721 37,760 61,120

28 6 42.2 6,195 27,920 - 26,875 37,875 62,000 11 39 54.8 5,001 26,174 5,120 27,201 40,160 64,880 32 40 54.8 5,083 24,870 5,680 27,601 41,361 65,200 18 31 69.7 6,015 30,000 - 30,445 47,343 62,844 39 32 69.7 6,260 33,565 - 32,560 46,560 62,962

This data was input into the JMP analysis software and analyzed to determine which factors and

to what degree those factors contributed to the spurt and cavity timing.

A.1.2 Phase II Initial Test Results Phase II initial tests investigated the effect fuel versus water had on the spurt timing and

cavitation. These tests also were conducted over multiple variables so that a better understanding of how the medium, which was being shot through, affected the results. In Table A-2, the results from the Phase II initial tests have been sorted by the primary spurt time. This table has left out the threat column, as all tests were conducted with a 69.7-grain sphere, and instead has listed the fluid type as the comparison of the tests with fuel to the tests with water was one of the primary goals of this phase.


Table A-2. Phase II Initial Spurt and Cavitation Characteristics (Sorted by Primary Spurt Time)

Run Number

Test Number

Fluid Type

Velocity (fps)

Cavitation Collapse

(µsec)

Pre-Spurt (µsec)




15 16 Fuel 5,832 24,240 - 16,093 28,728 70,424 18 17 Fuel 4,010 19,796 - 16,093 27,198 67,697 12 10 Fuel 5,793 24,937 - 16,359 32,651 69,027 17 18 Fuel 5,805 26,866 7,448 16,492 27,997 69,293 4 4 Water 6,174 25,669 - 16,692 33,250 73,682

22 22 Water 5,995 28,462 - 17,090 29,659 74,014 13 14 Fuel 4,069 20,604 - 17,223 34,447 67,830 14 15 Fuel 3,975 20,705 - 17,357 34,846 68,362 6 6 Water 3,872 21,547 - 17,423 38,038 73,815

10 11 Fuel 5,751 26,401 - 17,423 29,327 69,160 19 20 Water 5,893 28,280 8,246 17,489 29,459 73,948 11 12 Fuel 5,960 27,398 - 18,021 29,393 69,359 23 23 Water 6,051 27,132 - 18,287 31,654 74,945 16 13 Fuel 3,941 20,814 7,847 18,487 34,048 67,365 7 7 Water 3,827 22,610 - 18,620 36,642 73,084

20 19 Water 3,980 22,119 - 18,819 37,506 73,682 9 9 Fuel 3,698 20,503 7,647 19,152 35,112 67,564 1 1 Water 3,942 23,142 7,781 20,549 38,570 73,150 8 8 Water 3,931 23,275 8,712 20,748 38,371 72,287 3 3 Water 5,915 29,260 - 30,656 50,407 74,879

21 21 Water 5,816 28,595 - 30,656 50,872 74,347 2 2 Water 5,926 29,326 - 31,455 49,144 74,081

The results from this test were input into the JMP analysis software in order to analyze both

factors that contributed to the spurt and cavitation characteristics and also to determine what the effects of the fluid type were on the results. This was to be able to test water in the follow-up tests using a cubical fragment. Had the results determined that fuel was not comparable to water, then using fuel in the cubical fragment tests would have caused longer test turn-arounds, higher risk due to fire initiation, and cloudy internal high-speed camera views. Details of the fuel versus water comparison as well as the spurt and cavitation characteristics can be found in Section A.2.2.

A.1.3 Phase II Follow-Up Test Results

After the Phase II initial tests were conducted, a set of follow-up tests with a cubical fragment were completed. These tests, based on the initial test results, were all conducted with water. This allowed for quicker test turn-arounds, a reduction in risk due to fire initiation, and also clearer internal high-speed camera views. Table A-3 shows the results from the Phase II follow-up tests that were conducted with water and a 75-grain cubical fragment. These tests have been sorted by the primary spurt time.


Table A-3. Phase II Follow-Up Spurt and Cavitation Characteristics (Sorted by Primary Spurt Time)

Run Number

Test #

Threat (grains)

Velocity (fps)

Cavitation Collapse

(µsec)

Pre-Spurt

1 (µsec)

Pre-Spurt

2 (µsec)




25 10 75 3,363 17,527 - - 17,741 38,190 65,051 26 24 75 3,962 19,950 8,550 - 19,237 37,834 74,385 11 14 75 3,816 19,879 - 17,955 19,451 37,905 73,886 22 2 75 3,909 19,783 - - 19,493 38,784 73,225 7 7 75 4,317 19,950 - - 19,594 38,190 72,533

29 10 75 3,912 19,166 - - 19,665 37,905 68,543 6 12 75 3,352 19,451 - - 19,737 38,190 64,553

21 22 75 3,538 20,046 8,484 - 19,897 37,976 65,448 20 15 75 3,629 20,221 8,383 - 20,099 38,279 73,730 12 18 75 4,128 21,090 - - 20,164 37,834 74,385 19 20 75 3,506 20,746 9,166 - 20,402 38,178 65,549 3 5 75 3,853 21,400 - - 21,090 38,689 72,248

30 20 75 3,765 21,161 8,835 - 21,233 39,045 74,029 14 12 75 3,868 21,802 - - 21,233 36,551 73,744 27 23 75 4,768 23,370 - - 21,304 38,475 75,454 24 6 75 4,853 22,586 - - 21,447 40,613 71,179

1 (Pretest 3)

1 75

3,996 22,740 - - 22,000 38,400 69,400

31 22 75 4,247 22,871 8408 - 23,014 39,544 72,248 17 19 75 4,276 22,672 9,696 18,483 23,533 36,461 73,629 13 16 75 4,858 23,085 - 18,168 24,012 35,768 75,098 15 21 75 4,667 24,367 - - 25,008 39,401 73,601 5 9 75 4,901 26,077 - 17,741 25,009 34,841 72,604 8 8 75 5,057 23,940 - - 25,293 35,340 74,385

2 (Pretest 4)

4 75

4,856 25,203 - - 25,488 37,288 64,488

18 17 75 4,856 23,897 8,787 19,291 25,654 36,158 72,619 23 3 75 4,842 22,428 - 19,392 26,361 37,673 75,548 28 6 75 4,858 22,871 - 18,525 26,933 37,763 75,596 10 13 75 4,891 25,935 - - 27,503 37,763 75,525 16 11 75 4,833 25,211 8,383 18,988 29,593 38,077 73,730

The results from this phase of testing were input into the JMP analysis software to determine

what factors were significant in the spurt and cavitation times. These results were also compared to the


data from the previous phases of this series to see how the cubical fragments effect the spurt and cavitation timing in comparison the spheres used in the previous two sets of tests. Results from this analysis can be found in Section A.2.3.

A.2 DOE Statistical Analysis of Results A.2.1 Phase I The objective of Phase I testing was to determine which factors had the largest effect on primary spurt and collapse times. When the time data for primary spurt and collapse was collected, it was determined that 10 test points with the thinnest material (0.0625-inch) and two points with the next thinnest material (0.09-inch) had experienced cracking; this did not allow the primary spurt timing to be determined. Most of these cases were for the highest velocity (6000 fps), but some cracking also occurred at the lower speed (4000 fps) as well as the highest elevation (45 degrees). However, because this was a three-level design, all effects were still able to be adequately analyzed. When performing initial testing, it is advisable to include more than two levels and to incorporate replicates to help overcome potential issues with incomplete data. The data were analyzed in JMP 12 using stepwise regression for all main effects for linear and quadratic effects. The results for primary spurt and collapse from the prediction profiler are given in Figure A-1 and Figure A-2.

Figure A-1. Prediction Profiler for Primary Spurt

Figure A-2. Prediction Profiler for Cavity Collapse

The first and most obvious conclusion is that primary spurt and collapse are highly correlated. The significant factors and functional relationship were identical for both responses. This seems reasonable because the the driver for primary spurt should be the collapse of the cavity.


The factors of actual velocity and threat showed a linear relation to primary spurt and collapse. This makes intuitive sense because these factors are the momentum of the fragment. More momentum should penetrate deeper and make a larger cavity, therefore delaying the primary spurt. Panel thickness also showed a linear response, but in the opposite direction. Again, this makes intuitive sense because as the panel gets thicker, the velocity after penetration will be lower. The magnitude of the effect for panel size was lower, but it should be noted that these tests were conducted on thin material (0.0625- to 0.19-inch). This range was selected because it represents the range of material thicknesses currently used in wing tank walls. So, for current aircraft, panel thickness is not a large contributor, but if a thicker material was used, this effect would be more significant.

These three factors together contribute to the momentum that is imparted into the fluid. When considered together, this raises the issue that the best factor might be momentum at impact with the fluid, but this is not possible to directly measure. It would be a function of velocity at impact with the fluid and threat size at impact with the fluid, which are both influenced by the material thickness and penetration characteristics of the projectile. In this case, a reasonable prediction can be obtained using the initial velocity, initial threat size, and panel thickness, but this assumption should be considered for each new set of test conditions.

Since these factors were well understood and all were related, the decision was made to only include velocity as a factor in the next phases of testing. Two factors that were found to be not significant were ullage pressure and fluid head. The values of these factors that were tested represented the range of condition in current aircraft. If future systems use values that are much higher, additional testing should be considered. Due to the lack of an observable effect at representative levels, these factors were eliminated from future phases of testing.

The only factors that showed a quadratic relationship were elevation, shotline from wall, tank depth, and shot panel size. This was not expected and required additional investigation. The first detail to be examined was the fact the elevation, shotline from wall, and shot panel size were all tilted. This was an indication that the shape might be due to fitting one level that had a different relationship to the responses than the others. Upon, further review of the video, an explanation was discovered. For an angled shotline, or shotlines near the wall, the cavity would impact the sides of the tank. This would cause the cavity to begin to collapse and would reduce collapse and primary spurt timing. Cavity interactions with walls is an important part of understanding the hydrodynamics, but it is beyond the scope of the current efforts, so follow-on testing used factors settings that would avoid this confounding issue. A full explanation for the impact of tank depth could not be determined, so this factor was included in the next phases. Elevation was also included in additional testing to investigate any effects of angled impacts into the panels.

A.2.2 Phase II Initial Tests

The objective of this phase of testing was to determine if hydrodynamic results obtained using water as the fluid could be scaled to be representative of JP8. The factors for this phase were fluid type, velocity, tank depth, and elevation; fluid type was initially analyzed as a categorical factor. The results from the prediction profiler for primary spurt and collapse are given in Figure A-3.


Figure A-3. Prediction Profiler for Primary Spurt and Cavity Collapse

Collapse and primary spurt are very similar again, but the effect of velocity is stronger on primary spurt timing. This was not further explored because it was not the objective of this phase, but it could be investigated in future testing. Elevation showed a statistically significant, but small effect. This can be explained by understanding that a shotline at an elevation of 45 degrees, has a path through the material that is 1.41 times longer than a shotline at 0 degrees. Phase I demonstrated that increasing the material thickness slowed the impact velocity into the fluid, and reduced the timing of primary spurt and collapse. Tank depth was not significant in the phase.

The model for both collapse and primary spurt showed excellent prediction capability with and adjusted R2 of greater than 0.97 using just fluid type, actual velocity, and elevation. Even a simplified model of just fluid type and actual velocity has an adjusted R2 above 0.92 for both responses.

When fluid type was analyzed as a categorical factor, it showed an offset between the fluid types for collapse and primary spurt. A common practice for converting results between different fluids is to use the ratio of the speed of sound in the fluids, but this raised two issues. The first issue is that the model fit and offset did not increase as velocity increased. The interaction term between fluid and velocity was included in the model, but it was not significant. So as the velocity changed, and the timing of collapse and primary spurt increased, the ratio between the fluids changed.

The next issue for converting between the fluids was determining an appropriate ratio. The ratio of speed of sound between JP-8 and water is a strong function of temperature as shown in Figure A-4. The data was examined to determine the temperature of the fluid and it was discovered that fluid temperature varied more than 15 degree Celsius and most of the fuel tests were on colder days.


Figure A-4. Ratio of Speed of Sound Between Water and JP-8

To account for the differences in temperature, and still be able to scale the results as a function of speed of sound in the fluid, a new analysis method was developed. The categorical factor of fluid type was transformed into a continuous factor of speed of sound, which was determined from the fluid type and fluid temperature. The analysis was rerun and the prediction profiler is shown in Figure A-5.

Figure A-5. Prediction Profiler for Rerun of Primary Spurt and Cavity Collapse with Compensation in Scaling Based on Speed of Sound of Fluid Type

The results are very similar to the previous analysis with one important difference, collapse and primary spurt could be predicted as a function of temperature. This could be an important discovery because the temperature of the fuel in aircraft has a very wide range. Temperatures in wing fuel tanks in transport

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1000

1100

1200

1300

1400

1500

1600

0 10 20 30 40 50

Spee

d of

Sou

nd (m

/s)

Temperature (oC)

Water

JP-8

Ratio


aircraft can be very low where other aircraft like the latest generation of fighters, use the fuel as a coolant and the temperatures can be much higher. The model fit and standard deviations using speed of sound were equal to or better than the model based on fluid type indicating that the continuous factor of speed of sound in the fluid and is an excellent predictor for collapse and primary spurt. This methodology could be valid for other fuel types, other responses, and a wide range of temperatures, but additional testing is needed for validation.

A.2.3 Phase II Follow-Up Tests The objective of this phase was to evaluate the effect of fragment orientation on collapse and primary spurt timing. Previous phases used a sphere to avoid confounding the response with orientation. In this phase cubical fragments were fired at two different orientations to vary in the impact pitch. An elevation of 0 degrees would tend to have a flat face impact and an elevation of 45 degree would tend to have an edge on impact. The other factors included in this phase were velocity and tank depth. The pitch and yaw were recorded at impact with the tank wall.

The first step in reducing the data was to get the pitch and yaw measurements into a usable form. Since the fragments were cubical and all that had been shot had a 0 degree azimuth, it was assumed that positive or negative yaw would have an identical influence, therefore the absolute value of yaw from flat face was used as the factor. Pitch was normalized around the edge on configuration, that means a pitch of 0 degrees represents an edge on impact and ± 45 is a flat face impact. The distribution of pitch and yaw is shown in Figure A-6. In this figure, all shots at an elevation of 45 degrees are highlighted.

Figure A-6. Distribution of Pitch and Yaw

The distribution of yaw shows that almost all shots had some yaw and most cases were between 5 degrees and 20 degrees. Despite attempts to fire all shots at 0 degrees, yaw was not highly controllable and resulted in a distribution of values. The distribution does not appear to vary much as a function of elevation. The pitch showed a similar distribution with the shots at 45 degrees elevation tending to be around 0 degrees (which is defined as edge on) and the shots at 0 degrees elevation distributed around 45 degrees (defined as flat face). The implications of these results is that the data had a range of yaw and pitch in this phase, but these factors are difficult to control and are not good candidates for input factors.

The terms of yaw and pitch were included in the model and the results from the prediction profiler are given in Figure A-7. Velocity elevation and tank depth were similar to results from the previous phases. Yaw showed a tendency to delay collapse and primary spurt as the angles increased from 0 degrees


(defined as flat face). This makes intuitive sense because the fragment should penetrate easier as it becomes more edge on, and therefore impart more energy into the fluid. As we have seen from previous phases, more energy (e.g. from higher velocity, larger threat, thinner material) all increase the collapse and spurt time. As the pitch varied from edge on (defined as 0 degrees) to more flat faced, the collapse and primary spurt times decreased. This is expected because a flatter face impact would penetrate with less energy and therefore result in lower collapse and primary spurt times.

Figure A-7. Prediction Profiler for Primary Spurt and Cavity Collapse Including Pitch and Yaw Terms

When developing practical models, it is important to consider the relative magnitudes of the effect of each factor. Velocity has the largest magnitude followed by elevation. While yaw and pitch are significant, they do not have a large effect on the responses for the range of conditions used in this testing. Furthermore, the actual orientation at impact is a distribution of values around the launch orientation. Therefore, the overall effect of using fragments instead of spheres is that fragment orientation slightly increases the variability of the response, and this effect can be incorporated into a simple model by slightly increasing the prediction intervals. The prediction profiler for a model with just velocity and elevation is shown in Figure A-8. Even though the model does not include specific terms for fragment orientation, it is a useful for prediction of collapse and primary spurt.

Figure A-8. Prediction Profiler for Primary Spurt and Cavity Collapse Only Showing Velocity and Elevation


Appendix B. Example of Analysis of Randomized Block Design

Consider the following example as provided in Montgomery (2013). An experiment is performed to determine the effect of three washing solutions on bacteria growth. Only three trials can be performed each day. Because day could be a source of variability, a randomized block design is used. Consider the results of this experiment by including or ignoring the nuisance factor day in the analysis in Table B.1.

Table B.1 Bacteria growth data

Days Solution 1 2 3 4 1 13 22 18 39 2 16 24 17 44 3 5 4 1 22

The correct analysis for this experiment includes day as a block effect in order to determine the effect of the washing solution on the response. As seen in Figure B.1 the output below, there is a significant difference in growth rate for the three washing solutions. The large F ratio for the block effect indicates day did introduce variability in the response, so it was important to include this in the analysis.

Summary of Fit Rsquare 0.972166 Adj Rsquare 0.948972 Root Mean Square Error 2.939199 Mean of Response 18.75 Observations (or Sum Wgts) 12 Analysis of Variance Source DF Sum of

Squares Mean

Square F Ratio Prob > F

Solution 2 703.5000 351.750 40.7170 0.0003* Block 3 1106.9167 368.972 42.7106 0.0002* Error 6 51.8333 8.639 C. Total 11 1862.2500

Figure B.1 JMP output for blocked experiment

Now suppose we had ignored the nuisance factor day in the analysis. The results of this analysis is shown in Figure B.2. Because we have ignored day, the block sum of squares has been folded into the error sum of squares. The error sum of squares is now 1158.75 (compared to 51.83 previously when including block effect in the analysis). Because the variability due to the nuisance factor day has not been partitioned from the error sum of squares, the effect of solution has not been identified (p-value = 0.1182). By ignoring the block effect, the analysis fails to identify a significant effect of solution on the response.


Summary of Fit Rsquare 0.377769 Adj Rsquare 0.239495 Root Mean Square Error 11.34681 Mean of Response 18.75 Observations (or Sum Wgts) 12

Analysis of Variance Source DF Sum of

Squares Mean

Square F Ratio Prob > F

Solution 2 703.5000 351.750 2.7320 0.1182 Error 9 1158.7500 128.750 C. Total 11 1862.2500

Figure B.2 Analysis Ignoring Block Effect

If a blocked experiment is executed (whether it was planned that way or not), the analysis of the data should reflect it. By ignoring a nuisance factor in the analysis, you may reach incorrect conclusions and not identify important relationships.


Appendix C. Glossary of STAT Terms

Affinity diagram A tool and process to organize many items into natural groupings or when a group needs to come to a consensus.

Aliasing Degree to which a term in the fitted model is biased by active term(s) not in the fitted model.

Blocking A planning and analysis technique that sets aside (partitions) both undesirable and known sources of nuisance variability so they do not hinder the estimation of other effects. A variance reduction technique used in design of experiments to account for nuisance variables.

Confidence The probability of concluding a factor has no effect on the response, when it does not have an effect. (True Negative Rate and equal to 1 – type I error rate)

Confounding See aliasing Control factor An experimental factor that can be controlled and varied during test that

could impact the response as factor levels change Constraint Anything that limits the design of test space (e.g., resources, range

procedures, operational limitations, etc.) D-optimal design A computer-generated experimental design such that the determinant

of the information matrix is maximized. Equivalently, a design such that the volume of the confidence region of the parameter estimates is minimized. Popular for screening experiments due to favorable properties to identify active effects. See Montgomery (2013) or Myers et al. (2016) for details.

Efficiency, design Evaluation measure related to computer-generated optimal designs. Value depends on the optimality criteria used (e.g., D-optimality, I-optimality, etc). Allows for comparing designs that have different sample sizes.

Factor Input (independent) variable for which there are inferred effects on the response

Fishbone diagram A brainstorming tool whose purpose is to facilitate extracting causal factors.

Hard-to-change Factor A factor whose levels are difficult, time-consuming, or costly to change after every test run.

Hold constant Factor A factor that likely influences the response that is set to a constant value throughout all of the tests. May not be of primary interest, may be too difficult to vary, or may have been previously shown to not be significant.

I-optimal design A computer-generated experimental design such that the average prediction variance is minimized. Popular for designs where the goal is to optimize or predict values due to favorable properties of prediction variance. See Montgomery (2013) or Myers et al. (2016) for details.

Inter-relationship diagram

A brainstorming tool that maps items (or steps) in a complex solution (or process). Often used after a fishbone diagram or affinity diagram to further understand or explore links between items.

Level (of a factor) The values set for a given factor throughout the experiment.


Multicollinearity Exists when the factors or predictor variables are correlated among themselves

Noise factor Factor that likely influences the response, but may not be possible (or desirable) to control in an experiment in the field

Nuisance Factor Factor that could influence the response, but which is not a factor of interest

Optimal design A computer-generated design found by optimizing a specified characteristic or property of the design. See also D-optimal design, I-optimal design

Orthogonal design A design in which the model terms are linearly independent of each other.

Random factor A factor that has a large number of possible levels and whose levels are randomly selected from the population of factor levels. Analysis on a random factor provides inference on the entire population of factor levels (not just those observed in the experiment).

Power The probability of concluding a factor has an effect on the response when in fact it does. (True Positive Rate and equal to 1 – type II error rate)

Pure error The component of the error sum of squares attributed to replicates Repeated Observation Multiple measurements made at once for a given factor level

combination Replicate An independent test run of each factor level combination Resolution Metric of a screening design that indicates the degree of confounding in

the design. In general, a design is resolution R if no p-factor effect is aliased with another effect containing less than R-p factors.

Resolution III design A design such that main effects are not aliased with other main effects, but are aliased with two-factor interactions and some two-factor interactions may be aliased with each other.

Resolution IV design A design such that main effects are not aliased with other main effects or with any two-factor interaction, but two-factor interactions are aliased with each other.

Resolution V design A design such that no main effect or two-factor interactions is aliased with any other main effect or two-factor interaction, but two-factor interactions are aliased with three-factor interactions.

Response The measured, dependent output of a test event. Screening Identifying which factors are active (important/significant) in the model

of the response and eliminating those that are unimportant. Signal-to-noise ratio The ratio of 𝛿𝛿 (the desired difference in the response to detect) over 𝜎𝜎

(the magnitude of the process noise variability). Split-plot design Experimental design with a restriction on randomization due to hard-to-

change factors. Strength Indicates the level of interactions that are fully covered by the design. Test and Evaluation Master Plan

The master document for the planning, management, and execution of the T&E program for a particular system. Describes the overall test program structure, strategy, schedule, resources, objectives, and evaluation frameworks.

Type I error The probability of concluding a factor has an effect on the response when it does not have an effect (False positive rate and equal to 1 –


confidence) Type II error The probability of concluding a factor does not have an effect on the

response when it does have an effect (False negative rate and equal to 1 – power)


Appendix D. Learning Resources

Order of Learning: We recommend you take a formal design of experiments course first so you gain knowledge in a forum in which you can ask questions and learn the fundamentals. After that, access texts and online resources to broaden your knowledge, learn additional rigorous techniques, and seek ways to solve your specific problems. Remember that the key is to first understand the problem and then apply the methods, techniques, and designs that will address the stated objectives. The following sections detail resources for learning or researching STAT related topics. These are merely a few references and this list barely scratches the surface. When in doubt, ask your STAT expert.

1. Formal Courses a. US Air Force (sign up using your service specific course portal (e.g. AcqNow or ATRRS)

i. Design of Experiments 1. Science of Test (SOT) 210: 2 days (for management) 2. SOT 310: 5 days (for practitioners) 3. SOT 410: 5 days (for practitioners)

ii. Reliability 1. Reliability (REL) 210: 5 days (for management) 2. REL 310: 5 days (for practitioners) 3. REL 410: 5 days (for practitioners)

b. Defense Acquisition University i. Statistics: CLE-035 (online)

ii. Reliability: CLE-301 (Reliability and Maintainability) 2. General websites

a. STAT COE Website www.AFIT.edu/STAT b. NIST Engineering Statistics Handbook http://www.itl.nist.gov/div898/handbook/ c. Test Science Knowledge Center http://www.testscience.org

3. Statistics: a. Stat Trek: http://stattrek.com/ b. Online statistics book: http://onlinestatbook.com/2/index.html c. Statistics textbook: http://www.statsoft.com/Textbook d. Which statistical test to use: https://fhssrsc.byu.edu/SitePages/ANOVA,%20t-

tests,%20Regression,%20and%20Chi%20Square.aspx e. Common mistakes: http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html

4. Confidence Intervals a. 10 things to know: http://www.measuringusability.com/blog/ci-10things.php b. Video:

http://www.youtube.com/playlist?list=PLvxOuBpazmsMdPBRxBTvwLv5Lhuk0tuXh 5. Sampling Plans/OC Curves:

a. NIST Engineering Statistics Handbook chapter 6.2.3.2 http://www.itl.nist.gov/div898/handbook/

b. http://src.alionscience.com/pdf/OC_Curves.pdf c. http://www.samplingplans.com/modern3.htm

http://www.afit.edu/STAT

http://www.itl.nist.gov/div898/handbook/

http://stattrek.com/

http://onlinestatbook.com/2/index.html

http://www.statsoft.com/Textbook

https://fhssrsc.byu.edu/SitePages/ANOVA,%20t-tests,%20Regression,%20and%20Chi%20Square.aspx

https://fhssrsc.byu.edu/SitePages/ANOVA,%20t-tests,%20Regression,%20and%20Chi%20Square.aspx

http://www.ma.utexas.edu/users/mks/statmistakes/TOC.html

http://www.measuringusability.com/blog/ci-10things.php

http://www.youtube.com/playlist?list=PLvxOuBpazmsMdPBRxBTvwLv5Lhuk0tuXh

http://www.itl.nist.gov/div898/handbook/

http://src.alionscience.com/pdf/OC_Curves.pdf

http://www.samplingplans.com/modern3.htm


6. Bayesian techniques: http://bayesian.org/video/gallery 7. Design of Experiments

a. Textbook: Montgomery, D.C. (2013) Design and Analysis of Experiments (8th ed.), Hoboken, New Jersey: John Wiley & Sons.

b. DOE Overview (ASQ) http://asq.org/learn-about-quality/data-collection-analysis-tools/overview/design-of-experiments-tutorial.html

c. DOE Overview https://www.moresteam.com/toolbox/design-of-experiments.cfm d. DOE Primer http://www.isixsigma.com/tools-templates/design-of-experiments-

doe/design-experiments-%e2%90%93-primer/ e. DOE Basics https://www.youtube.com/watch?v=tZWAYbKYVjM f. Stu Hunter DOE videos

https://www.youtube.com/playlist?list=PLgjnRjSuvqgdT6D1ujfIKonxHMXkfvDLF 8. Covering arrays/Combinatorics http://csrc.nist.gov/groups/SNS/acts/index.html 9. Reliability/Reliability Growth

a. Basics http://reliawiki.org/index.php/Reliability_Growth_Planning b. Tools

i. http://www.amsaa.army.mil/CRG_Tools.html ii. Reliability Analytics Toolkit

c. Sequential Probability Ratio Test i. http://reliabilityanalyticstoolkit.appspot.com/sequential_reliability_testing_bin

omial_distribution ii. Slides: http://homepage.stat.uiowa.edu/~kcowles/s138_2006/SPRTSlides.pdf

http://bayesian.org/video/gallery

http://asq.org/learn-about-quality/data-collection-analysis-tools/overview/design-of-experiments-tutorial.html

http://asq.org/learn-about-quality/data-collection-analysis-tools/overview/design-of-experiments-tutorial.html

https://www.moresteam.com/toolbox/design-of-experiments.cfm

http://www.isixsigma.com/tools-templates/design-of-experiments-doe/design-experiments-%e2%90%93-primer/

http://www.isixsigma.com/tools-templates/design-of-experiments-doe/design-experiments-%e2%90%93-primer/

https://www.youtube.com/watch?v=tZWAYbKYVjM

https://www.youtube.com/playlist?list=PLgjnRjSuvqgdT6D1ujfIKonxHMXkfvDLF

http://csrc.nist.gov/groups/SNS/acts/index.html

http://reliawiki.org/index.php/Reliability_Growth_Planning

http://www.amsaa.army.mil/CRG_Tools.html

http://reliabilityanalyticstoolkit.appspot.com/

http://reliabilityanalyticstoolkit.appspot.com/sequential_reliability_testing_binomial_distribution

http://reliabilityanalyticstoolkit.appspot.com/sequential_reliability_testing_binomial_distribution

http://homepage.stat.uiowa.edu/%7Ekcowles/s138_2006/SPRTSlides.pdf


References Anderson-Cook, Christine, (2007), “When Should You Consider A Split-Plot Design?” Quality Progress, October 2007, 57-59. Box, G.E.P. (1999), “Statistics as a Catalyst to Learning by Scientific Method Part II –A Discussion,” Journal of Quality Technology, 31, 16-29. Box, G.E.P., Hunter, J.S., and Hunter, W.G. (2005) Statistics for Experimenters: Design, Innovation and Discovery (2nd ed.), Hoboken, New Jersey: John Wiley & Sons. Box, G.E.P., Liu, P.Y.T. (1999), “Statistics as a Catalyst to Learning by Scientific Method Part I–An Example,” Journal of Quality Technology, 31, 1-15. Cohen, Jacob, “The Cost of Dichotomization,” Applied Psychological Measurement, Vol. 7, No. 3, Summer 1983. Coleman, D.E., and Montgomery, D.C. (1993), “A Systematic Approach to Planning for a Designed Industrial Experiment,” Technometrics, 35, 1-27. Department of Defense (DOD), “Cybersecurity Test and Evaluation Management Guidebook,” 1 July 2015 Version 1.0. Department of Defense (DOD), ”Risk Management Framework (RMF) for DoD Information Technology (IT)”, March 12, 2014 Incorporating Change 1, Effective May 24, 2016. Department of Defense (DOD), “Defense Acquisition Guidebook,” Chapter 9, 2015, https://dag.dau.mil Department of Defense (DOD), “Test and Evaluation Management Guide,” December 2012, Sixth Edition Freeman, L.J., Ryan, A.G., Kensler, J.L.K., Dickinson, R.M., and Vining, G.G. (2013), “A Tutorial on the Planning of Experiments,” Quality Engineering, 25, 315-332. Gilmore, J. Michael, “A Statistically Rigorous Approach to Test and Evaluation (T&E),” International Test & Evaluation Association Journal, Vol. 34, Sep 13.

Hamada, Michael, “The Advantages of Continuous Measurements over Pass/Fail Data,” Quality Engineering, Vol. 15, No. 2, 2002.

Harman, Michael, “Understanding Requirements for Effective Test Planning, Best Practice,” www.AFIT.edu/STAT, January 2014 Hutto, Greg (2011), Dragon Spear Test Planning Materials, 96th Test Wing, Eglin AFB, FL.

Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models. New York: McGraw-Hill Irwin.

https://dag.dau.mil/



Montgomery, D.C. (2013) Design and Analysis of Experiments (8th ed.), Hoboken, New Jersey: John Wiley & Sons. Myers, R. H., Montgomery, D. C., & Anderson-Cook, C. M. (2016). Response Surface Methodology (4th ed.), Hoboken, New Jersey: John Wiley & Sons. National Institute of Standards and Technology, NIST/SEMATECH e-Handbook of Statistical Methods, http://www.itl.nist.gov/div898/handbook/, April 2012. Ortiz, Francisco, “Dealing with Categorical Data Types in a Designed Experiment Part I: Why You Should Avoiding Using Categorical Data Types, Best Practice,” www.AFIT.edu/STAT, Report 06-2014 OSD Memorandum Definition of Integrated Testing, 25 April 2008. Tague, N. R. (2005). The quality toolbox. Milwaukee: ASQ Quality Press.


guide to developing an effective stat test strategy v4

Documents