1.exploratory data analysis · 2018-01-11 · exploratory data analysis - detailed table of...

1. Exploratory Data Analysis

http://www.itl.nist.gov/div898/handbook/eda/eda.htm[6/27/2012 2:04:03 PM]


This chapter presents the assumptions, principles, and techniques necessaryto gain insight into data via EDA--exploratory data analysis.

1. EDA Introduction

1. What is EDA? 2. EDA vs Classical &

Bayesian3. EDA vs Summary4. EDA Goals5. The Role of Graphics6. An EDA/Graphics Example 7. General Problem Categories

2. EDA Assumptions

1. Underlying Assumptions2. Importance3. Techniques for Testing

Assumptions4. Interpretation of 4-Plot5. Consequences

3. EDA Techniques

1. Introduction2. Analysis Questions3. Graphical Techniques:

Alphabetical4. Graphical Techniques: By

Problem Category5. Quantitative Techniques6. Probability Distributions

4. EDA Case Studies

1. Introduction2. By Problem Category

Detailed Chapter Table of ContentsReferencesDataplot Commands for EDA Techniques

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/eda/section4/eda44.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/


http://www.itl.nist.gov/div898/handbook/eda/eda_d.htm[6/27/2012 2:00:18 PM]

1. Exploratory Data Analysis - Detailed Table of Contents [1.]

This chapter presents the assumptions, principles, and techniques necessary to gain insight into data via EDA--exploratory data analysis.

1. EDA Introduction [1.1.]1. What is EDA? [1.1.1.]2. How Does Exploratory Data Analysis differ from Classical Data Analysis? [1.1.2.]

1. Model [1.1.2.1.]2. Focus [1.1.2.2.]3. Techniques [1.1.2.3.]4. Rigor [1.1.2.4.]5. Data Treatment [1.1.2.5.]6. Assumptions [1.1.2.6.]

3. How Does Exploratory Data Analysis Differ from Summary Analysis? [1.1.3.]4. What are the EDA Goals? [1.1.4.]5. The Role of Graphics [1.1.5.]6. An EDA/Graphics Example [1.1.6.]7. General Problem Categories [1.1.7.]

2. EDA Assumptions [1.2.]1. Underlying Assumptions [1.2.1.]2. Importance [1.2.2.]3. Techniques for Testing Assumptions [1.2.3.]4. Interpretation of 4-Plot [1.2.4.]5. Consequences [1.2.5.]

1. Consequences of Non-Randomness [1.2.5.1.]2. Consequences of Non-Fixed Location Parameter [1.2.5.2.]3. Consequences of Non-Fixed Variation Parameter [1.2.5.3.]4. Consequences Related to Distributional Assumptions [1.2.5.4.]

3. EDA Techniques [1.3.]1. Introduction [1.3.1.]2. Analysis Questions [1.3.2.]3. Graphical Techniques: Alphabetic [1.3.3.]

1. Autocorrelation Plot [1.3.3.1.]1. Autocorrelation Plot: Random Data [1.3.3.1.1.]2. Autocorrelation Plot: Moderate Autocorrelation [1.3.3.1.2.]3. Autocorrelation Plot: Strong Autocorrelation and Autoregressive Model [1.3.3.1.3.]4. Autocorrelation Plot: Sinusoidal Model [1.3.3.1.4.]

2. Bihistogram [1.3.3.2.]3. Block Plot [1.3.3.3.]4. Bootstrap Plot [1.3.3.4.]5. Box-Cox Linearity Plot [1.3.3.5.]6. Box-Cox Normality Plot [1.3.3.6.]7. Box Plot [1.3.3.7.]

http://www.itl.nist.gov/div898/handbook/index.htm



8. Complex Demodulation Amplitude Plot [1.3.3.8.]9. Complex Demodulation Phase Plot [1.3.3.9.]

10. Contour Plot [1.3.3.10.]1. DOE Contour Plot [1.3.3.10.1.]

11. DOE Scatter Plot [1.3.3.11.]12. DOE Mean Plot [1.3.3.12.]13. DOE Standard Deviation Plot [1.3.3.13.]14. Histogram [1.3.3.14.]

1. Histogram Interpretation: Normal [1.3.3.14.1.]2. Histogram Interpretation: Symmetric, Non-Normal, Short-Tailed [1.3.3.14.2.]3. Histogram Interpretation: Symmetric, Non-Normal, Long-Tailed [1.3.3.14.3.]4. Histogram Interpretation: Symmetric and Bimodal [1.3.3.14.4.]5. Histogram Interpretation: Bimodal Mixture of 2 Normals [1.3.3.14.5.]6. Histogram Interpretation: Skewed (Non-Normal) Right [1.3.3.14.6.]7. Histogram Interpretation: Skewed (Non-Symmetric) Left [1.3.3.14.7.]8. Histogram Interpretation: Symmetric with Outlier [1.3.3.14.8.]

15. Lag Plot [1.3.3.15.]1. Lag Plot: Random Data [1.3.3.15.1.]2. Lag Plot: Moderate Autocorrelation [1.3.3.15.2.]3. Lag Plot: Strong Autocorrelation and Autoregressive Model [1.3.3.15.3.]4. Lag Plot: Sinusoidal Models and Outliers [1.3.3.15.4.]

16. Linear Correlation Plot [1.3.3.16.]17. Linear Intercept Plot [1.3.3.17.]18. Linear Slope Plot [1.3.3.18.]19. Linear Residual Standard Deviation Plot [1.3.3.19.]20. Mean Plot [1.3.3.20.]21. Normal Probability Plot [1.3.3.21.]

1. Normal Probability Plot: Normally Distributed Data [1.3.3.21.1.]2. Normal Probability Plot: Data Have Short Tails [1.3.3.21.2.]3. Normal Probability Plot: Data Have Long Tails [1.3.3.21.3.]4. Normal Probability Plot: Data are Skewed Right [1.3.3.21.4.]

22. Probability Plot [1.3.3.22.]23. Probability Plot Correlation Coefficient Plot [1.3.3.23.]24. Quantile-Quantile Plot [1.3.3.24.]25. Run-Sequence Plot [1.3.3.25.]26. Scatter Plot [1.3.3.26.]

1. Scatter Plot: No Relationship [1.3.3.26.1.]2. Scatter Plot: Strong Linear (positive correlation) Relationship [1.3.3.26.2.]3. Scatter Plot: Strong Linear (negative correlation) Relationship [1.3.3.26.3.]4. Scatter Plot: Exact Linear (positive correlation) Relationship [1.3.3.26.4.]5. Scatter Plot: Quadratic Relationship [1.3.3.26.5.]6. Scatter Plot: Exponential Relationship [1.3.3.26.6.]7. Scatter Plot: Sinusoidal Relationship (damped) [1.3.3.26.7.]8. Scatter Plot: Variation of Y Does Not Depend on X (homoscedastic) [1.3.3.26.8.]9. Scatter Plot: Variation of Y Does Depend on X (heteroscedastic) [1.3.3.26.9.]

10. Scatter Plot: Outlier [1.3.3.26.10.]11. Scatterplot Matrix [1.3.3.26.11.]12. Conditioning Plot [1.3.3.26.12.]

27. Spectral Plot [1.3.3.27.]1. Spectral Plot: Random Data [1.3.3.27.1.]2. Spectral Plot: Strong Autocorrelation and Autoregressive Model [1.3.3.27.2.]3. Spectral Plot: Sinusoidal Model [1.3.3.27.3.]

28. Standard Deviation Plot [1.3.3.28.]29. Star Plot [1.3.3.29.]



30. Weibull Plot [1.3.3.30.]31. Youden Plot [1.3.3.31.]

1. DOE Youden Plot [1.3.3.31.1.]32. 4-Plot [1.3.3.32.]33. 6-Plot [1.3.3.33.]

4. Graphical Techniques: By Problem Category [1.3.4.]5. Quantitative Techniques [1.3.5.]

1. Measures of Location [1.3.5.1.]2. Confidence Limits for the Mean [1.3.5.2.]3. Two-Sample t-Test for Equal Means [1.3.5.3.]

1. Data Used for Two-Sample t-Test [1.3.5.3.1.]4. One-Factor ANOVA [1.3.5.4.]5. Multi-factor Analysis of Variance [1.3.5.5.]6. Measures of Scale [1.3.5.6.]7. Bartlett's Test [1.3.5.7.]8. Chi-Square Test for the Standard Deviation [1.3.5.8.]

1. Data Used for Chi-Square Test for the Standard Deviation [1.3.5.8.1.]9. F-Test for Equality of Two Standard Deviations [1.3.5.9.]

10. Levene Test for Equality of Variances [1.3.5.10.]11. Measures of Skewness and Kurtosis [1.3.5.11.]12. Autocorrelation [1.3.5.12.]13. Runs Test for Detecting Non-randomness [1.3.5.13.]14. Anderson-Darling Test [1.3.5.14.]15. Chi-Square Goodness-of-Fit Test [1.3.5.15.]16. Kolmogorov-Smirnov Goodness-of-Fit Test [1.3.5.16.]17. Grubbs' Test for Outliers [1.3.5.17.]18. Yates Analysis [1.3.5.18.]

1. Defining Models and Prediction Equations [1.3.5.18.1.]2. Important Factors [1.3.5.18.2.]

6. Probability Distributions [1.3.6.]1. What is a Probability Distribution [1.3.6.1.]2. Related Distributions [1.3.6.2.]3. Families of Distributions [1.3.6.3.]4. Location and Scale Parameters [1.3.6.4.]5. Estimating the Parameters of a Distribution [1.3.6.5.]

1. Method of Moments [1.3.6.5.1.]2. Maximum Likelihood [1.3.6.5.2.]3. Least Squares [1.3.6.5.3.]4. PPCC and Probability Plots [1.3.6.5.4.]

6. Gallery of Distributions [1.3.6.6.]1. Normal Distribution [1.3.6.6.1.]2. Uniform Distribution [1.3.6.6.2.]3. Cauchy Distribution [1.3.6.6.3.]4. t Distribution [1.3.6.6.4.]5. F Distribution [1.3.6.6.5.]6. Chi-Square Distribution [1.3.6.6.6.]7. Exponential Distribution [1.3.6.6.7.]8. Weibull Distribution [1.3.6.6.8.]9. Lognormal Distribution [1.3.6.6.9.]

10. Fatigue Life Distribution [1.3.6.6.10.]11. Gamma Distribution [1.3.6.6.11.]12. Double Exponential Distribution [1.3.6.6.12.]13. Power Normal Distribution [1.3.6.6.13.]14. Power Lognormal Distribution [1.3.6.6.14.]



15. Tukey-Lambda Distribution [1.3.6.6.15.]16. Extreme Value Type I Distribution [1.3.6.6.16.]17. Beta Distribution [1.3.6.6.17.]18. Binomial Distribution [1.3.6.6.18.]19. Poisson Distribution [1.3.6.6.19.]

7. Tables for Probability Distributions [1.3.6.7.]1. Cumulative Distribution Function of the Standard Normal Distribution [1.3.6.7.1.]2. Upper Critical Values of the Student's-t Distribution [1.3.6.7.2.]3. Upper Critical Values of the F Distribution [1.3.6.7.3.]4. Critical Values of the Chi-Square Distribution [1.3.6.7.4.]5. Critical Values of the t* Distribution [1.3.6.7.5.]6. Critical Values of the Normal PPCC Distribution [1.3.6.7.6.]

4. EDA Case Studies [1.4.]1. Case Studies Introduction [1.4.1.]2. Case Studies [1.4.2.]

1. Normal Random Numbers [1.4.2.1.]1. Background and Data [1.4.2.1.1.]2. Graphical Output and Interpretation [1.4.2.1.2.]3. Quantitative Output and Interpretation [1.4.2.1.3.]4. Work This Example Yourself [1.4.2.1.4.]

2. Uniform Random Numbers [1.4.2.2.]1. Background and Data [1.4.2.2.1.]2. Graphical Output and Interpretation [1.4.2.2.2.]3. Quantitative Output and Interpretation [1.4.2.2.3.]4. Work This Example Yourself [1.4.2.2.4.]

3. Random Walk [1.4.2.3.]1. Background and Data [1.4.2.3.1.]2. Test Underlying Assumptions [1.4.2.3.2.]3. Develop A Better Model [1.4.2.3.3.]4. Validate New Model [1.4.2.3.4.]5. Work This Example Yourself [1.4.2.3.5.]

4. Josephson Junction Cryothermometry [1.4.2.4.]1. Background and Data [1.4.2.4.1.]2. Graphical Output and Interpretation [1.4.2.4.2.]3. Quantitative Output and Interpretation [1.4.2.4.3.]4. Work This Example Yourself [1.4.2.4.4.]

5. Beam Deflections [1.4.2.5.]1. Background and Data [1.4.2.5.1.]2. Test Underlying Assumptions [1.4.2.5.2.]3. Develop a Better Model [1.4.2.5.3.]4. Validate New Model [1.4.2.5.4.]5. Work This Example Yourself [1.4.2.5.5.]

6. Filter Transmittance [1.4.2.6.]1. Background and Data [1.4.2.6.1.]2. Graphical Output and Interpretation [1.4.2.6.2.]3. Quantitative Output and Interpretation [1.4.2.6.3.]4. Work This Example Yourself [1.4.2.6.4.]

7. Standard Resistor [1.4.2.7.]1. Background and Data [1.4.2.7.1.]2. Graphical Output and Interpretation [1.4.2.7.2.]3. Quantitative Output and Interpretation [1.4.2.7.3.]4. Work This Example Yourself [1.4.2.7.4.]



8. Heat Flow Meter 1 [1.4.2.8.]1. Background and Data [1.4.2.8.1.]2. Graphical Output and Interpretation [1.4.2.8.2.]3. Quantitative Output and Interpretation [1.4.2.8.3.]4. Work This Example Yourself [1.4.2.8.4.]

9. Fatigue Life of Aluminum Alloy Specimens [1.4.2.9.]1. Background and Data [1.4.2.9.1.]2. Graphical Output and Interpretation [1.4.2.9.2.]

10. Ceramic Strength [1.4.2.10.]1. Background and Data [1.4.2.10.1.]2. Analysis of the Response Variable [1.4.2.10.2.]3. Analysis of the Batch Effect [1.4.2.10.3.]4. Analysis of the Lab Effect [1.4.2.10.4.]5. Analysis of Primary Factors [1.4.2.10.5.]6. Work This Example Yourself [1.4.2.10.6.]

3. References For Chapter 1: Exploratory Data Analysis [1.4.3.]

http://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/http://www.sematech.org/http://www.nist.gov/

1.1. EDA Introduction

http://www.itl.nist.gov/div898/handbook/eda/section1/eda1.htm[6/27/2012 2:00:24 PM]


1.1. EDA Introduction

Summary What is exploratory data analysis? How did it begin? Howand where did it originate? How is it differentiated from otherdata analysis approaches, such as classical and Bayesian? IsEDA the same as statistical graphics? What role doesstatistical graphics play in EDA? Is statistical graphicsidentical to EDA?

These questions and related questions are dealt with in thissection. This section answers these questions and provides thenecessary frame of reference for EDA assumptions, principles,and techniques.

Table ofContentsfor Section1

1. What is EDA?2. EDA versus Classical and Bayesian

1. Models2. Focus3. Techniques4. Rigor5. Data Treatment6. Assumptions

3. EDA vs Summary4. EDA Goals5. The Role of Graphics6. An EDA/Graphics Example7. General Problem Categories

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.1.1. What is EDA?


1. Exploratory Data Analysis 1.1. EDA Introduction

1.1.1. What is EDA?

Approach Exploratory Data Analysis (EDA) is an approach/philosophyfor data analysis that employs a variety of techniques (mostlygraphical) to

1. maximize insight into a data set;2. uncover underlying structure;3. extract important variables;4. detect outliers and anomalies;5. test underlying assumptions;6. develop parsimonious models; and7. determine optimal factor settings.

Focus The EDA approach is precisely that--an approach--not a set oftechniques, but an attitude/philosophy about how a dataanalysis should be carried out.

Philosophy EDA is not identical to statistical graphics although the twoterms are used almost interchangeably. Statistical graphics is acollection of techniques--all graphically based and allfocusing on one data characterization aspect. EDAencompasses a larger venue; EDA is an approach to dataanalysis that postpones the usual assumptions about what kindof model the data follow with the more direct approach ofallowing the data itself to reveal its underlying structure andmodel. EDA is not a mere collection of techniques; EDA is aphilosophy as to how we dissect a data set; what we look for;how we look; and how we interpret. It is true that EDAheavily uses the collection of techniques that we call"statistical graphics", but it is not identical to statisticalgraphics per se.

History The seminal work in EDA is Exploratory Data Analysis,Tukey, (1977). Over the years it has benefitted from othernoteworthy publications such as Data Analysis andRegression, Mosteller and Tukey (1977), Interactive DataAnalysis, Hoaglin (1977), The ABC's of EDA, Velleman andHoaglin (1981) and has gained a large following as "the" wayto analyze a data set.

Techniques Most EDA techniques are graphical in nature with a fewquantitative techniques. The reason for the heavy reliance on


1.1.1. What is EDA?


graphics is that by its very nature the main role of EDA is toopen-mindedly explore, and graphics gives the analystsunparalleled power to do so, enticing the data to reveal itsstructural secrets, and being always ready to gain some new,often unsuspected, insight into the data. In combination withthe natural pattern-recognition capabilities that we all possess,graphics provides, of course, unparalleled power to carry thisout.

The particular graphical techniques employed in EDA areoften quite simple, consisting of various techniques of:

1. Plotting the raw data (such as data traces, histograms,bihistograms, probability plots, lag plots, block plots,and Youden plots.

2. Plotting simple statistics such as mean plots, standarddeviation plots, box plots, and main effects plots of theraw data.

3. Positioning such plots so as to maximize our naturalpattern-recognition abilities, such as using multipleplots per page.

http://www.itl.nist.gov/div898/handbook/eda/section3/runseqpl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bihistog.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lagplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/blockplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/youdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/meanplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/sdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/sdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxplot.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.1.2. How Does Exploratory Data Analysis differ from Classical Data Analysis?



1.1.2. How Does Exploratory Data Analysisdiffer from Classical Data Analysis?

DataAnalysisApproaches

EDA is a data analysis approach. What other data analysisapproaches exist and how does EDA differ from these otherapproaches? Three popular data analysis approaches are:

1. Classical2. Exploratory (EDA)3. Bayesian

Paradigmsfor AnalysisTechniques

These three approaches are similar in that they all start witha general science/engineering problem and all yieldscience/engineering conclusions. The difference is thesequence and focus of the intermediate steps.

For classical analysis, the sequence is

Problem => Data => Model => Analysis =>Conclusions

For EDA, the sequence is

Problem => Data => Analysis => Model =>Conclusions

For Bayesian, the sequence is

Problem => Data => Model => Prior Distribution =>Analysis => Conclusions

Method ofdealing withunderlyingmodel forthe datadistinguishesthe 3approaches

Thus for classical analysis, the data collection is followed bythe imposition of a model (normality, linearity, etc.) and theanalysis, estimation, and testing that follows are focused onthe parameters of that model. For EDA, the data collection isnot followed by a model imposition; rather it is followedimmediately by analysis with a goal of inferring what modelwould be appropriate. Finally, for a Bayesian analysis, theanalyst attempts to incorporate scientific/engineeringknowledge/expertise into the analysis by imposing a data-independent distribution on the parameters of the selectedmodel; the analysis thus consists of formally combining boththe prior distribution on the parameters and the collected


1.1.2. How Does Exploratory Data Analysis differ from Classical Data Analysis?


data to jointly make inferences and/or test assumptions aboutthe model parameters.

In the real world, data analysts freely mix elements of all ofthe above three approaches (and other approaches). Theabove distinctions were made to emphasize the majordifferences among the three approaches.

Furtherdiscussion ofthedistinctionbetween theclassical andEDAapproaches

Focusing on EDA versus classical, these two approachesdiffer as follows:

1. Models2. Focus3. Techniques4. Rigor5. Data Treatment6. Assumptions

http://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.1.2.1. Model


1. Exploratory Data Analysis 1.1. EDA Introduction 1.1.2. How Does Exploratory Data Analysis differ from Classical Data Analysis?

1.1.2.1. Model

Classical The classical approach imposes models (both deterministicand probabilistic) on the data. Deterministic models include,for example, regression models and analysis of variance(ANOVA) models. The most common probabilistic modelassumes that the errors about the deterministic model arenormally distributed--this assumption affects the validity ofthe ANOVA F tests.

Exploratory The Exploratory Data Analysis approach does not imposedeterministic or probabilistic models on the data. On thecontrary, the EDA approach allows the data to suggestadmissible models that best fit the data.

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/pmd/section1/pmd141.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.1.2.2. Focus



1.1.2.2. Focus

Classical The two approaches differ substantially in focus. For classicalanalysis, the focus is on the model--estimating parameters ofthe model and generating predicted values from the model.

Exploratory For exploratory data analysis, the focus is on the data--itsstructure, outliers, and models suggested by the data.


1.1.2.3. Techniques



1.1.2.3. Techniques

Classical Classical techniques are generally quantitative in nature. Theyinclude ANOVA, t tests, chi-squared tests, and F tests.

Exploratory EDA techniques are generally graphical. They include scatterplots, character plots, box plots, histograms, bihistograms,probability plots, residual plots, and mean plots.

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/prc/section4/prc42.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bihistog.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/6plot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/meanplot.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.1.2.4. Rigor



1.1.2.4. Rigor

Classical Classical techniques serve as the probabilistic foundation ofscience and engineering; the most important characteristic ofclassical techniques is that they are rigorous, formal, and"objective".

Exploratory EDA techniques do not share in that rigor or formality. EDAtechniques make up for that lack of rigor by being verysuggestive, indicative, and insightful about what theappropriate model should be.

EDA techniques are subjective and depend on interpretationwhich may differ from analyst to analyst, althoughexperienced analysts commonly arrive at identicalconclusions.


1.1.2.5. Data Treatment



1.1.2.5. Data Treatment

Classical Classical estimation techniques have the characteristic oftaking all of the data and mapping the data into a fewnumbers ("estimates"). This is both a virtue and a vice. Thevirtue is that these few numbers focus on importantcharacteristics (location, variation, etc.) of the population. Thevice is that concentrating on these few characteristics canfilter out other characteristics (skewness, tail length,autocorrelation, etc.) of the same population. In this sensethere is a loss of information due to this "filtering" process.

Exploratory The EDA approach, on the other hand, often makes use of(and shows) all of the available data. In this sense there is nocorresponding loss of information.


1.1.2.6. Assumptions



1.1.2.6. Assumptions

Classical The "good news" of the classical approach is that tests basedon classical techniques are usually very sensitive--that is, if atrue shift in location, say, has occurred, such tests frequentlyhave the power to detect such a shift and to conclude thatsuch a shift is "statistically significant". The "bad news" isthat classical tests depend on underlying assumptions (e.g.,normality), and hence the validity of the test conclusionsbecomes dependent on the validity of the underlyingassumptions. Worse yet, the exact underlying assumptionsmay be unknown to the analyst, or if known, untested. Thusthe validity of the scientific conclusions becomes intrinsicallylinked to the validity of the underlying assumptions. Inpractice, if such assumptions are unknown or untested, thevalidity of the scientific conclusions becomes suspect.

Exploratory Many EDA techniques make little or no assumptions--theypresent and show the data--all of the data--as is, with fewerencumbering assumptions.


1.1.3. How Does Exploratory Data Analysis Differ from Summary Analysis?



1.1.3. How Does Exploratory Data AnalysisDiffer from Summary Analysis?

Summary A summary analysis is simply a numeric reduction of ahistorical data set. It is quite passive. Its focus is in the past.Quite commonly, its purpose is to simply arrive at a few keystatistics (for example, mean and standard deviation) whichmay then either replace the data set or be added to the dataset in the form of a summary table.

Exploratory In contrast, EDA has as its broadest goal the desire to gaininsight into the engineering/scientific process behind the data.Whereas summary statistics are passive and historical, EDAis active and futuristic. In an attempt to "understand" theprocess and improve it in the future, EDA uses the data as a"window" to peer into the heart of the process that generatedthe data. There is an archival role in the research andmanufacturing world for summary statistics, but there is anenormously larger role for the EDA approach.


1.1.4. What are the EDA Goals?



1.1.4. What are the EDA Goals?

PrimaryandSecondaryGoals

The primary goal of EDA is to maximize the analyst's insightinto a data set and into the underlying structure of a data set,while providing all of the specific items that an analyst wouldwant to extract from a data set, such as:

1. a good-fitting, parsimonious model2. a list of outliers3. a sense of robustness of conclusions4. estimates for parameters5. uncertainties for those estimates6. a ranked list of important factors7. conclusions as to whether individual factors are

statistically significant8. optimal settings

Insightinto theData

Insight implies detecting and uncovering underlying structurein the data. Such underlying structure may not be encapsulatedin the list of items above; such items serve as the specifictargets of an analysis, but the real insight and "feel" for a dataset comes as the analyst judiciously probes and explores thevarious subtleties of the data. The "feel" for the data comesalmost exclusively from the application of various graphicaltechniques, the collection of which serves as the window intothe essence of the data. Graphics are irreplaceable--there areno quantitative analogues that will give the same insight aswell-chosen graphics.

To get a "feel" for the data, it is not enough for the analyst toknow what is in the data; the analyst also must know what isnot in the data, and the only way to do that is to draw on ourown human pattern-recognition and comparative abilities inthe context of a series of judicious graphical techniquesapplied to the data.


1.1.5. The Role of Graphics




Quantitative/Graphical

Statistics and data analysis procedures can broadly be splitinto two parts:

quantitativegraphical

Quantitative Quantitative techniques are the set of statistical proceduresthat yield numeric or tabular output. Examples ofquantitative techniques include:

hypothesis testinganalysis of variancepoint estimates and confidence intervalsleast squares regression

These and similar techniques are all valuable and aremainstream in terms of classical analysis.

Graphical On the other hand, there is a large collection of statisticaltools that we generally refer to as graphical techniques.These include:

scatter plotshistogramsprobability plotsresidual plotsbox plotsblock plots

EDAApproachReliesHeavily onGraphicalTechniques

The EDA approach relies heavily on these and similargraphical techniques. Graphical procedures are not just toolsthat we could use in an EDA context, they are tools that wemust use. Such graphical tools are the shortest path togaining insight into a data set in terms of

testing assumptionsmodel selectionmodel validationestimator selectionrelationship identificationfactor effect determination

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/prc/section4/prc42.htmhttp://www.itl.nist.gov/div898/handbook/pmd/pmd.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/6plot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/blockplo.htm



outlier detection

If one is not using statistical graphics, then one is forfeitinginsight into one or more aspects of the underlying structureof the data.


1.1.6. An EDA/Graphics Example




AnscombeExample

A simple, classic (Anscombe) example of the central rolethat graphics play in terms of providing insight into a dataset starts with the following data set:

Data X Y10.00 8.04 8.00 6.9513.00 7.58 9.00 8.8111.00 8.3314.00 9.96 6.00 7.24 4.00 4.2612.00 10.84 7.00 4.82 5.00 5.68

SummaryStatistics

If the goal of the analysis is to compute summary statisticsplus determine the best linear fit for Y as a function of X,the results might be given as:

N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816

The above quantitative analysis, although valuable, gives usonly limited insight into the data.

Scatter Plot In contrast, the following simple scatter plot of the data

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterp.htm



suggests the following:

1. The data set "behaves like" a linear curve with somescatter;

2. there is no justification for a more complicated model(e.g., quadratic);

3. there are no outliers;4. the vertical spread of the data appears to be of equal

height irrespective of the X-value; this indicates thatthe data are equally-precise throughout and so a"regular" (that is, equi-weighted) fit is appropriate.

ThreeAdditionalData Sets

This kind of characterization for the data serves as the corefor getting insight/feel for the data. Such insight/feel doesnot come from the quantitative statistics; on the contrary,calculations of quantitative statistics such as intercept andslope should be subsequent to the characterization and willmake sense only if the characterization is true. To illustratethe loss of information that results when the graphicsinsight step is skipped, consider the following three datasets [Anscombe data sets 2, 3, and 4]:

X2 Y2 X3 Y3 X4 Y410.00 9.14 10.00 7.46 8.00 6.58 8.00 8.14 8.00 6.77 8.00 5.7613.00 8.74 13.00 12.74 8.00 7.71 9.00 8.77 9.00 7.11 8.00 8.8411.00 9.26 11.00 7.81 8.00 8.4714.00 8.10 14.00 8.84 8.00 7.04 6.00 6.13 6.00 6.08 8.00 5.25 4.00 3.10 4.00 5.39 19.00 12.5012.00 9.13 12.00 8.15 8.00 5.56 7.00 7.26 7.00 6.42 8.00 7.91 5.00 4.74 5.00 5.73 8.00 6.89

QuantitativeStatistics forData Set 2

A quantitative analysis on data set 2 yields

N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3



Slope = 0.5Residual standard deviation = 1.237Correlation = 0.816

which is identical to the analysis for data set 1. One mightnaively assume that the two data sets are "equivalent" sincethat is what the statistics tell us; but what do the statisticsnot tell us?

QuantitativeStatistics forData Sets 3and 4

Remarkably, a quantitative analysis on data sets 3 and 4also yields

N = 11Mean of X = 9.0Mean of Y = 7.5Intercept = 3Slope = 0.5Residual standard deviation = 1.236Correlation = 0.816 (0.817 for data set 4)

which implies that in some quantitative sense, all four ofthe data sets are "equivalent". In fact, the four data sets arefar from "equivalent" and a scatter plot of each data set,which would be step 1 of any EDA approach, would tell usthat immediately.

Scatter Plots

Interpretationof ScatterPlots

Conclusions from the scatter plots are:

1. data set 1 is clearly linear with some scatter.2. data set 2 is clearly quadratic.3. data set 3 clearly has an outlier.4. data set 4 is obviously the victim of a poor

experimental design with a single point far removedfrom the bulk of the data "wagging the dog".

Importance These points are exactly the substance that provide and



ofExploratoryAnalysis

define "insight" and "feel" for a data set. They are the goalsand the fruits of an open exploratory data analysis (EDA)approach to the data. Quantitative statistics are not wrongper se, but they are incomplete. They are incompletebecause they are numeric summaries which in thesummarization operation do a good job of focusing on aparticular aspect of the data (e.g., location, intercept, slope,degree of relatedness, etc.) by judiciously reducing the datato a few numbers. Doing so also filters the data, necessarilyomitting and screening out other sometimes crucialinformation in the focusing operation. Quantitative statisticsfocus but also filter; and filtering is exactly what makes thequantitative approach incomplete at best and misleading atworst.

The estimated intercepts (= 3) and slopes (= 0.5) for datasets 2, 3, and 4 are misleading because the estimation isdone in the context of an assumed linear model and thatlinearity assumption is the fatal flaw in this analysis.

The EDA approach of deliberately postponing the modelselection until further along in the analysis has manyrewards, not the least of which is the ultimate convergenceto a much-improved model and the formulation of validand supportable scientific and engineering conclusions.


1.1.7. General Problem Categories




ProblemClassification

The following table is a convenient way to classify EDAproblems.

Univariateand Control UNIVARIATE

Data:

A single column ofnumbers, Y.

Model:

y = constant + error

Output:

1. A number (theestimated constant inthe model).

2. An estimate ofuncertainty for theconstant.

3. An estimate of thedistribution for theerror.

Techniques:

4-PlotProbability PlotPPCC Plot

CONTROL

Data:

A single column ofnumbers, Y.

Model:

y = constant + error

Output:

A "yes" or "no" to thequestion "Is thesystem out of control?".

Techniques:

Control Charts

ComparativeandScreening

COMPARATIVE

Data:

A single responsevariable and kindependent variables(Y, X1, X2, ... , Xk),primary focus is on

SCREENING

Data:

A single responsevariable and kindependent variables(Y, X1, X2, ... , Xk).

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/4plot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/ppccplot.htmhttp://www.itl.nist.gov/div898/handbook/pmc/section3/pmc3.htm



one (the primaryfactor) of theseindependent variables.

Model:

y = f(x1, x2, ..., xk) +error

Output:

A "yes" or "no" to thequestion "Is theprimary factorsignificant?".

Techniques:

Block PlotScatter PlotBox Plot

Model:

y = f(x1, x2, ..., xk) +error

Output:

1. A ranked list (frommost important toleast important) offactors.

2. Best settings for thefactors.

3. A goodmodel/predictionequation relating Y tothe factors.

Techniques:

Block PlotProbability PlotBihistogram

OptimizationandRegression

OPTIMIZATION

Data:

A single responsevariable and kindependent variables(Y, X1, X2, ... , Xk).

Model:

y = f(x1, x2, ..., xk) +error

Output:

Best settings for thefactor variables.

Techniques:

Block PlotLeast Squares FittingContour Plot

REGRESSION

Data:

A single responsevariable and kindependent variables(Y, X1, X2, ... , Xk).The independentvariables can becontinuous.

Model:

y = f(x1, x2, ..., xk) +error

Output:

A goodmodel/predictionequation relating Y tothe factors.

Techniques:

Least Squares FittingScatter Plot

http://www.itl.nist.gov/div898/handbook/eda/section3/blockplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/blockplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bihistog.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/blockplo.htmhttp://www.itl.nist.gov/div898/handbook/pmd/section1/pmd141.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/contour.htmhttp://www.itl.nist.gov/div898/handbook/pmd/section1/pmd141.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterp.htm



6-Plot

Time SeriesandMultivariate

TIME SERIES

Data:

A column oftime dependentnumbers, Y. Inaddition, time isan indpendentvariable. Thetime variablecan be eitherexplicit orimplied. If thedata are notequi-spaced, thetime variableshould beexplicitlyprovided.

Model:

yt = f(t) + error The model canbe either a timedomain based orfrequencydomain based.

Output:

A goodmodel/predictionequation relatingY to previousvalues of Y.

Techniques:

AutocorrelationPlotSpectrumComplexDemodulationAmplitude PlotComplexDemodulationPhase PlotARIMA Models

MULTIVARIATE

Data:

k factor variables (X1, X2, ..., Xk).

Model:

The model is not explicit.

Output:

Identify underlyingcorrelation structure in thedata.

Techniques:

Star PlotScatter Plot MatrixConditioning PlotProfile PlotPrincipal ComponentsClusteringDiscrimination/Classification

Note that multivarate analysis isonly covered lightly in thisHandbook.

http://www.itl.nist.gov/div898/handbook/eda/section3/6plot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/autocopl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/autocopl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/spectrum.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeam.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeam.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeam.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeph.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeph.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeph.htmhttp://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/starplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterb.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterc.htmhttp://www.itl.nist.gov/div898/handbook/pmc/section5/pmc55.htm




1.2. EDA Assumptions



1.2. EDA Assumptions

Summary The gamut of scientific and engineering experimentation isvirtually limitless. In this sea of diversity is there any commonbasis that allows the analyst to systematically and validlyarrive at supportable, repeatable research conclusions?

Fortunately, there is such a basis and it is rooted in the factthat every measurement process, however complicated, hascertain underlying assumptions. This section deals with whatthose assumptions are, why they are important, how to goabout testing them, and what the consequences are if theassumptions do not hold.


1. Underlying Assumptions2. Importance3. Testing Assumptions4. Importance of Plots5. Consequences


1.2.1. Underlying Assumptions


1. Exploratory Data Analysis 1.2. EDA Assumptions


AssumptionsUnderlying aMeasurementProcess

There are four assumptions that typically underlie allmeasurement processes; namely, that the data from theprocess at hand "behave like":

1. random drawings;2. from a fixed distribution;3. with the distribution having fixed location; and4. with the distribution having fixed variation.

Univariate orSingleResponseVariable

The "fixed location" referred to in item 3 above differs fordifferent problem types. The simplest problem type isunivariate; that is, a single variable. For the univariateproblem, the general model

response = deterministic component + randomcomponent

becomes

response = constant + error

AssumptionsforUnivariateModel

For this case, the "fixed location" is simply the unknownconstant. We can thus imagine the process at hand to beoperating under constant conditions that produce a singlecolumn of data with the properties that

the data are uncorrelated with one another;the random component has a fixed distribution;the deterministic component consists of only aconstant; andthe random component has fixed variation.

Extrapolationto a Functionof ManyVariables

The universal power and importance of the univariate modelis that it can easily be extended to the more general casewhere the deterministic component is not just a constant,but is in fact a function of many variables, and theengineering objective is to characterize and model thefunction.

Residuals The key point is that regardless of how many factors there

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/pmd/pmd.htmhttp://www.itl.nist.gov/div898/handbook/pmd/pmd.htm



Will BehaveAccording toUnivariateAssumptions

are, and regardless of how complicated the function is, ifthe engineer succeeds in choosing a good model, then thedifferences (residuals) between the raw response data andthe predicted values from the fitted model shouldthemselves behave like a univariate process. Furthermore,the residuals from this univariate process fit will behavelike:

random drawings;from a fixed distribution;with fixed location (namely, 0 in this case); andwith fixed variation.

Validation ofModel

Thus if the residuals from the fitted model do in fact behavelike the ideal, then testing of underlying assumptionsbecomes a tool for the validation and quality of fit of thechosen model. On the other hand, if the residuals from thechosen fitted model violate one or more of the aboveunivariate assumptions, then the chosen fitted model isinadequate and an opportunity exists for arriving at animproved model.

http://www.itl.nist.gov/div898/handbook/pmd/section4/pmd44.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.2.2. Importance



1.2.2. Importance

PredictabilityandStatisticalControl

Predictability is an all-important goal in science andengineering. If the four underlying assumptions hold, thenwe have achieved probabilistic predictability--the ability tomake probability statements not only about the process inthe past, but also about the process in the future. In short,such processes are said to be "in statistical control".

Validity ofEngineeringConclusions

Moreover, if the four assumptions are valid, then theprocess is amenable to the generation of valid scientific andengineering conclusions. If the four assumptions are notvalid, then the process is drifting (with respect to location,variation, or distribution), unpredictable, and out of control.A simple characterization of such processes by a locationestimate, a variation estimate, or a distribution "estimate"inevitably leads to engineering conclusions that are notvalid, are not supportable (scientifically or legally), andwhich are not repeatable in the laboratory.


1.2.3. Techniques for Testing Assumptions




TestingUnderlyingAssumptionsHelps Assure theValidity ofScientific andEngineeringConclusions

Because the validity of the final scientific/engineeringconclusions is inextricably linked to the validity of theunderlying univariate assumptions, it naturally follows thatthere is a real necessity that each and every one of theabove four assumptions be routinely tested.

Four Techniquesto TestUnderlyingAssumptions

The following EDA techniques are simple, efficient, andpowerful for the routine testing of underlyingassumptions:

1. run sequence plot (Yi versus i)2. lag plot (Yi versus Yi-1)3. histogram (counts versus subgroups of Y)4. normal probability plot (ordered Y versus theoretical

ordered Y)

Plot on a SinglePage for aQuickCharacterizationof the Data

The four EDA plots can be juxtaposed for a quick look atthe characteristics of the data. The plots below are orderedas follows:

1. Run sequence plot - upper left2. Lag plot - upper right3. Histogram - lower left4. Normal probability plot - lower right

Sample Plot:AssumptionsHold

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/runseqpl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lagplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htm



This 4-plot reveals a process that has fixed location, fixedvariation, is random, apparently has a fixed approximatelynormal distribution, and has no outliers.

Sample Plot:Assumptions DoNot Hold

If one or more of the four underlying assumptions do nothold, then it will show up in the various plots asdemonstrated in the following example.

This 4-plot reveals a process that has fixed location, fixedvariation, is non-random (oscillatory), has a non-normal,U-shaped distribution, and has several outliers.

http://www.itl.nist.gov/div898/handbook/eda/section3/4plot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/4plot.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.2.4. Interpretation of 4-Plot




Interpretationof EDAPlots:Flat andEqui-Banded,Random,Bell-Shaped,and Linear

The four EDA plots discussed on the previous page areused to test the underlying assumptions:

1. Fixed Location:If the fixed location assumption holds, then the runsequence plot will be flat and non-drifting.

2. Fixed Variation:If the fixed variation assumption holds, then thevertical spread in the run sequence plot will be theapproximately the same over the entire horizontalaxis.

3. Randomness:If the randomness assumption holds, then the lag plotwill be structureless and random.

4. Fixed Distribution:If the fixed distribution assumption holds, inparticular if the fixed normal distribution holds, then

1. the histogram will be bell-shaped, and2. the normal probability plot will be linear.

Plots Utilizedto Test theAssumptions

Conversely, the underlying assumptions are tested using theEDA plots:

Run Sequence Plot:If the run sequence plot is flat and non-drifting, thefixed-location assumption holds. If the run sequenceplot has a vertical spread that is about the same overthe entire plot, then the fixed-variation assumptionholds.

Lag Plot:If the lag plot is structureless, then the randomnessassumption holds.

Histogram:If the histogram is bell-shaped, the underlyingdistribution is symmetric and perhaps approximatelynormal.




Normal Probability Plot:If the normal probability plot is linear, the underlyingdistribution is approximately normal.

If all four of the assumptions hold, then the process is saiddefinitionally to be "in statistical control".


1.2.5. Consequences



1.2.5. Consequences

What IfAssumptionsDo Not Hold?

If some of the underlying assumptions do not hold, whatcan be done about it? What corrective actions can betaken? The positive way of approaching this is to view thetesting of underlying assumptions as a framework forlearning about the process. Assumption-testing promotesinsight into important aspects of the process that may nothave surfaced otherwise.

Primary Goalis Correctand ValidScientificConclusions

The primary goal is to have correct, validated, andcomplete scientific/engineering conclusions flowing fromthe analysis. This usually includes intermediate goals suchas the derivation of a good-fitting model and thecomputation of realistic parameter estimates. It shouldalways include the ultimate goal of an understanding and a"feel" for "what makes the process tick". There is no morepowerful catalyst for discovery than the bringing togetherof an experienced/expert scientist/engineer and a data setripe with intriguing "anomalies" and characteristics.

Consequencesof InvalidAssumptions

The following sections discuss in more detail theconsequences of invalid assumptions:

1. Consequences of non-randomness2. Consequences of non-fixed location parameter3. Consequences of non-fixed variation4. Consequences related to distributional assumptions


1.2.5.1. Consequences of Non-Randomness


1. Exploratory Data Analysis 1.2. EDA Assumptions 1.2.5. Consequences


RandomnessAssumption

There are four underlying assumptions:

1. randomness;2. fixed location;3. fixed variation; and4. fixed distribution.

The randomness assumption is the most critical but theleast tested.

Consequeces ofNon-Randomness

If the randomness assumption does not hold, then

1. All of the usual statistical tests are invalid.2. The calculated uncertainties for commonly used

statistics become meaningless.3. The calculated minimal sample size required for a

pre-specified tolerance becomes meaningless.4. The simple model: y = constant + error becomes

invalid.5. The parameter estimates become suspect and non-

supportable.

Non-RandomnessDue toAutocorrelation

One specific and common type of non-randomness isautocorrelation. Autocorrelation is the correlationbetween Yt and Yt-k, where k is an integer that defines thelag for the autocorrelation. That is, autocorrelation is atime dependent non-randomness. This means that thevalue of the current point is highly dependent on theprevious point if k = 1 (or k points ago if k is not 1).Autocorrelation is typically detected via anautocorrelation plot or a lag plot.

If the data are not random due to autocorrelation, then

1. Adjacent data values may be related.2. There may not be n independent snapshots of the

phenomenon under study.3. There may be undetected "junk"-outliers.4. There may be undetected "information-rich"-

outliers.

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/autocopl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lagplot.htm




1.2.5.2. Consequences of Non-Fixed Location Parameter



1.2.5.2. Consequences of Non-Fixed LocationParameter

LocationEstimate

The usual estimate of location is the mean

from N measurements Y1, Y2, ... , YN.

Consequencesof Non-FixedLocation

If the run sequence plot does not support the assumption offixed location, then

1. The location may be drifting.

2. The single location estimate may be meaningless (ifthe process is drifting).

3. The choice of location estimator (e.g., the samplemean) may be sub-optimal.

4. The usual formula for the uncertainty of the mean:

may be invalid and the numerical value optimisticallysmall.

5. The location estimate may be poor.

6. The location estimate may be biased.


1.2.5.3. Consequences of Non-Fixed Variation Parameter



1.2.5.3. Consequences of Non-Fixed VariationParameter

VariationEstimate

The usual estimate of variation is the standard deviation

from N measurements Y1, Y2, ... , YN.

Consequencesof Non-FixedVariation

If the run sequence plot does not support the assumption offixed variation, then

1. The variation may be drifting.

2. The single variation estimate may be meaningless (ifthe process variation is drifting).

3. The variation estimate may be poor.

4. The variation estimate may be biased.


1.2.5.4. Consequences Related to Distributional Assumptions



1.2.5.4. Consequences Related to DistributionalAssumptions

DistributionalAnalysis

Scientists and engineers routinely use the mean (average) toestimate the "middle" of a distribution. It is not so wellknown that the variability and the noisiness of the mean asa location estimator are intrinsically linked with theunderlying distribution of the data. For certain distributions,the mean is a poor choice. For any given distribution, thereexists an optimal choice-- that is, the estimator withminimum variability/noisiness. This optimal choice may be,for example, the median, the midrange, the midmean, themean, or something else. The implication of this is to"estimate" the distribution first, and then--based on thedistribution--choose the optimal estimator. The resultingengineering parameter estimators will have less variabilitythan if this approach is not followed.

Case Studies The airplane glass failure case study gives an example ofdetermining an appropriate distribution and estimating theparameters of that distribution. The uniform randomnumbers case study gives an example of determining amore appropriate centrality parameter for a non-normaldistribution.

Other consequences that flow from problems withdistributional assumptions are:

Distribution 1. The distribution may be changing.2. The single distribution estimate may be meaningless

(if the process distribution is changing).3. The distribution may be markedly non-normal.4. The distribution may be unknown.5. The true probability distribution for the error may

remain unknown.

Model 1. The model may be changing.2. The single model estimate may be meaningless.3. The default model

Y = constant + errormay be invalid.

4. If the default model is insufficient, information about


1.2.5.4. Consequences Related to Distributional Assumptions


a better model may remain undetected.5. A poor deterministic model may be fit.6. Information about an improved model may go

undetected.

Process 1. The process may be out-of-control.2. The process may be unpredictable.3. The process may be un-modelable.


1.3. EDA Techniques



1.3. EDA Techniques

Summary After you have collected a set of data, how do you do anexploratory data analysis? What techniques do you employ?What do the various techniques focus on? What conclusionscan you expect to reach?

This section provides answers to these kinds of questions via agallery of EDA techniques and a detailed description of eachtechnique. The techniques are divided into graphical andquantitative techniques. For exploratory data analysis, theemphasis is primarily on the graphical techniques.


1. Introduction2. Analysis Questions3. Graphical Techniques: Alphabetical4. Graphical Techniques: By Problem Category5. Quantitative Techniques: Alphabetical6. Probability Distributions


1.3.1. Introduction


1. Exploratory Data Analysis 1.3. EDA Techniques

1.3.1. Introduction

GraphicalandQuantitativeTechniques

This section describes many techniques that are commonlyused in exploratory and classical data analysis. This list is byno means meant to be exhaustive. Additional techniques(both graphical and quantitative) are discussed in the otherchapters. Specifically, the product comparisons chapter has amuch more detailed description of many classical statisticaltechniques.

EDA emphasizes graphical techniques while classicaltechniques emphasize quantitative techniques. In practice, ananalyst typically uses a mixture of graphical and quantitativetechniques. In this section, we have divided the descriptionsinto graphical and quantitative techniques. This is fororganizational clarity and is not meant to discourage the useof both graphical and quantitiative techniques whenanalyzing data.

Use ofTechniquesShown inCaseStudies

This section emphasizes the techniques themselves; how thegraph or test is defined, published references, and sampleoutput. The use of the techniques to answer engineeringquestions is demonstrated in the case studies section. Thecase studies do not demonstrate all of the techniques.

Availabilityin Software

The sample plots and output in this section were generatedwith the Dataplot software program. Other general purposestatistical data analysis programs can generate most of theplots, intervals, and tests discussed here, or macros can bewritten to acheive the same result.

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/prc/prc.htmhttp://www.itl.nist.gov/div898/handbook/dataplot.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.3.2. Analysis Questions




EDAQuestions

Some common questions that exploratory data analysis isused to answer are:

1. What is a typical value?2. What is the uncertainty for a typical value?3. What is a good distributional fit for a set of numbers?4. What is a percentile?5. Does an engineering modification have an effect?6. Does a factor have an effect?7. What are the most important factors?8. Are measurements coming from different laboratories

equivalent?9. What is the best function for relating a response

variable to a set of factor variables?10. What are the best settings for factors?11. Can we separate signal from noise in time dependent

data?12. Can we extract any structure from multivariate data?13. Does the data have outliers?

AnalystShouldIdentifyRelevantQuestionsfor hisEngineeringProblem

A critical early step in any analysis is to identify (for theengineering problem at hand) which of the above questionsare relevant. That is, we need to identify which questions wewant answered and which questions have no bearing on theproblem at hand. After collecting such a set of questions, anequally important step, which is invaluable for maintainingfocus, is to prioritize those questions in decreasing order ofimportance. EDA techniques are tied in with each of thequestions. There are some EDA techniques (e.g., the scatterplot) that are broad-brushed and apply almost universally. Onthe other hand, there are a large number of EDA techniquesthat are specific and whose specificity is tied in with one ofthe above questions. Clearly if one chooses not to explicitlyidentify relevant questions, then one cannot take advantage ofthese question-specific EDA technqiues.

EDAApproachEmphasizesGraphics

Most of these questions can be addressed by techniquesdiscussed in this chapter. The process modeling and processimprovement chapters also address many of the questionsabove. These questions are also relevant for the classicalapproach to statistics. What distinguishes the EDA approachis an emphasis on graphical techniques to gain insight as

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/ppccplot.htmhttp://www.itl.nist.gov/div898/handbook/prc/section2/prc252.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bihistog.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/blockplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexmeanp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/youdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/youdplot.htmhttp://www.itl.nist.gov/div898/handbook/pmd/section1/pmd142.htmhttp://www.itl.nist.gov/div898/handbook/pmd/section1/pmd142.htmhttp://www.itl.nist.gov/div898/handbook/pri/section3/pri336.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/spectrum.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/spectrum.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/starplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxplot.htmhttp://www.itl.nist.gov/div898/handbook/pmd/pmd.htmhttp://www.itl.nist.gov/div898/handbook/pri/pri.htmhttp://www.itl.nist.gov/div898/handbook/pri/pri.htm



opposed to the classical approach of quantitative tests. Mostdata analysts will use a mix of graphical and classicalquantitative techniques to address these problems.


1.3.3. Graphical Techniques: Alphabetic




This section provides a gallery of some useful graphicaltechniques. The techniques are ordered alphabetically, so thissection is not intended to be read in a sequential fashion. Theuse of most of these graphical techniques is demonstrated inthe case studies in this chapter. A few of these graphicaltechniques are demonstrated in later chapters.

AutocorrelationPlot: 1.3.3.1

Bihistogram:1.3.3.2

Block Plot:1.3.3.3

Bootstrap Plot:1.3.3.4

Box-CoxLinearity Plot:

1.3.3.5

Box-CoxNormality Plot:

1.3.3.6

Box Plot: 1.3.3.7 ComplexDemodulation

Amplitude Plot:1.3.3.8

ComplexDemodulation

Phase Plot:1.3.3.9

Contour Plot:1.3.3.10

DOE ScatterPlot: 1.3.3.11

DOE Mean Plot:1.3.3.12

DOE StandardDeviation Plot:

1.3.3.13

Histogram:1.3.3.14

Lag Plot:1.3.3.15

LinearCorrelation Plot:

1.3.3.16

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/autocopl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/autocopl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/autocopl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bihistog.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bihistog.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bihistog.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/blockplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/blockplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/blockplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bootplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bootplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/bootplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxcoxli.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxcoxli.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxcoxli.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxcoxli.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxcox.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxcox.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxcox.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxcox.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/boxplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeam.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeam.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeam.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeam.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeam.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeph.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeph.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeph.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeph.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/compdeph.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/contour.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/contour.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/contour.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexsplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexsplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexsplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexmeanp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexmeanp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexmeanp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexsdplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexsdplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexsdplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/dexsdplo.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/histogra.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lagplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lagplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lagplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/linecorr.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/linecorr.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/linecorr.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/linecorr.htm



Linear InterceptPlot: 1.3.3.17

Linear SlopePlot: 1.3.3.18

Linear ResidualStandard

Deviation Plot:1.3.3.19

Mean Plot:1.3.3.20

NormalProbability Plot:

1.3.3.21

Probability Plot:1.3.3.22

Probability PlotCorrelation

Coefficient Plot:1.3.3.23

Quantile-Quantile Plot:

1.3.3.24

Run SequencePlot: 1.3.3.25

Scatter Plot:1.3.3.26

Spectrum:1.3.3.27

StandardDeviation Plot:

1.3.3.28

Star Plot:1.3.3.29

Weibull Plot:1.3.3.30

Youden Plot:1.3.3.31

4-Plot: 1.3.3.32

6-Plot: 1.3.3.33

http://www.itl.nist.gov/div898/handbook/eda/section3/lineinte.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lineinte.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lineinte.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lineslop.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lineslop.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lineslop.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/linressd.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/linressd.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/linressd.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/linressd.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/linressd.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/meanplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/meanplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/meanplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/normprpl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/probplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/ppccplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/ppccplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/ppccplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/ppccplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/ppccplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/runseqpl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/runseqpl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/runseqpl.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/scatterp.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/spectrum.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/spectrum.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/spectrum.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/sdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/sdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/sdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/sdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/starplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/starplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/starplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/weibplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/weibplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/weibplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/youdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/youdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/youdplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/4plot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/4plot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/6plot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/6plot.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.3.3.1. Autocorrelation Plot


1. Exploratory Data Analysis 1.3. EDA Techniques 1.3.3. Graphical Techniques: Alphabetic


Purpose: CheckRandomness

Autocorrelation plots (Box and Jenkins, pp. 28-32) are acommonly-used tool for checking randomness in a dataset. This randomness is ascertained by computingautocorrelations for data values at varying time lags. Ifrandom, such autocorrelations should be near zero for anyand all time-lag separations. If non-random, then one ormore of the autocorrelations will be significantly non-zero.

In addition, autocorrelation plots are used in the modelidentification stage for Box-Jenkins autoregressive,moving average time series models.

Sample Plot:Autocorrelationsshould be near-zero forrandomness.Such is not thecase in thisexample andthus therandomnessassumption fails

This sample autocorrelation plot shows that the time seriesis not random, but rather has a high degree ofautocorrelation between adjacent and near-adjacentobservations.

Definition: r(h) versus h

Autocorrelation plots are formed by

Vertical axis: Autocorrelation coefficient

where Ch is the autocovariance function

http://www.itl.nist.gov/div898/handbook/index.htmhttp://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm



and C0 is the variance function

Note--Rh is between -1 and +1.

Note--Some sources may use the following formulafor the autocovariance function

Although this definition has less bias, the (1/N)formulation has some desirable statistical propertiesand is the form most commonly used in thestatistics literature. See pages 20 and 49-50 inChatfield for details.

Horizontal axis: Time lag h (h = 1, 2, 3, ...)

The above line also contains several horizontalreference lines. The middle line is at zero. The otherfour lines are 95 % and 99 % confidence bands.Note that there are two distinct formulas forgenerating the confidence bands.

1. If the autocorrelation plot is being used to testfor randomness (i.e., there is no timedependence in the data), the followingformula is recommended:

where N is the sample size, z is thecumulative distribution function of thestandard normal distribution and is thesignificance level. In this case, the confidencebands have fixed width that depends on thesample size. This is the formula that was usedto generate the confidence bands in the aboveplot.

2. Autocorrelation plots are also used in themodel identification stage for fitting ARIMAmodels. In this case, a moving average modelis assumed for the data and the followingconfidence bands should be generated:

http://www.itl.nist.gov/div898/handbook/pmc/section4/pmc446.htmhttp://www.itl.nist.gov/div898/handbook/pmc/section4/pmc446.htm



where k is the lag, N is the sample size, z isthe cumulative distribution function of thestandard normal distribution and is thesignificance level. In this case, the confidencebands increase as the lag increases.

Questions The autocorrelation plot can provide answers to thefollowing questions:

1. Are the data random?2. Is an observation related to an adjacent

observation?3. Is an observation related to an observation twice-

removed? (etc.)4. Is the observed time series white noise?5. Is the observed time series sinusoidal?6. Is the observed time series autoregressive?7. What is an appropriate model for the observed time

series?8. Is the model

Y = constant + error

valid and sufficient?

9. Is the formula valid?

Importance: Ensure validityof engineeringconclusions

Randomness (along with fixed model, fixed variation, andfixed distribution) is one of the four assumptions thattypically underlie all measurement processes. Therandomness assumption is critically important for thefollowing three reasons:

1. Most standard statistical tests depend onrandomness. The validity of the test conclusions isdirectly linked to the validity of the randomnessassumption.

2. Many commonly-used statistical formulae dependon the randomness assumption, the most commonformula being the formula for determining thestandard deviation of the sample mean:

where is the standard deviation of the data.Although heavily used, the results from using thisformula are of no value unless the randomness



assumption holds.

3. For univariate data, the default model is

Y = constant + error

If the data are not random, this model is incorrectand invalid, and the estimates for the parameters(such as the constant) become nonsensical andinvalid.

In short, if the analyst does not check for randomness,then the validity of many of the statistical conclusionsbecomes suspect. The autocorrelation plot is an excellentway of checking for such randomness.

Examples Examples of the autocorrelation plot for several commonsituations are given in the following pages.

1. Random (= White Noise)2. Weak autocorrelation3. Strong autocorrelation and autoregressive model 4. Sinusoidal model

RelatedTechniques

Partial Autocorrelation Plot Lag Plot Spectral Plot Seasonal Subseries Plot

Case Study The autocorrelation plot is demonstrated in the beamdeflection data case study.

Software Autocorrelation plots are available in most generalpurpose statistical software programs.

http://www.itl.nist.gov/div898/handbook/eda/section3/autocop1.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/autocop2.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/autocop3.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/autocop4.htmhttp://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4463.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/lagplot.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/spectrum.htmhttp://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4431.htmhttp://www.itl.nist.gov/div898/handbook/search.htmhttp://www.itl.nist.gov/div898/handbook/toolaids.htmhttp://www.itl.nist.gov/div898/handbook/index.htmhttp://www.sematech.org/http://www.nist.gov/

1.3.3.1.1. Autocorrelation Plot: Random Data


1. Exploratory Data Analysis 1.3. EDA Techniques 1.3.3. Graphical Techniques: Alphabetic 1.3.3.1. Autocorrelation Plot

1.3.3.1.1. Autocorrelation Plot: Random Data

AutocorrelationPlot

The following is a sample autocorrelation plot.

Conclusions We can make the following conclusions from this plot.

1. There are no significant autocorrelations.2. The data are random.

Discussion Note that with the exception of lag 0, which is always 1 bydefinition, almost all of the autocorrelations fall within the95% confidence limits. In addition, there is no apparentpattern (such as the first twenty-five being positive and thesecond twenty-five being negative). This is the abscenceof a pattern we expect to see if the data are in fact random.

A few lags slightly outside the 95% and 99% confidencelimits do not neccessarily indicate non-randomness. For a95% confidence interval, we might expect about one outof twenty lags to be statistically significant due to randomfluctuations.

Th

1.exploratory data analysis · 2018-01-11 · exploratory data analysis - detailed table of...

Documents