descriptive statistics summarizing, simplifying useful for comprehending data, and thus making...
TRANSCRIPT
Descriptive Statistics
Summarizing, Simplifying Useful for comprehending data, and thus
making meaningful interpretations, particularly in medium to large data sets.
Describing Useful for recognizing important
characteristics of data Used in inferential statistics
Important Characteristics of Data
Center – typical data value Variation – spread in data Distribution – shape of data distribution Outliers – problems in data Time – changes over time?
Graphical Summary Methods
Pie Chart Useful for qualitative or quantitative data
Bar Chart Useful for qualitative data Called a Pareto chart if bars ordered by height
Graphical Summary Methods
Frequency Histogram
Useful for quantitative data A “connected bar plot” with bar height
proportional to the frequency of the associated value or class (interval of values)
Graphical summary of a frequency distribution (sometimes called a frequency table)
Frequency Distribution (Table)
For Discrete Data: Lists data values and corresponding counts Resulting histogram has a bar on each value
with height proportional to its count
For Continuous Data: Data is divided into classes (intervals of
values) and the classes are listed along with the corresponding counts
Definitions for Classes
Lower Class Limit – smallest value in a class Upper Class Limit - largest value in a class Class Width – distance between consecutive
lower (or upper) class limits Class Mark – midpoint of class (calculated as
the mean of the lower and upper class limits) Class Boundaries – eliminates space
between consecutive classes for plotting purposes
Constructing a Frequency Table
Select the class width (w) Approximated by range divided by the desired
number of classes (usually between 5 and 15 in medium-sized data sets)
Select lower class limit for first class Construct class limits using w as the distance
between consecutive lower or upper class limits Count number of observations in each class.
classesw
#
minmax
Types of Histograms
Frequency – height of bar is count
Relative Frequency - height of bar is relative frequency (proportion/percentage/probability)
Cumulative Frequency – height of bar is cumulative count
Cumulative Relative Frequency – height of bar is cumulative relative frequency (percentile)
Other Types of Graphs
Dotplot Each value is plotted as a dot along an x-axis.
Dots representing equal values are stacked. Stem-and-Leaf Plot
Each value is separated into a stem (such as the leftmost value or values) and a leaf (such as the rightmost value or values)
Stems are listed in order and leaves are plotted alongside the appropriate stem
Ordered Stem-and-Leaf Plot
Other Types of Graphs
Scatter Diagram or Scatter Plot Plot of paired (x,y) data with x on the
horizontal axis and y on the vertical axis. Useful for seeing relationship between x and y
Time-Series Plot A special scatter diagram which as time
plotted on the horizontal axis.
Importance of Knowing the Distribution of Data
Distribution can affect the choice of an appropriate statistic to use.
Distribution can aid in determining the validity of many inferential statistics.
Common data distributions Bell (normal), bi-modal, right-skewed (chi-
squared, exponential), left-skewed
Numerical Summary Methods
Measures of Center (Location) The middle value or typical observation from a
population.
Measures of Variability The dispersion or spread in the population.
Measures of Relative Standing The comparative value relative to the
population.
Measures of Center
Population Mean
N
xN
ii
1n
xx
n
ii
1
Mean (Arithmetic Mean) The size of the population is denoted by N. The
sample size is denoted by n.
Sample Mean
Measures of Center
Median Middle value in the ordered data for odd n. Mean of the 2 middle values for even n. Commonly called the 50th percentile. The location of the median in the ordered
data set is: (n+1)÷2
Measures of Center
Mode Most common value (occurs most frequently)
Midrange Midway between the lowest and highest value
Trimmed Mean Mean of values remaining after an equal
number of values are removed from each tail.
Skewness
Mode = Mean = Median
SKEWED LEFT(negatively)
SYMMETRIC
Mean Mode Median
SKEWED RIGHT(positively)
Mean Mode Median
Mean Median Mode Midrange Trimmed Mean
Values utilized?
All Middle Most Common
Extreme All but extreme
Outliers? Not Robust
Robust Robust Not Robust
Robust
Exists? Unique
Unique May not exist or be unique
Unique Unique
Data type?
Quan. Quan. Any Quan. Quan.
Compare Measures of Center
Measures of Variation
Range Distance between minimum and maximum Range = Max – Min
The range does not measure the overall variability in the data. A measure is needed which incorporates the variability of every value in the data. One was is to look at deviations from the mean (xi- )m for each xi.
Measures of Variation
Population Variance
N
xN
ii
1
2
2
)(
1
)(1
2
2
n
xxs
n
ii
Variance The average squared difference of the observations
from the mean.
Sample Variance
Calculator Formula for Sample Variance
)1(
)(
1
)( 22
1
2
2
nn
xxn
n
xxs
n
ii
Measures of Variation
Population Standard Deviation
N
xN
ii
1
2)(
1
)(1
2
n
xxs
n
ii
Standard Deviation The square root of the average squared difference of
the observations from the mean.
Sample
Standard Deviation
Calculator Formula for Sample Standard Deviation
)1(
)(
1
)( 22
1
2
nn
xxn
n
xxs
n
ii
Empirical Rule
For data that is approximately bell-shaped in distribution, 68% of data values fall within 1 standard
deviation of the mean, 95.4% of data values fall within 2 standard
deviation of the mean, 99.7% of data values fall within 3 standard
deviation of the mean,
x
The Empirical Rule(applies to bell-shaped distributions)FIGURE 2-13
x - s x x + s
68% within1 standard deviation
34% 34%
The Empirical Rule(applies to bell-shaped distributions)FIGURE 2-13
x - 2s x - s x x + 2sx + s
68% within1 standard deviation
34% 34%
95% within 2 standard deviations
The Empirical Rule(applies to bell-shaped distributions)
13.5% 13.5%
FIGURE 2-13
x - 3s x - 2s x - s x x + 2s x + 3sx + s
68% within1 standard deviation
34% 34%
95% within 2 standard deviations
99.7% of data are within 3 standard deviations of the mean
The Empirical Rule(applies to bell-shaped distributions)
0.1% 0.1%
2.4% 2.4%
13.5% 13.5%
FIGURE 2-13
Chebyshev’s TheoremFor data from any distribution, the proportion
(or fraction) of values lying within K standard deviations of the mean is always at least 1 - 1/K2 , where K is any positive number greater than 1.
at least 3/4 (75%) of all values lie within 2 standard deviations of the mean.
at least 8/9 (89%) of all values lie within 3 standard deviations of the mean.
Relates the standard deviation of a data set to its mean
The CV is useful for comparing relative variation between two or more sets of data
Coefficient of Variation
100x
sCV
Measures of Relative Position
s
xxxor
Standard Score or Z-Score
deviation standard
mean
xZ
Measures of Relative Position
Order Statistics
The order statistics, denoted by,
x(1), x(2), … x(n)
are the observed data values ordered from smallest to greatest.
Measures of Relative Position
Percentile The kth percentile (Pk) separates the bottom k%
of data from the top (100-k)% of data. The location of Pk in the order statistics is:
integeran not is
100 if
100Ceiling
integeran is100 if5.0
100knkn
knkn
L
Finding the Value of the
kth Percentile
Sort the data.
(Arrange the data in
order of lowest to
highest.)
The value of the kth percentile
is midway between the Lth value
and the next value in the
sorted set of data. Find Pk by
adding the L th value and the
next value and dividing the
total by 2.
Start
Compute
L = n where
n = number of values
k = percentile in question
)( k100
Change L by rounding
it up to the next
larger whole number.
The value of Pk is the
Lth value, counting from the lowest
Is L a whole
number?
Yes
No
Figure 2-15
Measures of Relative Position
Quartiles The quartiles (Q1=P25, Q2=P50 and Q3=P75)
separate the data into fourths. Interquartile Range (IQR)
The distance between the first and third quartiles: IQR=Q3-Q1.
The IQR is a measure of variability which is less affected by outliers than the range, variance and standard deviation.
Box-and-Whisker Diagram(Boxplot)
Graphical display of the
“5 Number Summary” X(1) =Min Q1 =P25, Q2 =P50, Q3 =P75
X(n) =Max
Inner & Outer Fences Useful for identifying potential outliers in data.
Bell-Shaped
Figure 2-17 Boxplots
Bell-Shaped
Figure 2-17 Boxplots
Uniform
Bell-Shaped Skewed
Figure 2-17 Boxplots
Uniform
Percentile Rank If x=Pk and k is the percentile rank of x,
then k is approximately equal to:
100
values#
n
xk
Exploring Measures of center: mean, median, and mode
Measures of variation: Standard deviation and range
Measures of relative location: order statistics, minimum, maximum, percentile
Unusual values: outliers
Distribution: histograms, stem-leaf plots, and boxplots