descriptive statistics summarizing, simplifying useful for comprehending data, and thus making...

40
Descriptive Statistics Summarizing, Simplifying Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to large data sets. Describing Useful for recognizing important characteristics of data Used in inferential statistics

Upload: augustus-david-james

Post on 23-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Descriptive Statistics

Summarizing, Simplifying Useful for comprehending data, and thus

making meaningful interpretations, particularly in medium to large data sets.

Describing Useful for recognizing important

characteristics of data Used in inferential statistics

Page 2: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Important Characteristics of Data

Center – typical data value Variation – spread in data Distribution – shape of data distribution Outliers – problems in data Time – changes over time?

Page 3: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Graphical Summary Methods

Pie Chart Useful for qualitative or quantitative data

Bar Chart Useful for qualitative data Called a Pareto chart if bars ordered by height

Page 4: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Graphical Summary Methods

Frequency Histogram

Useful for quantitative data A “connected bar plot” with bar height

proportional to the frequency of the associated value or class (interval of values)

Graphical summary of a frequency distribution (sometimes called a frequency table)

Page 5: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Frequency Distribution (Table)

For Discrete Data: Lists data values and corresponding counts Resulting histogram has a bar on each value

with height proportional to its count

For Continuous Data: Data is divided into classes (intervals of

values) and the classes are listed along with the corresponding counts

Page 6: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Definitions for Classes

Lower Class Limit – smallest value in a class Upper Class Limit - largest value in a class Class Width – distance between consecutive

lower (or upper) class limits Class Mark – midpoint of class (calculated as

the mean of the lower and upper class limits) Class Boundaries – eliminates space

between consecutive classes for plotting purposes

Page 7: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Constructing a Frequency Table

Select the class width (w) Approximated by range divided by the desired

number of classes (usually between 5 and 15 in medium-sized data sets)

Select lower class limit for first class Construct class limits using w as the distance

between consecutive lower or upper class limits Count number of observations in each class.

classesw

#

minmax

Page 8: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Types of Histograms

Frequency – height of bar is count

Relative Frequency - height of bar is relative frequency (proportion/percentage/probability)

Cumulative Frequency – height of bar is cumulative count

Cumulative Relative Frequency – height of bar is cumulative relative frequency (percentile)

Page 9: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Other Types of Graphs

Dotplot Each value is plotted as a dot along an x-axis.

Dots representing equal values are stacked. Stem-and-Leaf Plot

Each value is separated into a stem (such as the leftmost value or values) and a leaf (such as the rightmost value or values)

Stems are listed in order and leaves are plotted alongside the appropriate stem

Ordered Stem-and-Leaf Plot

Page 10: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Other Types of Graphs

Scatter Diagram or Scatter Plot Plot of paired (x,y) data with x on the

horizontal axis and y on the vertical axis. Useful for seeing relationship between x and y

Time-Series Plot A special scatter diagram which as time

plotted on the horizontal axis.

Page 11: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Importance of Knowing the Distribution of Data

Distribution can affect the choice of an appropriate statistic to use.

Distribution can aid in determining the validity of many inferential statistics.

Common data distributions Bell (normal), bi-modal, right-skewed (chi-

squared, exponential), left-skewed

Page 12: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Numerical Summary Methods

Measures of Center (Location) The middle value or typical observation from a

population.

Measures of Variability The dispersion or spread in the population.

Measures of Relative Standing The comparative value relative to the

population.

Page 13: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Center

Population Mean

N

xN

ii

1n

xx

n

ii

1

Mean (Arithmetic Mean) The size of the population is denoted by N. The

sample size is denoted by n.

Sample Mean

Page 14: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Center

Median Middle value in the ordered data for odd n. Mean of the 2 middle values for even n. Commonly called the 50th percentile. The location of the median in the ordered

data set is: (n+1)÷2

Page 15: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Center

Mode Most common value (occurs most frequently)

Midrange Midway between the lowest and highest value

Trimmed Mean Mean of values remaining after an equal

number of values are removed from each tail.

Page 16: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Skewness

Mode = Mean = Median

SKEWED LEFT(negatively)

SYMMETRIC

Mean Mode Median

SKEWED RIGHT(positively)

Mean Mode Median

Page 17: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Mean Median Mode Midrange Trimmed Mean

Values utilized?

All Middle Most Common

Extreme All but extreme

Outliers? Not Robust

Robust Robust Not Robust

Robust

Exists? Unique

Unique May not exist or be unique

Unique Unique

Data type?

Quan. Quan. Any Quan. Quan.

Compare Measures of Center

Page 18: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Variation

Range Distance between minimum and maximum Range = Max – Min

The range does not measure the overall variability in the data. A measure is needed which incorporates the variability of every value in the data. One was is to look at deviations from the mean (xi- )m for each xi.

Page 19: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Variation

Population Variance

N

xN

ii

1

2

2

)(

1

)(1

2

2

n

xxs

n

ii

Variance The average squared difference of the observations

from the mean.

Sample Variance

Page 20: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Calculator Formula for Sample Variance

)1(

)(

1

)( 22

1

2

2

nn

xxn

n

xxs

n

ii

Page 21: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Variation

Population Standard Deviation

N

xN

ii

1

2)(

1

)(1

2

n

xxs

n

ii

Standard Deviation The square root of the average squared difference of

the observations from the mean.

Sample

Standard Deviation

Page 22: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Calculator Formula for Sample Standard Deviation

)1(

)(

1

)( 22

1

2

nn

xxn

n

xxs

n

ii

Page 23: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Empirical Rule

For data that is approximately bell-shaped in distribution, 68% of data values fall within 1 standard

deviation of the mean, 95.4% of data values fall within 2 standard

deviation of the mean, 99.7% of data values fall within 3 standard

deviation of the mean,

Page 24: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

x

The Empirical Rule(applies to bell-shaped distributions)FIGURE 2-13

Page 25: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

x - s x x + s

68% within1 standard deviation

34% 34%

The Empirical Rule(applies to bell-shaped distributions)FIGURE 2-13

Page 26: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

x - 2s x - s x x + 2sx + s

68% within1 standard deviation

34% 34%

95% within 2 standard deviations

The Empirical Rule(applies to bell-shaped distributions)

13.5% 13.5%

FIGURE 2-13

Page 27: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

x - 3s x - 2s x - s x x + 2s x + 3sx + s

68% within1 standard deviation

34% 34%

95% within 2 standard deviations

99.7% of data are within 3 standard deviations of the mean

The Empirical Rule(applies to bell-shaped distributions)

0.1% 0.1%

2.4% 2.4%

13.5% 13.5%

FIGURE 2-13

Page 28: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Chebyshev’s TheoremFor data from any distribution, the proportion

(or fraction) of values lying within K standard deviations of the mean is always at least 1 - 1/K2 , where K is any positive number greater than 1.

at least 3/4 (75%) of all values lie within 2 standard deviations of the mean.

at least 8/9 (89%) of all values lie within 3 standard deviations of the mean.

Page 29: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Relates the standard deviation of a data set to its mean

The CV is useful for comparing relative variation between two or more sets of data

Coefficient of Variation

100x

sCV

Page 30: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Relative Position

s

xxxor

Standard Score or Z-Score

deviation standard

mean

xZ

Page 31: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Relative Position

Order Statistics

The order statistics, denoted by,

x(1), x(2), … x(n)

are the observed data values ordered from smallest to greatest.

Page 32: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Relative Position

Percentile The kth percentile (Pk) separates the bottom k%

of data from the top (100-k)% of data. The location of Pk in the order statistics is:

integeran not is

100 if

100Ceiling

integeran is100 if5.0

100knkn

knkn

L

Page 33: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Finding the Value of the

kth Percentile

Sort the data.

(Arrange the data in

order of lowest to

highest.)

The value of the kth percentile

is midway between the Lth value

and the next value in the

sorted set of data. Find Pk by

adding the L th value and the

next value and dividing the

total by 2.

Start

Compute

L = n where

n = number of values

k = percentile in question

)( k100

Change L by rounding

it up to the next

larger whole number.

The value of Pk is the

Lth value, counting from the lowest

Is L a whole

number?

Yes

No

Figure 2-15

Page 34: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Measures of Relative Position

Quartiles The quartiles (Q1=P25, Q2=P50 and Q3=P75)

separate the data into fourths. Interquartile Range (IQR)

The distance between the first and third quartiles: IQR=Q3-Q1.

The IQR is a measure of variability which is less affected by outliers than the range, variance and standard deviation.

Page 35: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Box-and-Whisker Diagram(Boxplot)

Graphical display of the

“5 Number Summary” X(1) =Min Q1 =P25, Q2 =P50, Q3 =P75

X(n) =Max

Inner & Outer Fences Useful for identifying potential outliers in data.

Page 36: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Bell-Shaped

Figure 2-17 Boxplots

Page 37: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Bell-Shaped

Figure 2-17 Boxplots

Uniform

Page 38: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Bell-Shaped Skewed

Figure 2-17 Boxplots

Uniform

Page 39: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Percentile Rank If x=Pk and k is the percentile rank of x,

then k is approximately equal to:

100

values#

n

xk

Page 40: Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to

Exploring Measures of center: mean, median, and mode

Measures of variation: Standard deviation and range

Measures of relative location: order statistics, minimum, maximum, percentile

Unusual values: outliers

Distribution: histograms, stem-leaf plots, and boxplots