mare 250 dr. jason turner descriptive measures. descriptive measures – numbers that are used to...
TRANSCRIPT
MARE 250Dr. Jason Turner
Descriptive Measures
Descriptive Measures
Descriptive Measures – numbers that are used to describe datasets
Parts of Descriptive Statistics
Used to summarize raw data
Descriptive Measures
Measures of Center
Measures of Variation – how data are distributed around center
5-number summary – used to construct visual representation - Boxplot
Measures of Center
Measure of Central Tendency – indicate where center or most typical value of data set lie
Mean, Median, Mode
Measures of Center
Mean – of a dataset is the sum of the observations divided by the number of observations; Arithmetic Average
10,20,30,40,50,60,70,80,90,100 = 550
550 / 10 = 55
Measures of Center
Median – the number that divides the bottom 50% of the data from the top 50%
1) Arrange data in increasing order2) If number of observations is ODD, the median is the observation exactly in the middle3) If the number of observations is EVEN, median is the mean of the middle two observations
Measures of Center
Median = (n+1)/2
10,20,30,40,50,60,70,80,90,100, 110(ODD); Median = 60
10,20,30,40,50,60,70,80,90,100(EVEN); Median = 50+60/2 = 55
Measures of Center
Mode – frequency of each value inthe data set
If no value occurs more than once – No Mode; 10,20,30,40,50,60,70,80,90,100
Otherwise – any value with greatest frequency is Mode; 10,20,30,40,50,50, 60,70,80,90,100…Mode is 50
Measures of Center
The mode is useful if the distribution is skewed or bimodal (having two very pronounced values around which data are concentrated)
30
Num
ber
of I
ndiv
idua
ls
0
10
20
You are so totally skewed!
The mean is sensitive to extreme (very large or small) observations and the median is not
Therefore – you can determine how skewed your data is by looking at the relationship between median and mean
Mean is Greater than the Median
Mean and Median are Equal
Mean is Less Than the Median
Resistance Measures
A resistance measure is not sensitive to the influences of a few extreme observations
Median – resistant measure of centerMean – not resistant
Outliers DO NOT affect Median
Outliers DO affect Mean
Resistance Measures
Resistance of Mean can be improved by using – Trimmed Means – a specified percentage of the smallest and largest observations are removed before computing the mean
Will do something like this later when exploring the data and evaluating outliers…(their effects upon the mean)
Measures of Variation
Measures of Variation (Spread) – amount of variability in the data set
Range, Standard Deviation, Variance
Range = Maximum Observation – Minimum Observation10,20,30,40,50,60,70,80,90,100;Range = 100-10 = 90
Measures of Variation
Standard Deviation - (±SD) measures the variation by indicating how far (on average) the observations are from the mean
Large Dev. – farFrom mean
Small Dev. – Close to mean
Measures of Variation
Variance - (measure used by statistical formulas) square of the standard deviation
“Equal Variance” is one of the assumptions of parametric means testing…(we will learn this later)
Measures of Variation
Three Standard Deviations Rule – almost all observations in any data set lie within three standard deviations to either side of the mean; “almost all” defined in 2-ways by stats nerds…
Measures of Variation
Three Standard Deviations Rule –
Chebychev’s Rule – 89% of data within 3 Standard Deviations
Empirical Rule – 99.7% of observations are within 3 Standard deviations; if data are approximately bell-shaped
5 Number Summary
Percentiles – data set is divided into hundredths (100 equal parts)
Why?..Percentiles are not sensitive to the influence of a few extreme observations (outliers)
5 Number Summary
Quartiles – data set is divided into quarters (4 equal parts); most typically used
Data set has 3 Quartiles: Q1, Q2, Q3
Q1 – is the number that divides the bottom 25% from top 75%
Q2 – is the median; bottom 50% from top 50%Q3 – is the number that divides the bottom 75% from top 25%
5 Number Summary
Quartiles – data set is divided into quarters (4 equal parts); most typically used
5 Number Summary
Interquartile Range (IQR) – the difference between the first and third quartiles
IQR = Q3 – Q1
The IQR gives you the range of the middle 50% of the data
Outlier, Outlier
Outliers – observations that fall well outside the overall pattern of the data
Requires special attention
May be the result of:Measurement or Recording ErrorObservation from a different populationUnusual Extreme observation
Pants on Fire!
Must deal with outliers: (Yes, really!)
If error – can delete; otherwise judgment call
Can use quartiles and IQR to identify potential outliers
The Outer Limits
Lower and Upper Limits:Lower limit – is the number that lies 1.5
IQR’s below the first quartile
Lower Limit = Q1 - 1.5 * IQR
Upper limit – is the number that lies 1.5 IQR’s above the first quartile
Upper Limit = Q3 + 1.5 * IQR
The Outer Limits
If a value is outside the “Outer Limits” of a dataset it is an…
OUTLIER!OUTLIER!
5 Number Summary
5-Number Summary:Min, Q1, Q2, Q3, Max
Written in increasing order
Provides information on Center and Variation
Are used to construct Box-Plots
Boxplot
Boxplot (Box-and-Whisker-Design): based on the 5-number summary provide graphic display of the center and variation
Q1 Q2 Q3
Min Max
0 70
Boxplot
Potential Outlier
0 70
*
Modified Boxplot – includes outliers
Note that Min & Max are determine after outliers are removed!
Boxplot
Boxplot
Boxplots summarize information about the shape, dispersion, and center of your data
They can also help you spot outliers
Boxplot
Left edge of the box represents the first quartile (Q1), while the right edge represents the third quartile (Q3)Box portion of the plot represents the interquartile range (IQR) - middle 50% of data
Q1 Q2 Q3LowerLimit
UpperLimit
0 70
Boxplot
The line drawn through the box represents the median of the data
The lines extending from the box are called whiskers
The whiskers extend outward to indicate the Upper and Lower limits in the data set (excluding outliers)
Boxplot
Extreme values, or outliers, are represented by dots A value is considered an outlier if it is outside of the box (greater than Q3 or less than Q1) by more than 1.5 times the IQR
0 70
*
Potential Outlier
Boxplot
Use the boxplot to assess the symmetry of the data:
If the data are fairly symmetric, the median line will be roughly in the middle of the IQR box and the whiskers will be similar in length
0 70
Boxplot
Use the boxplot to assess the symmetry of the data:
If the data are skewed, the median may not fall in the middle of the IQR box, and one whisker will likely be noticeably longer than the other
0 70