how can you best represent statistical information and draw conclusions from it?
TRANSCRIPT
How can you best represent statistical information and draw
conclusions from it?
What is statistics?Statistics is the branch of mathematics that is concerned with the collection, organization, display and interpretation of data.
S.1 Organizing DataHow can data be shown on a table or in a
graph and how can you read such data?
What is categorical data?When should you use a pie chart and how are
they made?How do you organize a frequency distribution?
Data types:categorical and
numericCategorical—any non numeric data
Use frequency distributionsBar chartsPie charts
Numeric—anything that can be measured and list by numberDotplotsStem and leafFrequency distributionshistograms
Does this data mean anything to you and can you answer questions about it in its current form?Example
Leisure time activitiesW T A W G T W WC W T W A T T WG W W C A W A WW W T W W T
W=walkingT=weight training C=cyclingG= gardening A=aerobics
Displaying Catagoric DataHow can you display and interpret catagoric
data?
catagoric—anything that can’t be measured and listed by number
Frequency distributionsBar ChartsPie Charts
Frequency DistributionDisplays all categories and a tally for eachRelative frequency—the percentage as a
decimal of time this category appears in the data Category Tally Frequenc
yRelative
Frequency
Walking
Weight training
Cycling
Gardening
Aerobics
Leisure time activities
W T A W G T W W C W
T W A T T W G W W C
A W A W W W T W W T
/
/ //
/
/
/
/
/
/ //
/
/
/
/
/
/
----
/ /
/
/
/
/
/
/
/
----
/
/
/
/
/
/
/
/
//
/
/
/
----
/
/
/
/
/
/
/
/
/
/
/
/
/
----
/
/
/
15
7
2 2
4
Total = 30
.5
2
2 2
2
Total = 1
Bar ChartGraphs the frequency of categorical dataBars DO NOT touchCategories are on the x-axisFrequencies are on the y-axis
Walking Wt Training Cycling Gardening Aerobic
Pie Charts (circle graphs)Used when there are not too many categories
Rule of thumb 8 or fewerEach “slice” is determined by the relative
frequencyDegrees in slice = rel freq x 360
HomeworkWorksheet 1
S-2 Displaying Numeric DataEQ: How do you construct and read
stem and leaf plots, dotplots, frequency distributions and histograms?
Numeric—anything that can be measured and list by numberDotplotsStem and leafFrequency distributionshistograms
DotplotsSimple way to represent small amounts of
dataEach piece of data has its own dotDots stack vertically above the position on
the x-axisDepending on the data set, you may lose the
exact value for each piece512 615 524 632 645
575 592 716 618 521
682 675 549 523 651 5 6
7
Stem PlotWorks for a small to moderate set of dataStems go in a vertical columnStems may be split low and high (0-4 and 5-
9)Comparative or double stemplot—shows
multiple data sets51 61 52 63 6457 59 71 61 5268 67 54 52 65
5
6
7
51 61 52 73 5457 59 71 61 5268 67 74 52 65
1 2 2 2 4 7 9
1 1 3 4 5 7 8
1
1 2 2 2 4 7 9
1 1 5 7 8
1 3 4
HistogramsA bar chart for numeric dataCenter the rectangle over the indicated value
on the x-axis—the bars touchCan be drawn off of the frequency or the
relative frequency distribution
# of partners in local law firms frequency
relative frequency
1 2 0.12 3 0.153 6 0.34 6 0.35 3 0.15
Totals 20 1
Shapes of HistogramsUnimodal—has one peak
Bimodal—has two peaks
Multimodal—has more than two peaks
Types of Unimodal CurvesSymmetric
Normal or Bell Shaped
Heavy tailed--Having long tailsLarger standard dev.
Light Tailed--Having short tailsSmaller Standard dev.
Skewed Curves
Lower (left) tail Upper (right) tail
When there is an outlier to the right, the curve is skewed right
When there is an outlier to the left, the curve is skewed left
Skewness is judged by the tail not where the majority of the data lies.
Skewness is judged by the tail not where the majority of the data lies.
Skewness is judged by the tail not where the majority of the data lies.
Frequency DistributionsContinuous and Discrete DataDiscrete Data
Individual data pointsThe range is always from the set of integers or
whole numbers Continuous Data
Data that may include decimals
Frequency Distributions
There are no natural breaks for continuous dataWe create our own
Ex. The fuel efficiency of a particular car ranges from 25.3 to 29.8 mpg we decide to use an interval of .5 Note:
Always start at an even increment lower than the lowest piece of data and go to an even increment higher than the highest piece of data
Interval # Interval
Low High
1 25.0 25.5
2 25.5 26.0
3 26.0 26.5
4 26.5 27.0
5 27.0 27.5
6 27.5 28.0
7 28.0 28.5
8 28.5 29.0
9 29.0 29.5
10 29.5 30.0
In which interval would you place 27.5 mpg? HxL i
Homework Numeric DataWorksheet 2
Density GraphsWhen data is unevenly distributed
You may want to use unequal groups or intervals This may only be done if you graph the density
widthclass
class of freq rel.density
interval name low high frequency
relative frequency density
1 1 10 2 0.09091 0.008262 10 20 3 0.13636 0.006493 20 30 4 0.18182 0.005874 30 40 3 0.13636 0.003335 40 50 6 0.27273 0.005356 50 100 1 0.04545 0.000457 100 200 2 0.09091 0.000458 200 1000 1 0.04545 0.00005
total 22
S-3 Describing the Center of a Data Set
EQ:What are the measures of central tendency and how can they be determined?
Center and SpreadTwo of the most critical descriptors of a data setGraphical methods such as those in the last chapter give a general impression
of bothNumerical methods give precise value that can be compared in detail
The three M’sMean
Median
Mode
• Also known as the average
• Also called the middle
• Most Frequent
The Meanformula for the sample mean
• x= each piece of data • xi= i indicates the position of the data from within the
original data set• n= number of pieces of data in the data set• ∑ = Greek letter Sigma means to add what follows
Always use more accuracy (more decimals) than any one piece of data has.
µ is used for the population meanGreek letters are always used for population values
n
xx
n
ii
1
The Median
The middle value in a list of ordered values
Median has no symbol but is often abbreviated Med
If n is odd then the median is the exact middle number
If n is even then the median is the mean of the two middle numbers
Comparison and Contrast of the Mean and Median
Median divides the data into two equal parts 50% of the data is on either side of the median
Mean is where the fulcrum would cause the “data scale” to balance if the values had weight
It is very sensitive to outliers
Balancing the “data scale”
Normal/Bell curve
meanmedian
Skewed Left Skewed Right
Trimmed MeanMakes the mean less susceptible to outliers
Order the data Remove the same number of pieces of data from each
end Recalculate the mean
% x n = number of pieces to be removed from EACH end
A small to moderate trim is 5% to 25%
Trimmed MeanExample:Find the 15% Trimmed mean of: 3, 6, 8, 2, 9, 10, 7, 15, 4, 12, 20, 36, 15, 5, 3, 7, 10, 16, 17,
12
Order the numbers: 2, 3, 3, 4, 5, 6, 7, 7, 8, 9, 10, 10, 12, 12, 15, 15, 16, 17,
20, 36,
20 items • .15 = 3
4, 5, 6, 7, 7, 8, 9, 10, 10, 12, 12, 15, 15, 16 =
8.914
136
Weighted Meanis similar to an arithmetic mean (the most
common type of average), where instead of each of the data points contributing equally to the final average, some data points contribute more than others.
Weighted Mean# of students Class average
1st period 20 75
2nd period 35 79
55
79357520 ave. weighted
Homework worksheet 3
S-4 SpreadWhat are the quartiles, percentiles, and box
plots?
RangeHigh - Low
• IQRIQR = upper quartile (Q3) – lower
quartile (Q1)
Lower quartile—the median of the lower halfUpper quartile—the median of the upper half
IF n is odd, the exact median is excluded from the quartiles
Used because it is resistant to outliersThere is no special name for the population IQR
Interquartile Range
Boxplot• Can be used for many types of summarizations
• Iqr = Q3 – Q1• Outlier = data more than 1.5•iqr from the end of the
box• Extreme=data more than 3•iqr from the end of the
box
25% 25% 25% 25%
Outlier(closed circle)
ExtremeOutlier(open circle)
Modified Boxplot
Percentages and percentiles:
Percentage: “ the score “ * 100 total possible points
Percentile: “The position of the score w/in an ordered list”*100 the total number of items
EX: 10 students took a 90 point test60, 65, 68, 74, 75, 80, 81, 81, 84, 90 (note: an ordered list)1 2 3 4 5 6 7 8 9 10
What is the percent and the percentile for a score of 81?
Percent: 81/90 *100=90%
Percentile: 7/10*100= 70ieth percentile
10 2 5 720 1 630 5 8 9 940 2 3 5 7 850 260 3 6
•the median
•the first quartile
•the third quartile
•the interquartile range
•the mode
•the percentile for .271
•the value closest to the 60th percentile
EXAMPLE:Given a stem and leaf plotFIND:
S-5 Measures of VariabilityHow do the measures of variability help us to
better understand what our data set might look like?
S-5 Measures of VariabilityRange = high – low
Deviation from the mean= xi – if positive then xi is larger than the mean
if negative then xi is smaller than the mean Mean deviation is the average of the deviations
Sample Variance
x
1
)(2
12
n
xxs
n
ii
Sample Standard Deviation“average distance” the items fall from the
mean
A small s or s2 indicates low variabilityA high s or s2 indicates large variability
2ss
Population Variance (knowing all the data)
Population Standard Deviation
compute to the same accuracy as the population
n
xxn
ii
1
2
2
)(
2
Uses of the IQRStandard deviation can be approximated by
SD = IQR/1.35
If SD > IQR/1.35 it suggests heavier or longer tails than the normal curve
Example20, 15, 12, 18, 17, 15, 17, 16, 18, 25
Reorder12, 15, 15, 16, 17, 17, 18, 18 20, 25
range =iqr =
sd =
x
Median= 17Q1= 15 Q3= 18
continuedFind the mean deviation and the standard
deviationBy hand
i xi Xi- (xi- )2
1 12
2 15
3 15
4 16
5 17
6 17
7 18
8 18
9 20
10 25
totals
xx
Given 12, 15, 15, 16, 17, 17, 18, 18, 20, 25Find the SD By iqr
By calculator
Homework worksheet 5
S-6 Translation and ScaleWhat is the difference in the impact of translation and scale change on data?
In class project:
Hints for review #1How many intervals should be used for a set
of data?The book recommends
data ofpieces of#
Homework
TEST 1
S-7 Data CollectionHow do you know which method of data
collection is most appropriate?
Random SamplesWhat methods of data collection constitute
collecting a random sample?
SamplingSince time and money usually do not permit a
scientist to collect the opinion or measure the effect on every person in the population, they take samples which should include all groups so they can make accurate statements about the entire population
Simple Random SampleEach object in the population has an equal chance
of being selected for the sample
Each object in the sample is chosen independently of any other object in the sampleIndependent—choosing one has no bearing on the
choice of the next object Independent example
All names are placed in a hat and 10 are chosen Dependent example
Two names are drawn and they each ask 4 people to participate with them
BiasWhen one group is over-represented in
sample Causes:
Basis of selection Who responds Who asks the questions or how they are asked
Stratified SampleThe population is divided into groups and a
specified number are chosen from each group
River Project
The Normal DistributionHow does normally distributed data begin
to relate statistics to probability?
The Normal DistributionWhen most of the data falls close to the average and only a few pieces of data fall at a distance from
the mean. This configuration is often called a bell shaped or normal curve. Research has found that when data is normally distributed:68% of the data lies within one standard deviation of the mean95% of the data lies within two standard deviations
(13.5% lies in the one to two SD range)99.7% of the data lies within three standard deviations
(2.35% lies in the two to three SD range)
.15% of the data lies beyond each of the three standard deviation range
3X 2X X X X 2X 3X
Normal curves are symmetric to the mean some are narrow and some are wide—this is determined by the value of one standard deviation.
The area under a normal curve represents all the data—100% or 1. The area under any section represents the percentage and therefore probability that a given piece of data will fall to the left of this region of the curve.
Normal distributions have a direct link to Probability through something called z-scores. The z-score tells exactly how many full and partial standard deviations a particular piece of data falls from the mean. A negative number means the data is to the left of the mean, a positive number tell you the data is to the right of the mean.
the formula for z-scores is
The attached table gives the probability that a given value has a z-score less than a given value. (falls to the left of a particular spot on the normal curve)
xx
z
Return to problem a
Return to problem b and c
Examples: Find the z-score for each of the following:
a) 45 when = 50 and = 4x
Return to z-chart
b) 56 when = 60 and = 10
c) between 20 and 60 = 50 and = 10x
x
Return to z-chart