probability and statistics representation of data measures of center for data simple analysis of...

72

Upload: kobe-caster

Post on 28-Mar-2015

223 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data
Page 2: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Probability and Statistics

Representation of DataMeasures of Center for DataSimple Analysis of Data

Page 3: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

OverviewIn this module you’ll be learning about the basics of statistics:

Statistical Displays – Data can be displayed graphically in different ways. You will learn how to choose displays by the type of date and the message to be delivered to the audience. Some “do not” examples will also be covered.Measures of Center – A single number or data is commonly used to describe an entire set of data. You will explore the different types of “averages” and learn why you might choose one over another.Analysis – This module covers the simple analysis of the data. You will look at what information can be obtained from the data and how to make comparisons of various data sets.

Page 4: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Topics• An Introduction to statistics• Types of data• Displaying data• How NOT to display data• Arithmetic Mean• Median• Mode• Weighted Mean• Types of distributions• Measures of center vs. variation

Page 5: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Statistics: An Introduction

Statistics is the set of mathematical tools for collecting, organizing and analyzing data; and then interpreting the information to make decisions

Introduction to Statistics

The first page at this site gives an explanation of how statistics is used. Clicking on the “Continue” link on the bottom right of the page will take you to the section on “Revealing Patterns Using Descriptive Statistics”. It may be worthwhile to read the first page of this section to review some of the common terms/vocabulary used in describing data. Return to this presentation when you are ready.

(On the right of the web page, you will notice additional information that is beyond the scope of this module. Feel free to come back at a later time to explore further.)

Page 6: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Displaying Data

Page 7: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

What is Data?

Data can be qualitative, describing distinct categories, or quantitative, describing numerical counts or measurements.

Qualitative data can be nominal, where no natural order exists between the categories, or ordinal, meaning an order does exist.

Quantitative date can be divided into continuous, when data are values within a range, and discreet, when the measurements are integers.

Types of Data

Read the text on the web site. Answer the nine questions at the bottom of the page to check your understanding of the topic. Return to this presentation when finished.

Page 8: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Another explanation of Types of Data

You should now be able to classify data as qualitative nominal, qualitative ordinal, quantitative discrete or quantitative continuous, and are ready to explore how to display data. Continue to the next slide!

Types of Data (cont.)

If you are still unsure about recognizing qualitative and quantitative data, click on the link above to review how to distinguish between these two variables. When you have completed the “Progress check” at the bottom of the web page, return to this presentation.

Page 9: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Common Graphs

Types of Graphs

You should have a basic idea of the types of graphs that can be created to display data. As you move on through the slides, you will learn how to create these graphs and how to determine which graph gives the best representation of the data you want to display.

Visit the above website for a brief description and representation of ten of the most common graphs.

Page 10: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Math Dictionary

Types of Graphs (cont.)

There are various ways to display your data. The differences arise from the type of data and the information and/or message you want to deliver. Following is a list of the more traditional types of graphs.

Bar graphHistogramPie graph

Line graphBox plotScatter plot

Line plot/Dot plotPictographStem & Leaf plot

This web site is home to a mathematics dictionary. It has examples to the graphs listed below. Return to this presentation when finished.

Page 11: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Bar Graphs

One way to graphically represent data is by using a bar graph/chart. What type of data is best represented by a bar graph? What information about a data set should you be able to interpret from a bar graph?

Bar Graphs

Click on the link above to learn how to create a bar graph. After reading the information on bar graphs, answer the ten questions in the “Your Turn” section at the bottom of the page.

Page 12: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Histograms

Histograms can be used to represent continuous data.

You should be able to identify data that is continuous and be able to create a histogram to represent that data.

Histograms

Read about histograms by following the link above. Check your understanding by answering the ten questions in the “Your turn” section at the bottom of the web page.

Page 13: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Create a Histogram (video)

Histograms are best used when the data variable on the x-axis is quantitative. The bars most often represent a range of values.

Each bar could also represent an individual value. In this case, the histogram would more accurately be called a frequency distribution graph.

Histograms

This video demonstrates how to take a data set and create a histogram.

Page 14: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

A Histogram is NOT a Bar Chart

Histograms and bar charts can look similar even though they display very different representations of the data. After reading the information on the web page linked above, you should be able to identify three differences between histograms and bar charts.

Histograms (cont.)

It is important to distinguish the difference between a histogram and a bar chart. This is the first of several sites that will help you determine when to display data as one or the other. Read the information on the first page and then return to this presentation.

Page 15: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Bar Charts and Histograms (includes a video)

Histograms (cont.)

This webpage has additional information on when to use a bar chart or histogram to display your data. You can also view the video which shows how to create a bar chart and histogram.

Page 16: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Histograms vs. Bar Graphs

Can you answer the following questions?• What type of data would be best represented by a histogram? • What information should you be able to identify when data is

represented in a bar graph?

Histograms (cont.)

Click on the link above to read more about the differences between histograms and bar charts. The information is set up as a conversation between a teacher and student reasoning through how each graph can be used to display specific types of data. When you have finished reading the discussion, please return to this presentation.

Page 17: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Pie Charts

Pie charts represent data as a part-to-whole relationship.

Pie Charts

Read about how to create a pie chart and what type of data displays best in this format. Be sure to complete the questions at the bottom of the web page in the “Your turn” section and then return to this presentation!

Page 18: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Pie Charts

You should be able to answer the following question regarding pie charts:• What is the best type of data to represent graphically in a pie

chart?• When interpreting information from a pie chart, what are

three areas you should pay attention to in the representation?

Pie Charts (cont.)

This site looks at how NOT to use pie charts, along with showing many examples found in the news, in business reports and other media.

Page 19: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

What is a Scatterplot?

Main points: • A scatterplot is used to graph the relationship between two

quantitative variables or bivariate data;• Scatterplots may show patterns – weak or strong, positive or

negative correlations;• Correlation does not indicate cause and effect.

Scatterplots

This site will introduce you to scatterplots. Click the blue “View Video” button to see how to make and read scatterplots. Once you have watched the video and read through the information on this webpage, return to this presentation.

Page 20: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Scatterplots and Correlation

Explore further…At the above website, under the correlation graphs, is a link More About Correlation. Here you will see how correlation is calculated. In most cases, you will use a calculator or software function for this; however, it’s beneficial to know how the correlation coefficient is derived.

Scatterplots (cont.)

This site presents another view of scatterplots and correlation. After reading this information, answer the nine question in the “Your turn” section at the bottom of the page.

Page 21: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Line Graph

Main ideas:• Line graphs help to determine the relationship between two

sets of values;• Value sets represent an independent variable and an

independent variable;• Line graphs are useful in showing trends and making

predictions.

Line Graph

This website gives many examples of line graphs and explains what makes a line graph different from a scatter plot. Read through this information and then return to this presentation.

Page 22: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Line Graphs

You should now be able to answer to following questions:• What are the main differences between scatterplots and line

graphs?• What type of data is best represented in a line graph?

Line Graph (cont.)

Check your understanding in interpreting line graphs by answering the ten “Your turn” questions at the bottom on this webpage.

Page 23: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Box Plots (YouTube video)

Vocabulary to understand box plots:• Distribution• Median• Average (Mean)• Extremes• Quartiles• Interquartile Range

Box Plot

This video introduces you to Box Plots, as it demonstrates how to create a box plot and defines the vocabulary terms listed below. When you have finished viewing the video, return to this presentation.

Page 24: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Quartiles / Interquartile Range / Box and Whisker Plot

At this point, you should be able to:• determine the lower, middle and upper quartiles of a data set;• calculate interquartile range;• construct a Box and Whisker Plot to represent the data;• compare box plots from two data sets and make observations

about the distributions.

Box Plot

This webpage gives another look at the breakdown of Box Plots. Once you have read through the information, try answering the ten “Your turn” questions at the end of the page. (Tip: It will be helpful to have scrap paper available)

Page 25: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Boxplot (aka, Box and Whisker Plot)

Box Plot

If you need additional information to understand boxplots, click on the link above and “View Video”, which gives more details on how to read a boxplot. When you have finished reading through Boxplots Basics and How to Interpret a Boxplot, return to this presentation.

Page 26: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Stemplots (aka, Stem and Leaf Plots)

• Use to display quantitative data• Best used with small sets of data• Shows shape of distribution• Stem values can have any number of digits• Leaves can only be represented by one digit• Limitations displaying decimals

Stem & Leaf Plot

Click the blue button to View Video and then read the information on stem and leaf plots. For additional explanation about this type of graph, proceed to the next slide.

Page 27: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Stem and Leaf Plots

You should now know:• under what circumstances stems should be split;• how to organize decimal data in a stem and leaf plot; • how to interpret data by looking at a stem plot.

Stem & Leaf Plot

This site provides additional details on “splitting the stems” and “splitting stems using decimal values”.

Page 28: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Line Plot (YouTube Video)

Vocabulary:• Clusters• Gaps• Outliers

Line Plot / Dot Plot

View this video on how to make a Line Plot then return to this presentation.

Page 29: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Dot Plot vs. Line Plot (YouTube Video)

Line Plot / Dot Plot

This YouTube video does a good job describing the similarities and differences between a line plot and dot plot. Then return to this slide and click here to re-enforce what you have have learned about Line and Dot plots.

Page 30: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Pictographs

In a Pictograph, symbols are used to display statistical data. Symbols can be misleading if not accurately proportioned or if the symbols can not be divided evenly to represent fractional parts.

Picture Graph / Pictograph

Read the information on Pictographs and then answer the nine “Your turn” questions at the bottom of this webpage.

Page 31: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Comparing Graphs

Types of Graphs

Most data can be represented using multiple graphs. Decisions on the most appropriate display should be make based on what information you want the reader to draw from the graph.

Test your understanding of the graphs covered in this unit. At this website, read through the problems and decide which graph most clearly represents the data and what information is to be conveyed to the reader. Also, work through the five questions at the bottom of the page.

Page 32: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Hans Rosling

An Advanced Display of Data

Probably one of the most informative and modern displays of data can be seen from the work of Hans Rosling. The link above shows a video of his TED talk in 2006. It is a 20 minute video and it gets very interesting about 4 minutes into the video. Watch it all if you have time but we recommend at least 10 minutes.

The point of this experience is not that we expect you to duplicate this extraordinary presentation, but that you appreciate the power of displaying data in a clear and understandable method. Any enhancement of the display should be for the purpose of clarity and not just distracting visuals.

Page 33: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Often times a data display can mislead the reader. At times this may be intentional when the creator wants to persuade the reader in some way. Other times it may be unintentional when the creator tries to make the display more visually appealing and causes the reader to misinterpret the results.

How NOT to Display Data

The above link is by Salman Khan, founder of the Khan Academy. In his video he highlights the misleading visual displays of a line graph (5 min). Return to this presentation when finished.

Misleading Line Graphs by Khan Academy

Page 34: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Misleading Graphs by Wikipedia

Typically the displays we see are technically accurate but they use visual “tricks” to mislead the reader who may not pay close attention to the details of the graphic display.

How NOT to Display Data

The above link by Wikipedia, shows various ways a graphic display can mislead its intended audience. Return to this presentation when finished.

Page 35: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Measures of Center

Page 36: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Definitions of these terms

Different ways to measure the center of data• Arithmetic mean (commonly called average or just mean)• Median• Mode• Weighted mean

Measures of Center

If you are not familiar with the terms listed below, follow the link above to familiarize yourself with these terms.(The above site includes other measures of center that are beyond the scope of this presentation)

Page 37: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Central Values

What is meant by “Measure of Center”? Sometimes we want to describe a group of data (numbers, values) by a single number. The advantage of this is the ability to more easily compare different groups of data. The disadvantage is when you describe a data set by a single number you lose the details and could mislead someone.

Measures of Center

The link above gives some simple examples of measures of center and compares the mean, median, and mode. Check your understanding with the ten questions at the bottom of the page before returning.

Page 38: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

The mean of a set of data is found by adding all the data values and dividing that answer by the number of points. (often referred to as “n”)Strengths

• Its calculation includes all the data• It is common and more likely understood by others• It is often used in other statistical formulas

Arithmetic Mean

Page 39: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Weaknesses• Sometimes you don’t know all the data points needed to

calculate the mean (data may be in a graph only)• An extremely large data set may be difficult to calculate.• It can be influenced by outliers, those values much larger or

smaller than the rest of the data.• It is often a value that is different than any of the data values

Arithmetic Mean

When best to use • The mean is best used when you data is continuous and symmetrical.• Often necessary for use in other statistical measures.

Page 40: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

How to Find the Mean

Visit the web site above to learn more about the arithmetic mean. After reading the lesson make sure and check your understanding by answering the ten questions at the end.

In case you missed it, make sure and check out the “mean machine”. Run this virtual machine to see the relationship between the data points and the mean value.

Lessons on Arithmetic Mean

Page 41: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Wikipedia defines median

The median of a set of data is found by arranging all the data in numerical order and then selecting the data point in the middle. If the data has an even number of values the median is the mean of the two central values.Strengths

• Requires little if any mathematical calculation• It is not effected by outliers (large or small data points)• It can be approximated from a frequency distribution or a

distribution graph

Median

The web site above give a very detailed definition of median. (Many of the examples are beyond the scope of this presentation)

Page 42: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Weaknesses• Arranging a large set of data in order can be very difficult.

Median

When best to use • The median is usually preferred when the data distribution is

skewed• It is used with ordinal data when the mean cannot be used

Page 43: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

How to Find the Median Value

Visit the following web site to learn more about the median. After reading the lesson make sure and check your understanding by answering the ten questions at the end.

Lessons on Median

Page 44: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Mean / Median Applet

Comparing Mean & Median

The link above gives you the ability to see how the mean and median change as the data points change. The applet allows you to drag data points on the line or move data points on the line. Take some time and play with this applet and see how the mean and median change and compare.

You can also check the box for “box plot” to see how a boxplot would look with the data that shows on the line.

When you have finished, jot down the patterns you have observed and then return to this presentation

Page 45: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Seeing Statistics

Use the link above for a more comprehensive lesson on the attributes and differences between the mean and median. The link will take you to an introduction of the web interphase. When you think you are familiar with how to navigate the system, click on the icon in the left column.

When the table of contents show, click on lesson #3 “Describing the Center”. You can advance from one page to the next by clicking on the icon in the top, right corner of the page.

Return to this presentation when you finish.

Comparing Mean & Median

Page 46: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Wikepedia defines mode

The mode of a set of data is found by identifying the data element that occurs most often. Many people remember this by associating the word “most” with mode.

Mode

Strengths• Depending on the display of the data or the size of the data, it is

often easy to identify• It is the ONLY measure of center you can use for non-numeric

data (nominal data). Example: What is the best measure of center for the eye color of this group of people?

The web site above give a very detailed definition of mode. (Many of the examples are beyond the scope of this presentation)

Page 47: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Weaknesses• Sometimes the data set could have more than one mode or even

multiple modes.• Often the data does not have any data element that is more

numerous than any other.• Sometimes the mode is nowhere near the center of the data.

Mode

When best to use • It is the only measure of center valid with nominal data (Example:

data on student’s eye color)

• It can support the validity of the mean and median if it has a similar value. If the data is perfectly normal, mean=median=mode

Page 48: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

How to Find the Mode Value

Visit the web site above to learn more about the mode. After reading the lesson make sure and check your understanding by answering the ten questions at the end.

Mode

Page 49: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Wikipedia defines weighted mean

• Sometimes certain values in a data set contribute more to a measure of center than other values. In this situation, we calculate a weighted mean.

Weighted Mean

The web site above give a very detailed definition of weighted mean. (Many of the examples are beyond the scope of this presentation)

Page 50: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Dr. Math explains weighted mean (weighted average)

A simple example:• Consider a university that teaches two classes. One class has

10 students, the other has 100 students. If you ask the university the average (mean) class size they respond with 55. (100+10)/2. However, if you ask every student what size class they are in to find the mean you would get 91.8. [(100 * 100) + (10 * 10)] / 110

• The 100 students in the larger class carry more WEIGHT that the 10 students in the smaller class.

Weighted Mean

The web site above gives examples of calculating a weighted mean or weighted average.

Page 51: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Weighted Mean

Visit the web site above to learn more about the weighted mean. After reading the lesson make sure and check your understanding by answering the ten questions at the end.

Weighted Mean

Page 52: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Simple Analysis of Data

Page 53: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Exploring Data

Read the first page in the link above to learn about the importance of thinking carefully about how to interpret what a data set can reveal.

Simple Exploration and Analysis of Data

Measures of center, like the mean, median, and mode, give useful information about a data set, but is hidden by such single number summaries.

To understand the information in a complex or large data set, it is important to examine the integrity of the data, to look out for interesting and useful patterns, and to summarize the data skillfully.

Patterns are likely to be found more easily in visual, rather than numerical, representations of the data.

A single number is unlikely to summarize the data effectively.

Page 54: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Outliers

Outliers are numbers in a data set that are very different from all the others. Read the information in the link above to learn more about outliers. Then work the problems at the end to test your understanding.

Data Integrity - Outliers

Why do some data sets have outliers? Have these numbers been recorded wrongly? Do they correspond to bad mistakes, e.g. in measurements?

You should have found from the problems you worked that• Outliers don’t have much effect on the median;• Outliers can have a big effect on the mean.

• How can we tell when a “suspicious” number is a “genuine” outlier rather than just being at the limits of what is “normal”?

• What, if anything, should we do about outliers? If we ignore them, will they make problems for how we interpret the data?

We’ll discuss some of these issues in the pages ahead.

Page 55: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Patterns of Data

Open the link above to read about various ways to describe and identify patterns in data sets.

Patterns of Data

To begin with, we will concentrate on finding ways to measure the spread of a data set.

This will give us two numbers (a measure of center and a measure of spread) to use when we summarize a data set. Two is better than one.

The interaction of center and spread is important. If a data set has small spread, the values will be clustered closely around the center, so the center will represent the data values well.

Page 56: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Range

Discover what is meant by the range of a quantitative data set by reading the explanation in the link above and working through the activities.

Range

The range of a data set is the difference between the largest and smallest values. It is the simplest measure of the spread of the data.

Strength:• Very simple to compute.

Weaknesses:• Very sensitive to unreliable data or outliers; it is easy for an inaccurate

measurement to be much bigger or smaller than the others; this could have a big and misleading impact on the range.

• Only uses two data values, so a lot of information about the data set is lost.

Page 57: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Quartile Definition and Computation

Click the link above to read about quartiles and how to compute them. Then check your understanding by working the problem at the end.

Quartiles

Half the values in a data set are at least as big as the median – and half the values are no bigger than the median.• The first (or lower) quartile Q1 is essentially the median of the lower half of

the data set;• The third (or upper) quartile Q3 is essentially the median of the upper half of

the data set.

Controversy:• There is no generally accepted agreement about how precisely to compute

upper and lower quartiles. In fact, different calculators or software packages will give different results for the quartiles of the same data set.

Page 58: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Interquartile Range

Open the link above to learn what interquartile range (IQR) means and how to compute it. Don’t miss the imbedded YouTube video. Then read more here and check your understanding by answering the ten questions at the end.

Interquartile Range

The IQR tells you about the spread of a data set through focusing on the middle 50% of the data. Its value is the length of the box in a box and whisker plot.

Advantages of the IQR as a measure of spread:• Easy to compute;• Much less likely than the range to be affected by outliers.

When best to use:• When you use the median as a measure of center, the IQR is a good

measure of the spread of a distribution.

Page 59: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

IQR and Outliers

Visit the link above to find how to use the interquartile range to identify outliers in a data set. Then try your hand at the problems in this set.

Interquartile Range and Outliers

A standard convention is that a number in a data set is an outlier if it is at least 1.5 IQRs away from the median.

It is important to understand that the choice of 1.5 IQRs is not specified by any theory. It is an arbitrary convention – but it has worked well for many years.

An outlier can easily be spotted in a box and whisker plot – the end of a whisker that is more than one and a half times as long as the box.

Page 60: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Five number summary

The link above explains how a box and whisker plot provides five numbers that conveniently summarize a data set. Five summary numbers allow us a much richer analysis of a data set than a single measure of center. Practice problems here.

Five Number Summary

The five number summary of a data set is based on a box and whisker plot. It consists of• The biggest value;• The third quartile;• The median;• The first quartile; • The smallest value.It provides representative information about a data set that easily leads to a measure of center and a measure of spread at the same time as making outlier detection a simple matter of checking for over-long whiskers.

Page 61: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Standard Deviation

So far we have measured the spread of a data set in ways(range, IQR) that associate well with the median measure of center. Read the link above to see how to measure spread in a way (standard deviation) that is compatible with the mean measure of center. Then work the problems at the end.

Standard Deviation

The deviation of a data value from the mean is just the difference between the two.

The variance of a data set is the average of the squares of the deviations (with a slight adjustment if the data consists of a selection from all possible values.)

The standard deviation of a data set is the square root of its variance.

It is usually better to avoid computing variance or standard deviation by hand.Many calculators have these computations pre-programmed, so it is easy to get the information once the data is entered.

Page 62: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Standard Deviation Video

Here’s a video to help you understand the concept of standard deviation.

Standard Deviation and Outliers

Standard deviation behaves like the average distance of the data values from their mean. It is a measure of the spread of the data set:• When the average distance is small, many of the data values will be clustered

around the mean, and the spread will be small;• When the average distance is large, many of the data values will be far from

the mean, and the spread will be large.

We viewed a data value as an outlier when it was “far away” in IQR terms from the median. We shall also label a data value as an outlier when it is far away from the mean, as measured in terms of the standard deviation.• One convention is that an outlier is at least 3 standard deviations from the

mean – but this is a matter of debate, as it what you should do about outliers..

Page 63: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Anscombe's Quartet

It is important to realize that very different data sets can have identical means and standard deviations. In other words, they have identical center and spread, but look very different. The link above is to Anscombe’s famous examples.

Distinguishing Between Data Sets

Our objective is to understand and analyze our data sets. Because of Anscombe’s and similar examples, simple number summaries cannot give complete answers to our questions.

It is necessary to look for patterns in other ways.

Page 64: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Challenging Problems

Before changing direction, here are some problems that you will need to think about carefully. If you get stuck, there are solutions posted.

Challenging Problems

Page 65: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Frequency Distributions

Consult the link above as well as this one to find out about frequency distributions. Make sure you confirm your understanding by working the problems.

Frequency Distributions

• Sometimes a value occurs more than once in a data set. Its frequency is the number of times it appears in the list of values.

• By putting the values into bins if appropriate and counting up the total frequency in each bin, it is not difficult to create a frequency table that can be represented as a histogram.

Page 66: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Relative Frequency Distributions

To create a relative frequency distribution, we proceed as for the frequency distributions, but scale each of the frequencies by dividing by the total count of data values. This scaled frequency is the relative frequency. Read the link above and here, working the problems.

Relative Frequency Distributions

• Relative frequency distributions are actually distributions of probabilities.• Using relative frequency distributions allows us to compare data sets of

different sizes on an equal basis.• If we selected one data set of 100 measurements and another data set

of 1000 measurements by taking samples from a massive collection, there shouldn’t be too much difference between how each value compares with the others, regardless of which data set we examine.

• However, we should expect the frequency (e.g. 49) of one value in the second data set to be roughly 10 times its frequency (e.g. 5) in the first set. On the other hand, the relative frequencies should be similar (e.g. 4.9 and 5.0.)

Page 67: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Describing Data Patterns

The link above describes some basic patterns that can occur in frequency distributions of data. These descriptions go beyond measuring center and spread, and focus as well on shape and unusual features.

Describing Data Patterns I

• Some distributions are symmetric around the mean – which must then coincide with the median. (Why is this the case?)

• Some distributions are skewed with a tail to the right or the left.• Some distributions have more than one mode.

Page 68: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Data Pattern Video

See whether you can use your knowledge of the previous slide to answer the questions in the YouTube video linked above.

Describing Data Patterns II

• It turns out that distributions that are symmetric around the mean/median and that have a single mode that also coincides with the mean/median are the most important of all.

Page 69: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Normal Distribution

Consult the link above for basic facts about the normal distribution and how it arises. As always, work the problems at the end. Check out the YouTube video here for ways to test carefully and systematically whether or not your data really is normal.

Characteristics of a Normal Distribution

• The continuous normal distribution has a distinctive bell shape, with the mode, mean and median all at the same place. The bell is wider when the standard deviation is greater, but the basic form is always the same.

• Approximately 68% of all values are within 1 standard deviation of the mean, 95% are within 2 standard deviations of the mean, and 99.7% are within 3 standard deviations of the mean.

• Normal distributions are continuous distributions that tend to arise in measuring heights, sizes, pressures, temperatures, and so on.

Page 70: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Standard Normal Distribution

Consult the link above for basic facts about the standard normal distribution and work the problems at the end.

Standard Normal Distribution

• A normal distribution is standard if it has mean 0 and standard deviation 1.• All normally distributed data sets can be converted simply to a standard

normal distribution. If the original data set has mean μ and standard deviation σ, the new data set created by replacing the old data values x by new data values z = (x-μ)/σ will be a standard normal distribution.

• It is common and useful to convert normal distributions to standard form; this makes it much easier to make comparison.

Page 71: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Poisson Distribution

Consult the Wikipedia link above to seek out basic facts about the Binomial distribution. It is a discrete probability distribution that “expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.

Characteristics of a Binomial Distribution

• A Poisson distribution is a discrete distribution that takes on values that are 0 or a positive integer.

• The mean and variance of a Poisson distribution are always the same.

• When the mean/variance is small, the Poisson distribution is skewed wit a long tail to the right.

• When the mean/variance is large, the Poisson distribution closely resembles the normal distribution with the same mean and variance.

Page 72: Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data

Binomial Distribution

Consult the Wikipedia link above to fish for basic facts about the Poisson distribution. It is a discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.

Characteristics of a Binomial Distribution

• A binomial distribution B(n,p) is a discrete distribution that takes on values that are 0 or a positive integer no greater than n.

• The mean is np and variance is np(1-p).

• A Poisson distribution is not symmetric, except when p = ½.

• When n is large and both np and n(1-p) are not too small, the binomial distribution closely resembles the normal distribution with the same mean and variance.