stat301graphical and descriptive

Upload: sarveshmishra

Post on 06-Mar-2016

7 views

Category:

Documents


0 download

DESCRIPTION

stat

TRANSCRIPT

  • Objectives 1.1 Displaying data and distributions with graphs

    p Variables p Types of variables (CIS p40-41)

    p Distribution of a variable

    p Bar graphs for categorical variables (CIS p42) p Histograms for quantitative variables (CIS p43-46)

    p Interpreting histograms (CIS p56-57) p Scatterplots

    p Time series plots

    Further reading: http://onlinestatbook.com/2/graphing_distributions/graphing_distributions.html

  • Samples, Populations and variables p A sample subset of the population. The population and subset is

    made up of individuals (these are not necessarily human). p Example: Individuals can be individual people, companies, animals or

    any object of interest.

    q A variable is any characteristic of an individual/case. A variable varies (thus the name) among individuals.

    q Example: If an individual is a person, then variables of interest can be

    this persons height, blood pressure, ethnicity, mother tongue etc.

    Further reading:

    http://onlinestatbook.com/2/introduction/variables.html

  • Sampling: demonstration p Try the Applet:

    p http://onlinestatbook.com/2/introduction/sampling_demo.html

    p There are 100 animals in this population and you are `randomly sampling from them.

    p (note you will need to activate Java to use this Applet)

  • Population and Sample: Example 1 q As we mentioned earlier, statistics is the analysis of a population

    based on a sample. An example of are presidential elections. q The population of interest are all voting adults in the US. The

    `parameter within this population of interest is the proportion of voters who vote for each candidate, looking back the most recent election this was Obama or Romney.

    q Before election day, we did not know this proportion. But opinion polls were conducted to estimate it.

    q The individuals are the people in the sample. The variable of interest is the candidate each individual will vote for. The variable can take one of two outcomes: either Obama or Romney.

    q In an opinion poll, a representative sample of electorate is taken. This sample is usually several hundred or a couple of thousand people. This is a lot smaller than the 100 million plus who are eligible to vote. However, is a representative sample is taken we can still make reliable confidence statements about the outcome.

  • Example 1: cont. p If you read the newspapers during the general election you would

    have seen reports like (taken from mid-October 2012): p Gallup writes as the top of the page: 48% of voters said they would vote

    for Obama. p At the end of this page Gallup describe how they did the survey and

    quantify their statement For results based on 2723 voters, one can say with with 95% confidence that the maximum margin of error in the sample is 3 percentage points.

    p This means that Gallup predicted that the Obama vote should lie within the interval 483% = [45,51]%, with 95% confidence.

    This is an example of a confidence statement, it says with 95% confidence we believe the true proportion of Obama voters lie in this interval. But we cannot be certain about it. p The percentage that Obama got in the end was about 50.5%. So the

    interval did contain the correct percentage. Note: we cannot predict from the interval who would have won the election

  • Population and sample: Example 2

    p In a medical check-up, often, a blood sample is taken to check whether, for example, your iron level is okay.

    p The mean iron level in your entire body is the amount of iron in your body divided by the total amount of blood. Of course, in reality this cannot be calculated. So blood samples need to be taken.

    p Iron is not distributed evenly over the body, so the iron in each blood sample will vary. The amount of iron in each blood sample is said to be `random (this is what we mean by varying from sample to sample).

    p The variable of interest is the amount of iron in each blood sample. p Given a few blood samples, the doctor will want to estimate the

    average iron level overall the body.

  • Two types of variables p Variables can be either quantitative

    p Something that takes numerical values for which arithmetic operations, such as adding and averaging, make sense.

    p Example: How tall you are, your age, your blood iron level, the number of credit cards you own.

    p or categorical.

    p Something that falls into one of several categories (if it is categorical there is no ordering in the categories). What can be counted is the count or proportion of cases in each category.

    p Example: Your blood type (A, B, AB, O), your ethnicity, who you would vote for in an election.

  • Population and sample: Example 3

    p The vital statistics of calves. p Suppose we are interested in how the weights of baby calves evolve

    over time. The weights will vary from calf to calf. It makes sense to consider the mean weight of calves.

    p But the mean would be the mean weight of calves over the entire population of calves. There are millions of calves, and it is impossible to measure the weights of all calves.

    p Instead, we can monitor the mean weight over a sample of, say, 44 calves. But the mean based on the sample (called the sample mean) will vary from sample to sample. We cannot use this as the true mean, it would be inaccurate. However, it does contain information about the true mean.

    p This data set has a whole bunch of different variables: p Weights are numerical. p Treatment the calf is given is categorical.

  • Data in a Table Tag # TRT Igg TSP Wt 0 wt 0.5 Wt 1

    30 A C 5.6 106.5 103 104 34 B F 4 86.5 79 74 42 A C 5.1 79 76 71 43 D P 4.3 85.5 82 81 44 C F 4.2 90 83 82 45 D F 4.2 97 89 89 46 D P 4.7 91 91 89 48 A F 4.5 89.5 87 82 49 C F 4.4 85 82 83

    q The above is a snapshot from the calf data.

    q Tag # corresponds to an individual.

    q Igg and TSP correspond to treatments that each calf was given. These are categorical variables.

    q Wt0.5. Corresponds to the weight at week 0.5. This a numerical variable.

    q The data for each individual/case is called an observation.

  • Population and sample: Example 4

    p Suppose we are interested in M&Ms. In particular: p The type of M&M bag (Peanut, Chocolate or Peanut butter) q The number of M&Ms in a fun size bag. q Do the number of yellow M&Ms and blue M&Ms differ? q Is there a dependence between the number of blue and yellow M&Ms?

    q Looking at just one M&M bag wont be useful, as everything varies from bag to bag. We are interested in what happens on average over the entire population of M&Ms.

    q It is impossible to observe the population. Instead we try to make confident statements based on a sample.

    q In order that the sample is reliable, it needs to be a random sample. Every bag in the population is equally likely to be drawn. q In the piece of paper given to you, please give the above

    information. q After doing the counting and writing you are free to eat them.

  • Samples p What we have done is collected a sample. The true population is the

    entire set of M&Ms. For now, the sample are all bags of M&Ms in this room.

    p The quantity of interest that we are measuring in the M&M bag is called a variable. Here the variables of interest are the type of M&Ms, the total number of M&Ms in each bag, the number of yellow M&Ms in each bag and the number of blue M&Ms in each bag.

    p Put yourself into groups of 3. For each group calculate the average number of M&Ms.

    p The M&M website state that the mean number of M&Ms in a bag is 10.

    p How does your average compare to the population mean of 10? p Are the above the same or different why?

    p The total number of M&Ms in a bag is a variable and it is random. p Therefore the average number of M&Ms over three bags is random.

  • Distribution of a variable p We previously defined a variable as a quantity in a population that

    we are interested in. p Example: Heights of people, weights of people, favorite color.

    q A variable by definition is `random in the sense that it can vary from individual to individual. q However, it is not so random in the sense that it can take any values

    (the values are all over the place). In general certain outcomes are more likely than others. q Example 1: It is more likely that adult heights lie between 5 to 6 feet

    rather than 3 to 4 feet. q Example 2: The total number of M&Ms in a bag tend to around 12.

    q The frequency of a variable to take different outcomes is known as the distribution of the variable. The distribution tells us which outcomes are more or less likely.

  • Distributions p As in most areas of statistics, the distribution of a variable is rarely

    ever known. However we can get some idea of what it `looks like by taking a sample from a population and calculating the frequency for certain outcomes

    p Of course, a whole bunch of numbers is very hard to understand. p The best course of action is to make a `plot of the distribution. p Various plotting tools can be used:

    p Frequency Histograms (most popular) p Pie chart (good for a quick look)

    p The distribution of each variable will have certain characteristics, which we should look out for when we make a plot.

    p The next few slides will be on making distribution plots based on the data.

  • Ways to plot categorical data When the variable is categorical, the categories (recall a categorical variable is where ordering does not matter) in the graph can be ordered any way we want (alphabetical, by increasing value, by personal preference, etc.)

    p Bar graphs Each category is represented by a bar with length equal to a count or percent (starting from the x-axis).

    p Pie charts The slices represent the parts of one whole.

    More reading: http://onlinestatbook.com/2/graphing_distributions/graphing_qualitative.html

  • Example: Top 10 causes of death in the United States 2006

    For each individual who died in the United States in 2006, we record the cause of death. The table above is a summary of that information.

    Remember: x% means x/100. e.g., 50% = 50/100 = and 25% = 25/100 = .

    Rank Causes of death Counts % of top 10s % of total deaths 1 Heart disease 631,636 34% 26%

    2 Cancer 559,888 30% 23%

    3 Cerebrovascular 137,119 7% 6%

    4 Chronic respiratory 124,583 7% 5%

    5 Accidents 121,599 7% 5%

    6 Diabetes mellitus 72,449 4% 3%

    7 Alzheimers disease 72,432 4% 3%

    8 Flu and pneumonia 56,326 3% 2%

    9 Kidney disorders 45,344 2% 2%

    10 Septicemia 34,234 2% 1%

    All other causes 570,654 24%

  • 0

    100,000

    200,000

    300,000

    400,000

    500,000

    600,000

    700,000

    Top 10 causes of deaths in the United States 2006

    Bar graphs

    Each category is represented by one bar. The bars height shows the count (or percentage) for that particular category. Its bottom must be at value 0.

    The number of individuals who died of an accident in

    2006 is approximately 121,000.

  • Example: Child poverty before and after government interventionUNICEF, 2005

    This is similar to an experimental design, the influence of an intervention (factor).

    All countries show an improvement after intervention.

    Still the United States and Mexico have the highest rate of child poverty among OECD (Organization for Economic Cooperation and Development) nations. (22% and 28% are under 18, resp.). Their governments, in general, do the leastthrough taxes and subsidiesto remedy the problem.

    The poverty line is defined as 50% of national median income.

  • Ways to plot quantitative data

    p The plot you use depends on what aspect of the data you want to

    understand.

    p Histograms (for plotting the distribution) A histogram breaks the range of values of a variable into bins (intervals) and displays the count or percent of the observations that fall into each bin.

    The histogram and its close relative the density will be widely used in this class. Further reading: http://onlinestatbook.com/2/graphing_distributions/histograms.html

    There are various other methods for plotting quantitive data, for a full list see:

    http://onlinestatbook.com/2/graphing_distributions/quantitative.html

  • Histograms

    The range of values that a variable can take is divided into equal size intervals (bins).

    The histogram shows the number of individual data points that fall in each interval.

    The first bar represents all states with a Hispanic percent in their population between 0% and 4.99%. The height of the column shows how many states (27) have a percent in this range.

    The last bar represents all states with a Hispanic percent in their population between 40% and 44.99%. There is only one such state: New Mexico, at 42.1% Hispanics.

  • Relative Frequency Histograms q A relative frequency histogram is when there is a percentage of the

    total on the y-axis rather than just the count. q For example, in the previous example the y-axis would be the percentage of states rather than the number of states. q The shape of a relative frequency histogram is identical to the shape

    of a regular histogram, the only difference is that the y-axis gives you the percentage (or chance) compared to the total rather than the numbers.

    q It is a useful for reading off information: q Example: In the previous example we know that 27 states have a

    Hispanic population that makes up less than 5% of the states population. But this is the same as saying that 54% -- this number is more informative -- of the states in America have a Hispanic population that is less than 5%. Often percentages are preferred as it makes comparisons easier.

  • Interpreting histograms When describing the distribution of a quantitative variable, we look for the

    overall pattern and for striking deviations from that pattern. We can describe

    the overall pattern of a histogram by its shape, center, and spread.

    Histogram with a line connecting each column to outline the shape

    Histogram with a smoothed curve that may represent all cases (if there are many more than were observed)

  • Most common distribution shapes

    p A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other.

    Symmetric distribution

    Complex, multimodal distribution

    p Not all histograms have a simple overall shape. It can be difficult to identify the shape when the sample size is small..

    Skewed distribution

    p A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram extends much farther out than the right side.

  • Examples of Shapes p Distributions which tend to be nearly symmetric are:

    p Heights of a particular gender. p Lengths of bird bills. p Other biological lengths.

    p Distributions which tend to be skewed: p RIGHT SKEWED: The price of houses. Many which are moderately

    priced, but the expensive ones can be extremely expensive (right skewed). In addition house prices are not usually below zero dollars.

    p LEFT SKEWED: The gestation period of baby. Common knowledge tells us that the due data is at 40 weeks. But the baby rarely arrives on the due data, and unless it is induced the gestation period is random. However, a baby born several weeks earlier can still survive but a living baby cannot be born after 43 weeks.

    p Distribution with multiple modes: p Height of adult humans (male and females populations put together).

  • p Distribution with multiple modes: p The number of M&Ms in a bag (several modes because the of the

    different types). p Anything where several subpopulation, each with their own

    characteristics are out together. p Distributions which tend to be flat (no modes):

    p Numbers in a lottery p Examples of this type of distribution are rare.

  • Alaska Florida

    Outliers An important kind of observation is an outlier. Outliers are values that lie far outside the overall pattern of a distribution. Always look for outliers and try to explain them. Do not delete them!

    The overall pattern is fairly symmetrical, but 2 states have exceptional values. Alaska and Florida have unusual representations of elderly in their populations. The most extreme values are outliers only if they stand far away from the main body of values.

  • Reasons for outliers p Outliers are observations which lie `far from the rest of the

    observations. p There can be several explanations for them:

    p They could be extreme events, which happen rarely (such as a persons height being over 7 feet, it happens but the chance of it happening is very small).

    p It could simply due to measurement error. Someone simply mixed 7 feet up with 6 feet!

    p It could be due to an inputting error on the spread sheet.

    p Whatever the reason, outliers should be identified and explained. Outliers can have an influence on the data analysis, but if they are deemed genuine removing them should be avoided. There are methods for keeping them, but reducing their impact on the analysis.

    q Later we will give a rule of thumb for identifying outliers.

  • How should a good histogram look?

    The key is the number of bins/intervals, the width of the bins and the size of the groups contained in them.

    p Not many bins with either 0 or 1 counts

    p Not overly summarized so that you lose all the information

    p Not so detailed that it is no longer a summary of the data

    rule of thumb: start with 5 to 10 bins

    Look at the distribution and refine your bins

    (There isnt a unique or perfect solution)

    In the next slide we give examples when there are too many and too few bins.

  • Not summarized enough

    Too summarized

    Same data set: length of toucan beaks

  • Role of binwidth on the distribution

    q The length of a bin (the block) is known as a binwidth. p When we make a histogram it is from a sample of the population.

    This means that we do not know what the true distribution will look like.

    p Our aim is to choose a binwidth which will give a good representation of the true unknown population histogram.

    p As illustrated on the previous slide the shape of the histogram that we do plot depends on the binwidth that we use. p If the binwidth is too small, the histogram will appear too spikey with too

    many modes (peaks). This is probably not a feature of the distribution, but due to the fact that the sample is too small to have data in all the intervals.

    p If the binwidth is too large, the histogram will appear too flat. This will smooth out features such as modes that are in the distribution.

  • Binwidth: Demonstration p Go to the Applet:

    http://onlinestatbook.com/stat_sim/histogram/index.html

    p 5 classical data sets are given. Select a data set. See what happens to the histogram when you increase and decrease the binwidth

  • IMPORTANT NOTE: Your data are the way they are. Do not try to force them into a particular shape.

    It is a common misconception that if you have a large enough

    data set, the data will eventually turn out nice and symmetrical.

    This is NOT true. Different

    variables have different shapes.

    Histogram of dry days in 1995

  • Student Beers BAC

    1 5 0.1

    2 2 0.03

    3 9 0.19

    6 7 0.095

    7 3 0.07

    9 3 0.02

    11 4 0.07

    13 5 0.085

    4 8 0.12

    5 3 0.04

    8 5 0.06

    10 5 0.05

    12 6 0.1

    14 7 0.09

    15 1 0.01

    16 4 0.05

    Scatterplots (relationships) In a scatterplot, one axis is used to represent each of the variables, and the data are plotted as points on the graph, but not connected.

  • Lab Practice p Upload the M&M data into Statcrunch or JMP. p Make a histogram of the total number of M&Ms in a bag.

    p Make a plot with bandwidth 2 p Make a plot with bandwidth 6

    p You will see that with binwidth 2 it clearly looks bi-modal, and we can clearly see that it is a mixture of Peanut/PB and Chocolate M&Ms. On the other hand with binwidth 6, it is only vaguely bi-modal.

    p Upload the calf data into Statcrunch (make sure that you use choose comma in the Delimiter options). p Make a scatter plot of the week 0 weights against the week 1 weights. p Make a scatter plot of week 0 weights against week 8 weights.

    q It appears that the birth weight has an influence on the weight of the calf at week 1 (not too surprising). But the influence appears to reduce as the weeks go by. Later we shall show that birth weight has a significant (meaning that it is not a chance occurrence) but weak influence on weight at week 8.

  • Statistical Literacy p We will discuss the following two examples:

    p http://onlinestatbook.com/2/graphing_distributions/graphing_distributionsSA.html

  • Objectives

    1.2 Describing distributions with numbers p Measures of center: mean, median (CIS p59-61)

    p Mean versus median

    p The Box plot (CIS p49)

    p Measure of spread: standard deviation (CIS p62-66)

    p Changing the unit of measurement

    http://onlinestatbook.com/2/summarizing_distributions/summarizing_distributions.html

    Adapted from authors slides 2012 W.H. Freeman and Company

  • Numerical summaries of Data p Summarizing the data with a few number can simplify comparisons

    between samples.

    p Two simple ways to describe the data is: p A number which describes its center. p A number which describes its spread.

    p These numbers will not give a true description of the distribution of the data. p For example, using these numbers we cannot identify whether it is uni-

    modal or not. p But it does give us a useful summary. p The notions we describe now will play a very important in the rest of

    this course.

  • The mean or arithmetic average To calculate the average, or mean, add all values, then divide by the number of

    cases. It is the center of mass of the histogram. We often denote it with the

    symbol (say x bar).

    Sum of heights is 1598.3 and n = 25. Dividing by 25 gives 63.9 inches.

    58.2 64.059.5 64.560.7 64.160.9 64.861.9 65.261.9 65.762.2 66.262.2 66.762.4 67.162.9 67.863.9 68.963.1 69.663.9

    Measure of center 1: the mean

    x = 1598.325 = 63.9

    x

    The mean is the same as the center in a Balance scale and smallest squared deviation: http://onlinestatbook.com/2/summarizing_distributions/what_is_ct.html

  • When the center may not be meaningful.

    Here the shape of the distribution is wildly irregular. Why?

    Could we have more than one plant species or phenotype?

    x = 69.6

    The distribution of womens heights appears coherent and symmetrical. The mean is a good numerical summary.

    x = 63.9 Height of 25 women in a class

  • Measure of center 2: the median The median (M) is the midpoint of a distributionthe number such that half of the observations are smaller and half are larger.

    1. Sort observations by size. n = number of observations

    ______________________________

    1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.5

    10 10 2.811 11 2.912 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.6

    n = 24 n/2 = 12

    Median = (3.3+3.4) /2 = 3.35

    2.b. If n is even, the median is the mean of the two middle observations.

    1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 8 2.39 9 2.5

    10 10 2.811 11 2.912 12 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 8 4.722 9 4.923 10 5.324 11 5.625 12 6.1

    n = 25 (n+1)/2 = 26/2 = 13 Median = 3.4

    2.a. If n is odd, the median is observation (n+1)/2 down the list

  • Left skewed Right skewed

    Mean Median

    Mean Median

    Mean Median

    Comparing the mean and the median

    The mean and the median are the same only if the distribution is symmetrical. The median is a measure of center that is resistant to

    skewness and outliers. The mean is not. If there is a skew in the data the mean tends to be pulled towards the side of the skew.

    Mean and median for a symmetric distribution

    Mean and median for skewed distributions

    Mean is center of balance

    Median splits the area in half

  • Disease X:

    Mean and median are about the same.

    Symmetric distribution

    253.393.35

    nxM

    =

    =

    =

    Multiple myeloma: 253.342.32

    nxM

    =

    =

    =

    and a right-skewed distribution

    The mean is pulled in the direction of the skewness.

    Impact of skewed data

  • The mean, median and outliers.

    p When summarizing data often the mean and median if given. We saw in the examples above that outliers can have too much of an influence on the mean, whereas it will not have a large influence on the median. p Example: What is the average of the sequence: 0,1,1,2,2,100? p Answer: 17.66.. p Example: What is the median of the sequence: 0,1,1,2,2,100? p Answer: (1+2)/2=1.5.

    p Which number do you think describes best the `center of this data. p Here we see that the mean is `pulled towards the 100, even though

    most of the data takes quite small values. p For this example the mean does not give a good idea of the center. http://onlinestatbook.com/2/summarizing_distributions/mean_median_def.html

  • Mean and Median: Demonstration p Change the distribution by clicking on the distribution and moving it

    about and see how the mean and median changes. Go to the Applet:

    http://onlinestatbook.com/2/summarizing_distributions/balance.html

    p Each time you make a histogram, think up a variable which fits the plot.

  • What should you use, when, and why?

    Arithmetic mean or median? Currently taxation is an important issue in the news. In particular the impact that taxation will have on the `typical family. How we define `typical depends on its application:

    p Mean: Although income is likely to be right-skewed, the city government wants to know about the total tax base (the total income of the population), this is proportional to average income. Thus the mean is useful here.

    p Median: On the other hand the sociologist is interested in a typical family of middle income. She uses the median to measure this to, lessen the impact on the value of her summary statistic due to extreme incomes.

    p Often in a data summary both the mean and median are given, and it left to the reader to decide which is the most representative.

  • Lab Practice q Load the M&M data into Statcrunch.

    q Summarizing the data and groups within the data:

    q If you go to Stat, Summary Stat, Column, you can choose which variables you want to summarize and even summarize it according to type. For example, by choosing group in the Group by option we can find the mean number of M&Ms in a Peanut bag, Peanut butter and Plain chocolate bag.

    q You will see a `clear difference in the means of the peanut and plain chocolate. Is there an obvious difference in the means of the peanut butter and the peanut?

    q The summary option gives the mean, median and many other quantities which we will describe next.

  • Measuring spread p A measure of center alone can be misleading, see

    http://onlinestatbook.com/2/summarizing_distributions/variability.html

    p Real Example: Two nations may have the same mean income, but very different extremes in terms of wealth and poverty. p For example, one nation may have generous social security (which is

    paid for by high taxes for the wealthy). This would imply that the mean income is good, with very few in poverty but also not so many people who are super rich.

    p On the other hand a country may be a `rising economy. In other words, there are many in extreme poverty and also many who are extremely rich. This would give an average income which is good.

    p These differences can be nicely illustrated with a histogram. p Both countries have the same center of distribution but very different

    spreads in the income. p Demonstration which compares the different histogram/densities:

    http://onlinestatbook.com/2/summarizing_distributions/spread_sim.html

  • M = median = 3.4

    Q1= first quartile = 2.2

    Q3= third quartile = 4.35

    1 1 0.62 2 1.23 3 1.64 4 1.95 5 1.56 6 2.17 7 2.38 1 2.39 2 2.5

    10 3 2.811 4 2.912 5 3.313 3.414 1 3.615 2 3.716 3 3.817 4 3.918 5 4.119 6 4.220 7 4.521 1 4.722 2 4.923 3 5.324 4 5.625 5 6.1

    Measure of spread 1: the quartiles

    The first quartile, Q1, is the value in the

    sample that has 25% of the data at or

    below it ( it is the median of the lower

    half of the sorted data, excluding M).

    The third quartile, Q3, is the value in the

    sample that has 75% of the data at or

    below it ( it is the median of the upper

    half of the sorted data, excluding M).

  • M = median = 3.4

    Q3= third quartile = 4.35

    Q1= first quartile = 2.2

    25 6 6.124 5 5.623 4 5.322 3 4.921 2 4.720 1 4.519 6 4.218 5 4.117 4 3.916 3 3.815 2 3.714 1 3.613 3.412 6 3.311 5 2.910 4 2.89 3 2.58 2 2.37 1 2.36 6 2.15 5 1.54 4 1.93 3 1.62 2 1.21 1 0.6

    Largest = max = 6.1

    Smallest = min = 0.6

    Disease X0

    1

    2

    3

    4

    5

    6

    7

    Year

    s unt

    il dea

    th

    Five-number summary and boxplot BOXPLOT

    Interquartile range Q3 Q1

    4.35 2.2 = 2.15

  • This is the boxplot of the calve weights as they evolve over time (taken every week for 8 weeks, the weights are in pounds). We observe that from birth, here seems to be a small drop in weight (up to 2 weeks), then a steady increase, where both the median and spread (IQR) increase over time.

  • Identifying suspected outliers using quartiles and the Boxplot

    p The boxplot is often used to identify outliers.

    p IQR: The interquantile range is the difference between the first and third Quartile.

    q An observation is identified as a suspected outlier if it lies 1.5IQR above the third quartile or 1.5IQR below the first quartile. q A nice illustration is given in Figure 1.18, page 37 of IPS. Here the

    top line is the 1.5 above the third quartile (1.5 below the first quartile make no sense, since this means negative phone times). This allows us to identify phone calls which were `excessively long.

  • Statistician are obsesses with averages. The most common measure of spread is the variance which is the average spread about the mean:

    deviation value x= 1. First calculate the deviations.

    2s s=

    3. Finally, take the square root to get the standard deviation s.

    Measure of spread 2: the standard deviation

    Mean 1 s.d.

    x2 sum of squared deviations

    1s

    n=

    2. Then calculate the variance s2.

  • The standard deviation s is used to describe the variation around the mean. Like the mean, it is not resistant to skew or outliers.

    deviation value x= 1. First calculate the deviations.

    2s s=

    3. Finally, take the square root to get the standard deviation s.

    Measure of spread 2: A calculation

    2 sum of squared deviations1

    sn

    =

    2. Then calculate the variance s2.

    Numerical Example:

    q We observe: 1,3,4,5,8

    q The sample mean is 4.2

    q Deviation of data from mean is -3.2 -1.2 -0.2 0.8 3.8 q Squared deviation is 10.24 1.44 0.04 0.64 14.44 q Average of squares is: 6.7.

    q 6.7 is too large it almost covers the entire data set. This is because we squared the deviation. We standardize by taking the root s = 2.58 = 6.7

  • Variance and Standard Deviation p Why do we square the deviations?

    p The sum of the non-squared deviations from the mean is always zero. p The sum of the squared deviations of any set of observations from their

    mean is the smallest that the sum of squared deviations from any number can possibly be.

    p Why do we emphasize the standard deviation rather than the variance? p s has the same unit of measurement as the original observations, and so

    is a natural measure of spread. p s is the best measure of spread for Normal distributions.

    p Why do we average by dividing by n 1 rather than n in calculating the variance (a statistical quirk)? p The sum of the deviations is always zero, so only n 1 of the squared

    deviations can vary freely. Variance calculation demonstration: http://onlinestatbook.com/2/summarizing_distributions/variance_est.html

  • Properties of Standard Deviation p Usually, s > 0.

    s = 0 only when all observations have the same value and there is no spread. Eg. The data is 1, 1, 1, 1, 1. The sample mean is 1. The difference about the mean is 0, 0, 0, 0, 0. Thus s = 0.

    p s has the same units of measurement as the original observations.

    p s measures spread about the mean.

    p `Most of the data is between two standard deviations of the mean:

    By most we mean at least 65% and often more than 90% of the data.

    p The more standard deviations it is from the mean, the more `extreme it is. We calculate the number of standard deviations using the z-score (in a few slides time).

    p s is not resistant to outliers or skewness. That is, a few extreme values can change s considerably.

    2 and 2 .x s x s +

  • Disease X:

    Symmetric distribution

    253.391.48

    nxs

    =

    =

    =

    Multiple myeloma: 253.343.17

    nxs

    =

    =

    =

    and a right-skewed distribution

    Visualizing mean and std. dev.

    x2x s 2x s+

    x 2x s+x s x s+

    x s x s+

    3x s+

    Data mostly within 1 st. dev. of the mean and nearly all within 2 st. dev.

    Data to the left of the mean are more bunched together than data to the right of the mean.

  • Changing the unit of measurement Variables can be recorded in different units of measurement. Most often, one measurement unit is a linear transformation of another measurement unit: xnew = a + bxold.

    Temperatures can be expressed in degrees Fahrenheit or degrees Celsius. TempFahrenheit = 32 + (9/5)* TempCelsius a + bx.

    Linear transformations do not change the basic shape of a distribution (skew, symmetry, multimodal). But they do change the measures of center and spread:

    p Multiplying each observation by a positive number b multiplies both measures of center (mean, median) and spread (s, IQR) by b.

    p Adding the same number a (positive or negative) to each observation adds a to measures of center (mean, median) and to quartiles but it does not change measures of spread (s, IQR).

  • The new mean and standard deviation in a linear transformation

    p Consider the transformation Fahrenheit = 32+(9/5)Celsius. p The new mean after transformation is = 32 +(9/5)old mean. p The standard deviation after transformation is = (9/5)old standard deviation.

    q In general if the transformation is Y = a + bX p The new mean after transformation is = a +bold mean. p The standard deviation after transformation is = |b|old standard deviation (remember to use the absolute of b, the standard deviation can never be negative). To see why this is true, look at the real Fahrenheit to Celsius example given in: http://onlinestatbook.com/2/summarizing_distributions/transformations.html

  • Examples q Examples 1: 5 students are 19, 19, 20, 21, 22 years old. Their mean

    age is 20.2 years and their standard deviation is 1.3 q Question: In 10 years time what will be their mean and standard

    deviation? q Answer: a = 10 and b =1. Their mean age will be 20.2+10 = 30.2 but the

    amount of variation remains the same, 1.3 1 = 1.3 (the data has just shifted to the right).

    q Examples 2: 5 children are 0.5, 1.5, 2, 3.2, 3.8 years old. The mean

    age is 2.2 years and their standard deviation is 1.32. q Question: Suppose we convert years into months, what is their age and

    standard deviation? q Answer: a = 0 and b = 12. There are 12 months in a year. So the mean

    age in months is 2.2 12 = 26.4 and the standard deviation is 1.32 12 = 15.84.

    q Because the units have changed (from years to months), it `looks like the mean and standard have increased. Of this is not the case, it is simply that a different unit of measurement has been used.

  • p Example 3: Suppose we observe the data 1, 3, 4, 5, 6. It has the sample mean 3.8 and sample standard deviation 1.92. p Question: The data is transformed with the transformation y = -2x. What

    is the new sample mean and standard deviation? p Answer: The transformed data is -12, -10, -8, -6, -1. The new mean is -23.8 = -7.6 and the new standard deviation is |2|1.92 = 3.84. Observe that the average and spread has seemingly increased, because we have multiplied the data by factor 2. Furthermore, the spread of the data stays POSITIVE even though the data has been multiplied by a negative factor.

  • Lab practice p Load the calf weight data into Statcrunch. p The weights at birth are in pounds our aim is to convert them into kg

    and look at the corresponding mean and standard deviation. p We can convert them into kgs. Note there are 0.435kg to a point.

    p Go to Data, Transform Data, type 0.453*[include here the column The weight of interest by highlighting the column and pressing add column] q Click on calculate. This will create a new column (with the weight now in

    kgs) at the end of the table. q This gives you the equation for converting the pounds into kgs. q By making a plot of the weights in kgs and pounds we see both

    histograms have the same weight. Only the histogram for weights in kgs is narrower and closer to zero (which make sense, as the transformation squashes the data: 0.435pounds).

  • What unit to choose? p As illustrated on the previous slide there are so many variables which

    can be measured with different units of measurements. This makes comparisons extremely difficult: p Temperatures: Do we use Fahrenheit or Celsius? p Length: Do we use Miles, meters of kilometers? q Weight: Do we use Pounds, or kilograms? q When buying, dollars or Euros?

    q It is clear that there is a large amount of ambiguity on which unit of measurement one should use for this and many other situations. p In general we choose the unit of measurement which gives the least number

    of zeros. Why: Well if one asked the distance between University and your house, it is much easier to say 5km than 5000 meters.

    p It will sound silly, but always the units of measurement should be given whenever data is collected.

    p Why? Well here is good example: http://en.wikipedia.org/wiki/Mars_Climate_Orbiter - Cause_of_failure

    p In the Mars Mission, some of the software was written using imperial units, while others used metric units. The cost of this mix-up was millions of dollars and wasted time.

  • q We should just choose one unit of measurement to make comparisons. But even this can cause diplomatic problems. q In the European Union, one regularization (designed to make

    comparisons fair) was that all weights should be stated in kilograms. However, there were a whole bunch of people who objected to this, claiming it took away their freedom and wanted to use pounds instead (these people were nicknamed the metric martyrs). Fortunately, metrication of the UK is pretty much complete.

    q Even in the US it is slowly creeping through, though the resistance amongst the older generation is strong.

    q In statistics, there is a simple solution for making comparisons, which avoids selecting a particular unit of measurement. q This is by making a z-transform, which is described next.

  • A special linear transformation that we often use is the z-score of a measurement. The z-score standardizes measurements so they can be compared across different distributions. Specifically, using it makes them have a mean equal to 0 and a standard deviation equal to 1. z score = (x mean) /st. dev.

    q Example: Often we want to compare temperatures in different regions of the World. For example, the temperatures between Manchester (UK) and College Station (USA). It does not make much sense comparing their raw temperature, it is guaranteed that College Station is a lot hotter. But we can make a comparison against what is normal in that part of the world. For example, during September on average it is 18 degrees Celsius in Manchester with standard deviation (spread) 3 degrees, in College station on average it is 90 Fahrenheit with standard deviation 10 degrees.

    z-scores

  • q Today it is 15 degree Celsius in Manchester and 95 Fahrenheit in College Station. Which place is experiencing the more extreme weather relative to the usual weather? p By making a z-transform we can compare how many standard

    deviations from the mean each measurement is and thus how extreme each of the temperatures are: q Making a z-score of the Manchester temperature we see that the z-score

    is z=(15-18)/3 = -1 standard deviations. That is 1 standard deviation to the left of the mean.

    q Whereas the z-score of the College Station temperature we see that it is z=(95-90)/10 = 0.5 standard deviations to the right of the mean.

    q This suggests that the weather in Manchester today is more extreme than the weather in College Station.

    q Observe the z-transform we have made is free of units of measurements. Since it is centered about zero and has standard deviation one.

  • p Calculating the z-score: The depths of 212 earthquakes with magnitude at least 6.0 were recorded. The mean depth is 65.73km and and the standard deviation is 125.65km. One of the larger observations in the data set is 576.8, which is

    4.06 standard deviations above the mean. p Calculating depth: Suppose we have z = 0.363 (the observation is 0.363 standard deviations below the mean). We can obtain the original variable (depth) by reversing the z-score calculation.

    Example: z-scores

    576.8 65.73 4.06.125.65

    x xzs

    = = =

    125.65 ( 0.363) 65.73 20.11.x s z x= + = + =

    REMEMBER: If the observation is to the right of the mean, the z-score is positive. If the observation is to the left of the mean, the z-score is negative.

  • Calculation Practice p Suppose we transform the variable Y = 5 + 6X, where the variable X

    has mean 2 and standard deviation 3. p What is the mean and standard deviation of the transformed variable Y.

    p Consider the z-score Z = (X mean) / st. dev. p Show the z-score has mean zero and standard deviation one. p Sketch a histogram of X and Z, to illustrate the above.

    p Consider the data set 1, 3, 10, 11, 12. p Question: Calculate its mean and standard deviation

    Answer: It is 7.4 and 5.03 respectively. p Using the mean and standard deviation calculate the z-scores for this

    data set. The data set gets transformed to -1.27, -0.87, 0.52, 0.72, 0.91. This transformed data set has mean zero and standard deviation 1. In fact all data after being transformed with a z-transform has mean

    zero and standard deviation one. This is why we often say the data has been standardized.

  • Properties of z-transform p We now summarize important properties of the z-transform Suppose we observe the data x1, x2, . . . , xn which has sample mean

    x = 1nPn

    i=1 xi and standard deviation s =q

    1n1Pn

    i=1(xi x)2.The z-transform is

    zi =xi x

    s.

    The average and standard deviation of z1, . . . , zn is 0 and 1 respectively.To see why we use the properties of linear transforms. We observe that the

    z-transform can be written as zi = a + bxi, where a = x/s and b = 1/s. Werecall that the new average is simply xnew = a + bxold and the new standarddeviation is snew = |b|sold. Applying this to the z-transform gives

    z = a+ bxold = xs+

    1

    sxold = x

    s+

    1

    sx = 0

    sz = |b|sold = 1ssold =

    s

    s= 1.

  • Statistical literacy p Look at the following example, what do you think?

    http://onlinestatbook.com/2/summarizing_distributions/summarizing_distributionsSA.html

  • Accompanying problems associated with this Chapter

    p Quiz 2 p Quiz 3 p Homework 1