new york universitypages.stern.nyu.edu/~djuran/part1s.doc  · web viewlast first position bats...

34
Part I: Introduction to Data Analysis The true foundation of theology is to ascertain the character of God. It is by the aid of Statistics that law in the social sphere can be ascertained and codified, and certain aspects of the character of God thereby revealed. The study of statistics is thus a religious service. — Florence Nightingale (1820- 1910). Statistical Thinking is understanding variation and how to deal with it. In this course we explore methods for moving as far as possible to the right on this continuum: Ignoranc e --> Uncertain ty --> Risk --> Certaint y We will also stress tools and skills for communicating variation and risk-related ideas and conclusions to other people. In other words, we view statistics as an important part of a managerial tool kit, aimed at making good decisions on the basis of solid scientific information. Types of Data Categorical vs. Numerical Discrete vs. Continuous Nominal Data are the weakest type of measurement for statistical methods. They can be numbers, but really are just names or labels (not quantities). Same as Categorical. Ordinal Data , by their size, rank or order observations on some basis. The intervals between these numbers, and their ratios, are meaningless. Interval Data also rank observations according to some dimension, but the interval or distance between observations has a constant meaning. Readings on the Fahrenheit temperature scale are examples of interval data; the zero

Upload: others

Post on 25-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Part I: Introduction to Data AnalysisThe true foundation of theology is to ascertain the character of God. It is by the aid of Statistics that law in the social sphere can be ascertained and codified, and certain aspects of the character of God thereby revealed. The study of statistics is thus a religious service. — Florence Nightingale (1820-1910).

Statistical Thinking is understanding variation and how to deal with it. In this course we explore methods for moving as far as possible to the right on this continuum:

Ignorance --> Uncertaint

y --> Risk --> Certainty

We will also stress tools and skills for communicating variation and risk-related ideas and conclusions to other people. In other words, we view statistics as an important part of a managerial tool kit, aimed at making good decisions on the basis of solid scientific information.Types of DataCategorical vs. NumericalDiscrete vs. ContinuousNominal Data are the weakest type of measurement for statistical methods. They can be numbers, but really are just names or labels (not quantities). Same as Categorical.Ordinal Data, by their size, rank or order observations on some basis. The intervals between these numbers, and their ratios, are meaningless. Interval Data also rank observations according to some dimension, but the interval or distance between observations has a constant meaning. Readings on the Fahrenheit temperature scale are examples of interval data; the zero point is somewhat arbitrary, but a difference of, say, ten degrees means the same thing everywhere on the scale. We can do addition and subtraction with interval data, but not multiplication or division.Rational Data are the most useful type for statistical analysis. Ratio data are numbers which by their size rank observations in order of importance and between which intervals as well as ratios are

Page 2: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

meaningful. All types of arithmetic operations can be performed with rational data.

B01.1305 2 Prof. Juran

Page 3: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Example 11998 New York Yankees Roster

No. Last First Position Bats Throws Ht. Wt. Born2 Jeter Derek Infield R R 6-3 185 6/26/7411 Knoblauch Chuck Infield R R 5-9 170 7/7/6814 Irabu Hideki Pitcher R R 6-4 240 5/5/6918 Brosius Scott Infield R R 6-1 202 8/15/6619 Sojo Luis Infield R R 5-11 175 1/3/6620 Posada Jorge Catcher S R 6-2 205 8/17/7120 Davis X-Chili Outfield S R 6-3 220 1/17/6021 O'Neill Paul Outfield L L 6-4 215 2/25/6322 Bush Homer Infield R R 5-10 175 11/12/7224 Martinez Tino Infield L R 6-2 210 12/7/6725 Girardi Joe Catcher R R 5-11 195 10/14/6426 Hernandez Orlando Pitcher R R 6-2 190 10/11/6926 Spencer Shane Outfield R R 5-11 210 2/20/7227 Lloyd Graeme Pitcher L L 6-7 234 4/9/6728 Curtis Chad Outfield R R 5-10 185 11/6/6829 Stanton Mike Pitcher L L 6-1 215 6/2/6731 Raines Tim Outfield S R 5-8 186 9/16/5933 Wells David Pitcher L L 6-4 225 5/20/6336 Cone David Pitcher L R 6-1 190 1/2/6339 Strawberry Darryl Outfield L L 6-6 215 3/12/6240 Holmes X-Darren Pitcher R R 6-0 202 4/25/6642 Rivera Mariano Pitcher R R 6-2 168 11/29/6943 Nelson X-Jeff Pitcher R R 6-8 235 11/17/6646 Pettitte Andy Pitcher L L 6-5 235 6/15/7251 Williams Bernie Outfield S R 6-2 205 9/13/6854 Borowski Joe Pitcher R R 6-2 225 5/4/7155 Mendoza Ramiro Pitcher R R 6-2 154 6/15/7258 Jerzembeck Mike Pitcher R R 6-1 185 5/18/72

B01.1305 3 Prof. Juran

Page 4: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Example 2a. Ballard Power Systems, Inc. stock has risen in price by $107 per

share in five years.b. Ballard Power Systems, Inc. stock has risen in price from $8 to

$115 per share in five years.

Operational DefinitionsAn important concept, perhaps difficult to measure (e.g. the overall health of the U.S. equity market), is often operationalized with an easy-to-measure proxy (e.g. the Dow Jones Industrial Average).SamplingOne of the fundamental principles of statistics is that we can learn a great deal about a complete population of data by looking at a smaller subset, or sample, from the population.StatisticsWe will learn to represent an entire population (or sample) of numbers with one or more statistics, which attempt to summarize many numbers with one number. The most important types of these summary measures are: measures of central tendency (i.e. averages), measures of dispersion (i.e. ranges, quartiles, standard deviations), measures of association (i.e. coefficients of correlation and

covariance), measures of relative distance (i.e. z-stats and t-stats), and measures of probability or risk (i.e. p-values).

B01.1305 4 Prof. Juran

Page 5: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Getting Started in Microsoft ExcelFrequency Distribution

Focus CountConsulting 276Manufacturing 98Financial 69Other 57

Percentage Distribution

Focus CountPercen

tConsulting 276 55.20%Manufacturing 98 19.60%Financial 69 13.80%Other 57 11.40%

500100.00

%

Graphs and ChartsHistoryJohann Heinrich Lambert (1728-1777) was a Swiss-German scientist and mathematician. He is generally recognized as the inventor of the time series graph, in which the values of some variable of interest are plotted against the vertical axis and time is plotted on the horizontal axis. William Playfair (1759-1823) was a Scottish political economist. He advocated the use of charts instead of tables of data, because "a man who has carefully investigated a printed table, finds, when done, that he has only a very faint and partial idea of what he has read". Playfair also invented the bar graph.Florence Nightingale (1820-1910) was a British Army nurse in the Crimean War (1854). She used graphical tools to convince army officers to improve conditions in military hospitals. In 1860 she offered to fund a chair in applied statistics at Oxford, and was turned down.Edward Tufte (1946- ) is a professor of political science, statistics, and computer science at Yale. He has written several excellent books about statistics and graphic design.Personal Computers and Integrated Software such as the Microsoft Excel, PowerPoint, and Word programs used by most students in this class, have greatly simplified the creation of graphs and their use in documents and multimedia presentations. An

B01.1305 5 Prof. Juran

Page 6: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

unfortunate side effect has been to limit people's creativity in creating graphs.

B01.1305 6 Prof. Juran

Page 7: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Types of ChartsFrequency Bar Chart

Focuses of 500 MBA Graduates

0

50

100

150

200

250

300

Consulting Manufacturing Financial Other

Focus

Freq

uenc

y

Percentage Bar ChartFocuses of 500 MBA Graduates

0%

10%

20%

30%

40%

50%

60%

Consulting Manufacturing Financial Other

Focus

Perc

ent

Pie ChartFocuses of 500 MBA Graduates

Other11%

Financial14%

Manufacturing20%

Consulting55%

B01.1305 7 Prof. Juran

Page 8: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Pareto DiagramCost Cumulative

CostCumulative %

DeBurr $ 8,181.25 $ 8,181.25 52.5%Cut $ 5,950.00 $ 14,131.25 90.7%Engrave $ 848.75 $ 14,980.00 96.2%Grind $ 446.25 $ 15,426.25 99.0%Weld $ 148.75 $ 15,575.00 100.0%

Pareto Diagram5 Types of Manufacturing Defects

$-

$2,000

$4,000

$6,000

$8,000

$10,000

$12,000

$14,000

DeBurr Cut Engrave Grind Weld

Defect Type

Cos

t ($)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Cum

ulat

ive

%

B01.1305 8 Prof. Juran

Page 9: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Histogram

Salary Distribution - 500 MBAs

0

10

20

30

40

50

60

70

80

90

100

0-10k 10-20k 20-30k 30-40k 40-50k 50-60k 60-70k 70-80k 80-90k 90-100k 100-110k 110-120k 120-130k 130-140k 140-150k 150-160k 160-170k 170-180k

Salary

Freq

uenc

y

Scatter Plot

Scatter PlotEducation vs. Income

$-

$20,000

$40,000

$60,000

$80,000

$100,000

$120,000

$140,000

0 5 10 15 20 25

Education (Years)

Inco

me

($)

B01.1305 9 Prof. Juran

Page 10: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Here is a time-series graph from March, 1999, showing the growth of the Dow Jones Industrial Average during the 1990s. Note how the minimum value on the vertical axis has been set to accentuate the Dow's growth — a mild example of lying with charts.

B01.1305 10 Prof. Juran

Page 11: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Two graphs that appeared in the H.J. Heinz Company’s 2003 proxy statement:

B01.1305 11 Prof. Juran

Page 12: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Here is another example of lying with charts. The proportion of the number of titles in the Barnes and Noble database to the number in Amazon's is evidently 8,000,000 to 4,700,000, or about 170%. But this one-dimensional relationship is distorted in the two-dimensional graph. The area of Barnes and Noble's black bar is 2700 square centimeters, while the area of Amazon's gray bar is 800 square centimeters. This gives the visual impression that the proportion of titles is more like 340%. The distortion is augmented by the choice of color: Barnes and Noble looks bold, clear and strong, while Amazon looks washed-out, pale, and weak.

B01.1305 12 Prof. Juran

Page 13: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Here's an example of a graphical technique that you can't do with Excel. In this NEW YORK TIMES map of Kosovo, colors and shapes are used creatively to communicate complicated quantitative information simply and clearly (e.g. the volume and direction of refugee movements over time).

B01.1305 13 Prof. Juran

Page 14: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Two interactive web graphs, from MSNBC.COM and from Yahoo Finance:

Juran’s version of the same Dow Jones Industrial Average data:5-year DJIA Daily Closing Values

-

2,000

4,000

6,000

8,000

10,000

12,000

14,000

Jun-98 Dec-98 Jun-99 Dec-99 Jun-00 Dec-00 Jun-01 Dec-01 Jun-02 Dec-02

B01.1305 14 Prof. Juran

Page 15: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Juran's Suggestions for Good ChartsGeneralLabel all axes with the variable name and units.Don't use a legend for univariate charts (charts with only one variable).Put the dependent variable on the vertical (Y) axis and the independent variable on the horizontal (X) axis. (We will discuss dependent and independent variables in greater detail later in the course.)Let horizontal and vertical axes start at zero unless you have a good reason not to.Keep your scales, colors, patterns, and symbols consistent.Eschew fancy effects that do not contribute to the reader's understanding (e. g. 3-D effects, distracting colors or patterns, etc.).Watch your ink-to-information ratio (see Tufte).Keep it simple. Don't present data that aren't central to the point you are making.Don't rely on the reader to infer the point of your chart; state your point explicitly in the text.Pareto ChartsLet the left vertical axis show the values for the various categories, and be scaled so the maximum value corresponds to the total of all categories. Let the right vertical axis show the cumulative percent, and be scaled so that the maximum value is 100%.HistogramsDon't let Excel decide what values to use for the class boundaries (a.k.a. bin or bucket boundaries). Specify them yourself.The proper number of classes is subjective; try to use between six and ten.Don't use the upper class boundary as the category label on the X-axis. Use the class midpoint to avoid confusion.The default Excel column chart has gaps between the columns; these make a histogram harder to read. Double-click on one of the columns, select "Options", and reduce the gap width to zero.

B01.1305 15 Prof. Juran

Page 16: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Descriptive StatisticsMeasures of Central Tendency1) Average or Arithmetic Mean.Example: The annual salaries (in $1000s) of the seven employees of a small government department are as follows:

48, 90, 46, 42, 40, 46, 49.The mean is:

= (48 + 90 + 46 + 42 + 40 + 46 + 49)/7= (361/7)= 51.571

The mean salary is therefore $51,571. We use the Greek letter mu () to symbolize the mean.Notation: We will sometimes use a mathematical shorthand notation called Summation Notation. It is easy to use and should not scare anyone; ask for help if you need it. If we have 7 data points, we can abstractly write these numbers X1, X2,..., X7 (where X1 = 48, X2 = 90, ... X7 = 49). Then we write the average of N = 7 numbers as:

Average of (X1, X2, ..., XN) = =

We can also write the average:

Where

48 + 90 + ... + 49 = 361, so the average or mean is 361/7 = 51.571 or $51,571.

B01.1305 16 Prof. Juran

Page 17: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

2) MedianThe median of a data set is the “middle” value; the value such that 50% of the population lies above and below it.To find the median salary, first arrange the salaries in ascending order:

40, 42, 46, 46, 48, 49, 90.The median salary is the middle value. In this case, it is $46,000, which (at least here) seems more representative of a typical salary than the mean value ($51,571).This worked nicely because we had an odd number of observations. Suppose we want to find the median of the following:

48, 90, 46, 42, 40, 46, 49, 51. For an even number of observations, the median is the average of the two middle values. In this case, the average of 46 and 48, that is $47,000.3) ModeThe mode of a data set is the “most popular” value or the value with highest frequency.Example: The manager of a men's store observes that the 10 pairs of trousers sold yesterday have the following waist sizes (in inches): 31, 34, 36, 33, 28, 34, 30, 34, 32, 40. The mode of these waist sizes is 34 inches, and this fact is undoubtedly of more interest to the manager than are the facts that the mean waist size is 33.2 inches and the median is 33.5 inches.

B01.1305 17 Prof. Juran

Page 18: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Measures of Dispersion1) Range = maximum value - minimum value.In the above example, the range is 90 - 40 = 50.2) Quartiles, Interquartile Range

Top 20 Domestic Film Grosses ($ not adjusted)

Movie Year ($Millions)1 Titanic 1997 600.82 Star Wars 1977 460.93 E.T. 1982 434.94 Star Wars: Episode I 1999 431.15 Spider Man 2002 403.76 Jurassic Park 1993 357.17 Lord of the Rings: Two Towers 2002 337.98 Forrest Gump 1994 329.79 Harry Potter: Sorcerer’s Stone 2001 317.6

10 Lord of the Rings: Fellowship 2001 313.411 Lion King 1994 312.912 Star Wars: Episode II 2002 310.713 Return of the Jedi 1983 309.114 Independence Day 1996 306.115 Sixth Sense 1999 293.516 Empire Strikes Back 1980 290.217 Home Alone 1990 285.818 Shrek 2001 267.719 Harry Potter: Chamber Secrets 2002 262.020 Jaws 1975 260.0

As Of 2003; source: http://www.the-movie-times.com/thrsdir/alltime.mv?domestic+ByDGNote: If figures were corrected for inflation the picture would be very different. For example, Star Wars (1977) would have made $1.027 Billion in today’s dollars.

Quartiles are used to divide a data set into four pieces; they can be thought of as statistical dividing lines between these pieces. You will discover that there are differences in the way statisticians calculate these dividing lines; here we will illustrate the method used in the Excel QUARTILE function.

For a list of n numbers, first sort the numbers in increasing order and figure out how many data there are. In this case, n = 20. In the Excel method, the first quartile is the number that is three quarters of the way from the fifth observation (from the bottom) to the sixth. The fifth is 290.2 (Empire Strikes Back), the sixth is 293.5 (Sixth Sense), and the first quartile is:

billion

B01.1305 18 Prof. Juran

Page 19: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

The second quartile is the number that is half way from the tenth observation (from the bottom) to the eleventh. The tenth is 313.4 (Lord of the Rings: Fellowship), the eleventh is 312.9 (Lion King), so the second quartile is:

billionThe third quartile is the number that is one quarter of the way from the fifteenth observation (from the bottom) to the sixteenth. The fifteenth is 357.1 (Jurassic Park), the sixteenth is 403.7 (Spider Man), so the third quartile is:

billionThe interquartile range is the difference between the third and first quartile:

368.75 – 292.675 = $76.075 million.

Percentiles are like quartiles, except they are dividing lines between hundredths of the data instead of fourths. The 25th percentile is the same as the 1st quartile, the 50th percentile is the same as the 2nd quartile, and the 75th percentile is the same as the 3rd quartile.

B01.1305 19 Prof. Juran

Page 20: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Quartiles can be used to create a type of chart called a Box Plot, or Box and Whisker Plot, as in this example from the MBA graduate data:

Notice that the box plot allows us to compare central tendency and dispersion across several variables in one chart. Here we can see how starting salaries vary across the different groups of students. Unfortunately, Excel can't help you with box plots very well (these were created in Minitab, a popular statistics software package).

B01.1305 20 Prof. Juran

Page 21: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

3) Variance: The average of the squared deviations of values from the arithmetic mean.Example: To calculate the variance of the above 7 governmental salaries, first calculate the mean; it is 51.571. Then for each number, calculate its deviation from the mean, so we get

48 - 51.571 = -3.57, 90 - 51.571 = 38.43, and so forth..., 49 - 51.571 = -2.57.

Add the squares of these together, and we get (-3.57)2 + (38.43)2 + ... + (-2.57)2 = 1,783. Then dividing by 7 we get 254.82. The variance of the above salaries is 254.82($2). Using summation notation this is:

(Beware of the units of the variance, it is in the original units squared.)4) Standard deviation = = . (We use the Greek letter sigma, , to represent the standard deviation.) This can be thought of as the “average” deviation from the mean. It is simply the square root of the variance:

= $15.96

B01.1305 21 Prof. Juran

Page 22: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Example: A school system employs teachers at salaries between $28,000 and $50,000. The teachers' union and the school board are negotiating the form of next year's salary increases.1. If every teacher is given a flat $1000 raise, what will this do to the

mean salary?

2. To the median salary?

3. To the range?

4. To the quartiles of the salary distribution?

5. What would a flat $1000 raise do to the standard deviation of teachers' salaries?

6. If, instead, each teacher receives a 5% raise, what will this do to the mean salary?

7. To the median salary?

8. Will the 5% raise increase the standard deviation of the salaries?

B01.1305 22 Prof. Juran

Page 23: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Population versus Sample A population is usually a group we want to know something about:

e.g., all potential customers, all eligible voters, all the products coming off an assembly line, all items in inventory, etc....

A population parameter is a number relevant to the population that is of interest to us: e.g., the proportion (in the population) that would buy a product, the proportion of eligible voters who will vote for a candidate, the average number of M&M's in a pack....

A sample is a subset of the population that we actually do know about (by taking measurements of some kind): e.g., a group who fill out a survey, a group of voters that are polled, a number of randomly chosen items off the line....

A sample statistic is often the only practical estimate of a population parameter. In practice we will use sample statistics as proxies for population parameters, but it is important to remember the difference.

Sample Mean and Variance: To determine the average amount of money spent in the Central Mall, a Central City official randomly samples 12 people as they exit the mall. He asks them the amount of money spent and records the data. The official is trying to estimate mean and variance of the population from a sample of 12 data points. Here are the data for the 12 people:

Person $ spent Person $ spent

Person $ spent

1 $132 5 $123 9 $4492 $334 6 $5 10 $1333 $33 7 $6 11 $444 $10 8 $14 12 $1

B01.1305 23 Prof. Juran

Page 24: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Sample Means, Variances and Standard Deviations: A sample (x1, x2, ... , xn) has sample mean, sample variance, and sample standard deviation as follows:

Sample Mean

Sample Variance

Note: The denominator of the sample variance formula is n - 1, not n. This is because of the aforementioned distinction between population parameters and sample statistics. The n - 1 formula for s2 tends to give a better estimate of 2, especially for small sample sizes.

Sample Standard Deviation

Example: The sample mean is

The sample variance is

The sample standard deviation is

Without checking every shopper that has ever bought anything at the mall, we estimate that people’s spending at the mall has a mean of $107 and a standard deviation of $144.40. These are just estimates of the population parameters, based on sample statistics.In the later parts of this course we will be working almost exclusively with sample data, because in real life population data frequently are difficult, expensive, time-consuming, or impossible to obtain.

B01.1305 24 Prof. Juran

Page 25: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Other Measures of DispersionExample: Consider these summary statistics from month-end stock prices for 50 months (from November 1999 to December 2003):

Mean StDevMinnesota Mining & Manufacturing (3M) MMM

$105.41

$ 18.57

International Business Machines IBM $ 96.15

$ 15.96

Procter & Gamble PG $ 80.04

$ 13.48

Disney DIS $ 25.82

$ 8.05

Microsoft MSFT $ 30.97

$ 7.85

McDonald's MCD $ 27.15

$ 6.92

One important consideration in investing is volatility, because high volatility (which is really the same thing we have been calling dispersion) implies high risk. We might infer from the standard deviations here that MMM, IBM, and PG stock prices are quite volatile compared with DIS, MSFT, and MCD.However, we also note that the average prices for those three stocks are somewhat higher, and an investor might be more interested in volatility relative to the mean than in absolute terms. The coefficient of variation might be useful here:

Using the CV, we would infer that Disney was in fact the most volatile stock over these 50 months:

Mean StDev CVMinnesota Mining & Manufacturing (3M) MMM

$105.41

$ 18.57

0.176

International Business Machines IBM $ 96.15

$ 15.96

0.166

Procter & Gamble PG $ 80.04

$ 13.48

0.168

Disney DIS $ 25.82

$ 8.05

0.312

Microsoft MSFT $ 30.97

$ 7.85

0.253

McDonald's MCD $ 27.15

$ 6.92

0.255

B01.1305 25 Prof. Juran

Page 26: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

In finance, it is conventional to avoid these scaling issues by using the concept of return on investment, which is calculated as shown here:

For example, 3M sold for $95.56 at the end of November 1999, and for $97.88 at the end of December 1999. We can say that 3M stock had a return of about 2.42% during the course of that month.

Using these percent return on investment data instead of raw stock price data, we can return to the use of standard deviation as a measure of risk:

Mean StDevMinnesota Mining & Manufacturing (3m) MMM 0.35% 9.77%

International Business Machines IBM 0.39%11.39

%Procter & Gamble PG 0.19% 7.69%Disney DIS 0.12% 9.79%

Microsoft MSFT-

0.08%14.08

%

B01.1305 26 Prof. Juran

Page 27: New York Universitypages.stern.nyu.edu/~djuran/part1s.doc  · Web viewLast First Position Bats Throws Ht. Wt. Born 2 Jeter Derek Infield R R 6-3 185 6/26/74 11 Knoblauch Chuck Infield

Selected BibliographyBernstein, Peter L. (1996). Against the Gods: The Remarkable Story of Risk. New York: John Wiley and Sons.Gonick, Larry and Woollcott Smith (1994). The Cartoon Guide to Statistics. HarperCollins. ISBN: 0062731025.Levitt, Steven and Stephen Dubner (2005). Freakonomics : A Rogue Economist Explores the Hidden Side of Everything. New York: William Morrow. ISBN 006073132XLowenstein, Roger (2001). When Genius Failed: The Rise and Fall of Long-Term Capital Management. New York: Random House. ISBN: 0375758259.Paulos, John Allen (2003). A Mathematician Plays the Stock Market. New York: Basic Books.Taleb, Nassim (2001). Fooled by Randomness: The Hidden Role of Chance in the Markets and in Life. New York: Texere.Tufte, Edward (1983). The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press.

B01.1305 27 Prof. Juran