elementary statistics

Elementary Statistics

Davis LazarusAssistant Professor

ISIM, The IIS University

Too few categories

18 23 28

0

10

20

30

40

50

60

Age (in years)

Fre

quency (

Count)

Age of Spring 1998 Stat 250 Students

n=92 students

2 3 4

0

1

2

3

4

5

6

7

GPA

Fre

quen

cy (C

oun

t)GPAs of Spring 1998 Stat 250 Students

n=92 students

Too many categories

30

35

40

45

50

55

60

65

70

75

30 40 50 60 70 80

X

Y

•Scatter Plot

•Scatter diagram

•Scattergram

Classes Class boundaries

Tally Marks Freq.

x

70 – 78 61 – 69 52 – 60 43 – 51 34 – 42 25 – 33 16 – 24

69.5 – 78.560.5 – 69.551.5 – 60.5 42.5 – 51.533.5 – 42.524.5 – 33.515.5 – 24.5

//////////

///////-///////-/////-/////////-/////-/////-//

5 5

0

27

14 17

74655647382920

A frequency distribution table lists

categories of scores along with their corresponding frequencies.

The frequency for a particular category or class is the number of

original scores that fall into that class.

The classes or categories refer to the groupings of a frequency table

• The range is the difference between the highest value and the lowest value.

R = highest value – lowest value

The class width is the difference between two consecutive lower class

limits or class boundaries.

The class limits are the smallest or the largest

numbers that can actually belong to different classes.

• Lower class limits are the smallest numbers that can actually belong to the different classes.

• Upper class limits are the largest numbers that can actually belong to the different classes.

• The class boundaries are obtained by increasing the upper class limits and decreasing the lower class limits by the same amount so that there are no gaps between consecutive under classes. The amount to be added or subtracted is ½ the difference between the upper limit of one class and the lower limit of the following class.

Essential Question :

• How do we construct a frequency distribution table?

Process of Constructing a

Frequency Table • STEP 1: Determine the

range.

R = Highest Value – Lowest Value

• STEP 2. Determine the tentative number of classes (k)

k = 1 + 3.322 log N

• Always round – off • Note: The number of classes should be between

5 and 20. The actual number of classes may be affected by convenience or other subjective factors

• STEP 3. Find the class width by dividing the range by the number of classes.

(Always round – off )

k

Rc

classesofnumber

Rangewidthclass

• STEP 4. Write the classes or categories starting with the lowest score. Stop when the class already includes the highest score.

• Add the class width to the starting point to get the second lower class limit. Add the class width to the second lower class limit to get the third, and so on. List the lower class limits in a vertical column and enter the upper class limits, which can be easily identified at this stage.

• STEP 5. Determine the frequency for each class by referring to the tally columns and present the results in a table.

When constructing frequency tables, the following

guidelines should be followed.• The classes must be mutually

exclusive. That is, each score must belong to exactly one class.

• Include all classes, even if the frequency might be zero.

• All classes should have the same width, although it is sometimes impossible to avoid open – ended intervals such as “65 years or older”.

• The number of classes should be between 5 and 20.

Let’s Try!!!

• Time magazine collected information on all 464 people who died from gunfire in the Philippines during one week. Here are the ages of 50 men randomly selected from that population. Construct a frequency distribution table.

19 18 30 40 41 33 73 25

23 25 21 33 65 17 20 7647 69 20 31 18 24 35 2417 36 65 70 22 2565 1624 29 42 37 26 46 27 6321 27 23 25 71 37 75 2527 23

Using Table:• What is the lower class limit of the highest class? Upper class limit of the lowest class?

• Find the class mark of the class 43 – 51.

• What is the frequency of the class 16 – 24?

Classes Class boundaries

Tally Marks Freq.

x

70 – 78 61 – 69 52 – 60 43 – 51 34 – 42 25 – 33 16 – 24

69.5 – 78.560.5 – 69.551.5 – 60.5 42.5 – 51.533.5 – 42.524.5 – 33.515.5 – 24.5

//////////

///////-///////-/////-/////////-/////-/////-//

5 5

0

27

14 17

74655647382920

The manager of Hudson Auto would like to have a betterunderstanding of the cost of parts used in the engine

tune-ups performed in the shop.

She examines 50 customer invoices

for tune-ups. The costs of parts,

rounded off to the nearest dollar,

are listed on the next slide.

91 78 93 57 75 52 99 80 97 6271 69 72 89 66 75 79 75 72 76104 74 62 68 97 105 77 65 80 10985 97 88 68 83 68 71 69 67 7462 82 98 101 79 105 79 69 62 73

91 78 93 57 75 52 99 80 97 6271 69 72 89 66 75 79 75 72 76104 74 62 68 97 105 77 65 80 10985 97 88 68 83 68 71 69 67 7462 82 98 101 79 105 79 69 62 73

Example 1

CUMULATIVE FREQUENCY DISTRIBUTION

• The less than cumulative frequency distribution (F<) is constructed by adding the frequencies from the lowest to the highest interval while the more than cumulative frequency distribution (F>) is constructed by adding the frequencies from the highest class interval to the lowest class interval.

Tabular Summary Frequency Distribution of

engine tune-ups

50-59

60-69

70-79

80-89

90-99

100-109

2

13

16

7

7

5

50

Cost ($) Frequency

0.04

0.26

0.32

0.14

0.14

0.10

1.00

Relative Frequency

2 + 13

Cumulative Frequency

less than more than

2

15

31

38

45

50

50

48

35

18

12

5

5 + 7

45 tune-ups cost less than $ 100

12 tune-ups cost more than $ 89

Graphical Summary: Histogram

2244

6688

1010

1212

14141616

1818

Cost ($)Cost ($)

Fre

qu

ency

Fre

qu

ency

50-59 60-69 70-79 80-89 90-99 100-11050-59 60-69 70-79 80-89 90-99 100-110

Unlike a bar graph, a histogram has no naturalseparation between rectangles of adjacent classes.

Tune-upCost ($)Tune-upCost ($)

1010

2020

3030

4040

50 50F

req

uen

cyF

req

uen

cy

60 70 80 90 100 110 60 70 80 90 100 110

Ogiveless than ogive

median

more than ogive

Stem-and-Leaf Display

5

6

7

8

9

10

2 7

2 2 2 2 5 6 7 8 8 8 9 9 9

1 1 2 2 3 4 4 5 5 5 6 7 8 9 9 9 0 0 2 3 5 8 9

1 3 7 7 7 8 9

1 4 5 5 9

a stema leaf

A single digit is used to define each leaf

Leaf units may be 100, 10, 1, 0.1, and so on

Where the leaf unit is not shown, it is assumed to equal 1

In the above example, the leaf unit was 1

Leaf Unit = 0.1

8

9

10

11

6 8

1 4

2

0 7

8.6 11.7 9.4 9.1 10.2 11.0 8.8

Leaf Unit = 10 1806 1717 1974 1791 1682 1910 1838

16

17

18

19

8

1 9

0 3

1 7

The 82 in 1682is rounded downto 80 and isrepresented as an 8

Measures of Central Tendency Arithmetic Mean, Weighted Mean, Geometric Mean, Median, Mode, Partition Values – Quartiles, Deciles and Percentiles

Measures of DispersionRange, Mean deviation, Standard deviation, Variance, Co-efficient of variation

Measures of PositionQuartile deviation

Mean: the average obtained by finding the sum of the

numbers and dividing by the number of numbers in the sum.

Median: When the numbers are listed from highest to lowest

or lowest to highest, the median is the average number found

in the middle. If there are an even number of data, find the

average of the middle two numbers.

Mode: The number that occurs the most often.

• What is the “location” or “centre” of the data? (measures of location or central tendency)

• How do the data vary? (measures of variability or dispersion)

Mean is the most widely used measure of location and shows the central value of the data.

• all values are used• unique• sum of the deviations from the mean is 0• affected by unusually large or small data values

µ is the population meanN is the population size

Xi is a particular population value indicates the operation of addingN

X i

n

XX

µ is the sample meann is the sample size

xi is a particular sample value

xi

There are as many values above the median as below it

in the data array.

For an even set of values, the median will be the

arithmetic average of the two middle numbers and is

found at the (n+1)/2 ranked observation.

The MedianMedian is the midpoint of the values after they

have been ordered from the smallest to the largest.

unique not affected by extremely large or small values

good measure of location when such values occur

Data can have more than one mode.

If it has two modes, it is referred to as bimodal, three

modes, trimodal, and the like.

The ModeMode is another measure of location and represents

the value of the observation that appears most frequently.

GM X X X Xnn ( )( )( )...( )1 2 3

GM is used to average percents, indexes, and relatives.

Geometric MeanGeometric Mean of a set of n numbers is

defined as the nth root of the product of the n numbers.

)21

)2211

...(

...(

n

nnw

www

XwXwXwX

Weighted MeanWeighted Mean of a set of numbers X1, X2, ..., Xn,

with corresponding weights w1, w2, ...,wn

The interest rate on three bonds were 5, 21, and 4 percent.

The arithmetic mean is (5+21+4) / 3 =10.0

The geometric mean is

49.7)4)(21)(5(3 GM

The GM gives a more conservative profit figure because

it is not heavily weighted by the rate of 21%

Example 1

Another use of GM

is to determine the

percent increase in

sales, production

or other business

or economic series

from one time

period to another.

Grow th in Sales 1999-2004

0

10

20

30

40

50

1999 2000 2001 2002 2003 2004

Year

Sal

es in

Milli

ons(

$)

1period) of beginningat (Value

period) of endat Value(nGM

Example 2

0127.1000,755

000,8358 GM

The total number of females enrolled in American

colleges increased from 755,000 in 1992 to 835,000 in

2000. That is, the geometric mean rate of increase is

1.27%.

Example 3

Measures of Dispersion

•Range

• Mean Deviation

•Quartile Deviation

•Standard Deviation

•Variance

•Co-efficient of Variation

Dispersion Dispersion refers to the spread or variability in the data.

Range Range = Largest value – Smallest value

0

5

10

15

20

25

30

0 2 4 6 8 10 12

mean

The following represents the current year’s Return on Equity of the 25 companies in an investor’s portfolio.

-8.1 3.2 5.9 8.1 12.3-5.1 4.1 6.3 9.2 13.3-3.1 4.6 7.9 9.5 14.0-1.4 4.8 7.9 9.7 15.01.2 5.7 8.0 10.3 22.1

Highest value: 22.1 Lowest value: -8.1

Range = Highest value – lowest value = 22.1-(-8.1) = 30.2

Range Example

The arithmetic mean of the absolute values of the deviations from the arithmetic mean.

All values are used

in the calculation.

It is not unduly

influenced by large

or small values.

The absolute values

are difficult to

manipulate.

M D = X - X

n

Mean Deviation

The weights of a sample of crates containing books for the bookstore (in pounds ) are: 103, 97, 101, 106, 103

X = 102

4.25

541515

102103...102103

n

XXMD

Example 5

Population Variance (X - )2

N =

X is the value of an observation in the population

μ is the arithmetic mean of the population

N is the number of observations in the population

Population Standard Deviation,

the arithmetic mean of the squared deviations from the mean

Standard deviation = (variance)

Standard deviation and Variance

In Example 4, the variance and standard deviation are:

(-8 .1 -6 .6 2 ) 2 + (-5 .1 -6 .6 2 ) 2 + ... + (2 2 .1 -6 .6 2 ) 2

2 5

= 4 2 .2 2 7 = 6 .4 9 8

(X - )2

N =

Example 6

Sample variance

s 2 =(X - X ) 2

n -1 Sample standard deviation, s

The hourly wages earned by a sample of five students are $7, $5, $11, $8, $6.

40.75

37

n

XX

30.515

2.2115

4.76...4.77

1

2222

n

XXs

30.230.52 ss

Example 7

Example:

-1 1

3 9

-2 4

-3 9

2 4

1 1

Data: X = {6, 10, 5, 4, 9, 8}; N = 6

Total: 42 Total: 28

Standard Deviation:

76

42

N

XX

Mean:

Variance:2

2 ( ) 284.67

6

X Xs

N

16.267.42 ss

XX 2)( XX X

6

10

5

4

9

8

Empirical Rule:

For any symmetrical, bell-shaped distribution

About 68% of the observations will lie within 1s the mean

About 95% of the observations will lie within 2s of the mean

Nearly all the observations will be within 3s of the mean

68%

95%

99.7%

Interpretation and Uses of the Standard Deviation

Quartiles Q1, Q2, Q3 divides ranked

data into four equal parts

25%

Q3Q2Q1

25% 25%25%

10 Deciles: D1, D2, D3, D4, D5, D6, D7, D8, D9

divides ranked data into ten equal parts

10% 10% 10% 10% 10% 10% 10% 10% 10% 10%

D1 D2 D3 D4 D5 D6 D7 D8 D9

99 Percentiles: divides ranked data into 100 equal parts

Fractiles

Relative Standing

Percentiles

percentile of value x = ((number of values < x)/ total number of values)*100

(round the result to the nearest whole numberSuppose that in a class of 25 people we have the following averages (ordered in ascending order)

42, 59, 63, 67, 69, 69, 70, 73, 73, 74, 74, 74, 77, 78, 78, 79, 80, 81, 84, 85, 87, 89, 91, 94, 98

If you received a 77, what percentile are you?

percentile of 77 = (12/25)*100 = 48

Relative Standing

Quartiles

Instead of finding the percentile of a single data value as we did on the previous page, it is often useful to group the data into 4, or more, (nearly) equal groups. When grouping the data into four equal groupings, we call these groupings quartiles.

Let n = number of items in the data set

k = percent desired (ex. k= 25)

L = locator the value separating the first k percent of the data from the rest

L = (k/100) * n

Relative Standing

Let’s separate the 25 class grades into four quartiles.

•Step 1 – order the data in ascending order

42, 59, 63, 67, 69, 69, 70, 73, 73, 74, 74, 74, 77, 78, 78, 79, 80, 81, 84, 85, 87, 89, 91, 94, 98

Now find the 3 locators L25, L50, L75,

L25 = (25/100) * 25 = 6.25

L50 = (50/100) * 25 = 12.5

L75 = (75/100) * 25 = 18.75

7

13

19

Round fraction part up to the next integer

L25

Q1Q2 Q3

Relative Standing

Other measures of relative standing include•Interquartile range (IQR) = Q3 - Q1

•Semi-interquartile range = (Q3 - Q1)/ 2

•Midquartile = (Q3 + Q1)/2

•10 – 90 percentile range = P90 - P10

For the data on the previous page we have:

IQR = 84 – 70 = 16

Semi IQR = (84 – 70)/2 = 8

Midquartile = (84 + 70)/2 = 77

Measures of variation

Measure of central tendency

Box Diagram

65, 67, 68, 68, 69, 69, 71, 71, 71, 72, 72, 72, 73, 73, 73,

74, 74, 75, 75, 75, 75, 76, 76, 77, 77, 77, 77, 77, 77, 78,

78, 78, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 81, 81,

81, 82, 82, 83, 84, 85, 85, 85, 86, 86, 87, 87, 88, 89, 92

L25

L75

65 92 69 73 77 81 85 89

Q1 M Q3

median

To construct a box diagram to illustrate the extent to which the extreme data values lie beyond the interquartile range, draw a line with the low and high value highlighted at the two ends. Mark the gradations between these two extremes, then locate the quartile boundaries Q1, Med., and Q3 on this line. Construct a box about these values.

Q1 = (73 + 74)/2 = 73.5

Percentile of score a = * 100number of scores less than a

total number of scores

Relation between the different fractiles

• Q1 = P25

• Q2 = P50

• Q3 = P75

D1 = P10

D2 = P20

D3 = P30

• • •

D9 = P90

Interquartile Range: Q3 – Q1

Five pieces of data are needed to construct a box plot:

Minimum Value,

First Quartile, Q1

Median,

Third Quartile, Q3

Maximum Value.

graphical display, based on quartiles,

that helps to picture a set of data.Box plot

The box represents the interquartile

range which contains the 50% of

values.

The whiskers represent the range;

they extend from the box to the

highest and lowest values,

excluding outliers.

A line across the box indicates the

median.

Based on a sample of 20 deliveries, Buddy’s Pizza

determined the following information. The minimum

delivery time was 13 minutes and the maximum 30 minutes.

The first quartile was 15 minutes, the median 18 minutes, and

the third quartile 22 minutes. Develop a box plot for the

delivery times.

Example 8

1 2 1 4 1 6 1 8 2 0 2 2 2 4 2 6 2 8 3 0 3 2

Q 1 Q 3M a xM in M ed ia n

1.5 times the interquartile range1.5 times the IQ range

Symmetric distribution: A distribution having the same shape on either side of the centre

Skewed distribution: One whose shapes on either side of the center differ; a nonsymmetrical distribution.

Can be positively or negatively skewed, or bimodal

Skewness

measurement of the lack of symmetry of the distribution.

Relative Positions of the Mean, Median, and Mode in a Symmetric Distribution

M o d e

M ed ia n

M ea n

Relative Positions of the Mean, Median, and Mode in a Right Skewed or Positively Skewed Distribution

M o d e

M ed ia n

M ea n

Mean > Median > Mode

The Relative Positions of the Mean, Median, and Mode in a Left Skewed or Negatively Skewed Distribution

M o d eM ea n

M ed ia n

Mean < Median < Mode

Using the twelve stock prices, we find the mean to be 84.42, standard deviation, 7.18, median, 84.5.

= -.035( )

s

MedianXsk

-=

3

Example 9

The coefficient of skewness can range from -3.00 up to 3.00

A value of 0 indicates a symmetric distribution.

• derived from the Greek word κυρτός, kyrtos or kurtos,

meaning bulging

• measure of the "peakedness" of the probability

distribution of a real-valued random variable

• higher kurtosis means more of the variance is due to

infrequent extreme deviations, as opposed to frequent

modestly-sized deviations.

Kurtosis

distribution with positive kurtosis is called leptokurtic,

or leptokurtotic.

In terms of shape, a leptokurtic distribution has a more acute

"peak" around the mean (that is, a higher probability than a

normally distributed variable of values near the mean) and

"fat tails" (that is, a higher probability than a normally

distributed variable of extreme values).

distribution with negative kurtosis is called platykurtic,

or platykurtotic.

In terms of shape, a platykurtic distribution has a smaller

"peak" around the mean (that is, a lower probability than a

normally distributed variable of values near the mean) and

"thin tails" (that is, a lower probability than a normally

distributed variable of extreme values).

Normal distribution - Mesokurtic

Other distribution – Leptokurtic

Normal distribution - Mesokurtic

Other distribution – Platykurtic

Comparing Standard Deviations

Mean = 15.5 s = 3.338

11 12 13 14 15 16 17 18 19 20 21

Data A

11 12 13 14 15 16 17 18 19 20 21

Data B

Mean = 15.5 s = .9258

11 12 13 14 15 16 17 18 19 20 21

Mean = 15.5 s = 4.57

Data C

• Measures relative variation

• Always in percentage (%)

• Shows variation relative to mean

• Is used to compare two or more sets of data measured in different units

100%S

CVX

Co-efficient of variation

When the mean value is near zero, the coefficient of variation is sensitive to change in the standard deviation, limiting its usefulness.

Stock A:Average price last year = $50Standard deviation = $5

Stock B:Average price last year = $100Standard deviation = $5

$5100% 100% 10%

$50

SCV

X

$5100% 100% 5%

$100

SCV

X

elementary statistics

Documents

class byreferring

class widthby

lower class limitof

theupper class limits

lowest class interval

listthe lower class

thesecond lower class

class mark of theclass