descriptive statistics

Descriptive Statistics

Quantitative(variable)

Discrete (no. of customers, no of

claims)

Continuous (salary, price)

Qualitative(Attribute)

Ordinal (customer satisfaction, efficiency

of workers, bond rating)

Nominal (sex, nationality, eye color)

7/3/2013 2 Descriptive Statistics

Data

Primary

Secondary

Data

Time series (unemployment

rate, GDP)

Cross Sectional (queue length in

different SBI branches)


Definition

Primary Data

• Collected from source directly

• Collected under the control and supervision of investigation

Secondary Data • Not collected by the

investigator • Derived from the other

sources


• Interview Method

• Questionnaire Method

• Observation Method

Methods of collecting Primary Data


Diagram Presentation

Diagram

Line (time series)

Simple Multiple

Bar

Vertical (time series)

Horizontal (cross

sectional Component Subdivided

Pie


• When data are collected in original form,

they are called raw data.

• When the raw data is organized into a

frequency distribution, the frequency will

be the number of values in a specific class

of the distribution (grouped data).

7/3/2013 Descriptive Statistics 7

Data Table : Compressive Strength of 80 Aluminum Lithium Alloy

105 221 183 186 121 181 180 143

97 154 153 174 120 168 167 141

245 228 174 199 181 158 176 110

163 131 154 115 160 208 158 133

207 180 190 193 194 133 156 123

134 178 76 167 184 135 229 146

218 157 101 171 165 172 158 169

199 151 142 163 145 171 148 158

160 175 149 87 160 237 150 135

196 201 200 176 150 170 118 149 7/3/2013 Descriptive Statistics 8

Stem-And-Leaf Stem leaf frequency 7 6 1 8 7 1 9 7 1 10 5 1 2 11 5 0 8 3 12 1 0 3 3 13 4 1 3 5 3 5 6 14 2 9 5 8 3 1 6 9 8 15 4 7 1 3 4 0 8 8 6 8 0 8 12 16 3 0 7 3 0 5 0 8 7 9 10 17 8 5 4 4 1 6 2 1 0 6 10 18 0 3 6 1 4 1 0 7 19 9 6 0 9 3 4 6 20 7 1 0 8 4 21 8 1 22 1 8 9 3 23 7 1 24 5 1 7/3/2013 Descriptive Statistics 9

class width=

upper class boundary-lower class boundary

Terms Associated with a Grouped Frequency Distribution


Class Mark or Mid-Value

class marks are the midpoints of the class

boundaries

Class mark=

1/2(upper class boundary+lower class boundary)


FD=Class frequency/class width

It gives number of observations in a class of width one

Use- When class widths are not equal, frequency density is plotted on the y-axis to draw Histogram

Frequency Density


RF=Class frequency/total frequency

Relative Frequency


Visualizing Data

• The three most commonly used

graphs in research are:

• The histogram.

• The frequency polygon.

• The cumulative frequency graph or

ogive


Characteristic Definition / Interpretation

Central Tendency Where are the data values concentrated?

What seem to be typical or middle data

values?

Key Characteristics

Dispersion How much variation is there in the data?

How spread out are the data values?

Are there unusual values?

Shape Are the data values distributed

symmetrically? Skewed? Sharply peaked?

Flat? Bimodal?


Measure Formula Excel Formula Pro Con

Mean

(Raw

data)

=AVERAGE(Data)

Familiar and

uses all the

sample

information.

NA to

extreme

values and

open class

Measures

Mean

(Groupe

d data)

=AVERAGE(Data)

Familiar and

uses all the

sample

information.

NA to

extreme

values and

open class

7/3/2013 Descriptive Statistics

k

i

i

k

i

ii

f

fx

x

1

1

n

x

x

n

i

i 1

16


Median

Middle value

in sorted

array

=MEDIAN(Data)

Robust

when

extreme

data values

exist.

Statistical

procedure

s for

median

are

complex

Measures

Mode

Most

frequently

occurring

data value

=MODE(Data)

Useful for

attribute

data or

discrete

data with a

small range.

May not be

unique,

and is not

helpful for

continuous

data.


• Statistic is descriptive measure derived from a sample (n items).

• Parameter is descriptive measure derived from a population (N items).

Population vs Sample Characteristics


Calculation of Mean

k

i

ii

k

i

i

k

i

ii

k

i

i

n

i

i

N

i

i

fNfxN

meanSamplex

fNfxN

meanPopulation

dataGrouped

sizesamplenxn

meanSamplex

sizePopulationNxN

meanPopulation

dataRaw

11

11

1

1

; 1

; 1

:

; 1

; 1

:


Seventy efficiency apartments were randomly sampled in a small college town. The monthly rent prices for these apartments are listed below.

Sample Mean

Example: Apartment Rents


Sample Mean

34,356 490.80

70ix

xn


• Consider the following n = 6 data values: 11 12 15 17 21 32

• What is the median?

M = (x3+x4)/2 = (15+17)/2 = 16

11 12 15 16 17 21 32

For even n, Median = / 2 ( / 2 1)

2

n nx x

n/2 = 6/2 = 3 and n/2+1 = 6/2 + 1 = 4

Calculation of Median (n is even)


• Consider the following n = 7 data values: 11 12 15 17 21 32 38

• What is the median?

11 12 15 17 21 32 38

(n+1)/2 = 8/2 = 4

Calculation of Median (n is odd)

For odd n, Median = ( 1) / 2nx


Trimmed Mean

It is obtained by deleting a percentage of the smallest and largest values from a data set and then computing the mean of the remaining values.

For example, the 5% trimmed mean is obtained by removing the smallest 5% and the largest 5% of the data values and then computing the mean of the remaining values.

Another measure, sometimes used when extreme values are present, is the trimmed mean.


• A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data.

• Occurs when dissimilar populations are combined in one sample. For example,

Mode


• Percentiles are data that have been divided into 100 groups and how the data spread over an interval from smallest to largest value

• For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you.

• Deciles are data that have been divided into 10 groups.

• Quartiles are data that have been divided into 4 groups.

Percentiles and Quartiles

In general by pth order quantile or fractile (Zp ), we mean that p Proportion of the total observations lie below


• Put p=1/4, 2/4, 3/4, get quartiles

• Put p=1/10,2/10, …, 9/10, get deciles

• Put p=1/100, 2/100, …, 99/100, get percentiles

Step 1. Sort the observations.

Step 2. Calculate np ; n=no of observations.

Percentiles and Quartiles

7/3/2013 Descriptive Statistics

Step 3: If np is not an integer, consider the next integer value as the position else take both the integer and the next integer as the positions; take their mean

27

Third Quartile

Third quartile = 75th percentile

np = (75/100)70 = 52.5 = 53

Third quartile = 525


Dispersion

• Describes how similar a set of observations are to each other

or

the degree of deviation (spread) of a set of data from their central value

– In general, the more spread out a distribution is, the larger the measure of dispersion will be


Measures of Dispersion

• There are five main measures of dispersion:

– Range

– Mean Deviation

– Mean squared deviation (variance)

– Root mean squared deviation (Standard Deviation)

– Inter-quartile range (IQR)



Range xmax – xmin =MAX(Data)-

MIN(Data)

Easy to

calculate

Sensitive to

extreme data

values.

Measures

Mean

Deviation =ABS(expr)

Measures

deviation

accurately

Further

algebraic

treatment is

not possible

n

i

i xxn 1

1



Populatio

n

Variance

=VARP(array) Important

measure

Overestim

ates the

error

Measures

Sample

Variance =VAR(array)

Important

measure

Overestim

ates the

error

N

i

ixN 1

22 )(1

n

i

i xxn

s1

22 )(1

1


REMEMBER

2

1

2

1

22

11

1)(

1

1

xn

nfx

nfxx

ns

datagroupedFor

k

i

ii

k

i

ii



Populatio

n

Standard

Deviation

=STDEVP(array

)

Best

measure

Measures

Sample

Standard

deviation

=STDEV(array) Best

measure

2

2ss


Inter-quartile Range

• The inter-quartile range (IQR) is defined as the difference of the first and third quartiles divided by two

– The first quartile is the 25th percentile

– The third quartile is the 75th percentile

• IQR = (Q3 - Q1)


When To Use the SIR

• It is the range for the middle 50% of the data

• The SIR is often used with skewed data as it is insensitive to the extreme scores

• The SIR is used with open end distribution


Interquartile Range

3rd Quartile (Q3) = 525

1st Quartile (Q1) = 445

Interquartile Range = Q3 - Q1 = 525 - 445 = 80



Coefficient of Variation (CV)

• Relative measure (unit free) used for the purpose of comparison of variability when

(i) two variables of different units are compared

(ii) two variables of same unit with varying mean are compared

Relative Measure=absolute measure/avg. *100

100s

CVx


54.74100 % 100 % 11.15%

490.80

s

x

2 2996.16 54.74s s

the standard deviation is about 11% of the mean

Variance

Standard Deviation

Coefficient of Variation

Sample Variance, Standard Deviation, And Coefficient of Variation


n

i

i xxn

s1

22 16.2996)(1

1


Skewness

• Skew is a measure of symmetry in the distribution of data

Positive Skew Negative Skew

Normal (skew = 0)


Measure of Skew

• Skewness is a unit-free measure of shape of any frequency distribution.

• The coefficient compares two samples measured in different units or one sample with a known reference distribution (e.g., symmetric normal distribution).

• Calculate the sample’s skewness coefficient


Nature of Skewness

• If , distribution has a positive skewness or is right skewed

• If , distribution has a negative skewness or is left skewed

• If , distribution is symmetrical

01 g

01 g

01 g


spandana

Notes

1)mean>median 2)mean<median

• Kurtosis is the relative length of the tails and the degree of concentration in the center.

• Consider three kurtosis prototype shapes.

Kurtosis


Kurtosis

• When the distribution is normally distributed, its kurtosis equals 3 and it is said to be mesokurtic

• When the distribution is less spread out than normal, its kurtosis is greater than 3 and it is said to be leptokurtic

• When the distribution is more spread out than normal, its kurtosis is less than 3 and it is said to be platykurtic


The z-score is often called the standardized value.

It denotes the number of standard deviations a data value xi is from the mean. An observation’s z-score is a measure of the relative location of the observation in a data set.

z-Scores

s

xxz i

i

Excel’s STANDARDIZE function can be used to compute the z-score.


spandana

Notes

Amount of deviation wrt std deviation

425 490.80 1.20

54.74ix x

zs

z-Scores

Standardized Values for Apartment Rents



Chebyshev’s Theorem

At least (1 - 1/z2) of the items in any data set will be

within z standard deviations of the mean, where z is

any value greater than 1.

Chebyshev’s theorem requires z > 1, but z need not be an integer.


spandana

Notes

Mistake : replace z by k where k>1

At least of the data values must be

within of the mean.

75%

z = 2 standard deviations

Chebyshev’s Theorem


within of the mean.

89%



within of the mean.

94%



Empirical Rule

For data having a bell-shaped distribution: of the values of a normal random variable

are within of its mean.

68.26%

+/- 1 standard deviation

of the values of a normal random variable


95.44%

+/- 2 standard deviations

of the values of a normal random variable


99.72%

+/- 3 standard deviations


spandana

Notes

x(bar)-ks x(bar)+ks

Empirical Rule

x

– 3 – 1

– 2

+ 1

+ 2

+ 3

68.26%

95.44%

99.72%


Detecting Outliers

An outlier is an unusually small or unusually large value in a data set.

A data value with a z-score less than -3 or greater than +3 might be considered an outlier.


Box Plot

A box plot is a graphical summary of to identify

outliers.

A key to the development of a box plot is the computation of the median and the quartiles Q1 and Q3.


Box Plot

Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325

Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(80) = 645

The lower limit is located 1.5(IQR) below Q1

The upper limit is located 1.5(IQR) above Q3.

There are no outliers (values less than 325 or greater than 645) in the apartment rent data.



Box Plot

• Whiskers (dashed lines) are drawn from the ends of the box to the smallest and largest data values inside the limits.

400 425 450 475 500 525 550 575 600 625

Smallest value inside limits = 425

Largest value inside limits = 615



Weighted Mean

When the mean is computed by giving each data value a weight that reflects its importance, it is referred to as a weighted mean.

In the computation of a grade point average (GPA), the weights are the number of credit hours earned for each grade.

When data vary in importance, the analyst must choose the weight that best reflects the importance of each value.


Weighted Mean

i i

i

w xx

w

where:

xi = value of observation i

wi = weight for observation i


Mean for Grouped Data

i if Mx

n

N

Mf ii

where:

fi = frequency of class i

Mi = midpoint of class i

Sample Data

Population Data


Sample Mean for Grouped Data



Sample Mean for Grouped Data

This approximation

differs by $2.41 from

the actual sample

mean of $490.80.

34,525 493.21

70x



Variance for Grouped Data

sf M x

ni i2

2

1

( )

2

2

f M

Ni i( )

For sample data

For population data


Sample Variance for Grouped Data


3,017.89 54.94s

s2 = 208,234.29/(70 – 1) = 3,017.89

This approximation differs by only $.20

from the actual standard deviation of $54.74.

• Sample Variance

Sample Standard Deviation


Sample Variance for Grouped Data


ACKNOWLEDGEMENT

1) Statistics for Management by Levin & Rubin ( Prentice Hall )

2) Business Statistics by Aczel and Soundarpardian ( Pearson )

3) Business Statistics by Anderson, Sweeney & Williams ( Cengage )

4) Applied Statistics in Business & Economics by Doane ( McGraw-Hill )

descriptive statistics

Documents