descriptive statistics

63
Descriptive Statistics

Upload: spandana-achanta

Post on 19-Jul-2016

14 views

Category:

Documents


6 download

DESCRIPTION

Quantitative Methods

TRANSCRIPT

Page 1: Descriptive Statistics

Descriptive Statistics

Page 2: Descriptive Statistics

Quantitative(variable)

Discrete (no. of customers, no of

claims)

Continuous (salary, price)

Qualitative(Attribute)

Ordinal (customer satisfaction, efficiency

of workers, bond rating)

Nominal (sex, nationality, eye color)

7/3/2013 2 Descriptive Statistics

Page 3: Descriptive Statistics

Data

Primary

Secondary

Data

Time series (unemployment

rate, GDP)

Cross Sectional (queue length in

different SBI branches)

7/3/2013 3 Descriptive Statistics

Page 4: Descriptive Statistics

Definition

Primary Data

• Collected from source directly

• Collected under the control and supervision of investigation

Secondary Data • Not collected by the

investigator • Derived from the other

sources

7/3/2013 4 Descriptive Statistics

Page 5: Descriptive Statistics

• Interview Method

• Questionnaire Method

• Observation Method

Methods of collecting Primary Data

7/3/2013 5 Descriptive Statistics

Page 6: Descriptive Statistics

Diagram Presentation

Diagram

Line (time series)

Simple Multiple

Bar

Vertical (time series)

Horizontal (cross

sectional Component Subdivided

Pie

7/3/2013 6 Descriptive Statistics

Page 7: Descriptive Statistics

• When data are collected in original form,

they are called raw data.

• When the raw data is organized into a

frequency distribution, the frequency will

be the number of values in a specific class

of the distribution (grouped data).

7/3/2013 Descriptive Statistics 7

Page 8: Descriptive Statistics

Data Table : Compressive Strength of 80 Aluminum Lithium Alloy

105 221 183 186 121 181 180 143

97 154 153 174 120 168 167 141

245 228 174 199 181 158 176 110

163 131 154 115 160 208 158 133

207 180 190 193 194 133 156 123

134 178 76 167 184 135 229 146

218 157 101 171 165 172 158 169

199 151 142 163 145 171 148 158

160 175 149 87 160 237 150 135

196 201 200 176 150 170 118 149 7/3/2013 Descriptive Statistics 8

Page 9: Descriptive Statistics

Stem-And-Leaf Stem leaf frequency 7 6 1 8 7 1 9 7 1 10 5 1 2 11 5 0 8 3 12 1 0 3 3 13 4 1 3 5 3 5 6 14 2 9 5 8 3 1 6 9 8 15 4 7 1 3 4 0 8 8 6 8 0 8 12 16 3 0 7 3 0 5 0 8 7 9 10 17 8 5 4 4 1 6 2 1 0 6 10 18 0 3 6 1 4 1 0 7 19 9 6 0 9 3 4 6 20 7 1 0 8 4 21 8 1 22 1 8 9 3 23 7 1 24 5 1 7/3/2013 Descriptive Statistics 9

Page 10: Descriptive Statistics

class width=

upper class boundary-lower class boundary

Terms Associated with a Grouped Frequency Distribution

7/3/2013 Descriptive Statistics 10

Page 11: Descriptive Statistics

Class Mark or Mid-Value

class marks are the midpoints of the class

boundaries

Class mark=

1/2(upper class boundary+lower class boundary)

7/3/2013 Descriptive Statistics 11

Page 12: Descriptive Statistics

FD=Class frequency/class width

It gives number of observations in a class of width one

Use- When class widths are not equal, frequency density is plotted on the y-axis to draw Histogram

Frequency Density

7/3/2013 Descriptive Statistics 12

Page 13: Descriptive Statistics

RF=Class frequency/total frequency

Relative Frequency

7/3/2013 Descriptive Statistics 13

Page 14: Descriptive Statistics

Visualizing Data

• The three most commonly used

graphs in research are:

• The histogram.

• The frequency polygon.

• The cumulative frequency graph or

ogive

7/3/2013 Descriptive Statistics 14

Page 15: Descriptive Statistics

Characteristic Definition / Interpretation

Central Tendency Where are the data values concentrated?

What seem to be typical or middle data

values?

Key Characteristics

Dispersion How much variation is there in the data?

How spread out are the data values?

Are there unusual values?

Shape Are the data values distributed

symmetrically? Skewed? Sharply peaked?

Flat? Bimodal?

7/3/2013 Descriptive Statistics 15

Page 16: Descriptive Statistics

Measure Formula Excel Formula Pro Con

Mean

(Raw

data)

=AVERAGE(Data)

Familiar and

uses all the

sample

information.

NA to

extreme

values and

open class

Measures

Mean

(Groupe

d data)

=AVERAGE(Data)

Familiar and

uses all the

sample

information.

NA to

extreme

values and

open class

7/3/2013 Descriptive Statistics

k

i

i

k

i

ii

f

fx

x

1

1

n

x

x

n

i

i 1

16

Page 17: Descriptive Statistics

Measure Formula Excel Formula Pro Con

Median

Middle value

in sorted

array

=MEDIAN(Data)

Robust

when

extreme

data values

exist.

Statistical

procedure

s for

median

are

complex

Measures

Mode

Most

frequently

occurring

data value

=MODE(Data)

Useful for

attribute

data or

discrete

data with a

small range.

May not be

unique,

and is not

helpful for

continuous

data.

7/3/2013 Descriptive Statistics 17

Page 18: Descriptive Statistics

• Statistic is descriptive measure derived from a sample (n items).

• Parameter is descriptive measure derived from a population (N items).

Population vs Sample Characteristics

7/3/2013 Descriptive Statistics 18

Page 19: Descriptive Statistics

Calculation of Mean

k

i

ii

k

i

i

k

i

ii

k

i

i

n

i

i

N

i

i

fNfxN

meanSamplex

fNfxN

meanPopulation

dataGrouped

sizesamplenxn

meanSamplex

sizePopulationNxN

meanPopulation

dataRaw

11

11

1

1

; 1

; 1

:

; 1

; 1

:

7/3/2013 Descriptive Statistics 19

Page 20: Descriptive Statistics

Seventy efficiency apartments were randomly sampled in a small college town. The monthly rent prices for these apartments are listed below.

Sample Mean

Example: Apartment Rents

7/3/2013 Descriptive Statistics 20

Page 21: Descriptive Statistics

Sample Mean

34,356 490.80

70ix

xn

7/3/2013 Descriptive Statistics 21

Page 22: Descriptive Statistics

• Consider the following n = 6 data values: 11 12 15 17 21 32

• What is the median?

M = (x3+x4)/2 = (15+17)/2 = 16

11 12 15 16 17 21 32

For even n, Median = / 2 ( / 2 1)

2

n nx x

n/2 = 6/2 = 3 and n/2+1 = 6/2 + 1 = 4

Calculation of Median (n is even)

7/3/2013 Descriptive Statistics 22

Page 23: Descriptive Statistics

• Consider the following n = 7 data values: 11 12 15 17 21 32 38

• What is the median?

11 12 15 17 21 32 38

(n+1)/2 = 8/2 = 4

Calculation of Median (n is odd)

For odd n, Median = ( 1) / 2nx

7/3/2013 Descriptive Statistics 23

Page 24: Descriptive Statistics

Trimmed Mean

It is obtained by deleting a percentage of the smallest and largest values from a data set and then computing the mean of the remaining values.

For example, the 5% trimmed mean is obtained by removing the smallest 5% and the largest 5% of the data values and then computing the mean of the remaining values.

Another measure, sometimes used when extreme values are present, is the trimmed mean.

7/3/2013 Descriptive Statistics 24

Page 25: Descriptive Statistics

• A bimodal distribution refers to the shape of the histogram rather than the mode of the raw data.

• Occurs when dissimilar populations are combined in one sample. For example,

Mode

7/3/2013 Descriptive Statistics 25

Page 26: Descriptive Statistics

• Percentiles are data that have been divided into 100 groups and how the data spread over an interval from smallest to largest value

• For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you.

• Deciles are data that have been divided into 10 groups.

• Quartiles are data that have been divided into 4 groups.

Percentiles and Quartiles

In general by pth order quantile or fractile (Zp ), we mean that p Proportion of the total observations lie below

7/3/2013 Descriptive Statistics 26

Page 27: Descriptive Statistics

• Put p=1/4, 2/4, 3/4, get quartiles

• Put p=1/10,2/10, …, 9/10, get deciles

• Put p=1/100, 2/100, …, 99/100, get percentiles

Step 1. Sort the observations.

Step 2. Calculate np ; n=no of observations.

Percentiles and Quartiles

7/3/2013 Descriptive Statistics

Step 3: If np is not an integer, consider the next integer value as the position else take both the integer and the next integer as the positions; take their mean

27

Page 28: Descriptive Statistics

Third Quartile

Third quartile = 75th percentile

np = (75/100)70 = 52.5 = 53

Third quartile = 525

7/3/2013 Descriptive Statistics 28

Page 29: Descriptive Statistics

Dispersion

• Describes how similar a set of observations are to each other

or

the degree of deviation (spread) of a set of data from their central value

– In general, the more spread out a distribution is, the larger the measure of dispersion will be

7/3/2013 Descriptive Statistics 29

Page 30: Descriptive Statistics

Measures of Dispersion

• There are five main measures of dispersion:

– Range

– Mean Deviation

– Mean squared deviation (variance)

– Root mean squared deviation (Standard Deviation)

– Inter-quartile range (IQR)

7/3/2013 Descriptive Statistics 30

Page 31: Descriptive Statistics

Measure Formula Excel Formula Pro Con

Range xmax – xmin =MAX(Data)-

MIN(Data)

Easy to

calculate

Sensitive to

extreme data

values.

Measures

Mean

Deviation =ABS(expr)

Measures

deviation

accurately

Further

algebraic

treatment is

not possible

n

i

i xxn 1

1

7/3/2013 Descriptive Statistics 31

Page 32: Descriptive Statistics

Measure Formula Excel Formula Pro Con

Populatio

n

Variance

=VARP(array) Important

measure

Overestim

ates the

error

Measures

Sample

Variance =VAR(array)

Important

measure

Overestim

ates the

error

N

i

ixN 1

22 )(1

n

i

i xxn

s1

22 )(1

1

7/3/2013 Descriptive Statistics 32

Page 33: Descriptive Statistics

REMEMBER

2

1

2

1

22

11

1)(

1

1

xn

nfx

nfxx

ns

datagroupedFor

k

i

ii

k

i

ii

7/3/2013 Descriptive Statistics 33

Page 34: Descriptive Statistics

Measure Formula Excel Formula Pro Con

Populatio

n

Standard

Deviation

=STDEVP(array

)

Best

measure

Measures

Sample

Standard

deviation

=STDEV(array) Best

measure

2

2ss

7/3/2013 Descriptive Statistics 34

Page 35: Descriptive Statistics

Inter-quartile Range

• The inter-quartile range (IQR) is defined as the difference of the first and third quartiles divided by two

– The first quartile is the 25th percentile

– The third quartile is the 75th percentile

• IQR = (Q3 - Q1)

7/3/2013 Descriptive Statistics 35

Page 36: Descriptive Statistics

When To Use the SIR

• It is the range for the middle 50% of the data

• The SIR is often used with skewed data as it is insensitive to the extreme scores

• The SIR is used with open end distribution

7/3/2013 Descriptive Statistics 36

Page 37: Descriptive Statistics

Interquartile Range

3rd Quartile (Q3) = 525

1st Quartile (Q1) = 445

Interquartile Range = Q3 - Q1 = 525 - 445 = 80

Example: Apartment Rents

7/3/2013 Descriptive Statistics 37

Page 38: Descriptive Statistics

Coefficient of Variation (CV)

• Relative measure (unit free) used for the purpose of comparison of variability when

(i) two variables of different units are compared

(ii) two variables of same unit with varying mean are compared

Relative Measure=absolute measure/avg. *100

100s

CVx

7/3/2013 Descriptive Statistics 38

Page 39: Descriptive Statistics

54.74100 % 100 % 11.15%

490.80

s

x

2 2996.16 54.74s s

the standard deviation is about 11% of the mean

Variance

Standard Deviation

Coefficient of Variation

Sample Variance, Standard Deviation, And Coefficient of Variation

Example: Apartment Rents

n

i

i xxn

s1

22 16.2996)(1

1

7/3/2013 Descriptive Statistics 39

Page 40: Descriptive Statistics

Skewness

• Skew is a measure of symmetry in the distribution of data

Positive Skew Negative Skew

Normal (skew = 0)

7/3/2013 Descriptive Statistics 40

Page 41: Descriptive Statistics

Measure of Skew

• Skewness is a unit-free measure of shape of any frequency distribution.

• The coefficient compares two samples measured in different units or one sample with a known reference distribution (e.g., symmetric normal distribution).

• Calculate the sample’s skewness coefficient

7/3/2013 Descriptive Statistics 41

Page 42: Descriptive Statistics

Nature of Skewness

• If , distribution has a positive skewness or is right skewed

• If , distribution has a negative skewness or is left skewed

• If , distribution is symmetrical

01 g

01 g

01 g

7/3/2013 Descriptive Statistics 42

spandana
Notes
1)mean>median 2)mean<median
Page 43: Descriptive Statistics

• Kurtosis is the relative length of the tails and the degree of concentration in the center.

• Consider three kurtosis prototype shapes.

Kurtosis

7/3/2013 Descriptive Statistics 43

Page 44: Descriptive Statistics

Kurtosis

• When the distribution is normally distributed, its kurtosis equals 3 and it is said to be mesokurtic

• When the distribution is less spread out than normal, its kurtosis is greater than 3 and it is said to be leptokurtic

• When the distribution is more spread out than normal, its kurtosis is less than 3 and it is said to be platykurtic

7/3/2013 Descriptive Statistics 44

Page 45: Descriptive Statistics

The z-score is often called the standardized value.

It denotes the number of standard deviations a data value xi is from the mean. An observation’s z-score is a measure of the relative location of the observation in a data set.

z-Scores

s

xxz i

i

Excel’s STANDARDIZE function can be used to compute the z-score.

7/3/2013 Descriptive Statistics 45

spandana
Notes
Amount of deviation wrt std deviation
Page 46: Descriptive Statistics

425 490.80 1.20

54.74ix x

zs

z-Scores

Standardized Values for Apartment Rents

Example: Apartment Rents

7/3/2013 Descriptive Statistics 46

Page 47: Descriptive Statistics

Chebyshev’s Theorem

At least (1 - 1/z2) of the items in any data set will be

within z standard deviations of the mean, where z is

any value greater than 1.

Chebyshev’s theorem requires z > 1, but z need not be an integer.

7/3/2013 Descriptive Statistics 47

spandana
Notes
Mistake : replace z by k where k>1
Page 48: Descriptive Statistics

At least of the data values must be

within of the mean.

75%

z = 2 standard deviations

Chebyshev’s Theorem

At least of the data values must be

within of the mean.

89%

z = 3 standard deviations

At least of the data values must be

within of the mean.

94%

z = 4 standard deviations

7/3/2013 Descriptive Statistics 48

Page 49: Descriptive Statistics

Empirical Rule

For data having a bell-shaped distribution: of the values of a normal random variable

are within of its mean.

68.26%

+/- 1 standard deviation

of the values of a normal random variable

are within of its mean.

95.44%

+/- 2 standard deviations

of the values of a normal random variable

are within of its mean.

99.72%

+/- 3 standard deviations

7/3/2013 Descriptive Statistics 49

spandana
Notes
x(bar)-ks x(bar)+ks
Page 50: Descriptive Statistics

Empirical Rule

x

– 3 – 1

– 2

+ 1

+ 2

+ 3

68.26%

95.44%

99.72%

7/3/2013 Descriptive Statistics 50

Page 51: Descriptive Statistics

Detecting Outliers

An outlier is an unusually small or unusually large value in a data set.

A data value with a z-score less than -3 or greater than +3 might be considered an outlier.

7/3/2013 Descriptive Statistics 51

Page 52: Descriptive Statistics

Box Plot

A box plot is a graphical summary of to identify

outliers.

A key to the development of a box plot is the computation of the median and the quartiles Q1 and Q3.

7/3/2013 Descriptive Statistics 52

Page 53: Descriptive Statistics

Box Plot

Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325

Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(80) = 645

The lower limit is located 1.5(IQR) below Q1

The upper limit is located 1.5(IQR) above Q3.

There are no outliers (values less than 325 or greater than 645) in the apartment rent data.

Example: Apartment Rents

7/3/2013 Descriptive Statistics 53

Page 54: Descriptive Statistics

Box Plot

• Whiskers (dashed lines) are drawn from the ends of the box to the smallest and largest data values inside the limits.

400 425 450 475 500 525 550 575 600 625

Smallest value inside limits = 425

Largest value inside limits = 615

Example: Apartment Rents

7/3/2013 Descriptive Statistics 54

Page 55: Descriptive Statistics

Weighted Mean

When the mean is computed by giving each data value a weight that reflects its importance, it is referred to as a weighted mean.

In the computation of a grade point average (GPA), the weights are the number of credit hours earned for each grade.

When data vary in importance, the analyst must choose the weight that best reflects the importance of each value.

7/3/2013 55 Descriptive Statistics

Page 56: Descriptive Statistics

Weighted Mean

i i

i

w xx

w

where:

xi = value of observation i

wi = weight for observation i

7/3/2013 56 Descriptive Statistics

Page 57: Descriptive Statistics

Mean for Grouped Data

i if Mx

n

N

Mf ii

where:

fi = frequency of class i

Mi = midpoint of class i

Sample Data

Population Data

7/3/2013 57 Descriptive Statistics

Page 58: Descriptive Statistics

Sample Mean for Grouped Data

Example: Apartment Rents

7/3/2013 58 Descriptive Statistics

Page 59: Descriptive Statistics

Sample Mean for Grouped Data

This approximation

differs by $2.41 from

the actual sample

mean of $490.80.

34,525 493.21

70x

Example: Apartment Rents

7/3/2013 59 Descriptive Statistics

Page 60: Descriptive Statistics

Variance for Grouped Data

sf M x

ni i2

2

1

( )

2

2

f M

Ni i( )

For sample data

For population data

7/3/2013 60 Descriptive Statistics

Page 61: Descriptive Statistics

Sample Variance for Grouped Data

7/3/2013 61 Descriptive Statistics

Page 62: Descriptive Statistics

3,017.89 54.94s

s2 = 208,234.29/(70 – 1) = 3,017.89

This approximation differs by only $.20

from the actual standard deviation of $54.74.

• Sample Variance

Sample Standard Deviation

Example: Apartment Rents

Sample Variance for Grouped Data

7/3/2013 62 Descriptive Statistics

Page 63: Descriptive Statistics

ACKNOWLEDGEMENT

1) Statistics for Management by Levin & Rubin ( Prentice Hall )

2) Business Statistics by Aczel and Soundarpardian ( Pearson )

3) Business Statistics by Anderson, Sweeney & Williams ( Cengage )

4) Applied Statistics in Business & Economics by Doane ( McGraw-Hill )