chapter 1: why study statistics? - ozan ekşi

30
CHAPTER 1: WHY STUDY STATISTICS? Why Study Statistics? Population is a large (or innite) set of elements that are in the interest of a research question. A parameter is a specic characteristic of a population All the men living in Turkey can be a population. The average height of these men can be a population parameter Sample is a subset of population that we use to withdraw conclusions or predictions on the parameters of the population (for inferences to be valid, sampling should be random). Statistics is a characteristic of the sample Instead of measuring the height of every man in Turkey, we can randomly select 5000 men from di/erent locations of the country. This would be our sample. Then we can nd the average height of these people to estimate average height of the men in Turkey. This would be our sample statistics 1 Ozan Eksi, TOBB-ETU

Upload: others

Post on 04-May-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

CHAPTER 1: WHY STUDY STATISTICS?

Why Study Statistics?

� Population is a large (or in�nite) set of elements that are in the interest of a research question.A parameter is a speci�c characteristic of a population

�All the men living in Turkey can be a population. The average height of these men can bea population parameter

� Sample is a subset of population that we use to withdraw conclusions or predictions on theparameters of the population (for inferences to be valid, sampling should be random). Statisticsis a characteristic of the sample

� Instead of measuring the height of every man in Turkey, we can randomly select 5000 menfrom di¤erent locations of the country. This would be our sample. Then we can �nd theaverage height of these people to estimate average height of the men in Turkey. This wouldbe our sample statistics

1Ozan Eksi, TOBB-ETU

Page 2: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Types of Statistics

� Inferential Statistics: This is what explained above; i.e. using sample data to make estima-tion and hypothesis testing (the tools that helps us to make statements and decisions underuncertainty, incomplete information)

� Descriptive Statistics: Graphical and numerical procedures that are used to present and sum-marize data. We can use descriptive statistics on either population, or sample data

Chapter Summary

� Terms reviewed in this chapter:

� Population (Populasyon)� Parameter (Parametre)� Sample (Örneklem)� Inferential Statistics (Ǭkar¬msal ·Istatistik)� Estimation (Tahmin)� Descriptive Statistics (Betimleyici ·Istatistik)

2Ozan Eksi, TOBB-ETU

Page 3: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

CHAPTER 2: USING GRAPHS TO DESCRIBE DATA

Data, Variable, and Constant

Data are usually just a set of numbers representing the same kind of thing, such as body weight. That"thing" is called a variable (it is variable because the numbers vary from subject to subject). If thenumbers are the same, the thing is called a constant

Classi�cation of Variables

� Categorical (sometimes called Nominal) or Numerical

�Categorical: (Yes or No), (Like, Dislike or Indi¤erent), ...

�Numerical: (Discrete: Outcome of a dice, ...), (Continuous: Height, time, ...)

� Qualitative or Quantitative

�Qualitative: These variables are measured on an ordinal, interval, or ratio scale to describevariables. Numerical identi�cation is only given to make variables categorized (Yes and Nocan be labeled as 0 and1). Ordered data indicate the rank of ordering items as well (Like,Dislike and Indi¤erent can be labelled as 2, 1,0). This thpe of data can be either categoricalor numerical

�Quantitative: They are measured on a nominal scale. Hence, numeric values matter

3Ozan Eksi, TOBB-ETU

Page 4: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� Independent or Dependent

� Independent: A variable that stands alone and isn�t changed by the other variables (ex.someone�s age)

�Dependent: A variable that is explained by independent variables

4Ozan Eksi, TOBB-ETU

Page 5: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Tables And Graphs to Describe Categorical Variables

� The Frequency Distribution Table reveals the number of occurrence (frequency) of eachpossible outcome

�A probability distribution is a frequency distribution with each frequency divided bythe total number of observations

� Bar Chart, Pie Chart and Pareto Diagram are the graphics that present the same infor-mation with the Frequency Distribution Table

�Example: Hospital Patients by Unit

Frequency Distribution Table Bar Chart Pie Chart

Hospital Unit          Number of Patients

Cardiac Care 1,052Emergency 2,245Intensive Care 340Maternity 552Surgery 4,630

Hospital Patients by Unit

0

1000

2000

3000

4000

5000

Car

diac

Car

e

Emer

genc

y

Inte

nsiv

eC

are

Mat

erni

ty

Surg

ery

Num

ber o

fpa

tient

s pe

r yea

r

Hospital Patients by Unit

Emergency25%

Maternity6%

Surgery53%

Cardiac Care12%

Intensive Care4%

5Ozan Eksi, TOBB-ETU

Page 6: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� Pareto Diagram: It is a special Bar Chart. But unlike Bar and Pie Charts, Pareto diagrampresents the information in an order (descending or ascending), and the cumulative total isrepresented by the line

�Ex: 400 defective items are examined for cause of defectFrequency Distribution Table Arranging Data

400Total21Cracked case19Electrical Short78Paint Flaw25Missing Part

223Poor Alignment34Bad Weld

Number of defectsSource of

Manufacturing Error

400Total21Cracked case19Electrical Short78Paint Flaw25Missing Part

223Poor Alignment34Bad Weld

Number of defectsSource of

Manufacturing Error

4001921253478223

Number of defects

100%Total4.75Electrical Short5.25Cracked case6.25Missing Part8.50Bad Weld

19.50Paint Flaw55.75Poor Alignment

% of Total DefectsSource of

Manufacturing Error

4001921253478223

Number of defects

100%Total4.75Electrical Short5.25Cracked case6.25Missing Part8.50Bad Weld

19.50Paint Flaw55.75Poor Alignment

% of Total DefectsSource of

Manufacturing Error

Pareto Diagram

% o

f def

ects

 in e

ach 

cate

gory

(bar

 gra

ph)

Pareto Diagram: Cause of Manufacturing Defect

0%

10%

20%

30%

40%

50%

60%

Poor Alignment Paint Flaw Bad Weld Missing Part Cracked case Electrical Short0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

cumulative %(line graph)

6Ozan Eksi, TOBB-ETU

Page 7: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Tables And Graphs to Describe Numerical Variables

� We have frequency distribution just like the case with categorical variables. However, since thistime the data is not categorized into groups, it is better to form arti�cial groups instead ofrevealing frequency of each data point

�Ex: A manufacturer of insulation randomly selects 20 winter days and records the dailyhigh temperature: 24, 35, 17, 21, 24, 37, 26, 46, 58, 30, 32, 13, 12, 38, 41, 43, 44, 27, 53, 27

The Ordered Data

12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

The Frequency Distribution Table

Interval Frequency Relative Frequency Percentage

more than 10 but less than 20 3 .15 15

more than 20 but less than 20 6 .30 30

more than 30 but less than 40 5 .25 25

more than 40 but less than 50 4 .20 20

more than 50 but less than 60 2 .10 10

Total 20 1.00 100

7Ozan Eksi, TOBB-ETU

Page 8: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� Note: In this example to classify the data we used intervals of 10. However, there is no rule forthat. The decision should be case speci�c

� Note: The best graph is always the one that displays the information in the most clear andapprehensible way. There is no restriction for the type of the graph that you would use. However,remember that it may also be risky not to use standard graphs as it may lead confusion for thereaders

Histogram

� It is a graph of the (numerical) data in a frequency distribution

Interval Frequency

10 but less than 20 3

20 but less than 20 6

30 but less than 40 5

40 but less than 50 4

50 but less than 60 2

Histogram: Daily High Temperature

0

3

65

4

2

001234567

0 10 20 30 40 50 60

Freq

uenc

y

8Ozan Eksi, TOBB-ETU

Page 9: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

The Cumulative Frequency Distribution & Ogive (graphing cumulative frequencies)

Interval Frequency Percentage Cumulative CumulativeFrequency Percentage

more than 10 but less than 20 3 15 3 15

more than 20 but less than 20 6 30 9 45

more than 30 but less than 40 5 25 14 70

more than 40 but less than 50 4 20 18 90

more than 50 but less than 60 2 10 20 100

Total 20 100

Ogive: Daily High Temperature

0

20

40

60

80

100

10 20 30 40 50 60Cum

ulat

ive 

Perc

enta

ge

9Ozan Eksi, TOBB-ETU

Page 10: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

A line chart (time-series plot)

� It is used to show the values of a variable over time (time series data)

�Time is measured on the horizontal axis

�The variable of interest is measured on the vertical axis

� An Example:

Magazine Subscriptions by Year

0

50

100

150

200

250

300

350

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

Thou

sand

s of

 sub

scrib

ers

� Cross-Sectional Data: Refers to data collected by observing many subjects at the same point oftime. It is collected usually for the purpose of comparison

� Time series-cross-sectional Data: Refers to data collected by observing many subjects at thesuccessive points in time

10Ozan Eksi, TOBB-ETU

Page 11: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

The shape of the distribution

� The shape of the distribution is said to be symmetric if the observations are balanced, or evenlydistributed, about the center

Symmetric Distribution

0123456789

10

1 2 3 4 5 6 7 8 9

Freq

uenc

y

The shape of the distribution is said to be skewed if the observations are not symmetricallydistributed around the center

Negatively Skewed Distribution

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9

Freq

uenc

y

Positively Skewed Distribution

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9

Freq

uenc

y

11Ozan Eksi, TOBB-ETU

Page 12: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Tables and Graphs to Describe Relationship Between Variables

� Graphs illustrated so far have involved only a single variable

� When two variables exist other techniques are used:

�Categorical (Qualitative) Variables: Cross tables (or contingency tables)

�Numerical (Quantitative) Variables : Scatter plots

Cross Tables

� If there are r categories for the �rst variable (rows) and c categories for the second variable(columns), the table is called an r x c cross table

� Ex: 4 x 3 Cross Table for Investment Choices by Investor

Investment Investor A Investor B Investor C TotalCategory

Stocks 46.5 55 27.5 129Bonds 32.0                 44 19.0 95CD 15.5                 20                 13.5 49Savings 16.0 28                   7.0 51

Total 110.0 147                 67.0 324

12Ozan Eksi, TOBB-ETU

Page 13: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Side by side bar chart

Comparing Investors

0 10 20 30 40 50 60

Stocks

Bonds

CD

Savings

Investor A Investor B Investor C

Scatter Diagrams They are used for paired observations taken from two numerical variables.One variable is measured on the vertical axis and the other variable is measured on the horizontal axis

200601955518850170421673816033146291402612523

Cost perday

Volumeper day

200601955518850170421673816033146291402612523

Cost perday

Volumeper day

Cost per Day vs. Production Volume

0

50

100

150

200

250

0 10 20 30 40 50 60 70

Volume per Day

Cost 

per D

ay

13Ozan Eksi, TOBB-ETU

Page 14: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Chapter Summary

� Data (veri) in raw form are usually not easy to use for decision making. Some type of organizationin the form of table or graphs is needed

� Terms reviewed in this chapter:

� Variables (De¼gi̧skenler):� Categorical (Kategorik) � Numerical (Say¬sal)� Qualitative (Niteliksel) � Quantitative (Niceliksel)� Independent (Ba¼g¬ms¬z) � Dependent (Ba¼g¬ml¬)

� Ordinal scale (S¬rasal Ölçek) � Ratio scale(Oransal Ölçek)� Interval scale(Aral¬ksal Ölçek) � Nominal scale (Say¬sal Ölçek)� Line chart (Çizgisel gra�k) � Bar chart (Çubuk gra�k)� Pie chart (Dairesel Gra�k) � Pareto diagram (Pareto Diyagram¬)

� Histogram (Histogram) � Ogive (A cumulative line graph)� The Cumulative Frequency distribution (Kümülatif Frekans Da¼g¬l¬m¬)� Time Series (Zaman Serisi) � Time Series (Zaman Serisi)� Skewed (Çarp¬k Da¼g¬l¬m) � Scatter plot (Saç¬l¬m Gra�¼gi)

14Ozan Eksi, TOBB-ETU

Page 15: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

15Ozan Eksi, TOBB-ETU

Page 16: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

CHAPTER 3: USING NUMERICAL MEASURES TO DESCRIBE DATA

Measures of Central Tendency

� Mean: Arithmetic average of values (sum of values divided by the number of them)

� Median: Midpoint of ranked values

� Mode: Most frequently observed value in the data

�Ex: Suppose the following bicycle prices: 2.000, 100, 300, 100, 500

� The mean is: (2.000+100+300+100+500)/5=600

� The median can be found after ranking: 2.000, 500, 300, 100, 100; which is 300

� The mode is 100

�Even though the mean is the most generally used measure of central tendency, it is seenthat it is subject to outliers� that is, it is highly a¤ected from high or low values in thedata even though these values may not be very informative

�Then median is often used, since the median is not sensitive to extreme values

16Ozan Eksi, TOBB-ETU

Page 17: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� Note: the location of the median is n+ 1

2position in the ordered data

� If the number of values is odd, the median is the middle number

� If the number of values is even, the median is the average of the two middle numbers

� Formally, the mean (also called arithmetic mean) is

� If calculated from population of N values, the mean is denoted by � and calculated as:

� =

NPi=1

xi

N=x1 + x1 + :::+ xN

N

� If calculated from sample size of n values, the mean is denoted by �x and calculated as:

�x =

nPi=1

xi

n=x1 + x1 + :::+ xn

n

17Ozan Eksi, TOBB-ETU

Page 18: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Mean and Median Depending on Shape of a Distribution

Mean = MedianMean < Median Median < Mean

Right­SkewedLeft­Skewed Symmetric

Measures of Variability

� Measures of variation give information on the spread or variability of the data values

�Ex: Same center, di¤erent variation

18Ozan Eksi, TOBB-ETU

Page 19: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� There are di¤erent measures of variability. The ones we are going to discuss

�Range: Di¤erence between the largest and the smallest observations

� Interquartile Range: Eliminate high- and low-valued observations and calculate the rangeof the middle 50% of the data

�Variance: Average of squared deviations of values from the mean

� Standard Deviation: Square Root of Variance

�Coe¢ cient of Variation: Standard Deviation divided by mean (shows relative variation)

� Range: Di¤erence between the largest and the smallest observations

�Ex:

Range = 14 ­ 1 = 13

�However, it ignores the way in which data are distributed and sensitive to outliers

7     8     9     10    11    12Range = 12 ­ 7 = 5

7     8     9    10     11    12Range = 12 ­ 7 = 5

19Ozan Eksi, TOBB-ETU

Page 20: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� Interquartile Range: Eliminate high- and low-valued observations and calculate the range of themiddle 50% of the data

�The �rst quartile, Q1, is the value for which 25% of the observations are smaller and 75%are larger

�Q2 is the same as the median (50% are smaller, 50% are larger)

�Only 25% of the observations are greater than the third quartile

� Ex:Median

(Q2)X

maximumXminimum Q1 Q3

25%                 25%               25%          25%

12                     30                 45           57 70

Interquartile range= 57 –30 = 27

20Ozan Eksi, TOBB-ETU

Page 21: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� Variance: Average of squared deviations of values from the mean

�Population mean and variance

� =

NPi=1

xi

N�2 =

NPi=1

(xi � �)2

N

� Sample mean and variance

�x =

NPi=1

xi

ns2 =

NPi=1

(xi � �x)2

n� 1

� Standard Deviation: It is square root of variance. � is the population standard deviation, and sis the sample standard deviation

21Ozan Eksi, TOBB-ETU

Page 22: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� Ex: Sample Data (xi): 10, 12, 14, 15, 17, 18, 18, 24

�The sample size, n = 8. The mean can be found by

�x =10 + 12 + 14 + 15 + 17 + 18 + 18 + 24

8= 16

�The standard deviation can be found by

s =

s(10� �x)2 + (12� �x)2 + :::+ (24� �x)2

n� 1 =

s(10� 16)2 + (12� 16)2 + :::+ (24� 16)2

8� 1

s =

r126

7= 4:2426 (a measure of the average scatter around the mean)

� You don�t have to rank the data to �nd variance or standard deviation

� Both measure is used for hypothesis testing for a single distribution, but cannot be used tocompare variability of di¤erent distributions

22Ozan Eksi, TOBB-ETU

Page 23: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� Coe¢ cient of Variation: Shows variation relative to mean, so that it measures relative variationand can be used to compare two or more sets of data measured in di¤erent units

CV = (s

�x)100%

� Ex:

� Stock A:

� Average price last year = $50

� Standard deviation = $5CV = (

5

50)100% = 10%

� Stock B:

� Average price last year = $100

� Standard deviation = $5CV = (

5

100)100% = 5%

�Both stocks have the same standard deviation, but stock B is less variable relative to itsprice

23Ozan Eksi, TOBB-ETU

Page 24: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

More About Standard Deviation of a Distribution

� Chebyshev�s Theorem: For any distribution (not necessarily normal) with mean � and standarddeviation � , and k > 1 , the part of the observations that fall within the interval

�� k�

(i.e. k standard deviations of the mean) includes at least this much of the data

100[1� (1=k2)]%

�Ex:At least Within

(1� 1=1:52) = 55:5 % k = 1:5 (�� 1:5�)(1� 1=22) = 75 % k = 2 (�� 2�)(1� 1=32) = 88:9 % k = 3 (�� 3�)

24Ozan Eksi, TOBB-ETU

Page 25: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� If the data distribution is bell-shaped (normally distributed), then the interval

��� 1� contains about 68 % of the values in the population or the sample

��� 2� contains about 95 % of the values in the population or the sample

��� 3� contains about 99:7 % of the values in the population or the sample

μ

68%

1σμ±

95%

2σμ± 3σμ±

99.7%

25Ozan Eksi, TOBB-ETU

Page 26: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Weighted Mean and Measures of Grouped Data

� The weighted mean of a set of data is �x =

nPi=1

wixiPwi

=w1x1 + w2x2 + :::+ wnxnP

wi

where wi is the weight of the ith observation

� Can be used when data is already grouped into n classes, with wi values in the ith class

� Suppose a data set contains values m1;m2; :::;mk, occurring with frequencies f1; f2; :::fK

�Population mean and variance

� =

KPi=1

fimi

Nwhere N =

KPi=1

fi , and �2 =

KPi=1

fi(mi � �)2

N

� Sample mean and variance

�x =

NPi=1

fimi

nwhere n =

KPi=1

fi , and s2 =

KPi=1

fi(mi � �x)2

n� 1

26Ozan Eksi, TOBB-ETU

Page 27: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Measures of Relationships Between Variables

� The covariance measures the strength of the linear relationship between two variables

�Population covariance:

Cov(x; y) = �2xy =

NPi=1

(xi � �x)(yi � �y)

N

� Sample covariance:

Cov(x; y) = s2xy =

nPi=1

(xi � �x)(yi � �y)

n� 1

�Only concerned with the strength of the relationship

�No causal e¤ect is implied

� Interpreting Covariance:

�Cov(x; y) > 0 ) x and y tend to move in the same direction

�Cov(x; y) < 0 ) x and y tend to move in opposite directions

�Cov(x; y) = 0 ) there is no linear relation between x and y

27Ozan Eksi, TOBB-ETU

Page 28: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

� Coe¢ cient of Correlation measures the relative strength of the linear relationship betweentwo variables. It is relative because, unlike covariance, this measure is not a¤ected from themagnitude of data

�Population correlation coe¢ cient: � =Cov(x; y)

�x�y

� Sample correlation coe¢ cient: r =Cov(x; y)

sxsy

� It is unit free and ranges between �1 and 1. The closer to �1, the stronger the negative linearrelationship. 0 indicates no relationship between the variables of interest

Y

X X X

Y

XX

r = ­1 r = ­.6 r = 0

r = +.3r = +1

Y

Xr = 0

28Ozan Eksi, TOBB-ETU

Page 29: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

29Ozan Eksi, TOBB-ETU

Page 30: CHAPTER 1: WHY STUDY STATISTICS? - Ozan Ekşi

Chapter Summary

� Terms reviewed in this chapter:

� Mean (Ortalama) � Median (Medyan, Ortanca De¼ger)� Mode (Mod, Tepe De¼geri) � Measure (Ölçü)� Range (De¼gi̧sim Aral¬¼g¬) � Variance (Varyasyon)� Interquartile Range (Yar¬-çeyreklik De¼gi̧sim Aral¬¼g¬) � Coe¢ cient of Variation (Varyasyon Katsay¬s¬)� Standard Deviation (Standart Sapma) � Weighted Mean (A¼g¬rl¬kl¬ortalama)� Covariance (Covaryasyon) � Correlation (Corelasyon)

30Ozan Eksi, TOBB-ETU