stac51: categorical data analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfmahinda...
TRANSCRIPT
![Page 1: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/1.jpg)
Introduction
STAC51: Categorical data Analysis
Mahinda Samarakoon
January 21, 2016
Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 21
![Page 2: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/2.jpg)
Introduction
Table of contents
1 Introduction
Mahinda Samarakoon STAC51: Categorical data Analysis 2 / 21
![Page 3: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/3.jpg)
Introduction
Basic Concepts
Categorical data analysis is concerned with the statistical methodsfor analysis of categorical response (dependent) variables.Explanatory variables may be categorical or continuous or both.For example the explanatory variables can be income, education,gender, race etc.There are two types of categorical variables:
Mahinda Samarakoon STAC51: Categorical data Analysis 3 / 21
![Page 4: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/4.jpg)
Introduction
Types of variables
Nominal - unordered categories
Major: Mathematics, Statistics ot Computer ScienceFavorite music: rock, classical, jazz, country, folk, popCriminal offense convictions: murder, robbery, assault
Ordinal - ordered categories, but the exact distances betweencategories are unknown..Examples
Patient condition: excellent, good, fair, poorGovernment spending: too high, about right, too lowHighest attained education level: HS, BS, MS, PhD
Mahinda Samarakoon STAC51: Categorical data Analysis 4 / 21
![Page 5: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/5.jpg)
Introduction
Types of variables
Binary valentines
A binary variable is a special case of a categorical variable,taking only two values (categories) such as success and failureot true or false.
For binary variables nominal-ordinal distinction is notimportant.
Mahinda Samarakoon STAC51: Categorical data Analysis 5 / 21
![Page 6: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/6.jpg)
Introduction
Types of variables
Interval variables
An interval variables is one that does have meaningfuldistances between any two values.
Examples: Annual income, height, weight, systolic bloodpressure level.
Mahinda Samarakoon STAC51: Categorical data Analysis 6 / 21
![Page 7: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/7.jpg)
Introduction
Probability Distributions for Categorical Data
In categorical data analysis, the binomial distribution (and itsmultinomial distribution generalization) plays the role that theNormal distribution does for continuous response.
Recall that for a Bin(n, π) random variable Y
P(Y = y) = pY (y) =(ny
)πy (1− π)n−y for y = 1, . . . , n and
zero otherwise.
E (Y ) = nπ
Var(Y ) = nπ(1− π)
If X1, . . . ,Xn are i.i.d. Bernoulli random variables, i.e.P(X1 = 1) = π and P(X1 = 0) = 1− π, thenY = X1 + · · ·+ Xn ∼ Bin(n, π). In other words Y is thenumber of successes (i.e. 1’s) in n independent Bernoullitrials.
Mahinda Samarakoon STAC51: Categorical data Analysis 7 / 21
![Page 8: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/8.jpg)
Introduction
Binomial Distribution
Example According to published statistics, 8% of people ages14-24 are school dropouts, i.e. persons who are not in regularschool and who have not completed the 12th grade or any higherdegree degree. Suppose you pick five people at random from thisage group, what is the probability that exactly two of then will beschool dropouts?Solution: Let Y denote the number of school dropouts in thissample of 5 people, then Y ∼ Bin(n = 5, π = 0.08). The questionwants P(Y = 2) pause and using the formula
P(Y = 2) =
(5
2
)(0.08)2(1− 0.08)5−2
= 10× (0.08)× (0.92)3
= 0.049836032.
Mahinda Samarakoon STAC51: Categorical data Analysis 8 / 21
![Page 9: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/9.jpg)
Introduction
Multinomial Distribution
In some trials more than two outcomes are possible. Suppose nindependent trails can have outcome in any of c categories. Letyij = 1 if the i th outcome results in category j and zero otherwise.
Let nj =n∑
i=1yij , then (n1, n2, . . . , nc) is an observed value (vector)
from a multinomial distribution. The probability mass function ofthe multinomial distribution is given by:
p(n1, n2, . . . , nc) =
(n!
n1!n2! . . . nc !
)πn11 π
n22 . . . πncc . (1)
where πj is the probability of an outcome in category j (for anytrial).
Mahinda Samarakoon STAC51: Categorical data Analysis 9 / 21
![Page 10: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/10.jpg)
Introduction
Multinomial Distribution: Example
Suppose we have a bowl with 10 marbles - 2 red marbles, 3 greenmarbles, and 5 blue marbles. We randomly select 4 marbles fromthe bowl, with replacement. What is the probability of selecting 2green marbles and 2 blue marbles?Solution: Let Y1,Y1 and , Y3 denote the numbers of red, greenand blue marbles respectively. Then (Y1,Y1,Y3) has a multinomialdistribution with n = 4, π1 = 0.2, π2 = 0.3 and π3 = 0.5 andP(Y1 = 0,Y2 = 2,Y2 = 2) =
(4!
0!2!2!
)0.20 × 0.32 × 0.52 =
6× 0.0225 = 0.135.R commands
> dmultinom(x = c(0, 2, 2), size = 4, prob = c(0.2, 0.3, 0.5))
[1] 0.135
>
Mahinda Samarakoon STAC51: Categorical data Analysis 10 / 21
![Page 11: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/11.jpg)
Introduction
Multinomial Distribution
Some properties of the Multinomial DistributionIf Y1,Y2, . . . ,Yc−1 have a multinomial (n, π1, π2, . . . , πc), then
Yi ∼ Bin(n, πi )
µi = E (Yj) = nπj
Var(Yj) = nπj(1− πj)Cov(Yj ,Yk) = E ((Yj − µj)(Yk − µk)) = −nπjπk .
Mahinda Samarakoon STAC51: Categorical data Analysis 11 / 21
![Page 12: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/12.jpg)
Introduction
Poisson Distribution
Sometimes, count data do not result from a fixed number of trials.For example, the number of accidents during a particular period ina particular city. This type of random variables often have aPoisson distribution. The probability mass function of the Poissondistribution is given by
p(y) =e−µµy
y !, y = 0, 1, . . . (2)
The parameter of the distribution µ represents the mean of thedistribution. That is, if Y ∼ Po(µ), then E (Y ) = µ. It can also beshown that Var(Y ) = µ.
Mahinda Samarakoon STAC51: Categorical data Analysis 12 / 21
![Page 13: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/13.jpg)
Introduction
Poisson Distribution: Example
Births in a hospital occur randomly at an average rate of 1.8 birthsper hour. It is reasonable to assume that distribution of the thenumber of births in a in any particular hour to be Poisson withmean 1.8.What is the probability of observing 4 births in a given hour at thehospital?Solution: Let Y be the number of births in this interval. ThenY ∼ Po(1.8) and so P(Y = 4) = e−1.81.84
4! = 0.0723.
Mahinda Samarakoon STAC51: Categorical data Analysis 13 / 21
![Page 14: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/14.jpg)
Introduction
Poisson Approximation to the Binomial distribution
If n is large (n ≥ 100) and π is small (usually π ≤ 0.01) (andnπ ≤ 20), then we can use Poisson(µ = nπ) to approximate thebinomial probabilities.
Mahinda Samarakoon STAC51: Categorical data Analysis 14 / 21
![Page 15: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/15.jpg)
Introduction
Poisson Approximation to the Binomial distribution:Example
Suppose that 1 in 5000 light bulbs are defective. Let Y denote thenumber of defective bulbs in a batch of 10000 bulbs.What is the chance that at most three bulbs will be defective?Solution: Y ∼ Bin(n = 10000, p = 1/5000 = 0.0002).P(Y ≤ 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3)=(10000
0
)0.00020(1− 0.0002)10000−0 +
(100001
)0.00021(1−
0.0002)10000−1 +(10000
2
)0.00022(1− 0.0002)10000−2 +(10000
3
)0.00023(1− 0.0002)10000−3 =?
Or we can use the Poisson approximation.Y
approx∼ Po(µ = nπ = 10000× 0.0002) = 2.P(Y ≤ 3) = P(Y = 0) + P(Y = 1) + P(Y = 2) + P(Y = 3) ≈e−2 20
0! + e−2 21
1! + e−2 22
2! + e−2 23
3! = 0.8571230094
Mahinda Samarakoon STAC51: Categorical data Analysis 15 / 21
![Page 16: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/16.jpg)
Introduction
Poisson Approximation to the Binomial distribution:Example
Here are the R commands calculating P(Y ≤ 3) using the twodistributions:
> pbinom(3, 10000, 0.0002)
[1] 0.8571415
> ppois(3, 2)
[1] 0.8571235
Mahinda Samarakoon STAC51: Categorical data Analysis 16 / 21
![Page 17: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/17.jpg)
Introduction
The Chi-squared Distribution Another distribution that we oftencome across in categorical data analysis is the chi-squareddistribution. Definition Let Z1,Z2, . . . ,Zν be ν iid randomvariables each having a N(0, 1) distribution., then the distributionof the random variable Y = Z 2
1 + Z 22 + · · ·+ Z 2
ν is called achi-squared distribution with degreed of freedom ν.
Mahinda Samarakoon STAC51: Categorical data Analysis 17 / 21
![Page 18: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/18.jpg)
Introduction
Some properties of the Chi-squared distribution
1 If Z ∼ N(0, 1), then E (Z 2) = Var(Z ) + (E (Z ))2 = 1 + 02 = 1
2 If X ∼ N(µ, σ2), then, it can be shown that for any integerp ≥ 0,
E (X − µ)2p =(2p)!
p!2pσ2p
andE (X − µ)2p+1 = 0.
3 Var(Z 2) = E (Z 4)− (E (Z 2)) = (2×2)!2!22
× 12×2 − 11 = 2
4 If Y = Z 21 + Z 2
2 + · · ·+ Z 2ν , where Z1,Z2, . . . ,Zν are iid
N(0, 1), then EY = EZ 21 + EZ 2
2 + · · ·+ EZ 2ν = ν
5 Y = Z 21 + Z 2
2 + · · ·+ Z 2ν , where Z1,Z2, . . . ,Zν are iid N(0, 1),
then Var(Y ) = Var(Z 21 ) + Var(Z 2
2 ) + · · ·+ Var(Z 2ν ) = 2ν
6 If Y1 ∼ χ2ν1 and Y2 ∼ χ2
ν2 and if Y1 and Y2 are independent ,then Y1 + Y2 ∼ χ2
ν1+ν2
Mahinda Samarakoon STAC51: Categorical data Analysis 18 / 21
![Page 19: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/19.jpg)
Introduction
Inference for ProportionsLet Y be the number of successes (i.e. 1’s) in n independentBernoulli trials with success probability π. The probability of asuccess π is usually an unknown parameter and we estimate it bythe sample proportion of successes:
π̂ =Y
n. (3)
Mahinda Samarakoon STAC51: Categorical data Analysis 19 / 21
![Page 20: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/20.jpg)
Introduction
Some properties of π̂
1 π̂ is an unbiased estimator of π (i.e. E (π̂) = π).
2 Var(π̂) = π(1−π)n
3 π̂Pr→ π by WLLN
4 π̂approx∼ N(π, π(1−π)n ) for large n, by CLT
Mahinda Samarakoon STAC51: Categorical data Analysis 20 / 21
![Page 21: STAC51: Categorical data Analysisfisher.utstat.utoronto.ca/~mahinda/stac51/slidesc51_1p.pdfMahinda Samarakoon STAC51: Categorical data Analysis 12/21 Introduction Poisson Distribution:](https://reader030.vdocuments.site/reader030/viewer/2022040121/5eda862cfebf237c0c3b70fa/html5/thumbnails/21.jpg)
Introduction
Definition (Likelihood function)The likelihood function is the probability of the observed data,expressed as a function of the parameter value.
Definition (Maximum Likelihood Estimate)The maximum likelihood estimate (MLE) is the parametervalue at which the likelihood function takes its maximum.
Mahinda Samarakoon STAC51: Categorical data Analysis 21 / 21