Probability density function• p : R → [0,1] such that
– for all x, 0 ≤ p(x) ≤ 1–
• P( {x: a ≤ x ≤ b}) is definite integral of density function between a and b
Normal density function
• symmetric about mean µ• Z-score, z = (x-µ)/σ• p(z) = exp(-z2/2) times a constant
Z-score• Z-score of x is (x-mu)/sigma• Distance measured by number of standard
deviations above or below the mean (to left or right of mean in relative histogram)
Cumulative distribution for standard normal
Integral cannot be solved analytically.Use Table A-2 at back of book, or useNORMDIST(x,0,1,TRUE) in Excel
This expression equals probability that X ≤ xP( {x: -infinity < u ≤ x} )
Continuity correction• Continuity correction used when the
continuous normal distribution used to approximate the discrete binomial distribution.
Use Φ to approximate binomial probability in following examples
Suppose p=0.1, n=100, so µ = 10, and σ=sqrt(npq)=3• P(less than 30 successes) = P(k < 30)
• P(more than 30 successes) = P(k > 30)
Use Φ to approximate binomial probability in following examples
Suppose p=0.1, n=100• P(exactly 30 successes) = P(k = 30)
• P(at least 30 successes) = P(k ≥ 30)
• P(at most 30 successes) = P(k ≤ 30)
Sampling distribution of the mean
• Sampling distribution of the mean is the distribution of sample means for all samples of size n.
• Sampling distribution of the mean is approximately normal for large sample sizes
CENTRAL LIMIT THEOREM
• The mean of all sample means for samples of size n equals the population mean.– for n=1, this is by definition of population mean– inductively, one can establish the truth for any fixed sample size
Die roll example: sampling WITHOUT replacement, sample size of 2, compute average of sample variances
Sampling distribution of a statistic
• More generally, the sampling distribution of any statistic is the distribution of that statistic over all samples of size n.
• For instance, the sampling distribution of the sample standard deviation (resp. variance) is the distribution of all sample standard deviations (resp. variances) over all samples of size n.
• For more examples of this form, see my demo python program demoSampleMeansVar.py, in class web pages. See also
• outputDemoSampleMeansVarWithReplacement.txt and
• outputDemoSampleMeansVarWithoutReplacement.txt
Histogram of the means of (X+X+X)/3 for ALL size 3 samples of bear ages: Mean:43.52 Var: 372.01 StDev:19.29Max:177 Min:8. V[(X+X+X)/3] = V[X]/3, so stdev[(X+X+X)/3] =Stdev(X)/sqrt(3). NOTE that bear age stdev (previous slide) divided by sqrt(3), where 3 is sample size, equals 19.29
Histogram of sample variances of ALL size 3 samples of bear ages: Mean:1116.03 StDev:1567.25 Max:9520.33 Min:0. Note that the mean of all sample variances equals the population variance.
Histogram of 10,000 repetitions of following experiment:count number of 2’s when rolling die 100 times
Sampling distribution of the proportion
• Sampling distribution of the proportion is the distribution of all sample proportions for all samples of (equal and fixed) size n
• Proportion is number of “successes” over number of “trials” (number of people who plan to vote for candidate X over number of voters, number of red balls over number of balls in urn).
Properties of distribution of sample proportions
• sample proportions tend to target the value of the propulation proportion– relative frequency of r red balls in sample of size n
is P(r/n) = h(n,r;N,R)
• for sufficiently large sample sizes, the distribution of sample proportions approximates a normal distribution (recall that density curve of hypergeometric is approximately normal for sufficiently large sample size.
Unbiased statistic• Unbiased statistic is a statistic that targets the
population parameter
• sample mean, sample variance, sample proportion are unbiased statistics that target population mean, variance and proportion
• median, range, stdev are statistics that do not target corresponding population parameters.
Consequences of central limit theorem
• Sampling distribution of the mean (which by definition is the distribution of sample means for all samples of equal and fixed size n) is approximately normal for large sample sizes. The expected value of the sampling distribution for the mean is the population mean.
• Sampling distribution of the proportion is approximately normal for large sample sizes. The expected value of the sampling distribution of the proportion is the population mean.
However ...
• Sampling distribution of the variance is NOT necessarily approximately normal. In fact, we saw that such an example with bear age samples of size 3!
• Nevertheless, the average of the sample variances does equal the population variance, provided that the sampling is done withreplacement.
Mean, variance of sample distribution
• Sampling with replacement (binomial):– Let X be r.v. counting number of successes in sample of size
n, where probability of success is p– E[X] = np, V[X] = npq– Let Y be r.v. counting relative frequency of success in
sample of size n– E[Y] = E[X/n] = np/n = p– V[Y] = V[X/n] =npq/n2 = pq/n = sigma2/n– stdev(Y) = sqrt(pq/n) = sigma/sqrt(n)
Mean, variance of sample distribution
• Sampling without replacement (hypergeometric):– Let X be r.v. counting number of successes in sample of size
n, where probability of success is p– E[X] = np, V[X] = npq(N-n)/(N-1)– Let Y be r.v. counting relative frequency of success in sample
of size n, so Y = X/n– E[Y] = E[X/n] = np/n = p– V[Y] = V[X/n] =V[X]/n2 = (npq/n2)(N-n)/(N-1)
= (pq/n)(N-n)/(N-1) – stdev(Y) = stdev(X)/sqrt(n) = sqrt[ (pq/n)(N-n)/(N-1)] Due to complicated formulas for sampling without replacement, we ALWAYS use sampling WITH replacement!
Proportion of successes in n trials
• Let X be r.v. that counts the number of successes in n trials, where probability of success is p (absolute frequency)
• Let Y be r.v. that returns the proportion of successes in n trials (relative frequency)
• E[Y] = E[X/n] = E[X]/n = np/n = p• V[Y] = V[X/n] = V[X]/n2 = npq/n2 = pq/n
NOTE: We compute Z-score (x-mu)/(sigma/sqrt(n))since we are working with sample distr of mean.Now compute Φ(Z-score)
Normality test: quantile plot
• sort data in increasing order x1,...,xn
• for each i compute z-scores zi=NORMINV(i/n,0,1), or rather following book zi=NORMINV((2i-1)/2n,0,1)
• determine if scatter plot (xi,zi) is linear
Z-scores
y = 0.0304x - 1.6244
-3
-2
-1
0
1
2
3
1 10 19 28 37 46 55 64 73 82 91 100
Zscores
Linear (Zscores)
Quantile plot for temperatures
Interarrival time, or distance between successive genomic
motifs
What probability distribution is this?
Excel statistical functions• QUARTILE
– Returns the quartile of a data set (quartile = 0,1,2,3,4)
– Syntax: QUARTILE(array,quart)• PERCENTILE
– Returns the k-th percentile of values in a range. For example, you can decide to examine candidates who score above the 90th percentile.
– Syntax: PERCENTILE(array,k)Here k is in interval [0,1] e.g. 0.9.
• STANDARDIZE– Returns Z-score corresponding to x for a distribution with
mean and stdev.– Syntax: STANDARDIZE(x,mean,stdev)
• BINOMDIST– Returns the individual term binomial
distribution probability. – Syntax: BINOMDIST(k,n,p,cumulative).
Used for sampling with replacement
Density function when cumulative = falseCumulative density function when cumulative = true
• HYPGEOMDIST– Returns the probability of a given number of
sample successes, given the sample size, population successes, and population size.
– Syntax: HYPGEOMDIST(r,n,R,N)Used for sampling without replacement
• PERMUT– Returns the number of ORDERED
sequences of length k drawn from a set of size n. Permutations are different from combinations, for which order is not significant.
– Syntax: PERMUT(n,k)
• POISSON– Returns the Poisson distribution. A common
application of the Poisson distribution is predicting the number of events over a specific time, such as the number of cars arriving at a toll plaza in 1 minute.
– Syntax: POISSON(x,mean,cumulative)
• NORMDIST– Returns the normal distribution for the specified mean and
standard deviation. – Syntax: NORMDIST(x,mean,stdev,cumulative)
• NORMDIST– Returns the normal distribution for the specified mean and standard
deviation. – Syntax: NORMDIST(x,mean,stdev,cumulative)
• NORMSINV– Returns the inverse of the standard normal cumulative distribution.
The distribution has a mean of zero and a standard deviation of one.– Syntax: NORMSINV(probability)