molecular biomedical informatics machine learning and bioinformatics machine learning &...
Post on 29-Mar-2015
230 Views
Preview:
TRANSCRIPT
Machine Learning & Bioinformatics 1
Molecular Biomedical Informatics分子生醫資訊實驗室Machine Learning and Bioinformatics機器學習與生物資訊學
2Machine Learning and Bioinformatics
Statistics
Machine Learning and Bioinformatics 3
Statistical test In statistics, a result is called statistically significant if it is
unlikely to have occurred by chance Determines what outcomes of an experiment would lead to a
rejection of the null hypothesis; helping to decide whether
experimental results contain enough information to cast doubt
on conventional wisdom Answers
– assuming that the null hypothesis is true, what is the probability of
observing a value for the test statistic that is at least as extreme as the
actually observed one?
– that probability is known as the P-value
Machine Learning and Bioinformatics 4
Similar to a criminal trial A defendant is considered not guilty until his guilt is proven
– the prosecutor tries to prove the guilt of the defendant, until there is
enough charging evidence the defendant is convicted
In the start of the procedure, there are two hypotheses
– H0: “the defendant is not guilty”
– H1: “the defendant is guilty”
The first one is called null hypothesis, and is for the time
being accepted The second one is called alternative (hypothesis), which is the
hypothesis one hopes to support
Machine Learning and Bioinformatics 5
The hypothesis of innocence is only rejected when an error is very
unlikely, because one doesn’t want to convict an innocent
defendant Such an error is called error of the first kind (i.e. the conviction of
an innocent person), and the occurrence of this error is controlled
to be rare As a consequence of this asymmetric behavior, the error of the
second kind (acquitting a person who committed the crime), is
often rather large H0 is trueTruly not guilty
H1 is trueTruly guilty
Accept Null HypothesisAcquittal
Right decisionWrong decision
Type II Error
Reject Null HypothesisConviction
Wrong decisionType I Error
Right decision
Machine Learning and Bioinformatics 6
Philosopher’s beans Few beans of this handful are white.
Most beans in this bag are white. Therefore, probably, these beans were taken from another
bag.– this is an hypothetical inference
Terminology– the beans in the bag are the population
– the handful are the sample
– the null hypothesis is that the sample originated from the
population
Machine Learning and Bioinformatics 7
The criterion for rejecting the null-hypothesis is the
“obvious” difference in appearance (an informal
difference in the mean) Again, assuming that the null hypothesis is true,
what is the probability of observing a difference
that is at least as extreme as the actually observed
one? To be a real statistical hypothesis test, this example
requires the formalities of a probability calculation
and a comparison of that probability to a standard
Machine Learning and Bioinformatics 8
Clairvoyant card game A person (the subject) is tested
for clairvoyance. He is shown
the reverse of a randomly chosen playing card 25 times
and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X. As we try to find evidence of his clairvoyance
– the null hypothesis is that the person is not clairvoyant
– the alternative is, of course, the person is (more or less)
clairvoyant
null hypothesis?
Machine Learning and Bioinformatics 9
If the null hypothesis is valid, the only thing the test
person can do is guess– for every card, the probability (relative frequency) of any
single suit appearing is ¼
If the alternative is valid, the test subject will predict
the suit correctly with probability greater than ¼ Suppose that the observed probability of guessing
correctly is p, then the hypotheses, then are
– null hypothesis (H0): p = ¼ (just guessing)
– alternative hypothesis (H1): p > ¼ (true clairvoyant)
Machine Learning and Bioinformatics 10
What’s the decision? When the test subject correctly predicts
all 25 cards, we will consider him
clairvoyant, and reject the null hypothesis.
Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to
consider him so. But what about 12 hits, or 17 hits?– what is the critical number, c, of hits, at which point we consider the
subject to be clairvoyant?
– how do we determine the critical value c?
It is obvious that with the choice c=25 we’re more critical than
with c=10
Machine Learning and Bioinformatics 11
In practice, one decides how critical one will be– one decides how often an error of the first kind (false
positive or Type I error)
With c=25 the probability of such an error is very
small
Being less critical, with c=10, yields a much
grater probability of false positive
These are p-values
Machine Learning and Bioinformatics 12
The probability of Type I error Before the test is actually performed, the maximum
acceptable probability of a Type I error (α) is determined Depending on this Type I error rate, the critical value c is
calculated. For example, if we select an error rate of 1%
– from all the numbers c with this property we choose the
smallest, in order to minimize the probability of a Type II error
(false negative)
– for the above example, we select c=13
Machine Learning and Bioinformatics 13
P-value vs. α
Machine Learning and Bioinformatics 14
Any Questions?about the figure in the last slide
Machine Learning and Bioinformatics 15
Wherethe distribution (the blue curve) comes from?
Machine Learning and Bioinformatics 16
You have to choose the right oneThe hardest part for many people
But please understand the basic, rather than the practice
Machine Learning and Bioinformatics 17
Normal distribution A continuous probability
distribution, defined on the
entire real line, that has a bell-shaped probability density function Known as the Gaussian function
μ is the mean or expectation (location of the peak); σ2 is the
variance; σ is known as the standard deviation The distribution with μ=0 and σ2=1 is called the standard normal
distribution or the unit normal distribution Normal distribution - Wikipedia, the free encyclopedia
Machine Learning and Bioinformatics 18
The normal distribution is considered the most prominent
probability distribution in statistics The normal distribution arises from the central limit theorem
– under mild conditions, the mean of a large number of random
variables independently drawn from the same distribution is
distributed approximately normally, irrespective of the form of the
original distribution
Very tractable analytically, that is, a large number of results
involving this distribution can be derived in explicit form For these reasons, the normal distribution is commonly
encountered in practice– for example, the observational error in an experiment is usually
assumed to follow a normal distribution
19http://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/2000px-Normal_Distribution_PDF.svg.png
20http://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Normal_Distribution_CDF.svg/2000px-Normal_Distribution_CDF.svg.png
Machine Learning and Bioinformatics 21
Z-test Z-test - Wikipedia, the free encyclopedia For any test statistic of which the distribution under the null
hypothesis can be approximated by a normal distribution Because of the central limit theorem, many test statistics are
approximately normally distributed for large samples Many statistical tests can be conveniently performed as
approximate Z-tests if the sample size is large or the population
variance known– if the population variance is unknown (and therefore has to be
estimated from the sample itself) and the sample size is not large (n <
30), the Student t-test may be more appropriate
Machine Learning and Bioinformatics 22
If T is a statistic that is approximately normally distributed
under the null hypothesis– estimate the expected value θ of T under the null hypothesis
– obtain an estimate s of the standard deviation of T
– calculate the standard score Z = (T − θ) / s
– one-tailed and two-tailed p-values can be calculated as Φ(−|Z|)
and 2Φ(−|Z|), respectively
– Φ is the standard normal cumulative distribution function
Machine Learning and Bioinformatics 23
Z-test
Example Suppose that in a particular
geographic region, the mean
and standard deviation of scores
on a reading test are 100 and 12 points, respectively. Our interest is in the scores of 55 students in a particular school
who received a mean score of 96 We can ask whether this mean score is significantly lower than
the regional mean– are the students in this school comparable to a simple random sample
of 55 students from the region as a whole
– or are their scores surprisingly low?
Machine Learning and Bioinformatics 24
The standard error
The z-score, which is the distance from the sample mean
to the population mean in units of the standard error
Looking up the table of the standard normal distribution,
the probability of observing a standard normal value ≤ -
2.47 is about 0.0068– with 99.32% confidence we reject the null hypothesis
If instead of a classroom, we considered a sub-region
containing 900 students whose mean score was 99, nearly
the same z-score and p-value would be observed
Machine Learning and Bioinformatics 25
Hyper-geometric distribution A discrete probability distribution that describes
the probability of k successes in n draws from a
finite population of size N containing m successes
without replacement A random variable X follows the hyper-geometric distribution
if its probability mass function is given by
– N is the population size; m is the number of success states in the
population; n is the number of draws; k is the number of successes
Hypergeometric distribution - Wikipedia, the free encyclopedia
http://www.statsref.com/HTML/hypergeom.png
Machine Learning and Bioinformatics 27
Fisher’s exact test Used in the analysis of contingency tables Although in practice it is employed when
sample sizes are small, it is valid for all
sample sizes It is called exact because the significance of the deviation from a
null hypothesis can be calculated exactly, rather than relying on an
approximation that becomes exact in the limit as the sample size
grows to infinity Fisher devised the test due to a boast
– try google ‘lady tasting tea’
Fisher's exact test - Wikipedia, the free encyclopedia
Machine Learning and Bioinformatics 28
The test is useful for
categorical data that
result from classifying
objects in two different ways It is used to examine the significance of the
association (contingency) between the two kinds
of classification The numbers in the cells of the table form a
hyper-geometric distribution under the null
hypothesis of independence
Men Women Total
Dieting 1 9 12
Non-dieting 11 3 12
Total 12 12 24
Machine Learning and Bioinformatics 29
Fisher’s exact test
Example A sample of teenagers might be divided into
– male and female
– and those that are and are not currently dieting
Test whether the observed difference of proportions is
significant– what is the probability that the 10 dieters would be so
unevenly distributed between the women and the men?
– if we were to choose 10 of the teenagers at random, what is
the probability that 9 of them would be among the 12 women,
and only 1 from among the 12 men?
Men Women Total
Dieting 1 9 12
Non-dieting 11 3 12
Total 12 12 24
Machine Learning and Bioinformatics 30
The probability follows
the hyper-geometric
distribution
– the exact probability of this particular arrangement of the data
– on the null hypothesis of independence that men and women are
equally likely to be dieters
– assuming the given marginal totals
We can calculate the exact probability of any arrangement Fisher showed that to generate a significance level, we
need consider only the more extreme cases with the same
marginal totals
Men Women Total
Dieting a b a+b
Non-dieting c d c+d
Total a+c b+d n
Machine Learning and Bioinformatics 31
Any Questions?so far
Machine Learning and Bioinformatics 32
Howdo you choose the test, or
do you know the distribution
Machine Learning and Bioinformatics 33
Distribution is “assumed”Different tests may use the same distribution
One test statistic could be tested under different assumptions
Machine Learning and Bioinformatics 34
Overlap significance Determine the degree of the
overlap– ; ;
The above statistics answer the degree but not the
confidence of overlap Consider outside the two leafs Can you formulize a statistical test based on hyper-
geometric distribution?
Machine Learning and Bioinformatics 35
Suppose that we are drawing an area as large as the
first leaf What’s the probability to obtain an area with larger
overlap with the second leaf by chance?
– N is the size of the entire area
Notice that the p-value answers the confidence
when we claim that these two leaves
overlapped, but not the degree of the overlap
http://www.nature.com/nrc/journal/v7/n1/images/nrc2036-f1.jpg
Gene Ontology Enrichment Analysis
Machine Learning and Bioinformatics 37
Student’s t-test The test statistic follows a
Student’s t distribution if the
null hypothesis is supported Commonly applied Z-test when the test statistic follows a normal
distribution and the value of a scaling term is known When the scaling term is unknown and is replaced by an estimate
based on the data, the test statistic follows a Student’s t
distribution The t-statistic was introduced in 1908 by William Sealy Gosset
(“Student” was his pen name) Student’s t-test - Wikipedia, the free encyclopedia
Machine Learning and Bioinformatics 38
Compared to normal distribution The probability of seeing a normally distributed value far
(i.e. more than a few standard deviations) from the mean
drops off extremely rapidly– thus, normal distribution is not robust to the presence of
outliers (data that are unexpectedly far from the mean, due to
exceptional circumstances, observational error, etc.)
– data with outliers may be better described using a heavy-tailed
distribution such as the Student’s t-distribution
If are independent normally distributed random variables
with means μ and variances σ2
Machine Learning and Bioinformatics 39
The sample mean follows normal distribution
The ratio of the sample mean to the sample
standard deviation follows the Student’s t-
distribution with n−1 degrees of freedom
– this is useful to compare two sets of numerical data
The sum of their squares has the chi-squared
distribution with n degrees of freedom
Machine Learning and Bioinformatics 40
How manytest you remember
Machine Learning and Bioinformatics 41
That’s why we have Choosing the Correct Statistical Test in SAS,
Stata and SPSS GraphPad
- FAQ 1790 - Choosing a statistical test The testing process Common test statistics
But…
Machine Learning and Bioinformatics 42
Do not use themunless you understand the concepts introduced in this slide
Machine Learning and Bioinformatics 43
Chi-squared distribution The chi-squared distribution (also chi-square or χ²-
distribution) with k degrees of freedom is the distribution of
a sum of k independent standard normal random variables Used in chi-squared tests for
– goodness of fit of an observed distribution to a theoretical one
– the independence of two criteria of classification
– confidence interval estimation for a population standard deviation
of a normal distribution from a sample standard deviation
– many other statistical tests also use this distribution, like
Friedman’s analysis of variance by ranks
Machine Learning and Bioinformatics 44
A special case of the gamma distribution If are independent, standard normal random variables, then the
sum of their squares
is distributed according to the chi-squared distribution with k
degrees of freedom This is usually denoted as or Chi-squared distribution - Wikipedia, the free encyclopedia
Machine Learning and Bioinformatics 45
Chi-squared tests Also known as chi-square test or χ² test Note the distinction between the test statistic and its distribution The distribution is a chi-squared distribution when the null
hypothesis is true, or asymptotically true– the sampling distribution can be approximated to a chi-squared
distribution as closely as desired by enlarging the sample size
Often the shorthand for Pearson’s chi-squared test, also known
as– the chi-squared goodness-of-fit test
– the chi-squared test for independence
Machine Learning and Bioinformatics 46
Pearson’s chi-squared test Pearson’s chi-squared test - Wikipedia The best-known of several chi-squared tests Tests the frequency distributions of events
– the considered events must be mutually exclusive and have total
probability 1
– e.g., tests the “fairness” of a die
Used to assess two types of comparison– test of goodness of fit answers if an observed frequency distribution
differs from a theoretical one
– test of independence answers if paired observations on two variables,
expressed in a contingency table, are independent
Machine Learning and Bioinformatics 47
Steps Calculate the chi-squared test statistic, χ2, which resembles a
normalized sum of squared deviations between observed and
theoretical frequencies Determine the degrees of freedom, d, of that statistic, which is
essentially the number of frequencies reduced by the number of
parameters of the fitted distribution χ2 is then compared to the critical value in the distribution to
obtain a p-value A test that does not rely on the approximation of χ2 is the Fisher’s
exact test, which is more accurate in obtaining a significance
level, especially with few observations
Machine Learning and Bioinformatics 48
Test for fit of a distribution Suppose that there N observations divided among n cells A simple application is to test the hypothesis that, in the general
population, values would occur in each cell with equal frequency– the “theoretical frequency” for any cell (under the null hypothesis of a
discrete uniform distribution) is
– the reduction in the degrees of freedom is p=1, notionally because the
observed frequencies Oi are constrained to sum to N
– the degrees of freedom is n-1 degrees of freedom
The value of the test-statistic is , where X2 is a Pearson’s
cumulative test statistic, which asymptotically approaches
distribution
Machine Learning and Bioinformatics 49
When testing whether observations are random variables whose
distribution belongs to a given family of distributions, the
“theoretical frequencies” are calculated using a distribution from that
family– the reduction in the degrees of freedom is calculated as p=s+1, where s is
the number of co-variates used in fitting the distribution
– for instance, when checking a normal distribution (where the parameters
are mean and standard deviation), p=3
– the degrees of freedom is n-p
It should be noted that the degrees of freedom are not based on the
number of observations as with a Student’s t distribution– if testing for a fair, six-sided die, there would be five degrees of freedom
because there are six categories
– the number of times the die is rolled will have absolutely no effect on the
number of degrees of freedom
Machine Learning and Bioinformatics 50
Test of independence An “observation” consists of the values of two outcomes and the null
hypothesis is that the occurrence of these outcomes is statistically
independent Each observation is allocated to one cell of a two-dimensional array of
cells (called a table) according to the values of the two outcomes If there are r rows and c columns in the table, the value of the test-
statistic is Fitting the model of “independence” reduces the number of degrees of
freedom by p=r+c−1 The number of degrees of freedom is equal to the number of cells r×c,
minus the reduction in degrees of freedom, p, which reduces to (r − 1)(c
− 1).
Men Women Total
Dieting O1,1 O1,2 O1,1+O1,2
Non-dieting O2,1 O2,2 O2,1+O2,2
Total O1,1+O2,1 O2,1+O2,2 N
Machine Learning and Bioinformatics 51
Summary Statistical test
– criminal trial
– philosopher’s beans
– clairvoyant card game
P-value vs. α You have to choose the
right distribution– normal distribution (z-test)
– hyper-geometric distribution
(Fisher’s exact test)
Distinguish between
distributions and tests– different tests with the same
distribution• overlap significance
• enrichment analysis
– different distributions for the
same test statistic• Student’s t-test
Chi-squared tests– goodness of fit
– test of independence
Machine Learning & Bioinformatics 52
Feature selectionTests if the selected features are significantly better
than other. Upload and test them in our
simulation system. Finally, commit your best
version and send TA Jang a report before 23:59 1/8
(Tue).
top related