molecular biomedical informatics machine learning and bioinformatics machine learning &...

Machine Learning & Bioinformatics 1

Molecular Biomedical Informatics分子生醫資訊實驗室Machine Learning and Bioinformatics機器學習與生物資訊學

2Machine Learning and Bioinformatics

Statistics

Machine Learning and Bioinformatics 3

Statistical test In statistics, a result is called statistically significant if it is

unlikely to have occurred by chance Determines what outcomes of an experiment would lead to a

rejection of the null hypothesis; helping to decide whether

experimental results contain enough information to cast doubt

on conventional wisdom Answers

– assuming that the null hypothesis is true, what is the probability of

observing a value for the test statistic that is at least as extreme as the

actually observed one?

– that probability is known as the P-value

Similar to a criminal trial A defendant is considered not guilty until his guilt is proven

– the prosecutor tries to prove the guilt of the defendant, until there is

enough charging evidence the defendant is convicted

In the start of the procedure, there are two hypotheses

– H0: “the defendant is not guilty”

– H1: “the defendant is guilty”

The first one is called null hypothesis, and is for the time

being accepted The second one is called alternative (hypothesis), which is the

hypothesis one hopes to support

The hypothesis of innocence is only rejected when an error is very

unlikely, because one doesn’t want to convict an innocent

defendant Such an error is called error of the first kind (i.e. the conviction of

an innocent person), and the occurrence of this error is controlled

to be rare As a consequence of this asymmetric behavior, the error of the

second kind (acquitting a person who committed the crime), is

often rather large H0 is trueTruly not guilty

H1 is trueTruly guilty

Accept Null HypothesisAcquittal

Right decisionWrong decision

Type II Error

Reject Null HypothesisConviction

Wrong decisionType I Error

Right decision

Philosopher’s beans Few beans of this handful are white.

Most beans in this bag are white. Therefore, probably, these beans were taken from another

bag.– this is an hypothetical inference

Terminology– the beans in the bag are the population

– the handful are the sample

– the null hypothesis is that the sample originated from the

population

The criterion for rejecting the null-hypothesis is the

“obvious” difference in appearance (an informal

difference in the mean) Again, assuming that the null hypothesis is true,

what is the probability of observing a difference

that is at least as extreme as the actually observed

one? To be a real statistical hypothesis test, this example

requires the formalities of a probability calculation

and a comparison of that probability to a standard

Clairvoyant card game A person (the subject) is tested

for clairvoyance. He is shown

the reverse of a randomly chosen playing card 25 times

and asked which of the four suits it belongs to. The number of hits, or correct answers, is called X. As we try to find evidence of his clairvoyance

– the null hypothesis is that the person is not clairvoyant

– the alternative is, of course, the person is (more or less)

clairvoyant

null hypothesis?

If the null hypothesis is valid, the only thing the test

person can do is guess– for every card, the probability (relative frequency) of any

single suit appearing is ¼

If the alternative is valid, the test subject will predict

the suit correctly with probability greater than ¼ Suppose that the observed probability of guessing

correctly is p, then the hypotheses, then are

– null hypothesis (H0): p = ¼ (just guessing)

– alternative hypothesis (H1): p > ¼ (true clairvoyant)

What’s the decision? When the test subject correctly predicts

all 25 cards, we will consider him

clairvoyant, and reject the null hypothesis.

Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to

consider him so. But what about 12 hits, or 17 hits?– what is the critical number, c, of hits, at which point we consider the

subject to be clairvoyant?

– how do we determine the critical value c?

It is obvious that with the choice c=25 we’re more critical than

with c=10

In practice, one decides how critical one will be– one decides how often an error of the first kind (false

positive or Type I error)

With c=25 the probability of such an error is very

Being less critical, with c=10, yields a much

grater probability of false positive

These are p-values

The probability of Type I error Before the test is actually performed, the maximum

acceptable probability of a Type I error (α) is determined Depending on this Type I error rate, the critical value c is

calculated. For example, if we select an error rate of 1%

– from all the numbers c with this property we choose the

smallest, in order to minimize the probability of a Type II error

(false negative)

– for the above example, we select c=13

P-value vs. α

Any Questions?about the figure in the last slide

Wherethe distribution (the blue curve) comes from?

You have to choose the right oneThe hardest part for many people

But please understand the basic, rather than the practice

Normal distribution A continuous probability

distribution, defined on the

entire real line, that has a bell-shaped probability density function Known as the Gaussian function

μ is the mean or expectation (location of the peak); σ2 is the

variance; σ is known as the standard deviation The distribution with μ=0 and σ2=1 is called the standard normal

distribution or the unit normal distribution Normal distribution - Wikipedia, the free encyclopedia

The normal distribution is considered the most prominent

probability distribution in statistics The normal distribution arises from the central limit theorem

– under mild conditions, the mean of a large number of random

variables independently drawn from the same distribution is

distributed approximately normally, irrespective of the form of the

original distribution

Very tractable analytically, that is, a large number of results

involving this distribution can be derived in explicit form For these reasons, the normal distribution is commonly

encountered in practice– for example, the observational error in an experiment is usually

assumed to follow a normal distribution

19http://upload.wikimedia.org/wikipedia/commons/thumb/7/74/Normal_Distribution_PDF.svg/2000px-Normal_Distribution_PDF.svg.png

20http://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Normal_Distribution_CDF.svg/2000px-Normal_Distribution_CDF.svg.png

Z-test Z-test - Wikipedia, the free encyclopedia For any test statistic of which the distribution under the null

hypothesis can be approximated by a normal distribution Because of the central limit theorem, many test statistics are

approximately normally distributed for large samples Many statistical tests can be conveniently performed as

approximate Z-tests if the sample size is large or the population

variance known– if the population variance is unknown (and therefore has to be

estimated from the sample itself) and the sample size is not large (n <

30), the Student t-test may be more appropriate

If T is a statistic that is approximately normally distributed

under the null hypothesis– estimate the expected value θ of T under the null hypothesis

– obtain an estimate s of the standard deviation of T

– calculate the standard score Z = (T − θ) / s

– one-tailed and two-tailed p-values can be calculated as Φ(−|Z|)

and 2Φ(−|Z|), respectively

– Φ is the standard normal cumulative distribution function

Z-test

Example Suppose that in a particular

geographic region, the mean

and standard deviation of scores

on a reading test are 100 and 12 points, respectively. Our interest is in the scores of 55 students in a particular school

who received a mean score of 96 We can ask whether this mean score is significantly lower than

the regional mean– are the students in this school comparable to a simple random sample

of 55 students from the region as a whole

– or are their scores surprisingly low?

The standard error

The z-score, which is the distance from the sample mean

to the population mean in units of the standard error

Looking up the table of the standard normal distribution,

the probability of observing a standard normal value ≤ -

2.47 is about 0.0068– with 99.32% confidence we reject the null hypothesis

If instead of a classroom, we considered a sub-region

containing 900 students whose mean score was 99, nearly

the same z-score and p-value would be observed

Hyper-geometric distribution A discrete probability distribution that describes

the probability of k successes in n draws from a

finite population of size N containing m successes

without replacement A random variable X follows the hyper-geometric distribution

if its probability mass function is given by

– N is the population size; m is the number of success states in the

population; n is the number of draws; k is the number of successes

Hypergeometric distribution - Wikipedia, the free encyclopedia

http://www.statsref.com/HTML/hypergeom.png

Fisher’s exact test Used in the analysis of contingency tables Although in practice it is employed when

sample sizes are small, it is valid for all

sample sizes It is called exact because the significance of the deviation from a

null hypothesis can be calculated exactly, rather than relying on an

approximation that becomes exact in the limit as the sample size

grows to infinity Fisher devised the test due to a boast

– try google ‘lady tasting tea’

Fisher's exact test - Wikipedia, the free encyclopedia

The test is useful for

categorical data that

result from classifying

objects in two different ways It is used to examine the significance of the

association (contingency) between the two kinds

of classification The numbers in the cells of the table form a

hyper-geometric distribution under the null

hypothesis of independence

Men Women Total

Dieting 1 9 12

Non-dieting 11 3 12

Total 12 12 24

Fisher’s exact test

Example A sample of teenagers might be divided into

– male and female

– and those that are and are not currently dieting

Test whether the observed difference of proportions is

significant– what is the probability that the 10 dieters would be so

unevenly distributed between the women and the men?

– if we were to choose 10 of the teenagers at random, what is

the probability that 9 of them would be among the 12 women,

and only 1 from among the 12 men?

Men Women Total

Dieting 1 9 12

Non-dieting 11 3 12

Total 12 12 24

The probability follows

the hyper-geometric

distribution

– the exact probability of this particular arrangement of the data

– on the null hypothesis of independence that men and women are

equally likely to be dieters

– assuming the given marginal totals

We can calculate the exact probability of any arrangement Fisher showed that to generate a significance level, we

need consider only the more extreme cases with the same

marginal totals

Men Women Total

Dieting a b a+b

Non-dieting c d c+d

Total a+c b+d n

Any Questions?so far

Howdo you choose the test, or

do you know the distribution

Distribution is “assumed”Different tests may use the same distribution

One test statistic could be tested under different assumptions

Overlap significance Determine the degree of the

overlap– ; ;

The above statistics answer the degree but not the

confidence of overlap Consider outside the two leafs Can you formulize a statistical test based on hyper-

geometric distribution?

Suppose that we are drawing an area as large as the

first leaf What’s the probability to obtain an area with larger

overlap with the second leaf by chance?

– N is the size of the entire area

Notice that the p-value answers the confidence

when we claim that these two leaves

overlapped, but not the degree of the overlap

http://www.nature.com/nrc/journal/v7/n1/images/nrc2036-f1.jpg

Gene Ontology Enrichment Analysis

Student’s t-test The test statistic follows a

Student’s t distribution if the

null hypothesis is supported Commonly applied Z-test when the test statistic follows a normal

distribution and the value of a scaling term is known When the scaling term is unknown and is replaced by an estimate

based on the data, the test statistic follows a Student’s t

distribution The t-statistic was introduced in 1908 by William Sealy Gosset

(“Student” was his pen name) Student’s t-test - Wikipedia, the free encyclopedia

Compared to normal distribution The probability of seeing a normally distributed value far

(i.e. more than a few standard deviations) from the mean

drops off extremely rapidly– thus, normal distribution is not robust to the presence of

outliers (data that are unexpectedly far from the mean, due to

exceptional circumstances, observational error, etc.)

– data with outliers may be better described using a heavy-tailed

distribution such as the Student’s t-distribution

If are independent normally distributed random variables

with means μ and variances σ2

The sample mean follows normal distribution

The ratio of the sample mean to the sample

standard deviation follows the Student’s t-

distribution with n−1 degrees of freedom

– this is useful to compare two sets of numerical data

The sum of their squares has the chi-squared

distribution with n degrees of freedom

How manytest you remember

That’s why we have Choosing the Correct Statistical Test in SAS,

Stata and SPSS GraphPad

- FAQ 1790 - Choosing a statistical test The testing process Common test statistics

But…

Do not use themunless you understand the concepts introduced in this slide

Chi-squared distribution The chi-squared distribution (also chi-square or χ²-

distribution) with k degrees of freedom is the distribution of

a sum of k independent standard normal random variables Used in chi-squared tests for

– goodness of fit of an observed distribution to a theoretical one

– the independence of two criteria of classification

– confidence interval estimation for a population standard deviation

of a normal distribution from a sample standard deviation

– many other statistical tests also use this distribution, like

Friedman’s analysis of variance by ranks

A special case of the gamma distribution If are independent, standard normal random variables, then the

sum of their squares

is distributed according to the chi-squared distribution with k

degrees of freedom This is usually denoted as or Chi-squared distribution - Wikipedia, the free encyclopedia

Chi-squared tests Also known as chi-square test or χ² test Note the distinction between the test statistic and its distribution The distribution is a chi-squared distribution when the null

hypothesis is true, or asymptotically true– the sampling distribution can be approximated to a chi-squared

distribution as closely as desired by enlarging the sample size

Often the shorthand for Pearson’s chi-squared test, also known

as– the chi-squared goodness-of-fit test

– the chi-squared test for independence

Pearson’s chi-squared test Pearson’s chi-squared test - Wikipedia The best-known of several chi-squared tests Tests the frequency distributions of events

– the considered events must be mutually exclusive and have total

probability 1

– e.g., tests the “fairness” of a die

Used to assess two types of comparison– test of goodness of fit answers if an observed frequency distribution

differs from a theoretical one

– test of independence answers if paired observations on two variables,

expressed in a contingency table, are independent

Steps Calculate the chi-squared test statistic, χ2, which resembles a

normalized sum of squared deviations between observed and

theoretical frequencies Determine the degrees of freedom, d, of that statistic, which is

essentially the number of frequencies reduced by the number of

parameters of the fitted distribution χ2 is then compared to the critical value in the distribution to

obtain a p-value A test that does not rely on the approximation of χ2 is the Fisher’s

exact test, which is more accurate in obtaining a significance

level, especially with few observations

Test for fit of a distribution Suppose that there N observations divided among n cells A simple application is to test the hypothesis that, in the general

population, values would occur in each cell with equal frequency– the “theoretical frequency” for any cell (under the null hypothesis of a

discrete uniform distribution) is

– the reduction in the degrees of freedom is p=1, notionally because the

observed frequencies Oi are constrained to sum to N

– the degrees of freedom is n-1 degrees of freedom

The value of the test-statistic is , where X2 is a Pearson’s

cumulative test statistic, which asymptotically approaches

distribution

When testing whether observations are random variables whose

distribution belongs to a given family of distributions, the

“theoretical frequencies” are calculated using a distribution from that

family– the reduction in the degrees of freedom is calculated as p=s+1, where s is

the number of co-variates used in fitting the distribution

– for instance, when checking a normal distribution (where the parameters

are mean and standard deviation), p=3

– the degrees of freedom is n-p

It should be noted that the degrees of freedom are not based on the

number of observations as with a Student’s t distribution– if testing for a fair, six-sided die, there would be five degrees of freedom

because there are six categories

– the number of times the die is rolled will have absolutely no effect on the

number of degrees of freedom

Test of independence An “observation” consists of the values of two outcomes and the null

hypothesis is that the occurrence of these outcomes is statistically

independent Each observation is allocated to one cell of a two-dimensional array of

cells (called a table) according to the values of the two outcomes If there are r rows and c columns in the table, the value of the test-

statistic is Fitting the model of “independence” reduces the number of degrees of

freedom by p=r+c−1 The number of degrees of freedom is equal to the number of cells r×c,

minus the reduction in degrees of freedom, p, which reduces to (r − 1)(c

− 1).

Men Women Total

Dieting O1,1 O1,2 O1,1+O1,2

Non-dieting O2,1 O2,2 O2,1+O2,2

Total O1,1+O2,1 O2,1+O2,2 N

Summary Statistical test

– criminal trial

– philosopher’s beans

– clairvoyant card game

P-value vs. α You have to choose the

right distribution– normal distribution (z-test)

– hyper-geometric distribution

(Fisher’s exact test)

Distinguish between

distributions and tests– different tests with the same

distribution• overlap significance

• enrichment analysis

– different distributions for the

same test statistic• Student’s t-test

Chi-squared tests– goodness of fit

– test of independence

Machine Learning & Bioinformatics 52

Feature selectionTests if the selected features are significantly better

than other. Upload and test them in our

simulation system. Finally, commit your best

version and send TA Jang a report before 23:59 1/8

(Tue).

molecular biomedical informatics machine learning and bioinformatics machine learning &...

bioinformatics slide

hypothesis of innocence

alternative hypothesis

population machine learning

standard machine learning

large machine learning

pvalue machine learning

right decision slide

Documents

machine learning in bioinformatics - unimi.it · machine...

machine learning & bioinformatics

machine learning approaches in bioinformatics and...

single-cell bioinformatics & machine learning · goal:...

graphical models in machine learning ai4190. 2 outlines of...

machine learning in bioinformatics and drug...

1 june 2015dimacs - machine learning in bioinformatics 1...

baldi p., brunak s. bioinformatics.. the machine learning...

statistical machine learning methods for...

bioinformatics, data integration and machine learning a...

ai, machine, deep learning and nlp - mathworks...3...

bioinformatics ii theoretical bioinformatics and machine...

machine learning & bioinformatics

statistical machine learning methods for bioinformatics...

i529: machine learning in bioinformatics

deep learning in bioinformatics - arxiv.org e-print...

machine learning in bioinformatics

neural networks and machine learning in bioinformatics...

machine learning & bioinformatics 1 tien-hao chang (darby...

machine learning and its applications in bioinformatics