statistical foundations of machine learning › sites › default › files › 2....

Statistical foundations of machinelearning

INFO-F-422

Gianluca Bontempi

Département d’Informatique

Boulevard de Triomphe - CP 212

http://www.ulb.ac.be/di

Apprentissage automatique – p. 1/78

Random experimentWe define as random experiment any action or process which generatesresults or observations which cannot be predicted with certainty.Examples:

• tossing a coin,

• rolling dice,

• measuring the time to reach home


Probability space• A random experiment is characterized by a sample space Ω that is the

set of all the possible outcomes ω of the experiment. This space can beeither finite or infinite. For example, in the die experiment,Ω = ω1, ω2, . . . , ω6.

• The elements of the set Ω are called experimental outcomes. Theoutcome of an experiment need not be a number, for example, theoutcome when a coin is tossed can be ’heads’ or ’tails’.

• A subset of experimental outcomes is called event. An example of eventis the set of even values E = ω2, ω4, ω6.

• A single execution of a random experiment is called a trial. At each trialwe observe one outcome ωi. We say that an event occurred during thistrial if it contains the element ωi. For example, in the die experiment, ifwe observe the outcome ω4 the event even took place.


Events and set theorySince events E are subsets, we can apply to them the terminology of the settheory:

• Ec = ω ∈ Ω : ω /∈ E denotes the complement of E .

• E1 ∪ E2 = ω ∈ Ω : ω ∈ E1 OR ω ∈ E2 refers to the event that occurswhen E1 or E2 or both occur.

• E1 ∩ E2 = ω ∈ Ω : ω ∈ E1 AND ω ∈ E2 refers to the event that occurswhen both E1 and E2 occur.

• two events E1 and E2 are mutually exclusive or disjoints if

E1 ∩ E2 = ∅

that is each time that E1 occurs, E2 does not occur.

• a partition of Ω is a set of disjoint sets Ej , j = 1, . . . , n such that

∪nj=1Ej = Ω


Class of events• The class E of events is not an arbitrary collection of subsets of Ω.

• We want that if E1 and E2 are events also that the intersection E1 ∩ E2

and the union E1 ∪ E2 be events.

• We do so because we will want to know not only the probabilities ofvarious events, but also the probabilities of their unions andintersections.

• In the following we will consider only events that, in mathematical terms,form a Borel field.


Axiomatic definition of probability• Let Ω the certain event that occurs in every trial and let E1 + E2 design

the event that occurs when E1 or E2 or both occur.

• The axiomatic approach to probability consists in assigning to eachevent E a number Prob E which is called the probability of the event E .This number is so chosen as to satisfy the following three conditions:

1. Prob E ≥ 0 for any E .

2. Prob Ω = 1

3. Prob E1 + E2 = Prob E1 + Prob E2 if Prob E1 ∩ E2 = 0 that isE1 and E2 are mutually exclusive (or disjoint).

• These conditions are the axioms of the theory of probability(Kolmogoroff,1933).

• It follows thatProb E =

∑

ω∈E

Prob ω

• All probabilistic conclusions are based directly or indirectly on theaxioms and only the axioms.

• But how to define these probability numbers?Apprentissage automatique – p. 6/78

Symmetrical definition of probabilityConsider a random experiment where

• the sample space is made of N symmetric outcomes, i.e. we have noreason to expect or prefer one over the other

• the number of outcomes which are favorable to the event E (i.e. if theyoccur then the event E takes place) is NE .

then according to the principle of indifference (a term popularized by J.M.Keynes in 1921) we have

Prob E =NE

N

Note that this number is determined without any experimentation and isbased on symmetrical assumptions. But what happen if symmetry does nothold?


Frequentist definition of probability• Let us consider a random experiment and an event E .

• Let us repeat the experiment N times and let us compute the number oftimes that the event E occurs. The quantity NE

N is the relative frequencyof E .

• It can be observed that the frequency converges to a fixed value forincreasing N .

• This observation led Von Mises to use the notion of frequency as afoundation for the notion of probability.


Frequentist definition of probability• It is based on the following definition (Von Mises):

Definition 1. The probability Prob E of an event E is the limit

Prob E = limN→∞

NE

N

where N is the number of observations (trials) and NE is the number of times that Eoccurred.

• This definition appears reasonable and it is compatible with the axioms.

• According to this approach, the probability is not a property of a specificobservation, but rather it is a property of the entire set of observations.

• In practice, in any physical experience N is finite and the limit has to beaccepted as an hypothesis, not as a number that can be determinedexperimentally.

• However, the frequentist interpretation is very important to show thelinks between theory and application.


Weak law of Large Numbers• A link between the axiomatic and the frequentist approach is provided by

the weak law of Large Numbers.

• It was shown by Bernoulli that

Theorem 1. For any ǫ > 0,

Prob

∣

∣

∣

∣

NE

N− p

∣

∣

∣

∣

≤ ǫ

→ 1 as N → ∞

• According to this theorem, the ratio NE/N is close to p in the sense that,for any ǫ > 0, the probability that |NE/N − p| ≤ ǫ tends to 1 as N → ∞.

• The weak law essentially states that for any nonzero margin specified,no matter how small, with a sufficiently large sample there will be a veryhigh probability that the frequency will be close to the probability, that is,within the margin.


Probability and frequencyNote that the law of large numbers does NOT mean that NE will be close toN · p. In fact,

Prob NE = Np ≈ 1√

2πNp(1 − p)→ 0, as N → ∞

For instance, in a fair coin-tossing game this law does not imply that theabsolute difference between the number of heads and tails should oscillateclose to zero . On the contrary it could happen that the absolute differencekeeps growing (though slower than the number of tosses).


Probability and frequency

0 e+00 2 e+05 4 e+05 6 e+05

0.2

0.3

0.4

0.5

0.6

Number of trials

Rel

ativ

e fr

eque

ncy

0 e+00 2 e+05 4 e+05 6 e+05

020

040

060

0

Number of trials

Abs

olut

e di

ffere

nce

(no.

hea

ds a

nd ta

ils)


Some probabilistic notionsDefinition 2 (Independent events). . Two events E1 and E2 are independent if and only if

Prob E1 ∩ E2 = Prob E1 Prob E2 (1)

and we write E1 ⊥⊥ E2.

As an example of two independent events, think of two outcomes of a roulettewheel or of two coins tossed simultaneously.

Definition 3 (Conditional probability). If Prob E1 > 0 then the conditional probability of E2

given E1 is

Prob E2|E1 =Prob E1 ∩ E2

Prob E1(2)


Exercices1. Let E1 and E2 two disjoint events with positive probability. Can they be

independent?

2. Suppose that a fair die is rolled and that the number x appears. Let E1 bethe event that the number x is even, E2 be the event that the number x isgreater than or equal to 3, E3 be the event that the number x is a 4,5 or 6.Are the events E1 and E2 independent? Are the events E1 and E3

independent?


Warnings• For any fixed E1, the quantity Prob ·|E1 satisfies the axioms of

probability. For instance if E2, E3 and E4 are disjoint events we have that

Prob E2 ∪ E3 ∪ E4|E1 = Prob E2|E1 + Prob E3|E1 + Prob E4|E1

• However this does NOT generally hold for Prob E1|·, that is when wefix the term E1 on the left of the conditional bar. For two disjoint events E2

and E3, in general

Prob E1|E2 ∪ E3 6= Prob E1|E2 + Prob E1|E3

• it is generally NOT the case that Prob E2|E1 = Prob E1|E2.

• The following properties hold :

If E1 ⊂ E2 ⇒ Prob E2|E1 =Prob E1Prob E1

= 1 (E1 ⇒ E2) (3)

⇒ Prob E1|E2 =Prob E1Prob E2

≥ Prob E1 (4)


Bayes’ theoremLet us consider a set of mutually exclusive and exhaustive events E1, E2,...,Ek, i.e. they form a partition of Ω.

Theorem 2 (Law of total probability). Let Prob Ei, i = 1, . . . , k denote the probabilities of

the ith event Ei and Prob E|Ei, i = 1, . . . , k the conditional probability of a generic event

E given that Ei has occurred. It can be shown that

Prob E =k

∑

i=1

Prob E|Ei Prob Ei =k

∑

i=1

Prob E ∩ Ei (5)

Theorem 3 (Bayes’ theorem). The conditional (“inverse”) probability of any Ei, i = 1, . . . , k

given that E has occurred is given by

Prob Ei|E =Prob E|EiProb Ei

∑kj=1 Prob E|Ej Prob Ej

=Prob E ∩ Ei

Prob E i = 1, . . . , k (6)


Combined experiments• So far we assumed that all the events belong to the same sample space.

• The most interesting use of probability concerns however combinedrandom experiments whose sample space

Ω = Ω1 × Ω2 × . . .Ωn

is the Cartesian product of several spaces Ωi, i = 1, . . . , n.

• For instance if we want to study the probabilistic dependence betweenthe height and the weight of a child we have to define a joint samplespace

Ω = (w, h) : w ∈ Ωw, h ∈ Ωhmade of all pairs (w, h) where Ωw is the sample space of the randomexperiment describing the weight and Ωh is the sample space of therandom experiment describing the height.


Medical studyLet us consider a medical study about the relationship between the outcomeof a medical test and the presence of a disease . We model this study as thecombination of two random experiments:

1. the random experiment which models the state of the patient. Its samplespace is Ωs = H, S where H and S stand for healthy and sick patient,respectively.

2. the random experiment which models the outcome of the medical test.Its sample space is Ωo = +,− where + and − stand for positive andnegative outcome of the test, respectively.

Suppose that out of 1000 patients,

Es = S Es = H

Eo = + 9 99

Eo = − 1 891

⇒Es = S Es = H

Eo = + .009 .099

Eo = − .001 .891

What is the probability of having a positive (negative) test outcome when thepatient is sick (healthy)? What is the probability of being in front of a sick(healthy) patient when a positive (negative) outcome is obtained?


Medical study (II)From the definition of conditional probability we derive

Prob Eo = +|Es = S =Prob Eo = +, Es = S

Prob Es = S =.009

.009 + .001= .9

Prob Eo = −|Es = H =Prob Eo = −, Es = H

Prob Es = H =.891

.891 + .099= .9

According to these figures, the test appears to be accurate. Do we have toexpect a high probability of being sick when the test is positive? The answeris NO as shown by

Prob Es = S|Eo = + =Prob Eo = +, Es = S

Prob Eo = + =.009

.009 + .099≈ .08

This example shows that sometimes humans tend to confound Prob Es|Eowith Prob Eo|Es and that the most intuitive response is not always the rightone.


Array of joint/marginal probabilitiesLet us consider the combination of two random experiments whose samplespaces are ΩA = A1, · · · , An and ΩB = B1, · · · , Bm respectively.Assume that for each pairs of events (Ai, Bj), i = 1, . . . , n, j = 1, . . . , m weknow the joint probability value Prob Ai, Bj.

B1 B2 · · · Value Bm Marginal

A1 Prob A1, B1 Prob A1, B2 · · · Prob A1, Bm Prob A1

A2 Prob A2, B1 Prob A2, B2 · · · Prob A1, Bm Prob A2

......

......

......

An Prob An, B1 Prob An, B2 · · · Prob An, Bm Prob An

Marginal Prob B1 Prob B2 · · · Prob Bm Sum=1

The joint probability array contains all the necessary information forcomputing all marginal and conditional probabilities

Try to fill the table for the dependent and independent case.


Dependent/independent: exampleLet us model the commute time to go back home for an ULB student living inSt. Gilles as a random experiment. Suppose that its sample space isΩ=LOW, MEDIUM, HIGH. Consider also an (extremely:-) randomexperiment representing the weather in Brussels, whose sample space isΩ=G=GOOD, B=BAD. Suppose that the array of joint probabilities is

G (in Bxl) B (in Bxl) Marginal

LOW 0.15 0.05 Prob LOW = 0.2

MEDIUM 0.1 0.4 Prob MEDIUM = 0.5

HIGH 0.05 0.25 Prob HIGH = 0.3

Prob G = 0.3 Prob B = 0.7 Sum=1

Is the commute time dependent on the weather in Bxl?

G (in Rome) B (in Rome) Marginal

LOW 0.18 0.02 Prob LOW = 0.2

MEDIUM 0.45 0.05 Prob MEDIUM = 0.5

HIGH 0.27 0.03 Prob HIGH = 0.3

Prob G = 0.9 Prob B = 0.1 Sum=1

Is the commute time dependent on the weather in Rome?Apprentissage automatique – p. 21/78

Dependent/independent: example (II)If Brussels’ weather is good

LOW MEDIUM HIGH

Prob ·|G 0.15/0.3=0.5 0.1/0.3=0.33 0.05/0.3=0.16

Else if Brussels’ weather is bad

LOW MEDIUM HIGH

Prob ·|B 0.05/0.7=0.07 0.4/0.7=0.57 0.25/0.7=0.35

The distribution of the commute time changes according to the value of theBrussels’ weather.


Dependent/independent: example (III)If Rome’s weather is good

LOW MEDIUM HIGH

Prob ·|G 0.18/0.9=0.2 0.45/0.9=0.5 0.27/0.9=0.3

Else if If Rome’s weather is bad

LOW MEDIUM HIGH

Prob ·|B 0.02/0.1=0.2 0.05/0.1=0.5 0.03/0.1=0.3

The distribution of the commute time does NOT change according to thevalue of Rome’s weather.


Marginal/conditional: exampleConsider a probabilistic model of the day’s weather based on three randomdescriptors (or features) where

1. the first represents the sky condition and takes value in the finite setCLEAR, CLOUDY.

2. the second represents the barometer trend and takes value in the finiteset RISING,FALLING,

3. the third represents the humidity in the afternoon and takes value inDRY,WET.


Marginal/conditional: example (II)Let the joint distribution be given by the table

E1 E2 E3 P(E1,E2,E3)

CLEAR RISING DRY 0.4

CLEAR RISING WET 0.07

CLEAR FALLING DRY 0.08

CLEAR FALLING WET 0.10

CLOUDY RISING DRY 0.09

CLOUDY RISING WET 0.11

CLOUDY FALLING DRY 0.03

CLOUDY FALLING WET 0.12

From the joint distribution we can calculate the marginal probabilitiesP (CLEAR, RISING) = 0.47 and P (CLOUDY ) = 0.35 and the conditionalvalue

P (DRY |CLEAR, RISING) =

=P (DRY, CLEAR, RISING)

P (CLEAR, RISING)=

0.40

0.47≈ 0.85


Random variables• Machine learning and statistics is concerned with data. What is then the

link between the notion of random experiments and data? The answer isprovided by the concept of random variable.

• Consider a random experiment and the associated triple(Ω, E, Prob ·). The outcome of an experiment need not be a number,for example, the outcome when a coin is tossed can be ’heads’ or ’tails’.However, we often want to represent outcomes as numbers. A randomvariable is a function that associates a unique numerical value withevery outcome of an experiment. The value of the random variable willvary from trial to trial as the experiment is repeated.

• Suppose that we have a mapping rule Ω → R such that we canassociate to each experimental outcome ω a real value z(ω).

• We say that z is the value taken by the random variable z when theoutcome of the random experiment is ω.

• Since there is a probability associated to each event E and we have amapping from events to real values, a probability distribution can beassociated to z.


Random variablesDefinition 4. Given a random experiment (Ω, E, Prob ·), a random variable z is the

result of a mapping that assigns a number z to every outcome ω. This mapping must satisfy

the following two conditions but is otherwise arbitrary:

• the set z ≤ z is an event for every z.

• the probabilities

Prob z = ∞ = 0 Prob z = −∞ = 0

Given a random variable z ∈ Z and a subset I ⊂ Z we define the inversemapping

z−1(I) = ω ∈ Ω|z(ω) ∈ Iwhere z−1(I) ∈ E is an event and let

Prob z ∈ I = Prob

z−1(I)

= Prob ω ∈ Ω|z(ω) ∈ I


Probabilistic interpretation of uncertainty• This course will assume that the variability of measurements can be

represented by the probability formalism.

• A random variable is a numerical quantity, linked to some experimentinvolving some degree of randomness, that takes its value from someset of possible real values.

• Example: the experiment might be the rolling of two six-sided dice andthe r.v. z might be the sum of the two numbers showing in the dice. Inthis case the set of possible values are 2, . . . , 12.

• In the example of the commute time, the random experiment is a compact

(and approximate) way of modeling the disparate set of causes which ledto variability in the value of z.


Probability function of a discrete r.v.The probability (mass) function of a discrete r.v. z is the combination of

1. the set Z of values that this r.v. can take (also called range or sample

space )

2. the set of probabilities associated to each value of ZThis means that we can attach to the random variable some specificmathematical function P (z) that gives for each z ∈ Z the probability that z

assumes the value z

Pz(z) = Prob z = zThis function must satify the two following conditions

Pz(z) ≥ 0∑

z∈Z Pz(z) = 1


Probability function of a discrete r.v.(II)For a reduced number of possible values of z, the probability function can bepresented in the form of a table. For example, if we plan to toss a fair cointwice, and the random variable z is the number of heads that eventually turnup, the probability function can be presented as follow

Values of the random variable z 0 1 2

Associated probabilities 0.25 0.50 0.25


Parametric probability functionSuppose that

1. z is a discrete r.v. that takes its value in Z = 1, 2, 3.

2. the probability function of z is

Pz(z) =θ2z

θ2 + θ4 + θ6

where θ is some fixed non zero real number.

Whatever the value of θ, Pz(z) ≥ 0 for z = 1, 2, 3 andPz(1) + Pz(2) + Pz(3) = 1. Therefore z is a well-defined random variable,even if the value of θ is unknown.We call θ a parameter , that is some constant, usually unknown involved in aprobability function.


Expected value of a discrete r.v.Expected value of a discrete random variable z is defined as

E[z] = µ =∑

z∈Z

zPz(z)

• In English “mean” is used as a synonymous of “expected value”.• But the word “average” is NOT a synonymous of “expected value”.• The expected value is not necessarily a value that belongs to Z.


Variance of a discrete r.v.Variance of a discrete random variable z is defined as

Var [z] = σ2 = E[(z − E[z])2] =∑

z∈Z

(z − E[z])2Pz(z)

• The variance is a measure of the dispersion of the probabilityfunction of the random variable around its mean.

Note that the following identity holds

E[(z − E[z])2] ≡ E[z2] − (E[z])2 = E[z2] − µ2


Examples of probability functions

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1:10

P

2 4 6 8 100.

00.

20.

40.

60.

8

1:10

PP

Two discrete r.v. probability functions having the same mean but differentvariance.


Std. deviation and moments of a discrete r.v.Standard deviation of a discrete random variable z is defined as the positive

square root of the variance.

Std [z] =√

Var [z] = σ

Moment: for any positive integer r, the rth moment of the probability functionis

µr = E[zr] =∑

z

zrPz(z)

Skewness of a discrete random variable z is defined as

γ =E[(z − µ)3]

σ3

• Distributions with positive skewness have long tails to the right, anddistributions with negative skewness have long tails to the left.


Joint probability• Consider a probabilistic model described by n discrete random variables.

• A fully specified probabilistic model gives the joint probability function forevery combination of the values of the n r.v.s.

• The model is specified by the values of the probabilities

Prob z1 = z1, z2 = z2, . . . , zn = zn = P (z1, z2, . . . , zn)

for every possible assignment of values z1, . . . , zn to the variables.


Independent variables• Let x and y be two random variables.

• The two variables x and y are defined to be statistically independent ifthe joint probability Prob x = x,y = y = Prob x = xProb y = y

• If two variables x and y are independent, then the transformed r.v. g(x)

and h(y), where g and h are two given functions, are also independent.

• In qualitative terms, this means that we do not expect that the outcomeof one variable will affect the other.

• Examples: think to two outcomes of a roulette wheel or to two coinstossed simultaneously.


Marginal and conditional probability• Marginal probabilities for subsets of the n variables can be found by

summing over all possibile combinations of values for the othervariables.

P (z1, . . . , zm) =∑

zm+1

· · ·∑

zn

P (z1, . . . , zm, zm+1, . . . , zn)

• Conditional probabilities for one subset of variables xi : i ∈ S1 givenvalues for another disjoint subset xj : j ∈ Sj where S1 ∩ S2 = ∅, aredefined as ratios of marginal probabilisties

P (xi : i ∈ S1|xj : j ∈ S2) =P (xi : i ∈ S1, xj : j ∈ S2)

P (xj : j ∈ S2)

• If x and y are independent then Px(x|y = y) = Px(x)

• Since independence is symmetric, this is equivalent to say thatPy(y|x = x) = Py(y)


Bernoulli trial•• A Bernoulli trial is a random experiment with two possible outcomes, often

called “success” and “failure”.

• The probability of success is denoted by p and the probability of failureby (1 − p).

• A Bernoulli random variable z is a discrete r.v. associated with the Bernoullitrial. It takes z = 0 with probability (1 − p) and z = 1 with probability p.

• The probability function of z can be written in the form

Prob z = z = Pz(z) = pz(1 − p)1−z z = 0, 1

• Note thatE[z] = 1 · p + 0 · (1 − p) = p

andVar [z] = p(1 − p)2 + (1 − p)(0 − p)2 = p(1 − p).


The Binomial distribution• A binomial random variable is the number of successes in a fixed number

N of independent Bernoulli trials with the same probability of successfor each trial.

• Example: the number z of heads in N tosses of a coin.

• The probability function is given by

Prob z = z = Pz(z) =

(

N

z

)

pz(1 − p)N−z z = 0, 1, . . . , N

• The mean of the distribution is µ = Np and σ2 = Np(1 − p).

• The Bernoulli distribution is a special case (N = 1) of the binomialdistribution.

• For small p, the probability of having at least 1 success in N trials isproportional to N , as long as Np is small.


Mean and variances of the distributionsDistribution Mean Variance

Bernoulli p p(1 − p)

Binomial Np Np(1 − p)

Geometric p/(1 − p) p/(1 − p)2

Poisson λ λ


Continuous random variable• Continuous random variables take their value in some continuous range of

values. Consider a real random variable z whose range is the set of realnumbers. The following quantities can be defined:

Definition 5. The (cumulative) distribution function of z is the function

Fz(z) = Prob z ≤ z (7)

Definition 6. The density function of a real random variable z is the derivative of the

distribution function:

pz(z) =dFz(z)

dz(8)

• Probabilities of continuous r.v. are not allocated to specific values butrather to interval of values. Specifically

Prob a < z < b =

∫ b

a

pz(z)dz,

∫

Z

pz(z)dz = 1


Mean, variance,... of a continuous r.v.Consider a continuous scalar r.v. having range (l, h) and density functionp(z). We can define

Expectation (mean):

µ =

∫ h

l

zp(z)dz

Variance:

σ2 =

∫ h

l

(z − µ)2p(z)dz

Other quantities of interest are the moments :

µr = E[zr] =

∫ h

l

zrp(z)dz

The moment of order r = 1 is the mean of z.


The Chebyshev’s inequality• Let z be a generic random variable, discrete or continuous , having mean

µ and variance σ2.

• The Chebyshev’s inequality states that for any positive constant d

Prob |z − µ| ≥ d ≤ σ2

d2

• An experimental validation of the Chebyshev’s inequality can be found inthe R script cheby.R.


Uniform distributionA random variable z is said to be uniformly distributed on the interval (a, b)

(also z ∼ U(a, b)) if its probability density function is given by

p(z) =

1b−a if a < z < b

0, otherwise

p(z)

a b z

b-a1

TP: compute the variance of U(a, b).


Normal distribution: the scalar caseA continuous scalar random variable x is said to be normally distributed withparameters µ and σ2 (also x ∼ N (µ, σ2)) if its probability density function isgiven by

px(x) =1√2πσ

e−(x−µ)2

2σ2

• The mean of x is µ; the variance of x is σ2.

• The coefficient in front of the exponential ensures that∫

p(x)dx = 1.

• The probability that an observation x from a normal r.v. is within 2

standard deviations from the mean is 0.95.

• If µ = 0 and σ2 = 1 the distribution is defined standard normal . We willdenote its distribution function Fz(z) = Φ(z).

• Given a normal r.v. x ∼ N (µ, σ2), the r.v. z = (x − µ)/σ has a standardnormal distribution.


Standard distribution


Important relationsx ∈ N (µ, σ2)

Prob µ − σ ≤ x ≤ µ + σ ≈ 0.683

Prob µ − 1.282σ ≤ x ≤ µ + 1.282σ ≈ 0.8

Prob µ − 1.645σ ≤ x ≤ µ + 1.645σ ≈ 0.9

Prob µ − 1.96σ ≤ x ≤ µ + 1.96σ ≈ 0.95

Prob µ − 2σ ≤ x ≤ µ + 2σ ≈ 0.954

Prob µ − 2.57σ ≤ x ≤ µ + 2.57σ ≈ 0.99

Prob µ − 3σ ≤ x ≤ µ + 3σ ≈ 0.997

Test yourself these relations by random sampling and simulation using R!


Test by using R> N<-1000000> mu<-1

> sigma<-2> DN<-rnorm(N,mu,sigma)

>> print(sum(DN<=(mu+sigma) & DN>=(mu-sigma))/N)

[1] 0.682668

> print(sum(DN<=(mu+1.645*sigma) & DN>=(mu-1.645*sigma))/N)[1] 0.899851

> qnorm(0.9)

[1] 1.281552> qnorm(0.95)

[1] 1.644854> qnorm(0.995)

[1] 2.57


Upper critical pointsDefinition 7. The upper critical point of a continuous r.v. x is the smallest number xα such

that

1 − α = Prob x ≤ xα = F (xα) ⇔ xα = F−1(1 − α)

• We will denote with zα the upper critical points of the standard normaldensity.

1 − α = Prob z ≤ zα = F (zα) = Φ(zα)

• Note that the following relations hold

zα = Φ−1(1 − α), z1−α = −zα, 1 − α =1√2π

∫ zα

−∞

e−z2/2dz

• Here we list the most commonly used values of zα.

α 0.1 0.075 0.05 0.025 0.01 0.005 0.001 0.0005

zα 1.282 1.440 1.645 1.967 2.326 2.576 3.090 3.291

• in R, zα =qnorm(1-alpha)


Bivariate discrete probability distribution• Let us consider two discrete r.v.’s x ∈ X and y ∈ Y and their joint

probability function Px,y(x, y).

• We define marginal probability the quantity

Px(x) =∑

y∈Y

Px,y(x, y)

• We define conditional probability the quantity

Py|x(y|x) =P (x, y)

Px(x)

which is the probability that y takes the value y once x = x.

• If x and y are independent

Px,y(x, y) = Px(x)Py(y), P (y|x) = Py(y)


Bivariate cont probability distribution• Let us consider two continuous r.v.’s x and y and their joint density

function px,y(x, y).

• We define marginal density the quantity

px(x) =

∫ ∞

−∞

px,y(x, y)dy

• We define conditional density the quantity

py|x(y|x) =p(x, y)

px(x)

which is, in loose terms, the probability that y is in dy about y assumingx = x.

• If x and y are independent

px,y(x, y) = px(x)py(y), p(y|x) = py(y)


Bivariate probability distribution (II)Definition 8 (Conditional expectation). The conditional expectation of y given x = x is

E[y|x = x] =

∫

ypy|x(y|x)dy = µy|x(x) (9)

Definition 9 (Conditional variance).

Var [y|x = x] =

∫

(y − µy|x(x))2py|x(y|x)dy (10)

Note that both these quantitites are function of x. If x is a random variable itfollows that the conditional expectation and the conditional variance arerandom, too.

Theorem 4. For two r.v. x and y, assuming the expectations exist, we have that

Ex[Ey[y|x]] = Ey[y] (11)

and

Var [y] = Ex[Var [y|x]] + Var [Ey[y|x]] (12)

where Var [y|x] and Ey[y|x] are functions of x.Apprentissage automatique – p. 53/78

Bivariate discrete exampley = 3 y = 10 y = 20 Marginal

x = 10 0.05 0.05 0.15 P (x = 10) = 0.25

x = 20 0.1 0.2 0.10 P (x = 20) = 0.4

x = 30 0.15 0.15 0.05 P (x = 30) = 0.35

P (y = 3) = 0.3 P (y = 10) = 0.4 P (y = 20) = 0.3 Sum=1

E[x] = 21, Var [x] = 59, E[y] = 10.9, Var [y] = 43.89,

E[y|x = 10] = 0.05 ∗ 3/0.25 + 0.05 ∗ 10/0.25 + 0.15 ∗ 20/0.25 = 14.6

E[y|x = 20] = 0.1 ∗ 3/0.4 + 0.2 ∗ 10/0.4 + 0.1 ∗ 20/0.4 = 10.75

E[y|x = 30] = 0.15 ∗ 3/0.35 + 0.15 ∗ 10/0.35 + 0.05 ∗ 20/0.35 = 8.42

E[y] = E[y|x = 10]P (x = 10)+E[y|x = 20]P (x = 20)+E[y|x = 30]P (x = 30)


Bivariate continuous random experiment

0 2 4 6 8 100

0.5

1

1.5

2

2.5

3

3.5

4

x

y

Input/output training set


Bivariate density distribution


Linear combinations• The expectation value of a linear combination of r.v.’s is simply the linear

combination of their respective expectation values

E[ax + by] = aE[x] + bE[y]

i.e., expectation is a linear statistic.

• Since the variance is not a linear statistic, we have

Var [ax + by] = a2Var [x] + b2Var [y] + 2ab (E[xy] − E[x]E[y]) =

= a2Var [x] + b2Var [y] + 2abCov[x,y]

where

Cov[x,y] = E [(x − E[x])(y − E[y])] = E[xy] − E[x]E[y]

is called covariance .

• Covariance measure whether the two variables vary simultaneously inthe same way around their average.


Covariance exampley = 3 y = 10 y = 20 Marginal

x = 10 0.05 0.05 0.15 P (x = 10) = 0.25

x = 20 0.1 0.2 0.10 P (x = 20) = 0.4

x = 30 0.15 0.15 0.05 P (x = 30) = 0.35

P (y = 3) = 0.3 P (y = 10) = 0.4 P (y = 20) = 0.3 Sum=1

E[x] = 21, Var [x] = 59, E[y] = 10.9, Var [y] = 43.89,

xy 30 60 90 100 200 300 400 600

P (xy = xy) 0.05 0.1 0.15 0.05 0.15+0.2 0.15 0.10 0.05

E[xy] = 211 ⇒ Cov[x,y] = 211 − 21 ∗ 10.9 = −17.9, ρ(x,y) = −0.35


Correlation• The correlation coefficient is

ρ(x,y) =Cov[x,y]

√

Var [x] Var [y]

It is easily shown that −1 ≤ ρ(x,y) ≤ 1.

• Two r.v. are called uncorrelated if

E[xy] = E[x]E[y]

• If x and y are two independent random variables then Cov[x,y] = 0 orequivalently E[xy] = E[x]E[y] .

• If x and y are two independent random variables then alsoCov[g(x), h(y)] = 0 or equivalently E[g(x)h(y)] = E[g(x)]E[h(y)] .

• Independence ⇒ Uncorrelation but not viceversa for a genericdistribution.

• Independence ⇔ Uncorrelation if x and y are jointly gaussian.


Correlation and causation• The existence of a correlation different from zero does not necessarily

mean that there is a causality relationship.

• Think to these examples• Salaries of professors and alcohol consumption• Amount of Cokes drunk per day and the sport performance• Sleeping with shoes on and headache• The amount of firemen and the gravity of the disaster• Taking an expensive drug and cancer risk.

• According to Tufte,

Empirically observed covariation is a necessary but not sufficientcondition four causality


Linear combination of independent vars• If the random variables x and y are independent, then

Var [x + y] = Var [x] + Var [y]

andVar [ax + by] = a2Var [x] + b2Var [y]

• In general, if the random variables x1,x2, . . . ,xk are independent, then

Var

[

k∑

i=1

cixi

]

=k

∑

i=1

c2i σ

2i


TP1. Let x and y two discrete independent r.v. such that

Px(−1) = 0.1, Px(0) = 0.8, Px(1) = 0.1

andPy(1) = 0.1, Py(2) = 0.8, Py(3) = 0.1

If z = x + y show that E[z] = E[x] + E[y]

2. Let x be a discrete r.v. which assumes −1, 0, 1 with probability 1/3 andy = x2.

(a) Let z = x + y. Show that E[z] = E[x] + E[y]

(b) Demonstrate that x and y are uncorrelated but dependent randomvariables.


The sum of i.i.d. random variablesSuppose that z1,z2,. . . ,zN are i.i.d. (identically and independently distributed)random variables, discrete or continuous, each having a probabilitydistribution with mean µ and variance σ2. Let us consider the two derived r.v.,that is the sum

SN = z1 + z2 + · · · + zN

and the average

z =z1 + z2 + · · · + zN

N

The following relations hold

E[SN ] = Nµ, Var [SN ] = Nσ2

E[z] = µ, Var [z] =σ2

N

See the R script sum_rv.R.


Normal distribution: the multivariate caseLet z be a random vector (n × 1).The vector is said to be normally distributed with parameters µ(n×1) andΣ(n×n) (also z ∼ N (µ, Σ)) if its probability density function is given by

pz(z) =1

(√

2π)n√

det(Σ)exp

−1

2(z − µ)T Σ−1(z − µ)

It follows that

• the mean E[z] = µ = [µ1, . . . , µn]T is a [n, 1]-dimensional vector, whereµi = E[zi], i = 1, . . . , n,

• the [n, n] matrix

Σ(n×n) = E[(z − µ)(z − µ)T ] =

σ21 σ12 . . . σ1n

...... . . .

...

σ1n σ2n . . . σ2n

is the covariance matrix where σ2i = Var [zi] and σij = Cov[zi, zj ]. This

matrix is squared and symmetric. It has n(n + 1)/2 parameters.Apprentissage automatique – p. 64/78

Normal multivariate distribution (II)The quantity

∆ = (z − µ)T Σ−1(z − µ)

which appears in the exponent of pz is called the Mahalanobis distance fromz to µ.It can be shown that the surfaces of constant probability density

• are hyperellipsoids on which ∆2 is constant;

• their principal axes are given by the eigenvectors ui, i = 1, . . . , n of Σ

which satisfyΣui = λiui i = 1, . . . , n

where λi are the corresponding eigenvalues.

• the eigenvalues λi give the variances along the principal directions.


Normal multivariate distribution (III)If the covariance matrix Σ is diagonal then

• the contours of constant density are hyperellipsoids with the principaldirections aligned with the coordinate axes.

• the components of z are then statistically independent since thedistribution of z can be written as the product of the distributions for eachof the components separately in the form

pz(z) =n

∏

i=1

p(zi)

• the total number of independent parameters in the distribution is 2n.

• if σi = σ for all i, the contours of constant density are hyperspheres.


Bivariate normal distributionConsider a bivariate normal density whose mean is µ = [µ1, µ2]

T and thecovariance matrix is

Σ =

[

σ21 σ12

σ21 σ22

]

The correlation coefficient is

ρ =σ12

σ1σ2

It can be shown that the general bivariate normal density has the form

p(z1, z2) =1

2πσ1σ2

√

1 − ρ2

exp

[

− 1

2(1 − ρ2)

[

(

z1 − µ1

σ1

)2

− 2ρ

(

z1 − µ1

σ1

)(

z2 − µ2

σ2

)

+

(

z2 − µ2

σ2

)2]]


Bivariate normal distributionLet Σ = [1.2919, 0.4546; 0.4546, 1.7081].

010

2030

4050

60

0

10

20

30

40

50

600

0.05

0.1

0.15

0.2

0.25

0.3

0.35

z1

z2

p(z 1,z

2)


Bivariate normal distribution (prj)

2

1

1λ2λ

u 1

u 2

z

z

See the R script s_gaussXYZ.R.


Marginal and conditional distributionsOne of the important properties of the multivariate normal density is that allconditional and marginal probabilities are also normal. Using the relation

p(z2|z1) =p(z1, z2)

p(z1)

we find that p(z2|z1) is a normal distribution N (µ2|1, σ22|1), where

µ2|1 = µ2 + ρσ2

σ1(z1 − µ1)

σ22|1 = σ2

2(1 − ρ2)

Note that

• µ2|1 is a linear function of z1: if the correlation coefficient ρ is positive,the larger z1, the larger µ2|1.

• if there is no correlation between z1 and z2, we can ignore the value of z1

to estimate µ2.


−4 −2 0 2 4

−4

−2

02

4

z1

z2

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Z1

p(z1

)

0.0 0.4 0.8

−4

−2

02

4

p(z2)

Z2

0.0 0.4 0.8

−4

−2

02

4

p(z2|z1=0)Z

2

rho= 0.905 ; Var[z2]= 1.05 ; Var[z2|z1]= 0.19


The central limit theoremTheorem 5. Assume that z1,z2,. . . ,zN are i.i.d. random variables, discrete or continuous,

each having a probability distribution with finite mean µ and finite variance σ2. As N → ∞,

the standardized random variable

(z − µ)√

N

σ

which is identical to(SN − Nµ)√

Nσ

converges in distribution to a r.v. having the standardized normal distribution N (0, 1).

• This result holds regardless of the common distribution of zi.

• This theorem justifies the importance of the normal distribution, sincemany r.v. of interest are either sums or averages.

• See R script central.R.


The chi-squared distributionFor a N positive integer, a r.v. z has a χ2

N distribution if

z = x21 + · · · + x2

N

where x1,x2,. . . ,xN are i.i.d. random variables N (0, 1).

• The probability distribution is a gamma distribution with parameters( 12N, 1

2 ).

• E[z] = N and Var [z] = 2N .

• The distribution is called “a chi-squared distribution with N degrees offreedom”.


The chi-squared distribution (II)

0 5 10 15 20 25 30 35 40 45 500

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

χ2N

density: N=10

0 5 10 15 20 25 30 35 40 45 500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

χ2N

cumulative distribution: N=10

R script chisq.R.


Student’s t-distributionIf x ∼ N (0, 1) and y ∼ χ2

N are independent then the Student’s t-distribution

with N degrees of freedom is the distribution of the r.v.

z =x

√

y/N

We denote this with z ∼ TN .


Student’s t-distribution

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Student density: N=10

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Student cumulative distribution: N=10

R script s_stu.R.


Notation• In order to clarify the distinction between random variables and their

values, we will use the boldface notation for denoting a random variable

(e.g. z) and the normal face notation for the eventually observed value(e.g. z = 11).

• The notation Pz(z) denotes the probability that the random variable z

take the value z. The suffix indicates that the probability relates to therandom variable z. This is necessary since we often discussprobabilities associated with several random variables simultaneously.

• Example: z could be the age of a student before asking and z = 22

could be his value after the observation.


Notation (II)• In general terms, we will denote as the probability distribution of a random

variable z any complete description of the probabilistic behavior of z.

• For example, if z is continuous, the density function p(z) or thedistribution function could be examples of probability distribution.

• Given a probability distribution Fz(z) the notation

F → z1, z2, . . . , zN

means that the dataset DN = z1, z2, . . . , zN is a i.i.d. random sampleobserved from the probability distribution Fz(·).


statistical foundations of machine learning › sites › default › files › 2....

Documents