statistical foundations of machine learning › sites › default › files › 2....
TRANSCRIPT
Statistical foundations of machinelearning
INFO-F-422
Gianluca Bontempi
Département d’Informatique
Boulevard de Triomphe - CP 212
http://www.ulb.ac.be/di
Apprentissage automatique – p. 1/78
Random experimentWe define as random experiment any action or process which generatesresults or observations which cannot be predicted with certainty.Examples:
• tossing a coin,
• rolling dice,
• measuring the time to reach home
Apprentissage automatique – p. 2/78
Probability space• A random experiment is characterized by a sample space Ω that is the
set of all the possible outcomes ω of the experiment. This space can beeither finite or infinite. For example, in the die experiment,Ω = ω1, ω2, . . . , ω6.
• The elements of the set Ω are called experimental outcomes. Theoutcome of an experiment need not be a number, for example, theoutcome when a coin is tossed can be ’heads’ or ’tails’.
• A subset of experimental outcomes is called event. An example of eventis the set of even values E = ω2, ω4, ω6.
• A single execution of a random experiment is called a trial. At each trialwe observe one outcome ωi. We say that an event occurred during thistrial if it contains the element ωi. For example, in the die experiment, ifwe observe the outcome ω4 the event even took place.
Apprentissage automatique – p. 3/78
Events and set theorySince events E are subsets, we can apply to them the terminology of the settheory:
• Ec = ω ∈ Ω : ω /∈ E denotes the complement of E .
• E1 ∪ E2 = ω ∈ Ω : ω ∈ E1 OR ω ∈ E2 refers to the event that occurswhen E1 or E2 or both occur.
• E1 ∩ E2 = ω ∈ Ω : ω ∈ E1 AND ω ∈ E2 refers to the event that occurswhen both E1 and E2 occur.
• two events E1 and E2 are mutually exclusive or disjoints if
E1 ∩ E2 = ∅
that is each time that E1 occurs, E2 does not occur.
• a partition of Ω is a set of disjoint sets Ej , j = 1, . . . , n such that
∪nj=1Ej = Ω
Apprentissage automatique – p. 4/78
Class of events• The class E of events is not an arbitrary collection of subsets of Ω.
• We want that if E1 and E2 are events also that the intersection E1 ∩ E2
and the union E1 ∪ E2 be events.
• We do so because we will want to know not only the probabilities ofvarious events, but also the probabilities of their unions andintersections.
• In the following we will consider only events that, in mathematical terms,form a Borel field.
Apprentissage automatique – p. 5/78
Axiomatic definition of probability• Let Ω the certain event that occurs in every trial and let E1 + E2 design
the event that occurs when E1 or E2 or both occur.
• The axiomatic approach to probability consists in assigning to eachevent E a number Prob E which is called the probability of the event E .This number is so chosen as to satisfy the following three conditions:
1. Prob E ≥ 0 for any E .
2. Prob Ω = 1
3. Prob E1 + E2 = Prob E1 + Prob E2 if Prob E1 ∩ E2 = 0 that isE1 and E2 are mutually exclusive (or disjoint).
• These conditions are the axioms of the theory of probability(Kolmogoroff,1933).
• It follows thatProb E =
∑
ω∈E
Prob ω
• All probabilistic conclusions are based directly or indirectly on theaxioms and only the axioms.
• But how to define these probability numbers?Apprentissage automatique – p. 6/78
Symmetrical definition of probabilityConsider a random experiment where
• the sample space is made of N symmetric outcomes, i.e. we have noreason to expect or prefer one over the other
• the number of outcomes which are favorable to the event E (i.e. if theyoccur then the event E takes place) is NE .
then according to the principle of indifference (a term popularized by J.M.Keynes in 1921) we have
Prob E =NE
N
Note that this number is determined without any experimentation and isbased on symmetrical assumptions. But what happen if symmetry does nothold?
Apprentissage automatique – p. 7/78
Frequentist definition of probability• Let us consider a random experiment and an event E .
• Let us repeat the experiment N times and let us compute the number oftimes that the event E occurs. The quantity NE
N is the relative frequencyof E .
• It can be observed that the frequency converges to a fixed value forincreasing N .
• This observation led Von Mises to use the notion of frequency as afoundation for the notion of probability.
Apprentissage automatique – p. 8/78
Frequentist definition of probability• It is based on the following definition (Von Mises):
Definition 1. The probability Prob E of an event E is the limit
Prob E = limN→∞
NE
N
where N is the number of observations (trials) and NE is the number of times that Eoccurred.
• This definition appears reasonable and it is compatible with the axioms.
• According to this approach, the probability is not a property of a specificobservation, but rather it is a property of the entire set of observations.
• In practice, in any physical experience N is finite and the limit has to beaccepted as an hypothesis, not as a number that can be determinedexperimentally.
• However, the frequentist interpretation is very important to show thelinks between theory and application.
Apprentissage automatique – p. 9/78
Weak law of Large Numbers• A link between the axiomatic and the frequentist approach is provided by
the weak law of Large Numbers.
• It was shown by Bernoulli that
Theorem 1. For any ǫ > 0,
Prob
∣
∣
∣
∣
NE
N− p
∣
∣
∣
∣
≤ ǫ
→ 1 as N → ∞
• According to this theorem, the ratio NE/N is close to p in the sense that,for any ǫ > 0, the probability that |NE/N − p| ≤ ǫ tends to 1 as N → ∞.
• The weak law essentially states that for any nonzero margin specified,no matter how small, with a sufficiently large sample there will be a veryhigh probability that the frequency will be close to the probability, that is,within the margin.
Apprentissage automatique – p. 10/78
Probability and frequencyNote that the law of large numbers does NOT mean that NE will be close toN · p. In fact,
Prob NE = Np ≈ 1√
2πNp(1 − p)→ 0, as N → ∞
For instance, in a fair coin-tossing game this law does not imply that theabsolute difference between the number of heads and tails should oscillateclose to zero . On the contrary it could happen that the absolute differencekeeps growing (though slower than the number of tosses).
Apprentissage automatique – p. 11/78
Probability and frequency
0 e+00 2 e+05 4 e+05 6 e+05
0.2
0.3
0.4
0.5
0.6
Number of trials
Rel
ativ
e fr
eque
ncy
0 e+00 2 e+05 4 e+05 6 e+05
020
040
060
0
Number of trials
Abs
olut
e di
ffere
nce
(no.
hea
ds a
nd ta
ils)
Apprentissage automatique – p. 12/78
Some probabilistic notionsDefinition 2 (Independent events). . Two events E1 and E2 are independent if and only if
Prob E1 ∩ E2 = Prob E1 Prob E2 (1)
and we write E1 ⊥⊥ E2.
As an example of two independent events, think of two outcomes of a roulettewheel or of two coins tossed simultaneously.
Definition 3 (Conditional probability). If Prob E1 > 0 then the conditional probability of E2
given E1 is
Prob E2|E1 =Prob E1 ∩ E2
Prob E1(2)
Apprentissage automatique – p. 13/78
Exercices1. Let E1 and E2 two disjoint events with positive probability. Can they be
independent?
2. Suppose that a fair die is rolled and that the number x appears. Let E1 bethe event that the number x is even, E2 be the event that the number x isgreater than or equal to 3, E3 be the event that the number x is a 4,5 or 6.Are the events E1 and E2 independent? Are the events E1 and E3
independent?
Apprentissage automatique – p. 14/78
Warnings• For any fixed E1, the quantity Prob ·|E1 satisfies the axioms of
probability. For instance if E2, E3 and E4 are disjoint events we have that
Prob E2 ∪ E3 ∪ E4|E1 = Prob E2|E1 + Prob E3|E1 + Prob E4|E1
• However this does NOT generally hold for Prob E1|·, that is when wefix the term E1 on the left of the conditional bar. For two disjoint events E2
and E3, in general
Prob E1|E2 ∪ E3 6= Prob E1|E2 + Prob E1|E3
• it is generally NOT the case that Prob E2|E1 = Prob E1|E2.
• The following properties hold :
If E1 ⊂ E2 ⇒ Prob E2|E1 =Prob E1Prob E1
= 1 (E1 ⇒ E2) (3)
⇒ Prob E1|E2 =Prob E1Prob E2
≥ Prob E1 (4)
Apprentissage automatique – p. 15/78
Bayes’ theoremLet us consider a set of mutually exclusive and exhaustive events E1, E2,...,Ek, i.e. they form a partition of Ω.
Theorem 2 (Law of total probability). Let Prob Ei, i = 1, . . . , k denote the probabilities of
the ith event Ei and Prob E|Ei, i = 1, . . . , k the conditional probability of a generic event
E given that Ei has occurred. It can be shown that
Prob E =k
∑
i=1
Prob E|Ei Prob Ei =k
∑
i=1
Prob E ∩ Ei (5)
Theorem 3 (Bayes’ theorem). The conditional (“inverse”) probability of any Ei, i = 1, . . . , k
given that E has occurred is given by
Prob Ei|E =Prob E|EiProb Ei
∑kj=1 Prob E|Ej Prob Ej
=Prob E ∩ Ei
Prob E i = 1, . . . , k (6)
Apprentissage automatique – p. 16/78
Combined experiments• So far we assumed that all the events belong to the same sample space.
• The most interesting use of probability concerns however combinedrandom experiments whose sample space
Ω = Ω1 × Ω2 × . . .Ωn
is the Cartesian product of several spaces Ωi, i = 1, . . . , n.
• For instance if we want to study the probabilistic dependence betweenthe height and the weight of a child we have to define a joint samplespace
Ω = (w, h) : w ∈ Ωw, h ∈ Ωhmade of all pairs (w, h) where Ωw is the sample space of the randomexperiment describing the weight and Ωh is the sample space of therandom experiment describing the height.
Apprentissage automatique – p. 17/78
Medical studyLet us consider a medical study about the relationship between the outcomeof a medical test and the presence of a disease . We model this study as thecombination of two random experiments:
1. the random experiment which models the state of the patient. Its samplespace is Ωs = H, S where H and S stand for healthy and sick patient,respectively.
2. the random experiment which models the outcome of the medical test.Its sample space is Ωo = +,− where + and − stand for positive andnegative outcome of the test, respectively.
Suppose that out of 1000 patients,
Es = S Es = H
Eo = + 9 99
Eo = − 1 891
⇒Es = S Es = H
Eo = + .009 .099
Eo = − .001 .891
What is the probability of having a positive (negative) test outcome when thepatient is sick (healthy)? What is the probability of being in front of a sick(healthy) patient when a positive (negative) outcome is obtained?
Apprentissage automatique – p. 18/78
Medical study (II)From the definition of conditional probability we derive
Prob Eo = +|Es = S =Prob Eo = +, Es = S
Prob Es = S =.009
.009 + .001= .9
Prob Eo = −|Es = H =Prob Eo = −, Es = H
Prob Es = H =.891
.891 + .099= .9
According to these figures, the test appears to be accurate. Do we have toexpect a high probability of being sick when the test is positive? The answeris NO as shown by
Prob Es = S|Eo = + =Prob Eo = +, Es = S
Prob Eo = + =.009
.009 + .099≈ .08
This example shows that sometimes humans tend to confound Prob Es|Eowith Prob Eo|Es and that the most intuitive response is not always the rightone.
Apprentissage automatique – p. 19/78
Array of joint/marginal probabilitiesLet us consider the combination of two random experiments whose samplespaces are ΩA = A1, · · · , An and ΩB = B1, · · · , Bm respectively.Assume that for each pairs of events (Ai, Bj), i = 1, . . . , n, j = 1, . . . , m weknow the joint probability value Prob Ai, Bj.
B1 B2 · · · Value Bm Marginal
A1 Prob A1, B1 Prob A1, B2 · · · Prob A1, Bm Prob A1
A2 Prob A2, B1 Prob A2, B2 · · · Prob A1, Bm Prob A2
......
......
......
An Prob An, B1 Prob An, B2 · · · Prob An, Bm Prob An
Marginal Prob B1 Prob B2 · · · Prob Bm Sum=1
The joint probability array contains all the necessary information forcomputing all marginal and conditional probabilities
Try to fill the table for the dependent and independent case.
Apprentissage automatique – p. 20/78
Dependent/independent: exampleLet us model the commute time to go back home for an ULB student living inSt. Gilles as a random experiment. Suppose that its sample space isΩ=LOW, MEDIUM, HIGH. Consider also an (extremely:-) randomexperiment representing the weather in Brussels, whose sample space isΩ=G=GOOD, B=BAD. Suppose that the array of joint probabilities is
G (in Bxl) B (in Bxl) Marginal
LOW 0.15 0.05 Prob LOW = 0.2
MEDIUM 0.1 0.4 Prob MEDIUM = 0.5
HIGH 0.05 0.25 Prob HIGH = 0.3
Prob G = 0.3 Prob B = 0.7 Sum=1
Is the commute time dependent on the weather in Bxl?
G (in Rome) B (in Rome) Marginal
LOW 0.18 0.02 Prob LOW = 0.2
MEDIUM 0.45 0.05 Prob MEDIUM = 0.5
HIGH 0.27 0.03 Prob HIGH = 0.3
Prob G = 0.9 Prob B = 0.1 Sum=1
Is the commute time dependent on the weather in Rome?Apprentissage automatique – p. 21/78
Dependent/independent: example (II)If Brussels’ weather is good
LOW MEDIUM HIGH
Prob ·|G 0.15/0.3=0.5 0.1/0.3=0.33 0.05/0.3=0.16
Else if Brussels’ weather is bad
LOW MEDIUM HIGH
Prob ·|B 0.05/0.7=0.07 0.4/0.7=0.57 0.25/0.7=0.35
The distribution of the commute time changes according to the value of theBrussels’ weather.
Apprentissage automatique – p. 22/78
Dependent/independent: example (III)If Rome’s weather is good
LOW MEDIUM HIGH
Prob ·|G 0.18/0.9=0.2 0.45/0.9=0.5 0.27/0.9=0.3
Else if If Rome’s weather is bad
LOW MEDIUM HIGH
Prob ·|B 0.02/0.1=0.2 0.05/0.1=0.5 0.03/0.1=0.3
The distribution of the commute time does NOT change according to thevalue of Rome’s weather.
Apprentissage automatique – p. 23/78
Marginal/conditional: exampleConsider a probabilistic model of the day’s weather based on three randomdescriptors (or features) where
1. the first represents the sky condition and takes value in the finite setCLEAR, CLOUDY.
2. the second represents the barometer trend and takes value in the finiteset RISING,FALLING,
3. the third represents the humidity in the afternoon and takes value inDRY,WET.
Apprentissage automatique – p. 24/78
Marginal/conditional: example (II)Let the joint distribution be given by the table
E1 E2 E3 P(E1,E2,E3)
CLEAR RISING DRY 0.4
CLEAR RISING WET 0.07
CLEAR FALLING DRY 0.08
CLEAR FALLING WET 0.10
CLOUDY RISING DRY 0.09
CLOUDY RISING WET 0.11
CLOUDY FALLING DRY 0.03
CLOUDY FALLING WET 0.12
From the joint distribution we can calculate the marginal probabilitiesP (CLEAR, RISING) = 0.47 and P (CLOUDY ) = 0.35 and the conditionalvalue
P (DRY |CLEAR, RISING) =
=P (DRY, CLEAR, RISING)
P (CLEAR, RISING)=
0.40
0.47≈ 0.85
Apprentissage automatique – p. 25/78
Random variables• Machine learning and statistics is concerned with data. What is then the
link between the notion of random experiments and data? The answer isprovided by the concept of random variable.
• Consider a random experiment and the associated triple(Ω, E, Prob ·). The outcome of an experiment need not be a number,for example, the outcome when a coin is tossed can be ’heads’ or ’tails’.However, we often want to represent outcomes as numbers. A randomvariable is a function that associates a unique numerical value withevery outcome of an experiment. The value of the random variable willvary from trial to trial as the experiment is repeated.
• Suppose that we have a mapping rule Ω → R such that we canassociate to each experimental outcome ω a real value z(ω).
• We say that z is the value taken by the random variable z when theoutcome of the random experiment is ω.
• Since there is a probability associated to each event E and we have amapping from events to real values, a probability distribution can beassociated to z.
Apprentissage automatique – p. 26/78
Random variablesDefinition 4. Given a random experiment (Ω, E, Prob ·), a random variable z is the
result of a mapping that assigns a number z to every outcome ω. This mapping must satisfy
the following two conditions but is otherwise arbitrary:
• the set z ≤ z is an event for every z.
• the probabilities
Prob z = ∞ = 0 Prob z = −∞ = 0
Given a random variable z ∈ Z and a subset I ⊂ Z we define the inversemapping
z−1(I) = ω ∈ Ω|z(ω) ∈ Iwhere z−1(I) ∈ E is an event and let
Prob z ∈ I = Prob
z−1(I)
= Prob ω ∈ Ω|z(ω) ∈ I
Apprentissage automatique – p. 27/78
Probabilistic interpretation of uncertainty• This course will assume that the variability of measurements can be
represented by the probability formalism.
• A random variable is a numerical quantity, linked to some experimentinvolving some degree of randomness, that takes its value from someset of possible real values.
• Example: the experiment might be the rolling of two six-sided dice andthe r.v. z might be the sum of the two numbers showing in the dice. Inthis case the set of possible values are 2, . . . , 12.
• In the example of the commute time, the random experiment is a compact
(and approximate) way of modeling the disparate set of causes which ledto variability in the value of z.
Apprentissage automatique – p. 28/78
Probability function of a discrete r.v.The probability (mass) function of a discrete r.v. z is the combination of
1. the set Z of values that this r.v. can take (also called range or sample
space )
2. the set of probabilities associated to each value of ZThis means that we can attach to the random variable some specificmathematical function P (z) that gives for each z ∈ Z the probability that z
assumes the value z
Pz(z) = Prob z = zThis function must satify the two following conditions
Pz(z) ≥ 0∑
z∈Z Pz(z) = 1
Apprentissage automatique – p. 29/78
Probability function of a discrete r.v.(II)For a reduced number of possible values of z, the probability function can bepresented in the form of a table. For example, if we plan to toss a fair cointwice, and the random variable z is the number of heads that eventually turnup, the probability function can be presented as follow
Values of the random variable z 0 1 2
Associated probabilities 0.25 0.50 0.25
Apprentissage automatique – p. 30/78
Parametric probability functionSuppose that
1. z is a discrete r.v. that takes its value in Z = 1, 2, 3.
2. the probability function of z is
Pz(z) =θ2z
θ2 + θ4 + θ6
where θ is some fixed non zero real number.
Whatever the value of θ, Pz(z) ≥ 0 for z = 1, 2, 3 andPz(1) + Pz(2) + Pz(3) = 1. Therefore z is a well-defined random variable,even if the value of θ is unknown.We call θ a parameter , that is some constant, usually unknown involved in aprobability function.
Apprentissage automatique – p. 31/78
Expected value of a discrete r.v.Expected value of a discrete random variable z is defined as
E[z] = µ =∑
z∈Z
zPz(z)
• In English “mean” is used as a synonymous of “expected value”.• But the word “average” is NOT a synonymous of “expected value”.• The expected value is not necessarily a value that belongs to Z.
Apprentissage automatique – p. 32/78
Variance of a discrete r.v.Variance of a discrete random variable z is defined as
Var [z] = σ2 = E[(z − E[z])2] =∑
z∈Z
(z − E[z])2Pz(z)
• The variance is a measure of the dispersion of the probabilityfunction of the random variable around its mean.
Note that the following identity holds
E[(z − E[z])2] ≡ E[z2] − (E[z])2 = E[z2] − µ2
Apprentissage automatique – p. 33/78
Examples of probability functions
2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1:10
P
2 4 6 8 100.
00.
20.
40.
60.
8
1:10
PP
Two discrete r.v. probability functions having the same mean but differentvariance.
Apprentissage automatique – p. 34/78
Std. deviation and moments of a discrete r.v.Standard deviation of a discrete random variable z is defined as the positive
square root of the variance.
Std [z] =√
Var [z] = σ
Moment: for any positive integer r, the rth moment of the probability functionis
µr = E[zr] =∑
z
zrPz(z)
Skewness of a discrete random variable z is defined as
γ =E[(z − µ)3]
σ3
• Distributions with positive skewness have long tails to the right, anddistributions with negative skewness have long tails to the left.
Apprentissage automatique – p. 35/78
Joint probability• Consider a probabilistic model described by n discrete random variables.
• A fully specified probabilistic model gives the joint probability function forevery combination of the values of the n r.v.s.
• The model is specified by the values of the probabilities
Prob z1 = z1, z2 = z2, . . . , zn = zn = P (z1, z2, . . . , zn)
for every possible assignment of values z1, . . . , zn to the variables.
Apprentissage automatique – p. 36/78
Independent variables• Let x and y be two random variables.
• The two variables x and y are defined to be statistically independent ifthe joint probability Prob x = x,y = y = Prob x = xProb y = y
• If two variables x and y are independent, then the transformed r.v. g(x)
and h(y), where g and h are two given functions, are also independent.
• In qualitative terms, this means that we do not expect that the outcomeof one variable will affect the other.
• Examples: think to two outcomes of a roulette wheel or to two coinstossed simultaneously.
Apprentissage automatique – p. 37/78
Marginal and conditional probability• Marginal probabilities for subsets of the n variables can be found by
summing over all possibile combinations of values for the othervariables.
P (z1, . . . , zm) =∑
zm+1
· · ·∑
zn
P (z1, . . . , zm, zm+1, . . . , zn)
• Conditional probabilities for one subset of variables xi : i ∈ S1 givenvalues for another disjoint subset xj : j ∈ Sj where S1 ∩ S2 = ∅, aredefined as ratios of marginal probabilisties
P (xi : i ∈ S1|xj : j ∈ S2) =P (xi : i ∈ S1, xj : j ∈ S2)
P (xj : j ∈ S2)
• If x and y are independent then Px(x|y = y) = Px(x)
• Since independence is symmetric, this is equivalent to say thatPy(y|x = x) = Py(y)
Apprentissage automatique – p. 38/78
Bernoulli trial•• A Bernoulli trial is a random experiment with two possible outcomes, often
called “success” and “failure”.
• The probability of success is denoted by p and the probability of failureby (1 − p).
• A Bernoulli random variable z is a discrete r.v. associated with the Bernoullitrial. It takes z = 0 with probability (1 − p) and z = 1 with probability p.
• The probability function of z can be written in the form
Prob z = z = Pz(z) = pz(1 − p)1−z z = 0, 1
• Note thatE[z] = 1 · p + 0 · (1 − p) = p
andVar [z] = p(1 − p)2 + (1 − p)(0 − p)2 = p(1 − p).
Apprentissage automatique – p. 39/78
The Binomial distribution• A binomial random variable is the number of successes in a fixed number
N of independent Bernoulli trials with the same probability of successfor each trial.
• Example: the number z of heads in N tosses of a coin.
• The probability function is given by
Prob z = z = Pz(z) =
(
N
z
)
pz(1 − p)N−z z = 0, 1, . . . , N
• The mean of the distribution is µ = Np and σ2 = Np(1 − p).
• The Bernoulli distribution is a special case (N = 1) of the binomialdistribution.
• For small p, the probability of having at least 1 success in N trials isproportional to N , as long as Np is small.
Apprentissage automatique – p. 40/78
Mean and variances of the distributionsDistribution Mean Variance
Bernoulli p p(1 − p)
Binomial Np Np(1 − p)
Geometric p/(1 − p) p/(1 − p)2
Poisson λ λ
Apprentissage automatique – p. 41/78
Continuous random variable• Continuous random variables take their value in some continuous range of
values. Consider a real random variable z whose range is the set of realnumbers. The following quantities can be defined:
Definition 5. The (cumulative) distribution function of z is the function
Fz(z) = Prob z ≤ z (7)
Definition 6. The density function of a real random variable z is the derivative of the
distribution function:
pz(z) =dFz(z)
dz(8)
• Probabilities of continuous r.v. are not allocated to specific values butrather to interval of values. Specifically
Prob a < z < b =
∫ b
a
pz(z)dz,
∫
Z
pz(z)dz = 1
Apprentissage automatique – p. 42/78
Mean, variance,... of a continuous r.v.Consider a continuous scalar r.v. having range (l, h) and density functionp(z). We can define
Expectation (mean):
µ =
∫ h
l
zp(z)dz
Variance:
σ2 =
∫ h
l
(z − µ)2p(z)dz
Other quantities of interest are the moments :
µr = E[zr] =
∫ h
l
zrp(z)dz
The moment of order r = 1 is the mean of z.
Apprentissage automatique – p. 43/78
The Chebyshev’s inequality• Let z be a generic random variable, discrete or continuous , having mean
µ and variance σ2.
• The Chebyshev’s inequality states that for any positive constant d
Prob |z − µ| ≥ d ≤ σ2
d2
• An experimental validation of the Chebyshev’s inequality can be found inthe R script cheby.R.
Apprentissage automatique – p. 44/78
Uniform distributionA random variable z is said to be uniformly distributed on the interval (a, b)
(also z ∼ U(a, b)) if its probability density function is given by
p(z) =
1b−a if a < z < b
0, otherwise
p(z)
a b z
b-a1
TP: compute the variance of U(a, b).
Apprentissage automatique – p. 45/78
Normal distribution: the scalar caseA continuous scalar random variable x is said to be normally distributed withparameters µ and σ2 (also x ∼ N (µ, σ2)) if its probability density function isgiven by
px(x) =1√2πσ
e−(x−µ)2
2σ2
• The mean of x is µ; the variance of x is σ2.
• The coefficient in front of the exponential ensures that∫
p(x)dx = 1.
• The probability that an observation x from a normal r.v. is within 2
standard deviations from the mean is 0.95.
• If µ = 0 and σ2 = 1 the distribution is defined standard normal . We willdenote its distribution function Fz(z) = Φ(z).
• Given a normal r.v. x ∼ N (µ, σ2), the r.v. z = (x − µ)/σ has a standardnormal distribution.
Apprentissage automatique – p. 46/78
Standard distribution
Apprentissage automatique – p. 47/78
Important relationsx ∈ N (µ, σ2)
Prob µ − σ ≤ x ≤ µ + σ ≈ 0.683
Prob µ − 1.282σ ≤ x ≤ µ + 1.282σ ≈ 0.8
Prob µ − 1.645σ ≤ x ≤ µ + 1.645σ ≈ 0.9
Prob µ − 1.96σ ≤ x ≤ µ + 1.96σ ≈ 0.95
Prob µ − 2σ ≤ x ≤ µ + 2σ ≈ 0.954
Prob µ − 2.57σ ≤ x ≤ µ + 2.57σ ≈ 0.99
Prob µ − 3σ ≤ x ≤ µ + 3σ ≈ 0.997
Test yourself these relations by random sampling and simulation using R!
Apprentissage automatique – p. 48/78
Test by using R> N<-1000000> mu<-1
> sigma<-2> DN<-rnorm(N,mu,sigma)
>> print(sum(DN<=(mu+sigma) & DN>=(mu-sigma))/N)
[1] 0.682668
> print(sum(DN<=(mu+1.645*sigma) & DN>=(mu-1.645*sigma))/N)[1] 0.899851
> qnorm(0.9)
[1] 1.281552> qnorm(0.95)
[1] 1.644854> qnorm(0.995)
[1] 2.57
Apprentissage automatique – p. 49/78
Upper critical pointsDefinition 7. The upper critical point of a continuous r.v. x is the smallest number xα such
that
1 − α = Prob x ≤ xα = F (xα) ⇔ xα = F−1(1 − α)
• We will denote with zα the upper critical points of the standard normaldensity.
1 − α = Prob z ≤ zα = F (zα) = Φ(zα)
• Note that the following relations hold
zα = Φ−1(1 − α), z1−α = −zα, 1 − α =1√2π
∫ zα
−∞
e−z2/2dz
• Here we list the most commonly used values of zα.
α 0.1 0.075 0.05 0.025 0.01 0.005 0.001 0.0005
zα 1.282 1.440 1.645 1.967 2.326 2.576 3.090 3.291
• in R, zα =qnorm(1-alpha)
Apprentissage automatique – p. 50/78
Bivariate discrete probability distribution• Let us consider two discrete r.v.’s x ∈ X and y ∈ Y and their joint
probability function Px,y(x, y).
• We define marginal probability the quantity
Px(x) =∑
y∈Y
Px,y(x, y)
• We define conditional probability the quantity
Py|x(y|x) =P (x, y)
Px(x)
which is the probability that y takes the value y once x = x.
• If x and y are independent
Px,y(x, y) = Px(x)Py(y), P (y|x) = Py(y)
Apprentissage automatique – p. 51/78
Bivariate cont probability distribution• Let us consider two continuous r.v.’s x and y and their joint density
function px,y(x, y).
• We define marginal density the quantity
px(x) =
∫ ∞
−∞
px,y(x, y)dy
• We define conditional density the quantity
py|x(y|x) =p(x, y)
px(x)
which is, in loose terms, the probability that y is in dy about y assumingx = x.
• If x and y are independent
px,y(x, y) = px(x)py(y), p(y|x) = py(y)
Apprentissage automatique – p. 52/78
Bivariate probability distribution (II)Definition 8 (Conditional expectation). The conditional expectation of y given x = x is
E[y|x = x] =
∫
ypy|x(y|x)dy = µy|x(x) (9)
Definition 9 (Conditional variance).
Var [y|x = x] =
∫
(y − µy|x(x))2py|x(y|x)dy (10)
Note that both these quantitites are function of x. If x is a random variable itfollows that the conditional expectation and the conditional variance arerandom, too.
Theorem 4. For two r.v. x and y, assuming the expectations exist, we have that
Ex[Ey[y|x]] = Ey[y] (11)
and
Var [y] = Ex[Var [y|x]] + Var [Ey[y|x]] (12)
where Var [y|x] and Ey[y|x] are functions of x.Apprentissage automatique – p. 53/78
Bivariate discrete exampley = 3 y = 10 y = 20 Marginal
x = 10 0.05 0.05 0.15 P (x = 10) = 0.25
x = 20 0.1 0.2 0.10 P (x = 20) = 0.4
x = 30 0.15 0.15 0.05 P (x = 30) = 0.35
P (y = 3) = 0.3 P (y = 10) = 0.4 P (y = 20) = 0.3 Sum=1
E[x] = 21, Var [x] = 59, E[y] = 10.9, Var [y] = 43.89,
E[y|x = 10] = 0.05 ∗ 3/0.25 + 0.05 ∗ 10/0.25 + 0.15 ∗ 20/0.25 = 14.6
E[y|x = 20] = 0.1 ∗ 3/0.4 + 0.2 ∗ 10/0.4 + 0.1 ∗ 20/0.4 = 10.75
E[y|x = 30] = 0.15 ∗ 3/0.35 + 0.15 ∗ 10/0.35 + 0.05 ∗ 20/0.35 = 8.42
E[y] = E[y|x = 10]P (x = 10)+E[y|x = 20]P (x = 20)+E[y|x = 30]P (x = 30)
Apprentissage automatique – p. 54/78
Bivariate continuous random experiment
0 2 4 6 8 100
0.5
1
1.5
2
2.5
3
3.5
4
x
y
Input/output training set
Apprentissage automatique – p. 55/78
Bivariate density distribution
Apprentissage automatique – p. 56/78
Linear combinations• The expectation value of a linear combination of r.v.’s is simply the linear
combination of their respective expectation values
E[ax + by] = aE[x] + bE[y]
i.e., expectation is a linear statistic.
• Since the variance is not a linear statistic, we have
Var [ax + by] = a2Var [x] + b2Var [y] + 2ab (E[xy] − E[x]E[y]) =
= a2Var [x] + b2Var [y] + 2abCov[x,y]
where
Cov[x,y] = E [(x − E[x])(y − E[y])] = E[xy] − E[x]E[y]
is called covariance .
• Covariance measure whether the two variables vary simultaneously inthe same way around their average.
Apprentissage automatique – p. 57/78
Covariance exampley = 3 y = 10 y = 20 Marginal
x = 10 0.05 0.05 0.15 P (x = 10) = 0.25
x = 20 0.1 0.2 0.10 P (x = 20) = 0.4
x = 30 0.15 0.15 0.05 P (x = 30) = 0.35
P (y = 3) = 0.3 P (y = 10) = 0.4 P (y = 20) = 0.3 Sum=1
E[x] = 21, Var [x] = 59, E[y] = 10.9, Var [y] = 43.89,
xy 30 60 90 100 200 300 400 600
P (xy = xy) 0.05 0.1 0.15 0.05 0.15+0.2 0.15 0.10 0.05
E[xy] = 211 ⇒ Cov[x,y] = 211 − 21 ∗ 10.9 = −17.9, ρ(x,y) = −0.35
Apprentissage automatique – p. 58/78
Correlation• The correlation coefficient is
ρ(x,y) =Cov[x,y]
√
Var [x] Var [y]
It is easily shown that −1 ≤ ρ(x,y) ≤ 1.
• Two r.v. are called uncorrelated if
E[xy] = E[x]E[y]
• If x and y are two independent random variables then Cov[x,y] = 0 orequivalently E[xy] = E[x]E[y] .
• If x and y are two independent random variables then alsoCov[g(x), h(y)] = 0 or equivalently E[g(x)h(y)] = E[g(x)]E[h(y)] .
• Independence ⇒ Uncorrelation but not viceversa for a genericdistribution.
• Independence ⇔ Uncorrelation if x and y are jointly gaussian.
Apprentissage automatique – p. 59/78
Correlation and causation• The existence of a correlation different from zero does not necessarily
mean that there is a causality relationship.
• Think to these examples• Salaries of professors and alcohol consumption• Amount of Cokes drunk per day and the sport performance• Sleeping with shoes on and headache• The amount of firemen and the gravity of the disaster• Taking an expensive drug and cancer risk.
• According to Tufte,
Empirically observed covariation is a necessary but not sufficientcondition four causality
Apprentissage automatique – p. 60/78
Linear combination of independent vars• If the random variables x and y are independent, then
Var [x + y] = Var [x] + Var [y]
andVar [ax + by] = a2Var [x] + b2Var [y]
• In general, if the random variables x1,x2, . . . ,xk are independent, then
Var
[
k∑
i=1
cixi
]
=k
∑
i=1
c2i σ
2i
Apprentissage automatique – p. 61/78
TP1. Let x and y two discrete independent r.v. such that
Px(−1) = 0.1, Px(0) = 0.8, Px(1) = 0.1
andPy(1) = 0.1, Py(2) = 0.8, Py(3) = 0.1
If z = x + y show that E[z] = E[x] + E[y]
2. Let x be a discrete r.v. which assumes −1, 0, 1 with probability 1/3 andy = x2.
(a) Let z = x + y. Show that E[z] = E[x] + E[y]
(b) Demonstrate that x and y are uncorrelated but dependent randomvariables.
Apprentissage automatique – p. 62/78
The sum of i.i.d. random variablesSuppose that z1,z2,. . . ,zN are i.i.d. (identically and independently distributed)random variables, discrete or continuous, each having a probabilitydistribution with mean µ and variance σ2. Let us consider the two derived r.v.,that is the sum
SN = z1 + z2 + · · · + zN
and the average
z =z1 + z2 + · · · + zN
N
The following relations hold
E[SN ] = Nµ, Var [SN ] = Nσ2
E[z] = µ, Var [z] =σ2
N
See the R script sum_rv.R.
Apprentissage automatique – p. 63/78
Normal distribution: the multivariate caseLet z be a random vector (n × 1).The vector is said to be normally distributed with parameters µ(n×1) andΣ(n×n) (also z ∼ N (µ, Σ)) if its probability density function is given by
pz(z) =1
(√
2π)n√
det(Σ)exp
−1
2(z − µ)T Σ−1(z − µ)
It follows that
• the mean E[z] = µ = [µ1, . . . , µn]T is a [n, 1]-dimensional vector, whereµi = E[zi], i = 1, . . . , n,
• the [n, n] matrix
Σ(n×n) = E[(z − µ)(z − µ)T ] =
σ21 σ12 . . . σ1n
...... . . .
...
σ1n σ2n . . . σ2n
is the covariance matrix where σ2i = Var [zi] and σij = Cov[zi, zj ]. This
matrix is squared and symmetric. It has n(n + 1)/2 parameters.Apprentissage automatique – p. 64/78
Normal multivariate distribution (II)The quantity
∆ = (z − µ)T Σ−1(z − µ)
which appears in the exponent of pz is called the Mahalanobis distance fromz to µ.It can be shown that the surfaces of constant probability density
• are hyperellipsoids on which ∆2 is constant;
• their principal axes are given by the eigenvectors ui, i = 1, . . . , n of Σ
which satisfyΣui = λiui i = 1, . . . , n
where λi are the corresponding eigenvalues.
• the eigenvalues λi give the variances along the principal directions.
Apprentissage automatique – p. 65/78
Normal multivariate distribution (III)If the covariance matrix Σ is diagonal then
• the contours of constant density are hyperellipsoids with the principaldirections aligned with the coordinate axes.
• the components of z are then statistically independent since thedistribution of z can be written as the product of the distributions for eachof the components separately in the form
pz(z) =n
∏
i=1
p(zi)
• the total number of independent parameters in the distribution is 2n.
• if σi = σ for all i, the contours of constant density are hyperspheres.
Apprentissage automatique – p. 66/78
Bivariate normal distributionConsider a bivariate normal density whose mean is µ = [µ1, µ2]
T and thecovariance matrix is
Σ =
[
σ21 σ12
σ21 σ22
]
The correlation coefficient is
ρ =σ12
σ1σ2
It can be shown that the general bivariate normal density has the form
p(z1, z2) =1
2πσ1σ2
√
1 − ρ2
exp
[
− 1
2(1 − ρ2)
[
(
z1 − µ1
σ1
)2
− 2ρ
(
z1 − µ1
σ1
)(
z2 − µ2
σ2
)
+
(
z2 − µ2
σ2
)2]]
Apprentissage automatique – p. 67/78
Bivariate normal distributionLet Σ = [1.2919, 0.4546; 0.4546, 1.7081].
010
2030
4050
60
0
10
20
30
40
50
600
0.05
0.1
0.15
0.2
0.25
0.3
0.35
z1
z2
p(z 1,z
2)
Apprentissage automatique – p. 68/78
Bivariate normal distribution (prj)
2
1
1λ2λ
u 1
u 2
z
z
See the R script s_gaussXYZ.R.
Apprentissage automatique – p. 69/78
Marginal and conditional distributionsOne of the important properties of the multivariate normal density is that allconditional and marginal probabilities are also normal. Using the relation
p(z2|z1) =p(z1, z2)
p(z1)
we find that p(z2|z1) is a normal distribution N (µ2|1, σ22|1), where
µ2|1 = µ2 + ρσ2
σ1(z1 − µ1)
σ22|1 = σ2
2(1 − ρ2)
Note that
• µ2|1 is a linear function of z1: if the correlation coefficient ρ is positive,the larger z1, the larger µ2|1.
• if there is no correlation between z1 and z2, we can ignore the value of z1
to estimate µ2.
Apprentissage automatique – p. 70/78
−4 −2 0 2 4
−4
−2
02
4
z1
z2
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Z1
p(z1
)
0.0 0.4 0.8
−4
−2
02
4
p(z2)
Z2
0.0 0.4 0.8
−4
−2
02
4
p(z2|z1=0)Z
2
rho= 0.905 ; Var[z2]= 1.05 ; Var[z2|z1]= 0.19
Apprentissage automatique – p. 71/78
The central limit theoremTheorem 5. Assume that z1,z2,. . . ,zN are i.i.d. random variables, discrete or continuous,
each having a probability distribution with finite mean µ and finite variance σ2. As N → ∞,
the standardized random variable
(z − µ)√
N
σ
which is identical to(SN − Nµ)√
Nσ
converges in distribution to a r.v. having the standardized normal distribution N (0, 1).
• This result holds regardless of the common distribution of zi.
• This theorem justifies the importance of the normal distribution, sincemany r.v. of interest are either sums or averages.
• See R script central.R.
Apprentissage automatique – p. 72/78
The chi-squared distributionFor a N positive integer, a r.v. z has a χ2
N distribution if
z = x21 + · · · + x2
N
where x1,x2,. . . ,xN are i.i.d. random variables N (0, 1).
• The probability distribution is a gamma distribution with parameters( 12N, 1
2 ).
• E[z] = N and Var [z] = 2N .
• The distribution is called “a chi-squared distribution with N degrees offreedom”.
Apprentissage automatique – p. 73/78
The chi-squared distribution (II)
0 5 10 15 20 25 30 35 40 45 500
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
χ2N
density: N=10
0 5 10 15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
χ2N
cumulative distribution: N=10
R script chisq.R.
Apprentissage automatique – p. 74/78
Student’s t-distributionIf x ∼ N (0, 1) and y ∼ χ2
N are independent then the Student’s t-distribution
with N degrees of freedom is the distribution of the r.v.
z =x
√
y/N
We denote this with z ∼ TN .
Apprentissage automatique – p. 75/78
Student’s t-distribution
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4Student density: N=10
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Student cumulative distribution: N=10
R script s_stu.R.
Apprentissage automatique – p. 76/78
Notation• In order to clarify the distinction between random variables and their
values, we will use the boldface notation for denoting a random variable
(e.g. z) and the normal face notation for the eventually observed value(e.g. z = 11).
• The notation Pz(z) denotes the probability that the random variable z
take the value z. The suffix indicates that the probability relates to therandom variable z. This is necessary since we often discussprobabilities associated with several random variables simultaneously.
• Example: z could be the age of a student before asking and z = 22
could be his value after the observation.
Apprentissage automatique – p. 77/78
Notation (II)• In general terms, we will denote as the probability distribution of a random
variable z any complete description of the probabilistic behavior of z.
• For example, if z is continuous, the density function p(z) or thedistribution function could be examples of probability distribution.
• Given a probability distribution Fz(z) the notation
F → z1, z2, . . . , zN
means that the dataset DN = z1, z2, . . . , zN is a i.i.d. random sampleobserved from the probability distribution Fz(·).
Apprentissage automatique – p. 78/78