probabilistic model of sequences bob durrant school of computer science university of birmingham...
Post on 15-Jan-2016
221 views
TRANSCRIPT
![Page 1: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/1.jpg)
Probabilistic Model of Sequences
Bob DurrantSchool of Computer Science
University of Birmingham
(Slides: Dr Ata Kabán)
![Page 2: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/2.jpg)
Sequence
• Example1: a b a c a b a b a c • Example2: 1 0 0 1 1 0 1 0 0 1 • Example3: 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 • Roll a six-sided die N times. You get a sequence.• Roll it again: You get another sequence.• Here is a sequence of characters, can you see it? • What is a sequence?• Alphabet1 = {a,b,c}, Alphabet2={0,1},
Alphabet3={1,2,3,4,5,6}
![Page 3: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/3.jpg)
Probabilistic Model
• Model = system that simulates the sequence under consideration
• Probabilistic model = model that produces different outcomes with different probabilities– It includes uncertainty– It can therefore simulate a whole class of sequences &
assigns a probability to each individual sequence
• Could you simulate any of the sequences on the previous slide?
![Page 4: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/4.jpg)
Random sequence model
• Back to the die example (can possibly be loaded)– Model of a roll: has 6 parameters:
p(1),p(2),p(3),p(4),p(5),p(6)– Here, p(i) is the probability of throwing i – To be probabilities, these must be non-negative and
must sum to one.– What is the probability of the sequence [1, 6, 3]? p(1)*p(6)*p(3)
• NOTE: in the random sequence model, the individual symbols in a sequence do not depend on each other. This is the simplest sequence model.
![Page 5: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/5.jpg)
Maximum Likelihood Parameter estimation
• The parameters of a probabilistic model are typically estimated from large sets of trusted examples, called the training set.
• Example (t=tail, h=head) : [t t t h t h h t]– Count up the frequencies: t5, h3– Compute probabilities:
• p(t)=5/(5+3), p(h)=3/(5+3)– These are the Maximum Likelihood (ML) estimates of the
parameters of the coin.– Does it make sense?– What if you know the coin is fair?
![Page 6: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/6.jpg)
Overfitting
• A fair coin has probabilities p(t)=0.5, p(h)=0.5• If you throw it 3 times and get [t, t, t], then the ML
estimates for this sequence are p(t)=1, p(h)=0.• Consequently, from this estimate, the probability of e.g.
the sequence [h, t, h, t] = ………….• This is an example of what is called overfitting. Overfitting
is the greatest enemy of Machine Learning!• Solution1: Get more data• Solution2: Build what you already know into the model.
(Will return to this during the module)
![Page 7: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/7.jpg)
Why is it called Maximum Likelihood?
• It can be shown that using the frequencies to compute probabilities maximises the total probability of all the sequences given the model (the likelihood).
• That is, P(Data|Parameters), the probability of observing the (training) data given (any) set of parameters is maximised by setting the parameters in this way.
![Page 8: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/8.jpg)
Probabilities
• Have two dice D1 and D2• The probability of rolling i given die D1 is called P(i|
D1). This is a conditional probability• Pick a die at random with probability P(Dj), j=1 or 2.
The probability for picking die Dj and rolling i is is called joint probability and is P(i,Dj)=P(Dj)P(i|Dj).
• For any events X and Y, P(X,Y)=P(X|Y)P(Y)• If we know P(X,Y), then the so-called marginal
probability P(X) can be computed as:
Y
YXPXP ),()(
![Page 9: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/9.jpg)
• Now, we show that maximising P(Data|Parameters) for the random sequence model leads to the frequency-based computation that we did intuitively.
T
t
T
t
L
l
T
ttl
T
t
xL
ll
t
L
tp
tpp(t)
tpxspTppSP
tpspTppSP
T...,tt,x
),...,p(T)p(
T
L
,...,ssS
t
1
1
1 1
11
1
)1)(( termLagrangian a addingby imposed becan constraint This
1)( so ies,probabilit be toneed hat Remember t
)(log)(log))(),...,1(|(log
:)"likelihood-log(" likelihood theof logarithm themaximise ly,Equivalent
)()())(),...,1(|(
:maximise tois Problem
,,21 symbolofoccurrenceoffrequency:
ties)(probabiliparameters:1
alphabetofsize:
sequenceoflength:
symbolsofsequence:
:Notation
![Page 10: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/10.jpg)
T
tt
t
T
t
T
tt
t
tt
T
t
T
tt
x
xtpSo
tpx
tpx
tp
x
tp
x
p(t)
Obj
tptpx
1
11
11
)(,
)(
)(
1,...T when tsidesboth up add and p(t)by sidesboth multiply , compute toNow
)(0
)(
zero. isfunction a of derivative themaximum, aat Remember,
)1)(()(logObj maximise toneed weTherefore
Why did we bother?
Because in more complicated models we cannot ‘guess’ the result.
![Page 11: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/11.jpg)
Markov Chains
• Further examples of sequences:– Bio-sequences– Web page request sequences while browsing
• These are not anymore random sequences, but have a time-structure.
• How many parameters would such a model have?• We need to make simplifying assumptions to end up with a
reasonable number of parameters• The first order Markov assumption: the observation only depends on
the immediately previous one, no longer history
• Markov Chain = sequence model which makes the Markov assumption
)|(),...,|( 111 llll ssPsssP
![Page 12: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/12.jpg)
Markov Chains
• The probability of a Markov sequence:
• The alphabet’s symbols are also called states• Once the parameters are estimated from training
data, the Markov chain can be used for prediction• Amongst others, Markov Chains are successful for
web browsing behavior prediction
L
lll
LL
ssPsP
ssPssPssPsPSP
211
123121
)|()(
)|()...|()|()()(
![Page 13: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/13.jpg)
Markov Chains
• A Markov Chain is stationary if at any time, it has the same transition probabilities.
• We assume stationary models here. • Then the parameters of the model consist of
the transition probability matrix & initial state probabilities.
)|()()|( 1 jiPshorthandjsisP tt
![Page 14: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/14.jpg)
ML parameter estimation
• We can derive how to compute the parameters of a Markov Chain from data, using Maximum Likelihood, as we did for random sequences.
• The ML estimate of the transition matrix will be again very intuitive:
T
iij
ij
x
xjiP
1
)|( Remember that
jjiPT
i
,1)|(1
![Page 15: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/15.jpg)
Simple example
• If it is raining today, it will rain tomorrow with probability 0.8 implies the contrary has probability 0.2
• If it is not raining today, it will rain tomorrow with probability 0.6 implies the contrary has probability 0.4
• Build the transition matrix
• Be careful which numbers need to sum to one and which don’t. Such a matrix is called stochastic matrix.
• Q: It rained all week, including today. What does this model predict for tomorrow? Why? What does it predict for a day from tomorrow? (*Homework)
4.06.0
2.08.0
)|()|(
)|()|(
todayrainnotomorrowrainnoPtodayrainnotomorrowrainP
todayraintomorrowrainnoPtodayraintomorrowrainP
![Page 16: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/16.jpg)
Examples of Web Applications• HTTP request prediction:
– To predict the probabilities of the next requests from the same user based on the history of requests from that client.
• Adaptive Web navigation:– To build a navigation agent which suggests which other links would be
of interest to the user based on the statistics of previous visits.– The predicted link does not strictly have to be a link present in the
Web page currently being viewed.• Tour generation:
– Is given as input the starting URL and generates a sequence of states (or URLs) using the Markov chain process.
![Page 17: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/17.jpg)
Building Markov Models from Web Log Files
• A Web log file is a collection of records of user requests for documents on a Web site, an example:
• Transition matrix can be seen as a graph – Link pair: (r - referrer, u - requested page, w - hyperlink
weight)– Link graph: it is called the state diagram of the MarkovChain
• a directed weighted graph• a hierarchy from the homepage down to multiple levels
177.21.3.4 - - [04/Apr/1999:00:01:11 +0100] "GET /studaffairs/ccampus.html HTTP/1.1" 200 5327 "http://www.ulst.ac.uk/studaffairs/accomm.html" "Mozilla/4.0 (compatible; MSIE 4.01; Windows 95)"
![Page 18: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/18.jpg)
Link Graph: an example (University of Ulster site)
1
2 3 4
5 6 7 8 910
11
9000
1800
2700 4500
810880 720
880
648
600
2128
282
2390 1800 2400
180023902400
72
University of Ulster
Department
InformationStudent
CSScience
&Arts
InternationalOffice Library
Under-graduate
Graduate
Jobs
200 300S
E
9000
Start
Exit
122128
Register
State diagram:
- Nodes: states
- Weighted arrows: number of transitions
Zhu et al. 2002
![Page 19: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/19.jpg)
Experimental Results(Sarukkai, 2000)
• Simulations :– ‘Correct link’ refers to the actual link chosen at the next step.– ‘depth of the correct link’ is measured by counting the number of
links which have a probability greater than or equal to the correct link.
– Over 70% of correct links are in the top 20 scoring states.– Difficulties: very large state space
![Page 20: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/20.jpg)
Simple exercise
• Build the Markov transition matrix of the following sequence:[a b b a c a b c b b d e e d e d e d]
State space: {…………….}
0
)|()|()|()|()|(
)|()|()|()|()|(
)|()|()|()|()|(
)|()|()|()|()|(
)|()|()|()|()|(
eePedPecPebPeaP
dePddPdcPdbPdaP
cePcdPccPcbPcaP
bePbdPbcPbbPbaP
aePadPacPabPaaP
![Page 21: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/21.jpg)
Further topics
• Hidden Markov Model– Does not make the Markov assumption on the
observed sequence– Instead, it assumes that the observed sequence was
generated by another sequence which is unobservable (hidden), and this other sequence is assumed to be Markovian
– More powerful– Estimation is more complicated
• Aggregate Markov model– Useful for clustering sub-graphs of a transition graph
K
ktttt skPksPssP
111 )|()|()|(
![Page 22: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/22.jpg)
HMM at an intuitive level
• Suppose that we know all the parameters of the following HMM, as shown on the state-diagram below. What is the probability of observing the sequence [A,B] if the initial state is S1? The same question if the initial state is chosen randomly with equal probability?
ANSWER:
If the initial state is S1: 0.2*(0.4*0.8+0.6*0.7) = 0.148.
In the second case: 0.5*0.148+0.5*0.3*(0.3*0.7+0.7*0.8) = 0.1895.
![Page 23: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/23.jpg)
Conclusions
• Probabilistic Model• Maximum Likelihood parameter estimation• Random sequence model• Markov chain model---------------------------------• Hidden Markov Model• Aggregate Markov Model
![Page 24: Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)](https://reader035.vdocuments.site/reader035/viewer/2022062309/56649d6b5503460f94a49b84/html5/thumbnails/24.jpg)
Any questions?