markov chains as a learning tool. 2 weather: raining today40% rain tomorrow 60% no rain tomorrow not...

23
. Markov Chains as a Learning Tool

Upload: isabel-griffith

Post on 18-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

.

Markov Chains as a Learning Tool

2

Weather:

• raining today 40% rain tomorrow

60% no rain tomorrow

• not raining today 20% rain tomorrow

80% no rain tomorrow

Markov ProcessSimple Example

rain no rain

0.60.4 0.8

0.2

Stochastic Finite State Machine:

3

Weather:

• raining today 40% rain tomorrow

60% no rain tomorrow

• not raining today 20% rain tomorrow

80% no rain tomorrow

Markov ProcessSimple Example

8.02.0

6.04.0P

• Stochastic matrix:Rows sum up to 1

• Double stochastic matrix:Rows and columns sum up to 1

The transition matrix:

Rain No rain

Rain

No rain

Markov Process

• Markov Property: Xt+1, the state of the system at time t+1 depends only on the state of the system at time t

X1 X2 X3 X4 X5

x | X x X x x X | X x X tttttttt 111111 PrPr

• Stationary Assumption: Transition probabilities are independent of time (t)

1Pr t t ab X b | X a p

Let Xi be the weather of day i, 1 <= i <= t. We may decide the probability of Xt+1 from Xi, 1 <= i <= t.

5

– Gambler starts with $10 (the initial state)

- At each play we have one of the following:

• Gambler wins $1 with probability p

• Gambler looses $1 with probability 1-p

– Game ends when gambler goes broke, or gains a fortune of $100

(Both 0 and 100 are absorbing states)

0 1 2 99 100

p p p p

1-p 1-p 1-p 1-pStart (10$)

Markov ProcessGambler’s Example

1-p

6

• Markov process - described by a stochastic FSM

• Markov chain - a random walk on this graph (distribution over paths)

• Edge-weights give us

• We can ask more complex questions, like

Markov Process

1Pr t t ab X b | X a p

?Pr 2 b a | X X tt

0 1 2 99 100

p p p p

1-p 1-p 1-p 1-pStart (10$)

7

• Given that a person’s last cola purchase was Coke, there is a 90% chance that his next cola purchase will also be Coke.

• If a person’s last cola purchase was Pepsi, there is an 80% chance that his next cola purchase will also be Pepsi.

coke pepsi

0.10.9 0.8

0.2

Markov ProcessCoke vs. Pepsi Example

8.02.0

1.09.0P

transition matrix:

coke pepsi

coke

pepsi

8.02.0

1.09.0P

8

Given that a person is currently a Pepsi purchaser, what is the probability that he will purchase Coke two purchases from now?

Pr[ Pepsi?Coke ] =

Pr[ PepsiCokeCoke ] + Pr[ Pepsi Pepsi Coke ] =

0.2 * 0.9 + 0.8 * 0.2 = 0.34

66.034.0

17.083.0

8.02.0

1.09.0

8.02.0

1.09.02P

Markov ProcessCoke vs. Pepsi Example (cont)

Pepsi ? ? Coke

9

Given that a person is currently a Coke purchaser, what is the probability that he will buy Pepsi at the third purchase from now?

Markov ProcessCoke vs. Pepsi Example (cont)

562.0438.0

219.0781.0

66.034.0

17.083.0

8.02.0

1.09.03P

10

•Assume each person makes one cola purchase per week

•Suppose 60% of all people now drink Coke, and 40% drink Pepsi

•What fraction of people will be drinking Coke three weeks from now?

Markov ProcessCoke vs. Pepsi Example (cont)

8.02.0

1.09.0P

562.0438.0

219.0781.03P

Pr[X3=Coke] = 0.6 * 0.781 + 0.4 * 0.438 = 0.6438

Qi - the distribution in week i

Q0= (0.6,0.4) - initial distribution

Q3= Q0 * P3 =(0.6438,0.3562)

11

Simulation:

Markov ProcessCoke vs. Pepsi Example (cont)

week - i

Pr[

Xi =

Cok

e]

2/3

31

32

31

32

8.02.0

1.09.0

stationary distribution

coke pepsi

0.10.9 0.8

0.2

How to obtain Stochastic matrix?

Solve the linear equations, e.g.,

Learn from examples, e.g., what letters follow what letters in English words: mast, tame, same, teams, team, meat, steam, stem.

12

31

32

2,21,2

2,111

31

32

xx

xx ,

How to obtain Stochastic matrix?

Counts table vs Stochastic Matrix

13

P a s t m e \0

a 0 1/7 1/7 5/7 0 0

e 4/7 0 0 1/7 0 2/7

m 1/8 1/8 0 0 3/8 3/8

s 1/5 0 3/5 0 0 1/5

t 1/7 0 0 0 4/7 2/7

@ 0 3/8 3/8 2/8 0 0

Application of Stochastic matrix

Using Stochastic Matrix to generate a random word: Generate most likely first letter For each current letter generate most likely next letter

14

A a s t m e \0

a - 1 2 7 - -

e 4 - - 5 - 7

m 1 2 - - 5 8

s 1 - 4 - - 5

t 1 - - - 5 7

@ - 3 6 8 - -

If C[r,j] > 0, let A[r,j] = C[r,1]+C[r,2]+…+C[r,j]

C

Application of Stochastic matrix Using Stochastic Matrix to generate a random word:

Generate most likely first letter: Generate a random number x between 1 and 8. If 1 <= x <= 3, the letter is ‘s’; if 4 <= x <= 6, the letter is ‘t’; otherwise, it’s ‘m’. For each current letter generate

most likely next letter: Suppose

the current letter is ‘s’ and we

generate a random number x

between 1 and 5. If x = 1, the next

letter is ‘a’; if 2 <= x <= 4, the next

letter is ‘t’; otherwise, the current

letter is an ending letter.

15

A a s t m e \0

a - 1 2 7 - -

e 4 - - 5 - 7

m 1 2 - - 5 8

s 1 - 4 - - 5

t 1 - - - 5 7

@ - 3 6 8 - -

If C[r,j] > 0, let A[r,j] = C[r,1]+C[r,2]+…+C[r,j]

Supervised vs Unsupervised

Decision tree learning is “supervised learning” as we know the correct output of each example.

Learning based on Markov chains is “unsupervised learning” as we don’t know which is the correct output of “next letter”.

16

K-Nearest Neighbor

Features All instances correspond to points in an n-

dimensional Euclidean space Classification is delayed till a new instance

arrives Classification done by comparing feature

vectors of the different points Target function may be discrete or real-valued

1-Nearest Neighbor

3-Nearest Neighbor

20

Example:Identify Animal Type

14 examples10 attributes5 types

What’s the type ofthis new animal?

K-Nearest Neighbor

An arbitrary instance is represented by (a1(x), a2(x), a3(x),.., an(x))

ai(x) denotes features Euclidean distance between two instances

d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2) Continuous valued target function

mean value of the k nearest training examples

Distance-Weighted Nearest Neighbor Algorithm

Assign weights to the neighbors based on their ‘distance’ from the query point

Weight ‘may’ be inverse square of the distances

All training points may influence a particular instance

Shepard’s method

Remarks

+ Highly effective inductive inference method for noisy training data and complex target functions

+ Target function for a whole space may be described as a combination of less complex local approximations

+ Learning is very simple- Classification is time consuming (except 1NN)