sequential data modeling - conditional random fields

1

Sequential Data Modeling – Conditional Random Fields

Sequential Data Modeling -Conditional Random Fields

Graham NeubigNara Institute of Science and Technology (NAIST)

2


Prediction Problems

Given x, predict y

3


Prediction Problems

Given x, predict yA book review

Oh, man I love this book!This book is so boring...

Is it positive?yesno

BinaryPrediction(2 choices)

A tweetOn the way to the park!

公園に行くなう！

Its languageEnglish

Japanese

Multi-classPrediction(several choices)

A sentence

I read a book

Its parts-of-speech StructuredPrediction(millions of choices)I read a book

DET NNVBDN

4


Logistic Regression

5


Example we will use:

● Given an introductory sentence from Wikipedia

● Predict whether the article is about a person

● This is binary classification (of course!)

Given

Gonso was a Sanron sect priest (754-827)in the late Nara and early Heian periods.

Predict

Yes!

Shichikuzan Chigogataki Fudomyoo isa historical site located at Magura, MaizuruCity, Kyoto Prefecture.

No!

6


Review: Linear Prediction Model

● Each element that helps us predict is a feature

● Each feature has a weight, positive if it indicates “yes”, and negative if it indicates “no”

● For a new example, sum the weights

● If the sum is at least 0: “yes”, otherwise: “no”

contains “priest” contains “(<#>-<#>)”contains “site” contains “Kyoto Prefecture”

wcontains “priest”

= 2 wcontains “(<#>-<#>)”

= 1w

contains “site” = -3 w

contains “Kyoto Prefecture” = -1

Kuya (903-972) was a priestborn in Kyoto Prefecture. 2 + -1 + 1 = 2

7


Review: Mathematical Formulation

y = sign (w⋅ϕ (x))

= sign (∑i=1

Iw i⋅ϕ i(x))

● x: the input

● φ(x): vector of feature functions {φ1(x), φ

2(x), …, φ

I(x)}

● w: the weight vector {w1, w

2, …, w

I}

● y: the prediction, +1 if “yes”, -1 if “no”● (sign(v) is +1 if v >= 0, -1 otherwise)

8


-10 -5 0 5 100

0.5

1

w*phi(x)

p(y

|x)

Perceptron and Probabilities

● Sometimes we want the probability● Estimating confidence in predictions● Combining with other systems

● However, perceptron only gives us a prediction

P( y∣x )

In other words:

P( y=1∣x)=1 if w⋅ϕ (x )≥0

y=sign(w⋅ϕ( x))

P( y=1∣x)=0 if w⋅ϕ (x)<0

9


-10 -5 0 5 100

0.5

1

w*phi(x)p

(y|x

)

The Logistic Function● The logistic function is a “softened” version of the

function used in the perceptron

-10 -5 0 5 100

0.5

1

w*phi(x)

p(y

|x)

Perceptron Logistic Function

P( y=1∣x)=e w⋅ϕ( x)

1+ew⋅ϕ(x)

● Can account for uncertainty

● Differentiable

10


Logistic Regression

● Train based on conditional likelihood

● Find the parameters w that maximize the conditional likelihood of all answers y

i given the example x

i

● How do we solve this?

w=argmaxw

∏iP( y i∣x i ;w)

11


Review: Perceptron Training Algorithmcreate map wfor I iterations

for each labeled pair x, y in the dataphi = create_features(x)y' = predict_one(w, phi)if y' != y

w += y * phi

● In other words● Try to classify each training example● Every time we make a mistake, update the weights

12


Stochastic Gradient Descent● Online training algorithm for probabilistic models

(including logistic regression)

create map wfor I iterations

for each labeled pair x, y in the dataw += α * dP(y|x)/dw

● In other words● For every training example, calculate the gradient

(the direction that will increase the probability of y)● Move in that direction, multiplied by learning rate α

13


-10 -5 0 5 100

0.2

0.4

w*phi(x)

dp

(y|x

) /d

w*p

hi(

x)

Gradient of the Logistic Function

● Take the derivative of the probability

dd w

P ( y=1∣x ) =d

d wew⋅ϕ( x)

1+ew⋅ϕ(x)

= ϕ (x )ew⋅ϕ(x)

(1+ew⋅ϕ(x))

2

dd w

P ( y=−1∣x ) =d

d w(1−

ew⋅ϕ(x)

1+ew⋅ϕ(x) )

= −ϕ(x )ew⋅ϕ(x)

(1+ew⋅ϕ(x))

2

14


Example: Initial Update● Set α=1, initialize w=0

x = A site , located in Maizuru , Kyoto y = -1

w⋅ϕ(x)=0

w←w+−0.25ϕ (x )

wunigram “A”

= -0.25w

unigram “site”= -0.25w

unigram “,” = -0.5

wunigram “located”

= -0.25wunigram “in”

= -0.25

wunigram “Maizuru”

= -0.25

wunigram “Kyoto”

= -0.25

dd w

P ( y=−1∣x ) = −e0

(1+e0)2 ϕ (x )

= −0.25ϕ (x)

15


Example: Second Updatex = Shoken , monk born in Kyoto y = 1

w⋅ϕ (x )=−1

w←w+0.196 ϕ( x)

wunigram “A”

= -0.25w

unigram “site”= -0.25w

unigram “,” = -0.304

wunigram “located”

= -0.25wunigram “in”

= -0.054

wunigram “Maizuru”

= -0.25

wunigram “Kyoto”

= -0.054

-0.5 -0.25 -0.25

wunigram “Shoken”

= 0.196w

unigram “monk”= 0.196

wunigram “born”

= 0.196

dd w

P ( y=1∣x ) =e1

(1+e1)2 ϕ (x )

= 0.196ϕ (x )

16


Calculating Optimal Sequences, Probabilities

17


Sequence Likelihood

● Logistic regression considered probability of

● What if we want to consider probability of a sequence?

y∈{−1,+1}

P( y∣x )

I visited Nara

PRN VBD NNP

Xi

Yi

P(Y∣X )

18


φ( )*w=1φ( )*w=2

φ( )*w=0

φ( )φ( )φ( )

Calculating Multi-class Probabilities● Each sequence has it's own feature vector

● Use weights for each feature to calculate scores

time fliesN Vφ( ) φ

T,<S>,N=1 φ

T,N,V=1 φ

T,V,</S>=1 φ

E,N,time=1 φ

E,V,flies=1

time fliesV N

φT,<S>,V

=1 φT,V,N

=1 φT,N,</S>

=1 φE,V,time

=1 φE,N,flies

=1

time fliesN N

φT,<S>,N

=1 φT,N,N

=1 φT,N,</S>

=1 φE,N,time

=1 φE,N,flies

=1

time fliesV V

φT,<S>,V

=1 φT,V,V

=1 φT,V,</S>

=1 φE,V,time

=1 φE,V,flies

=1

wT,<S>,N

=1 wE,N,time

=1wT,V,</S>

=1

time fliesN Vφ( )*w=3 time flies

V Ntime fliesN N

time fliesV V

19


exp(φ( )*w)=2.72exp(φ( )*w)=7.39

exp(φ( )*w)=1.00

The Softmax Function● Turn into probabilities by taking exponent and

normalizing (the Softmax function)

● Take the exponent and normalizetime fliesN Vexp(φ( )*w)=20.08 time flies

V Ntime fliesN N

time fliesV V

P(Y∣X )=ew⋅ϕ(Y , X )

∑Yew⋅ϕ(Y , X )

P(N V | time flies)=.6437

P(N N | time flies)=.2369 P(V V | time flies)=0.0872

P(V N | time flies)=0.0320

20


Calculating Edge Features

● Like perceptron, can calculate features for each edge

<S>

N

V

N

V

</S>

time flies

φT,<S>,N

=1

φT,<S>,V

=1

φE,N,time

=1

φE,V,time

=1

φT,N,N

=1

φT,V,V

=1

φT,V,N

=1

φT,N,V

=1

φE,V,flies

=1

φE,N,flies

=1

φE,N,flies

=1

φE,V,flies

=1

φT,V,</S>

=1

φT,N,</S>

=1

21


Calculating Edge Probabilities

● Calculate scores, and take exponent

<S>

N

V

N

V

</S>

time fliesew*φ=7.39P=.881

ew*φ=1.00P=.119

ew*φ=1.00P=.237

ew*φ=1.00P=.032

ew*φ=1.00P=.644

ew*φ=2.72P=.731

ew*φ=1.00P=.269

ew*φ=1.00P=.087

● This is now the same form as the HMM● Can use the Viterbi algorithm● Calculate probabilities using forward-backward

22


Conditional Random Fields

23


Maximizing CRF Likelihood

● Want to maximize the likelihood for sequences

● For convenience, we consider the log likelihood

● Want to find gradient for stochastic gradient descent

P(Y∣X )=ew⋅ϕ(Y , X )

∑Yew⋅ϕ(Y , X )

log P(Y∣X )=w⋅ϕ(Y , X)− log∑Yew⋅ϕ(Y , X )

dd w

log P (Y∣X )

w=argmaxw

∏iP(Y i∣X i ;w )

24


Deriving a CRF Gradient:

log P(Y∣X ) = w⋅ϕ (Y , X )−log∑Yew⋅ϕ(Y , X )

= w⋅ϕ (Y , X )−log Z

dd w

log P (Y∣X ) = ϕ(Y , X )−d

d wlog∑Y

ew⋅ϕ(Y , X )

= ϕ(Y , X )−1Z∑Y

dd w

ew⋅ϕ(Y , X )

= ϕ(Y , X )−∑Y

ew⋅ϕ(Y , X )

Zϕ (Y , X )

= ϕ(Y , X )−∑YP (Y∣X )ϕ (Y , X )

25


In Other Words...

● To get the gradient we:

dd w

log P (Y∣X )=ϕ (Y , X )−∑YP (Y∣X )ϕ (Y , X )

add the correct feature vector

subtract the expectation of the features

26


Example

φ( )φ( )φ( )

time fliesN Vφ( ) φ

T,<S>,N=1 φ

T,N,V=1 φ

T,V,</S>=1 φ

E,N,time=1 φ

E,V,flies=1

time fliesV N

φT,<S>,V

=1 φT,V,N

=1 φT,N,</S>

=1 φE,V,time

=1 φE,N,flies

=1

time fliesN N

φT,<S>,N

=1 φT,N,N

=1 φT,N,</S>

=1 φE,N,time

=1 φE,N,flies

=1

time fliesV V

φT,<S>,V

=1 φT,V,V

=1 φT,V,</S>

=1 φE,V,time

=1 φE,V,flies

=1

P=.644

P=.237

P=.087

P=.032

φT,<S>,N

, φE,N,time

= 1-.644-.237 = .119 φT,N,V

= 1-.644 = .356

φT,V,</S>

, φE,V,flies

= 1-.644-.087 = .269

φT,V,N

= 0-.032 = -.032

φT,N,N

= 0-.237 = -.237

φT,V,V

= 0-.087 = -.087

φT,<S>,V

, φE,V,time

= 0-.032-.087 = -.119

φT,N,</S>

, φE,V,flies

= 0-.032-.237 = -.269

27


Combinatorial Explosion

● Problem!: The number of hypotheses is exponential.

dd w

log P (Y∣X )=ϕ (Y , X )−∑YP (Y∣X )ϕ (Y , X )

O(T|X|)

T = number of tags

28


Calculate Feature Expectationsusing Edge Probabilities!

● If we know the edge probabilities, just multiply them!

<S>

N

V

time

φT,<S>,N

=1

φT,<S>,V

=1

φE,N,time

=1

φE,V,time

=1

…

ew*φ=7.39P=.881

ew*φ=1.00P=.119

φT,<S>,N

, φE,N,time

= 1-.881 = .119

φT,<S>,V

, φE,V,time

= 0-.119 = -.119

φT,<S>,N

, φE,N,time

= 1-.644-.237 = .119

φT,<S>,V

, φE,V,time

= 0-.032-.087 = -.119

Same answer as when weexplicitly expand all Y!

29


CRF Training Procedure


for each labeled pair X, Y in the datagradient = φ(Y,X)calculate eφ(y,x)*w for each edgerun forward-backward algorithm to get P(edge)for each edge

gradient -= P(edge)*φ(edge)w += α * gradient

● Can perform stochastic gradient descent, like logistic regression

● Only major difference is gradient calculation

● Learning rate α

30


Learning Algorithms

31


Batch Learning


for each labeled pair x, y in the dataw += α * dP(y|x)/dw

● Online Learning: Update after each example

● Batch Learning: Update after all examples

Online Stochastic Gradient Descent


for each labeled pair x, y in the datagradient += α * dP(y|x)/dw

w += gradient

Batch Stochastic Gradient Descent

32


Batch Learning Algorithms:Newton/Quasi-Newton Methods

● Newton-Raphson Method:● Choose how far to update using the second-order

derivatives (the Hessian matrix)● Faster convergence, but |w|*|w| time and memory

● Limited Memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS):● Guesses second-order derivatives from first-order● Most widely used?● Library: http://www.chokkan.org/software/liblbfgs/

● More information:http://homes.cs.washington.edu/~galen/files/quasi-newton-notes.pdf

33


Online Learning vs. Batch Learning

● Online:● In general, simpler mathematical derivation● Often converges faster

● Batch:● More stable (does not change based on order)● Trivially parallelizable

34


Regularization

35


Cannot Distinguish BetweenLarge and Small Classifiers

● For these examples:

● Which classifier is better?

-1 he saw a bird in the park+1 he saw a robbery in the park

Classifier 1he +3saw -5a +0.5bird -1robbery +1in +5the -3park -2

Classifier 2bird -1robbery +1

36


Cannot Distinguish BetweenLarge and Small Classifiers

● For these examples:

● Which classifier is better?

-1 he saw a bird in the park+1 he saw a robbery in the park

Classifier 1he +3saw -5a +0.5bird -1robbery +1in +5the -3park -2

Classifier 2bird -1robbery +1

Probably classifier 2!It doesn't use

irrelevant information.

37


Regularization

● A penalty on adding extra weights

● L2 regularization:● Big penalty on large weights,

small penalty on small weights● High accuracy

● L1 regularization:● Uniform increase whether large

or small● Will cause many weights to

become zero → small model

-2 -1 0 1 20

1

2

3

4

5

L2L1

38


Regularization in Logistic Regression/CRF

● To do so in logistic regression/CRF, we add the penalty to the log likelihood (for the whole corpus)

● c adjusts the strength of the regularization● smaller: more freedom to fit the data● larger: less freedom to fit the data, better generalization

● L1 also used, slightly more difficult to optimize

w=argmaxw

(∏iP (Y i∣X i;w))−c∑w∈w

w2

L2 Regularization

39


Conclusion

40


Conclusion

● Logistic regression is a probabilistic classifier

● Conditional random fields are probabilistic structured discriminative prediction models

● Can be trained using● Online stochastic gradient descent (like peceptron)● Batch learning using a method such as L-BFGS

● Regularization can help solve problems of overfitting

41


Thank You!

sequential data modeling - conditional random fields

Documents