classical probabilistic models and conditional random fields

52
ented by Jian-Shiun Tzeng 4/24/2009 Classical Probabilistic Models and Conditional Random Fields Roman Klinger 1 , Katrin Tomanek 2 Algorithm Engineering Report (TR07-2-013), 2007 1. Dortmund University of Technology Department of Computer Science 2. Language & Information Engineering (JULIE) Lab

Upload: gur

Post on 15-Feb-2016

80 views

Category:

Documents


0 download

DESCRIPTION

Classical Probabilistic Models and Conditional Random Fields. Roman Klinger 1 , Katrin Tomanek 2 Algorithm Engineering Report (TR07-2-013), 2007 1. Dortmund University of Technology Department of Computer Science 2. Language & Information Engineering (JULIE) Lab. Abstract. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Classical Probabilistic Models and Conditional Random Fields

Presented by Jian-Shiun Tzeng 4/24/2009

Classical Probabilistic Models and Conditional Random Fields

Roman Klinger1, Katrin Tomanek2

Algorithm Engineering Report (TR07-2-013), 2007

1. Dortmund University of Technology Department of Computer Science2. Language & Information Engineering (JULIE) Lab

Page 2: Classical Probabilistic Models and Conditional Random Fields

2

Abstract

• Introduction• Naïve Bayes• HMM• ME• CRF

Page 3: Classical Probabilistic Models and Conditional Random Fields

3

Introduction

• Classification is known as the assignment of a class y Y to an observation x X∈ ∈

• A well-known example (Russell and Norvig, 2003)– Classification of weather– Y = {good, bad}– X = {Monday, Tuesday,…}

• x can be described by a set of features• fcloudy, fsunny or frainy

otherwise,0

day xon cloudy isit if,1)(xfcloudy

In general, not necessary have to be binary

Page 4: Classical Probabilistic Models and Conditional Random Fields

4

• Modeling all dependencies in a probability distribution is typically very complex due to interdependencies between features

• The Naïve Bayes assumption of all features being conditionally independent is an approach to address this problem

• In nearly all probabilistic models such independence assumptions are made for some variables to make necessary computations manageable

Page 5: Classical Probabilistic Models and Conditional Random Fields

5

• In the structured learning scenario, multiple and typically interdependent class and observation variables are considered which implicates an even higher complexity in the probability distribution– As for images, pixels near to each other are very likely

to have a similar color or hue– In music, different succeeding notes follow special

laws, they are not independent, especially when they sound simultaneously. Otherwise, music would not be pleasant to the ear

– In text, words are not an arbitrary accumulation, the order is important and grammatical constraints hold

Page 6: Classical Probabilistic Models and Conditional Random Fields

6

Y = {name, city, 0}X: set of words

Page 7: Classical Probabilistic Models and Conditional Random Fields

7

• One approach for modeling linear sequence structures, as can be found in natural language text, are HMMs

• For the sake of complexity reduction, strong independence assumptions between the observation variables are made– This impairs the accuracy of

the model

observation variables

Page 8: Classical Probabilistic Models and Conditional Random Fields

8

• Conditional Random Fields (CRFs, Lafferty et al. (2001)) are developed exactly to fill that gap: – While CRFs make similar assumptions on the

dependencies among the class variables, no assumptions on the dependencies among observation variables need to be made

observation variables

Page 9: Classical Probabilistic Models and Conditional Random Fields

9

• In natural language processing, CRFs are currently a state-of-the-art technique for many of its subtasks including– basic text segmentation (Tomanek et al., 2007)– part-of-speech tagging (Lafferty et al., 2001)– shallow parsing (Sha and Pereira, 2003)– the resolution of elliptical noun phrases (Buyko et al., 2007)– named entity recognition (Settles, 2004; McDonald and

Pereira, 2005; Klinger et al., 2007a,b; McDonald et al.,– 2004)– gene prediction (DeCaprio et al., 2007)– image labeling (He et al., 2004)– object recognition (Quattoni et al., 2005)– telematics for intrusion detection (Gupta et al., 2007)– sensor data management (Zhang et al., 2007)

Page 10: Classical Probabilistic Models and Conditional Random Fields

10

Figure 1: Overview of probabilistic models

MEMM

Page 11: Classical Probabilistic Models and Conditional Random Fields

11

Abstract

• Introduction• Naïve Bayes• HMM• ME• CRF

Page 12: Classical Probabilistic Models and Conditional Random Fields

12

Naïve Bayes

• A conditional probability model is a probability distribution with an input vector , where are features and is the class variable to be predicted. That probability can be formulated with Bayes' law

– : tag– : word

not used here

(1)

Page 13: Classical Probabilistic Models and Conditional Random Fields

13

constant for all y

m

iii

mm

m

iii

mmm

yxxxpyxpyp

yxxxpyxxpyxpypxyp

xxxpxp

xxxpxxpxpxxp

2111

11121

2111

111211

),,...,|()|()(

),,...,|()...,|()|()(),(

),...,|()(

),...,|()...|()(),...,(

(2)

(3)1 2 3

Complex: Model probabilities conditioned on various # of variables

Page 14: Classical Probabilistic Models and Conditional Random Fields

14

• In practice, it is often assumed, that all input variables are conditionally independent of each other which is known as the Naïve Bayes assumption

jiholds

yxpyxpyp

yxpyxpyp

yxpypyxxpyp

xypxxyp

xyxp

yxpyxpyxxp

ij

ij

j

ji

j

jiji

jiji

)|()|()(

)|()|()(

)|()()|,()(

),(),,(

),|(

)|()|()|,(

Page 15: Classical Probabilistic Models and Conditional Random Fields

15

)|()|()|()(

)|()|()|()()|,()(

)|,,()(),,(

),,,(),,|(

3

21

321

12

123

12

123123

yxpyxpyxpyp

yxpyxpyxpypyxxpypyxxxpyp

xxypxxxypxxyxp

Page 16: Classical Probabilistic Models and Conditional Random Fields

16

m

ii

m

mm

yxpyp

yxpyxpyxpypyxxxpyxxpyxpyp

xypxyp

1

21

11121

)|()(

)|()...|()|()(),,...,|()...,|()|()(

),()|(

(4)less complex than (3)

m

iii yxxxpyxpypxyp

2111 ),,...,|()|()(),(

Simple: Model probabilities conditioned on only y

Page 17: Classical Probabilistic Models and Conditional Random Fields

17

• Dependencies between the input variables are not modeled, probably leading to an imperfect representation of the real world

• Nevertheless, the Naïve Bayes Model performs surprisingly well in many real world applications, such as email classification (Androutsopoulos et al., 2000a,b; Kiritchenko and Matwin, 2001)

Page 18: Classical Probabilistic Models and Conditional Random Fields

18

Abstract

• Introduction• Naïve Bayes• HMM• ME• CRF

Page 19: Classical Probabilistic Models and Conditional Random Fields

19

HMM

• In the Naïve Bayes Model, only single output variables have been considered

• To predict a sequence of class variables for an observation sequence a simple sequence model can be formulated as a product over single Naïve Bayes Models

Page 20: Classical Probabilistic Models and Conditional Random Fields

20

• Note, that in contrast to the Naïve Bayes Model there is only one feature at each sequence position, namely the identity of the respective observation

• Dependencies between single sequence positions are not taken into account

• Each observation xi depends only on the class variable yi at the respective sequence position

X XNaïve Bayes

(5)

Page 21: Classical Probabilistic Models and Conditional Random Fields

21

• it is reasonable to assume that there are dependencies between the observations at consecutive sequence positions

(5)

(6)

(7)

Page 22: Classical Probabilistic Models and Conditional Random Fields

22

• Dependencies between output variables ~y are modeled. A shortcoming is the assumption of conditional independence (see equation 6) between the input variables ~x due to complexity issues

• As we will see later, CRFs address exactly this problem

Page 23: Classical Probabilistic Models and Conditional Random Fields

23

Abstract

• Introduction• Naïve Bayes• HMM• ME• CRF

Page 24: Classical Probabilistic Models and Conditional Random Fields

24

ME (Maximum Entropy)

• The previous two models are trained to maximize the joint likelihood

• In the following, the Maximum Entropy Model is discussed in more detail as it is fundamentally related to CRFs

Page 25: Classical Probabilistic Models and Conditional Random Fields

25

• The Maximum Entropy Model is a conditional probability model

• It is based on the Principle of Maximum Entropy (Jaynes, 1957) which states that if incomplete information about a probability distribution is available, the only unbiased assumption that can be made is a distribution which is as uniform as possible given the available information

Page 26: Classical Probabilistic Models and Conditional Random Fields

26

• Under this assumption, the proper probability distribution is the one which maximizes the entropy given the constraints from the training material

• For the conditional model p(y|x) the conditional entropy H(y|x) (Korn and Korn, 2000; Bishop, 2006) is applied, which is defined as

(8)

Page 27: Classical Probabilistic Models and Conditional Random Fields

27

• The basic idea behind Maximum Entropy Models is to find the model p*(y|x) which on the one hand has the largest possible conditional entropy but is on the other hand still consistent with the information from the training material

• The objective function, later referred to as primal problem, is thus

(9)

Page 28: Classical Probabilistic Models and Conditional Random Fields

28

• The training material is represented by features• Here, these are defined as binary-valued

functions which depend on both the input variable x and the class variable y. An example for such a function is

(10)

Page 29: Classical Probabilistic Models and Conditional Random Fields

29

• The expected value of each feature fi is estimated from the empirical distribution ~ p(x; y)

• The empirical distribution is obtained by simply counting how often the different values of the variables occur in the training data

(11)

Page 30: Classical Probabilistic Models and Conditional Random Fields

30

• As the empirical probability for a pair (x, y) which is not contained in the training material is 0

• can be rewritten as

– The size of the training set is– can be calculated by counting how often a

feature fi is found with value 1 in the training dataand dividing that number by the size N of the training set

(11)

(12)

Page 31: Classical Probabilistic Models and Conditional Random Fields

31

• Analogously to equation 11, the expected value of a feature on the model distribution is defined as

• In contrast to equation 11 (the expected value on the empirical distribution), the model distribution is taken into account here

• Of course, p(x; y) cannot be calculated in general because the number of all possible x 2 X can be enormous

(11)

(13)

(14)

Page 32: Classical Probabilistic Models and Conditional Random Fields

32

• This is an approximation to make the calculation of E(fi) possible (see Lau et al. (1993) for a more detailed discussion). This results in

• which can (analogously to equation 12) be transformed into

Xx Yyi

Xx Yyi

yxfxypxp

yxfxypxp

),()|()(~

),()|()(~

(15)

(16)

Page 33: Classical Probabilistic Models and Conditional Random Fields

33

• Equation 9 postulates that the model p*(y|x) is consistent with the evidence found in the training material

• That means, for each feature fi its expected value on the empirical distribution must be equal to its expected value on the particular model distribution, these are the first m constraints

• Another constraint is to have a proper conditional probability ensured by

(17)

(18)

Page 34: Classical Probabilistic Models and Conditional Random Fields

34

• Finding p* (y|x) under these constraints can be formulated as a constrained optimization problem

• For each constraint a Lagrange multiplier λi is introduced. This leads to the following Lagrange function (p; ~ ) (19)

∀ x?

Page 35: Classical Probabilistic Models and Conditional Random Fields

35

• In the same manner as done for the expectation values in equation 15, H(y|x) is approximated

(20)

(21)

)|(log)|()(~...)|(log)|()(~...)|(log)|()(~

)|(log)|()(~)|()|(

11111

lnlnl

sksks

Xx Yy

xypxypxpxypxypxpxypxypxp

xypxypxpXYHxyH

)1)|()(log(~)|()|()|(log)(~)|(

)|()|(

)|(

sks

sk

sksks

sk

xypxp

xypxypxypxpXYH

xypxyH

xyp

Page 36: Classical Probabilistic Models and Conditional Random Fields

36

)|()(

)|()(

log)(),...,(),...,(

...

)|()(log)|()(log

),...,,,,...,,,,...,,(),(

~1)|(~1,)(

1

1 1111

1 11 1

212222221211121111

XYHXH

xXYHpXH

pppXHppHpppH

xXyYPxXPxXyYPxXPpppp

ppppppppppppppppppHHYXH

njpxXyYPlipxXP

l

iii

l

i

n

jijiji

l

iininiln

l

i

n

jijiiji

l

i

n

jijiiji

nllllllnnnlnl

ijij

ii

Page 37: Classical Probabilistic Models and Conditional Random Fields

37

Zyx

Xx Yy

l

i

n

jijiji

l

i

n

jijiji

l

i

n

jijiji

l

i

n

jijiji

l

iii

xypxypxp

xypxypxpxypxypxp

xXyYPxXyYPxXP

xXyYPxXyYPxXP

pppxXYHpXYH

),(

1 1

1 1

1 1

1 11

)|(log)|()(

)|(log)|()()|(log)|()(

)|(log)|()(

)|(log)|()(

log)|()|(

Page 38: Classical Probabilistic Models and Conditional Random Fields

38

m

iksisi

m

i Xx Yy Xx Yyiii

sk

m

iiii

sk

yxfxp

yxfyxpyxfxypxpxyp

fEfExyp

1

1

1

),()(~

),(),(~),()|()(~)|(

))(~)(()|(

(22)

Page 39: Classical Probabilistic Models and Conditional Random Fields

39

sm

lnlmsksmmsk

l

i

n

jijim

sk

m

n

jijm

sk

Yym

sk

Yym

xypxypxypxyp

xXyYPxyp

ndescriptioother

siwhen

xXyYPxyp

xypxyp

lixypxyp

)|(...)|(...)|()|(

1)|()|(

1)|()|(

1)|()|(

~11)|()|(

111

1 1

1

11

1

1

Page 40: Classical Probabilistic Models and Conditional Random Fields

40

• The complete derivation of the Lagrange function from equation 19 is then

• Equating this term to 0 and solving by p(y|x) leads to

(23)

(24)

Page 41: Classical Probabilistic Models and Conditional Random Fields

41

• The second constraint in equation 18 is given as

• Substituting equation 24 into 25 results in

• Substituting equation 26 back into equation 24 results in

(25)

(26)

(27)

Page 42: Classical Probabilistic Models and Conditional Random Fields

42

• This is the general form the model needs to have to meet the constraints. The Maximum Entropy Model can then be formulated as

• This formulation of a conditional probability distribution as a log-linear model and a product of exponentiated weighted features is discussed from another perspective in Section 3

• In Section 4, the similarity of Conditional Random Fields, which are also log-linear models, with the conceptually closely related Maximum Entropy Models becomes evident

(28)

(29)

Page 43: Classical Probabilistic Models and Conditional Random Fields

43Presented by Jian-Shiun Tzeng 4/24/2009

Lagrange Multiplier Method

張海潮 教授/臺灣大學數學系http://episte.math.ntu.edu.tw/entries/en_lagrange_mul/index.html

Page 44: Classical Probabilistic Models and Conditional Random Fields

44

• 在兩個變數的時候,要找 f(x,y) 的極值的一個必要的條件是:• 但是如果 x,y 的範圍一開始就被另一個函數

g(x,y)=0 所限制, Lagrange 提出以 對 x 和 y 的偏導數為 0 ,來代替 (L1) 作為在 g(x,y)=0 上面尋找 f(x,y) 極值的條件

• 式中引入的 λ 是一個待定的數,稱為乘數,因為是乘在 g 的前面而得名。

0

xf

xf

),(),( yxgyxf

(L1)

Page 45: Classical Probabilistic Models and Conditional Random Fields

45

• 首先我們注意,要解的是 x,y 和 λ 三個變數,而

• 雖然有三個方程式,原則上是可以解得出來的。0),(

0

0

yxgyg

yf

xg

xf

Page 46: Classical Probabilistic Models and Conditional Random Fields

46

• 以 f(x,y)=x , g(x,y)=x2+y2-1 為例,當 x,y 被限制在 x2+y2-1=0 上活動時,對下面三個方程式求解

• 答案有兩組,分別是 x=1 , y=0 , λ=-1/2 和 x=-1 , y=0 , λ=1/2 。 對應的是 x2+y2-1=0 這個圓的左、右兩個端點。它們的 x 坐標分別是 1 和 -1 ,一個是最大可能,另一個是最小可能。

01

2)]1([

21)]1([

22

22

22

yx

xyxxy

xyxxx

Page 47: Classical Probabilistic Models and Conditional Random Fields

47

• 讀者可能認為為何不把 x2+y2-1=0 這個限制改寫為 、 來代入得到 ,然後令對 θ 的微分等於 0 來求解呢?對以上的這個例子而言,當然是可以的,但是如果 g(x,y) 是相當一般的形式,而無法以 x,y 的參數式代入滿足,或是再更多變數加上更多限制的時候,舊的代參數式方法通常是失效的註1 。

cosx siny)sin,(cos),( fyxf

Page 48: Classical Probabilistic Models and Conditional Random Fields

48

註 1• 如果在 g1(x,y,z)=0 , g2(x,y,z)=0 這樣的限制之下求 f(x,y,z) 的極值。 Lagrange 乘數法需列出下面五個方程式

• 要解的變數有 x,y,z 和 λ1, λ2 一共五個。00

0][

0][

0][

2

1

2211

2211

2211

gg

ggfz

ggfy

ggfx

Page 49: Classical Probabilistic Models and Conditional Random Fields

49

• 這個方法的意義為何?原來在 g(x,y)=0 的時候,不妨把 y 想成是 x 的隱函數,而有 g(x,y(x))=0 ,並且 f(x,y) 也變成了 f(x,y(x)) 。令 根據連鎖法則,我們得到

• 因此有行列式為 0 的結論。

0))(,( xyxfdxd

)0))(,((0

0

xyxgdxd

dxdygg

dxdyff

yx

yx

0yx

yx

ggff

Page 50: Classical Probabilistic Models and Conditional Random Fields

50

• 這表示 fx,fy 和 gx,gy 成比例,所以有 λ 註 2

yy

xx

gfgf

Page 51: Classical Probabilistic Models and Conditional Random Fields

51

註 2• 當然也可以寫成 , 。只是要注意,這裡有一個先決條件就是 gx,gy 不能同時為

0 ,同時為 0 會使 y 和 x 無法表成對方的反函數,請看下面的例子:設 f(x,y)=x, g(x,y)=y2-x3 ,則

• 易見這個方程式無解,但 f(x,y) 的極值是有的,它發生在 (0,0) 之處,只不過在 (0,0) , gx=0=gy,所以乘數法是失效的。

0 xx gf 0 yy gf

0

2)]([

031)]([

32

32

232

xy

yxyxy

xxyxx

Page 52: Classical Probabilistic Models and Conditional Random Fields

52

Abstract

• Introduction• Naïve Bayes• HMM• ME• CRF