multiclass classification in nlp

46
Page 1 Multiclass Classification in NLP Name/Entity Recognition Label people, locations, and organizations in a sentence [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US Congress]. Decompose into sub-problems Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1) Sam Houston, born in Virginia... (PER,LOC,ORG,?) None (0) Sam Houston, born in Virginia... (PER,LOC,ORG,?) LOC (2) Many problems in NLP are decomposed this way Disambiguation tasks POS Tagging Word-sense disambiguation Verb Classification Semantic-Role Labeling

Upload: damien

Post on 18-Jan-2016

109 views

Category:

Documents


1 download

DESCRIPTION

Multiclass Classification in NLP. Name/Entity Recognition Label people, locations, and organizations in a sentence [PER Sam Houston] ,[born in] [LOC Virginia] , [was a member of the] [ORG US Congress] . Decompose into sub-problems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiclass Classification in NLP

Page 1

Multiclass Classification in NLP Name/Entity Recognition

Label people, locations, and organizations in a sentence [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US

Congress].

Decompose into sub-problems Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1) Sam Houston, born in Virginia... (PER,LOC,ORG,?) None (0) Sam Houston, born in Virginia... (PER,LOC,ORG,?) LOC (2)

Many problems in NLP are decomposed this way Disambiguation tasks

POS Tagging Word-sense disambiguation Verb Classification

Semantic-Role Labeling

Page 2: Multiclass Classification in NLP

Page 2

Outline

Multi-Categorical Classification Tasks example: Semantic Role Labeling (SRL)

Decomposition Approaches Constraint Classification

Unifies learning of multi-categorical classifiers Structured-Output Learning

revisit SRL Decomposition versus Constraint Classification

Goal: Discuss multi-class and structured output from the same

perspective. Discuss similarities and differences

Page 3: Multiclass Classification in NLP

Page 3

Multi-Categorical Output Tasks

Multi-class Classification (y {1,...,K})character recognition (‘6’)document classification (‘homepage’)

Multi-label Classification (y {1,...,K})document classification (‘(homepage,facultypage)’)

Category Ranking (y K)user preference (‘(love > like > hate)’)document classification (‘hompage > facultypage > sports’)

Hierarchical Classification (y {1,..,K})cohere with class hierarchyplace document into index where ‘soccer’ is-a ‘sport’

Page 4: Multiclass Classification in NLP

Page 4

(more) Multi-Categorical Output Tasks

Sequential Prediction (y {1,...,K}+)

e.g. POS tagging (‘(NVNNA)’)

“This is a sentence.” D V N D

e.g. phrase identification

Many labels: KL for length L sentence

Structured Output Prediction (y C({1,...,K}+))

e.g. parse tree, multi-level phrase identification

e.g. sequential prediction

Constrained by

domain, problem, data, background knowledge, etc...

Page 5: Multiclass Classification in NLP

Page 5

Semantic Role LabelingA Structured-Output Problem

For each verb in a sentence1. Identify all constituents that fill a semantic role

2. Determine their roles• Core Arguments, e.g., Agent, Patient or Instrument• Their adjuncts, e.g., Locative, Temporal or Manner

I left my pearls to my daughter-in-law in my will.

A0 : leaver

A1 : thing left

A2 : benefactor

AM-LOC

Page 6: Multiclass Classification in NLP

Page 6

Semantic Role Labeling

I left my pearls to my daughter-in-law in my will.

Many possible valid output Many possible invalid output

A0 - A1 A2 AM-LOC

Page 7: Multiclass Classification in NLP

Page 7

Structured Output Problems Multi-Class

View y=4 as (y1,...,yk) = ( 0 0 0 1 0 0 0 ) The output is restricted by “Exactly one of yi=1” Learn f1(x),..,fk(x)

Sequence Prediction e.g. POS tagging: x = (My name is Dav) y = (Pr,N,V,N) e.g. restriction: “Every sentence must have a verb”

Structured Output Arbitrary global constraints Local functions do not have access to global constraints!

Goal: Discuss multi-class and structured output from the same perspective. Discuss similarities and differences

Page 8: Multiclass Classification in NLP

Page 8

Transform the sub-problems

Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1)

Transform each problem to feature vector Sam Houston, born in Virginia

(Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN, --BORN,... ) ( 0 , 0 , 1 , 0 , 1 , 1 ,... )

Transform each label to a class label PER 1 LOC 2 ORG 3 ? 0

Input : {0,1}d or Rd

Output: {0,1,2,3,...,k}

Page 9: Multiclass Classification in NLP

Page 9

Solving multiclass with binary learning

Multiclass classifier Function f : Rd {1,2,3,...,k}

Decompose into binary problems

Not always possible to learn No theoretical justification (unless the problem is easy)

Page 10: Multiclass Classification in NLP

Page 10

The Real MultiClass Problem

General framework Extend binary algorithms Theoretically justified

Provably correct Generalizes well

Verified Experimentally

Naturally extends binary classification algorithms to mulitclass setting e.g. Linear binary separation induces linear boundaries in

multiclass setting

Page 11: Multiclass Classification in NLP

Page 11

Multi Class over Linear Functions

– One versus all (OvA)

– Direct winner-take-all (D-WTA)

– All versus all (AvA)

Page 12: Multiclass Classification in NLP

Page 12

WTA over linear functions

Assume examples generated from winner-take-all y = argmax wi . x + ti

wi, x Rn , ti R

• Note: Voronoi diagrams are WTA functionsargminc || ci – x || = argmaxc ci . x – ||ci||2 / 2

Page 13: Multiclass Classification in NLP

Page 13

Learning via One-Versus-All (OvA) Assumption Find vr,vb,vg,vy Rn such that

vr.x > 0 iff y = red vb.x > 0 iff y = blue vg.x > 0 iff y = green vy.x > 0 iff y = yellow

Classifier f(x) = argmax vi.x

Individual Classifiers

Decision Regions

H = Rkn

Page 14: Multiclass Classification in NLP

Page 14

Learning via All-Verses-All (AvA) Assumption Find vrb,vrg,vry,vbg,vby,vgy Rd such that

vrb.x > 0 if y = red < 0 if y = blue

vrg.x > 0 if y = red < 0 if y = green ... (for all pairs)

Individual Classifiers

Decision Regions

H = Rkkn

How to classify?

Page 15: Multiclass Classification in NLP

Page 15

Classifying with AvA

Tree

1 red, 2 yellow, 2 green ?

Majority Vote

Tournament

All are post-learning and might cause weird stuff

Page 16: Multiclass Classification in NLP

Page 16

Summary (1): Learning Binary Classifiers

On-Line: Perceptron, Winnow Mistake bounded Generalizes well (VC-Dim) Works well in practice

SVM Well motivated to maximize margin Generalizes well Works well in practice

Boosting, Neural Networks, etc...

Page 17: Multiclass Classification in NLP

Page 17

From Binary to Multi-categorical

Decompose multi-categorical problems into multiple (independent) binary problems

Multi-class: OvA, AvA, ECOC, DT, etc... Multi-label: reduce to multi-class Categorical Ranking: reduce or regression Sequence Prediction:

Reduce to Multi-class part/alphabet based decompositions

Structured Output: learn parts of output based on local information!!!

Page 18: Multiclass Classification in NLP

Page 18

Problems with Decompositions

Learning optimizes over local metrics Poor global performance

What is the metric? We don’t care about the performance of the local classifiers

Poor decomposition poor performance Difficult local problems Irrelevant local problems

Not clear how to decompose all Multi-category problems

Page 19: Multiclass Classification in NLP

Page 19

Multi-class OvA Decomposition: a Linear Representation

Hypothesis: h(x) = argmaxi vix Decomposition

Each class represented by a linear function vix

Learning: One-versus-all (OvA) For each class i vix > 0 iff i=y

General Case Each class represented by a function fi(x) > 0

Page 20: Multiclass Classification in NLP

Page 20

Learning via One-Versus-All (OvA) Assumption

Classifier f(x) = argmax vi.x

Individual Classifiers

OvA Learning: Find vi.x > 0 iff y=i

OvA is fine only if data is OvA separable! Linear classifier can represent this function!

(voronoi) argmin d(ci,x) (wta) argmax cix + di

Page 21: Multiclass Classification in NLP

Page 21

Other Issues we Mentioned

Error Correcting Output Codes Another (class of) decomposition Difficulty: how to make sure that the resulting problems are separable.

Commented on the advantage of All vs. All when working with the dual space (e.g., kernels)

Page 22: Multiclass Classification in NLP

Page 22

Example: SNoW Multi-class Classifier

Targets (each an LTU)

Features

Weighted edges (weight vectors)

SNoW only represents the targets and weighted edges

How do we train?

How do we evaluate?

Page 23: Multiclass Classification in NLP

Page 23

Winnow: Extensions

Winnow learns monotone boolean functions To learn non-monotone boolean functions:

For each variable x, introduce x’ = ¬x Learn monotone functions over 2n variables

To learn functions with real valued inputs: “Balanced Winnow” 2 weights per variable; effective weight is the difference Update rule:

If [(w w ) x ] y, wi wi

ry xi , wi wi

r y xi

Page 24: Multiclass Classification in NLP

Page 24

An Intuition: Balanced Winnow

In most multi-class classifiers you have a target node that represents positive examples and target node that represents negative examples.

Typically, we train each node separately (my/not my example). Rather, given an example we could say: this is more a + example

than a – example.

We compared the activation of the different target nodes (classifiers) on a given example. (This example is more class + than class -)

If [(w w ) x ] y, wi wi

ry xi , wi wi

r y xi

Page 25: Multiclass Classification in NLP

Page 25

Constraint Classification

Can be viewed as a generalization of the balanced Winnow to the multi-class case

Unifies multi-class, multi-label, category-ranking Reduces learning to a single binary learning task Captures theoretical properties of binary algorithm Experimentally verified Naturally extends Perceptron, SVM, etc...

Do all of this by representing labels as a set of constraints or preferences among output labels.

Page 26: Multiclass Classification in NLP

Page 26

Multi-category to Constraint Classification

Multiclass (x, A) (x, (A>B, A>C, A>D) )

Multilabel (x, (A, B)) (x, ( (A>C, A>D, B>C, B>D) )

Label Ranking (x, (5>4>3>2>1)) (x, ( (5>4, 4>3, 3>2, 2>1) )

Examples (x,y) y Sk

Sk : partial order over class labels {1,...,k} defines “preference” relation ( > ) for class labeling

Constraint Classifier h: X Sk

Page 27: Multiclass Classification in NLP

Page 27

Learning Constraint ClassificationKesler Construction

Transform Examples

2>12>32>4

2>1

2>3

i>j fi(x) - fj(x) > 0

wi x - wj x > 0W Xi - W Xj > 0W (Xi - Xj) > 0W Xij > 0

i>j fi(x) - fj(x) > 0

wi x - wj x > 0W Xi - W Xj > 0W (Xi - Xj) > 0W Xij > 0

Xi = (0,x,0,0) Rkd

Xj = (0,0,0,x) Rkd

Xij = Xi - Xj = (0,x,0,-x)

W = (w1,w2,w3,w4) Rkd

2>4

Page 28: Multiclass Classification in NLP

Page 28

Kesler’s Construction (1)

y = argmaxi=(r,b,g,y) vi.x vi , x Rn

Find vr,vb,vg,vy Rn such that vr.x > vb.x

vr.x > vg.x

vr.x > vy.x

H = Rkn

Page 29: Multiclass Classification in NLP

Page 29

Kesler’s Construction (2)

Let v = (vr,vb,vg,vy ) Rkn

Let 0n, be the n-dim zero vector

vr.x > vb.x v.(x,-x,0n,0n) > 0 v.(-x,x,0n,0n) < 0

vr.x > vg.x v.(x,0n,-x,0n) > 0 v.(x,0n,-x,0n) < 0

vr.x > vy.x v.(x,0n,0n,-x) > 0 v.(-x,0n,0n ,x) < 0

x -x -x x

Page 30: Multiclass Classification in NLP

Page 30

Kesler’s Construction (3)

Let v = (v1, ..., vk) Rn x ... x Rn = Rkn

xij = (0(i-1)n, x, 0(k-i)n) – (0(j-1)n, –x, 0(k-j)n) Rkn

Given (x, y) Rn x {1,...,k} For all j y

Add to P+(x,y), (xyj, 1) Add to P-(x,y), (–xyj, -1)

P+(x,y) has k-1 positive examples ( Rkn) P-(x,y) has k-1 negative examples ( Rkn)

-xx

Page 31: Multiclass Classification in NLP

Page 31

Learning via Kesler’s Construction

Given (x1, y1), ..., (xN, yN) Rn x {1,...,k} Create

P+ = P+(xi,yi) P– = P–(xi,yi)

Find v = (v1, ..., vk) Rkn, such that v.x separates P+ from P–

Output f(x) = argmax vi.x

Page 32: Multiclass Classification in NLP

Page 32

Constraint Classification

Examples (x,y) y Sk

Sk : partial order over class labels {1,...,k} defines “preference” relation (<) for class labels

e.g. Multiclass: 2<1, 2<3, 2<4, 2<5

e.g. Multilabel: 1<3, 1<4, 1<5, 2<3, 2<4, 4<5

Constraint Classifier f: X Sk

f(x) is a partial order f(x) is consistent with y if (i<j) y (i<j) f(x)

Page 33: Multiclass Classification in NLP

Page 33

Implementation

Examples (x,y) y Sk

Sk : partial order over class labels {1,...,k} defines “preference” relation (>) for class labels e.g. Multiclass: 2>1, 2>3, 2>4, 2>5

Given an example that is labeled 2, the activation of target 2 on it, should be larger than the activation of the other targets.

SNoW implementation: Conservative. Only the target node that corresponds to the correct label and the

highest activation are compared. If both are the same target node – no change. Otherwise, promote one and demote the other.

Page 34: Multiclass Classification in NLP

Page 34

Properties of Construction Can learn any argmax vi.x function Can use any algorithm to find linear separation

Perceptron Algorithm ultraconservative online algorithm [Crammer, Singer 2001]

Winnow Algorithm multiclass winnow [ Masterharm 2000 ]

Defines a multiclass margin by binary margin in Rkd

multiclass SVM [Crammer, Singer 2001]

Page 35: Multiclass Classification in NLP

Page 35

Margin Generalization Bounds

Linear Hypothesis space: h(x) = argsort vi.x

vi, x Rd argsort returns permutation of {1,...,k}

CC margin-based bound = min(x,y)S min (i < j)y vi.x – vj.x

errD (h) C

m

R2

2 ln()

m - number of examples R - maxx ||x|| - confidence C - average # constraints

Page 36: Multiclass Classification in NLP

Page 36

VC-style Generalization Bounds

Linear Hypothesis space: h(x) = argsort vi.x

vi, x Rd argsort returns permutation of {1,...,k}

CC VC-based bound

errD (h) err(S,h) kd log(mk /d) lnm

m - number of examples d - dimension of input space delta - confidence k - number of classes

Page 37: Multiclass Classification in NLP

Page 37

Beyond Multiclass Classification

Ranking category ranking (over classes) ordinal regression (over examples)

Multilabel x is both red and blue

Complex relationships x is more red than blue, but not green

Millions of classes sequence labeling (e.g. POS tagging) LATER

SNoW has an implementation of Constraint Classification for the Multi-Class case. Try to compare with 1-vs-all.

Experimental Issues: when is this version of multi-class better? Several easy improvements are possible via modifying the loss

function.

Page 38: Multiclass Classification in NLP

Page 38

Multi-class ExperimentsPicture isn’t so clear for very high dimensional problems. Why?

Page 39: Multiclass Classification in NLP

Page 39

Summary

OvA Constraint Classification

Learning: independent fi(x) > 0 iff y=i

Evaluation: global h(x) = argmax fi(x)

Learning: global find {fi(x)} s.t. y = argmax fi(x)

Evaluation: global h(x) = argmax fi(x)

Learn + Inference Inference Based Training

Learning: independent fi(x) > 0 iff “i is a part of y”

Evaluation: global Inf h(x) = argmaxy\inC SU fi(x)

Learning: global find {fi(x)} s.t. y = argmax fi(x)

Evaluation: global h(x) = argmax fi(x)

Page 40: Multiclass Classification in NLP

Page 40

Structured Output Learning

Abstract View: Decomposition versus Constraint Classification

More details: Inference with Classifiers

Page 41: Multiclass Classification in NLP

Page 41

Structured Output Learning:Semantic Role Labeling

I left my pearls to my child

A0 : leaver

A1 : thing left

A2 : benefactor

For each verb in a sentence1. Identify all constituents that fill a

semantic role

2. Determine their roles• Core Arguments, e.g., Agent, Patient or

Instrument• Their adjuncts, e.g., Locative, Temporal

or Manner

Y : All possible ways to label the treeC(Y): All valid ways to label the treeargmaxy C(Y) g(x,y)

Page 42: Multiclass Classification in NLP

Page 42

Components of Structured Output Learning

Input: X Output: A collection of variables

Y = (y1,...,yL) {1,...,K}L

Length is example dependent Constraints on the Output C(Y)

e.g. non-overlapping, no repeated values... partition output to valid and invalid assignments

Representation scoring function: g(x,y) e.g. linear: g(x,y) = w (x,y)

Inference h(x) = argmaxvalid y g(x,y)

y3

y2

y1

Y

I left mypearls to my child X

Page 43: Multiclass Classification in NLP

Page 43

Decomposition-based Learning

Many choices for decomposition Depends on problem, learning model, computation resources,

etc...

Value-based decomposition A function for each output value

fk(x,l), k = {1,..,K}

e.g. SRL tagging fA0(x,node), fA1(x,node),...

OvA learning fk(x,node) > 0 iff k=y

Page 44: Multiclass Classification in NLP

Page 44

Learning Discriminant Functions: The General Setting g(x,y) > g(x,y’) y’ Y \ y w (x,y) > w (x,y’) y’ Y \ y w (x,y,y’) = w ((x,y) - (x,y’)) > 0 P(x,y) = {(x,y,y’)} y’ Y \ y

P(S) = {P(x,y)}(x,y) S

Learn unary classifer over P(S) (binary) (+P(S),-P(S))

Used in many works [C02,WW00,CS01,CM03,TGK03]

Page 45: Multiclass Classification in NLP

Page 45

Learn a collection of “scoring” functions wA0A0(x,y,n) , wA1A1(x,y,n),...

scorev(x,y,n) = wvv(x,y,n)

Global score g(x,y) = n scoreyn(x,y,n) = n wynyn(x,y,n)

Learn locally (LO, L+I) for each label variable (node) n = A0

gA0(x,y,n) = wA0A0(x,y,n) > 0 iff yn = A0

Discriminant model dictates: g(x,y) > g(x,y’), y C(Y) argmaxy C(Y) g(x,y)

Learn Globally (IBT) g(x,y) = w (x,y)

Structured Output Learning:Semantic Role Labeling

I left mypearls to my child

scoreNONE(3)

scoreA2(13)

Page 46: Multiclass Classification in NLP

Page 46

SummaryOvA Constraint Classification

Learning: Independent fi(x) > 0 iff y=i

Evaluation: global h(x) = argmax fi(x)

Learning: global find {fi(x)} s.t. y = argmax fi(x)

Evaluation: global h(x) = argmax fi(x)

Learn + Inference Inference Based Training

Learning: Independent fi(x) > 0 iff “i is a part of y”

Evaluation: global Inference h(x) = Inference {fi(x)}

Efficient Learning

Learning: global find {fi(x)} s.t. Y = Inference {fi(x)}

Evaluation: global inference h(x) = Inference {fi(x)}

Less Efficent Learning

Mul

ti-cl

ass

Str

uctu

red

Out

put