multiclass classification in nlp

Multiclass Classification in NLP Name/Entity Recognition

Label people, locations, and organizations in a sentence [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US

Congress].

Decompose into sub-problems Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1) Sam Houston, born in Virginia... (PER,LOC,ORG,?) None (0) Sam Houston, born in Virginia... (PER,LOC,ORG,?) LOC (2)

Many problems in NLP are decomposed this way Disambiguation tasks

POS Tagging Word-sense disambiguation Verb Classification

Semantic-Role Labeling

Outline

Multi-Categorical Classification Tasks example: Semantic Role Labeling (SRL)

Decomposition Approaches Constraint Classification

Unifies learning of multi-categorical classifiers Structured-Output Learning

revisit SRL Decomposition versus Constraint Classification

Goal: Discuss multi-class and structured output from the same

perspective. Discuss similarities and differences

Multi-Categorical Output Tasks

Multi-class Classification (y {1,...,K})character recognition (‘6’)document classification (‘homepage’)

Multi-label Classification (y {1,...,K})document classification (‘(homepage,facultypage)’)

Category Ranking (y K)user preference (‘(love > like > hate)’)document classification (‘hompage > facultypage > sports’)

Hierarchical Classification (y {1,..,K})cohere with class hierarchyplace document into index where ‘soccer’ is-a ‘sport’

(more) Multi-Categorical Output Tasks

Sequential Prediction (y {1,...,K}+)

e.g. POS tagging (‘(NVNNA)’)

“This is a sentence.” D V N D

e.g. phrase identification

Many labels: KL for length L sentence

Structured Output Prediction (y C({1,...,K}+))

e.g. parse tree, multi-level phrase identification

e.g. sequential prediction

Constrained by

domain, problem, data, background knowledge, etc...

Semantic Role LabelingA Structured-Output Problem

For each verb in a sentence1. Identify all constituents that fill a semantic role

2. Determine their roles• Core Arguments, e.g., Agent, Patient or Instrument• Their adjuncts, e.g., Locative, Temporal or Manner

I left my pearls to my daughter-in-law in my will.

A0 : leaver

A1 : thing left

A2 : benefactor

AM-LOC

Semantic Role Labeling

I left my pearls to my daughter-in-law in my will.

Many possible valid output Many possible invalid output

A0 - A1 A2 AM-LOC

Structured Output Problems Multi-Class

View y=4 as (y1,...,yk) = ( 0 0 0 1 0 0 0 ) The output is restricted by “Exactly one of yi=1” Learn f1(x),..,fk(x)

Sequence Prediction e.g. POS tagging: x = (My name is Dav) y = (Pr,N,V,N) e.g. restriction: “Every sentence must have a verb”

Structured Output Arbitrary global constraints Local functions do not have access to global constraints!

Goal: Discuss multi-class and structured output from the same perspective. Discuss similarities and differences

Transform the sub-problems

Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1)

Transform each problem to feature vector Sam Houston, born in Virginia

(Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN, --BORN,... ) ( 0 , 0 , 1 , 0 , 1 , 1 ,... )

Transform each label to a class label PER 1 LOC 2 ORG 3 ? 0

Input : {0,1}d or Rd

Output: {0,1,2,3,...,k}

Solving multiclass with binary learning

Multiclass classifier Function f : Rd {1,2,3,...,k}

Decompose into binary problems

Not always possible to learn No theoretical justification (unless the problem is easy)

The Real MultiClass Problem

General framework Extend binary algorithms Theoretically justified

Provably correct Generalizes well

Verified Experimentally

Naturally extends binary classification algorithms to mulitclass setting e.g. Linear binary separation induces linear boundaries in

multiclass setting

Multi Class over Linear Functions

– One versus all (OvA)

– Direct winner-take-all (D-WTA)

– All versus all (AvA)

WTA over linear functions

Assume examples generated from winner-take-all y = argmax wi . x + ti

wi, x Rn , ti R

• Note: Voronoi diagrams are WTA functionsargminc || ci – x || = argmaxc ci . x – ||ci||2 / 2

Learning via One-Versus-All (OvA) Assumption Find vr,vb,vg,vy Rn such that

vr.x > 0 iff y = red vb.x > 0 iff y = blue vg.x > 0 iff y = green vy.x > 0 iff y = yellow

Classifier f(x) = argmax vi.x

Individual Classifiers

Decision Regions

H = Rkn

Learning via All-Verses-All (AvA) Assumption Find vrb,vrg,vry,vbg,vby,vgy Rd such that

vrb.x > 0 if y = red < 0 if y = blue

vrg.x > 0 if y = red < 0 if y = green ... (for all pairs)


Decision Regions

H = Rkkn

How to classify?

Classifying with AvA

Tree

1 red, 2 yellow, 2 green ?

Majority Vote

Tournament

All are post-learning and might cause weird stuff

Summary (1): Learning Binary Classifiers

On-Line: Perceptron, Winnow Mistake bounded Generalizes well (VC-Dim) Works well in practice

SVM Well motivated to maximize margin Generalizes well Works well in practice

Boosting, Neural Networks, etc...

From Binary to Multi-categorical

Decompose multi-categorical problems into multiple (independent) binary problems

Multi-class: OvA, AvA, ECOC, DT, etc... Multi-label: reduce to multi-class Categorical Ranking: reduce or regression Sequence Prediction:

Reduce to Multi-class part/alphabet based decompositions

Structured Output: learn parts of output based on local information!!!

Problems with Decompositions

Learning optimizes over local metrics Poor global performance

What is the metric? We don’t care about the performance of the local classifiers

Poor decomposition poor performance Difficult local problems Irrelevant local problems

Not clear how to decompose all Multi-category problems

Multi-class OvA Decomposition: a Linear Representation

Hypothesis: h(x) = argmaxi vix Decomposition

Each class represented by a linear function vix

Learning: One-versus-all (OvA) For each class i vix > 0 iff i=y

General Case Each class represented by a function fi(x) > 0

Learning via One-Versus-All (OvA) Assumption

Classifier f(x) = argmax vi.x


OvA Learning: Find vi.x > 0 iff y=i

OvA is fine only if data is OvA separable! Linear classifier can represent this function!

(voronoi) argmin d(ci,x) (wta) argmax cix + di

Other Issues we Mentioned

Error Correcting Output Codes Another (class of) decomposition Difficulty: how to make sure that the resulting problems are separable.

Commented on the advantage of All vs. All when working with the dual space (e.g., kernels)

Example: SNoW Multi-class Classifier

Targets (each an LTU)

Features

Weighted edges (weight vectors)

SNoW only represents the targets and weighted edges

How do we train?

How do we evaluate?

Winnow: Extensions

Winnow learns monotone boolean functions To learn non-monotone boolean functions:

For each variable x, introduce x’ = ¬x Learn monotone functions over 2n variables

To learn functions with real valued inputs: “Balanced Winnow” 2 weights per variable; effective weight is the difference Update rule:

If [(w w ) x ] y, wi wi

ry xi , wi wi

r y xi

An Intuition: Balanced Winnow

In most multi-class classifiers you have a target node that represents positive examples and target node that represents negative examples.

Typically, we train each node separately (my/not my example). Rather, given an example we could say: this is more a + example

than a – example.

We compared the activation of the different target nodes (classifiers) on a given example. (This example is more class + than class -)

If [(w w ) x ] y, wi wi

ry xi , wi wi

r y xi

Constraint Classification

Can be viewed as a generalization of the balanced Winnow to the multi-class case

Unifies multi-class, multi-label, category-ranking Reduces learning to a single binary learning task Captures theoretical properties of binary algorithm Experimentally verified Naturally extends Perceptron, SVM, etc...

Do all of this by representing labels as a set of constraints or preferences among output labels.

Multi-category to Constraint Classification

Multiclass (x, A) (x, (A>B, A>C, A>D) )

Multilabel (x, (A, B)) (x, ( (A>C, A>D, B>C, B>D) )

Label Ranking (x, (5>4>3>2>1)) (x, ( (5>4, 4>3, 3>2, 2>1) )

Examples (x,y) y Sk

Sk : partial order over class labels {1,...,k} defines “preference” relation ( > ) for class labeling

Constraint Classifier h: X Sk

Learning Constraint ClassificationKesler Construction

Transform Examples

2>12>32>4

2>1

2>3

i>j fi(x) - fj(x) > 0

wi x - wj x > 0W Xi - W Xj > 0W (Xi - Xj) > 0W Xij > 0

i>j fi(x) - fj(x) > 0

wi x - wj x > 0W Xi - W Xj > 0W (Xi - Xj) > 0W Xij > 0

Xi = (0,x,0,0) Rkd

Xj = (0,0,0,x) Rkd

Xij = Xi - Xj = (0,x,0,-x)

W = (w1,w2,w3,w4) Rkd

2>4

Kesler’s Construction (1)

y = argmaxi=(r,b,g,y) vi.x vi , x Rn

Find vr,vb,vg,vy Rn such that vr.x > vb.x

vr.x > vg.x

vr.x > vy.x

H = Rkn


Let v = (vr,vb,vg,vy ) Rkn

Let 0n, be the n-dim zero vector

vr.x > vb.x v.(x,-x,0n,0n) > 0 v.(-x,x,0n,0n) < 0

vr.x > vg.x v.(x,0n,-x,0n) > 0 v.(x,0n,-x,0n) < 0

vr.x > vy.x v.(x,0n,0n,-x) > 0 v.(-x,0n,0n ,x) < 0

x -x -x x


Let v = (v1, ..., vk) Rn x ... x Rn = Rkn

xij = (0(i-1)n, x, 0(k-i)n) – (0(j-1)n, –x, 0(k-j)n) Rkn

Given (x, y) Rn x {1,...,k} For all j y

Add to P+(x,y), (xyj, 1) Add to P-(x,y), (–xyj, -1)

P+(x,y) has k-1 positive examples ( Rkn) P-(x,y) has k-1 negative examples ( Rkn)

-xx

Learning via Kesler’s Construction

Given (x1, y1), ..., (xN, yN) Rn x {1,...,k} Create

P+ = P+(xi,yi) P– = P–(xi,yi)

Find v = (v1, ..., vk) Rkn, such that v.x separates P+ from P–

Output f(x) = argmax vi.x

Constraint Classification

Examples (x,y) y Sk

Sk : partial order over class labels {1,...,k} defines “preference” relation (<) for class labels

e.g. Multiclass: 2<1, 2<3, 2<4, 2<5

e.g. Multilabel: 1<3, 1<4, 1<5, 2<3, 2<4, 4<5

Constraint Classifier f: X Sk

f(x) is a partial order f(x) is consistent with y if (i<j) y (i<j) f(x)

Implementation

Examples (x,y) y Sk

Sk : partial order over class labels {1,...,k} defines “preference” relation (>) for class labels e.g. Multiclass: 2>1, 2>3, 2>4, 2>5

Given an example that is labeled 2, the activation of target 2 on it, should be larger than the activation of the other targets.

SNoW implementation: Conservative. Only the target node that corresponds to the correct label and the

highest activation are compared. If both are the same target node – no change. Otherwise, promote one and demote the other.

Properties of Construction Can learn any argmax vi.x function Can use any algorithm to find linear separation

Perceptron Algorithm ultraconservative online algorithm [Crammer, Singer 2001]

Winnow Algorithm multiclass winnow [ Masterharm 2000 ]

Defines a multiclass margin by binary margin in Rkd

multiclass SVM [Crammer, Singer 2001]

Margin Generalization Bounds

Linear Hypothesis space: h(x) = argsort vi.x

vi, x Rd argsort returns permutation of {1,...,k}

CC margin-based bound = min(x,y)S min (i < j)y vi.x – vj.x

errD (h) C

m

R2

2 ln()

m - number of examples R - maxx ||x|| - confidence C - average # constraints

VC-style Generalization Bounds

Linear Hypothesis space: h(x) = argsort vi.x

vi, x Rd argsort returns permutation of {1,...,k}

CC VC-based bound

errD (h) err(S,h) kd log(mk /d) lnm

m - number of examples d - dimension of input space delta - confidence k - number of classes

Beyond Multiclass Classification

Ranking category ranking (over classes) ordinal regression (over examples)

Multilabel x is both red and blue

Complex relationships x is more red than blue, but not green

Millions of classes sequence labeling (e.g. POS tagging) LATER

SNoW has an implementation of Constraint Classification for the Multi-Class case. Try to compare with 1-vs-all.

Experimental Issues: when is this version of multi-class better? Several easy improvements are possible via modifying the loss

function.

Multi-class ExperimentsPicture isn’t so clear for very high dimensional problems. Why?

Summary

OvA Constraint Classification

Learning: independent fi(x) > 0 iff y=i

Evaluation: global h(x) = argmax fi(x)

Learning: global find {fi(x)} s.t. y = argmax fi(x)


Learn + Inference Inference Based Training

Learning: independent fi(x) > 0 iff “i is a part of y”

Evaluation: global Inf h(x) = argmaxy\inC SU fi(x)



Structured Output Learning

Abstract View: Decomposition versus Constraint Classification

More details: Inference with Classifiers

Structured Output Learning:Semantic Role Labeling

I left my pearls to my child

A0 : leaver

A1 : thing left

A2 : benefactor

For each verb in a sentence1. Identify all constituents that fill a

semantic role

2. Determine their roles• Core Arguments, e.g., Agent, Patient or

Instrument• Their adjuncts, e.g., Locative, Temporal

or Manner

Y : All possible ways to label the treeC(Y): All valid ways to label the treeargmaxy C(Y) g(x,y)

Components of Structured Output Learning

Input: X Output: A collection of variables

Y = (y1,...,yL) {1,...,K}L

Length is example dependent Constraints on the Output C(Y)

e.g. non-overlapping, no repeated values... partition output to valid and invalid assignments

Representation scoring function: g(x,y) e.g. linear: g(x,y) = w (x,y)

Inference h(x) = argmaxvalid y g(x,y)

y3

y2

y1

Y

I left mypearls to my child X

Decomposition-based Learning

Many choices for decomposition Depends on problem, learning model, computation resources,

etc...

Value-based decomposition A function for each output value

fk(x,l), k = {1,..,K}

e.g. SRL tagging fA0(x,node), fA1(x,node),...

OvA learning fk(x,node) > 0 iff k=y

Learning Discriminant Functions: The General Setting g(x,y) > g(x,y’) y’ Y \ y w (x,y) > w (x,y’) y’ Y \ y w (x,y,y’) = w ((x,y) - (x,y’)) > 0 P(x,y) = {(x,y,y’)} y’ Y \ y

P(S) = {P(x,y)}(x,y) S

Learn unary classifer over P(S) (binary) (+P(S),-P(S))

Used in many works [C02,WW00,CS01,CM03,TGK03]

Learn a collection of “scoring” functions wA0A0(x,y,n) , wA1A1(x,y,n),...

scorev(x,y,n) = wvv(x,y,n)

Global score g(x,y) = n scoreyn(x,y,n) = n wynyn(x,y,n)

Learn locally (LO, L+I) for each label variable (node) n = A0

gA0(x,y,n) = wA0A0(x,y,n) > 0 iff yn = A0

Discriminant model dictates: g(x,y) > g(x,y’), y C(Y) argmaxy C(Y) g(x,y)

Learn Globally (IBT) g(x,y) = w (x,y)

Structured Output Learning:Semantic Role Labeling

I left mypearls to my child

scoreNONE(3)

scoreA2(13)

SummaryOvA Constraint Classification

Learning: Independent fi(x) > 0 iff y=i




Learn + Inference Inference Based Training

Learning: Independent fi(x) > 0 iff “i is a part of y”

Evaluation: global Inference h(x) = Inference {fi(x)}

Efficient Learning

Learning: global find {fi(x)} s.t. Y = Inference {fi(x)}

Evaluation: global inference h(x) = Inference {fi(x)}

Less Efficent Learning

Mul

ti-cl

ass

Str

uctu

red

Out

put

multiclass classification in nlp

Documents