multiclass classification in nlp
DESCRIPTION
Multiclass Classification in NLP. Name/Entity Recognition Label people, locations, and organizations in a sentence [PER Sam Houston] ,[born in] [LOC Virginia] , [was a member of the] [ORG US Congress] . Decompose into sub-problems - PowerPoint PPT PresentationTRANSCRIPT
Page 1
Multiclass Classification in NLP Name/Entity Recognition
Label people, locations, and organizations in a sentence [PER Sam Houston],[born in] [LOC Virginia], [was a member of the] [ORG US
Congress].
Decompose into sub-problems Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1) Sam Houston, born in Virginia... (PER,LOC,ORG,?) None (0) Sam Houston, born in Virginia... (PER,LOC,ORG,?) LOC (2)
Many problems in NLP are decomposed this way Disambiguation tasks
POS Tagging Word-sense disambiguation Verb Classification
Semantic-Role Labeling
Page 2
Outline
Multi-Categorical Classification Tasks example: Semantic Role Labeling (SRL)
Decomposition Approaches Constraint Classification
Unifies learning of multi-categorical classifiers Structured-Output Learning
revisit SRL Decomposition versus Constraint Classification
Goal: Discuss multi-class and structured output from the same
perspective. Discuss similarities and differences
Page 3
Multi-Categorical Output Tasks
Multi-class Classification (y {1,...,K})character recognition (‘6’)document classification (‘homepage’)
Multi-label Classification (y {1,...,K})document classification (‘(homepage,facultypage)’)
Category Ranking (y K)user preference (‘(love > like > hate)’)document classification (‘hompage > facultypage > sports’)
Hierarchical Classification (y {1,..,K})cohere with class hierarchyplace document into index where ‘soccer’ is-a ‘sport’
Page 4
(more) Multi-Categorical Output Tasks
Sequential Prediction (y {1,...,K}+)
e.g. POS tagging (‘(NVNNA)’)
“This is a sentence.” D V N D
e.g. phrase identification
Many labels: KL for length L sentence
Structured Output Prediction (y C({1,...,K}+))
e.g. parse tree, multi-level phrase identification
e.g. sequential prediction
Constrained by
domain, problem, data, background knowledge, etc...
Page 5
Semantic Role LabelingA Structured-Output Problem
For each verb in a sentence1. Identify all constituents that fill a semantic role
2. Determine their roles• Core Arguments, e.g., Agent, Patient or Instrument• Their adjuncts, e.g., Locative, Temporal or Manner
I left my pearls to my daughter-in-law in my will.
A0 : leaver
A1 : thing left
A2 : benefactor
AM-LOC
Page 6
Semantic Role Labeling
I left my pearls to my daughter-in-law in my will.
Many possible valid output Many possible invalid output
A0 - A1 A2 AM-LOC
Page 7
Structured Output Problems Multi-Class
View y=4 as (y1,...,yk) = ( 0 0 0 1 0 0 0 ) The output is restricted by “Exactly one of yi=1” Learn f1(x),..,fk(x)
Sequence Prediction e.g. POS tagging: x = (My name is Dav) y = (Pr,N,V,N) e.g. restriction: “Every sentence must have a verb”
Structured Output Arbitrary global constraints Local functions do not have access to global constraints!
Goal: Discuss multi-class and structured output from the same perspective. Discuss similarities and differences
Page 8
Transform the sub-problems
Sam Houston, born in Virginia... (PER,LOC,ORG,?) PER (1)
Transform each problem to feature vector Sam Houston, born in Virginia
(Bob-, JOHN-, SAM HOUSTON, HAPPY, -BORN, --BORN,... ) ( 0 , 0 , 1 , 0 , 1 , 1 ,... )
Transform each label to a class label PER 1 LOC 2 ORG 3 ? 0
Input : {0,1}d or Rd
Output: {0,1,2,3,...,k}
Page 9
Solving multiclass with binary learning
Multiclass classifier Function f : Rd {1,2,3,...,k}
Decompose into binary problems
Not always possible to learn No theoretical justification (unless the problem is easy)
Page 10
The Real MultiClass Problem
General framework Extend binary algorithms Theoretically justified
Provably correct Generalizes well
Verified Experimentally
Naturally extends binary classification algorithms to mulitclass setting e.g. Linear binary separation induces linear boundaries in
multiclass setting
Page 11
Multi Class over Linear Functions
– One versus all (OvA)
– Direct winner-take-all (D-WTA)
– All versus all (AvA)
Page 12
WTA over linear functions
Assume examples generated from winner-take-all y = argmax wi . x + ti
wi, x Rn , ti R
• Note: Voronoi diagrams are WTA functionsargminc || ci – x || = argmaxc ci . x – ||ci||2 / 2
Page 13
Learning via One-Versus-All (OvA) Assumption Find vr,vb,vg,vy Rn such that
vr.x > 0 iff y = red vb.x > 0 iff y = blue vg.x > 0 iff y = green vy.x > 0 iff y = yellow
Classifier f(x) = argmax vi.x
Individual Classifiers
Decision Regions
H = Rkn
Page 14
Learning via All-Verses-All (AvA) Assumption Find vrb,vrg,vry,vbg,vby,vgy Rd such that
vrb.x > 0 if y = red < 0 if y = blue
vrg.x > 0 if y = red < 0 if y = green ... (for all pairs)
Individual Classifiers
Decision Regions
H = Rkkn
How to classify?
Page 15
Classifying with AvA
Tree
1 red, 2 yellow, 2 green ?
Majority Vote
Tournament
All are post-learning and might cause weird stuff
Page 16
Summary (1): Learning Binary Classifiers
On-Line: Perceptron, Winnow Mistake bounded Generalizes well (VC-Dim) Works well in practice
SVM Well motivated to maximize margin Generalizes well Works well in practice
Boosting, Neural Networks, etc...
Page 17
From Binary to Multi-categorical
Decompose multi-categorical problems into multiple (independent) binary problems
Multi-class: OvA, AvA, ECOC, DT, etc... Multi-label: reduce to multi-class Categorical Ranking: reduce or regression Sequence Prediction:
Reduce to Multi-class part/alphabet based decompositions
Structured Output: learn parts of output based on local information!!!
Page 18
Problems with Decompositions
Learning optimizes over local metrics Poor global performance
What is the metric? We don’t care about the performance of the local classifiers
Poor decomposition poor performance Difficult local problems Irrelevant local problems
Not clear how to decompose all Multi-category problems
Page 19
Multi-class OvA Decomposition: a Linear Representation
Hypothesis: h(x) = argmaxi vix Decomposition
Each class represented by a linear function vix
Learning: One-versus-all (OvA) For each class i vix > 0 iff i=y
General Case Each class represented by a function fi(x) > 0
Page 20
Learning via One-Versus-All (OvA) Assumption
Classifier f(x) = argmax vi.x
Individual Classifiers
OvA Learning: Find vi.x > 0 iff y=i
OvA is fine only if data is OvA separable! Linear classifier can represent this function!
(voronoi) argmin d(ci,x) (wta) argmax cix + di
Page 21
Other Issues we Mentioned
Error Correcting Output Codes Another (class of) decomposition Difficulty: how to make sure that the resulting problems are separable.
Commented on the advantage of All vs. All when working with the dual space (e.g., kernels)
Page 22
Example: SNoW Multi-class Classifier
Targets (each an LTU)
Features
Weighted edges (weight vectors)
SNoW only represents the targets and weighted edges
How do we train?
How do we evaluate?
Page 23
Winnow: Extensions
Winnow learns monotone boolean functions To learn non-monotone boolean functions:
For each variable x, introduce x’ = ¬x Learn monotone functions over 2n variables
To learn functions with real valued inputs: “Balanced Winnow” 2 weights per variable; effective weight is the difference Update rule:
If [(w w ) x ] y, wi wi
ry xi , wi wi
r y xi
Page 24
An Intuition: Balanced Winnow
In most multi-class classifiers you have a target node that represents positive examples and target node that represents negative examples.
Typically, we train each node separately (my/not my example). Rather, given an example we could say: this is more a + example
than a – example.
We compared the activation of the different target nodes (classifiers) on a given example. (This example is more class + than class -)
If [(w w ) x ] y, wi wi
ry xi , wi wi
r y xi
Page 25
Constraint Classification
Can be viewed as a generalization of the balanced Winnow to the multi-class case
Unifies multi-class, multi-label, category-ranking Reduces learning to a single binary learning task Captures theoretical properties of binary algorithm Experimentally verified Naturally extends Perceptron, SVM, etc...
Do all of this by representing labels as a set of constraints or preferences among output labels.
Page 26
Multi-category to Constraint Classification
Multiclass (x, A) (x, (A>B, A>C, A>D) )
Multilabel (x, (A, B)) (x, ( (A>C, A>D, B>C, B>D) )
Label Ranking (x, (5>4>3>2>1)) (x, ( (5>4, 4>3, 3>2, 2>1) )
Examples (x,y) y Sk
Sk : partial order over class labels {1,...,k} defines “preference” relation ( > ) for class labeling
Constraint Classifier h: X Sk
Page 27
Learning Constraint ClassificationKesler Construction
Transform Examples
2>12>32>4
2>1
2>3
i>j fi(x) - fj(x) > 0
wi x - wj x > 0W Xi - W Xj > 0W (Xi - Xj) > 0W Xij > 0
i>j fi(x) - fj(x) > 0
wi x - wj x > 0W Xi - W Xj > 0W (Xi - Xj) > 0W Xij > 0
Xi = (0,x,0,0) Rkd
Xj = (0,0,0,x) Rkd
Xij = Xi - Xj = (0,x,0,-x)
W = (w1,w2,w3,w4) Rkd
2>4
Page 28
Kesler’s Construction (1)
y = argmaxi=(r,b,g,y) vi.x vi , x Rn
Find vr,vb,vg,vy Rn such that vr.x > vb.x
vr.x > vg.x
vr.x > vy.x
H = Rkn
Page 29
Kesler’s Construction (2)
Let v = (vr,vb,vg,vy ) Rkn
Let 0n, be the n-dim zero vector
vr.x > vb.x v.(x,-x,0n,0n) > 0 v.(-x,x,0n,0n) < 0
vr.x > vg.x v.(x,0n,-x,0n) > 0 v.(x,0n,-x,0n) < 0
vr.x > vy.x v.(x,0n,0n,-x) > 0 v.(-x,0n,0n ,x) < 0
x -x -x x
Page 30
Kesler’s Construction (3)
Let v = (v1, ..., vk) Rn x ... x Rn = Rkn
xij = (0(i-1)n, x, 0(k-i)n) – (0(j-1)n, –x, 0(k-j)n) Rkn
Given (x, y) Rn x {1,...,k} For all j y
Add to P+(x,y), (xyj, 1) Add to P-(x,y), (–xyj, -1)
P+(x,y) has k-1 positive examples ( Rkn) P-(x,y) has k-1 negative examples ( Rkn)
-xx
Page 31
Learning via Kesler’s Construction
Given (x1, y1), ..., (xN, yN) Rn x {1,...,k} Create
P+ = P+(xi,yi) P– = P–(xi,yi)
Find v = (v1, ..., vk) Rkn, such that v.x separates P+ from P–
Output f(x) = argmax vi.x
Page 32
Constraint Classification
Examples (x,y) y Sk
Sk : partial order over class labels {1,...,k} defines “preference” relation (<) for class labels
e.g. Multiclass: 2<1, 2<3, 2<4, 2<5
e.g. Multilabel: 1<3, 1<4, 1<5, 2<3, 2<4, 4<5
Constraint Classifier f: X Sk
f(x) is a partial order f(x) is consistent with y if (i<j) y (i<j) f(x)
Page 33
Implementation
Examples (x,y) y Sk
Sk : partial order over class labels {1,...,k} defines “preference” relation (>) for class labels e.g. Multiclass: 2>1, 2>3, 2>4, 2>5
Given an example that is labeled 2, the activation of target 2 on it, should be larger than the activation of the other targets.
SNoW implementation: Conservative. Only the target node that corresponds to the correct label and the
highest activation are compared. If both are the same target node – no change. Otherwise, promote one and demote the other.
Page 34
Properties of Construction Can learn any argmax vi.x function Can use any algorithm to find linear separation
Perceptron Algorithm ultraconservative online algorithm [Crammer, Singer 2001]
Winnow Algorithm multiclass winnow [ Masterharm 2000 ]
Defines a multiclass margin by binary margin in Rkd
multiclass SVM [Crammer, Singer 2001]
Page 35
Margin Generalization Bounds
Linear Hypothesis space: h(x) = argsort vi.x
vi, x Rd argsort returns permutation of {1,...,k}
CC margin-based bound = min(x,y)S min (i < j)y vi.x – vj.x
errD (h) C
m
R2
2 ln()
m - number of examples R - maxx ||x|| - confidence C - average # constraints
Page 36
VC-style Generalization Bounds
Linear Hypothesis space: h(x) = argsort vi.x
vi, x Rd argsort returns permutation of {1,...,k}
CC VC-based bound
errD (h) err(S,h) kd log(mk /d) lnm
m - number of examples d - dimension of input space delta - confidence k - number of classes
Page 37
Beyond Multiclass Classification
Ranking category ranking (over classes) ordinal regression (over examples)
Multilabel x is both red and blue
Complex relationships x is more red than blue, but not green
Millions of classes sequence labeling (e.g. POS tagging) LATER
SNoW has an implementation of Constraint Classification for the Multi-Class case. Try to compare with 1-vs-all.
Experimental Issues: when is this version of multi-class better? Several easy improvements are possible via modifying the loss
function.
Page 38
Multi-class ExperimentsPicture isn’t so clear for very high dimensional problems. Why?
Page 39
Summary
OvA Constraint Classification
Learning: independent fi(x) > 0 iff y=i
Evaluation: global h(x) = argmax fi(x)
Learning: global find {fi(x)} s.t. y = argmax fi(x)
Evaluation: global h(x) = argmax fi(x)
Learn + Inference Inference Based Training
Learning: independent fi(x) > 0 iff “i is a part of y”
Evaluation: global Inf h(x) = argmaxy\inC SU fi(x)
Learning: global find {fi(x)} s.t. y = argmax fi(x)
Evaluation: global h(x) = argmax fi(x)
Page 40
Structured Output Learning
Abstract View: Decomposition versus Constraint Classification
More details: Inference with Classifiers
Page 41
Structured Output Learning:Semantic Role Labeling
I left my pearls to my child
A0 : leaver
A1 : thing left
A2 : benefactor
For each verb in a sentence1. Identify all constituents that fill a
semantic role
2. Determine their roles• Core Arguments, e.g., Agent, Patient or
Instrument• Their adjuncts, e.g., Locative, Temporal
or Manner
Y : All possible ways to label the treeC(Y): All valid ways to label the treeargmaxy C(Y) g(x,y)
Page 42
Components of Structured Output Learning
Input: X Output: A collection of variables
Y = (y1,...,yL) {1,...,K}L
Length is example dependent Constraints on the Output C(Y)
e.g. non-overlapping, no repeated values... partition output to valid and invalid assignments
Representation scoring function: g(x,y) e.g. linear: g(x,y) = w (x,y)
Inference h(x) = argmaxvalid y g(x,y)
y3
y2
y1
Y
I left mypearls to my child X
Page 43
Decomposition-based Learning
Many choices for decomposition Depends on problem, learning model, computation resources,
etc...
Value-based decomposition A function for each output value
fk(x,l), k = {1,..,K}
e.g. SRL tagging fA0(x,node), fA1(x,node),...
OvA learning fk(x,node) > 0 iff k=y
Page 44
Learning Discriminant Functions: The General Setting g(x,y) > g(x,y’) y’ Y \ y w (x,y) > w (x,y’) y’ Y \ y w (x,y,y’) = w ((x,y) - (x,y’)) > 0 P(x,y) = {(x,y,y’)} y’ Y \ y
P(S) = {P(x,y)}(x,y) S
Learn unary classifer over P(S) (binary) (+P(S),-P(S))
Used in many works [C02,WW00,CS01,CM03,TGK03]
Page 45
Learn a collection of “scoring” functions wA0A0(x,y,n) , wA1A1(x,y,n),...
scorev(x,y,n) = wvv(x,y,n)
Global score g(x,y) = n scoreyn(x,y,n) = n wynyn(x,y,n)
Learn locally (LO, L+I) for each label variable (node) n = A0
gA0(x,y,n) = wA0A0(x,y,n) > 0 iff yn = A0
Discriminant model dictates: g(x,y) > g(x,y’), y C(Y) argmaxy C(Y) g(x,y)
Learn Globally (IBT) g(x,y) = w (x,y)
Structured Output Learning:Semantic Role Labeling
I left mypearls to my child
scoreNONE(3)
scoreA2(13)
Page 46
SummaryOvA Constraint Classification
Learning: Independent fi(x) > 0 iff y=i
Evaluation: global h(x) = argmax fi(x)
Learning: global find {fi(x)} s.t. y = argmax fi(x)
Evaluation: global h(x) = argmax fi(x)
Learn + Inference Inference Based Training
Learning: Independent fi(x) > 0 iff “i is a part of y”
Evaluation: global Inference h(x) = Inference {fi(x)}
Efficient Learning
Learning: global find {fi(x)} s.t. Y = Inference {fi(x)}
Evaluation: global inference h(x) = Inference {fi(x)}
Less Efficent Learning
Mul
ti-cl
ass
Str
uctu
red
Out
put