efficient decomposed learning for structured prediction #icml2012
TRANSCRIPT
Efficient Decomposed Learning for Structured
Prediction Rajhans Samdani, Dan Roth (Illinois)
Presenter: Yoh Okuno
Abstract
• Structured learning is important for NLP or CV
• Enormous output space is often intractable
• Proposed DecL: decomposed learning
• DecL restrict output space to limited part
• Efficient and accurate in experiment and theory
Introduction • What is Structured learning?
– Predict output variables which mutually depend
– Problem: enormous output space (exponential)
• Applications: NLP, CV or Bioinformatics
– Multi label document classification (binary) [Crammer+02]
– Information extraction (sequence) [Lafferty+ 01]
– Dependency parsing (tree) [Koo+ 10]
Example: Conditional Random Fields Output Space
[Lafferty+ 01]
Example: Markov Random Fields Output Space
[Boykov+ 98]
Related Work
• There are two major approaches
1. Global Learning (GL): Exact but Slow [Tsochantaridis+ 04]
– Search the entre output space in learning phase
– Often implemented by ILP (Integer Linear Programming)
2. Local Learning (LL): Inaccurate but Fast
– Ignore structure of output for fast search
• DecL is exact in some assumption but faster than LL
Problem Setting • Given training data:
• Output y is represented as binary variables
• Model is linear combination of features
y = {y1, ..., yn} ∈ {0, 1}n
D = {(x1,y1), ..., (xm,ym)}
Structured SVM • Minimize loss function below:
• Generalized hinge-‐loss to multi dimension
• Regularization term is omitted for space issue
• See [Tokunaga 2011] for more information
[Tsochantaridis+ 04]
l(w) =m�
j=1
(maxy∈Y
f(xj ,y;w) +∆(yj ,y))− f(xj ,yj ;w)
Rewarding incorrect output
Figure 1: GL and DecL • Search neighborhood around gold output rather than entire search space
DecL: Decomposed Learning
• Define neighborhood around gold output:
• Note: prediction phase need global search
• How can we define neighborhood for learning?
l(w) =m�
j=1
( maxy∈nbr(yj)
f(xj ,y;w) +∆(yj ,y))− f(xj ,yj ;w)
Sub Gradient Descent for DecL
DecL-‐k: Special Case of DecL • Restrict output space to k-‐dimension
– Take all subsets of size k from indices of y
– Other dimensions are equal to gold output
• Domain knowledge can be used in general
– Group coupled variables into same groups
– Complexity depends on size of decomposition
Experiments on Synthetic Data • Compared DecL, LL and GL (Oracle)
• Synthetic training data: – 10 binary output with random linear constraints
– 20-‐dimensional input, 320 training examples
• Running time in seconds:
Multi Label Document Classification
• Dataset: Reuter corpus
• Size: 6,000 documents and 30 labels
• DecL performs well as GL and 6x faster
Information Extraction: Sequence Tagging • Data 1: citation recognition
– Recognize author, title.. from citation text
• Data 2: advertisement for real estate
– Recognize facility, roommates.. from ads
Conclusion • Structured learning has a tradeoff between speed and
accuracy
• Decomposition learning (DecL) splits output space
into small space for fast inference
• Fast and accurate in real world dataset
• Theoretical guarantee for exact search under some
assumptions (skipped)
Reference • [Collins+ 02] Discriminative training methods for hidden Markov models: Theory
and experiments with perceptron algorithms.
• [Lafferty+ 01] Conditional random fields: Probabilistic models for segmenting and
labeling sequence data.
• [Koo+ 10] Dual decomposition for parsing with nonprojective head automata.
• [Boykov+ 98] Markov Random Fields with Efficient Approximations.
• [Tsochantaridis+ 04] Support vector machine learning for interdependent and
structured output spaces.
• [Crammer+ 02] On the algorithmic implementation of multiclass kernel-‐based
vector machines.