predicting structured objects with support vector …...predicting structured objects with support...

Predicting Structured Objects withSupport Vector Machines

Thorsten JoachimsDept. of Computer Science

Cornell UniversityIthaca, NY 14853, [email protected]

Thomas HofmannGoogle Inc.

Brandschenkestrasse 1108002 Zürich, Switzerland

[email protected]

Yisong YueDept. of Computer Science

Cornell UniversityIthaca, NY 14853, USA

[email protected] Yu

Dept. of Computer ScienceCornell University

Ithaca, NY 14853, [email protected]

ABSTRACTMachine Learning today offers a broad repertoire of meth-ods for classification and regression. But what if we need topredict complex objects like trees, orderings, or alignments?Such problems arise naturally in natural language process-ing, search engines, and bioinformatics. The following ex-plores a generalization of Support Vector Machines (SVMs)for such complex prediction problems.

1. INTRODUCTIONConsider the problem of natural language parsing illus-

trated in Figure 1 (left). A parser takes as input a naturallanguage sentence, and the desired output is the parse treedecomposing the sentence into its constituents. While mod-ern machine learning methods like Boosting, Bagging, andSupport Vector Machines (SVMs) (see e.g. [9]) have becomethe methods of choice for other problems in natural languageprocessing (NLP) (e.g. word-sense disambiguation), parsingdoes not fit into the conventional framework of classificationand regression. In parsing, the prediction is not a singleyes/no, but a labeled tree. And breaking the tree down intoa collection of yes/no predictions is far from straightforward,since it would require modeling a host of inter-dependenciesbetween individual predictions. So, how can we take, say,an SVM and learn a rule for predicting trees?

Obviously, this question arises not only for learning topredict trees, but similarly for a variety of other structuredand complex outputs. Structured output prediction is thename for such learning tasks, where one aims at learninga function h : X → Y mapping inputs x ∈ X to complexand structured outputs y ∈ Y. In NLP, structured outputprediction tasks range from predicting an equivalence re-lation in noun-phrase co-reference resolution (see Figure 1,right), to predicting an accurate and well-formed translationin machine translation. But problems of this type are alsoplentiful beyond NLP. For example, consider the problem of

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 2008 ACM 0001-0782/08/0X00 ...$5.00.

image segmentation (i.e. predicting an equivalence relationy over a matrix of pixels x), the problem of protein structureprediction (which we will phrase as an alignment problemin Section 3.2), or the problem of web search (i.e. predictinga diversified document ranking y given a query x as ex-plored in Section 3.1). We will see in Section 3.3 that evenbinary classification becomes a structured prediction taskwhen aiming to optimize multivariate performance measureslike the F1-Score or Precision@k.

In this paper we describe a generalization of SVMs, calledStructural SVMs [26, 27, 14], that can be used to addressa large range of structured output prediction tasks. On theone hand, structural SVMs inherit the attractive propertiesof regular SVMs, namely a convex training problem, flex-ibility in the choice of loss function, and the opportunityto learn non-linear rules via kernels. On the other hand,structural SVMs inherit the expressive power of generativemodels (e.g. Probabilistic Context-Free Grammars (PCFG)or Markov Random Fields (MRF)). Most importantly, how-ever, structural SVMs are a discriminative learning methodthat does not require the independence assumptions madeby conventional generative methods. Similar to the increasein prediction accuracy one typically sees when switchingfrom a naive Bayes classifier to a classification SVM (e.g.[11]), structural SVMs promise such benefits also for struc-tured prediction tasks (e.g. training PCFGs for parsing).

Structural SVMs follow and build upon a line of researchon discriminative training for structured output prediction,including generalizations of Neural Nets [17], Logistic Re-gression [16], and other conditional likelihood methods (e.g.[18]). A particularly eye-opening paper is [16], showing thatglobal training for structured prediction can be formulatedas a convex optimization problem. Our work follows thistrack. More closely, however, we build upon structural Per-ceptrons [5] as well as methods for protein threading [19]and extend these to a large-margin formulation with an ef-ficient training algorithm. Independent of our work, Taskaret al. [25] arrived at a similar formulation, but with morerestrictive conditions on the form of the learning task.

In the following, we explain how structural SVMs workand highlight three applications in search engine ranking,protein structure prediction, and binary classification undernon-standard performance measures.

The

do

g ch

ased

th

e ca

t.S

NP VP

NDet NPV

Det N APPGEAYLQPGEAYLQV

[Obama] running in the [presidentalelection] has mobilized [many young voters]. [His] [position] on [climate change]was well received by [this group].

Obama

presidential election

many young voters

His

position

climate change

this group

Figure 1: Examples of structured output prediction tasks: predicting trees in natural language parsing (left),predicting the structure of proteins (middle), and predicting an equivalence relation over noun phrases (right).

The

do

g ch

ased

th

e ca

t.

S

NP VP

NDet NPV

Det N

S

NP VP

NDet PPV

IN N

S

NP

VP

NDet

V

S

S

VPNP

V

…

Class 1

Class 2

Class 3

Class k

NP

NP

N

CC

CC

N

S

VPNP

VN

Figure 2: Structured output prediction as a multi-class problem.

2. STRUCTURAL SVMSHow can one approach structured output prediction? On

an abstract level, a structured prediction task is much likea multi-class learning task. Each possible structure y ∈ Y(e.g. parse tree) corresponds to one class (see Figure 2), andclassifying a new example x amounts to predicting its correct“class”. While the following derivation of structural SVMsstarts from multi-class SVMs [6], there are four key problemsthat need to be overcome. All of these problems arise fromthe huge number |Y| of classes. In parsing, for example, thenumber of possible parse trees is exponential in the lengthof the sentence. And the situation is similar for most otherstructured output prediction problems.

The first problem is related to finding a compact repre-sentation for large output spaces. If we allowed even justone parameter for each class, we would already have moreparameters than we could ever hope to have enough trainingdata for. Second, just making a single prediction on a newexample is a computationally challenging problem, since se-quentially enumerating all possible classes may be infeasible.Third, we need a more refined notion of what constitutes aprediction error. Clearly, predicting a parse tree that is al-most correct should be treated differently from predictinga tree that is completely wrong. And, last but not least,we need efficient training algorithms that have a run-timecomplexity sub-linear in the number of classes.

In the following, we will tackle these problems one by one,starting with the formulation of the structural SVM method.

2.1 Problem 1: Structural SVM FormulationAs mentioned above, we start the derivation of the struc-

tural SVM from the multi-class SVM [6]. These multi-classSVMs use one weight vector wy for each class y. Each inputx now has a score for each class y via f(x,y) ≡ wy · Φ(x).Here Φ(x) is a vector of binary or numeric features extractedfrom x. Thus, every feature will have an additively weightedinfluence in the modeled compatibility between inputs x andclasses y. To classify x, the prediction rule h(x) then simplychooses the highest-scoring class

h(x) ≡ argmaxy∈Y

f(x,y) (1)

as the predicted output. This will result in the correct pre-diction y for input x provided the weights w = (w1, . . . ,wk)have been chosen such that the inequalities f(x, y) < f(x,y)hold for all incorrect outputs y 6= y.

For a given training sample (x1,y1), ..., (xn,yn), this leadsdirectly to a (hard-)margin formulation of the learning prob-lem by requiring a fixed margin (=1) separation of all train-ing examples, while using the norm of w as a regularizer:

minw

12‖w‖2, s.t. f(xi,yi)− f(xi, y) ≥ 1 (∀i, y 6= yi) (2)

For a k-class problem, the optimization problem has a totalof n(k−1) inequalities that are all linear in w, since one canexpand f(xi,yi)− f(xi, y) = (wyi −wy) · Φ(xi). Hence, itis a convex quadratic program.

The first challenge in using (2) for structured outputs isthat, while there is generalization across inputs x, there isno generalization across outputs. This is due to having aseparate weight vector wy for each class y. Furthermore,since the number of possible outputs can become very large(or infinite), naively reducing structured output predictionto multi-class classification leads to an undesirable blow-upin the overall number of parameters.

The key idea in overcoming these problems is to extractfeatures from input-output pairs using a so-called joint fea-ture map Ψ(x,y) instead of Φ(x). This yields compatibilityfunctions with contributions from combined properties of in-puts and outputs. These joint features will allow us to gen-eralize across outputs and to define meaningful scores evenfor outputs that were never actually observed in the trainingdata. At the same time, since we will define compatibilityfunctions via f(x,y) ≡ w ·Ψ(x,y), the number of parame-ters will simply equal the number of features extracted viaΨ, which may not depend on |Y|. One can then use the for-mulation in (2) with the more flexible definition of f via Ψ toarrive at the following (hard-margin) optimization problem

for structural SVMs.

minw

12‖w‖2 (3)

s.t. w ·Ψ(xi,yi)−w ·Ψ(xi, y) ≥ 1 (∀i, y 6= yi)

In words, find a weight vector w of an input-ouput compat-ibility function f that is linear in some joint feature map Ψso that on each training example it scores the correct outputhigher by a fixed margin than every alternative output, whilehaving low complexity (i.e. small norm ‖w‖). Note that thenumber of linear constraints is still n(|Y| − 1), but we willsuggest ways to cope with this efficiently in Section 2.4.

The design of the features Ψ is problem-specific, and it is astrength of the developed methods to allow for a great dealof flexibility in how to choose it. In the parsing exampleabove, the features extracted via Ψ may consist of countsof how many times each production rule of the underlyinggrammar has been applied in the derivation described bya pair (x,y). For other applications, the features can bederived from graphical models as proposed in [16, 25], butmore general features can be used as well.

2.2 Problem 2: Efficient PredictionBefore even starting to address the efficiency of solving a

quadratic programs of the form (3) for large n and |Y|, weneed to make sure that we can actually solve the inferenceproblem of computing a prediction h(x). Since the numberof possible outputs |Y| may grow exponentially in the lengthof the representation of outputs, a brute-force exhaustivesearch over Y may not always be feasible. In general, werequire some sort of structural matching between the (given)compositional structure of the outputs y and the (designed)joint feature map Ψ.

For instance, if we can decompose Y into non-overlappingindependent parts, Y = Y1× · · · ×Ym and if the feature ex-traction Ψ respects this decomposition such that no featurecombines properties across parts, then we can maximize thecompatibility for each part separately and compose the fulloutput by simple concatenation. A significantly more gen-eral case can be captured within the framework of Markovnetworks as explored by [16, 25]. If we represent outputs asa vector of (random) variables y = (y1, . . . , ym), then for afixed input x, we can think of Ψ(x,y) as sufficient statisticsin a conditional exponential model of P (y|x). Maximizingf(x,y) then corresponds to finding the most probable out-put, y = argmaxy P (y|x). Modeling the statistical depen-dencies between output variables vis a dependency graph,one can use efficient inference methods such as the junctiontree algorithm for prediction. This assumes the clique struc-ture of the dependency graph induced by a particular choiceof Ψ is tractable (e.g. the state space of the largest cliquesremains sufficiently small). Other efficient algorithms, forinstance, based on dynamic programming or minimal graphcuts, may be suitable for other prediction problems.

In our example of natural language parsing, the suggestedfeature map will lead to a compatibility function in whichparse trees are scored by associating a weight with each pro-duction rule and by simply combining all weights additively.Therefore, a prediction can be computed efficiently using theCKY-Parsing algorithm (cf. [27]).

2.3 Problem 3: Inconsistent Training DataSo far we have tacitly assumed that the optimization prob-

lem in (3) has a solution, i.e. there exists a weight vectorthat simultaneously fulfills all margin constraints. In prac-tice this may not be the case, either because the trainingdata is inconsistent or because our model class is not pow-erful enough. If we allow for mistakes, though, we mustbe able to quantify the degree of mismatch between a pre-diction and the correct output, since usually different in-correct predictions vary in quality. This is exactly the roleplayed by a loss function, formally ∆ : Y × Y → <, where∆(y, y) is the loss (or cost) for predicting y, when the cor-rect output is y. Ideally we are interested in minimizingthe expected loss of a structured output classifier, yet, as iscommon in machine learning, one often uses the empiricalor training loss 1

n

∑ni=1 ∆(yi, h(xi)) as a proxy for the (in-

accessible) expected loss. Like the choice of Ψ, defining ∆is problem-specific. If we predict a set or sequence of labels,a natural choice for ∆ may be the number of incorrect la-bels (a.k.a. Hamming loss). In the above parsing problem, aquality measure like F1 may be the preferred way to quantifythe similarity between two labeled trees in terms of commonconstituents (and then one may set ∆ ≡ 1− F1).

Now we will utilize the idea of loss functions to refine ourinitial problem formulation. Several approaches have beensuggested in the literature, all of which are variations ofthe so-called soft-margin SVM: instead of strictly enforcingconstraints, one allows for violations, yet penalizes them inthe overall objective function. A convenient approach is tointroduce slack variables that capture the actual violation,

f(xi,yi)− f(xi, y) ≥ 1− ξiy , (∀y 6= yi) (4)

so that by choosing ξiy > 0 a smaller (even negative) sepa-ration margin is possible. All that remains to be done thenis to define a penalty for the margin violations. A popu-lar choice is to use 1

n

∑ni=1 maxy 6=yi{∆(yi, y)ξiy}, thereby

penalizing violations more heavily for constraints associatedwith outputs y that have a high loss with regard to theobserved yi. Technically, one can convert this back into aquadratic program as follows:

minw,ξi≥0

1

2‖w‖2 +

C

n

n∑i=1

ξi (5)

s.t. w ·Ψ(xi,yi)−w ·Ψ(xi,y) ≥ 1− ξi∆(yi,y)

, (∀i, y 6=yi)

Here C > 0 is a constant that trades-off constraint violationswith the geometric margin (effectively given by 1/‖w‖).This has been called the slack re-scaling formulation. As analternative, one can also re-scale the margin to upper-boundthe training loss as first suggested in [25] and discussed in[27]. This leads to the following quadratic program

minw,ξi≥0

1

2‖w‖2 +

C

n

n∑i=1

ξi (6)

s.t. w ·Ψ(xi,yi)−w ·Ψ(xi,y) ≥ ∆(yi,y)− ξi , (∀i, y 6=yi)

Since it is slightly simpler, we will focus on this margin-rescaling formulation in the following.

2.4 Problem 4: Efficient TrainingLast, but not least, we need a training algorithm that finds

the optimal w solving the quadratic program in (6). Sincethere is a contraint for every incorrect label y, we cannotenumerate all constraints and simply give the optimization

Algorithm 1 for training structural SVMs (margin-rescaling).

1: Input: S = ((x1,y1), . . . , (xn,yn)), C, ε2: W ← ∅, w = 0, ξi ← 0 for all i = 1, ..., n3: repeat4: for i=1,...,n do5: y← argmaxy∈Y{∆(yi, y) +w ·Ψ(xi, y)}6: if w · [Ψ(xi,yi)−Ψ(xi, y)] < ∆(yi, y)− ξi − ε then7: W ←W ∪ {w·[Ψ(xi,yi)−Ψ(xi,y)] ≥ ∆(yi,y)− ξi}8: (w, ξ)← argmin

w,ξ≥0

12w ·w + C

n

∑ni=1 ξi s.t. W

9: end if10: end for11: until W has not changed during iteration12: return(w,ξ)

problem to a standard QP solver. Instead, we propose touse the cutting-plane Algorithm 1 (or a similar algorithmfor slack-rescaling). The key idea is to iteratively constructa working set of constraints W that is equivalent to the fullset of constraints in (6) up to a specified precision ε. Startingwith an empty W and w = 0, Algorithm 1 iterates throughthe training examples. For each example, the argmax inLine 5 finds the most violated constraint of the quadraticprogram in (6). If this constraint is violated by more thanε (Line 6), it is added to the working set W in Line 7 and anew w is computed by solving the quadratic program overthe new W (Line 8). The algorithm stops and returns thecurrent w if W did not change between iterations.

It is obvious that for any desired ε, the algorithm onlyterminates when it has found an ε-accurate solution, since itverifies in Line 5 that none of the constraints of the quadraticprogram in (6) is violated by more than ε. But how long doesit take to terminate? It can be shown [27] that Algorithm 1always terminates in a polynomial number of iterations thatis independent of the cardinality of the output space Y. Infact, a refined version of Algorithm 1 [13, 14] always termi-nates after adding at most O(Cε−1) constraints to W (typ-ically |W| << 1000). Note that the number of constraintsis not only independent of |Y|, but also independent of thenumber of training examples n, which makes it an attractivetraining algorithm even for conventional SVMs [13].

While the number of iterations is small, the argmax inLine 5 might be expensive to compute. In general, this istrue, but note that this argmax is closely related to theargmax for computing a prediction h(x). It is thereforecalled the“loss-augmented” inference problem, and often theprediction algorithm can be adapted to efficiently solve theloss-augmented inference problem as well.

Note that Algorithm 1 is rather generic and refers to theoutput structure only through a narrow interface. In par-ticular, to apply the algorithm to a new structured predic-tion problem, one merely needs to supply implementations ofΨ(x,y), ∆(y, y), and argmaxy∈Y{∆(yi, y) + w ·Ψ(xi, y)}.This makes the algorithm general and easily transferable tonew applications. In particular, all applications discussed inthe following used the general SVMstruct implementation ofAlgorithm 1 and supplied these three functions via an API.

2.5 Related ApproachesThe structural SVM method is closely related to other ap-

proaches, namely Conditional Random Fields [16] and Max-imum Margin Markov networks (M3N) [25]. The crucial

link to probabilistic methods is to interpret the joint featuremap as sufficient statistics in a conditional exponential fam-ily model. This means, we define a conditional probabilityof outputs given inputs by

P (y|x) = 1Z(x)

exp [w ·Ψ(x,y)] , (7)

where Z(x) is a normalization constant. Basically (7) isa generalization of the well-known logistic regression whereΨ(x,y) = yΦ(x) for y ∈ {−1, 1}. The compatibility func-tion used in structural SVMs directly governs this condi-tional probability. The probabilistic semantics lends itself tostatistical inference techniques like penalized (conditional)maximum likelihood, where one maximizes

∑ni=1 logP (yi|xi)−

λ‖w‖2. Unlike in Algorithm 1, however, here one usuallymust compute the expected sufficient statistics

∑y P (y|x)Ψ(x,y),

for instance, in order to compute a gradient direction on theconditional log-likelihood. There are additional algorithmicchallenges involved with learning CRFs (cf. [22]) and one ofthe key advantages of structural SVMs is that it allows anefficient dual formulation. This enables the use of kernelsinstead of explicit feature representations as well as a sparseexpansion of the solution (cf. [10]). A simpler, but usuallyless accurate learning algorithm that can be generalized tothe structured output setting is the perceptron algorithm,as first suggested in [5].

The M3N approach is also taking the probabilistic viewas its starting point, but instead of maximizing a likelihood-based criterion, aims at solving (6). More specifically, M3Nsexploit the clique-based representation of P (y|x) over a de-pendency graph to efficiently re-parametrize the dual prob-lem of (6), effectively replacing the margin constraints overpairs of outputs by constraints involving functions of cliqueconfigurations. This is an interesting alternative to ourcutting-plane algorithm, but has a somewhat more limitedrange of applicability.

3. APPLICATIONSAs outlined above, one needs to design and supply the

following four functions when applying a structural SVM toa new problem:

Ψ(x,y) (8)

∆(y, y) (9)

argmaxy∈Y{w ·Ψ(xi, y)} (10)

argmaxy∈Y{∆(yi, y) + w ·Ψ(xi, y)} (11)

The following three sections illustrate how this can be donefor learning retrieval functions that provide diversified re-sults, for aligning amino-acid sequences into homologuousprotein structures, and for optimizing non-standard perfor-mance measures in binary classification.

3.1 Optimizing Diversity in Search EnginesState of the art retrieval systems commonly use machine

learning to optimize their ranking functions. While effectiveto some extent, conventional learning methods score eachdocument independently and, therefore, cannot account forinformation diversity (e.g. not presenting multiple redun-dant results). Indeed, several recent studies in informationretrieval emphasized the need for diversity (e.g., [32, 24]).In particular, they stressed modeling inter-document de-pendencies, which is fundamentally a structured prediction

D1D2

D3

D4D5

Figure 3: Different documents (shaded regions)cover different information (subtopics) of a query.Here, the set {D2, D3, D4} contains the 3 documentsthat individually cover the most information, but{D1, D3, D5} collectively covers more information.

problem. Given a dataset of queries and documents labeledaccording to information diversity, we used structural SVMsto learn a general model of how to diversify [31].

What is an example (x,y) in this learning problem? Forsome query, let x = {x1, . . . , xn} be the candidate docu-ments that the system needs to rank. Our ground truthlabeling for x is a set of subtopics T = {T1, . . . , Tn}, whereTi denotes the subtopics covered by document xi. Theprediction goal is to find a subset y ⊂ x of size K (e.g.,K = 10 for search) maximizing subtopic coverage, and there-fore maximizing the information presented to the user. Wedefine our loss function ∆(T,y) to be the (weighted) frac-tion of subtopics not covered by y (more weight for popularsubtopics)1.

Even if the ground truth subtopics were known, comput-ing the best y can be difficult. Since documents overlapin subtopics (i.e., ∃i, j : Ti ∩ Tj 6= ∅), this is a budgetedmaximum coverage problem. The standard greedy methodachieves a (1−1/e)-approximation bound [15], and typicallyyields good solutions. The greedy method also naturallyproduces a ranking of the top documents.

Figure 3 depicts an abstract visualization of our predic-tion problem. The shaded regions represent candidate docu-ments x of a query, and the area covered by each region is the“information” (represented as subtopics T) covered by thatdocument. If T was known, we could use a greedy methodto find a solution with high subtopic diversity. For K = 3,the optimal solution in Figure 3 is y = {D1, D3, D5}.

In general, however, the subtopics of new queries are un-known. One can think of subtopic labels as a manual par-titioning of the total information of a query. We do how-ever, have access to an “automatic” representation of infor-mation diversity: the words contained in the documents.Intuitively, covering more (distinct) words should result incovering more subtopics. Since some words are more infor-mative than others, a weighting scheme is required.

Following the approach of Essential Pages [24], we mea-sure word importance primarily using relative word frequen-cies within x. For example, the word“computer” is generallyinformative, but conveys almost no information for the query“ACM” since it likely appears in most of the candidate doc-uments. We learn a word weighting function using labeledtraining data, whereas [24] uses a pre-defined function.

We now present a simple example of the joint feature rep-resentation Ψ. Let φ(v,x) denote the feature vector describ-ing the frequency of a word v amongst documents in x. For

1The loss function (9) can be defined using any ground truthformat, not necessarily the same format as the output space.

Figure 4: Subtopic loss comparison when retrieving5 documents. Structural SVM performance is supe-rior with 95% confidence (using signed rank test).

example, we can construct φ(v,x) as

φ(v,x) =

1[v appears in at least 15% of x]

1[v appears in at least 35% of x]

. . .

.

Let V (y) denote the union of words contained in the docu-ments of the predicted subset y. We can write Ψ as

Ψ(x,y) =∑

v∈V (y)

φ(v,x).

Given a model vector w, the benefit of covering word v in xis w · φ(v,x), and is realized when a document in y containsv, i.e., v ∈ V (y). Because documents overlap in words, thisis also budgeted maximum coverage problem. Thus bothprediction (10) and loss-augmented inference (11) can besolved effectively via the greedy method. Despite finding anapproximate most violated constraint, we can still bound theprecision of our solution [8]. Practical applications requiremore sophisticated Ψ and we refer to [31] for more details.

Experiments. We tested the effectiveness of our methodusing the TREC 6-8 Interactive Track queries. Relevantdocuments are labeled using subtopics. For example, query392 asked human judges to identify applications of roboticsin the world today, and they identified 36 subtopics amongthe results such as nanorobots and using robots for spacemissions. Additional details can be found in [31].

We compared against Okapi [21], Essential Pages [24], ran-dom selection and unweighted word selection (all words haveequal benefit). Okapi is a conventional retrieval functionwhich does not account for diversity. Figure 4 shows theloss on the test set for retrieving K = 5 documents. Thegains of the structural SVM over Essential Pages are 95%significant, and only the structural SVM and Essential Pagesoutperformed random.

This approach can accomodate other diversity criteria,such as diversity of clicks gathered from clickthrough logs.We can also accommodate very rich feature spaces based on,e.g., anchor text, URLs, and titles.

Abstractly, we are learning a mapping between two rep-resentations of the same set covering problem. One rep-resentation defines solution quality using manually labeledsubtopics. The second representation learns a word weight-ing function, with goal of having the representations agreeon the best solution. This setting is very general and can beapplied to other domains beyond subtopic retrieval.

3.2 Predicting Protein AlignmentsProteins are sequences of 20 different amino acids joined

together by peptide bonds. They fold into various stablestructures under normal cell environments. A central ques-tion in biology is to understand how proteins fold, as theirstructures relate to their functions.

One of the more successful methods for predicting proteinstructure from an amino acid sequence is homology model-ing. It approximates the structure of an unknown proteinby comparing it against a database of proteins with experi-mentally determined structures. An important intermediatestep is to align the unknown protein sequence with knownprotein sequences in the database, and this is a difficult prob-lem when the sequences have diverged too far (e.g., less than20% sequence identity).

To align protein sequences accurately for homology mod-eling, we need to have a good substitution cost matrix asinput to the Smith-Waterman algorithm (a dynamic pro-gramming algorithm). A common approach is to learn thesubstitution matrix from a set of known good alignmentsbetween evolutionarily related proteins.

Structural SVMs are particularly suitable for learning thesubstitution matrix, since they allow incorporating all avail-able information (e.g. secondary structures, solvent accessi-bility, and other physiochemical properties) in a principledway. When predicting the alignment y = (y1, y2, ...) for twogiven protein sequences x = (sa, sb), each sequence locationis described by a vector of features. The discriminative ap-proach of structural SVMs makes it easy to incorporate theseextra features without having to make unjustified indepen-dence assumptions as in generative models. As explainedbelow, this enables learning a ‘richer’ substitution functionw · Φ(x, yi) instead of a fixed substitution matrix.

The Smith-Waterman algorithm uses a linear function forscoring alignments, which allows us to decompose the jointfeature vector into a sum of feature vectors for individualalignment operations (match, insertion, or deletion):

w ·Ψ(x,y) =

length(y)∑i=1

w · Φ(x, yi)

Φ(x, yi) is the feature vector for the ith alignment operationin the alignment y of the sequence pair x. Below we describeseveral alignment models represented by Φ, focusing on thescoring of matching two amino acids (see Figure 5 for anillustration of the features used):

(i) Simple: we only consider the substitution cost of sin-gle features, e.g., the cost of matching a position withamino acid ‘M’ with another position in a loop region.

(ii) Anova2 : we consider pairwise interaction of features,e.g., the cost of matching a position with amino acid‘M’ in an alpha helix with another position with aminoacid ‘V’ in a loop region.

(iii) Tensor : we consider three-way interaction of featuresamong amino acid, secondary structure, and solventaccessibility.

(iv) Window : on top of three alignment models above,we add neighborhood features using a sliding window,which takes into account the hydrophobicity and sec-ondary structure in the neighborhood of a position.

Figure 5: Aligning a new protein sequence with aknown protein structure. Features at the alignedpositions are used to construct substitution matricesunder different alignment models.

(v) Profile: on top of the alignment model Window, weadd PSI-BLAST profile scores as features.

As the loss function ∆(y, y), we use Q4-loss. It is thepercentage of matched amino acids in the correct alignmenty that are aligned more than 4 positions apart in the pre-dicted alignment y. The linearity of the Q4-loss allows usto use the Smith-Waterman algorithm also for solving theloss-augmented inference problem during training. We referto [29] for more details.

Experiments. We tested our algorithm with the trainingand validation sets developed in [20], which contain 4542 and4587 pairwise alignments of evolutionarily-related proteinswith high structural similarites. For the test set we selected4185 structures deposited in the Protein Data Bank fromJune 2005 to June 2006, leading to 29345 test pairs. Forall sequence pairs x in the training, validation, and test setsboth structures are known, and we use the CE structuralalignment program [23] to generate “correct” alignments y.

The red bars in Figure 6 shows the Q4-scores (i.e. 100 mi-nus Q4-loss) of the five alignment models described above.By carefully introducing features and considering their in-teractions, we can build highly complex alignment models(with hundreds of thousands of parameters) with very goodalignment performance. Note that a conventional substitu-tion matrix has only 202 = 400 parameters.

SSALN[20] (blue bar) uses a generative alignment modelfor parameter estimation, and is trained using the sametraining set and features as the structural SVM algorithm.The structural SVM model Window substantially outper-forms SSALN on Q4-score. Incorporating profile informa-tion makes the SVM model Profile perform even better, il-lustrating the benefit of being able to easily add additionalfeatures in discriminative training. The result from BLAST(green bar) is provided as a baseline.

To get a plausible upper bound for further improvements,we check in how far the CE alignments we used as groundtruth agree with the structural alignments computed by TM-Align [33]. TM-Align gives a Q4-score of 85.45 when usingthe CE alignments as ground truth to compare against. Thisprovides a reasonable upper bound on the alignment accu-racy we might achieve on this noisy dataset. However, it alsoshows significant room for further research and improvementfrom our current alignment models.

3.3 Binary Classification with Unbalanced Classes

Figure 6: Q4-score of various Structural SVM align-ment models compared to two baseline models. TheStructural SVM using Window or Profile featuressignificantly outperforms the SSALN baseline. Thenumber in brackets is the number of features of thecorresponding alignment model.

Even binary classification problems can become structuredoutput problems in some cases. Consider, for example, thecase of learning a binary text classification rule with theclasses“machine learning”and“other topics”. Like most textclassification task it is highly unbalanced, and we might onlyhave, say, 1% machine-learning documents in our collection.In such a situation, prediction error typically becomes mean-ingless as a performance measure, since the default classi-fier that always predicts “other topics” already has a greaterror rate that is hard to beat. To overcome this prob-lem, the Information Retrieval (IR) community has designedother performance measures, like the F1-Score and the Pre-cision/Recall Breakeven Point (PRBEP) (see e.g. [1]), thatare meaningful even with unbalanced classes.

What does this mean for learning? Instead of optimizingsome variant of error rate during training, which is what con-ventional SVMs and virtually all other learning algorithmsdo, it seems like a natural choice to have the learning algo-rithm directly optimize, for example, the F1-Score (i.e. theharmonic mean of precision and recall). This is the pointwhere our binary classification task needs to become a struc-tured output problem, since F1-Score (as well as many otherIR measures) are not functions of individual examples (likeerror rate), but a function of a set of examples. In particular,we arrive at the structured output problem of predicting anarray of labels y = (y1, ..., yn), yi ∈ {−1,+1} for an arrayof feature vectors x = (x1, ..., xn), xi ∈ <N . Each possiblearray of labels y now has an associated F1-Score F1(y, y)w.r.t. the true labeling y, and optimizing F1-Score on thetraining set becomes a well-defined problem.

The problem of predicting an array of binary labels y foran array of input vectors x optimizing a loss function ∆(y, y)fits neatly into the structural SVM framework. Again, weonly need to define Ψ(x,y) and ∆(y, y) appropriately, andthen find algorithms for the two argmax problems. Thechoice of loss function is obvious: when optimizing F1-Score,which ranges from 0 (worst) to 1 (best), we will use

∆(y, y) = 1− F1(y, y).

Table 1: Comparing a conventional SVM to a struc-tual SVM that directly optimizes the performancemeasure.

Dataset Method F1 PRBEP ROCAreaReuters struct SVM 62.0 68.2 99.1

SVM 56.1 65.7 98.6ArXiv struct SVM 56.8 58.4 92.8

SVM 49.6 57.9 92.7Optdigits struct SVM 92.5 92.7 99.4

SVM 91.5 91.5 99.4Covertype struct SVM 73.8 72.1 94.6

SVM 73.9 71.0 94.1

For the joint feature mapping, using

Ψ(x,y) =1

n

n∑i=1

yixi

can be shown to be a natural choice, since it makes thestructural SVM equivalent to a conventional SVM if errorrate is used as the loss ∆(y, y). It also makes computing aprediction very efficient with

y(x) = argmaxy

{w·Ψ(x,y)} = (sign(w·x1), ..., sign(w·xn)).

Computing the loss-augmented argmax necessary for train-ing is a bit more complicated and depends on the particularloss function used, but it can be done in time at most O(n2)for any loss function that can be computed from the contin-gency table of prediction errors [12]. SVMperf is an imple-mentation of the method for various performance measures.

Experiments. Table 1 shows the prediction performanceof the structural SVM on four benchmark datasets, describedin more detail in [12]. In particular, it compares the struc-tural SVM that directly optimizes the measure it is eval-uated on (here F1, PRBEP, and ROCArea) to a conven-tional SVM that accounts for unbalanced classes with a lin-ear cost model. For both methods, parameters were selectedto optimize the evaluation measure via cross validation. Inmost cases, the structural SVM outperforms the conven-tional SVM, and it never does substantially worse.

Beyond these performance gains, the structural SVM ap-proach has an attractive simplicity to it – direct optimizationof the desired performance measure instead of using proxymeasures during training (e.g. different linear cost models).However, there are still several open question before makingthis process fully automatic; for example, understanding theinteraction between ∆ and Ψ as recently explored in [4, 3].

4. CONCLUSIONS AND OUTLOOKIn summary, structural SVMs are flexible and efficient

methods for structured output prediction with a wide rangeof potential applications, most of which are completely orlargely unexplored. Other explored applications include hi-erarchical classification [2], clustering [7], and optimizing av-erage precision [30]. Due to the universality of the cutting-plane training algorithm, only relatively small API changesare required for any new application. SVMstruct is an im-plementation of the algorithm with APIs in C and Python,and it is available at svmlight.joachims.org.

Beyond applications, there are several areas for researchin further developing the method. One such area is train-ing structural SVMs with Kernels. While the cutting-planemethod can be extended to use Kernels, it becomes quiteslow and more efficient methods are needed [28]. Anotherarea results from that fact that solving the two argmax prob-lems exactly is intractable for many application problems.However, for many such problems there exist methods thatcompute approximate solutions. An interesting question ishow the approximation quality affects the quality of thestructural SVM solution [8].

This work was supported in part through NSF AwardsIIS-0412894 and NSF IIS-0713483, NIH Grants IS10RR020889and GM67823, a gift from Yahoo!, and by Google.

5. REFERENCES[1] R. Baeza-Yates and B. Ribeiro-Neto. Modern

Information Retrieval. Addison-Wesley-Longman,Harlow, UK, May 1999.

[2] L. Cai and T. Hofmann. Hierarchical documentcategorization with support vector machines. InProceedings of the ACM Conference on Informationand Knowledge Management (CIKM), 2004.

[3] S. Chakrabarti, R. Khanna, U. Sawant, andC. Battacharyya. Structured learning for non-smoothranking losses. In ACM Conference on KnowledgeDiscovery and Data Mining (KDD), 2008.

[4] O. Chapelle, Q. Le, and A. Smola. Large marginoptimization of ranking measures. In NIPS Workshopon Machine Learning for Web Search, 2007.

[5] M. Collins. Discriminative training methods forhidden markov models: Theory and experiments withperceptron algorithms. In Empirical Methods inNatural Language Processing (EMNLP), pages 1–8,2002.

[6] K. Crammer and Y. Singer. On the algorithmicimplementation of multiclass kernel-based vectormachines. Journal of Machine Learning Research(JMLR), 2:265–292, 2001.

[7] T. Finley and T. Joachims. Supervised clustering withsupport vector machines. In International Conferenceon Machine Learning (ICML), pages 217–224, 2005.

[8] T. Finley and T. Joachims. Training structural SVMswhen exact inference is intractable. In InternationalConference on Machine Learning (ICML), pages304–311, 2008.

[9] T. Hastie, R. Tibshirani, and J. Friedman. TheElements of Statistical Learning. Springer, 2001.

[10] T. Hofmann, B. Scholkopf, and A. J. Smola. Kernelmethods in machine learning. Annals of Statistics,36(3):1171–1220, 2008.

[11] T. Joachims. Learning to Classify Text Using SupportVector Machines – Methods, Theory, and Algorithms.Kluwer/Springer, 2002.

[12] T. Joachims. A support vector method for multivariateperformance measures. In International Conference onMachine Learning (ICML), pages 377–384, 2005.

[13] T. Joachims. Training linear SVMs in linear time. InACM SIGKDD International Conference OnKnowledge Discovery and Data Mining (KDD), pages217–226, 2006.

[14] T. Joachims, T. Finley, and C.-N. Yu. Cutting-planetraining of structural svms. Machine Learning, toappear.

[15] S. Khuller, A. Moss, and J. Naor. The budgetedmaximum coverage problem. Information ProcessingLetters, 70(1):39–45, 1997.

[16] J. Lafferty, A. McCallum, and F. Pereira. Conditionalrandom fields: Probabilistic models for segmentingand labeling sequence data. In InternationalConference on Machine Learning (ICML), 2001.

[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to documentrecognition. Proceedings of the IEEE,86(11):2278–2324, November 1998.

[18] A. McCallum, D. Freitag, and F. Pereira. Maximumentropy Markov models for information extraction andsegmentation. In International Conference on MachineLearning (ICML), pages 591–598, 2000.

[19] J. Meller and R. Elber. Linear programmingoptimization and a double statistical filter for proteinthreading protocols. Proteins Structure, Function, andGenetics, 45:241–261, 2001.

[20] J. Qiu and R. Elber. SSALN: an alignment algorithmusing structure-dependent substitution matrices andgap penalties learned from structurally aligned proteinpairs. Proteins, 62:881–91, 2006.

[21] S. Robertson, S. Walker, S. Jones,M. Hancock-Beaulieu, and M. Gatford. Okapi atTREC-3. In Proceedings of TREC-3, 1994.

[22] F. Sha and F. Pereira. Shallow parsing withconditional random fields. In NAACL ’03: Proceedingsof the 2003 Conference of the North AmericanChapter of the Association for ComputationalLinguistics on Human Language Technology, pages134–141, Morristown, NJ, USA, 2003. Association forComputational Linguistics.

[23] I. N. Shindyalov and P. E. Bourne. Protein structurealignment by incremental combinatorial extension(CE)of the optimal path. Protein Eng, 11:739–747, 1998.

[24] A. Swaminathan, C. Mathew, and D. Kirovski.Essential pages. Technical Report MSR-TR-2008-015,Microsoft Research, 2008.

[25] B. Taskar, C. Guestrin, and D. Koller.Maximum-margin markov networks. In Advances inNeural Information Processing Systems (NIPS), 2003.

[26] I. Tsochantaridis, T. Hofmann, T. Joachims, andY. Altun. Support vector machine learning forinterdependent and structured output spaces. InInternational Conference on Machine Learning(ICML), pages 104–112, 2004.

[27] I. Tsochantaridis, T. Joachims, T. Hofmann, andY. Altun. Large margin methods for structured andinterdependent output variables. Journal of MachineLearning Research (JMLR), 6:1453–1484, September2005.

[28] C.-N. J. Yu and T. Joachims. Training structural svmswith kernels using sampled cuts. In ACM SIGKDDConference on Knowledge Discovery and Data Mining(KDD), pages 794–802, 2008.

[29] C.-N. J. Yu, T. Joachims, R. Elber, and J. Pillardy.Support vector training of protein alignment models.Journal of Computational Biology, 15(7):867–880,

September 2008.

[30] Y. Yue, T. Finley, F. Radlinski, and T. Joachims. Asupport vector method for optimizing averageprecision. In ACM SIGIR Conference on Research andDevelopment in Information Retrieval (SIGIR), pages271–278, 2007.

[31] Y. Yue and T. Joachims. Predicting diverse subsetsusing structural SVMs. In International Conference onMachine Learning (ICML), pages 271–278, 2008.

[32] C. Zhai, W. W. Cohen, and J. Lafferty. Beyondindependent relevance: Methods and evaluationmetrics for subtopic retrieval. In Proceedings of theACM Conference on Research and Development inInformation Retrieval (SIGIR), 2003.

[33] Y. Zhang and J. Skolnick. TM-align: A proteinstructure alignment algorithm based on TM-score.Nucleic Acids Research, 33:2302–2309, 2005.

predicting structured objects with support vector …...predicting structured objects with support...

Documents