kernel methods fast algorithms and real life applications

153
Kernel Methods Fast Algorithms and Real Life Applications A Thesis Submitted For the Degree of Doctor of Philosophy in the Faculty of Engineering by S.V.N.Vishwanathan Department of Computer Science and Automation Indian Institute of Science Bangalore – 560 012 JULY 2003

Upload: others

Post on 03-Feb-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Kernel MethodsFast Algorithms and Real Life Applications

A Thesis

Submitted For the Degree of

Doctor of Philosophy

in the Faculty of Engineering

by

S.V.N.Vishwanathan

Department of Computer Science and AutomationIndian Institute of Science

Bangalore – 560 012

JULY 2003

To my parents

for

everything . . .

Acknowledgments

They say that once in your lifetime there comes a person who opens the doors of your mind and

makes you more aware of yourself. For me such a person has been Prof. M.Narasimha Murty.

More than being my thesis adviser, he has been a friend, philosopher and guide in the truest

sense. Every time I had a problem, personal or professional, I rushed to him for advise, and

he never ever let me down. His immense confidence in me allowed me to achieve goals much

beyond my capabilities. I attribute whatever technical sophistication that can be found in this

thesis to his inspiration, motivation and guidance.

Whatever I thought were the qualities of a world class researcher and a good human being,

I found much more than that in Dr. Alexander Smola. His energy and enthusiasm for research

and his ability to listen to my endless ramblings and most importantly his belief in me have

contributed immensely to this thesis. I must thank Miki for patiently bearing with me when

I kept Alex at his office for long hours or discussed kernel methods with him on the way to

Mysore.

A part of my work was carried out at the Research School of Information Science and

Engineering (RSISE), Australian National University (ANU). I would like to thank RSISE for

an opportunity to attend the Machine Learning Summer School - 2002 and for hospitality and

facilities extended during my visit.

I would like to thank Prof. Adimurty for being very patient with me and teaching me whatever

little Analysis and Topology that I know. I would like to thank Prof. Vittal Rao for teaching

me Linear Algebra.

Many people have contributed to my education both technical and non-technical. Here I

must mention Mr. Satyam Dwivedi. He has been an near perfect room mate and a valuable

friend who has helped me gain insights into various aspects of life. My lab mates have helped

me in various ways both personal and technical. Chapter 7 was inspired by a discussion with

i

P. Viswanath.

I would like to thank the department of Computer Science and Automation for creating a

world class environment for research. Prof. Y.N. Srikant the chairman of the department has

been especially helpful in shielding me from administrivia at different stages of my Ph.D. The

office staff especially Mrs. Meenakshi, Mrs. Lalitha and Mr. Mohan have been very helpful during

the entire course of my stay at the Institute.

The research work reported in this thesis was supported by an Infosys fellowship, a travel

grant from Netscaler Inc., a grant from TriVium India Software Ltd. and a grant from the

Australian Research Council.

Finally, I want to thank my parents for everything. They taught me that dreams are im-

portant, however big or small they may be. They have been very supportive of my dream to

pursue a Ph.D. I consider myself immensely lucky to have parents like them. Although they

were physically far away from me, their immense faith in me always kept me going. I dedicate

this thesis to them.

S.V.N.Vishwanathan

Contents

Acknowledgments i

Abstract viii

1 Introduction 1

1.1 VC Theory - A Brief Primer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 The Learning Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Traditional Approach to Learning Algorithms . . . . . . . . . . . . . . . . 2

1.1.3 VC Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.1.4 Structural Risk Minimization . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Introduction to Linear SVM’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.2.1 Linear Hard-Margin Formulation . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.2 Linear Soft-Margin Formulation . . . . . . . . . . . . . . . . . . . . . . . 9

1.2.3 ν-SVM Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3 The Kernel Trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Quadratic Soft-Margin Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Contributions and a Road Map of this Thesis . . . . . . . . . . . . . . . . . . . . 14

1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 SimpleSVM: A SVM Training Algorithm 17

2.1 Notation and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 The Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.4 SimpleSVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4.1 The Dual Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

iii

2.4.2 Finite Time Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4.3 Rate of Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.5 Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5.2 Adding a Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.3 Removing a Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6.1 Rank-Degenerate Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6.2 Linear Soft-margin Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.7.1 Experimental Setup and Datasets . . . . . . . . . . . . . . . . . . . . . . . 30

2.7.2 Discussion of the Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.8 Summary and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Modified Cholesky Factorization 38

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.1 Triangular Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.2 Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.3 Uniqueness and Existence . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.4 Solution of Linear System . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.5 Parallelization and Implementation Issues . . . . . . . . . . . . . . . . . . 45

3.2.6 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3 An LDV Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.4 Rank Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.1 Generic Rank-1 Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4.2 Rank-1 Update Where p = Z q . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4.3 Removal of a Row and Column . . . . . . . . . . . . . . . . . . . . . . . . 51

3.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5.1 Interior Point Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.5.2 Lazy Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5.3 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Kernels on Discrete Objects 57

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.1.1 Applications of Kernels on Discrete Structures . . . . . . . . . . . . . . . 58

4.1.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Defining Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.1 Haussler’s R-Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.2 Exact and Inexact Matches . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3 Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.1 Implementation Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.4.2 Various String Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.5.2 Various Tree Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.5.3 Coarsening Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.6 Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6.1 Finite State Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.6.2 Pushdown Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.7 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5 Fast String and Tree Kernels 74

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 String Kernel Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3 Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.1 Definition of a Suffix Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.2 The Sentinel Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.3 Suffix Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3.4 Efficient Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3.5 Merging Suffix Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 Algorithm for Calculating Matching Statistics . . . . . . . . . . . . . . . . . . . . 80

5.4.1 Definition of Matching Statistics . . . . . . . . . . . . . . . . . . . . . . . 80

5.4.2 Matching Statistics Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4.3 Matching Substrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.5 Our Algorithm for String Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.5.1 Our Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.6 Weights and Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.7 Linear Time Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.8 Tree Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.8.1 Ordering Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.8.2 Coarsening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.9 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6 Kernels and Dynamic Systems 94

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2 Linear Time-Invariant Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3 Kernels On Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3.1 Discrete Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.3.2 Continuous Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.3.3 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.4 Kernels on Dynamic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.4.1 Discrete Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.4.2 Continuous Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.4.3 Non-Homogeneous Linear Time-Invariant Systems . . . . . . . . . . . . . 108

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7 Jigsawing: A Method to Create Virtual Examples 110

7.1 Background and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.3 Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.4 Jigsawing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.4.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

7.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.4.3 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

7.4.4 Why does Jigsawing Work? . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

7.6 An Image Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.6.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.6.2 Quadratic Time Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

8 Summary and Future Work 123

8.1 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.2 Extensions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

A Rank One Modification 126

A.1 Rank Modification of a Positive Matrix . . . . . . . . . . . . . . . . . . . . . . . 126

A.1.1 Rank One Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.1.2 Rank One Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

A.2 Rank-Degenerate Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Bibliography 130

Abstract

Support Vector Machines (SVM) have recently gained prominence in the field of machine learning

and pattern classification (Vapnik, 1995, Herbrich, 2002, Scholkopf and Smola, 2002). Classifi-

cation is achieved by finding a separating hyperplane in a feature space which can be mapped

back onto a non-linear surface in the input space. However, training a SVM involves solving a

quadratic optimization problem, which tends to be computationally intensive. Furthermore, it

can be subject to stability problems and is non-trivial to implement. This thesis proposes a fast

iterative Support Vector training algorithm which overcomes some of these problems.

Our algorithm, which we christen SimpleSVM, works mainly for the quadratic soft margin

loss (also called the `2 formulation). We also sketch an extension for the linear soft-margin loss

(also called the `1 formulation). SimpleSVM works by incrementally changing a candidate Sup-

port Vector set using a locally greedy approach, until the supporting hyperplane is found within

a finite number of iterations. It is derived by a simple (yet computationally crucial) modification

of the incremental SVM training algorithm of Cauwenberghs and Poggio (2001) which allows us

to perform update operations very efficiently. Constant-time methods for initialization of the

algorithm and experimental evidence for the speed of the proposed algorithm, when compared to

methods such as Sequential Minimal Optimization and the Nearest Point Algorithm are given.

We present results on a variety of real life datasets to validate our claims.

In many real life applications, especially for the `2 formulation, the kernel matrix K ∈ Rn×n

can be written as

K = Z>Z + Λ,

where, Z ∈ Rn×m with m � n and Λ ∈ Rn×n is diagonal with nonnegative entries. Hence the

matrix K − Λ is rank-degenerate. Extending the work of Fine and Scheinberg (2001) and Gill

et al. (1975) we propose an efficient factorization algorithm which can be used to find a LDL>

viii

factorization ofK in O(nm2) time. The modified factorization, after a rank one update ofK, can

be computed in O(m2) time. We show how the SimpleSVM algorithm can be sped up by taking

advantage of this new factorization. We also demonstrate applications of our factorization to

interior point methods. We show a close relation between the LDV factorization of a rectangular

matrix and our LDL> factorization (Gill et al., 1975).

An important feature of SVM’s is that they can work with data from any input domain as

long as a suitable mapping into a Hilbert space can be found, in other words, given the input

data we should be able to compute a positive semi-definite kernel matrix of the data (Scholkopf

and Smola, 2002). In this thesis we propose kernels on a variety of discrete objects, such as

strings, trees, Finite State Automata, and Pushdown Automata. We show that our kernels

include as special cases the celebrated Pair-HMM kernels (Durbin et al., 1998, Watkins, 2000),

the spectrum kernel (Leslie et al., 2002a), convolution kernels for NLP (Collins and Duffy, 2001),

graph diffusion kernels (Kondor and Lafferty, 2002) and various other string-matching kernels.

Because of their widespread applications in bio-informatics and web document based algo-

rithms, string kernels are of special practical importance. By intelligently using the matching

statistics algorithm of Chang and Lawler (1994), we propose, perhaps, the first ever algorithm to

compute string kernels in linear time. This obviates dynamic programming with quadratic time

complexity and makes string kernels a viable alternative for the practitioner. We also propose

extensions of our string kernels to compute kernels on trees efficiently. This thesis presents a

linear time algorithm for ordered trees and a log-linear time algorithm for un-ordered trees.

In general, SVM’s require time proportional to the number of Support Vectors for prediction.

In case the dataset is noisy a large fraction of the data points become Support Vectors and thus

time required for prediction increases. But, in many applications like search engines or web

document retrieval, the dataset is noisy, yet, the speed of prediction is critical. We propose a

method for string kernels by which the prediction time can be reduced to linear in the length of

the sequence to be classified, regardless of the number of Support Vectors. We achieve this by

using a weighted version of our string kernel algorithm.

We explore the relationship between dynamic systems and kernels. We define kernels on var-

ious kinds of dynamic systems including Markov chains (both discrete and continuous), diffusion

processes on graphs and Markov chains, Finite State Automata, various linear time-invariant

systems etc. Trajectories are used to define kernels induced on initial conditions by the under-

lying dynamic system. The same idea is extended to define kernels on a dynamic system with

respect to a set of initial conditions. This framework leads to a large number of novel kernels

and also generalizes many previously proposed kernels.

Lack of adequate training data is a problem which plagues classifiers. We propose a new

method to generate virtual training samples in the case of handwritten digit data. Our method

uses the two dimensional suffix tree representation of a set of matrices to encode an exponential

number of virtual samples in linear space thus leading to an increase in classification accuracy.

This in turn, leads us naturally to a compact data dependent representation of a test pattern

which we call the description tree. We propose a new kernel for images and demonstrate a

quadratic time algorithm for computing it by using the suffix tree representation of an image.

We also describe a method to reduce the prediction time to quadratic in the size of the test

image by using techniques similar to those used for string kernels.

Chapter 1

Introduction

This chapter introduces our notation and presents a tutorial introduction to Support Vector

Machines (SVM). Various loss functions which give rise to slightly different quadratic optimiza-

tion problems are discussed. Intuitive arguments are provided to show the relation between

SVM’s and Structural Risk Minimization (SRM). We also try to explain why SVM’s perform so

well on a variety of challenging problems. The main contributions of this thesis are also briefly

discussed.

In Section 1.1 we present a brief introduction to VC theory. We also point out a few short-

comings of traditional machine learning algorithms and show how these lead naturally to the

development of SVM’s. In Section 1.2 we introduce the linearly separable SVM formulation.

We then discuss the linear hard-margin formulation and extend it to the linear soft-margin for-

mulation. Going further we sketch the ν-SVM formulation and briefly discuss the interpretation

of parameter ν. We also briefly touch upon the relation between the ν-SVM formulation and

the linear soft-margin formulation. In Section 1.3 we introduce the kernel trick and show how

it can be used to project points to a higher dimensional space where they may be linearly sep-

arable. The kernel trick can also be used to extend SVM’s to work with non-vectorial data. In

Section 1.4 we discuss the extension to the quadratic soft-margin loss formulation also known

as the `2 formulation. The main contributions of this thesis are presented in Section 1.5. We

conclude this chapter with a summary in Section 1.6.

The aim of this chapter is to sketch various ideas and provide an overview of basic concepts.

It tries to provide numerous references to published literature for further reading. As such, there

are no pre-requisites to read this chapter, although a basic familiarity with machine learning

1

1.1 VC Theory - A Brief Primer 2

and pattern recognition will be useful. In general we sacrifice some mathematical rigor in

order to present more intuition to the reader. Throughout this chapter we concentrate our

attention entirely on the pattern recognition problem, an excellent tutorial on the use of SVM’s

for regression can be found in Smola and Scholkopf (1998).

1.1 VC Theory - A Brief Primer

In this section we formalize the binary learning problem. We then present the traditional

approach to learning and point out some of its shortcomings. We go on to give some intuition

behind the concept of VC-dimension and show why capacity plays an important role while

designing classifiers (Vapnik, 1995).

1.1.1 The Learning Problem

In the following, we denote by {(x1, y1), . . . , (xn, yn)} ⊂ X ×{±1} the set of labeled training

samples 1, where xi are drawn from some input domain X while yi ∈ {±1}, denote the class labels

+1 and −1 respectively. Furthermore, let n be the total number of points and let n+ and n−

denote the number of points in class +1 and −1 respectively. We assume that the samples are all

drawn i.i.d (Independent and Identically Distributed) from an unknown probability distribution

P (x, y).

Let F be a class of functions f : X → {±1} parameterized by a set of adjustable parameters

ρ. For example, ρ could be the weights on various nodes of a neural network. The goal of

building learning machines is to choose a ρ such that we can predict well on unknown samples

drawn from the same underlying distribution P (x, y).

1.1.2 Traditional Approach to Learning Algorithms

The empirical error for the given training set is defined as

Eemp(ρ) =n∑

i=1

c(f(xi, ρ), yi),

1For convenience of notation, throughout this thesis, we assume that there are no duplicates in the observations.

1.1 VC Theory - A Brief Primer 3

where f(x, ρ) is the class label predicted by the algorithm and c(., .) is some error function. The

empirical risk for a learning machine is just the measured mean error rate on the training set.

Using a 0− 1 loss function it can be written as

Remp(ρ) =12n

n∑i=1

|f(xi, ρ)− yi|.

The actual risk which is the mean of the error rate on the entire distribution P (x, y) is defined

as

Ractual(ρ) =∫

12|f(xi, ρ)− yi| dP (x, y).

Since, the underlying distribution P (x, y) is not known it is generally not possible to compute

the actual risk.

Many traditional learning algorithms concentrated their energy on the task of minimizing

the empirical risk on the training samples (Haykin, 1994). The hope was that, if the training

set was sufficiently representative of the underlying distribution, the algorithm would learn the

distribution and hence generalize to make proper predictions on unknown test samples. In other

words, what we are hoping for is that the mean of the empirical risk converges to the actual

risk as the number of training points increases to infinity (Vapnik, 1995). But, researchers soon

realized that this was not always the case. For example, consider the toy 2-d classification

problem depicted in Figure 1.1. The decision function on the left is a simple one which mis-

classifies a lot of points while that on the right is quite complex and manages to drive the

empirical risk to zero. But, we intuitively expect the function in the middle, which makes a few

errors on the training set, to generalize well on unseen data points.

As another example, consider a learning algorithm that naively remembers the class label

of every training sample presented to it. Following Burges (1998), we call such an algorithm a

memory machine. The memory machine of course has 100% accuracy on the training samples,

but, clearly cannot generalize on the test set.

If F is very rich, then, for each function f ∈ F and any test set {(x1, y1), . . . , (xm, ym)} ⊂

X ×{±1} such that {x1, . . . , xm} ∩ {x1, . . . , xn} = ∅, there exists another function f∗ such that

f(xi) = f∗(xi)∀i ∈ {1, . . . , n}, while, f(xi) 6= f∗(xi)∀i ∈ {1, . . . ,m}. As we are only given the

training data, we have no means of selecting which of the two functions (and hence which of

the two different sets of test label predictions) is preferable (Scholkopf and Smola, 2002). Thus,

1.1 VC Theory - A Brief Primer 4

Figure 1.1: Three different decision functions for the same classification problem. The hollowand filled circles belong to two different classes. Error points are shown with a x. (CourtesyScholkopf and Smola (2002))

it is clear that we need some more conditions on F to make the empirical risk converge to the

actual risk. These conditions are provided by the VC bounds (Vapnik, 1995).

1.1.3 VC Bounds

Let 0 ≤ η ≤ 1 be a number. Then, Vapnik and Chervonenkis proved that, for the 0 − 1 loss

function, with probability 1− η, the following bound holds (Vapnik, 1995)

Ractual(ρ) ≤ Remp(ρ) + φ

(h

n,log(η)n

), (1.1)

where

φ

(h

n,log(η)n

)= n

√h(log(2n/h) + 1)− log(η/4)

n, (1.2)

is called the confidence term. Here h is defined to be a non negative integer called the Vapnik

Chervonenkis (VC) dimension. The VC-dimension of a machine measures the capacity of the

class F that the machine can implement. A finite set of h points is said to be shattered by F if

for each of the possible 2h labellings there is a f ∈ F which correctly classifies the points. The

VC dimension is defined as the largest h such that there exists a set of h points which the class

can shatter, and ∞ if no such h exists.

For example, consider the VC dimension of the set of hyperplanes in R2. There are 23 = 8

ways of assigning 3 points to two classes. For points shown in Figure 1.2 all 8 possibilities can

be realized using separating hyperplanes, in other words, the function class can shatter 3 points.

But, we can see that given any 4 points we cannot find hyperplanes which realize each of the

1.1 VC Theory - A Brief Primer 5

24 = 16 possible labellings. Therefore, the VC dimension of the class of separating hyperplanes

in R2 is 3.

Figure 1.2: VC dimension of the class of separating hyperplanes in R2. (Courtesy Scholkopf andSmola (2002))

Consider the memory machine that we introduced in Section 1.1.2. Clearly, this machine can

drive the empirical risk to zero, but, still does not generalize well because it has a large capacity.

This leads us to the observation that, while minimizing empirical error is important, it is equally

important to use a machine with a low capacity. In other words, given two machines with the

same empirical risk, we have higher confidence in the machine with the lower VC-dimension.

A word of caution is in order here. It is often very difficult to measure the VC-dimension of

a machine practically. As a result, it is quite difficult to calculate the VC bounds explicitly. The

bounds provided by VC theory are often very loose and may not be of practical use. Tighter

bounds are provided by the annealed VC entropy or the growth function, but, they are even

more difficult to estimate in practice. It must also be borne in mind that only an upper bound

on the actual risk is available. This does not mean that a machine with larger capacity will

always generalize poorly. What the bound says is that, given the training data, we have more

confidence in a machine which has lower capacity. In some sense Equation (1.1) is a restatement

of the celebrated principle of Occam’s razor.

1.1 VC Theory - A Brief Primer 6

1.1.4 Structural Risk Minimization

The bounds provided by VC theory can be exploited in order to do model selection. A structure

is a nested class of functions Si such that

S1 ⊆ S2 ⊆ . . . ⊆ Sn ⊆ . . .

and hence their corresponding VC-dimensions hi satisfy

h1 ≤ h2 ≤ . . . ≤ hn . . .

Now, because of the nested structure of the function classes the empirical risk Remp decreases as

we move towards a bigger class. This is because, the complexity of the function class increases

and hence it can explain the training data well. But, since the VC-dimensions are increasing

the confidence bound (φ) increases as h increases. The curves shown in Figure 1.3 depict

Equation (1.1) and the above observations pictorially.

Figure 1.3: Graphical depiction of the structural risk minimization (SRM) induction principle.(Courtesy Scholkopf and Smola (2002))

1.2 Introduction to Linear SVM’s 7

Figure 1.4: The circles and diamonds belong to two different classes. The solid line representsthe maximally separating linear boundary. Points x1 and x2 are Support Vectors. (CourtesyScholkopf and Smola (2002))

These observations suggest a principled way of selecting a class of functions. The function

class is decomposed into a nested sequence of subsets of increasing size (and thus, of increasing

capacity). The SRM principle picks a function which has small training error, and comes from

an element of the structure that has low capacity, thus minimizing a risk bound shown in

Equation (1.1). This procedure is referred to as capacity control or model selection or structural

risk minimization.

1.2 Introduction to Linear SVM’s

For simplicity of exposition, we assume in this section that, X = Rd for some d. Furthermore,

we assume that the data points belonging to different classes are linearly separable. Figure 1.4

depicts a binary classification 2-d toy problem. The balls and diamonds belong to different

classes.

Since the problem is separable, there exist many linear decision surfaces (hyperplanes) pa-

rameterized by (w, b) with w ∈ Rd and b ∈ R which can be written as fw,b = 〈w,x〉 + b = 0,

where 〈w,x〉 denotes the dot product between vectors w and x. These hyperplanes satisfy

yi(〈w,xi〉+ b) > 0 for all i ∈ {1, 2, . . . , n}. Rescaling w and b such that the point(s) closest to

the hyperplane satisfy |〈w,xi〉 + b| = 1, we obtain a canonical form (w, b) of the hyperplane,

satisfying yi(〈w,xi〉 + b) ≥ 1 for all i ∈ {1, 2, . . . , n}. Note that in this case, the margin (the

1.2 Introduction to Linear SVM’s 8

distance of the closest point to the hyperplane) equals 1‖w ‖ . This can be seen by considering

two closest points x1 and x2 on opposite sides of the margin, and projecting them onto the

hyperplane normal vector w‖w ‖ (Scholkopf, 1997).

The decision surface that is intuitively appealing is the one that maximally separates the

points belonging to two different classes. In some sense, we are making the best guess given the

limited data that is available to us. The following lemma from Vapnik (1995) formalizes our

intuition.

Lemma 1 Let R be the radius of the smallest ball containing all the training samples. The

canonical decision function defined on the training points (denoted by fw,b) be a hyperplane with

parameters w and b. Then, the set {fw,b : ‖ w ‖≤ A,A ∈ R} has VC-dimension h satisfying

h < R2A2 + 1.

Thus, a large margin implies a small value for ‖w‖ and hence a small value of A, therefore

ensuring that the VC-dimension of the class fw,b is small. This can be understood geometrically

as follows: as the margin increases the number of planes, with the given margin, which can

separate the points into two classes decreases and thus the capacity of the class decreases. Thus,

the hyperplane with the largest margin of separation has the least capacity and hence is the

optimal hyperplane. The optimal hyperplane for our 2-d toy problem in Figure 1.4 is shown as

a solid line.

Recently there has been some work on data dependent machine learning where the distribu-

tion of the test samples is also taken into account while performing structural risk minimization.

We refer the reader to Cristianini and Shawe-Taylor (2000), Shawe-Taylor et al. (1998), Cannon

et al. (2002) for more details.

1.2.1 Linear Hard-Margin Formulation

The problem of maximizing the margin can be expressed as

minimizew,b

12‖w ‖2

subject to yi (〈w,xi〉+ b) ≥ 1 for all i ∈ {1, 2, . . . , n}.(1.3)

1.2 Introduction to Linear SVM’s 9

A standard technique for solving such problems is to formulate the Lagrangian and solve the

dual problem. Let α ∈ Rn be non-negative Lagrange multipliers. The dual can be written as

maximizeα

−12α>Hα+

∑i

αi

subject to∑

i

αiyi = 0 and αi ≥ 0 for all i ∈ {1, 2, . . . , n}.(1.4)

Here H ∈ Rn×n with Hij := yiyj〈xi,xj〉. Moreover, it is a basic fact from optimization theory

(Mangasarian, 1969) that the minimum of Equation (1.3) equals the maximum of Equation (1.4).

A key observation here is that the dual problem involves only the dot products of the form

〈xi,xj〉. Another interesting observation is that, αi’s are non zero for only those data points

which satisfy the primal constraints with equality. These points are called Support Vectors to

denote the fact that their removal will change the solution of Equation (1.3). Two Support

Vectors for our simple 2-dimensional case are marked as x1 and x2 in Figure 1.4.

It is well known that the optimal separating hyperplane between the sets with yi = 1 and yi =

−1 is equal to a linear combination of points in feature space (Boser et al., 1992). Consequently,

the classification rule can be expressed in terms of dot products in feature space and we have

f(x) = 〈w,x〉+ b =∑

j

αjyj〈xj ,x〉+ b, (1.5)

where αj ≥ 0 is the coefficient associated with a Support Vector xj and b is an offset. In some

sense SVM’s assign maximum weightage to boundary patterns which intuitively are the most

important patterns for discriminating between the two classes (Scholkopf and Smola, 2002).

In the case of a hard-margin SVM, all SV’s satisfy yif(xi) = 1 and for all other points we have

yif(xi) > 1. Furthermore (to account for the constant offset b) we have the condition∑

i yiαi = 0

(Vapnik and Chervonenkis, 1974). This means that if we knew all SV’s beforehand, we could

simply find the solution of the associated quadratic program by a simple matrix inversion.

1.2.2 Linear Soft-Margin Formulation

Practically observed data is frequently corrupted by noise. It is also well known that the noisy

patterns tend to occur near the boundaries (Duda et al., 2001). In such a case the data points

may not be separable by a linear hyperplane. Furthermore, we would like to ignore the noisy

1.2 Introduction to Linear SVM’s 10

points in order to improve generalization performance. If outliers are taken into account then

the margin of separation decreases and intuitively the solution does not generalize well.

We account for outliers (noisy points) by introducing non-negative slack variables ξ ∈ Rn

which penalize the outliers (Bennett and Mangasarian, 1993, Cortes and Vapnik, 1995). Let

C be a penalty factor which controls the penalty incurred by each misclassified point in the

training set. The primal problem is modified as

minimizew,b,ξ12‖w ‖2 + C

∑i

ξi

subject to yi (〈w,xi〉+ b) ≥ 1− ξi for all i ∈ {1, 2, . . . , n},(1.6)

while the dual can be written as

maximizeα

−12α>Hα+

∑i

αi

subject to∑

i

αiyi = 0 and C ≥ αi ≥ 0 for all i ∈ {1, 2, . . . , n}.(1.7)

In this case, if αi = C then we call the corresponding xi an error vector. Note that in this case

the form of the solution does not change and remains the same as shown in Equation (1.5). The

above formulation where we penalize the error points linearly is also popularly known as the

linear soft-margin loss or the `1 formulation.

1.2.3 ν-SVM Formulation

In the `1 formulation (Equation (1.6)), C is a constant determining the trade-off between two

conflicting goals: minimizing the training error, and maximizing the margin. Unfortunately, C is

a rather un-intuitive parameter, and we have no a priori way to select it. The ν-SVM formulation

was proposed to overcome this difficulty (Scholkopf et al., 2000). The primal problem is written

asminimize

w,b,ξ,ρ

12‖w ‖2 − νρ+

1m

∑i

ξi

subject to yi (〈w,xi〉+ b) ≥ ρ− ξi for all i ∈ {1, 2, . . . , n}

and ξi ≥ 0, ρ ≥ 0.

(1.8)

1.3 The Kernel Trick 11

Using the technique of Lagrange multipliers the dual problem is obtained after some algebra as

maximizeα

−12α>Hα (1.9)

subject to∑

i

αiyi = 0 (1.10)

0 ≤ αi ≤1m

(1.11)∑i

αi ≥ ν (1.12)

The following theorem from Scholkopf et al. (2000) provides an interpretation of the parameter

ν.

Theorem 2 Suppose we run ν-SVM with ν on some data with the result that ρ > 0, then

• ν is an upper bound on the fraction of margin errors.

• ν is a lower bound on the fraction of Support Vectors.

It can also be shown that under some assumptions ν asymptotically equals the fraction of

Support Vectors as well as the fraction of errors. The ν-SVM also has a surprising connection

with the `1 formulation which is stated in the following theorem from Scholkopf et al. (2000).

Theorem 3 If ν-SVM classification leads to ρ > 0, then the `1 classification with C set a priori

to 1/ρ, leads to the same decision function.

1.3 The Kernel Trick

In the previous section we assumed that the optimal decision surface was a linear hyperplane.

In real life situations this is a very restrictive assumption. But, suppose we can find a non-

linear mapping φ : Rd → Rk such that d � k and the data points are linearly separable in

Rk we can still use the linear SVM by replacing 〈xi,xj〉 with 〈φ(xi), φ(xj)〉 (cf. Figure 1.5).

Consider the toy example of a binary classification problem mapped into feature space shown

in Figure 1.6. We assume that the true decision boundary shown on the left is an ellipse in

input space. When mapped into feature space via the nonlinear map φ(x) = (z1, z2, z3) =

(|x1 |2, |x2 |2,√

2|x1 ||x2 |), the ellipse becomes a hyperplane, as shown on the right. It turns

out that certain class of functions which satisfy the Mercer’s conditions are admissible as kernels

1.3 The Kernel Trick 12

Figure 1.5: Nonlinear Mapping into a space of functions. Here each point x is identified with afunction φ(x). (Courtesy Scholkopf and Smola (2002))

Figure 1.6: Mapping an ellipse into a hyperplane. (Courtesy Scholkopf and Smola (2002))

1.3 The Kernel Trick 13

i.e. they can be written as

k(xi,xj) = 〈φ(xi), φ(xj)〉 ,

where φ is some non-linear mapping to a higher dimensional Hilbert space. Note that this

mapping is implicit, and, at no point do we actually need to calculate the mapping function

φ. As a result, all calculations are carried out in the space in which the data points reside. In

fact, a large class of algorithms which use similarity between points can be kernelized to work in

higher dimensional space. The rather technical Mercer’s condition is expressed as the following

two lemmas (Courant and Hilbert, 1953, 1962).

Lemma 4 If k is a continuous symmetric kernel of a positive integral operator K of the form

(Kf)(y) =∫

Ck(x,y)f(x) dx (1.13)

with ∫C×C

k(x,y)f(x)f(y) dx dy ≥ 0 (1.14)

for all f ∈ L2(C) where C is a compact subset of Rn, it can be expanded in a uniformly convergent

series (on C × C) in terms of eigenfunction ψj and positive eigenvalues λj

k(x,y) =NF∑j=1

λjψj(x)ψj(y), (1.15)

where NF ≤ ∞.

Lemma 5 If k is a continuous kernel of a positive integral operator, one can construct a map-

ping φ into a space where k acts as a dot product,

〈φ(x), φ(y)〉 = k(x,y). (1.16)

We refer the reader to (Scholkopf and Smola, 2002, Chapter 2) for an excellent technical discus-

sion on Mercer’s conditions and related topics.

In general, given data drawn from any domain X we can use SVM’s as long as we can find

a mapping φ : X → H, where H be any Hilbert space. Thus, the advantage of using SVM’s is

that we can work with non-vectorial data as long as the corresponding mapping to a Hilbert

1.4 Quadratic Soft-Margin Formulation 14

space can be found. In this thesis we exhibit many such mappings and give efficient algorithms

to compute them.

1.4 Quadratic Soft-Margin Formulation

In case we penalize the error points quadratically, the objective function Equation (1.6) is

modified as

minimizew,b

12‖w ‖2 + C

∑i

ξ2i

subject to yi (〈w,xi〉+ b) ≥ 1− ξi for all i ∈ {1, 2, . . . , n}.(1.17)

This formulation has been shown to be equivalent to the separable linear formulation in a space

that has more dimensions than the kernel space (Cortes and Vapnik, 1995, Freund and Schapire,

1999, Keerthi et al., 1999, Cristianini and Shawe-Taylor, 2000). In other words the quadratic loss

function gives rise to a modified hard-margin SV problem, where the kernel k(x, x′) is replaced

by k(x, x′) + χδx,x′ for some χ > 0. The above formulation where we penalize the error points

quadratically is also popularly known as the quadratic soft-margin loss or the `2 formulation. It

will be main focus of the SimpleSVM algorithm which we describe in Chapter 2.

1.5 Contributions and a Road Map of this Thesis

In Chapter 2 we present a fast iterative Support Vector training algorithm for the quadratic soft-

margin formulation. Our algorithm, which we christen the SimpleSVM, works by incrementally

changing a candidate Support Vector set using a locally greedy approach, until the supporting

hyperplane is found within a finite number of iterations. It is derived by a simple (yet computa-

tionally crucial) modification of the incremental SVM training algorithms of Cauwenberghs and

Poggio (2001) which allows us to perform update operations very efficiently. We also indicate

methods to extend our algorithm to the linear soft margin loss formulation.

The LDL> decomposition of a positive semi-definite matrix A ∈ Rn×n, where L ∈ Rn×n is

unit lower triangular and D ∈ Rn×n is diagonal, is popularly known as the Cholesky decompo-

sition. It is widely used in many applications because of its excellent numerical stability (Gill

et al., 1974). In general computing the LDL> decomposition of a n× n matrix requires O(n3)

computations while updating it after a rank one change in A requires O(n2) computations. In

1.5 Contributions and a Road Map of this Thesis 15

many applications of SVM’s, especially for the `2 formulation, the kernel matrix K ∈ Rn×n can

be written as K = ZZ> + Λ, where Z ∈ Rn×m with m� n and Λ is diagonal with nonnegative

entries. Hence the matrix K − Λ is rank-degenerate. In Chapter 3 we present an O(nm2) algo-

rithm to compute the LDL> factorization of such a matrix. We also show how rank-one updates

of such a factorization can be carried out in O(mn) time. We demonstrate the application of this

factorization to speed up the SimpleSVM algorithm. We also present applications to interior

point methods.

In Chapter 4 we try to provide a general overview of R-Convolution kernels proposed by

Haussler (1999). We produce many extensions and exhibit new kernels and also show how

various previous kernels can be viewed in this framework. This chapter provides general recipes

for defining kernels on strings, trees, Finite State Automata, images, etc. We also sketch a few

fast algorithms for computing kernels on sets. Specific implementation details and algorithms

for all other kernels are relegated to later chapters.

In Chapter 5 we present algorithms for computing kernels on strings (Watkins, 2000, Haus-

sler, 1999, Leslie et al., 2002a) and trees (Collins and Duffy, 2001) in linear time in the size of

the arguments, regardless of the weighting that is associated with any of the terms. We show

how suffix trees on strings can be used to enumerate all common substrings of two given strings.

This information can then be used to compute string kernels efficiently. In order to compute

kernels on trees we exhibit an algorithm to obtain the string representation of a tree. The string

kernel ideas are then used to compute kernels on trees. We discuss an algorithm for string

kernels by which the prediction cost can be reduced to linear cost in the length of the sequence

to be classified, regardless of the number of Support Vectors.

In Chapter 6 we explore the relationship between dynamical systems and kernels to define

kernels on dynamical systems with respect to a set of initial conditions and on initial conditions

with respect to an underlying dynamical system. This is achieved by comparing trajectories,

which leads to a large number of known and many novel kernels. We show, how, many previous

kernels can be viewed as special cases of our definitions and also propose many new kernels.

Using our definition we propose kernels on Markov Chains (discrete and continuous), diffusion

processes, graphs, linear time invariant systems, and Finite State Automata.

In Chapter 7 we describe a new method to generate virtual training samples in the case

of handwritten digit data. We use the two dimensional suffix tree representation of a set of

1.6 Summary 16

matrices to encode an exponential number of virtual samples in linear space, thus, leading to

an increase in classification accuracy. We propose a quadratic time algorithm for computing

kernels on images. Methods to reduce the prediction time to quadratic in the size of the test

image are also described.

We summarize the thesis in Chapter 8 with pointers for future research.

1.6 Summary

In this chapter we reviewed various ideas from machine learning and statistical learning theory.

We also showed why minimizing the empirical risk is not adequate for the classifier to generalize

and sketched the evolution of SVM’s from the ideas of statistical learning theory. We described

various optimization problems which arise out of different SVM formulations and discussed their

main features. We introduced the kernel trick and showed how it can help SVM’s handle non-

vectorial data. Finally, we presented the main contributions of this thesis in brief. The road

map will help the reader to locate chapters of specific interest.

Chapter 2

SimpleSVM: A SVM Training

Algorithm

This chapter is devoted to a detailed description of our Support Vector Machine (SVM) train-

ing algorithm called the SimpleSVM. SimpleSVM works mainly for the quadratic soft-margin

formulation (see Section 1.4 for details). It incrementally changes a candidate Support Vector

set using a locally greedy approach, until the supporting hyper plane is found within a finite

number of iterations.

We introduce our notation in Section 2.1 and also present some background. In Section 2.2

we discuss a few SVM training algorithms and show how the SimpleSVM is related to them.

We present a high level overview of our algorithm in Section 2.3. Detailed discussion of the

convergence properties follows in Section 2.4. We show finite time convergence and explain

why even exponential convergence is likely. Subsequently, in Section 2.5 we discuss the updates

required to change the Support Vector set in greater detail and present various initialization

strategies. Extensions of our method to the `1 formulation are sketched in Section 2.6. We

also briefly mention how SimpleSVM may benefit if the kernel matrix is rank-degenerate. This

extension is discussed in more detail in Chapter 3. Experimental evidence of the performance

of SimpleSVM is given in Section 2.7 and we compare it to other state-of-the art SVM training

algorithms. We conclude with a discussion in Section 2.8.

A few technical details concerning the factorization of matrices and their rank-one modifi-

cations is relegated to Appendix A. This is done, such that, only readers interested in imple-

menting the algorithm on their own will need to follow these derivations closely. Working code

17

2.1 Notation and Background 18

and datasets for our algorithm can be found at http://www.axiom.anu.edu.au/~vishy.

This chapter requires basic knowledge of SVM’s and the quadratic soft-margin formulation.

Readers may want to review these concepts from Chapter 1. For the convenience of readers

already familiar with these concepts and in order to make this chapter self contained the primal

and dual problems of the quadratic soft-margin formulation are repeated here. To understand

the dual formulation and its derivation some knowledge of optimization is helpful (but not

indispensable). A cursory knowledge of probability will help in understanding the constant time

initialization procedure.

2.1 Notation and Background

Training a SVM involves solving a quadratic optimization problem, which tends to be compu-

tationally intensive, is subject to stability problems and is non-trivial to implement. Attractive

iterative algorithms such as Sequential Minimal Optimization (SMO) by Platt (1999), the Near-

est Point Algorithm (NPA) by Keerthi et al. (2000), Lagrangian Support Vector Machines by

Mangasarian and Musicant (2001), Newton method using Kaufman-Bunch algorithm by Kauf-

man (1999) etc. have been proposed to overcome this problem. This chapter makes another

contribution in this direction.

In the following, we denote by {(x1, y1), . . . , (xn, yn)} ⊂ X ×{±1} the set of labeled training

samples, where xi are drawn from some domain X and yi ∈ {±1}, denotes the class labels +1 and

−1 respectively. Furthermore, let n be the total number of points and let n+ and n− denote the

number of points in class +1 and −1 respectively. With some abuse of notation we will associate

with each set A ⊆ {1, . . . , n} the corresponding set of observations S(A) := {(xi, yi)|i ∈ A} and

denote by |A| the cardinality of A, with m := |A| being the (current) number of support vectors.

Denote by k : X ×X → R, a Mercer kernel and by Φ : X → F , the corresponding feature

map, that is 〈Φ(x),Φ(x′)〉 = k(x,x′) (see Section 1.3 for more details). In this chapter we study

the quadratic soft-margin loss function discussed in Section 1.4. Consequently, in the following,

we assume that we are dealing with a hard-margin SVM where a separating hyperplane can be

found, possibly with k(x,x′)← k(x,x′) + χδx,x′ .

2.2 Related Work 19

2.2 Related Work

DirectSVM: It has been shown that the closest pair of points of the opposite class are SV’s.

Hence, DirectSVM (Roobaert, 2000) starts off with this pair of points in the candidate

SV set. It works on the conjecture that the point which incurs the maximum error (i.e.,

minimal yif(xi)) during each iteration is a SV. This violating point is found and added to

the SV set.

In case the dimension of the space is exceeded or all the data points are used up, without

convergence, the algorithm reinitializes with the next closest pair of points from opposite

classes (Roobaert, 2000). The problem with DirectSVM is that its approach to adding a

new point to the SV set is very costly.

GeometricSVM: Vishwanathan and Murty (2002a) proposed an optimization based approach

to add new points to the candidate SV thus improving the scaling behavior of DirectSVM.

Unfortunately, neither DirectSVM nor GeometricSVM has a provision to backtrack, i.e.

once they decide to include a point in the candidate SV set they cannot discard it. During

each iteration, both the algorithms spend their maximum effort in finding the maximum

violator. Caching schemes can be used to alleviate this problem, but, they require a large

cache size for large datasets, besides, the scaling behavior of such caching schemes is not

well understood (Vishwanathan and Murty, 2002a).

Newton Approach: Kaufman (1999) proposed a Newton approach based on the Bunch-Kaufman

algorithm. It maintains a s × s matrix of active constraints which is updated in O(s2)

time when a constraint is added or deleted. It finds the first constraint to be violated by

finding the gradient of the function and hence computing the change in all the constraints.

The main drawback of this method is that computing the gradient is a costly operation

and the algorithm has to compute the gradient for every iteration. Both, SimpleSVM and

the Kaufman (1999) algorithm maintain an active set and update it during each iteration.

This idea is studied under the name of inertia controlling methods for general Quadratic

Programs (QP). We refer the reader to Gill et al. (1991) for a survey of such techniques.

Incremental and Decremental SVM: Cauwenberghs and Poggio (2001) proposed an incre-

mental SVM algorithm, where, at each step only one point is added to the training set. If

2.3 The Basic Idea 20

the added point violates the KKT conditions one recomputes the exact SV solution of the

whole dataset seen so far.

After the addition of each point, the algorithm maintains the exact solution for the whole

dataset seen so far. Hence, after n points have been added, the algorithm finds the exact

solution for the entire training set, and thus converges in n steps (Cauwenberghs and

Poggio, 2001). Unfortunately, the condition to remain optimal at every step means that,

whenver a violating point is found, the algorithm has to test all the observations seen so

far. Such a requirement dramatically slows it down.

In particular, it means that the algorithm has to perform n′|A| kernel computations at

each step, where n′ denotes the number of observations seen so far and A is the current SV

set. This is clearly expensive. Cauwenberghs and Poggio (2001) suggest a practical on-line

variant where they introduce a δ margin and concentrate only on those points which are

within the δ margin of the boundary. But, it is clear that the results may vary by varying

the value of δ.

The way to overcome the limitations of the Cauwenberghs and Poggio (2001) and Kaufman

(1999) algorithms is to require that the new solution strictly decrease the margin of separation

and be optimal with respect to a subset of A ∪ {v}, where (xv, yv) satisfies yvf(xv) < 1. This

is a greedy approach which does not guarantee that we are making optimal progress towards

the final solution, instead, we perform a small amount of work and strictly decrease the margin

of separation to obtain the final solution after a finite number of steps. In this sense one could

interpret SimpleSVM as being related to the Incremental and Decremental SVM and the Newton

method.

2.3 The Basic Idea

In spirit, our algorithm is very much related to the chunking methods developed at AT&T

Bell Laboratories (Burges and Vapnik, 1995). There, SV training is carried out by splitting an

overly large training set into small chunks, train on the first one, keep the SV’s, add the next

chunk, retrain, keep the SV’s, etc. until all the points satisfy the Karush-Kuhn-Tucker (KKT)

conditions (see also Cortes (1995)).

2.3 The Basic Idea 21

Algorithm 2.1: SimpleSVMinput Dataset Z

Initialize: Find any sufficiently close pair from opposing classes (xi+ , xi−)A← {i+, i−}Compute f and α for Awhile there are xv with yvf(xv) < 1 doA← A ∪ {v}Recompute f and α and remove non-SV’s from A.

end whileOutput: A, {αi for i ∈ A}

Osuna et al. (1997), Joachims (1999) and Platt (1999) generalize this strategy by dropping

the requirement of optimality on all the points with nonzero αi. Instead, they fix some variables

while optimizing over the remainder, regardless of their value of αi. In particular, SMO optimizes

only over two observations at a time and computes the minimum in closed form. This strategy

has proven successful whenever the number of nonzero coefficients αi is large, that is, the dataset

is noisy, and the hypothesis-to-be-found is not too complex (Scholkopf and Smola, 2002).

The AT&T Bell Laboratories style optimization method, however, also admits another mod-

ification: add only one point to the set of SV’s at a time and compute the exact solution. If

we had to recompute the solution from scratch this would be an extremely wasteful procedure.

Instead, as we will see in Section 2.4, it is possible to perform such computations at O(m2)

cost, where m is the number of current SV’s and obtain the exact solution on the new subset of

points. Even better, if the kernel matrix is rank-degenerate of rank d, updates can be performed

at O(md) cost using a novel factorization method of Smola and Vishwanathan (2003), thereby

further reducing the computational burden (see Chapter 3 for more details on our factoriza-

tion). As one would expect, this modification will work well whenever the number of SV’s is

small relatively to the size of the dataset, that is, for “clean” datasets.

While there is no guarantee that one sweep through the data set will lead to a full solution of

the optimization problem (and it almost never will, since some points may be left out which will

become SV’s at a later stage), we empirically observed that a small number of passes through

the entire dataset (typically less than 4) is sufficient for the algorithm to converge. Algorithm 2.1

gives a high-level description of the simple steps involved in SimpleSVM.

2.4 SimpleSVM 22

2.4 SimpleSVM

In this section we show that SimpleSVM finds the hard-margin solution and we analyze its

properties. We begin by studying the optimization problem dual to finding the maximum

margin. Next we show that adding a new point to A will always decrease the dual objective

function and we will use this fact to show finite time convergence. Finally, we indicate why one

may obtain linear convergence based on coordinate descent argument.

2.4.1 The Dual Problem

In the case of hard-margin SVM the set of current Support Vectors A is also the set of active

constraints (hence, in the following, we will refer to A interchangeably). It is well known that

the dual problem to the maximum margin problem

minimizew,b

12‖w ‖

2

subject to yi (〈w,Φ(xi)〉+ b) ≥ 1 for all i ∈ A(2.1)

is given by

maximizeα

−12α>Hα+

∑i

αi

subject to∑

i

αiyi = 0 and αi ≥ 0 for all i ∈ A and αi = 0 for all i 6∈ A(2.2)

Here H ∈ Rn×n with Hij := yiyjk(xi, xj). Moreover, it is a basic fact from optimization theory

(Mangasarian, 1969) that the minimum of Equation (2.1) equals the maximum of Equation (2.2).

Furthermore, Boser et al. (1992) showed that the value of Equation (2.1) is given by 12ρ2 , where

ρ is the margin between the two classes with respect to the subsets chosen via A.

2.4.2 Finite Time Convergence

By construction, adding elements to A, can only increase the value of Equation (2.1) (or leave

it constant), since we are shrinking the feasible set. In other words, adding a violating point can

only decrease the margin of separation. Moreover, dropping elements from A which correspond

to strictly satisfied constraints will not change the value of the optimization problem. Finally,

since by assumption the solution of Equation (2.1) exists, adding elements to A which correspond

2.4 SimpleSVM 23

to strictly violated constraints in Equation (2.1) is guaranteed to increase the value of the primal

objective function. We therefore have the following lemma:

Lemma 6 (Strictly Improving Updates) At every step, where SimpleSVM adds some {v}

with yvf(xv) < 1 to A, the optimal margin of separation with respect to A must decrease.

Furthermore, dropping the non-SV’s from A will not change the margin.

Now we can show finite time convergence. Key to the proof is the fact that there exists only a

finite number of sets A.

Theorem 7 (Convergence of SimpleSVM) SimpleSVM converges to the hard-margin solu-

tion in a finite number of steps.

A relaxed version of the algorithm, which only finds solutions on A ∪ {v}, that are optimal

with respect to A′ ⊆ A∪{v} also will converge in a finite number of steps, as long as the objective

function in A′ is strictly larger than the one in A.

Proof Let A be the candidate SV set at the end of an iteration (i.e. after a violating point has

been added and all those points with negative α’s have been discarded.) The KKT conditions

of Equation (2.2) are both necessary and sufficient conditions for optimality. Since, the solution

found by SimpleSVM satisfies the KKT conditions it is an optimal solution of Equation (2.1)

with respect to the current A. Furthermore, SimpleSVM terminates only if A contains all active

constraints in Equation (2.1) from {1, . . . , n}. This, however, is the hard-margin solution.

On the other hand, by virtue of Lemma 6, as long as SimpleSVM performs updates, the value

of Equation (2.1) with respect to the current A is strictly increasing. Thus, the algorithm cannot

cycle back to the same A. However, there exist only a finite number of sets A ⊆ {1, . . . , n}, hence

the series of values of Equation (2.1) corresponding to the current A must converge to some value

in a finite number of steps. By the above reasoning, this must be the optimal solution.

The same reasoning holds for the relaxed version which only finds solutions optimal in

A′ ⊆ A ∪ {v}, as long as the objective function is strictly increasing.

2.4.3 Rate of Convergence

By Theorem 7 we know that Algorithm 2.1 does not cycle and instead it will visit every variable

(of which we have only finitely many) at a time. To show linear convergence, note that we are

2.5 Updates 24

performing updates which are strictly better than coordinate descent at every step (in coordinate

descent we only optimize over one variable at a time, whereas in our case we optimize over

A ∪ {v} which includes a new variable at every step). Coordinate descent, however, has linear

convergence for strictly convex functions Fletcher (1989).

2.5 Updates

This section contains the central details of the updates required for adding and removing points,

plus strategies for initializing the algorithm. Here we show how updates can be performed

cheaply without the need for many kernel computations.

2.5.1 Initialization

Since we want to find the optimal separating hyperplane of the overall dataset Z, a good starting

point is the pair of observations (x+, x−) from opposing sets X+, X− closest to each other

(Roobaert, 2000). Brute force search for this pair costs O(n2) kernel evaluations, which is

clearly not acceptable for the search of a good starting point. The algorithms by Bentley and

Shamos (1976) and Vaidya (1989) find the find best pair in log linear time for multi-dimensional

data points.

Another approach is to use approximate closest pair of points. Since, our algorithm does

not critically depend on the pair of points chosen for initialization, this approach is acceptable.

Denote by ξ := d(x+, x−) the random variable obtained by randomly choosing x+ ∈ X+ and

x− ∈ X−. Then the shortest distance between a pair x+, x− is given by the minimum of the

random variables ξ. Therefore, if we are only interested in finding a pair whose distance is, with

high probability, much better than the distance of any other pair, we need only draw random

pairs and pick the closest one.

In particular, one can check (Scholkopf and Smola, 2002) that roughly 59 pairs are sufficient

for a pair better than 95% of all pairs with 0.95 probability, and to be better than 99.9% of all

pairs with 0.999 probability we need to draw from only 7000 pairs (in general, we need log δlog(1−δ) ≈

δ−1 log δ observations to be better than a fraction of 1− δ samples with 1− δ probability). For

other fast algorithms for approximate closest pair queries see Gionis et al. (1999), Indyk and

Motawani (1998).

2.5 Updates 25

Once a good pair (x+, x−) has been found, we need to initialize the corresponding α+, α−

and b. This is done by solving the linear system of equations:

f(x+) + b = K++α+ −K+−α− + b = 1

−f(x−)− b = −K−+α+ +K−−α− − b = 1

α+ − α− = 0

(2.3)

This operation can be carried out in constant time.

2.5.2 Adding a Point

Now we proceed to the updates necessary for adding an observation to the set of SV’s. The

cheapest strategy is to progress linearly through the dataset. Other strategies to locate violating

points have the disadvantage of requiring a larger number of computations. For every new

observation (xv, yv) with v 6∈ A, two cases may occur:

yvf(xv) ≥ 1: This point is currently correctly classified, so we need not perform any updates.

We retain A and proceed to the next point.

yvf(xv) < 1: This point will become a Support Vector, since at present it is wrongly classi-

fied. By default we assume that A← A ∪ {v} (we deal with pruning other points in Sec-

tion 2.5.3) and that therefore all xi with i ∈ A must satisfy yif(xi) = 1 and∑

i∈A αiyi = 0.

In matrix notation this reads as follows: 0 y>A

yA HA

b

αA

=

0

e

. (2.4)

Here yA ∈ {−1, 1}|A| is the vector of yi corresponding to A, αA ∈ R|A|, HA ∈ R|A|×|A|

satisfies (HA)ij = yiyjk(xi, xj), and e ∈ R|A| is the vector of ones.

Consequently, adding one element to A means that we have to solve the linear system

Equation (2.4), which has been increased by one row and column, given the solution of

the smaller system.

Such rank-one modifications of linear systems are standard in numerical analysis and are

discussed in great detail in Golub and Loan (1996), Horn and Johnson (1985). In a nutshell,

2.5 Updates 26

the operation can be carried out in O(|A|2) time. The adaptation to the current problem

is described in Appendix A.

2.5.3 Removing a Point

In case the linear system Equation (2.4) leads to a solution containing negative values of αv on

the set A ∪ {v} we need to remove elements from A ∪ {v}, since points in A ∪ {v} have ceased

to be support vectors.

We will use an efficient variant of the “adiabatic increments” strategy from Cauwenberghs

and Poggio (2001) for our purposes. Two problems arise: which points to remove (and in which

order) to obtain a linear system for which all αi are nonnegative, and how to avoid having to

check all removed points whether they might become SV’s again (unlike in incremental SVM

learning, where such checks may contribute a significant amount of computation to the overall

cost of an update). In a nutshell, the strategy will be to remove one point at a time while

maintaining a strictly dual feasible set of variables.

We need a few auxiliary results. For the purpose of the proofs we assume that all xi with

i ∈ A ∪ {v} are linearly independent. The general proof strategy works as follows: first we show

that the infeasible set of variables arising from changing A into A ∪ {v} is the solution of a

modified optimization problem. Subsequently, we prove that adiabatic changes strictly increase

the value of ‖w‖2 while reducing the number of active constraints and rendering the resulting

solution less infeasible.

Lemma 8 (Changed Margin) Assume we have a set of coefficients αi ≥ 0 with i ∈ A ∪ {v}

and b ∈ R with∑

i∈A∪{v} yiαi = 0 and w :=∑

i∈A∪{v} αiyixi, such that yi(〈w, xi〉+b) = 1 for all

i ∈ A and yv(〈w, xv〉+ b) = ρ. Then (w, b) is the solution of the following optimization problem:

minimize 12‖w‖

2

subject to yi(〈w, xi〉+ b) ≥ 1 for all i ∈ A and yv(〈w, xv〉+ b) ≥ ρ.(2.5)

Proof The optimization problem Equation (2.5) is almost identical to the SVM hard-margin

classification problem, except for one modified constraint on (xv, yv). It is easy to check that the

dual optimization problem to Equation (2.5) has identical dual constraints to the SVM hard-

margin problem (only in the objective function the linear contribution of αv is changed from

1 · αv to ρ · αv).

2.5 Updates 27

By construction (w, b) is a feasible solution of Equation (2.5), the set of αi is dual feasible

and finally, by construction, the KKT conditions are all satisfied. From duality theory it fol-

lows (see e.g., Vanderbei (1997)) that such a set of variables constitutes an optimal solution of

Equation (2.5), which proves the claim.

Lemma 9 (Shifting) In addition to the assumptions of Lemma 8 denote by (α′, b′) with w′ :=∑i∈A∪{v} α

′iyixi the solution of the linear system

1 = yi(〈w, xi〉+ b) for all i ∈ A ∪ {v} and 0 =∑

i∈A∪{v}

yiαi. (2.6)

Then (α, b) = (1− λ)(α, b) + λ(α′, b′) with w = (1− λ)w+ λw′ is a solution of the optimization

problem Equation (2.5) with corresponding ρ = (1 − λ)ρ + λ, as long as all αi ≥ 0 for all

i ∈ A ∪ {v}.

Proof We first show that α and b satisfy the conditions of Lemma 8. By construction, α′ is

nonnegative, it satisfies the summation constraint∑

i∈A∪{v} yiαi, since it is a convex combina-

tion of α and α′, which both satisfy the constraint. Moreover, yi(〈w, xi〉 + b) = 1 for all i ∈ A

(again, since this holds for both (α, b) and (α′, b′)). Finally, the value of b also can be found as

a convex combination of ρ and ρ′ = 1.

Lemma 10 (Piecewise Optimization) We use the assumptions in Lemma 8 and 9 for (α, b, w),

(α′, b′, w′), (α, b, w). In addition to that we assume that αi > 0 for all i ∈ A and ρ < 1. Then

for λ given by

λ := min(

1, mini|α′i<0

(αi

αi − α′i

))(2.7)

we have ‖w‖2 < ‖w′‖2 and ρ < ρ′ ≤ 1. Moreover, if λ < 1 we have αi = 0 for some i ∈ A.

Proof By construction, λ > 0, since αj > 0 for all j ∈ A and α′j is finite (the xj are linearly

independent). From Lemma 9 we know that ρ = (1− λ)ρ+ λ > (1− λ)ρ+ λρ = ρ. Since (α, b)

is a solution of an optimization problem with a further restricted domain, obtained by replacing

ρ with ρ. Since the constraints were active for ρ, this implies that ‖w‖2 must increase as we

2.6 Extensions 28

Algorithm 2.2: Removing Points from A ∪ {v}Input: A ∪ {v}, α, brepeat

Compute α′, b′ by solving Equation (2.6)Compute λ according to Equation (2.7)Update α← (1− λ)α+ λα′ and b← (1− λ)b+ λb′.if λ < 1 then

Remove from A for which λ was attained.Update matrices and intermediate values needed for computing new α′, b′.

end ifuntil λ = 1Output: A, α, b

restrict the domain. The conclusion that some αj = 0 follows directly from the choice of λ: the

coefficient vanishes for the argmin of Equation (2.7).

Putting everything together we have an algorithm to perform the removals from A ∪ {v} while

being guaranteed to obtain larger ‖w‖2 as we go (Algorithm 2.2). At every step the value of

‖w‖2 must increase, due to Lemma 10. Furthermore, at the end of the optimization process,

we will have a solution (α, b) which is a SVM solution with respect to the new set A ∪ {v}

(where A possibly was shrunk). This ensures that we make progress at every step of the main

SimpleSVM algorithm (even though the new solution may not be optimal with respect to the

original A ∪ {v}).

Technical details on how α′, b′ are best computed and how the corresponding matrices can

be updated are relegated Appendix A. It is worth while noting that each of the steps in

Algorithm 2.2 comes at the cost of O(|A|2) operations, which may seem rather high. However,

note that whenever we remove a point from A, all further increments will incur a smaller cost,

so removing as many elements from A as possible is highly desirable.

2.6 Extensions

2.6.1 Rank-Degenerate Kernels

Regardless of the type of matrix factorization we use to compute the SV solutions on A, we will

still encounter the problem that the memory requirements scale with O(|A|2) and the overall

computation is of the order of O(|A|3 + |A|n) for the whole algorithm. This may be much better

2.6 Extensions 29

than other methods (see Section 2.7 for details), yet we would like to take further advantage of

kernels which are rank-degenerate, that is, if k(x, x′) can be approximated on the training set X

by z(x)z(x′)> where z(x) ∈ Rm with m� n (in the following we assume that this approximation

is exact). See Smola and Scholkopf (2000), Fine and Scheinberg (2001), Williams and Seeger

(2000), Zhang (2001) for details how such an approximation can be obtained efficiently. This

means that the quadratic matrix to be used in the `2 soft margin algorithm can be written as

K = Z>Z + Λ (2.8)

where Zij := zj(xi) and Z ∈ Rn×m with m � n, while Λ ∈ Rn×n is a diagonal matrix with

non-negative entries. Extending the work of Fine and Scheinberg (2001) recently an algorithm

was proposed by Smola and Vishwanathan (2003) which allows one to find a LDL> factorization

of H in O(nm2) time and which can be updated efficiently in O(m2) time. This means that the

algorithm will scale by a factor of |A|m faster using a low-rank matrix decomposition than by using

the full matrix inversion. Details on the decomposition and rank-one updates are discussed in

Chapter 3.

2.6.2 Linear Soft-margin Loss

In the case of a linear soft margin the primal and dual problem are modified to allow for

classification errors in the training set. Using the same notation as above, the primal problem

is given by

minimizew,b

12‖w ‖

2 + C∑m

i=1 ξi

subject to yi (〈w,Φ(xi)〉+ b) ≥ 1− ξi for all i ∈ A(2.9)

and the dual problem is given by

maximizeα

−12α>Hα+

∑i

αi

subject to∑

i

αiyi = 0 and C ≥ αi ≥ 0 for all i ∈ A and αi = 0 for all i 6∈ A(2.10)

It is worthwhile to note that the dual formulation looks exactly the same as Equation (2.2)

except for the extra constraints on αi’s. As before, let v be a violating point, i.e. yvf(xv) < 1,

our modified algorithm adds v to the SV set A. But, in this case a point i ∈ A can fail to

2.7 Experiments 30

become a Support Vector either because αi < 0 or αi > C. We check for both these conditions

and remove such non-Support Vectors from A. While removing these non-Support Vectors, if

αv > C, we drop v and proceed to the next violating point. Proof of convergence of this modified

algorithm and its performance on real life datasets is a topic of current research.

2.7 Experiments

Since the main goal of this chapter is to give an algorithmic improvement over existing SVM

training algorithms, we will not report generalization performance figures here (they are irrele-

vant for properly converging algorithms, since all of them minimize the same objective function).

Instead, we will compare our method with the performance of the NPA algorithm by Keerthi

et al. (2000). NPA was chosen, since its authors showed in their experiments that it is compet-

itive with or better than other methods (such as SVMLight or SMO).

In particular, we will be comparing the number of kernel evaluations performed by a Support

Vector algorithm as an effective measure of its speed. Other measures are fraught with difficulty,

since comparing different implementations, compilers, platforms, operating systems, etc., causes

a large amount of variation even between identical algorithms.

2.7.1 Experimental Setup and Datasets

All experiments were run on a 800 MHz, Intel Pentium III machine with 128MB RAM running

Linux Mandrake 8.0 (unless mentioned otherwise). The code was written in C++ as well as in

MATLAB1. We use the full kernel matrix for all our experiments.

We uniformly used a value of 0.001 for the error bound i.e. we stop the algorithm when

yif(xi) > 0.999 ∀i. The NPA results are those reported in Keerthi et al. (1999). Consequently

we used the same kernel, namely a Gaussian RBF kernel with

k(x, x′) = exp(− 1

2σ2‖x− x′‖2

). (2.11)

The datasets chosen for our experiments are described in Table 2.1. The Spiral dataset was

proposed by Alexis Wieland of MITRE Corporation and it is available from the CMU Artificial

Intelligence repository. Both WPBC and the Adult datasets are available from the UCI Machine

1Code and datasets are available under the GPL at http://axiom.anu.edu.au/~vishy

2.7 Experiments 31

Table 2.1: Datasets used for the comparison of SimpleSVM and NPADataset Size Dimensions σ2

Spiral 195 2 0.5WPBC 683 9 4Adult-1 1,605 123 10Adult-4 4,781 123 10Adult-7 16,100 123 10

Learning repository (Blake and Merz, 1998). We used the same values of σ2 as in Keerthi et al.

(1999) and Platt (1999) to allow for a fair comparison. Experimental results can be found in

Figures 2.1 to 2.10.

Figure 2.1: Performance comparison between SimpleSVM and NPA on the Spiral dataset.

2.7.2 Discussion of the Results

As can be seen SimpleSVM outperforms the NPA considerably on all five datasets. For instance,

on the Spiral dataset the SimpleSVM is an order of magnitude faster than the NPA. On the

Adult-4 dataset with C = 1000 the SimpleSVM algorithm is nearly 50 times faster than the

NPA.

Furthermore, unlike NPA, SimpleSVM’s runtime behavior, given by the number of kernel

2.7 Experiments 32

Figure 2.2: Total number of times a point was ever added to the initial Support Vector setcompared with the final number of Support Vectors on the Spiral dataset.

Figure 2.3: Performance comparison between SimpleSVM and NPA on the WPBC dataset.

2.7 Experiments 33

Figure 2.4: Total number of times a point was ever added to the initial Support Vector setcompared with the final number of Support Vectors on the WPBC dataset.

Figure 2.5: Performance comparison between SimpleSVM and NPA on the Adult-1 dataset.

2.7 Experiments 34

Figure 2.6: Total number of times a point was ever added to the initial Support Vector setcompared with the final number of Support Vectors on the Adult-1 dataset.

Figure 2.7: Performance comparison between SimpleSVM and NPA on the Adult-4 dataset.

2.7 Experiments 35

Figure 2.8: Total number of times a point was ever added to the initial Support Vector setcompared with the final number of Support Vectors on the Adult-4 dataset.

Figure 2.9: Performance comparison between SimpleSVM and NPA on the Adult-7 dataset.

2.7 Experiments 36

Figure 2.10: Total number of times a point was ever added to the initial Support Vector setcompared with the final number of Support Vectors on the Adult-7 dataset.

evaluations, does not critically depend on the value of χ (see for instance Figures 2.1 and 2.5).

The ratio between the total number of points ever added to the initial Support Vector set

(initially the set contains two points) and the final number of Support Vectors indicates the

number of times a “wrong” Support Vector is picked or a point is recycled. As can be seen the

penalty incurred due to our greedy approach is not very significant (see for instance Figures 2.2

and 2.4).

Also note that, by construction SimpleSVM computes an exact solution on its SV set, whereas

algorithms such as NPA will only yield approximate expansions. In all but one case (in the Adult-

7 dataset) the number of Support Vectors found by our algorithm and the NPA differs by at

most 2. One possible explanation for the difference on the Adult-7 data is that the stopping

criterion employed by the two algorithms are different. Round off errors due to differences in

the precisions used for the computations may further aggravate the problem.

2.8 Summary and Outlook 37

2.8 Summary and Outlook

We presented a new SV training algorithm that is efficient, intuitive, and fast. It significantly

outperforms other iterative algorithms like the NPA in terms of the number of kernel computa-

tions. Moreover, it does away with the problem of overall optimality on all previously seen data

that was one of the major drawbacks of Incremental SVM, as proposed by Cauwenberghs and

Poggio (2001). But, the cost incurred due to this relaxation is that our algorithm is no longer

an incremental algorithm.

It should be noted that SimpleSVM performs particularly well whenever the datasets are

relatively “clean”, that is, whenever the number of SV’s is rather small. On noisy data, on the

other hand, methods such as SMO may be preferable to our algorithm. This is mainly due to

the fact that a matrix of size of the kernel matrix needs to be stored in memory (256 MB of main

memory suffice to store a matrix corresponding to as many as 10, 000 SV’s). Storage therefore

becomes a serious limitation of SimpleSVM when applied to generic dense matrices on large noisy

datasets. One possibility to address this problem is to use low-rank approximation methods

which make the problem amenable to the low-rank factorizations described in Section 2.6.1.

Due to the LDL> factorization used in finding the SV solution our algorithm is numerically

more stable than using a direct matrix inverse. This helps us deal with round off errors that can

plague other algorithms. We suspect that similar modifications could be successfully applied to

other algorithms as well.

Our algorithm can be sped up further by the use of a kernel cache. The effect of the use of

cache is an area of further study. Based on this study an efficient kernel cache can be designed

for our algorithm.

It can be observed that the addition of a vector to the Support Vector set is entirely reversible.

Using this property and following the derivation in Opper and Winther (2000), Cauwenberghs

and Poggio (2001) calculated the leave one out error. Similar techniques could be used in the

context of SimpleSVM, too.

Chapter 3

Modified Cholesky Factorization

This chapter presents an algorithm to compute the Cholesky decomposition of a special class

of matrices which occur frequently in machine learning and Support Vector Machine (SVM)

training. In many applications of SVM’s, especially for the `2 formulation, the kernel matrix

K ∈ Rn×n is can be written as K = ZZ>+Λ, where Z ∈ Rn×m with m� n and Λ is a diagonal

matrix with non-negative entries. Hence the matrix K − Λ is rank-degenerate. This chapter

presents an O(nm2) algorithm to compute the LDL> factorization of such a matrix. An O(mn)

algorithm to carry out rank-one updates of such a factorization is also discussed. Application of

the factorization to speed up the SimpleSVM algorithm and interior point methods is described.

Section 3.2 contains the main result, namely an LDL> factorization algorithm for matrices

of type ZZ> + Λ. We present implementation details along with methods for parallelizing such

a factorization. In Section 3.3 we show how a modified version of our factorization can be used

to obtain an (implicit) LDV decomposition of Z at O(m3) cost. Subsequently, in Section 3.4

we study how rank-1 modifications and row/column removal can be dealt with most efficiently

when an initial factorization has already been obtained. Two applications are presented in

Section 3.5: interior point optimization and a Support Vector classification algorithm. Lazy

methods to reduce the computational burden are also discussed. We conclude the chapter with

a discussion in Section 3.6.

This chapter requires the reader to thoroughly understand the concept of the Cholesky

decomposition of a positive semi-definite matrix. See Stewart (2000) for a survey of matrix

factorization. Understanding of concepts from linear algebra is required to appreciate the proofs

of theorems in this chapter. We highly recommend the book by Strang (1998). Knowledge of the

38

3.1 Introduction 39

SimpleSVM algorithm is essential to appreciate the material discussed in Section 3.5. Readers

may want to read Chapter 2 which discusses the SimpleSVM algorithm in detail before reading

this chapter. Familiarity with interior point methods and parallel algorithms is also assumed at

many places in the chapter.

3.1 Introduction

Consider the system of equations

Ax = b (3.1)

where A ∈ Rn×n and b ∈ Rn. It is well known that x should be computed by some factorization

of A rather than by a direct computation of A−1 (Golub and Loan, 1996). If A is positive

semi-definite, it can be factored as A = LDL>, where L is a unit lower triangular matrix and

D is a diagonal matrix containing only non-negative entries. Standard methods for finding such

factorizations exist and require at least O(n3) operations (see Golub and Loan (1996), Horn and

Johnson (1985), Stoer and Bulirsch (1993) for references and details).

In many applications including Support Vector Machines (SVM) and interior point methods,

the matrix A can be written as

A = ZZ> + Λ, (3.2)

where Z ∈ Rn×m with m � n and Λ is a diagonal matrix with non-negative entries. In this

chapter we present an algorithm to compute the LDL> factorization of such a matrix in O(nm2)

time. We also show how our factorization can be used to solve Equation (3.1) in O(mn) time.

3.1.1 Previous Work

Conventional wisdom to solve Equation (3.1) for the special case of Equation (3.2) is to perform

m rank-1 updates of an LDL> factorization and use this product factorization, that is

A = LmLm−1 . . . L1DL>1 . . . L

>m−1L

>m. (3.3)

Here all Li are special lower triangular matrices. Such methods were suggested e.g., in Gill et al.

(1974), Goldfarb and Scheinberg (2001), Fine and Scheinberg (2001).

While a factorization of type Equation (3.3) is efficient in the sense that it exhibits the right

3.1 Introduction 40

scaling behavior of O(m2n) operations to factorize and O(mn) operations to solve the linear

system, it has several downsides:

• It is difficult to perform rank-1 modifications on A once its factorization has been com-

puted. Such operations occur frequently when solving families of optimization problems

which differ only slightly in the number of constraints.

• Operations needed for iterative factorizations, and especially the solution of the overall

system, cannot be easily vectorized.

• A factorization in terms of products of terms L1, . . . , Lm cannot easily be parallelized.

Our proposed factorization method, which we will describe in Section 3.2 addresses all three

issues. We give a detailed account on some applications of the new factorization in Section 3.5.

An alternative approach to solving the problem is to use the Sherman-Morrison-Woodbury

(SMW) formula (Golub and Loan, 1996) via

y =[Λ−1 − Λ−1Z(Z>ΛZ + 1m)−1ZΛ−1

]x . (3.4)

While it has been used successfully in some convex optimization problems (Ferris and Munson,

2000), such methods tend to be numerically less stable, in particular if Λ is ill conditioned, as

pointed out by Fine and Scheinberg (2001).

3.1.2 Notation

We denote matrices by capital letters, e.g., Z ∈ Rn×m, vectors by boldface characters, e.g.,

z ∈ Rm, and scalars by lowercase characters, e.g., z ∈ R. Let 1n be the unit matrix in Rn×n,

and 0 be the vector with appropriate number of zero entries. We denote by Zij the entries of

Z and, unless stated otherwise, zi will denote the ith column of Z> and zi the ith entry of z.

Also, unless stated otherwise, we assume vectors to be column-vectors, i.e., we assume that z> z

is a scalar. Moreover, i, j, k,m, n ∈ N are integers and throughout the chapter m ≤ n. The

matrices D and Λ are diagonal matrices and we will address their diagonal entries as Di and Λi

respectively. ‖A‖ denotes the 2-norm of A, considered as linear mapping.

3.2 Matrix Factorization 41

3.2 Matrix Factorization

3.2.1 Triangular Factors

Recall that special lower triangular matrices L(z,b) ∈ Rn×n with z,b ∈ Rn can be used for

rank-1 updates of LDL> decompositions. These matrices take the following form

L(z,b) =

1

z2b1 1...

. . .

znb1 . . . znbn−1 1

(3.5)

Note that we only need n − 1 entries of z and b, however to keep notation simple we treat z

and b as n-vectors. It is well known (Gill et al., 1975, Goldfarb and Scheinberg, 2001) that for

matrices of the form z z> +Λ one can find a factorization

z z> +Λ = L(z,b)DL(z,b)> (3.6)

and solve the linear system (z z> +Λ)x = y in O(n) time. In the following we propose a method

to decompose ZZ> + Λ directly into

ZZ> + Λ = L(Z,B)DL(Z,B)> where Z,B ∈ Rn×m . (3.7)

Here L(Z,B) is a lower triangular matrix with the special form

L(Z,B) =

1

z>2 b1 1...

. . .

z>n b1 . . . z>n bn−1 1

i.e., Lij =

0 if i < j

1 if i = j

z>i bj if i > j

(3.8)

and D is a diagonal matrix with non-negative entries.

3.2 Matrix Factorization 42

3.2.2 Factorization

Clearly (ZZ>+Λ)ij = z>i zj +δijΛi. Straightforward algebra shows that in component notation

Equation (3.7) can be written as

z>i zi +Λi = Di +i−1∑k=1

Dk(z>i bk)(b>k zi) (3.9)

z>i zj = Dj z>i bj +j−1∑k=1

Dk(z>i bk)(b>k zj) if j < i. (3.10)

Note that we need not check the case j > i, since the expressions are symmetric. Next we define

a set of auxiliary matrices Mj as

Mj := 1m−j−1∑k=1

Dk bk b>k . (3.11)

By construction Mj+1 = Mj − Dj bj b>j for j ≥ 1 and M1 = 1m. Rewriting Equations (3.9)

and (3.10) such as to make the dependency on B,D explicit yields

Di = Λi + z>i

[1m−

i−1∑k=1

Dk bk b>k

]zi = Λi + z>i Mi zi (3.12)

z>i (Dj bj) = z>i

[1m−

j−1∑k=1

Dk bk b>k

]zj = z>i Mj zj for all j < i. (3.13)

This allows us to formulate a recurrence scheme, given in Algorithm 3.1 in order to obtain B and

D1. Note that the time complexity of the algorithm is O(m2n), since it performs n iterations,

each of which involves a rank-1 update of an m×m matrix, plus a matrix-vector multiplication

of the same complexity. Furthermore, the storage requirement is O(mn) to store B, and O(m2)

for the auxiliary matrix M .

3.2.3 Uniqueness and Existence

Next we need to prove that the factorization always exists and furthermore, that it is unique.

We begin with an auxiliary result.

1In practice it is important in practice to perform the rank-1 update with (1/Di) t t> rather than with bi t>,

e.g., with DSYRK, in order to maintain symmetry in M .

3.2 Matrix Factorization 43

Algorithm 3.1: Triangular Factorization1: init M = 1m ∈ Rm×m and B = 0 ∈ Rn×m

2: for i = 1 to n do3: t = M zi

4: Di = z>i t+Λi

5: if Di > 0 then bi = 1Di

t and M = M − (1/Di) t t>

6: end for

Lemma 11 Denote by M ∈ Rm×m a positive semi-definite matrix and let λ > 0, z ∈ Rm. Then

the matrix M ′ := M − M z z>M>

z>M z+λis positive semi-definite. Furthermore the norm of M ′ is

bounded by

‖M‖ ≥ ‖M ′‖ ≥ λ

z>M z+λ‖M‖. (3.14)

Proof Clearly ξ := z>M z+λ > 0. Let x ∈ Rm. Using ξ > 0 yields

x>M ′ x = x>M x− (x>M z)2

z>M z+λ(3.15)

ξx>M x+

[(x>M x)(z>M z)− (x>M z)2

]≥ λ

ξx>M x . (3.16)

Here the last inequality follows from the Cauchy-Schwartz inequality (it becomes a strict equal-

ity for z = x). The upper bound on ‖M ′‖ is trivial, the lower bound follows directly from

Equation (3.16) and the definition of the norm.

Corollary 12 The matrices Mi are positive semi-definite for all i ≥ 1.

Proof This holds since M1 is positive definite, and the recursive definition of Mi. In particular,

Di = Λi + z>i Mi zi is positive, which allows us to apply Lemma 11.

Theorem 13 The factorization ZZ>+Λ = L(Z,B)DL(Z,B)> exists and it is unique with two

exceptions:

1. For every i with Di = 0, bi may be chosen arbitrarily.

2. If D1 z1, . . . , Dn zn do not span Rm, the choice of bi is undetermined in the orthogonal

complement of the span of zi+1, . . . , zn.

3.2 Matrix Factorization 44

3. The last row bn of B is undetermined.

Proof It is known that LDL> factorizations of positive semi-definite matrices exist and that

the choice of D and L is unique (up to columns corresponding to Di = 0). Hence we only need

to check whether any B,Λ satisfy the conditions imposed by Equations (3.12) and (3.13).

Clearly Di ≥ 0 for all i, since Mi are positive semi-definite matrices and Λi ≥ 0. Furthermore

the condition onDi, Equation (3.12) only depends on the index i, hence it can always be satisfied.

The conditions on bj , as given by Equation (3.13), have to hold for every i > j. This imposes

a necessary and sufficient condition on the part of Dj bj lying in the span of zj+1, . . . , zn (hence

the liberty in the orthogonal subspace if it exists), namely that Dj bj = Mj zj . For all Dj > 0

this determines bj .

On the other hand Dj = 0 implies Λj = 0 and z>j Mj zj = 0. Since Mj is a positive semi-

definite matrix, z>j Mj zj = 0 implies Mj zj = 0, which means that the condition 0 ·z>i bj = z>i 0

is trivially satisfied for any bj ∈ Rm.

Nonetheless, we are well advised to use the choice of bi as suggested in Algorithm 3.1,

since we sometimes may want to add rows to Z at a later stage, in which case the choice

bi = 1z>i Mi zi +Λi

Mi zi may become necessary and not only sufficient. We now proceed to solving

the linear system.

3.2.4 Solution of Linear System

We begin by solving L(Z,B)x = y (the treatment of L>(Z,B)x = y being completely analo-

gous). We have

yi = xi +i−1∑j=1

z>i bj xj = xi + z>i

i−1∑j=1

bj xj

= xi + z>i ti (3.17)

where ti :=∑i−1

j=1 bj xj , hence t1 = 0 and ti+1 = ti +bi xi. Since ti ∈ Rm it costs only O(m)

operations for each yi, i.e., a total of O(mn) operations to compute y from L(Z,B) and x.

Details are given in Algorithm 3.2. In complete analogy we can solve L(Z,B)> x = y as

yi = xi +n∑

j=i+1

b>i zj xj = xi + b>i

n∑j=i+1

zj xj

= xi + b>i ti (3.18)

3.2 Matrix Factorization 45

Algorithm 3.2: Forward Solution1: init t = 0 ∈ Rm

2: for i = 1 to n do3: xi = yi − z>i t4: t = t+xi bi

5: end for

Algorithm 3.3: Backward Solution1: init t = 0 ∈ Rm

2: for i = n down to 1 do3: xi = yi − b>i t4: t = t+xi zi

5: end for

where ti :=∑n

j=i+1 zj xj , hence tn = 0 and ti−1 = ti + zi xi. Algorithm 3.3 contains the pseudo

code. The computational complexity is identical to the forward loop.

3.2.5 Parallelization and Implementation Issues

Memory Access The advantage of Algorithms 3.1 - 3.3 is that at no time they require

storage of the full Z,B matrices in memory. Even better, only one pass through the data

is needed, whereas iterative factorization algorithms require m memory accesses. This means

that data can be cached on disk (see Ferris and Munson (2000) for a similar application in

the context of a Sherman-Morrison-Woodbury formula) and only loaded into memory when

necessary. Furthermore, in Algorithm 3.1 the number of operations per zi,bi pair is O(m2) (for

the rank-1 update of Mi), whereas the amount of data to be preloaded from slower storage is

only O(m). Therefore, this method excels particularly for large m, where the slower disk access

becomes almost negligible.

Unfortunately the triangular solvers Algorithm 3.2 and 3.3 do not share this computation

vs. memory access behavior. Here the speed of the solution of the system L(Z,B)x = y will be

essentially determined by the time a sweep through Z,B takes on slower storage. However, in

many practical applications, such a sweep will be necessary only once (possibly with a matrix-

valued argument), i.e., it will not take more time to perform the solution of the system than it

takes to factorize ZZ> + Λ in the first place. In summary, the time requirements of the overall

algorithm are O(m2n) CPU time and O(mn) access time for a possibly slow storage.

3.2 Matrix Factorization 46

Parallelization In the following we assume that we have N nodes which are all directly con-

nected to each other and that there exist no preferred connections in the computer. Clearly

the main cost in Algorithm 3.1 for large systems is the rank-1 update M ← M − bi t>, as it

requires O(m2) operations and O(m2) storage. We can solve this problem by splitting M into

N stripes of m/N rows and distributing them onto the N nodes and performing scatter/gather

operations to distribute zi and collect t. Algorithm 3.4 contains the pseudo code. For conve-

nience we denote by Ml the lth stripe of M and by tl the corresponding stripe of t, as obtained

by tl = Ml zi.

Algorithm 3.4: Parallel Triangular Factorization1: init each node with a stripe Ml of M = 1m ∈ Rm×m

2: for i = 1 to n do3: master: broadcast to all slaves zi, Di

4: slave l: compute tl = Ml zi and λl := (zi)>l tl

5: slave l: broadcast tl and λl

6: slaves and master: receive tk and λk, assemble t, compute Di = Λi +∑N

k=1 λk

7: master: if Di > 0 bi = 1Di

t, else t = 08: master: record Di,bi to disk9: slave l: if Di > 0 update stripe Ml = Ml − (bi)l t>

10: end for

The bottleneck of Algorithm 3.4 is the broadcast and collection of tk. Here latencies of the

communications process will become the dominant factor (we assume that on most computers

the time for broadcasting a vector is small in comparison to the time to initiate the broadcast).

The remaining steps are non blocking, since storage of bi and broadcast of zi can be performed

asynchronously. Finally, similar considerations apply to the forward and backward solution of

the linear system.

3.2.6 Extensions

Several extensions of the above factorization can be obtained without much modification of the

original algorithm. Below we present a selection of them:

Modified Metric Straightforward calculus shows that we can use Algorithm 3.1 also to obtain

an efficient LDL> factorization of matrices such as A = Λ + ZMZ>, where M is an arbitrary

positive semi-definite matrix. All that is required is to set M1 = M instead of M1 = 1.

3.3 An LDV Factorization 47

Indefinite Diagonal Terms There is no inherent restriction why Λ should be a positive

matrix. Indeed, all that is required is that Λ + ZZ> be positive semi-definite such that an

LDL> decomposition exists. However, as pointed out in Fletcher and Powell (1974) for the case

of vectorial Z, numerical stability can suffer if this decomposition leads to Mi with very small

or spurious negative eigenvalues.

3.3 An LDV Factorization

The factorization algorithm for ZZ>+Λ can also be used to obtain an LDV factorization of Z,

simply by running the original factorization with Λ = 0 and converting the obtained factorization

into an LDV factorization. Here L is a lower triangular matrix, D is diagonal, and V satisfies

V >DV = 1m. We extend a result from Gill et al. (1975).

Lemma 14 The factorization defined by Equation (3.7) satisfies LDB = Z.

Proof With the definition of L(Z,B) and using the fact that M> = M and D> = D we can

rewrite LDB = Z row-wise as

z>i = Di b>i +i−1∑j=1

z>i bj Dj b>j = Di b>i + z>i

i−1∑j=1

bj Dj b>j

⇐⇒ biDi = Mi zi .

The latter condition, however, is identical with the implications of Equation (3.13), hence the

claim is proven.

What this means is that if we use the LDL> factorization algorithm on the matrix ZZ> +

0 we will obtain the decomposition LDB = Z. All we need to argue is that the resulting

matrices L,D,B can be easily converted into an LDV factorization. The pseudo code is given

in Algorithm 3.5. For its analysis we need an auxiliary lemma.

Lemma 15 The image of 1m−Mi+1 is contained in span{z1, . . . , zi}. Furthermore ziMi zi = 0

implies zi ∈ span{z1, . . . , zi−1}.

Proof We use induction: for the first claim recall that 1m−M1 = 0, hence the induction

assumption holds. Next note that Mi+1 = Mi − cMi zi z>i Mi with c ≥ 0. Since we know the

image of 1m−Mi we only need to study Mi zi z>i Mi to contain the image of 1m−Mi+1.

3.3 An LDV Factorization 48

Algorithm 3.5: LDV Factorization1: init M = 1m ∈ Rm×m, i, j = 12: repeat3: t = M zi and Di = z>j t.4: if Di > 0 then5: vi = 1

Dit and M = M − (1/Di) t t>

6: i = i+ 17: end if8: j = j + 19: until i > m

Since the image of Mi zi z>i Mi is in the span of Mi zi = −(1m−Mi) zi + zi, we know that

the image of 1m−Mi+1 is contained in span{z1, . . . , zi}. This proves the first claim.

To prove the second claim recall that Mi is a positive matrix, hence z>i Mi zi = 0 implies

Mi zi = 0. The latter can be rewritten as zi = (1m−Mi) zi and therefore, by the first claim, zi

is contained in the image of 1m−Mi, which proves the second claim.

Theorem 16 Let Z ∈ Rn×m and furthermore assume that z1, . . . , zm spans Rm. Then algo-

rithm 3.5 will terminate after m steps and we have Z = LDV , where L = L(Z, [V, 0]) (here

[V, 0] is the extension of V ∈ Rm×m to an n×m matrix by filling in zeros).

Proof Assume we found ZZ> = LDL> (this holds by construction of the factorization al-

gorithm). Then rankL = n requires that m = rankZ = rankD. This means that D contains

exactly m nonzero entries. Next we show that these must be the first m terms Di. By assump-

tion the first m zi are linearly independent, hence by virtue of Lemma 15 the corresponding

Di = (z>i Mi zi)−1 are nonzero. This implies that for all i > m we must have Di = 0.

The latter, however, allows us to terminate Algorithm 3.1, since Di = 0 implies bi = 0. This

leads essentially to Algorithm 3.5 which counts the number of nonzero Di and terminates after

obtaining m of them.

Finally, we need to prove that V >DV = 1m. Here we use Lemma 14. From LDB = Z with

B = [V, 0] it follows that V = D−1L−1Z. Consequently we may rewrite V >DV as

(D−1L−1Z)>D(D−1L−1Z) = Z>(L>)−1D−1L−1Z = Z>(ZZ>)−1Z = 1m . (3.19)

The last equality holds, since, by assumption, Z has rank m.

3.4 Rank Modifications 49

If the first m zi are linearly dependent, the algorithm will iterate until it has found a set

of m linearly independent zi to construct the decomposition. Note that we are able to find an

implicit representation of the LDV decomposition in O(m3) time which can be much less than

the time required to visit all entries of Z.2

3.4 Rank Modifications

Now that we presented an algorithm to factorize ZZ> + D into LDL>, we study how such a

factorization can be useful in dealing with modifications of the original equation into

ZZ> + Λ + pp> = L D L> (3.20)

with corresponding B. To address this problem we will distinguish between three different

cases: (1) p is an arbitrary vector, (2) p can be written as p = Z q, (3) we want to modify

Equation (3.7) by removing the ith row and column. It will turn out that (3) can be essentially

reduced to (2).

The general idea of what follows is that we will be able to compute D efficiently without

knowing B and only subsequently compute B via D and an implicit formulation for L. Note

that the methods presented in the following could be easily adapted to rank modifications of an

LDV decomposition of an arbitrary matrix Z.

3.4.1 Generic Rank-1 Update

Consider Equation (3.20), knowing that ZZ> + Λ = LDL> we can rewrite it as

D + aa> = L−1 L D L>(L−1)> (3.21)

where, a := L−1 p can be computed in O(mn) time (see Algorithm 3.2). Moreover, we can find

a factorization L D L = D+ aa> in O(n) time by using Algorithm 3.1. If L = L(a,b) then it is

2Of course, to obtain the explicit values of L we would have to expend O(m2n) time. However, this is oftennot needed, when L is only used to find pseudo-inverses.

3.4 Rank Modifications 50

clear that b = D−1

L−1

a. We also observe that

L D L = L L D LL> hence D = D and L = L L . (3.22)

This follows from the uniqueness of LDL> decompositions (recall that the product of two lower

triangular matrices is a lower triangular matrix). To compute B we introduce Z := [Z,p].

Application of Lemma 14 yields

LDB = Z and L D B = Z = [Z,p] = [LDB,p] (3.23)

hence B = D−1 L−1[LDB,p] = D−1

L−1

[DB,a] = [D−1

L−1DB,b]. Note that L

−1DB can

be computed in O(mn) time since L−1

x for some x ∈ Rn can be computed in O(n) time. In

summary, we obtain a generic rank-1 update by computing a = L−1 p, then factorize D+ aa>,

yielding L D L>, and finally compute B = [D

−1L−1DB,b].

3.4.2 Rank-1 Update Where p = Z q

Here we may rewrite Equation (3.20) as

ZZ> + Λ + pp> = Z(1m +qq>)Z> + Λ = Z Z> +Λ (3.24)

where Z := Zχ and χ := (1m +qq>)12 . This means that we can apply Lemma 14 to the L D L>

decomposition of Equation (3.20) to obtain

L D B = Z = Zχ = LDBχ and hence B = D−1 L−1LDBχ. (3.25)

Next note that L−1 p = L−1Z q = DB q. This allows us to rewrite Equation (3.21) as

D + (L−1 p)(L−1 p)> = D +DB qq>B>D = L−1 L D L>(L−1)>. (3.26)

Here D and L can be found in O(n) time. By the same reasoning as in Section 3.4.1 we obtain

L = LL and D = D. This leads to

B = D−1 L−1LDBχ = D−1 L

−1DBχ. (3.27)

3.4 Rank Modifications 51

The key point to note is that here B gives us a factorization of L via L(Z, B), which is not quite

desirable, since it means that we would have to update not only B but also Z. However, B, Z

only appear in the form of dot products via z>i bj = z>i χbj . This means that we can keep Z, if

we multiply B by χ. In summary we obtain

Bfinal = Bχ = D−1 L−1DB(1m +qq>). (3.28)

Each of the steps can be carried out in O(mn) time, which determines the complexity of the

algorithm (note that we can gain a slight improvement by carrying out D−1 L−1D in one step).

3.4.3 Removal of a Row and Column

Assume that we have a factorization of ZZ>+Λ = LDL> and we would like to find a factoriza-

tion of ZZ>+Λ efficiently, where Z, Λ have been obtained by removing the i-th row (and column

respectively). It is well known (see e.g., Golub and Loan (1996)) that by such a modification

only the “lower right” part of L will change. For the sake of completeness we briefly repeat the

reasoning below. We split Z,D,L,Λ into three parts as follows:

Z =

Z1

Z2

Z3

Λ =

Λ1

Λ2

Λ3

D =

D1

D2

D3

L =

L11

L21 L22

L31 L32 L33

Likewise we have for the reduced system

Z =

Z1

Z3

Λ =

Λ1

Λ3

D =

D1

D3

L =

L11

L31 L33

Finally we decompose B, B in the same fashion. Matching up terms in the reduced system leads

to the equations

Z1Z>1 + Λ1 = L11D1L

>11 = L11D1L

>11

Z3Z>1 = L31D1L

>11 = L31D1L

>11

Z3Z>3 + Λ3 = L31D1L

>31 + L32D2L

>32 + L33D3L

>33 = L31D1L

>31 + L33D3L

>33.

3.5 Applications 52

The first two conditions of the system imply L11 = L11, D1 = D1, and L31 = L31. These

conditions on L can be satisfied by setting B1 = B1. Hence, we need to expend computational

effort only on satisfying

L32D2L>32 + L33D3L

>33 = L33D3L

>33. (3.29)

By the definition of L(Z,B) we know that L32 can be written as L32 = Z3B>2 , which leads to

Z3B>2 D2B2Z

>3 +L33D3L

>33 = L33D3L

>33. This problem, however, is identical to the one discussed

in Section 3.4.2 — simply substitute q = B2

√D2. This leads to the following update equations:

1. Factorize D3 +D3B3B>2 D2B2B

>3 D3 = L D L and set D3 = D, as in Equation (3.26).

2. Using Equation (3.28) compute B3 via B3 = D−13 L

−1D3B3(1m +B2D2B

>2 ).

Assuming that Z3 is an n′×m matrix (n′ = n−i), the first step only involves O(mn′) operations

to compute D3B3B>2

√D2 and O(n′) operations to compute L, D. Likewise, the second step costs

only O(mn′) operations (multiplication by a diagonal matrix, product with L−1

, rank-1 update

on B3).

3.5 Applications

3.5.1 Interior Point Methods

Our reasoning is similar to the one proposed in Goldfarb and Scheinberg (2001) and it presents

an alternative to Seol and Park (2002). In a nutshell the idea is the following: assume we want

to invert a matrix M which has the form M = ZZ> + C, where M,C ∈ Rn×n and Z ∈ Rn×m,

usually m � n, and finally C is easily invertible. Then the (numerically less stable) Sherman-

Morrison-Woodbury approach consists of replacing

(ZZ> + C)−1 x by[C−1 − C−1Z(Z>CZ + 1m)−1Z>C−1

]x . (3.30)

On the other hand, if we wish to find an LDL> factorization, we first factorize C into C =

LcDcL>c , which, by assumption, can be done cheaply. Subsequently, we factorize (L−1

c Z)(L−1c Z)>+

Dc = LzDzLz and make the replacement of

(ZZ> + C)−1 x by[LcLzDzL

>z L

>c

]−1x . (3.31)

3.5 Applications 53

Such a situation may occur in several cases:

Rank Degenerate Quadratic Objective Function In a quadratic programming problem

with objective function f(x) = x>H x+ c> x, where H has only rank m, or where it can be

approximated by a low-rank matrix (Ferris and Munson, 2000, Scholkopf and Smola, 2002, Fine

and Scheinberg, 2001), interior point codes lead to the following linear system (Vanderbei, 1994):

−(ZZ> +D) A

A> Hy

x

y

=

cx

cy

(3.32)

Here H = ZZ> (or H ≈ ZZ> in case we use a low-rank approximation of H) and D is a

diagonal matrix with positive entries. Equation (3.32) is typically solved by explicit pivoting for

the upper left block, which involves computing

(ZZ> +D)−1A and (ZZ> +D)−1 cx . (3.33)

Even if Z is a triangular matrix, it is difficult to invert (ZZ>+D)−1 and provides a picture-book

case of where our factorization can be applied (in fact, this was the reason for our derivations).

Dense Columns Linear programming involves solving a linear system (Vanderbei, 1997) sim-

ilar to Equation (3.32): −D A

A> 0

x

y

=

cx

cy

(3.34)

Here A represents the matrix of constraints, and D is a positive diagonal matrix. Typically

Equation (3.34) is solved by explicit pivoting for x, which leads to

x = D−1Ay−D−1cx and (A>D−1A)y = cy +A>D−1cx. (3.35)

If A is largely a sparse matrix with a few dense columns, Equation (3.35) nonetheless amounts

to solving a dense system. To avoid such problems we assume that A can be decomposed into

A = [Ad, As] (and likewise D into Dd and Ds) where Ad represents the dense columns and

As corresponds to the sparse ones (to keep the notation simple, for the purpose of the exam-

ple we assume that AdD−1d Ad has full rank). This means that we need solve linear systems

3.5 Applications 54

involving A>s D−1s As +A>d D

−1d Ad. Here as in Equation (3.31), we assume that an L D L

>factor-

ization for A>s D−1s As can be obtained efficiently, and a subsequent application of our method

to L−1A>d DdAd(L

−1)> + D yields the factorization of A>D−1A.

3.5.2 Lazy Decomposition

Recall the original problem Equation (3.2) of factorizing ZZ>+Λ. Frequently Λ will have large

entries (more specifically Λi � ‖ zi ‖2). In such cases it intuitively makes sense to “ignore” the

contribution of zi and save computational time by considering only Λi. In the following we will

formalize this notion.

Lemma 17 Assume that Mi is given by Algorithm 3.1. Then

0 ≤ a>Mi a−a>Mi+1 aa>Mi+1 a

≤ z>i Mi zi

Λifor all a ∈ Rm, i ∈ N . (3.36)

Proof The LHS of Equation (3.36) is trivial, since Mi −Mi+1 = αqq> for suitable α > 0 and

z ∈ Rm. To show the RHS we use

a>Mi a>−a>Mi+1 aa>Mi+1 a

=(a>Mi zi)2

(Λi + z>i Mi zi)a>Mi a−(a>Mi zi)2(3.37)

≤ (a>Mi zi)2

Λi a>Mi a≤ z>i Mi zi

Λi. (3.38)

Here the last two inequalities followed from the Cauchy-Schwartz inequality, i.e., that (a>M b)2 ≤

a>M ab>M b for any positive matrix M .

If the changes in Mi are smaller than the error tolerance, say z>i Mi zi

Λi≤ ε, we may decide

not to carry out the update on Mi at all (e.g., they would be smaller than the numerical error

introduced by the operation). Furthermore, note that M1 = 1m, hence M1 z = z. Now assume

that we reordered the rows/columns of ZZ> + Λ in such a way that the entries corresponding

to ‖ zi ‖2 > εΛi occur first. Then, for all such zi, no update in Mi will be carried out. Moreover,

Mi zi = zi, which again does not involve computational cost. Finally, we set bi = 1‖ zi ‖2+Λi

zi.

In a nutshell, this means that we can perform each of those steps at O(n) rather than O(n2)

cost, which leads to a significant speedup if such a situation happens for a large number of zi.

3.5 Applications 55

The latter, however, is exactly the case during the endgame of an interior point method.

Here Λi corresponds to ci/αi, i.e., the quotient between the constraint ci(x) and the Lagrange

multiplier αi. This is bound to converge to 0 or∞, depending on whether the constraint is active

or not. Quite often in machine learning, in “easy” learning problems (Scholkopf and Smola,

2002), only few constraints will be active, thus decreasing the effective number of operations

significantly: effectively we are dropping variables by our approach, without the need for any

heuristics to perform such operations.

3.5.3 Support Vector Machines

Several optimization algorithms for Support Vector Machines can benefit from the fact that

the kernel matrix K (which plays a central role in the estimation process) can be written as

K = ZZ>+Λ ∈ Rn×n where Z ∈ Rn×m and Λ is a diagonal matrix with non-negative entries, i.e.,

K−Λ is rank-degenerate. An important problem in this context is to factorize K (Mangasarian

and Musicant, 2001, Scholkopf and Smola, 2002, Vishwanathan and Murty, 2002a).

More specifically, often one will want to factorize not only K = ZZ> ∈ Rn×n but also

K = ZZ> ∈ Rn×n, where Z was obtained from Z by removing or adding a row. Such operations

may occur repeatedly especially in the case of the SimpleSVM algorithm which maintains an

active set of constraints and dynamically adds or removes data points from the Support Vector

set (cf. Chapter 2 for a detailed description of SimpleSVM).

Addition Addition of a row of Z occurs in every iteration of Algorithm 3.1, hence adding yet

another row is rather trivial, provided that we have knowledge of Mi. The computational cost

involved is O(m2). If Mi is lost, e.g., after other modifications of Z, we can expect to expend

more computation on obtaining the factorization. Recall that bi = Mi zi and Di = Λi+z>i Mi zi.

All we need to do is expand Mi into its definition and rearrange the bracketing, such that we

can compute Mi zi from scratch in O(mn) operations:

Dn+1 bn+1 = Mi zi = zi−n∑

i=1

bi Λi(b>i zn+1) (3.39)

Dn+1 = Λi + z>i (Mi zi) = ‖ zi ‖2 −n∑

i=1

Λi(b>i zn+1)2. (3.40)

3.6 Summary 56

The proposed factorization is faster than a direct method O(n2) operations, as used in algorithms

like those proposed by Cauwenberghs and Poggio (2001).

Removal Here we can use the results from Section 3.4.3 concerning the rank-1 modification

of a matrix by removal of a row and column. Unlike the Cauwenberghs and Poggio (2001)

algorithm, which expends O(n2) effort on that, we can perform such operations now in O(mn)

time, a significant computational advantage, given the fact that often m� n.

3.6 Summary

We presented a fast algorithm for computing the Cholesky decomposition of a class of matrices

which occur frequently in machine learning applications. We discussed strategies for parallel

implementation. Rank one updates to the factorization were also derived. We demonstrated

the use of our factorization in speeding up the SimpleSVM algorithm as well as interior point

methods. Error analysis of the proposed factorization along the lines of Bennett (1965) is a

topic of current research.

Chapter 4

Kernels on Discrete Objects

In this chapter we provide a general overview of R-Convolution kernels proposed by Haussler

(1999). We produce many extensions and exhibit new kernels and also show how various previous

kernels can be viewed in this framework. The aim of this chapter is to provide general recipes

for defining kernels on strings, trees, Finite State Automata, images, etc. A few fast algorithms

for computing kernels on sets are sketched in this chapter. Specific implementation details and

fast real life algorithms of all other kernels are relegated to later chapters.

This chapter is organized as follows. In Section 4.1 we motivate the need for kernels on

discrete objects and present various applications. We go on to survey some recent literature on

kernels for strings and trees. We review some basic concepts of convolution kernels and discuss

our extensions in Section 4.2. In Section 4.3 we discuss kernels on sets, in Section 4.4 we discuss

kernels on strings, in Section 4.5 we discuss kernels on trees, in Section 4.6 we discuss kernels

on Automata. and finally in Section 4.7 we define novel kernels on images.

This chapter requires the reader to understand the notion of a kernel. Review of Section 1.3

may be helpful. This chapter is a pre-requisite for reading Chapters 5 and 6. Readers may want

to read the influential paper by Haussler (1999) to gain a deeper understanding of convolution

kernels and their extensions presented in this chapter. Knowledge of different types of Automata

may be useful in understanding material presented in Section 4.6. We refer the reader to the

authoritative text by Hopcroft and Ullman (1979) for further details.

57

4.1 Introduction 58

4.1 Introduction

Many problems in machine learning require the classifier to work with a set of discrete examples.

Common examples are biological sequence analysis where data is represented as strings (Durbin

et al., 1998), Natural Language Processing (NLP) where the data is in the form of a parse tree

(Collins and Duffy, 2001), Internet connectivity where the connections are denoted by graphs

(Kondor and Lafferty, 2002). In order to apply machine learning algorithms on such discrete

data it is desirable to have a function of similarity. This can be achieved by the use of a feature

mapping of the form φ : X → HK where X is the set of discrete structures (for eg. the set of all

parse trees of a language) and HK is some Hilbert space. Furthermore, dot products in Hilbert

spaces lead to kernels

k(x, x′) = 〈φ(x), φ(x′)〉, (4.1)

where x, x′ ∈ X . It is clear that the success of kernel methods depends upon a faithful repre-

sentation of discrete data that can be computed efficiently. It should take into account both

the content as well as the inherent structural information present in the data. These notions of

similarity can also be extended to other areas like information retrieval and bio-informatics in a

natural and intuitive way.

4.1.1 Applications of Kernels on Discrete Structures

Kernels on discrete structures are very useful in a wide variety of fields.

Bio-Informatics Kernels on strings are widely used in the field of bio-informatics to compare

the similarity between two DNA sequences and for protein homology detection (Leslie

et al., 2002a,b, Jaakkola et al., 1999).

Intrusion Detection Use of the spectrum kernel for analyzing the system call traces in order

to detect intrusions can be found in Eskin et al. (2001). Other kinds of network data

including network logs can be analyzed effectively using a string kernel.

Natural Language Processing Use of kernels for defining similarity between parse trees has

been studied in Collins and Duffy (2001).

Document Retrieval on the Web Search engines have to deal with a lot of unstructured

data which is available in a wide variety of formats on the web. It is useful to have some

4.1 Introduction 59

algorithm by which measures of similarity can be computed for such documents (Joachims,

2002, Manevitz and Yousef, 2001).

Structured Text In the field of information retrieval, string kernels are very useful in order

to define a similarity metric between text documents or web pages (Lodhi et al., 2002).

The structure of XML and HTML documents can also be utilized to define a meaningful

similarity metric (Joachims et al., 2001).

Images Comparing two images for similarity has always been a hard problem. It is compounded

by the fact that the same image at two different scales has two different representations. A

possible way to get around this problem is to use image segmentation techniques in order to

decompose the image into a tree like structure and then compare the similarities between

them. Another method is to encode all rectangular regions of an image into a compact

data structure and compare two images based on the number of rectangular regions they

share (cf. Section 4.7 for more details).

4.1.2 Previous Work

Recently convolution kernels proposed by Haussler (1999) have gained prominence in the field of

machine learning. They are obtained from other kernels by a certain sum over products which

can be viewed as a generalized convolution. The advantage of these kernels is that they can be

applied iteratively to build a kernel on more complex structures by using the kernels defined on

its components.

Pair HMM’s were first introduced for biological sequence analysis by Durbin et al. (1998).

A pair HMM is essentially a HMM that generates two symbol sequences simultaneously. Thus,

it defines a joint probability distribution over finite symbol sequences. The dynamic alignment

kernel algorithm uses a pair HMM to define a kernel (Watkins, 2000). Given a pair of strings

the dynamic alignment kernel is defined as the probability that the pair of strings was emitted

by the pair HMM.

Kernels on strings were motivated by the work of Haussler (1999) and Watkins (2000).

Explicit recursion formulas for calculation of various string kernels can be found in Herbrich

(2002). Various weighing schemes like the inverse-document-frequency (IDF) or cosine functions

which are widely used in information retrieval can also be incorporated into string kernels to

increase their relevance. A special case of the string kernel is the k-spectrum kernel defined

4.2 Defining Kernels 60

in Leslie et al. (2002a). Given a value of k, the kernel is defined as the count of all k length

substrings of the first string which occur in the second. They also show an O(nk) algorithm to

compute the kernel, where n is the length of the string. An extension of this idea to incorporate

mismatches can be found in Leslie et al. (2002b). The edit distance of two strings is defined

as the minimum number of insertions, deletions and substitutions that are required to convert

the first string to the second. Two strings are said to match if their edit distance is less than a

threshold. Given k and m, the kernel is defined as the count of all k length substrings of the first

string which match (within edit distance of m) substrings of the second string. They present a

suffix tree implementation for the mismatch kernel but do not give any worst case bounds.

Application of convolution kernels to Natural Language processing can be found in Collins

and Duffy (2001). They show a recursion formula to find all matching subtrees of a pair of trees.

The kernel on parse trees is defined as the count of the number of subtrees of the first tree that

match subtrees of the second tree. Their approach has a worst case bound of O(|N1|.|N2|) where

N1 and N2 represent the number of nodes in the first and second tree respectively.

Kernels that can capture the notion of similarity between text documents are important

in the field of information retrieval. Context kernels which also take into account the relative

position of occurrence of words can be found in Sim (2001). They use a pair dictionary of words

for evaluating the context kernel. But, they report that incorporating context in kernels did not

produce statistically significant gains in classification accuracy.

4.2 Defining Kernels

In a sense, the present chapter can be considered a continuation of (Haussler, 1999). In partic-

ular, we will introduce important special cases of the R-convolution and show how this allows

us to deal with certain data structures such as sets, strings, trees, Automata or images effi-

ciently. For this purpose we briefly review the notion of R-convolutions and how they may lead

to kernels.

4.2.1 Haussler’s R-Convolution

We begin by defining a relation

R : X ×~X → {FALSE, TRUE} where ~X := X 1× . . .×XD (4.2)

4.2 Defining Kernels 61

or in short R(x,x), where the xi with (x1, . . . , xD) ∈ ~X are considered to be the “parts” of x.

This allows us to consider the sets

R−1(x) := {x |R(x,x) = TRUE}. (4.3)

In particular, we assume, following Haussler (1999), that |R−1(x)| is countable. However, we use

a slight extension, namely that R−1(x) may be a multi-set, i.e., it may contain elements more

than once. For instance, a text might contain the word ”dog” several times. This modification

greatly simplifies the subsequent considerations concerning exact and inexact matches. Next,

we introduce the R-convolution of the kernels k1, . . . , kD with ki : X i×X i → R via

k1 ? . . . ? kD(x,x′) :=∑

x∈R−1(x)

x′∈R−1(x′)

k1(x1, x′1) · . . . · kD(xD, x

′D). (4.4)

Depending on the definitions of R and ki we obtain a very rich class of kernels. In particular,

we may use recursive versions of Equation (4.4) to define kernels on objects such as trees. See

(Haussler, 1999, Section 2.2 and 4) for details and examples.

4.2.2 Exact and Inexact Matches

One theme will be central to our considerations: the difference between exact and inexact

matches. We call a kernel an exactly matching kernel if

k1 ? . . . ? kD(x,x′) :=∑

x∈R−1(x)

x′∈R−1(x′)

k1(x1, x′1) · . . . · kD(xD, x

′D)δx,x′ . (4.5)

Note that Equation (4.5) does not allow us to replace the double sum over R−1(x) and R−1(x′)

by a simple sum due to the fact that x might occur several times in R−1(x) (and likewise x′

in R−1(x′)). However, as may be apparent already now, significant computational gains can be

made in evaluating Equation (4.5) by sorting R−1(x) for each x beforehand and subsequently

comparing the sorted sets. While the specific form of sorting will depend on the kind of R we are

dealing with, it is safe to say that the central idea to efficient computation of exactly matching

kernels can be found in sorting the sets R−1(x) before evaluating Equation (4.5).

Concerning inexact matches, that is, the general case of Equation (4.4), it is safe to say

4.3 Sets 62

that in general we will have to pay a price which is O(h(|R−1(x)|) ·h(|R−1(x′)|)), where h(.) is a

function which depends on the domain of application and the type of inexact matches considered.

For example, in the case of string kernels, if we consider inexact matches due to substitution

alone they are cheaper to compute than inexact matches under a general edit distance (insertion,

deletion and substitution). Such considerations will become clear as we consider specific kernels

later in this chapter.

4.3 Sets

The most basic kernels are those where the relation R is given by R(x,x) := {x ∈ x} and x is a

set itself. Consequently we have

k(x, x′) =∑

x∈x,x′∈x′

κ(x,x′) (4.6)

possibly with a normalization term to restrict k to unit length in feature space. Here κ(x,x′) is a

kernel itself. Such kernels were proposed by Haussler (1999). Applications to multiple instance

problems can be found in Gartner et al. (2002). A straightforward application of an idea of

Herbster (2001) allows us to compute Equation (4.6) in linear time, for certain kernels. Details

can be found in the next section.

A special case occurs when κ(x,x′) = κ(x,x)δx,x′ , that is, if k becomes a weighted measure

of agreement between the sets x and x′. For instance, the bag-of-words representation of texts

(Joachims, 1998) belongs to this category. There κ(x,x) = 1 for all x. A slight modification

can be found in the kernels proposed by Leopold and Kindermann (2002), where κ(x,x) is

a weight which depends on the discriminative power of a word (e.g., Term Frequency Inverse

Document Frequency (TFIDF), 1/f -law, frequency of occurrence, etc.). As already mentioned

by Joachims (1998), sorting x can greatly improve speed of evaluation — in the case of a bag-

of-words representation this leads to time complexity linear in the number of distinct words in

the text.

4.3.1 Implementation Strategies

Let x = {x1, x2, . . . , xm} and x′ = {x′1, x′2, . . . , x

′n} be two sets whose elements are drawn from

some domain X . We present fast implementation strategies (motivated by Herbster (2001)) for

4.4 Strings 63

the set kernel defined as follows

k(x, x′) =∑xi∈x

∑x′j∈x′

K(xi, x′j), (4.7)

where K(., .) is a valid kernel function.

Suppose, we can write K(xi, x′j) as K(xi, x

′j) = ρ(xi)φ(x

′j) for some ρ(.) and φ(.) then from

Equation (4.7) we get

k(x, x′) =∑xi∈x

∑x′j∈x′

ρ(xi) · φ(x′j) =

∑xi∈x

ρ(xi) ·∑

x′j∈x′

φ(x′j)

which can be computed in O(m + n) time. Examples of such kernels include K(xi, x′j) = xix

′j

and K(xi, x′j) = exp−σ(xi−x

′j) for X = R.

Assume that, X = R and that the sets x and x′ are in ascending order 1, and let K(xi, x′j) =

exp−σ|xi−x′j |. Define, l[i] :=

∑ij=1 expσxj and r[i] =

∑mj=i+1 exp−σxj for 1 ≤ i ≤ m. For x

′j ∈ x′

define j∗ = max{i : xi ≤ x′j}. Now, we can compute k(x, x′) as

k(x, x′) =n∑

j=1

l[j∗] exp−σx′j +r[j∗] expσx

′j .

Computation of l[i] and r[i] for all i ∈ {1, 2, . . . ,m} takes O(m) time while j∗ for all j ∈

{1, 2, . . . , n} can be computed in O(m + n) time by merging x and x′. Thus, k(x, x′) can be

computed in O(m+ n) total time.

4.4 Strings

Strings occur naturally in many applications including information retrieval, web-document

retrieval, search engines and bio-informatics. The strings in such applications may contain

millions of characters and hence the speed of kernel evaluation is critical. We define various

kernels on strings in this section and show their relation to many previously defined kernels.

1If they are not in sorted order we can sort them by expending O(m log(m) + n log(n)) effort

4.4 Strings 64

4.4.1 Notation

Our notation for strings closely follows (Giegerich and Kurtz, 1997). Let A be a finite set which

we call the alphabet. The elements of A are characters. Let $ be a sentinel character such that

$ /∈ A. Any x ∈ Ak for k = 0, 1, 2 . . . is called a string. The empty string is denoted by ε. Some

examples of valid strings include

• Texts written in English defined over the alphabet A = {a, b, . . . , z, 0, 1, . . . , 9,�} (�

denotes the blank character)

• DNA sequences defined over the alphabet A = {A,C,G, T}

• Binary sequences defined over the alphabet A = {0, 1}

A∗ represents the set of all strings defined over the alphabet A. We denote by A+ the set of all

non empty strings defined over A. It is clear that A+ = A∗ \ε.

We use s, t, u, v, w, x, y to denote strings over the alphabet A and a, b, c to denote elements

of the alphabet. |x| denotes the length of string x. Concatenation of two strings u and v is

denoted by uv while concatenation of a character a to a string u is denoted by au. Given a

string t = t1t2 . . . tn where ti ∈ A, the reverse string is defined as t−1 := tn . . . t2t1. We use the

shorthand notation x[i : j] to define a substring of x between locations i and j (both inclusive)

where 1 ≤ i, j ≤ |x|. If x = uvw for some ( possibly empty ) u, v, w, then u is called a prefix

of x while v is called a substring and w is called a suffix of x. We sometimes use the notation

s v x to denote that s is a substring of x. A suffix or prefix of a string x is called nested if

it occurs elsewhere in x. Given two strings x and y, numy(x) denotes the number of times y

occurs as a substring of x.

4.4.2 Various String Kernels

It is immediately obvious that a simple frequency-of-occurrence count, such as in the bag-

of-words representation of texts discards a large amount of information inherent in a text.

Indeed, inclusion of a larger context of symbols can improve the generalization performance of

kernel methods (Sim, 2001, Lodhi et al., 2002). If we do not consider the contribution of gappy

substrings all currently used string kernels can be described in the following way:

k(x, x′) =∑

svx,s′vx′

κ(s, s′) (4.8)

4.4 Strings 65

where s, s′, x, x′ ∈ S are strings and s v x denotes that s is a substring of x. The following

special cases are worth considering.

Bag of Symbols: Here we have

κ(s, s′) =

wsδs,s′ if s ∈ A

0 otherwise

where ws are arbitrary nonnegative weights. In other words, only the frequency of occur-

rence of single symbols counts.

Bag of Substrings: To include context we extend the bag-of-words to a bag-of-substrings rep-

resentation. This means that we have a kernel of type

κ(s, s′) = wsδs,s′ .

Clearly the number of such substrings can be large (worst case it can be quadratic in the

length of x). This means that a naive implementation could take up to O(|x|2|x′|2) time to

compute k(x, x′), where |x| denotes the length of x. Chapter 5 will present an algorithm

which scales as O(|x| + |x′|), thus significantly improving on previous algorithms which

were of order O(|x| · |x′|) (see e.g., Herbrich (2002) for an overview). A special case of this

kernel is the length weighted kernel which uses

ws = λ|s|

where λ is a weighing factor. λ > 1 means that longer substring matches are favored while

λ < 1 favors smaller substring matches. Another special case takes into account substrings

of length greater than a given threshold and is given by

κ(s, s′) = wsδs,s′δ(|s| > T )

where T is the given threshold on the length of the substrings that match. An alternate

way of looking at it is to set ws = 0 if |s| < T . This may be useful in information retrieval

applications where we do not want frequently occurring connectors like a, an, the, etc. to

contribute to the kernel value.

4.5 Trees 66

The kernels proposed by Leslie et al. (2002a,b), Haussler (1999), Watkins (2000) are a special

case of our definitions and the algorithm we present in Chapter 5 will contain the previous

methods as a special case or be significantly faster than the originally proposed strategies.

4.5 Trees

Trees are widely used data structures for searching and database operations. They also occur

frequently in many applications including Natural Language Processing and database searching.

We define various kernels on trees in this section and show their relation to many previously

defined kernels. We also propose a way to handle inexact matches by using coarsening levels in

trees.

4.5.1 Notation

A tree is defined as a connected graph with no cycles. The set of nodes of a tree T is denoted

by VT and the set of edges is denoted by ET . A null tree is a tree with no nodes. A node (if it

exists) is designated as the root node and all other nodes are either leaf or internal nodes. An

internal node has one or more child nodes and is called the parent of its child nodes. Each node

except the root node has exactly one parent. All children of the same node are called siblings. A

node with no children is referred to as a leaf. The degree of a node is the number of its children.

A sequence of nodes n1, n2, . . . , nk, such that ni is the parent of ni+1 for i = 1, 2, . . . , k − 1 is

called a path. The depth of a node is the length of the unique path from the root to the node.

Nodes which are the same depth are at the same level. The height of a node in a tree is the

length of a longest path from the node to a leaf. The height of a tree is the height of its root.

Given two nodes a and d, if there is a path from node a to d then a is called an ancestor of d and

d is called a descendant of a. If a 6= b, then a is a proper ancestor and b is a proper descendant.

We define a subtree of a tree as a node in that tree together with all its descendants. A subtree

rooted at node n is denoted as Tn. If a set of nodes in the tree along with the corresponding

edges forms a tree then we define it to be a subset tree.

If every node including the internal nodes and root contain a label then the tree is called a

labeled tree. The label on a node n is denoted by nL and the set of all labels is denoted by LT .

If only the leaf nodes contain labels then the tree is called a leaf-labeled tree. We do not consider

unlabeled trees in this thesis. An ordered tree is one in which the child nodes of every node

4.5 Trees 67

are ordered as per the ordering defined on the node labels. It is easy to see that a Depth First

Search (DFS) on an ordered tree produces a unique sequence of node labels. In Chapter 5 we

present an algorithm to define an ordering on a leaf-labeled tree. From now on we will consider

ordered trees unless mentioned otherwise. Two ordered trees T and T ′ are equal iff VT = VT ′ ,

ET = ET ′ and the labels on the corresponding nodes match. It is clear that the subset tree of

an ordered tree is again an ordered tree.

4.5.2 Various Tree Kernels

A kernel on trees can be described in the following natural way:

k(t, t′) =∑

svt,s′vt′

κ(s, s′) (4.9)

where s, s′, x, x′ ∈ T are trees and s v t denotes that s is a subset tree of t. We consider the

following special cases.

Bag of Nodes: Here we have

κ(s, s′) =

wsδs,s′ if s ∈ VT

0 otherwise

where ws are arbitrary nonnegative weights. In other words, only those nodes with the

same node labels contribute to the kernel. This of course discards a lot of structural

information inherent in the tree.

Bag of Subset Trees: To include structural information into our kernel we can extend the

above notion to a bag of subset trees representation to yield

κ(s, s′) = wsδs,s′ .

Clearly the number of subset trees of a tree can be exponentially large and so clever

algorithms are required to calculate this kernel efficiently. We present a few such algorithms

in Chapter 5.

Bag of Subtrees: The kernel proposed in Collins and Duffy (2001) is a special case of the

4.5 Trees 68

subset trees kernel where

κ(s, s′) = wsδs,s′

and s and s′ are subtrees of t. So instead of comparing parts of subtrees they compare

complete subtrees for matches.

Bag of Paths: In case we consider only the paths in a tree for matching we get

κ(s, s′) = wsδs,s′ ,

where, s and s′ are paths in t. Suffix trees on trees can be constructed in time linear in the

number of nodes (Breslauer, 1998). They can be exploited to speed up the computation

of this kernel.

Inexact Matches: Two trees are said to be close to each other if we can do a small number

of the following operations in order to get the second tree from the first (Oflazer, 1997).

• add/delete a small number of leaves to/from one of the trees (Structural mismatch)

• change the label of a small number of leaves in one of the trees (Label mismatch)

As in the case of strings this is a very difficult problem and requires dynamic programming

tools for efficient handling.

4.5.3 Coarsening Levels

Sometimes two trees have close structural similarities if we throw away a few nodes from both

the trees. To respect such structural similarities we must ignore mismatches due to presence of

non matching subtrees hanging off a node. We define a d level coarsening of an unlabeled tree

T as the tree obtained by chopping off all subtrees of height d from T . We denote this tree by

Td. By definition Td is the null tree if d is greater than the height of the tree. For example T1

is obtained by throwing away all the leaves of the tree T . In case of labeled trees we need to

define a method to propagate the labels up the tree. This propagation of labels can be highly

application specific and hence we do not discuss them here.

4.6 Automata 69

We now define the kernel between two trees t and t′ as

kcoarse(t, t′) =∑i,j

(Wij × k(ti, t′j)) (4.10)

where Wij is a down-weighting factor, and k(ti, t′j) is the tree kernel defined in Equation (4.9).

Since the class of kernels is closed under addition and scalar multiplication, kcoarse(t, t′) is a valid

kernel function. A practically useful special case of the above kernel is

kcoarse(t, t′) =∑

i

(Wi × k(ti, t′i)) (4.11)

where the weighting factor Wi is chosen typically to be of the form λi with λ ≤ 1.

4.6 Automata

Automata are powerful abstractions very frequently used in computer science. They are inti-

mately related to Hidden Markov Models as well as dynamic systems.

4.6.1 Finite State Automata

Our notation closely follows Hopcroft and Ullman (1979). Let Σ be a finite set which we call

the alphabet. The elements of Σ are characters. Any x ∈ Σk for k = 0, 1, 2 . . . is called a string.

The empty string is denoted by ε. Σ∗ represents the set of all strings defined over the alphabet

Σ. We denote by Σ+ the set of all non empty strings defined over Σ. It is clear that Σ+ = Σ∗ \ε.

A language is a set of strings of symbols from one alphabet.

Finite State Automata Finite State Automata (FSA) are mathematical models which de-

scribe a regular language. At any given point of time the system can be in one of the

finite states. The transition to the next state is determined by the current state that the

automaton is in. More formally, a FSA is denoted by a 5-tuple (Q,Σ, δ, q0, F ) where Q is

the finite set of states, Σ is a finite input alphabet, q0 ∈ Q is the initial state, F ⊆ Q is

a the set of final states, and δ is the transition function mapping Q × Σ → Q (Hopcroft

and Ullman, 1979). A FSA is said to accept a string x if the sequence of state transitions

induced due to the symbols in x lead us from the start state to a final state. The language

accepted by FSA’s is called a regular language. We can extend the notion of a FSA to

4.6 Automata 70

include transitions on the empty input ε. It can be shown that the use of ε transitions

does not add to the expressive power of a FSA.

Nondeterministic Finite State Automata Nondeterministic Finite State Automata (NFA)

are extensions of FSA which allows zero, one or more transitions from a state on the same

input symbol. Formally we denote the NFA by a 5-tuple (Q,Σ, δ, q0, F ), where Q,Σ, q0, F

have the same meaning as they had for a FSA, but now δ is a map Q×Σ→ 2Q ( 2Q is the

set of all subsets of Q) (Hopcroft and Ullman, 1979). A NFA is said to accept a string x if

there exists at least one sequence of transitions that lead from the initial state to the final

state. It is well known that any set accepted by a NFA can also be accepted by a DFA,

in other words the language accepted by the set of NFA’s and DFA’s is the same. We can

show that the addition of ε transitions does not add to the expressive power of a NFA.

Let L be the language accepted by a given FSA or a NFA. Then for every x ∈ L there is at

least one (possibly more) sequence of state transitions q0Qkf for some f ∈ F induced by x. We

define the set of all such state transitions as q(x) and use the notation s v q(x) to denote that

the sequence of state transitions s occurs as a sub-sequence of some element of q(x). We now

define a generic kernel using the given FSA or NFA as

k(x, x′) =∑

xvq(x),x′vq(x′)

κ(x,x′) (4.12)

Typically some normalizing term is also added to account for the lengths of x and x′. Depending

on the function κ(x,x′) various kernels can be realized. The following cases are interesting

Bag of States: Here we have

κ(x,x′) =

wxδx,x′ if wx ∈ Q

0 otherwise

where wx are arbitrary nonnegative weights. In other words, this kernel counts the common

states that occurred during the state transitions induced by x and x′.

Bag of State Sub-Sequences: To include context we use a kernel of type

κ(x,x′) = wxδx,x′ .

4.6 Automata 71

Such a kernel may be efficiently computed by using ideas described in Chapter 5. A spe-

cialization of this kernel takes into account the location of occurrence of the sub-sequences

in order to assign weights. Computing such kernels is a topic of current research.

4.6.2 Pushdown Automata

A context-free grammar (CFG) is a finite set of variables, also called as non-terminals, each

of which represents a language. The languages represented by the variables are described re-

cursively in terms of each other and primitive symbols called terminals. The rules relating the

variables are called productions. More formally, a CFG is denoted by G = (V, T, P, S) where V

is a finite set of variables, T is a finite set of terminals, and S is a special variable called the

start symbol. P is a finite set of productions of the form A → α, where A is a variable and α

is a string of symbols from (V ∪ T )∗ (Hopcroft and Ullman, 1979). A string x is said to belong

to the language defined by G if there exists a finite sequence of productions in G which can be

applied to derive the string from the start symbol S. A parse tree of x is the tree representation

of the productions that are used to derive x from S. If some string x in the language defined by

G can be derived by applying more than one distinct sequence of productions (it has more than

one parse tree) then G is called ambiguous.

Pushdown Automata Pushdown Automata (PDA) is essentially a finite automaton endowed

with a stack which can be used to store symbols. It is a system (Q,Σ,Γ, δ, q0, Z0, F ), where

Q,Σ, q0 and F have the same meaning as for a FSA. Γ is the stack alphabet and Z0 ∈ Γ

is a a particular stack symbol called the start symbol. In case of a PDA δ is a mapping

Q× (Σ ∪ {ε})× Γ→ Q× Γ∗. We define the language accepted by a PDA as the set of all

inputs for which some choice of moves causes the PDA to enter a final state.

It can be shown that a deterministic PDA accepts only unambiguous languages. Let L be

the language accepted by a given PDA. Then, for every x ∈ L there is a unique parse tree. Given

two strings x and x′ in L we generate their parse trees and use ideas from Section 4.5 to compute

kernels on their parse trees (Baxter et al., 1998). At first glance this idea may look naive, but it

has very powerful implications. For example, every programming language (eg., C, C++, Java,

etc.) is defined by a grammar which is accepted by some PDA (Aho et al., 1986). Hence, our

idea could be used to compute kernels between say two C programs. The advantage of this

4.7 Images 72

method is that it goes beyond simple string matching and takes advantage of the semantics of

the program. This has many applications in duplicate code detection and software plagiarism

detection (Parker and Hamblen, 1989). Well structured languages like XML can be parsed by

a parser to generate a Document Object Model (DOM). This means that such languages can

be parsed by a PDA. Thus, our idea can also be used to compute kernels between two XML

documents.

4.7 Images

A lot of information is contained in the two dimensional structure of an image and hence any

meaningful kernel on images must take this into account. A generic convolution kernel on images

can be described in the following way:

k(x, y) =∑

s∈R(x),s′∈R(y)

κ(s, s′), (4.13)

where R(.) is the set of all connected regions in an image. It is apparent that the set of connected

regions is a very large set and hence it is extremely difficult to compute such a kernel efficiently.

If we visualize an image as a matrix, any connected region can be described using its sub-

matrices. Hence, if we restrict R(.) to be the set of all sub-matrices we still hope to capture a

lot of the inherent structural information. The advantage of this restriction is that the kernel

can be computed efficiently. We use the notation s v x to indicate that s is a sub-matrix of x

and define:

k(x, y) =∑

svx,s′vy

κ(s, s′). (4.14)

Typically, it is very difficult to consider inexact matches in images. One possibility is to con-

sider inexact matches due to scaling. But, since such schemes involve domain knowledge and

heuristics, we restrict ourselves to only exact matches and define

κ(s, s′) = wsδs,s′ ,

to produce a valid kernel. Efficient strategies for computing this kernel can be found in Sec-

tion 7.6.

4.8 Summary 73

4.8 Summary

We have shown how R-Convolution proposed by Haussler (1999) can be extended and effectively

used to define kernels on a variety of discrete structures. In particular we considered kernels

on sets, strings, trees, Automata and images in this chapter. We showed that our framework

includes as special cases many kernels proposed earlier. We presented fast implementation

strategies for a few instances of the set kernel. Fast algorithms for computing kernels on strings,

trees and images can be found in next few chapters.

Chapter 5

Fast String and Tree Kernels

This chapter presents algorithms for computing kernels on strings (Watkins, 2000, Haussler,

1999, Leslie et al., 2002a) and trees (Collins and Duffy, 2001) in linear time in the size of the

arguments, regardless of the weighting that is associated with any of the terms. We show how

suffix trees on strings can be used to enumerate all common substrings of two given strings. This

information can then be used to compute string kernels efficiently. In order to compute kernels

on trees we exhibit an algorithm to obtain the string representation of a tree. The string kernel

ideas are then used to compute kernels on trees. We discuss an algorithm for string kernels

by which the prediction cost can be reduced to linear cost in the length of the sequence to be

classified, regardless of the number of Support Vectors.

This chapter is organized as follows. In Section 5.1 we briefly introduce the problem and

discuss the basic ideas behind our method. In Section 5.2 we introduce our notation and briefly

review the definition of string kernels (also cf. Section 4.4). We briefly review the concept of

a suffix tree in Section 5.3. Readers already familiar with suffix trees can skip this section.

In Section 5.4 we review the important matching statistics algorithm by Chang and Lawler

(1994) and show its relation to our string kernel algorithm. In Section 5.5 we present our

algorithm for string kernels, prove its correctness and discuss its time complexity. In Section 5.6

we discuss various practical weighting schemes and show how our algorithm can handle such

schemes efficiently. In Section 5.7 we discuss our linear time prediction algorithm and prove it

correctness formally. We define tree kernels in Section 5.8 and show how tree kernels can be

computed efficiently by converting trees to strings. We present experimental results in Section 5.9

and conclude with a summary and discussion in Section 5.10.

74

5.1 Introduction 75

This chapter requires the reader to understand the notion of a kernel. Review of Section 1.3

may be helpful. The reader is strongly encouraged to read Chapter 4 (especially Sections 4.4 and

4.5) before reading this chapter. For the sake of completeness, a few ideas presented there will be

reviewed briefly in Sections 5.2 and 5.8. An understanding of suffix trees on strings is essential.

Readers may want to read the excellent review paper by Grossi and Italiano (1993). Other

important sources of information on suffix trees include McCreight (1976), Ukkonen (1995),

Weiner (1973), Gusfield (1997). Section 5.4 briefly reviews the matching statistics algorithm.

The paper by Chang and Lawler (1994) may help interested readers gain a deeper understanding

of the material presented in this chapter.

5.1 Introduction

String kernels of the form defined in Equation (4.8) are typically solved using dynamic pro-

gramming and hence require time quadratic cost in the length of the arguments (Herbrich,

2002, Collins and Duffy, 2001). In this chapter, we present an algorithm which computes string

kernels in linear time. This is a significant improvement, especially considering the fact that,

string kernels are widely used in bio-informatics or web search engines where each input string

could (possibly) contain millions of entries. Note that, the method we present here is far more

general than strings and trees, and it can be applied to finite state machines, formal languages,

Automata, etc. as discussed in Chapter 4. However, for the scope of the current chapter we

limit ourselves to a fast means of computing extensions of the kernels of Watkins (2000), Collins

and Duffy (2001), Leslie et al. (2002a).

In a nutshell, our idea works as follows: assume we have a kernel of the form k(x, x′) =∑i∈I φi(x)φi(x′), where the index set I may be large, yet the number of nonzero entries is

small in comparison to |I|. Then, an efficient way of computing k is to sort the set of nonzero

entries φ(x) and φ(x′) beforehand and count only matching non-zeros. This is similar to the

dot-product of sparse vectors in numerical mathematics. As long as the sorting is done in an

intelligent manner, the cost of computing k is linear in the sum of non-zero entries combined.

In order to use this idea for matching strings (which have a quadratically increasing number of

substrings) and trees (which can be transformed into strings) efficient sorting is realized by the

compression of the set of all substrings into a suffix tree. Moreover, dictionary keeping allows

us to use arbitrary weightings for each of the substrings and still compute the kernels in linear

5.2 String Kernel Definition 76

time.

In general, SVM’s require time proportional to the number of Support Vectors for prediction.

In case the dataset is noisy a large fraction of the data points become Support Vectors and the

time required for prediction increases. But, in many applications like search engines, spam

filtering or web document retrieval, the dataset is noisy, yet, the speed of prediction is critical.

We propose a weighted version of our string kernel algorithm which handles such cases efficiently.

The key observation is that a suffix tree of a set of strings can be constructed in time linear in

the size of the strings by using an algorithm proposed by Amir et al. (1994). We now weigh the

contribution due to each substring appropriately based on the Support Vector which contains

it. By using our algorithm the prediction time is linear in the length of the sequence to be

classified, regardless of the number of Support Vectors.

5.2 String Kernel Definition

We begin by introducing some notation. Let A be a finite set which we call the alphabet. The

elements of A are characters. Let $ be a sentinel character such that $ /∈ A. Any x ∈ Ak for

k = 0, 1, 2 . . . is called a string. The empty string is denoted by ε and A∗ represents the set of

all non empty strings defined over the alphabet A.

In the following we will use s, t, u, v, w, x, y, z ∈ A∗ to denote strings and a, b, c ∈ A to denote

characters. |x| denotes the length of x, uv ∈ A∗ the concatenation of two strings u, v and au

the concatenation of a character and a string. We use x[i : j] with 1 ≤ i ≤ j ≤ |x| to denote the

substring of x between locations i and j (both inclusive). If x = uvw for some (possibly empty)

u, v, w, then u is called a prefix of x while v is called a substring (also denoted by v v x) and w

is called a suffix of x. Finally, numy(x) denotes the number of occurrences of y in x. The type

of kernels we will be studying are defined by

k(x, x′) :=∑

svx,s′vx′

wsδs,s′ =∑s∈A∗

nums(x) nums(x′)ws. (5.1)

That is, we count the number of occurrences of every string s in both x and x′ and weight it

by ws, where the latter may be a weight chosen a priori or after seeing data, e.g., for inverse

document frequency counting (Leopold and Kindermann, 2002). This includes a large number

of special cases:

5.3 Suffix Trees 77

• Setting ws = 0 for all |s| > 1 yields the bag-of-characters kernel, counting simply single

characters.

• The bag-of-words kernel is generated by requiring s to be bounded by whitespace.

• Setting ws = 0 for all |s| > n yields limited range correlations of length n.

• The k-spectrum kernel takes into account substrings of length k (Leslie et al., 2002a). It

is achieved by setting ws = 0 for all |s| 6= k.

• Term Frequency Inverse Document Frequency (TFIDF) weights are achieved by first cre-

ating a (compressed) list of all s including frequencies of occurrence, and subsequently

rescaling ws accordingly.

All these kernels can be computed efficiently via the construction of suffix-trees, as we will see

in the following sections.

5.3 Suffix Trees

The suffix tree is a compacted trie (Fredkin, 1960) that stores all suffixes of a given text string.

It has been widely used in pattern matching applications in fields as diverse as molecular biology,

data processing, text editing and interpreter design (Gusfield, 1997, Grossi and Italiano, 1993).

In this section, we formally define a suffix tree and give a general overview of its properties. We

also make a few remarks about particular properties of suffix trees which are exploited later on

by our algorithms.

5.3.1 Definition of a Suffix Tree

A suffix tree of the string x is denoted by S(x). It is defined as a rooted tree with edges and

nodes that are labeled with substrings of x. The suffix tree is a multi-way Patrica tree (Knuth,

1998a,b) that satisfies the following properties (McCreight, 1976)

1. Each node is labeled with a unique string formed by the concatenation of the edge labels

on the path from the root to the node.

2. Each internal node has at least two descendants.

5.3 Suffix Trees 78

3. Edges leaving any given node are labeled with non-empty strings that start with different

symbols.

The suffix tree of the string ababc$ is shown in Figure 5.1.

/.-,()*+ab

wwoooooooooooooo

b ��???

????

?c$

**TTTTTTTTTTTTTTTTTTTTT

/.-,()*+ 22Z [ ] _ a c dabc$

������

���� c$

��???

????

? /.-,()*+abc$

������

���� c$

��???

????

? /.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

Figure 5.1: Suffix tree of the string ababc$

Let nodes(S(x)) denote the set of all nodes of S(x) and root(S(x)) be the root of S(x).

For a node w, father(w) denotes its parent, T (w) denotes the subtree tree rooted at the node,

lvs(w) denotes the number of leaves in the subtree and path(w) := w is the path from the

root to the node. That is, we use the path w from root to node as the label of the node w.

Let u,w ∈ nodes(S(x)) such that u = father(w) and w = ue, we define len(u, e(1)) := |e| and

son(u, e(1)) := w while first(u, e(1)) := i where e = x[i : i + |e|] (Since first(u, e(1)) denotes

some index in the string x where the substring u ends and the substring e begins it may not be

unique).

We denote by words(S(x)) the set of all strings w such that wu ∈ nodes(S(x)) for some

(possibly empty) string u, which means that words(S(x)) is the set of all possible substrings

of x. For every t ∈ words(S(x)) we define ceiling(t) as the node w such that w = tu and

u is the shortest (possibly empty) substring such that w ∈ nodes(S(x)). Similarly, for every

t ∈ words(S(x)) we define floor(t) as the node w such that t = wu and u is the shortest (possibly

empty) substring such that w ∈ nodes(S(x)). Given a string t and a suffix tree S(x), we can

decide if t ∈ words(S(x)) in O(|t|) time by just walking down the corresponding edges of S(x).

5.3.2 The Sentinel Character

It is clear that due to addition of the sentinel character $ to string x, it does not contain any

nested suffix (except ε). As a result each non empty suffix of x$ uniquely corresponds to a leaf in

S(x) and the number of leaves in S(x) exactly equals |x|. Furthermore, it can be shown that for

any t ∈ words(S(x)), lvs(ceiling(t)) gives us the number of occurrence of t in x (Giegerich and

5.3 Suffix Trees 79

Kurtz, 1997). The idea works as follows: all suffixes of x starting with t have to pass through

ceiling(t), hence we simply have to count the occurrences of the sentinel character, which can

be found only in the leaves. Given a suffix tree S(x), a simple Depth First Search (DFS) of the

tree will enable us to calculate lvs(w) for each node in S(x) in O(|x|) time and space.

5.3.3 Suffix Links

It is often convenient to augment the suffix tree with additional useful pointers called suffix links

(McCreight, 1976). Given a suffix tree S(x) we can define the suffix links as follows (Giegerich

and Kurtz, 1997)

Definition 18 Let aw be an internal node in S(x), and v be the longest suffix of w such that

v ∈ nodes(S(x)). An unlabeled edge aw → v is called a suffix link in S(x). A suffix link of the

form aw → w is called atomic.

Suffix links are denoted by dotted lines in Figure 5.1. It can be shown that all the suffix links

in a suffix tree are atomic (Giegerich and Kurtz, 1997, cf. proposition 2.9). For a node aw we

define shift(aw) as the node w which is reached by following its suffix link. We add suffix links to

S(x), to allow us to perform efficient string matching: suppose we found that aw is a substring

of x by parsing the suffix tree S(x). It is clear that w is also a substring of x. If we want to

locate the node corresponding to w, it would be wasteful to parse the tree again. Suffix links

can help us locate this node in constant time. The suffix tree building algorithms make use of

this property of suffix links to perform the construction in linear time.

5.3.4 Efficient Construction

It is well know that the suffix tree of a string x can be built in time linear in |x| (Weiner,

1973, McCreight, 1976, Ukkonen, 1995). The Ukkonen (1995) algorithm is on-line i.e. it builds

the suffix tree incrementally, while McCreight (1976) and Weiner (1973) are offline algorithms

i.e they need to scan the entire string for building the suffix tree. For our application we

use the McCreight (1976) algorithm because we assume the input strings to be available a

priori. Furthermore, the McCreight (1976) algorithm can also build the suffix links during the

construction of the suffix tree and these suffix links play a vital role in our algorithm.

5.4 Algorithm for Calculating Matching Statistics 80

5.3.5 Merging Suffix Trees

Given a set X of strings, their suffix tree is denoted as S(X ). Given two suffix trees S(x) and S(y)

we define a merged suffix tree S′(x, y) such that nodes(S′(x, y)) = nodes(S(x))⋃

nodes(S(y))

or in other words S′(x, y) = S({x, y}). Let x and y be two strings, Algorithm 5.1 constructs

S({x, y}) in O(|x|+ |y|) time. The analysis of the algorithm is easy. Construction of S(w) takes

O(|x|+ |y|) time. The pruning of edges can be easily done in linear time with a DFS on the suffix

tree and hence the entire algorithm takes O(|x|+ |y|) time. This idea can be applied recursively

in order to construct S(X ).

A slightly more general problem is to construct the suffix tree of a dictionary of strings

x1, x2, . . . xk, which can be updated by adding or removing strings. This is achieved by using

a modification of the McCreight (1976) algorithm using ideas similar to those outlined above

by the Amir et al. (1994) algorithm. Their algorithm also handles deletion of strings from the

dictionary and hence can be used by algorithms like the SimpleSVM (cf. Chapter 2) which

maintain an active set and perform additions and deletions on the active set.

Algorithm 5.1: Merging Suffix Treesinput x, youtput S′(x, y)1: w ← x#y$ {# and $ are unique sentinel characters}2: S(w) = ConstructSuffixTree(w)3: for all e ∈ edges(S(w)) do4: if e = a#y$ then5: e = a$6: end if7: end for

5.4 Algorithm for Calculating Matching Statistics

In this section we briefly review the linear time algorithm for computing matching statistics as

described in Chang and Lawler (1994). We define the concept of matching substrings. We then

present two lemmas which characterize the common substrings between two given strings using

their matching statistics.

5.4 Algorithm for Calculating Matching Statistics 81

5.4.1 Definition of Matching Statistics

Given strings x, y with |x| = n and |y| = m, the matching statistics of x with respect to y

are defined by two vectors v and c of length n. v is a vector of integers such that vi (the

ith component of v) is the length of the longest substring of y matching a prefix of x[i : n],

vi := i + vi − 1. c is a vector of pointers such that ci (the ith component of c) is a pointer to

ceiling(x[i : vi]) in S(y). For an example see Table 5.1.

Table 5.1: Matching statistic of abba with respect to S(ababc).String a b b avi 2 1 2 1

ceiling(x[i : vi]) ab b babc$ ab

5.4.2 Matching Statistics Algorithm

Here we try to present an intuitive idea behind the matching statistics algorithm. Applications

of the algorithm to approximate string matching can be found in Chang and Lawler (1994). For

a given y, one can construct v and c corresponding to x in linear time. The key observation

is that vi+1 ≥ vi − 1, since, if x[i : vi] is a substring of y then definitely x[i + 1 : vi] is also a

substring of y. Besides this, the matching substring in y that we find, must have x[i+1 : vi] as a

prefix. The matching statistics algorithm of Chang and Lawler (1994) exploits this observation

and uses it to cleverly walk down the suffix links of S(y) in order to compute the matching

statistics in O(|x|) time.

More specifically, the algorithm works by maintaining a pointer pi = floor(x[i : vi]). It then

finds pi+1 = floor(x[i+ 1 : vi]) by first walking down the suffix link of pi and then walking down

the edges corresponding to the remaining portion of x[i+1 : vi] until it reaches floor(x[i+1 : vi]).

Now, vi+1 can be found easily by walking from pi+1 along the edges of S(y) that match the string

x[i + l : n], until we can go no further. The value of v1 is found by simply walking down S(y)

to find the longest prefix of x which matches a substring of y.

5.4.3 Matching Substrings

Given a text string x of length n and a pattern string y of length m, the set of matching

substrings is defined as the set of all common substrings of x and y. Using v and c we can

read off the number of matching substrings in x and y. This is because the only substrings

5.5 Our Algorithm for String Kernels 82

which occur in both x and y are those which are prefixes of x[i : vi] for some i. The number

of occurrences of a substring in y can be found by lvs(ceiling(w)) (cf. Section 5.3.1). The two

lemmas below formalize this.

Lemma 19 w is a substring of x iff there is an i such that w is a prefix of x[i : n]. The number

of occurrences of w in x can be calculated by finding all such i.

Proof The proof is elementary. If w is a substring of x then w = x[i : i + |w| − 1] for some i

and hence w is a prefix of x[i : n]. Conversely every substring of the string x[i : n] (for some i)

is a substring of x and hence w is also a substring of x. It is also clear that every occurrence

of w in x satisfies the above property and hence the number of occurrences of w in x can be

calculated by counting such i which satisfy the above property.

Lemma 20 The set of matching substrings of x and y is the set of all prefixes of x[i : vi].

Proof Let w be a substring of both x and y. By above lemma there is an i such that w is a

prefix of x[i : n]. Since, vi is the length of the maximal prefix of x[i : n] which is a substring in

y, it follows that vi ≥ |w|. Hence, w must be a prefix of x[i : vi].

5.5 Our Algorithm for String Kernels

In this section we present a fast implementation strategy for the string kernels that we introduced

in Section 5.2. We make use of suffix trees that we introduced in Section 5.3 and a modified

form of the the Matching Statistics algorithm that we introduced in Section 5.4.

5.5.1 Our Algorithm

A rather naive, O(|x|.|x′|) time, algorithm for enumerating the matching substrings of two strings

is rather easy to obtain. The dynamic programming approach outlined in Herbrich (2002) also

requires O(|x|.|x′|) time. In case the weights of various substrings are completely independent

of each other we must consider the weight contribution of each substring and cannot do better

than the naive algorithm and we require at least O(|x|.|x′|) time. In most practical applications

5.5 Our Algorithm for String Kernels 83

the weights on the substrings have some relation with each other and by exploiting this relation

we can compute the string kernel in linear time.

From the previous sections we know how to determine the set of all longest prefixes x[i : vi]

of x[i : n] in y in linear time. The following theorem uses this information to compute kernels

efficiently.

Theorem 21 Let x and y be strings and c and v be the matching statistics of x with respect

to y. Assume that floor(.), ceiling(.), father(.) and lvs(.) are computed on S(y). Furthermore

assume that

W (y, t) =

∑s∈prefix(z)

wus

− wu where u = floor(t) and t = uz. (5.2)

can be computed in constant time for any t. Then k(x, y) defined in Equation (5.1) can be

computed in O(|x|+ |y|) time as

k(x, y) =|x|∑i=1

val(x[i : vi]) =|x|∑i=1

(val(father(ci)) + lvs(ceiling(x[i : vi])) ·W (y, x[i : vi])

)(5.3)

where val(t) := lvs(ceiling(t)) ·W (y, t) + val(floor(t)) and val(root) := 0.

Proof We first show that Equation (5.3) can indeed be computed in linear time. We know that

for S(y) the number of leaves can be computed in linear time and likewise c, v. By assumption

on W (y, t) and by exploiting the recursive nature of val(t) we can compute W (y,nodes(i)) for

all the nodes of S(y) by a simple top down procedure in O(|y|) time.

Also, due to recursion, the second equality of Equation (5.3) holds and we may compute each

term in constant time by a simple lookup for val(father(ci)) and computation of W (y, x[i : vi]).

Since we have |x| terms, the whole procedure takes O(|x|) time, which proves the O(|x| + |y|)

time complexity.

Now, we prove that Equation (5.3) really computes the kernel. We know from Lemma 20

that the sum in Equation (5.1) can be decomposed into the sum over matches between y and

each of the prefixes of x[i : vi] (this takes care of all the substrings in x matching with y). This

reduces the problem to showing that each term in the sum of Equation (5.3) corresponds to the

contribution of all prefixes of x[i : vi].

Assume we descend down the path x[i : vi] in S(y) (e.g., for the string bab with respect

5.6 Weights and Kernels 84

to the tree of Figure 5.1 this would correspond to (root, b, bab)), then each of the prefixes t

along the path (e.g., (’’, b, ba, bab) for the example tree) occurs exactly as many times as

lvs(ceiling(t)) does. In particular, prefixes ending on the same edge occur the same number of

times. This allows us to bracket the sums efficiently, and W (y, x) simply is the sum along an

edge, starting from the floor of x to x. Unwrapping val(x) shows that this is simply the sum

over the occurrences on the path of x, which proves our claim.

Our algorithm works by a slight modification of the matching statistics algorithm and is illus-

trated in Algorithm 5.2. The algorithm takes as input an annotated suffix tree S(y). At the end

of line 18 we know that the longest matching prefix of x[i : n] which is a substring in y is x[i : k].

By Lemma 20 we know that the only possible matching substrings of x and y are the prefixes

of x[i : k] for i = 1, . . . , |x|. Each occurrence of the string x[i : l], where l ≤ k, contributes wx[i:l]

weight to the kernel. But, the number of occurrences of x[i : l] is simply lvs(ceiling(x[i : l]), so,

its total weight contribution is lvs(ceiling(x[i : l]))× wx[i:l]. By the annotation of nodes of S(y)

we know that val(v) denotes the total weight contribution due to x[i : j]. Now, the number of

occurrences of x[i : j + 1], x[i : j + 2] . . . x[i : k] is simply lvs(son(v, x(j + 1))) because they lie

on the same edge and hence share the same ceiling node. Therefore, the contribution due to

x[i : j + 1], x[i : j + 2], . . . , x[i : k] is given by W (y, x[i : k]) ∗ lvs(ceiling(x[i : k])). The algorithm

sums up the contribution due to x[i : j] in line 19 and the contribution due to x[j + 1 : k] in

line 21 and repeats this for all x[i : k], i = 1, . . . , |x|. Hence, by the definition of k(x, y) in

Equation (5.1), val = k(x, y) in line 32.

5.6 Weights and Kernels

So far, our claim hinges on the fact that W (y, t) can be computed in constant time, which is far

from obvious at first glance. We now show that this is a reasonable assumption in all practical

cases.

Length Dependent Weights If the weights ws depend only on |s| we have ws = w|s|. Define

ωj :=∑j

i=1wj and compute its values beforehand up to ωJ where J ≥ |x| for all x. Then it

follows that

W (y, t) =|t|∑

j=| floor(t)|

wj − w| floor(t)| = ω|t| − ω| floor(t)| (5.4)

5.6 Weights and Kernels 85

Algorithm 5.2: String Kernel Calculationinput x, y, S(y)output k(x, y)1: Let v ← root(S(y))2: Let j ← 13: Let k ← 14: Let val← 05: for i = 1 to |x| do6: while (j < k) and (j + len(v, x(j)) ≤ k) do7: v ← son(v, x(j))8: j ← j + len(v, x(j))9: end while

10: if (j = k) then11: while son(v, x(j)) exists and (x(k) = y(first(v, x(j)) + k − j)) do12: k ← k + 113: if (j + len(v, x(j)) = k) then14: v ← son(v, x(j))15: j ← k16: end if17: end while18: end if19: val← val + val(v)20: if (j < k) then21: val← val + lvs(ceiling(x[i : k])) ∗W (y, x[i : k])22: end if23: if v = root(S(y)) then24: j ← j + 125: if (j = k) then26: Let k ← k + 127: end if28: else29: v ← shift(v)30: end if31: end for32: return val

5.6 Weights and Kernels 86

which can be computed in constant time. Examples of such weighting schemes are the kernels

suggested by Watkins (2000), where wi = λ−i, Haussler (1999) where wi = 1, and Joachims

(1999), where wi = δ1i.

Generic Weights In case of generic weights, we have several options: recall that one often

will want to compute m2 kernels k(x, x′), given m strings x ∈ X . Hence, we could build the suffix

trees for xi beforehand and annotate each of the nodes and characters on the edges explicitly

(at super-linear cost per string), which means that later, for the dot products, we will only need

to perform table lookup of W (x, x′[i : vi]).

However, there is an even more efficient mechanism, which can even deal with dynamic

weights, depending on the relative frequency of occurrence of the substrings in X . We can build

the suffix tree S(X ) in time linear in the total length of all the strings. It can be shown that for

all x and all i, x[i : vi] will be a node in this tree (Vishwanathan and Smola, 2002). This implies

that it is sufficient to annotate each node of S(X ) with the value of W (.) in order to compute

each one of the m2 kernels. Since there are at most O(m) such nodes we need to perform only

O(m) annotations. Another advantage of this scheme is that leaves counting allows us to get

the number of occurrences of a substring in X . This information is extremely useful when we

want to assign weights to substrings based on their frequency of occurrence.

In case we make the reasonable assumption that ws = ρ(freq(s)) · φ(|s|), that is, ws is a

function of the length and frequency only. Now, note that all the strings ending on the same

edge in S(X ) have the same frequency of occurrence (cf. Section 5.3.2). Hence, we can rewrite

Equation (5.2) as

W (y, t) =

∑s∈prefix(z)

wus

− wu = ρ(freq(t))

|t|∑i=| floor(t)|+1

φ(i)

− wu (5.5)

where u = floor(t) and t = uz. By pre-computing∑

i φ(i) we can evaluate Equation (5.5) in

constant time. The benefit of Equation (5.5) is twofold: we can compute the weights of all the

nodes of S(X ) in time linear in the total length of strings in X . Secondly, for arbitrary x we can

compute W (y, t) in constant time, thus allowing us to compute k(xi, x′) in O(|xi|+ |x′|) time.

5.7 Linear Time Prediction 87

5.7 Linear Time Prediction

Let X s = {x1, x2, . . . , xm} be the set of Support Vectors. Recall that, for prediction in a Support

Vector Machine we need to compute

f(x) =m∑

i=1

αik(xi, x),

which implies that we need to combine the contribution due to matching substrings from each

one of the Support Vectors. We first construct S(X s) in linear time by using the Amir et al.

(1994) algorithm. In S(X s), we associate weight αi with each leaf associated with a Support

Vector xi. For a node v ∈ nodes(S(X s)) we modify the definition of lvs(v) as the sum of weights

associated with the subtree rooted at node v. Given the suffix tree S(y) the matching statistics of

a string x with respect to string y can be computed in O(|x|) time (cf. Section 5.4.2). Similarly,

given the suffix tree S(X s) we can find the matching statistics of string x with respect to all

strings in X s in O(|x|) time. Now, Algorithm 5.2 can be applied unchanged to compute f(x).

As is clear, our algorithm runs in time linear in the size of x and is independent of the size of

X s. To prove its correctness we use the following generalized definition of a matching substring.

Lemma 22 w is a matching substring of X s and x iff w v x and there exists a xi ∈ X s such

that w v xi.

We now extend Lemma 20 by incorporating the above definition

Lemma 23 The set of matching substrings of X s and x is the set of all prefixes of x[i : vi].

Proof Let w be a matching substring of X s and x. By Lemma 19 there is an i such that w is

a prefix of x[i : n]. Since, vi is the length of the maximal prefix of x[i : n] which is a substring

in some xj ∈ X s, it follows that vi ≥ |w|. Hence, w must be a prefix of x[i : vi].

The above lemma shows that we can apply Theorem 21 and use Algorithm 5.2 to compute f(x)

as long as we can bracket the weights efficiently. Rewriting f(x) using Equation (5.1)

f(x) =m∑

i=1

αi

∑svx,s′vy

wsδs,s′ =∑

svx,s′vy

p∑i=1

(αiws)δs,s′ , (5.6)

5.8 Tree Kernels 88

we notice that each occurrence of a substring s instead of contributing a weight ws now con-

tributes a weight of αiws, which is taken into account by the modified definition of lvs(v), hence,

proving the correctness of our algorithm.

In the case of SimpleSVM (cf. Chapter 2 ) the number of Support Vectors changes dynami-

cally i.e., Support Vectors are either added or deleted during each iteration. In such cases also

we can insert and delete Support Vectors from X s in time linear in the size of the Support

Vector string by using the Amir et al. (1994) algorithm and perform prediction in time linear in

the size of the string to classify.

5.8 Tree Kernels

We begin by introducing some notation. A tree is defined as a connected directed graph with no

cycles. A node with no children is referred to as a leaf. A subtree rooted at node n is denoted as

Tn. If a set of nodes in the tree along with the corresponding edges forms a tree then we define

it to be a subset tree. We use the notation T ′ |= T to indicate that T ′ is a subset tree of T . If

every node n of the tree contains a label, denoted by label(n), then the tree is called an labeled

tree. If only the leaf nodes contain labels then the tree is called a leaf-labeled tree. Kernels on

trees can be defined by defining kernels on matching subset trees. This is more general than

matching subtrees proposed in Collins and Duffy (2001). Here we have

k(T, T ′) =∑

t|=T,t′|=T ′

wtδt,t′ . (5.7)

That is, we count the number of occurrences of every subset tree t in both T and T ′ and weight

it by wt. Setting wt = 0 for all |t| > 1 yields the bag-of-nodes kernel, which simply counts

matching nodes. Setting ws = 0 for all subset trees that are not complete subtrees under some

node leads us to subtree kernels. They will be the main emphasis of this chapter.

5.8.1 Ordering Trees

An ordered tree is one in which the child nodes of every node are ordered as per the ordering

defined on the node labels. Unless there is a specific inherent order on the trees we are given

(which is, e.g., the case for parse-trees), the representation of trees is not unique. For instance,

the following two unlabeled trees are equivalent and can obtained from each other by reordering

5.8 Tree Kernels 89

the nodes.

/.-,()*+

������

����

��???

????

??/.-,()*+

������

����

��???

????

??

/.-,()*+ /.-,()*+

������

����

��???

????

??/.-,()*+

������

����

��???

????

??/.-,()*+

/.-,()*+ /.-,()*+ /.-,()*+ /.-,()*+

Figure 5.2: Two equivalent trees

To order trees we assume that a lexicographic order is associated with the labels if they

exist. Furthermore, we assume that the additional symbols ‘[′, ‘]′ satisfy ‘[′< ‘]′, and that ‘]′, ‘[′<

label(n) for all labels. We will use these symbols to define tags for each node as follows:

• For an unlabeled leaf n define tag(n) := [].

• For a labeled leaf n define tag(n) := [ label(n)].

• For an unlabeled node n with children n1, . . . , nc sort the tags of the children in lexico-

graphical order such that tag(ni) ≤ tag(nj) if i < j and define

tag(n) = [ tag(n1) tag(n2) . . . tag(nc)].

• For a labeled node perform the same operations as above and set

tag(n) = [ label(n) tag(n1) tag(n2) . . . tag(nc)].

For instance, the root nodes of both trees depicted above would be encoded as [[][[][]]]. We now

prove that the tag of the root node, indeed, is a unique identifier and that it can be constructed

in log linear time.

Theorem 24 Denote by T a binary tree with l nodes and let λ be the maximum length of a

label. Then the following properties hold for the tag of the root node:

1. tag(root) can be computed in (λ+ 2)(l log2 l) time and linear storage in l.

2. Substrings s of tag(root) starting with ‘[′ and ending with a balanced ‘]′ correspond to

subtrees T ′ of T where s is the tag on T ′.

5.8 Tree Kernels 90

3. Arbitrary substrings s of tag(root) correspond to subset trees T ′ of T .

4. tag(root) is invariant under permutations of the leaves and allows the reconstruction of an

unique element of the equivalence class (under permutation).

Proof We prove claim 1 by induction. The tag of a leaf can be constructed in constant time

by storing [, ], and a pointer to the label of the leaf (if it exists), that is in 3 operations. Next

assume that we are at node n, with children n1, n2. Let Tn contain ln nodes and Tn1 and Tn2

contain l1, l2 nodes respectively. By our induction assumption we can construct the tag for n1

and n2 in (λ + 2)(l1 log2 l1) and (λ + 2)(l2 log2 l2) time respectively. Comparing the tags of

n1 and n2 costs at most (λ + 2)min(l1, l2) operations and the tag itself can be constructed in

constant time and linear space by manipulating pointers. Without loss of generality we assume

that l1 ≤ l2. Thus, the time required to construct tag(n) (normalized by λ+ 2) is

l1(log2 l1 + 1) + l2 log2(l2) = l1 log2(2l1) + l2 log2(l2) ≤ ln log2(ln). (5.8)

One way of visualizing our ordering is by imagining that we perform a DFS on the tree T and

emit a ′[′ followed by the label on the node, when we visit a node for the first time and a ′]′

when we leave a node for the last time. It is clear that a balanced substring s of tag(root) is

emitted only when the corresponding DFS on T ′ is completed. This proves claim 2.

We can emit a substring of tag(root) only if we can perform a DFS on the corresponding

set of nodes. This implies that these nodes constitute a tree and hence by definition are subset

trees of T . This proves claim 3.

Since, leaf nodes do not have children their tag is clearly invariant under permutation. For

an internal node we perform lexicographic sorting on the tags of its children. This removes any

dependence on permutations. This proves the invariance of tag(root) under permutations of the

leaves. Concerning the reconstruction, we proceed as follows: each tag of a subtree starts with

‘[′ and ends in a balanced ‘]′, hence we can strip the first [] pair from the tag, take whatever is

left outside brackets as the label of the root node, and repeat the procedure with the balanced

[. . .] entries for the children of the root node. This will construct a tree with the same tag as

tag(root), thus proving claim 4.

An extension to trees with d nodes is straightforward (the cost increases to d log2 d of the original

cost), yet the proof, in particular Equation (5.8) becomes more technical without providing

5.9 Experimental Results 91

additional insight, hence we omit this generalization for brevity.

Corollary 25 Kernels on trees T, T ′ can be computed via string kernels, if we use tag(T ), tag(T ′)

as strings. If we require that only balanced [. . .] substrings have nonzero weight ws then we obtain

the subtree matching kernel defined in Collins and Duffy (2001).

This reduces the problem of tree kernels to string kernels and as we have already seen string ker-

nels can be computed efficiently in linear time. Hence, the tree kernel k(T, T ′) can be computed

in O(l log(l)) time where l is the total number of nodes in T and T ′.

5.8.2 Coarsening

We now turn our attention to efficient evaluation of coarsening levels defined in Section 4.5.3.

It is clear that calculating the general tree kernel defined in Equation (4.10) is very difficult.

Instead we focus our attention on the kernel defined in Equation (4.11). Let ST represent a string

corresponding to an unlabeled tree T . To obtain ST1 we simply need to delete all occurrences

of [∗] where ∗ is a wildcard representing the label on a leaf node. By recursively applying this

procedure d times we can obtain STd. To calculate STd

from STd−1we require O(|STd−1

|) time.

Given two trees T and T ′ we can compute the kernel defined in Equation (4.11) by us-

ing Algorithm 5.3. The while loop requires O(min(|ST |, |ST ′ |)) for execution and it executes

O(min(|ST |, |ST ′ |)) times. Thus the whole algorithm now scales as quadratic in the size of the

trees. But, if we assume that during each deletion the length of ST and ST ′ reduce strictly by a

factor of λ < 1, then the algorithm is linear in the size of the trees and total work done by the

algorithm in this case is

∑i

(|ST |+ |ST ′ |)λi =1

1− λ(|ST |+ |ST ′ |). (5.9)

5.9 Experimental Results

The main point of this chapter is to introduce a novel means of evaluating string kernels efficiently

and to present theoretical guarantees on its time complexity. Hence, we do not concentrate our

efforts on presenting a wide variety of applications (separate papers focusing on applications

are currently under preparation). Here we present a proof of concept application for a remote

5.9 Experimental Results 92

Algorithm 5.3: Calculating kcoarse(T, T ′)input ST , ST ′

output kcoarse(T, T ′)1: Let val← 02: while ST 6= null and ST ′ 6= null do3: val← val +Wi × k(ST , ST ′)4: ST = delete(ST , [∗])5: ST ′ = delete(ST ′ , [∗])6: end while7: return val

homology detection problem from Jaakkola et al. (2000). Details of the experiments and data

sets are available at www.cse.ucsc.edu/research/compbio/discriminative. We use a length

weighted kernel and assign a weight λl for all matches of length greater than 3. We use a

publicly available SVM software implementation (www.cs.columbia.edu/compbio/svm), which

implements a soft margin optimization algorithm.

The ROC50 score is the area under the receiver operating characteristic curve (the plot of

true positives as a function of false positives) up to the first 50 false positives. A score of 1

indicates perfect separation of positives from negatives, whereas a score of 0 indicates that none

of the top 50 sequences selected by the algorithm were positives (Gribskov and Robinson, 1996).

Since this is a proof of concept implementation no parameter tuning of the soft margin SVM

was performed. We experimented with various values of λ ∈ {0.25, 0.5, 0.75, 0.9} and report

our best results for λ = 0.75 here. The ROC50 scores are compared with the spectrum kernel

algorithm of Leslie et al. (2002a) (they report best results for k = 3) in Figure 5.3. We also

observed that our kernel outperforms the spectrum kernel on nearly every every family in the

dataset and for all values of λ that we tested on.

It should be noted that this is the first method to allow users to specify weights rather

arbitrarily for all possible lengths of matching sequences and still be able to compute kernels

at O(|x|+ |x′|) time, plus, to predict on new sequences at O(|x|) time, once the set of Support

Vectors is established.

5.10 Summary 93

Figure 5.3: Total number of families for which a SVM classifier exceeds a ROC50 score threshold

5.10 Summary

We have shown that string kernels need not come at a super-linear cost and that prediction can

be carried out at cost linear only in the length of the argument, thus providing optimal run-time

behavior. Furthermore the same algorithm can be applied to trees as well.

We consider coarsening levels for trees by removing some of the leaves. For not too-

unbalanced trees (we assume that the tree shrinks at least by a constant factor at each coars-

ening) computation of the kernel over all coarsening levels can then be carried out at cost still

linear in the overall size of the tree. The coarsening ideas can be extended to approximate string

matching. If we remove characters, this amounts to the use of wildcards.

Likewise, we can consider the strings generated by finite state machines and thereby compare

the finite state machines themselves. This leads to kernels on Automata and other dynamical

systems. More details and extensions can be found in Chapter 4 as well as in Vishwanathan and

Smola (2002).

Applications of our string kernel to areas as diverse as spam filtering, network intrusion

detection, web search engines, document retrieval, bio-informatics and finite state transducers

are areas of active research.

Chapter 6

Kernels and Dynamic Systems

This chapter explores the relationship between dynamic systems and kernels. Kernels on various

kinds of dynamic systems including Markov chains (both discrete and continuous), diffusion

processes on graphs and Markov chains, Finite State Automata (FSA), various linear time-

invariant systems etc. are defined. Trajectories are used to define kernels induced on initial

conditions by the underlying dynamic system. The same idea is extended to define kernels on

a dynamic system with respect to a set of initial conditions. This framework leads to a large

number of novel kernels and also generalizes many previously proposed kernels.

In Section 6.1 we introduce and motivate the problem and present a generic model relating

kernels and dynamic systems. In Section 6.2 we review basic facts from control theory pertaining

to linear time-invariant systems (both discrete and continuous). In Section 6.3 we study kernels

defined using linear time-invariant systems and consider various special cases that arise out of our

model and their relation to previously proposed kernels. In Section 6.4 we study kernels defined

on linear time-invariant systems. We conclude with a discussion and summary in Section 6.5.

This chapter requires basic knowledge of kernels as discussed in Section 1.3. Knowledge

of elementary concepts from probability theory are assumed throughout the chapter(cf. Feller

(1950)). Section 6.2 reviews a few concepts from control theory related to linear time-invariant

systems. For a more complete review and detailed derivations see Luenberger (1979). Familiarity

with basic concepts from Markov chains, graphs, linear algebra and linear differential equations

is assumed throughout this chapter.

94

6.1 Introduction 95

6.1 Introduction

Assume that, some pre-defined model of the data is already available to us. For instance, this

model could encode our domain knowledge about the underlying probability distribution from

which data points are drawn (cf. Section 1.1.1). We typically want to use this domain knowledge

to define meaningful kernels. One way to do this is to compare the way in which the given model

explains the two given data points. For example, given a Hidden Markov Model (HMM), and

a point drawn from the same underlying probability distribution modeled by the HMM, there

exists one or more paths (with non zero probabilities) by which the HMM can generate that

point. Given two points, a meaningful data dependent kernel can compare similarities between

such paths.

Consider another variation of the same theme. When performing prediction based on the

initial conditions (e.g., the future performance of a stock, the similarity between musical tunes,

etc.) of a dynamic system we expect that future trajectories will be relevant for prediction.

While we may not have an exact estimate of the future behavior, we still may be able to find a

crude description of the dynamics and would like to use the latter as prior knowledge.

Likewise, given two models of data we may want to compare their similarities. This is

very useful when we want to make qualitative statements about similarities between the two

underlying data distributions. Given two dynamic systems, e.g., Markov chains, under typical

conditions, we would like to call them similar, if their time-evolution properties resemble each

other under a somewhat restricted set of conditions. For example, if the two Markov chains

converge to the same stationary distribution for a restricted set of initial distributions, which

are of interest to us, we would like to call them similar.

In the following, we assume that the initial conditions x0 of a dynamic system can be

embedded in a Hilbert space. For instance, x0 could be the position, speed, and acceleration of

a particle, the states of Markov model, or the strings generated by a FSA.

Model Definition Assume, we are given a dynamic system, denoted by D ∈ D, which trans-

forms an initial state x0 ∈ X at time t = 0 into x(t,D, x0, ξ(t)) ∈ X , where t ∈ T (and T = N

or T = R) and ξ is some random variable (for example a noise model). Moreover, we assume

6.2 Linear Time-Invariant Systems 96

that D, T ,X are measurable. Denote by

χ : X ×D → X T (6.1)

the trajectory associated with x(t,D, x0, ξ(t)), then we can define a kernel on (D,x0) via

k((D,x0), (D′, x′0)) := 〈χ(D,x), χ(D′, x′)〉 =∫〈x(t,D, x0, ξ(t)), x(t,D′, x′0, ξ(t))〉 dµ(t). (6.2)

In other words, it is defined as the time-averaged correlation between χ(D,x) and χ(D′, x′),

where, we are free to choose any measure µ(t) suitable for our purposes.

While this definition is sufficiently general, it is not very informative to compare trajectories

of different dynamic systems under different initial conditions. Much rather, one would like to

focus on one aspect and restrict the other, hence we define the two kernels

kinitial(x, x′) =∫k((D,x), (D,x′)) dµ(D) (6.3)

kdynamic(D,D′) =∫k((D,x), (D′, x)) dµ(x). (6.4)

In the following we will give explicit examples of such kernels and describe how they can be

computed efficiently.

6.2 Linear Time-Invariant Systems

We begin by introducing an important case of dynamic systems namely linear time-invariant

systems. For x(t) ∈ Rn and A ∈ Rn×n we have

x(t+ 1) = Ax(t) for discrete time (6.5)d

dtx(t) = Ax(t) for continuous time. (6.6)

In the case of a discrete linear time-invariant system x(t) = Atx(0), while, in the case of a con-

tinuous linear time-invariant dynamic system it is well known that x(t) = x(0) exp(At) (cf. Lu-

enberger (1979)). In case we consider the effect of white Gaussian noise (normally distributed

6.3 Kernels On Initial Conditions 97

with zero mean and σ2 variance) we have

x(t+ 1) = Ax(t) + ξt for discrete time (6.7)d

dtx(t) = Ax(t) + ξt for continuous time. (6.8)

For the discrete case we can find x(t) by solving the above difference equation to obtain

x(t) = Atx(0) +t−1∑i=0

Aiξi. (6.9)

while in the continuous case x(t) is given by (cf. (Luenberger, 1979, Chapter2))

x(t) = x(0) exp(At) +∫ t

0exp(A(t− τ))ξ(τ) dτ. (6.10)

In the case of a non-homogeneous linear time-invariant system we have

x(t+ 1) = A(x(t) + a) + ξt, (6.11)

where, a is constant and, as before, ξ is zero mean white Gaussian noise with variance σ2. Here

we can find x(t) using (cf. (Luenberger, 1979, Chapter2))

x(t) = Atxa + a+t−1∑i=0

Aiξi, (6.12)

where,

a := (1−A)−1a, (6.13)

and

xa := x(0)− (1−A)−1a = x(0)− a. (6.14)

6.3 Kernels On Initial Conditions

Often, the transformations applied to the data, prior to computing dot products, can only

be specified implicitly by specifying the algorithm which will carry out the transformations.

For instance, we might consider all the transformations that a Non-Deterministic Finite State

6.3 Kernels On Initial Conditions 98

Automata (NFA) might apply to a string and consider the resulting set of possible outcomes as

the set of features to compute a dot product with. Likewise, we could consider a Markov model

and study the evolution of the states over time, thereby comparing the similarity between various

states. Finally, we could study a diffusion process on a (directed) graph and infer similarities.

6.3.1 Discrete Time Systems

We can define kernels on discrete linear time-invariant systems using Equation (6.5) and Equa-

tion (6.3) to get

k(x, x′) =∞∑

t=0

ct(Atx)>W (Atx′) = x>

[ ∞∑t=0

ct(At)>WAt

]x′. (6.15)

Here ct’s are arbitrary weights and W ∈ Rn×n is a covariance matrix. In general, Equation (6.15)

will be difficult to compute in closed form and may not be well defined for most assignments of

ct. For specific types of ct, however, such an expression can be found efficiently. The following

lemma, which we state in a somewhat more general form, gives an expansion for ct = λt for a

specific range of values of λ.

Lemma 26 Let ‖A‖ denote the 2-norm of A ∈ Rn×n. Let A,B,W ∈ Rn×n be such that the

singular values of A and B be bounded by Λ, i.e., ‖A‖, ‖B‖ ≤ Λ. Then, for all |λ| < 1Λ2 the

series

M :=∞∑

t=0

λt(At)>W (Bt) (6.16)

converges and M can be computed by solving λA>MB +W = M .

Proof To show that M is well defined we use the triangular inequality, leading to

‖M‖ =

∥∥∥∥∥∞∑

t=0

λt(At)>W (Bt)

∥∥∥∥∥ ≤∞∑

t=0

∥∥∥λt(At)>W (Bt)∥∥∥ ≤ ∞∑

t=0

(|λ|Λ2

)t ‖W‖ =‖W‖

1− |λ|Λ2.

Next, we decompose M into the first term of the sum and the remainder to obtain

M = λ0(A0)>WB0 +∞∑

t=1

λt(At)>W (Bt) = W + λA>

[ ∞∑t=0

λt(At)>W (Bt)

]B = W + λA>MB.

This concludes our proof.

6.3 Kernels On Initial Conditions 99

Corollary 27 The kernel k(x, x′), as defined in Equation (6.15) with ct = λt can be computed

as

k(x, x′) = x>Mx where λA>MA+W = M. (6.17)

Furthermore, for ct = δt,T we have the somewhat trivial kernel

k(x, x′) = x>(AT )>WATx. (6.18)

Note that Sylvester equations of type

AXB> + CXD> = E, (6.19)

where, A,B,C,D ∈ Rn×n have been widely studied in control theory. Many software packages

exist for solving them in O(n3) time (Gardiner et al., 1992, Hopkins, 2002).

We now extend our model to linear time-invariant systems with zero mean white Gaussian

noise and define the kernel as

k(x, x′) = Eξ

∞∑t=0

ct

(Atx+

t−1∑i=0

Aiξi

)>W

(Atx′ +

t−1∑i=0

Aiξi

)(6.20)

The expectation with respect to the random variable ξ means that we are considering the whole

ensemble of trajectories arising due to a certain noise model for given values of x and x′. As

before, the kernel can be computed in closed form for ct = λt, for certain values of λ (given by

Lemma 26). Since ξ is Gaussian white noise, and, x and all ξi are mutually independent we can

compute the contribution of ξi to the kernel independently. We first prove an auxiliary result.

Lemma 28 Let y ∈ Rn be a random vector such that C := E(yy>). Then, E(y>Wy) = tr (WC)

for all W ∈ Rn×n.

Proof Since, y>Wy is a scalar, expectation and trace commute since they are both linear

operators and tr (AB) = tr (BA) we can write

E(y>Wy) = E(tr (y>Wy)) = E(tr (W (yy>))) = tr (WE(yy>)) = tr (WC).

6.3 Kernels On Initial Conditions 100

Using Lemma 28 and M as defined in Lemma 26 we prove the following slightly general result

∞∑t=0

λt

(t−1∑i=0

Aiξi

)>W

t−1∑j=0

Bjξj

= Eξ

[ ∞∑t=0

λtt−1∑i=0

(Aiξi

)>W(Biξi

)]

= σ2 tr

[ ∞∑t=1

λtt−1∑i=0

(Ai)>WBi

]

= σ2 tr

[ ∞∑i=0

∞∑t=i+1

λt(Ai)>WBi

]

= σ2 tr

[ ∞∑i=0

λi

1− λ(Ai)>WBi

]

=σ2

1− λtr M (6.21)

Now, using Equation (6.21) and Equation (6.20), and noting that

∞∑t=0

λt

(t−1∑i=0

Aiξi

)>W(Atx′

) = 0

and

[ ∞∑t=0

λt(Atx

)>W

(t−1∑i=0

Aiξi

)]= 0

yields

k(x, x′) = x>Mx+σ2

1− λtr M where λA>MA+W = M. (6.22)

6.3.2 Continuous Time Systems

We now turn our attention to a continuous linear time system. It follows from Equation (6.3)

and Equation (6.6) that we can define a kernel as

k(x, x′) := x>[∫ ∞

0c(t) exp(At)>W exp(At) dt

]x′ (6.23)

where c(t) is an arbitrary continuous weight function and W is a covariance matrix. As before,

the integral Equation (6.23) can be computed only for special values of c(t). One such tractable

case is c(t) = eλt (for a certain range of values of λ). Again, we prove a somewhat more general

result which will come handy in the following section as well.

6.3 Kernels On Initial Conditions 101

Lemma 29 Denote by A,B,W ∈ Rn×n such that ‖A‖, ‖B‖ ≤ Λ. Then, for all λ < −2Λ the

integral

M :=∫ ∞

0eλt exp(At)>W exp(Bt) dt (6.24)

converges and M is the solution of the Sylvester equation (A> + λ2 1)M +M(B + λ

2 1) = −W .

Proof Convergence of the integral is established via the triangular inequality, that is

‖M‖ ≤∫ ∞

0eλt‖ exp(At)>W exp(Bt)‖ dt ≤

∫ ∞

0exp((λ+ 2Λ)t)‖W‖ dt <∞. (6.25)

Furthermore, we have

M =∫ ∞

0etλetA

>WetB dt (6.26)

= (A>)−1etλetA>WetB

∣∣∣∞0−∫ ∞

0(A>)−1etλetA

>WetB(B + 1λ) dt (6.27)

= −(A>)−1W − (A>)−1M(B + λ1). (6.28)

Here we obtained Equation (6.27) by partial integration and Equation (6.28) by realizing that

the integrand vanishes for t → ∞ (for suitable λ) in order to make the integral convergent.

Multiplication by A> shows that M satisfies

A>M +MB + λM = −W. (6.29)

Rearranging terms and the fact that multiples of the identity matrix commute with A,B proves

the claim.

Corollary 30 The kernel k(x, x′), as defined in Equation (6.23) with c(t) = eλt can be computed

as

k(x, x′) = x>Mx where(A+

λ

21)>

M +M

(A+

λ

21)

= −W. (6.30)

Furthermore, for c(t) = δT (t) we have the somewhat trivial kernel

k(x, x′) = x>[exp(AT )>W exp(AT )

]x. (6.31)

6.3 Kernels On Initial Conditions 102

As before, we extend our model to linear time-invariant systems with zero mean white

Gaussian noise. Using Equation (6.10) we define the kernel as

k(x, x′) = Eξ

[∫ ∞

0c(t)P>WQdt

](6.32)

where, P := exp(At)x+∫ t0 exp(A(t− τ))ξ(τ) dτ and Q := exp(At)x′ +

∫ t0 exp(A(t− τ))ξ(τ) dτ .

We can compute the kernel in closed form for c(t) = exp(λt) (for certain values of λ given by

Lemma 29). Since, ξ is Gaussian white noise, and, x and ξ are mutually independent we can

compute the contribution of ξ to the kernel independently. Using Lemma 28 and M as defined

in Lemma 29 we prove the following slightly general result

[∫ ∞

0exp(λt)

(∫ t

0exp(A(t− τ))ξ(τ) dτ

)>W

(∫ t

0exp(B(t− τ))ξ(τ) dτ

)dt

]

= Eξ

[∫ ∞

0exp(λt)

∫ t

0(exp(A(t− τ))ξ(τ))>W exp(B(t− τ))ξ(τ) dτ dt

]= σ2 tr

[−∫ ∞

0exp(λt)

∫ t

0exp(Ap)>W exp(Bp) dp dt

]= σ2 tr

[−∫ ∞

0

[∫ ∞

pexp(λt) exp(Ap)>W exp(Bp) dt

]dp

]=

σ2

λtr[∫ ∞

0exp(λp) exp(Ap)>W exp(Bp) dp

](6.33)

=σ2

λtr M. (6.34)

Here we obtained Equation (6.33) by noting that exp(λt)→ 0 as t→∞ (for values of λ defined

by Lemma 29). Now, using Equation (6.34) and Equation (6.32), and noting that

[∫ ∞

0exp(λt)

(∫ t

0exp(A(t− τ))ξ(τ) dτ

)>W (exp(At)x′) dt

]= 0

and

[∫ ∞

0exp(λt)(exp(At)x)>W

(∫ t

0exp(A(t− τ))ξ(τ) dτ

)dt

]= 0

yields

k(x, x′) = x>Mx+σ2

λtr M where

(A+

λ

21)>

M +M

(A+

λ

21)

= −W. (6.35)

6.3 Kernels On Initial Conditions 103

6.3.3 Special Cases

The kernels described above, appear somewhat simple minded at first sight. After all, we are

only replacing the standard Euclidean metric by the covariance of two trajectories under a

dynamic system. Below, we show how specific instances of such kernels can lead to many new

and interesting kernels.

Discrete Time Markov Chain: In a homogeneous Discrete Time Markov Chain (DTMC),

x(t) corresponds to the state at time t and the matrix A is the matrix of state transition

probabilities, i.e., Aij = p(i|j), which is the probability of arriving in state i when originating

from state j. This means that via the function k(., .) we can compute the overlap between states,

when originating from x = ei and x′ = ej . Finally, since A is a stochastic matrix (positive entries

with row-sum 1), its eigenvalues are bounded by 1 and therefore, any discounting factor λ < 1

will lead to a well-defined kernel.

Note that if 1/λ is much smaller than the mixing time, k will mostly measure the overlap

between the initial states x, x′ and the transient distribution on the DTMC (The stationary

distribution may also play a role, albeit a reduced one, in determining the final kernel). The

quantity of interest here will be the ratio between λ and the gap between 1 and the second

largest eigenvalue of A (Graham, 1999).

In other words, we are comparing the trajectories of two (discrete time) random walks on a

Markov chain. Choosing ci = δi,T , on the other hand, means that we compare a snapshot of the

distribution over states after exactly T time steps.

Continuous Time Markov Chain: In a homogeneous Continuous Time Markov Chain

(CTMC), x(t) corresponds to the state at time t and the matrix A (also called the rate matrix)

denotes the differential change in the concentration of states. When the CTMC reaches a state,

it stays there for an exponentially distributed random time (called the state holding time) with

a mean that depends only on the state. Now, we can define kernels using diffusion on CTMC’s

by plugging in the rate matrix A into Equation 6.30.

Diffusion on Graphs: A special case of diffusion on a CTMC is diffusion on a directed graph.

Here diffusion through each of the edges of the graph is constant (in the direction of the edge).

This means that, given a matrix representation of a graph via D (here Dij = 1 if an edge from

6.4 Kernels on Dynamic Systems 104

j to i exists), we compute the Laplacian L = D − diag(D 1) of the graph, and use the latter to

define the diffusion process ddtx(t) = Lx(t). Note that in the special case of W = 1, the entry

Mij for M = exp(Lt)> exp(Lt) tells us the probability that any other state l could have been

reached jointly from i and j (Kondor and Lafferty, 2002).

Undirected Graphs and Groups: Kondor and Lafferty (2002) suggested to study diffusion

on undirected graphs with W = 1. Their derivations are a special case of Equation (6.31). Note

that for undirected graphs the matrices D and L are symmetric. This has the advantage that

Equation (6.31) can be further simplified to

k(x, x′) = x> exp(LT )>W exp(LT )x = x> exp(2LT )x. (6.36)

Finally, if we study an even more particular example of a graph, namely the Cayley graph of

a group, we thereby have a means of imposing a kernel on the elements of a group (see also

Kondor and Lafferty (2002) for further details).

Differential Equations It is well known that any linear differential equation can be trans-

formed into a first order linear differential equation by including the state space (Hirsch and

Smale, 1974). This means that differential equations thereby impose a kernel on the initial

conditions. Similarity here corresponds to correlation of the state space trajectories.

6.4 Kernels on Dynamic Systems

We now extend the notion of a kernel to dynamic systems. This includes differential equations,

FSA, Markov chain, etc. A natural way of comparing the similarity between two systems,

D ∈ D and D′ ∈ D is to compare the trajectories χ(D,x(0)) and χ(D′, x(0)), given the same

starting values x(0). In other words, we will say that two systems D and D′ are close if their

time evolution is close for a restricted set of inputs. This closely resembles the idea of system

identification in the “Behavioral Framework”, as invented by Willems (1986a,b, 1987) (see also

(Weyer, 1992) for further details).

In the most general setting, we will consider a mapping of a set of initial sequences x(0) into

6.4 Kernels on Dynamic Systems 105

their full time trajectory and define a dot product on them. This leads to a kernel of the form

k(D,D′) := EξEx(0)kχ(χ(D,x(0)), χ(D′, x(0))). (6.37)

Here kχ is a kernel determining how the similarity between the trajectories is measured. The

expectation with respect to x(0) allows us to encode knowledge about the set of initial conditions

we are most interested in. For instance, two models may behave identically on all situations

of interest to the user, hence we should consider them identical, even though they may exhibit

completely different behavior in a set of initial conditions which never occurs in practice. The

expectation with respect to the random variable ξ means that we are considering the whole

ensemble of trajectories arising due to a certain noise model for a given value of x(0).

In the following we will give examples of such kernels. They are useful in two regards: firstly

they allow us to define a Hilbert space embedding of dynamic systems in order to estimate some

of their properties directly. Secondly, they allow us to define notions of proximity between two

dynamic systems.

6.4.1 Discrete Time Systems

Let A and B define two discrete linear time-invariant systems without noise. For the sake of

analytic tractability we use

kχ(χ(A, x(0)), χ(B, x(0))) = x(0)>[ ∞∑

t=0

ct(At)>WBt

]x(0) (6.38)

where, as before, ct denote arbitrary weights. Hence, Equation (6.37) can be used to define the

following kernel

k(A,B) = Ex(0)

[x(0)>

[ ∞∑t=0

λt(At)>WBt

]x(0)

]. (6.39)

If we consider a weighting scheme of the form ci = λi, then, we know from Lemma 26 that kχ is

well defined only for certain values of λ. For such values of λ we can evaluate the above kernel

as (cf. Lemma 26)

k(A,B) = Ex(0)x(0)>Mx(0) where M = W + λA>MB. (6.40)

6.4 Kernels on Dynamic Systems 106

As before, M can be computed in O(n3) time by solving the Sylvester equation. Let C :=

Ex(0)x(0)x(0)> be the covariance matrix of the random variable x(0). Then, we can simplify

Equation (6.40) using Lemma 28 as

k(A,B) = Ex(0)(x(0)>Mx(0)) = tr (MC). (6.41)

If we consider the effect of white Gaussian noise the kernel can be defined as

kχ(χ(A, x(0)), χ(B, x(0)))

= EξEx(0)

∞∑t=0

ct

(Atx(0) +

t−1∑i=0

Aiξi

)>W

(Btx(0) +

t−1∑i=0

Biξi

) (6.42)

As before, Equation (6.42) can be computed in closed form for ct = λt, for certain values

of λ (given by Lemma 26). Since ξ is Gaussian white noise, and, x and all ξi are mutually

independent we can compute the contribution of ξi to the kernel independently. Using Lemma 28,

Equation (6.21) and Equation (6.42), and noting that

EξEx(0)

∞∑t=0

λt

(t−1∑i=0

Aiξi

)>W(Btx(0)

) = 0

and

EξEx(0)

[ ∞∑t=0

λt(Atx(0)

)>W

(t−1∑i=0

Biξi

)]= 0

yields

k(x, x′) = tr (MC) +σ2

1− λtr M where M = W + λA>MB. (6.43)

6.4.2 Continuous Time Systems

Let A and B define two continuous linear time-invariant systems without noise. For the sake of

analytic tractability we use

kχ(χ(A, x(0)), χ(B, x(0))) = x(0)>[∫ ∞

0c(t) exp(At)>W exp(Bt) dt

]x(0) (6.44)

6.4 Kernels on Dynamic Systems 107

where, as before, c(t) is a continuous weighting function. Hence, Equation (6.37) and Equa-

tion 6.44 can be used to define the following kernel

k(A,B) := Ex(0)

[x(0)>

(∫ ∞

0c(t) exp(At)>W exp(Bt)dt

)x(0)

]. (6.45)

If we consider a weighting scheme of the form ct = etλ, then, we know from Lemma 29 that kχ

is well defined only for certain values of λ. For such values of λ we can evaluate the above kernel

as (cf. Lemma 29)

k(A,B) = Ex(0)x(0)>Mx(0) where(A+

λ

21)>

M +M

(B +

λ

21)

= −W. (6.46)

By a similar reasoning as above we can compute Equation (6.45) by evaluating

k(A,B) = tr (MC) (6.47)

where, C is the covariance matrix of x(0).

We extend our model to linear time-invariant systems with zero mean white Gaussian noise

and define the kernel as

k(x, x′) = EξEx(0)

[∫ ∞

0c(t)P>WQdt

](6.48)

where, P := exp(At)x(0) +∫ t0 exp(A(t − τ))ξ(τ) dτ and Q := exp(Bt)x(0) +

∫ t0 exp(B(t −

τ))ξ(τ) dτ . We can compute the kernel in closed form for c(t) = exp(λt) (for certain values

of λ given by Lemma 29). Since, ξ is Gaussian white noise, and, x and ξ are mutually in-

dependent we compute the contribution of ξi to the kernel independently. Using Lemma 28,

Equation (6.34) and Equation (6.48), and noting that

EξEx(0)

[∫ ∞

0exp(λt)

(∫ t

0exp(A(t− τ))ξ(τ) dτ

)>W (exp(Bt)x(0)) dt

]= 0

and

EξEx(0)

[∫ ∞

0exp(λt)(exp(At)x(0))>W

(∫ t

0exp(A(t− τ))ξ(τ) dτ

)dt

]= 0

6.4 Kernels on Dynamic Systems 108

yields

k(x, x′) = tr (MC) +σ2

λtr M where

(A+

λ

21)>

M +M

(A+

λ

21)

= −W. (6.49)

6.4.3 Non-Homogeneous Linear Time-Invariant Systems

In the case of non-homogeneous linear time-invariant systems we define a kernel using Equa-

tion (6.12) and Equation (6.37) as

k((A, a), (B, b)) = EξEx(0)

[x(0)>

( ∞∑t=0

λtP>WQ

)x(0)

]

where, P := Atxa + a+∑t−1

i=0 Aiξi and Q := Btxa + b+

∑t−1i=0 B

iξi

Since x(0) and all ξi are mutually independent and ξ is Gaussian white noise we can compute

the contributions of ξi and x to the kernel independently. We first compute the contribution

due to Atxa + a.

∞∑t=0

λt(Atxa + a)>W (Btxb + b)

= x>a Mxb + x>a (1− λA>)−1Wb+ a>W (1− λB)−1xb +1

1− λa>Wb. (6.50)

Next, we assume that x(0) has variance Ξ and mean µ0. Using Equation (6.14) we get

Ex(0)

[xax

>b

]= Ex(0)

[(x(0)− a)(x(0)− b)>

]= Ξ + (µ0 − a)(µ0 − b)>. (6.51)

Putting everything together, i.e., taking expectations with respect to x(0) and plugging in

Equation (6.50), Equation (6.21) and Equation (6.51) into Equation (6.50) yields:

k((A, a), (B, b))

= tr (MΞ) +σ2

1− λtr (M) + tr ((µ0 − a)>M(µ0 − b)) (6.52)

+(µ0 − a)>(1− λA>)−1Wb+ a>W (1− λB)−1(µ0 − b) +1

1− λa>Wb.

Clearly, λ has to be chosen so that it satisfies the conditions defined in Lemma 29, since otherwise

the kernel k((A, a), (b, a)) would not be defined, due to lack of existence of the solution of the

6.5 Summary 109

Sylvester matrix equation.

6.5 Summary

We proposed a framework to connect kernels and dynamic systems. We showed how special-

izations of this framework lead to kernels on initial systems with respect to a given dynamic

system and on a dynamic system with respect to a given set of initial conditions. We studied

the framework for linear time-invariant systems and showed that their specializations lead to

many new kernels. Our framework also generalizes many previously proposed kernels.

Chapter 7

Jigsawing: A Method to Create

Virtual Examples

This chapter describes a new method to generate virtual training samples in the case of hand-

written digit data. It uses the two dimensional suffix tree representation of a set of matrices to

encode an exponential number of virtual samples in linear space thus leading to an increase in

classification accuracy. A new kernel for images is proposed and an algorithm for computing it

in quadratic time is described. Methods to reduce the prediction time to quadratic in the size of

the test image by using techniques similar to those used for string kernels (cf. Section 5.7) are

also described. We conjecture that the time complexity can be further reduced by intelligently

using the suffix tree on matrices.

In Section 7.1 we introduce our notation and motivate the problem. In Section 7.2 we survey

algorithms for generating virtual examples and examine their relation to our method. A high

level description of our algorithm follows in Section 7.3. A detailed description of jigsawing along

with intuitive arguments to show why it works can be found in Section 7.4. We discuss some

novel applications in Section 7.5. We propose a new kernel on images in Section 7.6 and show

how it can be computed in quadratic time by using suffix trees on matrices. We also describe a

quadratic time prediction algorithm for Support Vector Machines in Section 7.6. We conclude

with a summary in Section 7.7.

This chapter requires basic knowledge of statistical learning theory as discussed in Chapter 1

(cf. Section 1.1). An understanding of suffix trees on strings and their generalization to matrices

is essential. Readers may want to review Section 5.3 and the influential papers by Giancarlo

110

7.1 Background and Notation 111

(1995) and Cole and Hariharan (2000). Kim and Park (1999) also propose an alternate lin-

ear time construction algorithm for suffix trees on matrices. A review of convolution kernels

discussed in Section 4.2 and tree kernels discussed in Sections 4.5 and 5.8 maybe helpful in

understanding parts of this chapter.

7.1 Background and Notation

It is well known that the empirical risk of a classifier can be reduced by training it on a large

number of samples (Duda et al., 2001, Vapnik, 1995). In many real life situations the dataset

size may be limited or it may be expensive to obtain new samples for training (Niyogi et al.,

1998). But, we usually have some prior domain knowledge about the data which can be used

to generate virtual training samples. Some successful attempts in this direction especially for

handwritten digit datasets include methods like bootstrapping (Hamamoto et al., 1997) and

Partitioned Pattern Classification (PPC) trees (Viswanath and Murty, 2002). The methods

proposed by Niyogi et al. (1998) are applicable for object recognition and speech recognition.

In this chapter, we propose a novel method to generate exponential number of virtual training

samples by encoding the handwritten digit data in a two dimensional suffix tree using linear

space and linear construction time.

In the following, we denote by X = {(x1, y1), . . . , (xn, yn)} ⊂ Σ2k×2k × C the set of labeled

training samples1, where, C is the set of class labels, Σ is a finite ordered alphabet and we define

m := 2k. Without loss of generality we assume that $ /∈ Σ is a special symbol, ∗ ∈ Σ is a blank

character and C = {1, 2, . . . , c}. Furthermore, for i ∈ C, let X i denote the data points in class i

and let ni = | X i |. Let F (X i) denote a representation of X i, for example, it could be an array

that stores X i. A sample (x, i) is said to be virtual if it can be generated from F (X i). It is

clear that the only virtual samples generated by the array representation are (xj , i) such that

xj ∈ X i, while, other representations may yield a richer set of virtual samples. At this point, it

must be noted that, all the virtual training samples generated may not be meaningful, in fact,

many of them may be noisy patterns or outliers. Our ultimate goal is to use domain knowledge

to generate as many meaningful virtual samples as possible.

The suffix tree is a compacted trie that stores all suffixes of a given text string (Weiner,

1The assumption xi ∈ Σ2k×2k

looks very restrictive, but, is required mainly for notational convenience. Oneway to overcome this is to pad the data with sufficient number of blanks.

7.2 Related Work 112

1973, McCreight, 1976). It has been widely used for compact representation of input text and

in a wide variety of search applications (Grossi and Italiano, 1993) (cf. Section 5.3). Giancarlo

(1995) generalized the notion of a suffix tree on a string to suffix trees on arbitrary matrices

(also see Giancarlo and Grossi (1996, 1997), Giancarlo and Guaiana (1999)). We denote the

suffix tree of a matrix x by S(x). For each square sub-matrix of x, there is a corresponding path

in the suffix tree S(x). Linear time and linear space (that is O(m2)) algorithms for constructing

such suffix trees have recently been proposed by Cole and Hariharan (2000) (see Kim and Park

(1999) for an alternate algorithm).

Let nodes(S(x)) denote the set of all nodes of S(x) and root(S(x)) be the root of S(x).

For a node w, T (w) denotes the subtree tree rooted at the node, lvs(w) denotes the number

of leaves in the subtree. We denote by words(S(x)) the set of all sub-matrices of x. For every

t ∈ words(S(x)) we define ceiling(t) and floor(t) exactly analogous to their definitions on suffix

trees for strings (cf. Section 5.3.1).

The suffix tree of the set X i, denoted by S(X i), can be built in O(m2ni) time by merging

S(xj) for all (xj , i) ∈ X i. Giancarlo (1995) showed that all occurrences in X i of y ∈ Σp×p for

some p ≤ m can be found in O(p2) time. We use the notation y v X i to denote that y occurs

as a sub-matrix of some element of X i.

7.2 Related Work

Let D be the domain from which samples are drawn and T : Dt → D be a transformation which

maps a set of t training samples into a new meaningful sample. Virtual training samples can be

generated by applying the transformation T to the elements of the training set (Niyogi et al.,

1998). Different methods differ in the way they select the transformation function T and the

value of t.

Bootstrapping is a robust and well studied method to generate virtual samples from limited

training data (see for example Efron and Tibshirani, 1994, Efron, 1982). Variants of boot-

strapping have been successfully used for a wide variety of pattern recognition tasks (Efron

and Tibshirani, 1997, Kohavi, 1995). The basic idea behind bootstrapping is to generate new

samples by combining a set of closely related samples. For example, one method is to replace

a set of spatially close points by their centroid or their medoid (Hamamoto et al., 1997). The

main advantage of bootstrapping is that it is robust and well studied. But, it requires explicit

7.3 Basic Idea 113

storage of the generated samples and hence its use is generally constrained by memory and

storage considerations.

A trie is a multi-way tree structure used for storing strings over an alphabet (Fredkin, 1960).

Viswanath and Murty (2002) proposed the PPC tree which partitions the dataset vertically into

blocks and constructs a separate trie for each block. They show that the PPC tree can generate

virtual samples implicitly by combining data from various blocks. The advantage of the PPC

tree is that it uses linear space to encode a large number of virtual samples. But, the main

disadvantage is that the partitioning into blocks is highly dataset dependent (Viswanath and

Murty, 2002). Besides, it does not generalize well to take into consideration the two dimensional

nature of handwritten digit data. In handwritten digit datasets, the structure of the data plays

an important role in classification algorithms and hence a method which takes into account the

connected regions of a training sample is preferable. As of now, the PPC tree has been used

only with a k -Nearest Neighbor Classifier.

An immediately apparent application of the suffix tree on matrices is to search for all oc-

currences of a test pattern in the training set (Giancarlo, 1995). In real world applications, the

probability of finding an exact match between training samples and a test sample is negligible.

Hence, such a scheme is not practically feasible as a classification algorithm. It is clear that

we need to allow some kind of approximate matches in order to tackle this problem. Besides,

this scheme does not produce any virtual examples and hence may not be preferable when the

number of training points is limited.

7.3 Basic Idea

The basic idea of our algorithm is very simple. Given a set of patterns X i of class i, meaningful

virtual patterns (x, i) can be generated by swapping corresponding regions from various samples

in X i. This is similar to solving a jigsaw puzzle where the final solution is obtained by piecing

together various pieces of the jigsaw (Vishwanathan and Murty, 2002b). Figure 7.1 illustrates

this pictorially. This strategy generates an exponential number of virtual samples, hence, it

would be extremely wasteful to explicitly generate and store all such virtual patterns. Also,

it can be seen that all the samples generated by jigsawing may not be meaningful. Hence, we

would like to assign a weightage or confidence measure for each virtual sample.

In general, encoding all possible arbitrary shaped regions from each training sample in X i

7.3 Basic Idea 114

Figure 7.1: The two samples on the top are the original toy training samples. The two sampleson the bottom are obtained by swapping regions from the top two samples. In this simplisticsetting jigsawing is seen to produce valid virtual samples.

is an extremely difficult problem. But, since arbitrary regions in a two dimensional matrix can

be described in terms of its sub-matrices, it is adequate if we encode all the sub-matrices of

samples in X i. We achieve this by constructing the suffix tree S(X i) in linear time using linear

space. To assign weightages, we generate what we call a description tree of the virtual pattern

with respect to the training data X i, and use it to obtain a confidence estimate.

The idea of the description tree is best explained by describing its construction for a test

sample xt with respect to X i for some i. We start off with the root node which represents the

entire pattern xt and look for an exact match in X i. If no exact match is found we subdivide

xt into sub-matrices and add nodes to the root node corresponding to these sub-matrices. Now,

each one of the sub-matrices is used for locating an exact match. We recursively subdivide the

sub-matrices and add corresponding nodes to the tree until either an exact match is found or a

preset threshold on the depth of the tree is exceeded. Weights are assigned to each node of the

7.4 Jigsawing 115

tree and the sum of the weights on all nodes gives us a estimate of the similarity between xt

and X i. A meaningful weighting scheme takes into account the depth of a node, the number of

occurrences and the corresponding location coordinates, of the region represented by the node,

in the training set.

The description tree can also be used to design a simple classifier. Given a test sample,

generate its description tree with respect to X i for all i ∈ C and assign the sample to the class

with maximal weightage on the root node. This idea is very similar to that used for designing

classifiers using generative models (Everitt, 1984, Revow et al., 1996). In its simplest form,

a classifier based on a generative model has a model for each digit. Given an image of an

unidentified digit the idea is to search for the model that is most likely to have generated that

image (Revow et al., 1996). The advantage of this method is that, besides providing a classifier it

also provides valuable information to describe the digit. This information is especially useful in

hybrid settings where other classifiers may benefit from this data. Another important advantage

of generative models is that the models can be learnt from labeled training data by using an

Expectation Maximization like algorithm proposed by Dempster et al. (1977). Use of such

techniques for description tree generation is a topic of current research.

7.4 Jigsawing

In this section we describe the jigsawing algorithm formally. We also discuss its time complexity

and try to show using various intuitive arguments that it is likely to improve classification

accuracy.

7.4.1 Notation

Given x ∈ X , the numbers a, b, l ∈ R such that 1 ≤ a, b ≤ m and 1 ≤ a+ l, b + l ≤ m, define a

square sub-matrix x[a : a + l, b : b + l] := submatx(a, b, l). Conversely, given a sub-matrix of x

denoted by R we define coordx(R) := (a, b, l) such that R = submatx(a, b, l). Let splitx(a, b, 2l)

7.4 Jigsawing 116

be function which subdivides submatx(a, b, 2l) into four square sub-matrices given by

R1 = submatx(a, b, l)

R2 = submatx(a, b+ l + 1, l)

R3 = submatx(a+ l + 1, b, l)

R4 = submatx(a+ l + 1, b+ l + 1, l).

T be a threshold parameter such that T ≤ k. For the sake of illustration, we use a very simple

weighting scheme. Let λ ∈ (0, 1) be a weighting factor. A leaf at depth d is assigned a weight of

(λ/4)d while the weight of an internal node is given by the sum of the weights of all the leaves

hanging off the subtree rooted at that node. We denote the description tree of a test pattern

xt with respect to X i as D(xt,X i). It is a weighted tree with maximum depth T such that the

weight on the root node is a function of similarity between xt and X i. If t ∈ X i then D(xt,X i)

consists of a single root node which is assigned a maximum possible weight of 1.

7.4.2 Algorithm

For each i ∈ C we construct the corresponding suffix tree S(X i) in O(| X i |) time and O(| X i |)

space by using the Cole and Hariharan (2000) algorithm. Given a square sub-matrix of xt

denoted by R, a node N in D(xt,X i) at depth d (corresponding to the region R), the procedure

shown in Algorithm 7.1 splits R into four equal sized square sub-matrices (other splitting schemes

may be adapted based on domain knowledge about region R that is being split). Four children

are added to N in order to represent these new regions. Each of the sub-matrices is checked for

an exact match in X i. The recursive procedure is invoked until either an exact match is found

or the threshold depth T is reached.

To construct D(xt,X i), our algorithm first looks for an exact match for xt in X i. If such a

match is not found the recursive procedure shown in Algorithm 7.1 is invoked. This is outlined

in Algorithm 7.2.

7.4 Jigsawing 117

Algorithm 7.1: SplitNodeinput R, d, N , S(X i)

if d ≥ T thenweight(N) = 0return

end ifAdd {N1, N2, N3, N4} as children of Nfor Rj ∈ splitx(coordx(R)) do

if Rj v X i thenweight(Nj) = (λ/4)d

elseSplitNode(Rj , d+ 1, Nj , S(X i))

end ifend for

Algorithm 7.2: Description Tree Constructioninput xt, S(X i), T , λoutput D(xt,X i)

Let D(xt,X i)← rootif xt v X i then

weight(root) = 1else

SplitNode(xt, 1, root, S(X i))end if

7.4.3 Time Complexity

At each level the sum of the sizes of regions considered for matching is bound by m×m. Hence,

the work done to construct each level of D(xt,X i) is bounded by O(m2). Since the number of

levels is at most T ≤ k the whole description tree construction takes at most O(m2 log2(m)) time.

The cost of constructing the suffix tree S(X i) is a one time O(| X i |) cost while the construction

of the description tree D(xt,X i) is independent of the size of X i and depends only on the size

of the test pattern. This is significantly cheaper than the O(m2| X i |) cost typically incurred by

traditional classifiers like the nearest neighbor classifier (Cover and Hart, 1967).

7.4.4 Why does Jigsawing Work?

In general, handwritten digit samples preserve structural properties which means that a large

amount of information can be gleaned by studying their two dimensional structure LeCun et al.

(1995). Given two training samples xi and xj drawn from the same class i, let Rxi be a region

7.5 Applications 118

in xi and Rxj be the corresponding region in xj . New (possibly valid) training patterns can be

obtained by swapping Rxi and Rxj . Because of the structure preserving nature of handwritten

digit data, we expect that the new sample would be a valid (or at least close to valid) pattern

of class i. The jigsawing algorithm described above can be thought as swapping sub-matrices

from various training samples in order to produce new samples.

In real life situations, all such samples produced by swapping various regions may not be

valid. For example, as we jigsaw smaller and smaller sub-matrices (by allowing T → k) the

number of virtual examples we generate grows exponentially. While, many of these points may

be meaningful points, the number of noisy or meaningless points that are produced also increases.

This can be seen as follows: if we assume that Σ = {0, 1} and allow T = k then the complement

of any x ∈ X i can be generated by jigsawing individual pixels. In this case, the depth of the

description tree generated is large and hence the similarity measure we assign to the sample

decreases indicating our reduced confidence in such a sample.

As the number of virtual training samples increases, the empirical error (Remp) of the clas-

sifier is reduced. But, due to the increase in the number of noisy samples a more complex

hypothesis class may be needed to effectively explain the training data which requires us to

use a classifier with a higher VC dimension (Vapnik, 1995). This, in turn, may increase the

confidence term (φ) in Equation (1.1). As a result of these two effects the classification accuracy

tends to increase for smaller values of T but for larger values the classification accuracy may

come down.

7.5 Applications

In this section we try to give a flavor of the various applications of the description tree construc-

tion algorithm we described above. More quantitative results can be found in Vishwanathan and

Murty (2002c). As stated before, it is straightforward to design a classifier given the description

trees D(x,X i), for all i ∈ C. The weight of all the internal nodes can be found by doing a simple

Depth First Search (DFS) on D(x,X i). The weight on the root nodes can be used to assign x

to the class with maximum similarity.

The description tree is a true data dependent measure of similarity and hence can be used

for clustering applications. It can also be used to evaluate the discriminating capability of a

training set by looking at the dissimilarity measure between the trees D(x,X i) for different test

7.6 An Image Kernel 119

patterns. Given two test samples xi and xj a data dependent kernel of the samples can be

computed by first building their description trees D(xi,X ) and D(xj ,X ) and then using the

tree kernels ideas discussed in Sections 4.5 and 5.8.

Using a leave one out procedure, the description tree of a training pattern can be constructed

using the rest of the training samples from the same class. The depth and branching levels of this

tree can be used for dataset reduction as follows: a pattern which generates a highly branched

and deep description tree is very dissimilar to other patterns already present in the training set

and hence must be retained. On the other hand, a training pattern which produces a shallow

description tree with a large weight on the root node is very similar to already existing patterns

in the training set and hence can be pruned. Such ideas can be effectively combined with other

dataset reduction techniques like those proposed in Vishwanathan and Murty (2002d,e).

7.6 An Image Kernel

In this section we present an algorithm to compute the image kernels defined in Section 4.7 in

quadratic time. We use a variant of the idea used to compute string kernels and use a suffix tree

of a two dimensional matrix to compute regions of maximum overlap.

7.6.1 Algorithm

Recall that the image kernel was defined as

k(x, y) =∑

svx,s′vy

wsδs,s′ . (7.1)

Given images x and y we first construct the two dimensional suffix tree S(y). For each a, b ∈

{1, 2, . . . ,m} we compute the largest lab such that submatx(a, b, lab) v y. Computing each lab

requires us to walk down the edges of S(y) starting from the root and hence takes at most

O(m2) time. Since there are at most m2 such locations this computation can be carried out in

O(m4) time (quadratic in the size of the input). The following lemma establishes the relationship

between lab and sub-matrices of x which also occur in y.

Lemma 31 w is a sub-matrix of x and y iff w = submatx(a, b, l) such that l ≤ lab for some

1 ≤ a, b ≤ m.

7.6 An Image Kernel 120

Proof The proof is elementary. Let w be a sub-matrix of x. Then, there exist a, b and l such

that w = submatx(a, b, l). Assume that w is a sub-matrix of y also. If l ≥ lab it violates the

maximality of lab and hence l ≤ lab.

Conversely, let w = submatx(a, b, l) such that l ≤ lab. Clearly w is a sub-matrix of y. We

know that submatx(a, b, lab) is a sub-matrix of x and, hence, w is also a sub-matrix of x.

The following key theorem is used to compute image kernels efficiently.

Theorem 32 Let x and y be two dimensional images. Assume that

W (a, b, t) :=t−u∑s=1

wsubmatx(a,b,u+s) − wsubmatx(a,b,u) (7.2)

where submatx(a, b, u) := ceiling(submatx(a, b, t)) can be computed in constant time for any a, b

and t. Then k(x, y) can be computed in O(m4) time as

k(x, y) =m∑

a,b=1

val(submatx(a, b, lab)) (7.3)

where val(submatx(a, b, t)) := lvs(submatx(a, b, v)) · W (a, b, t) + val(submatx(a, b, u)) and we

define val(root) := 0 and submatx(a, b, v) := floor(submatx(a, b, t))

Proof We first show that Equation (7.3) can indeed be computed in quadratic time. We know

that for S(y) the number of leaves can be computed in linear time by a simple DFS while lab

for all a, b ∈ {1, 2, . . . ,m} can be computed in O(m4) time. By assumption on W (a, b, t) and by

exploiting the recursive nature of val(submatx(a, b, t)) we can compute W (.) for all the nodes of

S(y) by a simple top down procedure in O(m2) time. Now, the computation of the summation

takes O(m2) time since there are m2 terms to sum up. Thus the total time complexity is O(m4).

Now, we prove that Equation (7.3) really computes the kernel. From lemma 31 all sub-

matrices common to x and y can be described as submatx(a, b, l) such that l ≤ lab. For a

given 1 ≤ a, b ≤ m we know that lvs(floor(submatx(a, b, l))) gives the number of occurrences of

submatx(a, b, l) in y and hence the weight contribution due to submatx(a, b, l) should be counted

lvs(floor(submatx(a, b, l))) times. But, by definition val(submatx(a, b, lab)) computes the contri-

bution due to each submatx(a, b, l) for l ≤ lab. The kernel value is finally computed by taking

into account contributions due to submatx(a, b, l) for all 1 ≤ a, b ≤ m.

7.6 An Image Kernel 121

7.6.2 Quadratic Time Prediction

Recall that for prediction in a Support Vector Machine we need to compute

f(x) =p∑

i=1

αik(xi, x),

where p is the number of Support Vectors. This implies that we need to combine the contribution

due to sub-matrices of each Support Vector. Let X s denote the set of all Support Vectors. We

speed up prediction by constructing S(X s) and using it to compute f(x) in one shot.

In this new suffix tree, we associate weight αi with each leaf associated with a Support

Vector xi. For a node v ∈ nodes(S(X s)) we modify the definition of lvs(v) as the sum of weights

associated with the subtree rooted at node v. Using the Giancarlo (1995) algorithm we can

compute lab for each a, b in O(m2) time. Now, Theorem 32 can be applied unchanged using this

new definition of lvs(v) in order to compute the kernel. To see the correctness of the algorithm

we first use the following lemma:

Lemma 33 w is a sub-matrix of x and some y ∈ X s iff w = submatx(a, b, l) such that l ≤ lab

for some 1 ≤ a, b ≤ m.

We omit the proof since it is exactly analogous to that of Lemma 31. This lemma shows that

the suffix tree S(X s) helps us compute the set of all common sub-matrices of x and X s in O(m4)

time. Now, it suffices to rewrite f(x) as

f(x) =p∑

i=1

αi

∑svx,s′vy

wsδs,s′ =∑

svx,s′vy

p∑i=1

(αiws)δs,s′ , (7.4)

and notice that each occurrence of a sub-matrix s instead of contributing a weight ws now

contributes a weight of αiws which is taken into account by the modified definition of lvs(v).

Hence, by Theorem 32 we know that the above algorithm computes f(x).

From Theorem 32 we can see that our algorithm computes f(x) in O(m4) time (i.e. it scales

quadratically since the size of the input is m2) and is independent of the number of Support

Vectors or their size. This is a vast improvement over conventional methods which require

O(pm4) time for prediction.

7.7 Summary 122

7.7 Summary

We proposed a novel method to compactly represent the 2-d structural properties of hand-

written digit data using the suffix tree on matrices. This representation leads to a true data

dependent similarity measure for a test pattern which can be computed much faster than the

distance of a new data point from each point in the training set. Many applications of the above

ideas especially in the field of pattern recognition, machine learning and handwritten digit data

recognition are current topics of research and will be reported in subsequent publications.

We also proposed a novel kernel for images which takes into account their two dimensional

structure. We describe a quadratic time algorithm for its computation. We showed how suffix

trees on matrices can help us reduce the time required for prediction to quadratic in the size of

the test image, independent of the size or the number of Support Vectors. We conjecture that

the time complexity for both the image kernel computation as well as prediction can be reduced

to log-linear by intelligently using the suffix tree on matrices. This conjecture is currently being

investigated by us.

Chapter 8

Summary and Future Work

In this chapter we summarize the main contributions of this thesis in a nutshell and indicate

avenues for extending our ideas. Some of the extensions are the focus of our current research.

8.1 Contributions of the Thesis

The main contributions of this thesis are as follows:

• A fast and efficient SVM training algorithm for the `2 (quadratic soft margin) SVM for-

mulation.

• A modified Cholesky factorization method for a special class of matrices.

• A general recipe for defining kernels on discrete objects which includes as special cases

many previously known kernels.

• The first linear time algorithm for computing string kernels and its extension to compute

kernels on trees.

• An algorithm to perform prediction in time linear (quadratic) in the size of the input

independent of the number of Support Vectors in the case of string (image) kernels.

• A novel method to connect dynamical systems and kernels thus providing a rich framework

to define kernels.

• A method to increase the density of samples in input space by creating meaningful virtual

samples.

123

8.2 Extensions and Future Work 124

• The main contributions are described in Table 8.1.

Algorithm Input Size Time Complexity Space Complexity

Modified Cholesky Decomposition n2 O(nm2) O(m2)

String Kernel n O(n) O(n)

Unordered Tree Kernel n O(n log n) O(n)

Ordered Tree Kernel n O(n) O(n)

Dynamical System Kernel n2 O(n3) O(n2)

Image Kernel m2 O(m4) O(m2)

Table 8.1: Summary of various algorithms

8.2 Extensions and Future Work

The following immediately apparent extensions of work reported in this thesis are possible:

• Currently, SimpleSVM has been implemented only for the `2 formulation. Extensions to

the `1 formulation are possible. A careful study of implementation strategies and the rate

of convergence in the `1 formulation are topics of research.

• A careful study of the use of a kernel cache and rank-degenerate kernel matrices to speed

up SimpleSVM seems promising. Such studies will reveal the computational and storage

gains as well as limitations of the SimpleSVM algorithm.

• A parallel SVM training algorithm is possible to implement by integrating our parallel

modified Cholesky decomposition with the SimpleSVM.

• Error analysis of our modified Cholesky decomposition to show numerical stability is cur-

rently being worked out.

• Further research effort is needed to see if we can find a generic framework to embed discrete

structures in Hilbert spaces. The key question to ask is, if R-Convolutions are sufficient

or do we need extensions?

• Extensions of string kernels to incorporate mismatches is a very challenging problem. The

main difficulty stems from the fact that we need a way of specifying the contributions due

8.2 Extensions and Future Work 125

to mismatched substrings. The key problem is one of computational efficiency.

• Use of suffix trees on trees can be explored to efficiently define path kernels on trees.

Another important direction to investigate is wether we can do log-linear time subset tree

matching?

• Many more special cases of the dynamic systems kernels need to be studied including

kernels on pair NFA’s, time series, array of dynamic systems, etc. The computational

challenge is to reduce the time complexity from O(n3).

• More quantitative experiments can be performed using jigsawing to study its behavior on

real-life datasets.

• We conjecture that the image kernel can be computed in less than quadratic time by

intelligently using the suffix tree on matrices. Coming up with such an intelligent algorithm

is an exciting research problem.

Appendix A

Rank One Modification

A.1 Rank Modification of a Positive Matrix

Rank one updates of symmetric positive definite matrices have been extensively studied, see e.g.,

Gill et al. (1974, 1975), Golub and Loan (1996), Horn and Johnson (1985). In this section we

consider the special case of Equation (2.4), that is (dropping the subscript of A for notational

convenience and writing the RHS in a somewhat more general form)

0 y>

y H

b

α

=

0 y>

y LDL>

b

α

=

db

. (A.1)

Here α, dα ∈ Rn, b, db ∈ R and H,L,D ∈ Rn×n. Furthermore, D is diagonal and L is a unit

diagonal lower triangular matrix. Both L and D are obtained from the positive semi-definite

matrix H via LDL> = H by standard methods such as a Cholesky factorization (see Chapter 3

for more details).

For the following rank modifications we assume that in addition to L and D we know the

expressions

s1 := L−1dα and s2 := H−1dα = (L>)−1D−1s1

s3 := L−1 y and s4 := H−1 y = (L>)−1D−1s3.(A.2)

They can be used directly to compute the solution of Equation (A.1) in O(n) time via

b =(sT3D

−1s3)−1

(yT s2 − db) and α = s2 − s4b. (A.3)

126

A.1 Rank Modification of a Positive Matrix 127

A.1.1 Rank One Extension

After a rank one extension of H, y and dα in Equation (2.4) the augmented system of equations

is given by 0 y>

y H

b

α

=

0 y>

y LDL>

b

α

=

db

, (A.4)

where dα :=

, y :=

y

y

, and H :=

H h

h> h

. Here h, y, dα ∈ R are scalars and

h ∈ Rn is a column vector. L and D can be computed using well known rank one updates

(Golub and Loan, 1996), and we obtain

LDL> =

L

l> 1

D

d

L> l

1

(A.5)

where l = D−1L−1 h and d = h− l>D l. Since L is lower triangular and D is diagonal, L and D

can be computed in O(n2) time. Finally, we can compute the augmented versions of s1, . . . , s4

as follows

s1 =

s11

s12

=

s1old

dα − l> s1old

and s2 =

s2old − (L>)−1 l s12/d

s12/d

s3 =

s31

s32

=

s3old

y − l> s3old

and s4 =

s4old − (L>)−1 l s32/d

s32/d

(A.6)

The expensive part of the above updates is the computation of (L>)−1 l, which takes O(n2)

operations. However, given (L>)−1 l updating s1, . . . , s4 takes O(n) time. Hence the entire

update can be carried out in O(n2) time. Also note that analogous relations hold for rank-v

modifications (with v > 1), the only difference being that in this case vectors become matrices

and scalars become vectors or matrices in an obvious fashion.

A.1.2 Rank One Reduction

Assume that we want to delete the jth row and column from H and reduce Equation (A.1)

correspondingly. This operation occurs when we need to remove a point from the set of Support

Vectors. The following matrix manipulations are standard in numerical analysis and we refer

A.1 Rank Modification of a Positive Matrix 128

the reader to Golub and Loan (1996), Gill et al. (1974, 1975), Horn and Johnson (1985) for

further details. We begin with the decomposition of H into

H =

H11 H12 H13

H21 H22 H23

H31 H32 H33

= LDL>with L =

L11

L21 1

L31 L32 L33

and D =

D1

D2

D3

Straightforward algebra shows that in this case the LDL> decomposition for the reduced matrix H11 H13

H31 H33

satisfies L =

L11

L31 L33

and D =

D1

D3

, where

L33D3L>33 = L33D3L

>33 + L32D2L

>32. (A.7)

Equation (A.7) can be solved in O(n2) operations (where H33 ∈ Rn×n): we first reduce Equa-

tion (A.7) to the problem of factorizing the rank-1 update of a diagonal matrix via

(L−1

33 L33

)D3

(L−1

33 L33

)>= D3 +

(L−1

33 L32

)D2

(L−1

33 L32

)>. (A.8)

Here(L−1

33 L32

)can be computed in O(n2) time (L32 is a vector and L33 is lower triangular).

It turns out that a factorization of the RHS of Equation (A.8) can be obtained in O(n) space

and time (see Golub and Loan (1996), Horn and Johnson (1985)) and we obtain

LD3L> = D3 +

(L−1

33 L32

)D2

(L−1

33 L32

)> where Lij =

0 if i < j

1 if i = j

ζiβj if i > j

(A.9)

What remains is to compute L33 = L33L. In the following we show that this again can be done

in O(n2) time, despite the fact that we have to multiply two lower triangular matrices. We have

[L33

]ij

=[L33L

]ij

= [L33]ij +i∑

l=j+1

[L33]il ζlβj [L33]ij +

i∑l=j+1

[L33]il ζl

βj . (A.10)

Note that all possible terms in the above sum can be computed at O(n2) time simply by starting

with j = i− 1 and progressing to j = 1. Here each step requires O(1) operations.

Now we turn to the updates required for s1, . . . , s4. We can write a stored quantity si as

A.2 Rank-Degenerate Kernels 129

si = [si1, sij , si3]>, where sij is the jth element of the si vector. The updated values are given

by

s1 =

s11

s∗13

=

s11

L−133 (1−L31s11)

s2 =

s∗21

s∗23

=

D−11 (L>11)

−1(s11 −D1L>31s

∗23)

D−13 (L>33)

−1s∗13

s3 =

s31

s∗33

=

s31

L−133 (y3−L31s31)

s4 =

s∗41

s∗43

=

D−11 (L>11)

−1(s31 −D1L>31s

∗43)

D−13 (L>33)

−1s∗33

(A.11)

Since L11 and L33 are triangular matrices the stored quantities can be updated in O(n2) time.

A.2 Rank-Degenerate Kernels

In the case of rank-degenerate kernel we can apply the ideas discussed in Chapter 3 in order to

compute a LDL> decomposition of K in O(nm2) time. Rank-one updates of such a decompo-

sition can be performed in O(mn) time. Details can be found in Section 3.4 as well as in Smola

and Vishwanathan (2003). The procedure to update the stored quantities remains unchanged

in this case.

Bibliography

A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, And Tools.

Addison-Wesley Longman Publishing Co., Inc., Reading, MA, USA, 1986.

A. Amir, M. Farach, Z. Galil, R. Giancarlo, and K. Park. Dynamic dictionary matching.

Journal of Computer and System Science, 49(2):208–222, October 1994.

I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using ab-

stract syntax trees. In Proceedings of International Conference on Software Maintenance,

pages 368–377, Bethesda, Maryland, USA, November 1998. IEEE Press.

J. M. Bennett. Triangular factors of modified matrices. Numerical Mathematics, 7:217–

221, 1965.

K. P. Bennett and O. L. Mangasarian. Multicategory separation via linear programming.

Optimization Methods and Software, 3:27–39, 1993.

J. L. Bentley and M. I. Shamos. Divide-and-conquer in multidimensional space. In Pro-

ceedings of the 8th annual ACM symposium on Theory of computing, pages 220–230, 1976.

C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. URL

http://www.ics.uci.edu/~mlearn/MLRepository.html.

B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin

classifiers. In D. Haussler, editor, Proceedings of the Annual Conference on Computational

Learning Theory, pages 144–152, Pittsburgh, PA, July 1992. ACM Press.

D. Breslauer. The suffix tree of a tree and minimizing sequential transducers. Theoritical

Computer Science, 191((1-2)):131–144, January 1998.

130

BIBLIOGRAPHY 131

C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data

Mining and Knowledge Discovery, 2(2):121–167, 1998.

C. J. C. Burges and V. Vapnik. A new method for constructing artificial neural networks.

Interim technical report, ONR contract N00014-94-c-0186, AT&T Bell Laboratories, 1995.

A. Cannon, J. M. Ettinger, D. Hush, and C. Scovel. Machine learning with data dependent

hypothesis classes. Journal of Machine Learning Research, 2:335–358, February 2002.

G. Cauwenberghs and T. Poggio. Incremental and decremental support vector machine

learning. In T. K. Leen, T. G. Dietterich, and V. Tresp, editors, Advances in Neural

Information Processing Systems 13, pages 409–415. MIT Press, 2001.

W. I. Chang and E. L. Lawler. Sublinear approximate sting matching and biological

applications. Algorithmica, 12(4/5):327–344, 1994.

R. Cole and R. Hariharan. Faster suffix tree construction with missing suffix links. In Pro-

ceedings of the Thirty Second Annual Symposium on the Theory of Computing, Portland,

OR, USA, May 2000. ACM.

M. Collins and N. Duffy. Convolution kernels for natural language. In T. G. Dietterich,

S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Sys-

tems 14, Cambridge, MA, 2001. MIT Press.

C. Cortes. Prediction of Generalization Ability in Learning Machines. PhD thesis, De-

partment of Computer Science, University of Rochester, 1995.

C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273–297, 1995.

R. Courant and D. Hilbert. Methods of Mathematical Physics, volume 1. Interscience

Publishers, Inc, New York, 1953.

R. Courant and D. Hilbert. Methods of Mathematical Physics, volume 2. Interscience

Publishers, Inc, New York, 1962.

T. M. Cover and P. E. Hart. Nearest neighbor pattern classifications. IEEE Transactions

on Information Theory, 13(1):21–27, 1967.

BIBLIOGRAPHY 132

N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cam-

bridge University Press, Cambridge, UK, 2000.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete

Data via the EM Algorithm. Journal of the Royal Statistical Society B, 39(1):1–22, 1977.

Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification and Scene

Analysis. John Wiley and Sons, New York, 2001. Second edition.

R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Proba-

bilistic models of proteins and nucleic acids. Cambridge University Press, 1998.

B. Efron. The jacknife, the bootstrap, and other resampling plans. SIAM, Philadelphia,

1982.

B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, New

York, 1994.

Bradley Efron and Robert Tibshirani. Improvements on cross-validation: the .632+ boot-

strap method. Journal of the American Statistical Association, 92:548–560, 1997.

E. Eskin, A. Arnold, M Prerau, L Portnoy, and S Stolfo. A geometric framweork for

unsupervised anomaly detection: Detecting intrusions in unlabeled data. In Proceedings

of the Workshop on Data Mining for Security Applications, Philadelphia, PA, November

2001. ACM, Kluwer.

B. S. Everitt. An Introduction to Latent Variable Models. Chapman and Hall, London,

1984.

William Feller. An Introduction To Probability Theory and Its Application, volume 1. John

Wiley and Sons, New York, 1950.

M. C. Ferris and T. S. Munson. Interior point methods for massive support vector ma-

chines. Data Mining Institute Technical Report 00-05, Computer Sciences Department,

University of Wisconsin, Madison, Wisconsin, 2000.

S. Fine and K. Scheinberg. Efficient SVM training using low-rank kernel representations.

Journal of Machine Learning Research, 2:243–264, Dec 2001. http://www.jmlr.org.

BIBLIOGRAPHY 133

R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, New York, 1989.

R. Fletcher and M. J. D. Powell. On the modification of LDL> factorizations. Mathematics

of Computation, 28(128):1067–1087, 1974.

E. Fredkin. Trie memory. Communications of the ACM, 3(9):490–499, September 1960.

Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm.

Machine Learning, 37(3):277–296, 1999.

J. D. Gardiner, A. L. Laub, J. J. Amato, and C. B. Moler. Solution of the Sylvester

matrix equation AXB> +CXD> = E. ACM Transactions on Mathematical Software, 18

(2):223–231, 1992.

T. Gartner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In

Proceedings of the 19th International Conference on Machine Learning ICML, 2002.

R. Giancarlo. A generalization of the suffix tree to matrices,with applications. SIAM

Journal on Computing, 24(3):520–562, 1995.

R. Giancarlo and R. Grossi. On the construction of classes of suffix trees for square

matrices: Algorithms and applications. Information and Computation, 130:151–182, 1996.

R. Giancarlo and R. Grossi. Parallel construction and query of index data structures for

pattern matching and square matrices. Journal of Complexity, 15:30–71, 1997.

R. Giancarlo and R. Guaiana. On-line construction of two dimensional suffix trees. Journal

of Complexity, 15:72–127, 1999.

R. Giegerich and S. Kurtz. From Ukkonen to McCreight and Weiner: A unifying view of

linear-time suffix tree construction. Algorithmica, 19(3):331–353, 1997.

P. E. Gill, G. H. Golub, W. Murray, and M. A. Saunders. Methods for modifying matrix

factorizations. Mathematics of Computation, 28(126):505–535, April 1974.

P. E. Gill, W. Murray, and M. A. Saunders. Methods for computing and modifying the

LDV factors of a matrix. Mathematics of Computation, 29(132):1051–1077, October 1975.

BIBLIOGRAPHY 134

P. E. Gill, W. Murray, M. A. Saunders, and M. H. Wright. Inertia-controlling methods

for general quadratic programming. SIAM Review, 33(1), March 1991.

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing.

In M. P. Atkinson, M. E. Orlowska, P. Valduriez, S. B. Zdonik, and M. L. Brodie, editors,

Proceedings of the 25th VLDB Conference, pages 518–529, Edinburgh, Scotland, 1999.

Morgan Kaufmann.

D. Goldfarb and K. Scheinberg. A product-form Cholesky factorization method for han-

dling dense columns in interior point methods for linear programming. Technical report,

IBM Watson Research Center, Yorktown Heights, 2001.

G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins University Press,

Baltimore, MD, 3rd edition, 1996.

F. C. Graham. Logarithmic Sobolev techniques for random walks on graphs. In D. A.

Hejhal, J. Friedman, M. C. Gutzwiller, and A. M. Odlyzko, editors, Emerging Applications

of Number Theory, number 109 in IMA Volumes in Mathematics and its Applications,

pages 175–186. Springer, 1999. ISBN 0-387-98824-6.

M. Gribskov and N. L. Robinson. Use of receiver operating characteristic (ROC) analysis

to evaluate sequence matching. Computers and Chemistry, 20(1):25–33, 1996.

R. Grossi and G. Italiano. Suffix trees and their applications in string algorithms. In

Procedings of 1st South American Workshop on String Processing (WSP 1993), pages

57–76, September 1993.

D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Compu-

tational Biology. Cambridge University Press, June 1997. ISBN 0-521-58519-8.

Y. Hamamoto, S. Uchimura, and S. Tomita. A bootstrap technique for nearest neighor

classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(1):

73–79, January 1997.

D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-

99-10, Computer Science Department, UC Santa Cruz, 1999.

BIBLIOGRAPHY 135

S. Haykin. Neural Networks : A Comprehensive Foundation. Macmillan, New York, 1994.

R. Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, 2002.

M. Herbster. Learning additive models online with fast evaluating kernels. In D. P.

Helmbold and B. Williamson, editors, Proceedings of the Fourteenth Annual Conference

on Computational Learning Theory (COLT), volume 2111 of Lecture Notes in Computer

Science, pages 444–460. Springer, 2001.

M. W. Hirsch and S. Smale. Differential equations, dynamical systems, and linear algebra.

Academic Press, New York, 1974.

J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages and Com-

putation. Addison-Wesley, Reading, Massachusetts, first edition, 1979.

T. Hopkins. Remark on algorithm 705: A fortran-77 software package for solving the

Sylvester matrix equation AXB> + CXD> = E. ACM Transactions on Mathematical

Software (TOMS), 28(3):372–375, 2002.

R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, Cambridge,

1985.

P. Indyk and R. Motawani. Approximate nearest neighbors: Towards removing the curse

of dimensionality. In Proceedings of the 30th Symposium on Theory of Computing, pages

604–613, 1998.

T. S. Jaakkola, M. Diekhans, and D. Haussler. Using the Fisher kernel method to detect

remote protein homologies. In Proceedings of the International Conference on Intelligence

Systems for Molecular Biology, pages 149–158. AAAI Press, 1999.

T. S. Jaakkola, M. Diekhans, and D. Haussler. A discriminative framework for detecting

remote protein homologies. Journal of Computational Biology, 7:95–114, 2000.

T. Joachims. Text categorization with support vector machines: Learning with many

relevant features. In Proceedings of the European Conference on Machine Learning, pages

137–142, Berlin, 1998. Springer.

BIBLIOGRAPHY 136

T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf, C. J. C. Burges,

and A. J. Smola, editors, Advances in Kernel Methods—Support Vector Learning, pages

169–184, Cambridge, MA, 1999. MIT Press.

T. Joachims. Learning to Classify Text Using Support Vector Machines: Methods, Theory,

and Algorithms. The Kluwer International Series In Engineering And Computer Science.

Kluwer Academic Publishers, Boston, May 2002. ISBN 0-7923-7679-X.

T. Joachims, N. Cristianini, and J. Shawe-Taylor. Composite kernels for hypertext cate-

gorisation. In C. Brodley and A. Danyluk, editors, Proceedings of the 18th International

Conference on Machine Learning (ICML), pages 250–257, San Francisco, US, 2001. Mor-

gan Kaufmann.

L. Kaufman. Solving the quadratic programming problem arising in support vector clas-

sification. In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel

Methods—Support Vector Learning, pages 147–168, Cambridge, MA, 1999. MIT Press.

S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. A fast iterative

nearest point algorithm for support vector machine classifier design. Technical Report

Technical Report TR-ISL-99-03, Indian Institute of Science, Bangalore, 1999. URL http:

//guppy.mpe.nus.edu.sg/~mpessk/npa_tr.ps.gz.

S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy. A fast iterative

nearest point algorithm for support vector machine classifier design. IEEE Transactions

on Neural Networks, 11(1):124–136, January 2000.

D. K. Kim and K. Park. Linear-time construction of two-dimensional suffix trees. In Pro-

ceedings of the 26th International Colloquium on Automata, Languages and Programming

(ICALP), volume 1644 of LNCS, pages 463–472, Prague, Czech, July 1999. Springer.

D. E. Knuth. The Art of Computer Programming: Fundamental Algorithms, volume 1.

Addison-Wesley, Reading, Massachusetts, second edition, 1998a.

D. E. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3.

Addison-Wesley, Reading, Massachusetts, second edition, 1998b.

BIBLIOGRAPHY 137

Ron Kohavi. A study of cross validation and bootstrap for accuracy estimation and model

selection. In Proceedings of the International Joint Conference on Neural Networks, 1995.

R. S. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures.

In Proceedings of the ICML, 2002.

Y. LeCun, L. D. Jackel, L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A.

Muller, E. Sackinger, P. Simard, and V. Vapnik. Learning algorithms for classification: A

comparison on handwritten digit recognition. Neural Networks, pages 261–276, 1995.

E. Leopold and J. Kindermann. Text categorization with support vector machines: How

to represent text in input space? Machine Learning, 46(3):423–444, March 2002.

C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM

protein classification. In Proceedings of the Pacific Symposium on Biocomputing, pages

564–575, 2002a.

C. Leslie, E. Eskin, J. Weston, and W. S. Noble. Mismatch string kernels for SVM protein

classification. In Proceedings of Neural Information Processing Systems 2002, 2002b. in

press.

H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification

using string kernels. Journal of Machine Learning Research, 2:419–444, February 2002.

D. G. Luenberger. Introduction to Dynamic Systems: Theory, Models, and Applications.

John Wiley and Sons, Inc., New York, USA, May 1979. ISBN 0-471-02594-1.

L. M. Manevitz and M. Yousef. One-class SVMs for document classification. Journal of

Machine Learning Research, 2:139–154, December 2001.

O. L. Mangasarian. Nonlinear Programming. McGraw-Hill, New York, 1969.

O. L. Mangasarian and D. R. Musicant. Lagrangian support vector machines. Journal of

Machine Learning Research, 1:161–177, 2001. http://www.jmlr.org.

E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the

ACM, 23(2):262–272, April 1976.

BIBLIOGRAPHY 138

P. Niyogi, F. Girosi, and T. Poggio. Incorporating prior knowledge in machine learning

by creating virtual examples. Proceedings of IEEE, 86(11):2196–2209, November 1998.

K. Oflazer. Error-tolerant retrieval of trees. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 19(12):1376–1380, December 1997.

M. Opper and O. Winther. Gaussian processes and SVM: Mean field and leave-one-out.

In A. J. Smola, P. L. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in

Large Margin Classifiers, pages 311–326, Cambridge, MA, 2000. MIT Press.

E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector

machines. In J. Principe, L. Gile, N. Morgan, and E. Wilson, editors, Neural Networks

for Signal Processing VII—Proceedings of the 1997 IEEE Workshop, pages 276–285, New

York, 1997. IEEE.

A. Parker and J. O. Hamblen. Computer algorithms for plagiarism detection. IEEE

Transactions on Education, 32(2):94–99, May 1989.

J. Platt. Fast training of support vector machines using sequential minimal optimization.

In B. Scholkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods—

Support Vector Learning, pages 185–208, Cambridge, MA, 1999. MIT Press.

M. Revow, C. K.I. Williams, and G. E. Hinton. Using generative models for handwritten

digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18

(6):592–606, 1996.

D. Roobaert. DirectSVM: A simple support vector machine perceptron. In Neural Net-

works for Signal Processing X—Proceedings of the 2000 IEEE Workshop, pages 356–365,

New York, December 2000. IEEE.

B. Scholkopf. Support Vector Learning. R. Oldenbourg Verlag, Munchen, 1997. Doktorar-

beit, TU Berlin. Download: http://www.kernel-machines.org.

B. Scholkopf, A. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algo-

rithms. Neural Computation, 12:1207–1245, 2000.

B. Scholkopf and A. J. Smola. Learning with Kernels. MIT Press, 2002.

BIBLIOGRAPHY 139

T. Seol and S. Park. Solving linear systems in interior-point methods. Computers and

Operations Research, 29:317–326, 2002.

John Shawe-Taylor, Peter L. Bartlett, Robert C. Williamson, and Martin Anthony. Struc-

tural risk minimization over data-dependent hierarchies. IEEE Transactions on Informa-

tion Theory, 44(5):1926–1940, 1998.

Kristy Sim. Context kernels for text categorization. Master’s thesis, The Australian

National University, Canberra, Australia, ACT 0200, June 2001.

A. J. Smola and B. Scholkopf. A tutorial on support vector regression. NeuroCOLT

Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK,

1998.

A. J. Smola and B. Scholkopf. Sparse greedy matrix approximation for machine learning.

In P. Langley, editor, Proceedings of the International Conference on Machine Learning,

pages 911–918, San Francisco, 2000. Morgan Kaufmann Publishers.

A. J. Smola and S. V. N. Vishwanathan. Cholesky factorization for rank-k modifications

of diagonal matrices. SIAM Journal of Matrix Analysis, 2003. in preparation.

G. W. Stewart. Decompositional approach to matrix computation. Computing in Science

and Engineering, 2(1):50–59, February 2000.

J. Stoer and R. Bulirsch. Introduction to Numerical Analysis. Springer, New York, second

edition, 1993.

G. Strang. Introduction to Linear Algebra. Wellesley-Cambridge Press, August 1998.

E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3):249–260, 1995.

P. M. Vaidya. An O(n log(n)) algorithm for the all-nearest-neighbors problem. Discrete

and Computational Geometry, 4(2):101–115, January 1989.

R. J. Vanderbei. LOQO: An interior point code for quadratic programming. TR SOR-94-

15, Statistics and Operations Research, Princeton Univ., NJ, 1994.

R. J. Vanderbei. Linear Programming: Foundations and Extensions. Kluwer Academic,

Hingham, 1997.

BIBLIOGRAPHY 140

V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995.

V. Vapnik and A. Chervonenkis. Theory of Pattern Recognition [in Russian]. Nauka,

Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der

Zeichenerkennung, Akademie-Verlag, Berlin, 1979).

S. V. N. Vishwanathan and M. N. Murty. Geometric SVM: A fast and intuitive SVM

algorithm. In Proceedings of the ICPR, 2002a. accepted.

S. V. N. Vishwanathan and M. N. Murty. Jigsawing: A method to generate virtual

examples in OCR data. In Proceedings of the Second International Conference on Hybrid

Intelligent Systems, 2002b. To appear.

S. V. N. Vishwanathan and M. N. Murty. Jigsawing: A method to generate virtual

examples in OCR data. Pattern Recognition, 2002c. under preparation.

S. V. N. Vishwanathan and M. N. Murty. Use of MPSVM for data set reduction. In

A. Abraham and M. Koeppen, editors, Hybrid Information Systems, Heidelberg, 2002d.

Physica Verlag.

S. V. N. Vishwanathan and M. N. Murty. Use of MPSVM for data set reduction. In

A. Abraham, L. Jain, and J. Kacprzyk, editors, Recent Advances in Intelligent Paradigms

and Applications, volume 113 of Studies in Fuzziness and Soft Computing, chapter 16.

Springer Verlag, Berlin, November 2002e.

S. V. N. Vishwanathan and A. J. Smola. Kernels on structured objects. Technical report,

Australian National University, RSISE, 2002. in preparation.

P. Viswanath and M. N. Murty. An efficient incremental mining algorithm for compact

realization of prototypes. Technical Report IISC-CSA-2002-2, CSA Department, Indian

Institute of Science, Bangalore, India, January 2002.

C. Watkins. Dynamic alignment kernels. In A. J. Smola, P. L. Bartlett, B. Scholkopf, and

D. Schuurmans, editors, Advances in Large Margin Classifiers, pages 39–50, Cambridge,

MA, 2000. MIT Press.

BIBLIOGRAPHY 141

P. Weiner. Linear pattern matching algorithms. In Proceedings of the IEEE 14th Annual

Symposium on Switching and Automata Theory, pages 1–11, The University of Iowa, 1973.

IEEE.

E. Weyer. System Identification in the Behavioural Framework. PhD thesis, The Norwegian

Institute of Technology, Trondheim, 1992.

J. C. Willems. From time series to linear system. I. Finite-dimensional linear time invariant

systems. Automatica J. IFAC, 22(5):561–580, 1986a.

J. C. Willems. From time series to linear system. II. Exact modelling. Automatica J.

IFAC, 22(6):675–694, 1986b.

J. C. Willems. From time series to linear system. III. Approximate modelling. Automatica

J. IFAC, 23(1):87–115, 1987.

C. K. I. Williams and M. Seeger. The effect of the input density distribution on kernel-

based classifiers. In P. Langley, editor, Proceedings of the International Conference on

Machine Learning, pages 1159–1166, San Francisco, California, 2000. Morgan Kaufmann

Publishers.

T. Zhang. Some sparse approximation bounds for regression problems. In Proc. 18th Inter-

national Conf. on Machine Learning, pages 624–631. Morgan Kaufmann, San Francisco,

CA, 2001.