rankrc: large-scale nonlinear rare class rankingtfcolema/articles... · at the data level, various...
Post on 08-Jun-2020
1 Views
Preview:
TRANSCRIPT
RankRC: Large-scale Nonlinear Rare Class Ranking1
Aditya Tayala,1,∗, Thomas F. Colemanb,1,2, Yuying Lia,12
aCheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada N2L 3G13bCombinatorics and Optimization, University of Waterloo, Waterloo, ON, Canada N2L 3G14
Abstract5
Rare class problems are common in real-world applications across a wide range of domains. Stan-6
dard classification algorithms are known to perform poorly in these cases, since they focus on7
overall classification accuracy. In addition, we have seen an explosion of data in recent years,8
resulting in many large scale rare class problems. In this paper, we consider nonlinear kernel9
based classification methods expressed as a regularized loss minimization problem. We address10
challenges associated with both rare class problems and large scale learning, by 1) optimizing area11
under curve of the receiver of operator characteristic in the training process, instead of classifica-12
tion accuracy and 2) using a rare class kernel representation to achieve an efficient time and space13
algorithm. We call our algorithm RankRC. We provide heuristic and theoretical justification for14
the rare class representation, and experimentally illustrate the effectiveness of RankRC in both test15
performance and computational complexity on several datasets.16
Keywords: rare class, large scale, nonlinear kernel, receiver operator characteristic, ranking svm17
1. Introduction18
In many classification problems samples from one class are extremely rare (the minority class),19
while the number of samples belonging to the other class are plenty (the majority class). This20
situation is known as the rare class problem. It is also referred to as an unbalanced or skewed21
class distribution problem. Rare class problems naturally arise in several application domains,22
for example, fraud detection, customer churn, intrusion detection, fault detection, credit default,23
insurance risk and medical diagnosis.24
Standard classification methods perform poorly when dealing with unbalanced data, e.g. sup-25
port vector machines (SVM) [1, 2, 3], decision trees [4, 5, 1, 6], neural networks [1], Bayesian26
networks [7], and nearest neighbor methods [4, 8]. Most classification algorithms are driven by ac-27
curacy (i.e. minimizing error). Since minority examples constitute a small proportion of the data,28
∗Corresponding authorEmail addresses: amtayal@uwaterloo.ca (Aditya Tayal), tfcoleman@uwaterloo.ca (Thomas F.
Coleman), yuying@uwaterloo.ca (Yuying Li)1All three authors acknowledge funding from the National Sciences and Engineering Research Council of Canada2This author acknowledges funding from the Ophelia Lazaridis University Research Chair. The views expressed
herein are solely from the authors.
Preprint submitted to Pattern Recognition September 11, 2013
they have little impact on accuracy or total error. Thus majority examples overshadow the minor-29
ity class, resulting in models which are heavily biased in recognizing the majority class. Also,30
errors from different classes are assumed to have the same costs, which is usually not true. In most31
problems, incorrect classification of the rare class is more expensive, for instance, diagnosing a32
malignant tumor as benign has more severe consequences than the contrary case.33
Solutions to the class imbalance problem have been proposed at both the data and algorithm34
level. At the data level, various resampling techniques are used to balance class distribution,35
including random under-sampling of majority class instances [9], over-sampling minority class36
instances with new synthetic data generation [10], and focused resampling, in which samples37
are chosen based on additional criteria [8]. Although sampling approaches have been showed38
to achieve success in some applications, they are known to have drawbacks, for instance under-39
sampling can eliminate useful information, while over-sampling can result in overfitting. At the40
algorithm level, solutions are proposed by adjusting the algorithm itself. This usually involves ad-41
justing the costs of the classes to counter the class imbalance (cost-sensitive learning) or adjusting42
the decision threshold. However, true error costs are often unknown and using an inaccurate cost43
model can lead to additional bias.44
In this paper we focus on nonlinear kernel based classification methods expressed as a regu-45
larized loss minimization problem. In recent years, we have seen an explosion of data, resulting46
in many large scale rare class problems. For example, detecting unauthorized use of a credit card47
from millions of transactions. Processing large datasets can be prohibitive for many nonlinear ker-48
nel algorithms, which scale quadratically to cubically in the number of examples and may require49
quadratic space as well.50
To address the challenges associated with rare class problems and large scale learning we51
propose the following:52
1. Instead of maximizing accuracy (minimizing error), we optimize area under curve (AUC)53
of the receiver operator characteristic. The AUC overcomes inadequacies of accuracy for54
unbalanced problems and provides a skew independent measure. It is often used as the eval-55
uation metric for unbalanced problems and therefore it is appropriate to directly optimize56
it in the training process. This results in a regularized biclass ranking problem, which is a57
special case of RankSVM with two ordinal levels [11].58
2. To solve a kernel RankSVM problem in the dual, as originally proposed in [11], requires59
O(m6) time and O(m4) space, where m is the number of data samples. Recently, Chapelle and60
Keerthi [12] proposed a primal approach to solve RankSVM, which results in O(m3) time61
and O(m2) space, for nonlinear kernels. We propose a modification to kernel RankSVM, that62
takes specific advantage of the unbalanced nature of the problem, to achieve O(mm+) time63
and O(mm+) space, where m+ is the number of rare class examples. The idea is to restrict64
the solution to a linear combination of rare class kernel functions. We call it RankRC, since65
it enforces a rare class representation. We present heuristic and theoretical justification66
for this choice. Specifically, we show RankRC is optimal with respect to RankSVM for67
skewed data when a subset of kernel functions are used. We also draw connections to the68
Nyström approximation method. Several of our results are general and can be applied to69
other regularized loss minimization problems.70
2
The rest of the paper is organized as follows. Sections 2 and 3 review the AUC measure and71
RankSVM. Section 4 develops the RankRC problem. Section 5 outlines the optimization method72
used to solve RankRC. Section 6 empirically compares RankRC with other kernel methods on73
several datasets. Finally, Section 7 concludes with summary remarks and potential extensions.74
2. ROC Curve75
Evaluation metrics play an important role in learning algorithms. They provide ways to76
assess performance as well as guide modeling. For classification problems, error rate is the77
most commonly used metric. For simplicity, we will consider the two-class case. Let D =78
{(x1,y1), (x2,y2), ..., (xm,ym)} be a set of m training examples, where xi ∈ X ⊆ Rd , yi ∈ {+1,−1}.79
Denote f (x) as the inductive hypothesis obtained by training on example set D. Then error rate is80
defined as,81
ErrorRate =1m
m∑i=1
I[ f (xi) 6= yi] , (1)
where I[p] denotes the indicator function and is equal to 1 if p is true, 0 if p is false. However, for82
highly unbalanced datasets, error rate is not appropriate since it can be biased toward the majority83
class [13, 14, 15, 16]. In this paper, we follow convention and set the minority class as positive and84
the majority class as negative. Consider a dataset that has 1 percent positive cases and 99 percent85
negative ones. A naive solution which assigns every example to be positive will obtain only 186
percent error rate. Indeed, classifiers that always predict the majority class can obtain lower error87
rates than those that predict both classes equally well. But clearly these are not useful hypotheses.88
Classification performance can be represented by a confusion matrix as in Table 1, with m+89
denoting the number of majority examples and m− the number of minority ones. The proportion90
of the two rows reflects class distribution and any performance measure that uses values from both91
rows will be sensitive to class skew.92
Predictedf (x) = +1 f (x) = −1 Total
Actualy = +1 True Positives (TP) False Positives (FP) m+
y = −1 False Negatives (FN) True Negatives (TN) m−
Table 1: Add caption
The Receiver Operating Characteristic (ROC) can be used to obtain a skew independent mea-93
sure [13, 17, 18]. Most classifiers intrinsically output a numerical score and a predicted label is94
obtained by thresholding the score. For example, a threshold of zero leads to taking the sign of the95
numerical output as the label. Each threshold value generates a confusion matrix with different96
quantities of false positives and negatives. The ROC graph is obtained by plotting the true posi-97
tive rate (number of true positives divided by m+) against the false positive rate (number of false98
3
positives divided by m−) as the threshold level is varied (see Figure 1). It depicts the trade-off99
between benefits (true positive) and costs (false positives) for different choices of the threshold.100
Thus it does not depend on a priori knowledge of the costs associated with misclassification. A101
ROC curve that dominates another provides a better solution at any cost point.102
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
False Positive Rate (costs)
Tru
e P
ositiv
e R
ate
(benefits
)
A
B
C
Figure 1: Example ROC curves. Curve A dominates B and curve B dominates C. Curve C has anAUC of 0.5 and indicates a model with no discriminative value.
To facilitate comparison, it is convenient to characterize ROC curves using a single measure.103
The area under a ROC curve (AUC) can be used for this purpose. It is the average performance of104
the model across all threshold levels and corresponds to the Wilcoxon rank statistic [19]. The AUC105
can be obtained by forming the ROC curve and using the trapezoid rule to compute area. Also,106
given the intrinsic output of a hypothesis, f (x), we can directly compute the AUC by counting107
pairwise correct rankings [20]:108
AUC =1
m+m−
∑{i:yi=+1}
∑{ j:y j=−1}
I(
f (xi)> f (x j)). (2)
Incorporating the AUC in the modeling process leads to a biclass ranking problem, as discussed109
in the following section.110
3. RankSVM111
The modeling process can usually be expressed as an optimization problem involving a loss112
function and a penalty on complexity (e.g. regularization term). For most classification problems,113
since the performance measure is error rate, it is natural to consider minimizing the empirical114
error rate (1) as the loss function. In practice, I[·] is often replaced with a convex approximation115
such as the hinge loss, logistic loss or exponential loss [21]. Specifically, using the hinge loss,116
`h(z) = max(0,1−z),with `2-regularization leads to the well known support vector machine (SVM)117
formulation [22, 23],118
4
minw∈Rd
1m
m∑i=1
`h(yiwT xi
)+λ
2‖w‖2
2 , (3)
where λ ∈ R+ is a parameter that controls complexity and the hypothesis, f (x) = wT x, is assumed119
linear in the input space X . Since SVMs try to minimize error rate, they can lead to ineffective120
class boundaries when dealing with highly skewed datasets, with resulting solutions biased toward121
the majority concept [3]. The literature contains several approaches to remedy this problem. Most122
prevalent are sampling methods and cost-sensitive learning. However, these approaches explicitly123
or implicitly fix the relative costs of misclassification. When the true costs are unknown, this can124
lead to suboptimal solutions.125
Instead of minimizing error rate, we consider optimizing AUC as a natural way to deal with126
imbalance. Indeed, if we measure performance using AUC, it is preferable to optimize this quan-127
tity directly during the training process. In the AUC formula given in (2), we replace I[·] with the128
hinge loss to obtain a convex ranking loss function. Thus we solve the following regularized loss129
minimization problem:130
minw∈Rd
1m+m−
∑{i:yi=+1}
∑{ j:y j=−1}
`h(wT xi − wT x j
)+λ
2‖w‖2
2 . (4)
Problem (4) is a special case of RankSVM proposed by Herbrich et al. [11] with two ordinal131
levels. Like SVM, RankSVM leads to a dual problem which can be expressed in terms of dot-132
products between input vectors. This allows us to obtain a non-linear function through the kernel133
trick [22], which consists of using a kernel function, k : X ×X → R, that corresponds to a feature134
map, φ : X → F ⊆ Rd′ , such that ∀u,v ∈ X , k(u,v) = φ(u)Tφ(v). Here, k directly computes the135
inner product of two vectors in a potentially high-dimensional feature space F , without the need136
to explicitly form the mapping. Consequently, we can replace all occurrences of the dot-product137
with k in the dual and work implicitly in space F .138
However, since there is a Lagrange multiplier for each constraint associated with the hinge loss,139
the dual formulation leads to a problem in m+m− = O(m2) variables. Assuming the optimization140
procedure has cubic complexity in the number of variables, the complexity of the dual method141
becomes O(m6), which is unreasonable for even medium sized datasets.142
As noted by Chapelle [24], Chapelle and Keerthi [12], we can also solve the primal problem143
in the implicit feature space due to the Representer Theorem [25, 26]. This theorem states that the144
solution of any regularized loss minimization problem in F can be expressed as a linear combina-145
tion of kernel functions evaluated at the training samples, k(xi, ·), i = 1, ...,m. Thus, the solution of146
(4) in F can be written as:147
w =m∑
i=1
βik(xi, ·) , and f (x) =m∑
i=1
βik(xi,x) . (5)
5
Substituting (5) in (4) we can express the primal problem in terms of β:148
minβ∈Rm
1m+m−
∑{i:yi=+1}
∑{ j:y j=−1}
`h
(m∑
r=1
βrk(xr,xi) −
m∑r=1
βrk(xr,x j)
)+λ
2
m∑i, j=1
βiβ jk(xi,x j) ,
or more simply,149
minβ∈Rm
1m+m−
∑{i:yi=+1}
∑{ j:y j=−1}
`h
(βT Ki −βT K j
)+λ
2βT Kβ , (6)
where K ∈ Rm×m is the kernel matrix, Ki j = k(xi,x j), and Ki denotes the ith row of K. To be able150
to solve (6) using unconstrained optimization methods such as gradient descent, we require the151
objective to be differentiable. We replace the hinge loss, `h, with an ε-smoothed differentiable
−5 0 5
0
1
2
3
4
5
6
z
ℓ(z)
Hinge Loss
Smoothed Hinge
Figure 2: The smoothed hinge is a differentiable approximation of the hinge loss. Here thesmoothed hinge is shown with ε = 0.5.
152approximation, `ε, defined as,153
`ε(z) =
(1 − ε) − z if z< 1 − 2ε14ε (1 − z)2 if 1 − 2ε≤ z< 10 if z≥ 1 ,
which transitions from linear cost to zero cost using a quadratic segment (see Figure 2) and pro-154
vides similar benefits as the hinge loss. Now we can solve (6) using standard unconstrained opti-155
mization techniques. Since there are m variables, Newton’s method would for example take O(m3)156
operations to converge.157
6
RankSVM is popular in the information retrieval community, where linear models are the norm158
[e.g. see 27]. For a linear model, with d-dimension input vectors, the complexity of RankSVM159
can be reduced to O(md + m logm) [12]. However, many rare class problems require a nonlinear160
function to achieve optimal results. But solving a nonlinear RankSVM in O(m3) time may not be161
practical for mid- to large-sized datasets. Moreover, the method requires O(m2) space to store the162
kernel matrix. We believe this complexity is, in part, the reason why nonlinear RankSVMs are not163
commonly used to solve rare class problems.164
In the next section we propose a modification to nonlinear RankSVMs that takes specific ad-165
vantage of the unbalanced nature of the problem to achieve O(mm+) time and O(mm+) space, while166
not sacrificing performance.167
4. RankRC: Ranking with Rare Class Representation168
For highly unbalanced datasets, to make SVM computational feasible for large scale problems,169
we propose a rare class (RC) based method. Specifically we propose a RC based method, which170
restricts the solution to the form171
f (x) =∑{i:yi=+1}
βik(xi,x) , (7)
so it consists only of kernel function realizations of the minority class.172
Next we present motivate form (7) by assuming specific properties of the class conditional173
distributions and kernel function. Zhu et al. [28] make use of similar assumptions, however, in174
their method they attempt to directly estimate the likelihood ratio. In contrast, we use a regularized175
loss minimization approach.176
Recall that the optimal ranking function for a classification problem is the posterior probability,177
P(y = 1|x), since it minimizes the Bayes risk for arbitrary costs. From Bayes’ Theorem, we have178
P(y = 1|x) =P(y = 1)P(x|y = 1)
P(y = 1)P(x|y = 1) + P(y = −1)P(x|y = −1). (8)
Any monotonic transformation of (8) also yields equivalent ranking capability. Dividing the nu-179
merator and denominator of (8) by P(y = −1)P(x|y = −1), we note that P(y = 1|x) is a monotonic180
transformation of the likelihood ratio, denoted as181
f (x) =P(x|y = 1)
P(x|y = −1), (9)
which is the ranking function we wish to obtain. Now, if we assume that the conditional density,182
P(x|y = 1), is a mixture of m+ identical spherical normals centered at the rare class examples, we183
can write184
7
(a)
P(x|y=1)
P(x|y=−1)
Approximatelyconstant locally
(b)
Figure 3: (a) An example of a rare class dataset. Red ‘·’s indicate negative (majority) examples andblack ‘+’s indicate positive (minority) examples. (b) The class conditional distributions showingthat P(x|y = −1) is relatively constant in local neighborhood of positive examples.
P(x|y = 1) =∑{i:yi=+1}
ai exp{||xi − x||2
σ2
},
for some constants ai. This mixture encompasses a large range of possible distributions from the185
m+ rare examples provided. If we also assume that k denotes a Gaussian kernel function with186
width σ, then we have187
P(x|y = 1) =∑{i:yi=+1}
aik(xi,x) .
Observe that in rare class problems most examples are from the majority class (y = −1) and only188
a small number are from the rare class (y = +1). Therefore it is reasonable to assume P(x|y = −1)189
is locally constant in a neighborhood around the minority class examples, see Figure 3 for an190
illustration. Let P(x|y = −1) ≈ ci for each minority example i in the neighbourhood of xi.3 Then191
(9) can be written as,192
f (x)≈∑{i:yi=+1}
aik(xi,x)ci
,
3We do not make this more precise since we are mainly interested in motivating an approximate form.
8
which corresponds to the rare class representation (7) we have chosen. If the assumptions made193
are relaxed, we may still expect the rare class representation to perform reasonably well.194
For any general regularized loss minimization problem with any loss function L : Rm→R, we195
can consider a corresponding rare class regularization problem. Assume that a penalty parameter196
λ ∈ R+ is given, a regularized loss minimization problem can be described as197
minw∈Rd′
L ( f (x1), ..., f (xm)) +λ
2‖w‖2
2 , (10)
where f : X → R is a linear hypothesis in feature space, f (x)) = wTφ(x). Here L(·) is any loss198
function including both standard SVM and ranking SVM functions, since SVM-Rank is equivalent199
to a 1-class SVM on an enlarged dataset with the set of points P = {φ(xi) −φ(x j) : yi > y j, i, j =200
1...,m}. From the Representer Therorem, a solution vector w ∈ S = span{φ(xi) : i = 1, ...,m} can201
be expressed in terms of all the given training points in the feature space.202
Using the restricted hypothesis (7), we consider the following constrained regularized ranking203
problem,204
minβ∈RR
L ( f (x1), ..., f (xm)) +λ
2βT KRRβ ,
subject to f (x) =∑i∈R
βik(xi,x) (11)
whereR⊆{1, ...,m}. The proposed ranking with a rare class representation, subsequently referred205
to as RankRC, is a special case of (11) withR = {i : yi = 1}.206
In order to see potential advantages of RankRC, we compare the full data set regularized SVM207
ranking problem208
minβ∈Rm
1m+m−
∑{i:yi=+1}
∑{ j:y j=−1}
`h(
f (xi) − f (x j))
+λ
2βT Kβ
subject to f (x) =m∑
i=1
βik(xi,x) (12)
with the subset data regularization problem (11).209
Applying Theorem 2 in [? ], we can establish the following theoretical result.210
Theorem 1. Let f ∗(x) be the optimal hypothesis of the full data set SVM-Rank problem (12) under211
the feature mapping φ : X →F and f̄ ∗(x) be the optimal hypothesis for the subset data set SVM-212
Rank problem (11) where R⊆ {1, ...,m}. Assume there exists κ > 0 such that k(x,x) ≤ κ, where213
k : X ×X → R is the kernel map associated with φ. Then the following inequality holds for all214
x ∈ X :215
9
| f ∗(x) − f̄ ∗(x)| ≤ 2κλ
∑{i:yi=+1}
I[i 6∈ R]m+
+
∑{ j:y j=−1}
I[ j 6∈ R]m−
12
, (13)
where I[p] denotes the indicator function and is equal to 1 if p is true, 0 if p is false.216
We note, the difference in hypothesis decreases for larger regularization according to O( 1λ
)217
and as we include more kernel function realizations in our representation. However, for the rank-218
ing loss, the bound decreases asymmetrically depending on whether we include a point from the219
positive or negative class. In particular, if the dataset is unbalanced with m−�m+, then 1m+� 1
m−,220
and the reduction obtained from including a positive class basis is much greater than including221
one from the negative class. Hence, for a fixed number of kernel function realizations, the bound222
is minimized by first including bases corresponding to the positive or rare class.223
5. Computational Complexity Comparison between RankRC over SVM-Rank224
Using a smooth loss function in (??), we have the following RankRC problem in m+ variables,225
minβ∈Rm+
1m+m−
∑{i:yi=+1}
∑{ j:y j=−1}
`ε
(βT Ki+ −βT K j+
)+λ
2βT K++β . (14)
Here, Ki+ denotes ith row of K with column entries corresponding to only the positive class, and226
K++ ∈ Rm+×m+ is the square submatrix of K corresponding to the positive class entries. We also227
replace `h with the smooth approximation `ε. To solve (14) we can use several approaches, which228
are discussed below.229
5.1. Linearization230
Since K++ is a positive semi-definite matrix, it has an eigen-decomposition which can be ex-231
pressed in the form, K++ = UΛUT , with U being an orthonormal matrix (i.e. UTU = I) and Λ a232
diagonal matrix containing non-negative eigenvalues of K++. Let w = Λ12UTβ, then233
β = UΛ†12 w , (15)
where Λ† denotes the pseudoinverse of Λ. We can substitute (15) in (14) to obtain the following234
problem,235
minw∈Rm+
1m+m−
∑{i:yi=+1}
∑{ j:y j=−1}
`ε
(wT Λ†
12UT Ki+ − wT Λ†
12UT K j+
)+λ
2‖w‖2
2 , (16)
10
which is a problem in linear space. That is, Problem (16) is equivalent to Problem (4) with data236
points given by xi = Λ†12UT Ki+ ∈ Rm+ , i = 1, ...,m. Therefore we can use the algorithm described237
in Chapelle and Keerthi [12] to solve the linear ranking problem in O(mm+ + m logm) = O(mm+)238
time. Including the cost of factoring K++, the total time is O(mm+ +m3+). Once we solve for optimal239
w we can use (15) to obtain β for subsequent testing purposes. Also, since we only need entries240
{Ki j : i = 1, ...,m,y j = 1}, the method only requires O(mm+) space.241
5.2. Unconstrained Optimization242
We can also directly solve (14) using standard unconstrained optimization methods. Gradient243
only methods, such as steepest descent and nonlinear conjugate gradient do not require estima-244
tion of the Hessian. Although this makes each iteration much cheaper, convergence can be slow,245
especially near the solution. In contrast Hessian based algorithms, such as Newton’s method can246
obtain quadratic convergence near the solution, but each iteration can be expensive. In Newton’s247
method, the pth iterate is updated according to248
β(p+1) = β(p)+ s ,
where the step, s, is obtained by minimizing the quadratic Taylor approximation around the current249
iterate β(p):250
mins
sT g(p)+
12
sT H(p)s , (17)
where H(p) and g(p) are the Hessian and gradient of the objective at β(p), respectively. Problem251
(17) has a closed form solution given by252
s = −
(H(p)
)−1g(p) .
Since H(p) is a m+×m+ matrix, this involves O(m3+) cost in each iteration. To avoid this, we can253
use the truncated Newton method in which H(p)s = −g(p) is solved using linear conjugate gradient.254
Here, the Hessian is not computed explicitly and the method iteratively approximates the solution255
using Hessian-vector products. Since each iteration in the linear conjugate gradient algorithm256
leads to a descent direction, we can terminate early while still improving convergence.257
A drawback of (truncated) Newton’s method is that it can be sensitive to the initial point. If the258
initial point is not chosen close enough to the solution, the method can be slow to converge, or fail259
altogether. Therefore we consider a subspace-trust-region method, which combines the benefit of a260
truncated Newton step with steepest descent. In our tests, we found that the subspace-trust-region261
method converges with significantly fewer iterations than Newton’s method.262
The idea behind the trust-region method is to solve (17) while constraining the step, s, to a263
neighborhood around the current iterate, in which the approximation is trusted:264
mins
12
sT H(p)s + sT g(p)
s.t. ||s||2 ≤∆(p) .(18)
11
The trust region radius, ∆(p), is adjusted at each iterate according to standard rules, for example265
it is decreased if the solution obtained is worse than the current iterate. Problem (18) can be266
solved accurately [e.g see 29], however, the solution uses the full eigen-decomposition of H(p).267
To avoid this computation, in the subspace-trust-region method, Problem (18) is restricted to a268
two-dimensional subspace spanned by the gradient, g(p), and an approximate Newton direction,269
s2, which can be obtained by solving H(p)s2 = −g(p) using linear conjugate gradient [30]. The idea270
behind this choice is to ensure global convergence (via steepest descent direction) and achieve271
fast local convergence (via the Newton step). Once the subspace has been computed, solving (18)272
costs O(1) time, since in the subspace the problem is only two-dimensional. The implementation273
we use is provided in Matlab’s optimization toolbox, fminunc/fmincon.274
5.2.1. Computing Gradient and Hessian-Vector Product275
We describe how we can compute the gradient and Hessian-vector product of Problem (14) ef-ficiently. Let K·+ = [Ki j]i=1,...,m,y j=−1 ∈Rm×m+ denote the rectangular submatrix of K with columnsindexed by the positive class. Consider the expanded matrix
A = [Ki+ − K j+]i:yi=1, j:y j=−1 ∈ Rm+m−×m+ ,
consisting of the differences of rows in K·+ corresponding to all pairwise preferences. In ourcomputation we do not explicitly form matrix A, rather we note that A can be expressed as a sparsematrix product:
A = DK·+,where D ∈Rm+m−×m is a sparse matrix that encodes a pairwise preference. That is, if yi > y j, then276
there exists a row r in P such that Dri = 1,Dr j = −1 and the rest of the row is zero. Let Ar denote277
the rth row of A. Then the ranking loss expression in (14) can be written as,278
∑{i:yi=+1}
∑{ j:y j=−1}
`ε
(βT Ki+ −βT K j+
)
=m+m−∑r=1
`ε
(βT Ar
)=
m+m−∑r=1
I[r ∈ L](
1 − ε−βT Ar
)+
m+m−∑r=1
I[r ∈Q]14ε
(1 −βT Ar
)2, (19)
where L = {r : βT Ar < 1 − 2ε} is the set of pairwise differences which are in the linear portion of279
`ε, and Q = {r : 1 − 2ε ≤ βT Ar < 1} is the set which fall in the quadratic part. Denote e ∈ Rm+m−280
as a vector of ones. Define eL ∈Rm+m− as a binary vector where eLr = 1 if r ∈L and eLr = 0 if r 6∈ L.281
Also define IQ ∈ Rm+m−×m+m− as a diagonal matrix, where IQrr = 1, if r ∈ Q, and IQrr = 0, if r 6∈ Q.282
Then (19) is equivalent to283
12
(eL)T
((1 − ε)e − Aβ) +14ε
(e − Aβ)T IQ (e − Aβ)
=(
eL)T
((1 − ε)e − PK·+β) +14ε
(e − PK·+β)T IQ (e − PK·+β) .
Therefore the objective function in (14) can be expressed as284
F(β) ,1
m+m−
[(eL)T
((1 − ε)e − PK·+β) +14ε
(e − PK·+β)T IQ (e − PK·+β)]
+λ
2βT K++β . (20)
We obtain the gradient by taking the derivative of (20) with respect to β:285
g ,∂F∂β
=1
m+m−
[−
(eL)T
PK·+ +12ε
PK·+IQ (PK·+β − e)]
+λK++β
=1
m+m−
[−
((eL)T
P)
K·+ +12ε
(P(
K·+(
IQP)
(K·+β))
− P(
K·+(
IQe)))]
+λK++β .
(21)
In the last expression we have used brackets to emphasize the order of operations that leads to an286
efficient implementation and avoids computing A = PK·+. It can be verified that the time required287
is O(mm+).288
We obtain the Hessian by taking the derivative of (21) with respect to β:289
H ,∂2F
∂β∂βT =1
2εm+m−
(PK·+IQPK·+
)+λK++ .
Note the Hessian requires computing A. However, for the linear conjugate gradient method we290
only require computing Hs for some vector s. In this case, we can avoid computing A by using the291
following order of operations:292
Hs =1
2εm+m−
(P(
K·+(
IQP)
(K·+s)))
+λK++s .
The time required to compute Hs is also O(mm+).293
In the subspace-trust-region method we use a maximum of 25 conjugate gradient iterations.294
We found the solution usually converges in a constant number of trust region iterations. Since295
each iteration requires O(mm+) time, the total time required by the algorithm is O(mm+). Total296
space is also O(mm+).297
Finally, we note that we can slightly improve the time required to compute the gradient and298
Hessian-vector product by first sorting the values of K·+β or K·+s. Though this does not improve299
the big-O efficiency, it does reduce the constant factor. We refer the interested reader to [12] for300
details on a method which can be adapted for the nonlinear RankRC objective (14).301
13
6. Experiments302
In this section we empirically compare RankRC to other methods on several unbalanced prob-303
lems. The following methods are compared:304
1. KNN: k-Nearest-Neighbors algorithm. The posterior probability is used as the ranking func-305
tion:306
P(y|x) =1k
∑i∈K
I[yi = 1] ,
where K is the set of k nearest neighbors in the training dataset.307
2. SVM: This is the standard nonlinear SVM [23], in which the primal problem,308
minw∈Rd
1m
m∑i=1
max(0,1 − yi(wTφ(xi) + b)
)+λ
2‖w‖2
2 ,
is solved (in the dual) to obtain the decision function, f (x) = wTφ(x)+b =∑m
i=1βik(xi,x)+b,309
with k(xi,x) = φ(xi)Tφ(x).310
3. SVM-W: Weighted SVM [23, 31] in which311
minw∈Rd
1m
m∑i=1
ωi max(0,1 − yi(wTφ(xi) + b)
)+λ
2‖w‖2
2 ,
is solved, with different weights associated with each class:312
ωi =
{m
2m+if yi = +1
m2m−
if yi = −1 .
The idea is to penalize misclassification error of minority examples more heavily in order to313
reduce the bias towards majority examples.314
4. SVM-RUS: Randomly Under Sample the majority class examples (y = −1) to match the315
number of minority examples [9]. The resulting dataset, with 2m+ points, is used to train a316
standard SVM.317
5. SVM-SMT: Uses a Synthetic Minority Oversampling TEchnique (SMOTE) [10] in which318
the rare class is over-sampled by creating new synthetic rare class samples according to319
each rare class sample and its k nearest neighbors. Each new sample is generated in the320
direction of some or all of the nearest neighbors. We oversample to match the number of321
majority examples. The resulting dataset, with 2m− points, is used to train a standard SVM.322
6. RANK-SVM: Nonlinear RankSVM problem (6).323
14
7. RANK-RND: We solve the RankSVM problem constrained to m+ randomly selected set of324
basis function, i.e. Problem (??).325
8. RANK-RC: We solve RankSVM constrained to the rare class representation, i.e. Problem326
(14).327
We use LIBSVM [32] to solve the SVM problems (2-5). LIBSVM is a popular and efficient328
implementation of the sequential minimal optimization algorithm [33]. We set cache size to 10GB329
to minimize cache misses; termination criteria and shrinking heuristics are used in their default330
settings. The ranking methods (6-8) are solved using the subspace-trust-region method as outlined331
in Section 5. Termination tolerance is set at 1e-6. For ranking methods, the memory available to332
store the kernel matrix is limited to 10GB. Experiments are performed on a Xeon E5620@2.4Ghz333
running Linux.334
All datasets are standardized to zero mean and unit variance before training. Since our fo-335
cus is on nonlinear kernels, for all SVM and ranking methods (2-8), we use a Gaussian kernel,336
k(u,v) = exp(−‖u − v‖22/σ
2) with σ2 = 1m2
∑mi, j=1 ‖xi − x j‖2
2. The penalty parameter λ is determined337
by cross-validation over values log2λ = [−20,−18, ...,8,10]. For KNN we cross-validate over338
k = [1,2, ...,dmin(100,√
m)e].339
6.1. Simulated Data340
We simulate an unbalanced dataset in the following manner. Rare class instances are sampled341
from six multivariate normal distributions with equal probability. Their centers, µi, i = 1, ...,6, are342
randomly chosen within a unit cube. The majority class is sampled from(6
2
)= 15 multivariate343
normal distributions with equal probability. Their centers are chosen along lines connecting all344
combinations of two rare class centers, i.e. tµi + (1 − t)µ j, i > j. The parameter t ∈ [0,1] is used345
to roughly control the degree of class overlap. All covariances are chosen to be spherical, σ2I.346
Finally, the imbalance ratio, ρ = m+
m , is used to determine the number of samples drawn from347
each of the class conditional distributions. An example configuration in 2-dimensional space is348
shown in Figure 4. The resulting dataset contains multiple rare-class subconcepts that vary in349
discriminative structure.350
For our experiment we generate data in 5-dimensional space with σ = 0.5. We set t = 0.9,351
0.75, and 0.6 to produce datasets with high, medium and low overlap, respectively. The imbalance352
ratio, ρ, is varied from 10% to 40% in 10% increments for each t value. Thus we have a total of353
12 datasets. For each dataset we generate 1000 training points, 1000 validation points and 10000354
testing points. Results are averaged over 10 trials.355
Table 2 shows test AUC results using different methods. KNN does not perform as well as SVM356
and ranking methods. In general, ranking methods perform better than SVM based methods when357
there is greater overlap and higher imbalance (lower ρ). RANK-RND under performs in medium358
and low overlap datasets. In comparison, RANK-RC yields statistically similar performance as359
RANK-SVM across all datasets. Overall, both RANK-RC and RANK-SVM provide the best models.360
Figure 5 compares the empirical ranking loss function,361
R̂h =1
m+m−
∑{i:yi=+1}
∑{ j:y j=−1}
`h(
f (xi) − f (x j)),
15
Ove
rlap
ρK
NN
Cla
ssifi
catio
nL
oss
Ran
king
Los
sTr
ueB
ayes
SV
MS
VM
-WS
VM
-RU
SS
VM
-SM
TR
AN
K-S
VM
RA
NK
-RN
DR
AN
K-R
C
Hig
h
10%
56.3±
0.4
57.3±
0.3
59.8±
0.3
59.1±
0.4
58.9±
0.4
61.5±
0.3
61.4±
0.3
61.4±
0.2
66.8
20%
55.5±
0.4
59.2±
0.2
61.2±
0.1
61.0±
0.4
60.9±
0.4
61.6±
0.3
61.3±
0.4
62.3±
0.3
30%
57.3±
1.0
61.1±
0.2
62.2±
0.4
62.2±
0.1
61.8±
0.3
62.3±
0.4
62.2±
0.3
62.6±
0.4
40%
59.0±
0.7
62.8±
0.2
63.2±
0.3
63.3±
0.2
62.8±
0.2
62.7±
0.3
62.5±
0.2
62.7±
0.3
Med
ium
10%
54.9±
0.6
56.8±
0.5
59.5±
0.5
58.8±
0.3
58.6±
0.9
61.4±
0.5
60.1±
0.5
61.4±
0.4
69.5
20%
54.5±
0.3
60.1±
0.4
62.1±
0.2
62.1±
0.3
62.0±
0.5
62.8±
0.4
61.5±
0.4
63.4±
0.2
30%
57.6±
0.8
63.4±
0.1
64.0±
0.3
63.3±
0.1
63.4±
0.4
64.1±
0.4
62.4±
0.5
64.6±
0.5
40%
58.0±
0.7
65.3±
0.2
65.5±
0.1
65.4±
0.1
65.3±
0.2
64.6±
0.3
63.0±
0.2
64.9±
0.3
Low
10%
55.9±
1.0
61.1±
0.6
64.0±
0.4
63.5±
0.2
63.0±
0.8
65.3±
0.5
62.7±
0.6
65.5±
0.4
74.4
20%
57.2±
0.4
64.2±
0.3
66.6±
0.2
66.2±
0.2
66.4±
0.2
67.1±
0.3
63.5±
0.4
67.3±
0.2
30%
59.9±
0.9
69.1±
0.4
69.4±
0.2
68.9±
0.1
69.2±
0.2
69.2±
0.4
65.8±
0.4
69.8±
0.4
40%
61.6±
1.5
71.1±
0.1
70.7±
0.3
70.5±
0.1
71.1±
0.1
69.8±
0.3
65.8±
0.3
69.9±
0.3
Tabl
e2:
Com
pari
son
ofte
stA
UC
resu
ltsfo
rsi
mul
ated
data
sets
with
high
(t=
0.9)
,med
ium
(t=
0.75
)an
dlo
w(t
=0.
6)ov
erla
p,ea
chw
ithρ
=10%
,20%
,30%
and
40%
min
ority
sam
ples
.M
ean
AU
Csc
ore
with
stan
dard
erro
rove
r10
tria
lsar
esh
own.
Bol
ded
scor
esin
dica
teth
ere
sult
isst
atis
tical
lyno
tdiff
eren
ttha
nth
ebe
stpe
rfor
min
gm
odel
usin
ga
two-
taile
dt-
test
with
99%
confi
denc
e.
16
Figure 4: Example configuration of simulated dataset in 2-dimensions with σ = 0.1 and t = 0.75.The red, filled in circles show the locations of the six normal components for the rare class distri-bution. The black, empty circles show the location of the 15 normal components for the majorityclass distribution, whose centers lie along the dotted lines connecting all two rare class normalcomponents.
obtained by the ranking methods on four of the training and testing sets as λ is varied. We observe362
that the difference between RANK-SVM and the restricted basis models (RANK-RND and RANK-363
RC) decreases as λ is increased. Since restricting basis functions also limits the complexity of364
the model, the test loss of RANK-RND and RANK-RC is lower than that of RANK-SVM for small365
values of λ. However, RANK-RND is unable to achieve the optimal test loss levels at moderate366
values of λ (more noticeably in Figures 5c and 5d). In contrast, RANK-RC does not forfeit any test367
performance compared to RANK-SVM, while providing additional robustness as λ is reduced.368
6.2. Real Datasets369
In this section we compare methods on several unbalanced real datasets obtained from various370
sources. Table 3 lists the datasets along with their characteristics. For each dataset, three-quarters371
of the observations are used for training and the remaining one-quarter for out-of-sample testing.372
Results are averaged over 20 stratified random splits of the data. The model parameter (λ or k) is373
tuned by running 10-fold cross-validation on the training set for each split.374
Table 4 shows the mean test AUC score with standard error for each model. Overall, RANK-375
SVM and RANK-RC yield the best performing models with statistically similar results. RANK-RND,376
on the other hand, under performs on some datasets, indicating that a random basis set is not as377
effective as the rare class basis on unbalanced problems. SVM based methods generally do not378
perform as well as ranking methods, except when there appears to be more discriminative patterns379
in the data.380
17
RANK−SVM train
RANK−SVM test
RANK−RND train
RANK−RND test
RANK−RC train
RANK−RC test
−20 −10 0 100.2
0.4
0.6
0.8
1
1.2
log2(λ)
RankingLoss,R̂
h
(a) Higher Overlap, ρ = 10%
−20 −10 0 100.2
0.4
0.6
0.8
1
1.2
log2(λ)
RankingLoss,R̂
h
(b) Higher Overlap, ρ = 40%
−20 −10 0 100.2
0.4
0.6
0.8
1
1.2
log2(λ)
RankingLoss,R̂
h
(c) Lower Overlap, ρ = 10%
−20 −10 0 100.2
0.4
0.6
0.8
1
1.2
log2(λ)
RankingLoss,R̂
h
(d) Lower Overlap, ρ = 40%
Figure 5: Comparison of empirical train and test ranking loss obtained by the ranking methods onfour of the simulated datasets as λ is varied.
18
Table 5 compares the number of support vectors used by the SVM and ranking models. RANK-381
SVM uses more support vectors than SVM based models. It can use even more support vectors than382
SVM-SMT, which is trained on an enlarged dataset almost twice the size. This suggests that training383
RANK-SVM using a working-set type algorithm, which only tracks active support vectors (e.g. as384
proposed in [24] for standard SVMs), would still run costly in time and space. In comparison,385
RANK-RND and RANK-RC use significantly fewer support vectors. Moreover, with RANK-RC, test386
performance is also not compromised.387
6.3. Intrusion Detection388
In this section we use the KDD Cup 1999 dataset [36] to evaluate a large-scale unbalanced389
problem. The objective is to detect network intrusion by distinguishing between legitimate (nor-390
mal) and illegitimate (attack) connections to a computer network. The dataset is a collection of391
simulated raw TCP dump data over a period of nine weeks on a local area network. The first seven392
weeks of data is used for training and the last two for test, providing a total of 4 898 431 training393
observations and 311 029 test cases. We processed the data to remove duplicate entries (as done in394
[37]) resulting in 1 074 975 training observations and 77 286 test cases. Each observation contains395
41 features, three of which are categorical and the rest numerical. The three categorical features396
are protocol (3 categories), service (70 categories) and flag (11 categories). We represent proto-397
col using three binary features, where each feature is an indicator for one of the three categories.398
Service and flag categories are replaced by the frequency in the training sample (i.e. probability)399
corresponding to the event of observing an attack given the category is present. Thus we obtain a400
total of 43 features. Finally, as done for all datasets, we standardize each feature to zero mean and401
unit variance.402
The attack types are grouped in four categories, DOS (Denial of Service), Probing (Surveil-403
lance, e.g. port scanning), U2R (user to root), R2L (remote to local). Table 6 shows the distribution404
of attack types in the training and test sets. Together, the U2R and R2L attacks constitute 4.0%405
of the test dataset, which is a substantial increase compared to the training set, but still a small406
fraction. Poor results have been reported in literature for identifying the U2R and R2L attacks407
[38]. In this experiment, we focus on identifying these attack types by forming a binary classifi-408
cation problem with the positive class representing either a U2R or R2L attack, and the negative409
class representing all other connection types. Thus the final training set is highly skewed with only410
0.098% positive instances.411
We train using 5%, 10%, 25%, 50%, and 75% of the training data. The remaining training412
data is used for validation. We are unable to train RANK-SVM, even with just 5% of the data413
(53 749 samples), since the kernel matrix is too large to store in memory (>10GB). Clearly, this414
is an example where a large-scale solution is necessary to solve the ranking problem. We do not415
train SVM-SMT due to the large number of samples as well. We are able to train SVM-W using up416
to 50% of the data. With more samples SVM-W does not converge, likely due to the large number417
of support vectors which do not fit in the cache.418
Figure 6a shows test AUC results obtained by different methods as training data is increased.419
We observe that SVM and SVM-RUS perform poorly. RANK-RC, RANK-RND and SVM-W produce420
better results, with RANK-RC performing the best. Figures 6b and 6c compare training time and421
number of support vectors, respectively, as training data is increased. SVM and SVM-RUS train in422
19
Name Source SubjectFeatures Samples
Original Derived (d) m m+ ρ
Abalone19 UCI Life 1N,7C 10 4177 32 0.8%
Mammograph [34] Life 6C 6 11183 260 2.3%
Ozone UCI Environment 72C 72 2536 73 2.9%
YeastME2 UCI Life 8C 8 1484 51 3.4%
Wine4 UCI Chemistry 11C 11 4898 183 3.7%
Oil [35] Environment 49C 49 937 41 4.4%
SolarM0 UCI Nature 10N 32 1389 68 4.9%
Coil KDD Business 85C 85 9822 586 6.0%
Thyroid UCI Life 21N,7C 52 3772 231 6.1%
Libras UCI Physics 90C 90 360 24 6.7%
Scene LibSVM Nature 294C 294 2407 177 7.4%
YeastML8 LibSVM Life 103C 103 2417 178 7.4%
Crime UCI Economics 122C 100 1994 150 7.5%
Vowel0 Keel Computer 10C 10 989 90 9.1%
Euthyroid UCI Life 18N,7C 42 3163 293 9.3%
Abalone7 UCI Life 1N,7C 10 4177 391 9.4%
Satellite UCI Nature 36C 36 6435 626 9.7%
Page0 Keel Computer 10C 10 5472 559 10.2%
Ecoli UCI Life 7C 7 336 35 10.4%
Contra2 Keel Life 9C 9 1473 333 22.6%
Table 3: List of datasets and their characteristics that we use to evaluate methods. Under origi-nal features, ’N’ is used to denote number of nominal features, ’C’, is used to denote number ofcontinuous features. We derive d features by converting nominal features to an indicator represen-tation and use continuous features as is. Under samples, m is the total number of observations, m+
is the number of rare class observations, and ρ = m+
m is the percentage of rare class examples.
20
Dat
aset
KN
NC
lass
ifica
tion
Los
sR
anki
ngL
oss
SV
MS
VM
-WS
VM
-RU
SS
VM
-SM
TR
AN
K-S
VM
RA
NK
-RN
DR
AN
K-R
C
Aba
lone
1955
.7±
2.2
54.9±
3.1
64.3±
1.3
74.1±
1.5
67.4±
1.1
81.0±
1.2
79.1±
1.1
81.4±
1.1
Mam
mog
raph
80.7±
0.4
88.4±
0.4
90.1±
0.7
92.8±
0.4
91.3±
0.4
93.7±
0.4
93.9±
0.4
94.4±
0.3
Ozo
ne66
.4±
2.2
85.0±
1.1
85.5±
0.7
86.4±
0.9
85.9±
0.8
89.4±
0.9
88.7±
0.8
90.1±
0.9
Yea
stM
E2
69.3±
1.7
81.8±
0.5
85.5±
0.7
87.5±
0.7
86.8±
1.1
90.8±
0.8
89.0±
0.9
89.4±
1.1
Win
e461
.9±
0.5
74.9±
0.7
71.6±
0.9
79.1±
0.8
78.9±
0.8
83.5±
0.6
79.6±
0.6
82.7±
0.7
Oil
71.6±
1.7
91.1±
0.9
88.0±
1.4
90.6±
0.8
90.6±
1.0
92.5±
0.9
89.2±
1.0
91.7±
0.8
Sola
rM0
58.9±
1.6
55.4±
0.7
63.1±
1.3
71.5±
0.8
73.1±
0.4
78.5±
0.5
77.2±
0.8
77.5±
0.8
Coi
l53
.9±
0.3
59.2±
0.8
62.9±
0.4
68.8±
0.4
67.5±
0.5
70.0±
0.4
69.8±
0.4
72.3±
0.2
Thy
roid
73.9±
0.6
94.8±
0.4
93.4±
0.5
94.8±
0.3
94.4±
0.4
95.7±
0.4
91.3±
0.5
95.7±
0.3
Lib
ras
87.4±
2.1
96.8±
0.9
96.7±
0.9
96.4±
0.8
96.8±
0.9
97.6±
0.8
95.4±
1.0
94.8±
1.1
Scen
e59
.3±
0.8
67.3±
0.8
75.4±
0.9
74.8±
0.6
74.0±
0.9
77.1±
0.7
76.4±
0.8
77.5±
0.6
Yea
stM
L8
54.5±
0.7
57.1±
0.8
59.6±
0.6
57.9±
0.5
59.7±
0.4
61.5±
0.5
60.2±
0.7
62.0±
0.5
Cri
me
71.8±
1.5
87.6±
0.7
87.3±
0.6
90.1±
0.3
90.8±
0.3
92.3±
0.3
91.2±
0.3
91.6±
0.3
Vow
el0
100.
0±0.
010
0.0±
0.0
100.
0±0.
099
.8±
0.0
100.
0±0.
010
0.0±
0.0
98.4±
0.1
100.
0±0.
0E
uthy
roid
75.8±
0.8
95.0±
0.4
95.0±
0.4
94.6±
0.4
95.0±
0.4
95.2±
0.4
90.7±
0.4
94.1±
0.4
Aba
lone
778
.2±
2.1
56.1±
3.2
76.3±
0.5
77.4±
0.3
74.4±
0.2
87.0±
0.3
86.5±
0.3
87.1±
0.3
Sate
llite
83.8±
0.3
94.8±
0.1
94.6±
0.1
94.3±
0.1
95.1±
0.1
95.3±
0.1
94.3±
0.1
95.1±
0.1
Page
090
.4±
0.4
98.4±
0.1
98.1±
0.1
98.1±
0.1
98.2±
0.1
98.6±
0.1
95.6±
0.1
98.4±
0.1
Eco
li75
.6±
2.0
94.6±
0.7
93.7±
0.7
94.1±
0.6
93.2±
0.6
94.1±
0.6
93.4±
0.9
94.5±
0.7
Con
tra2
60.5±
0.9
66.9±
0.8
70.2±
0.5
70.5±
0.6
70.6±
0.8
73.2±
0.5
72.6±
0.5
73.4±
0.4
Tabl
e4:
Com
pari
son
ofte
stA
UC
resu
ltsfo
rre
alda
tase
ts(l
iste
din
Tabl
e3)
.M
ean
AU
Csc
ore
with
stan
dard
erro
rov
er20
tria
lsar
esh
own.
Eac
htr
ialu
ses
one-
quar
terd
ata
foro
ut-o
f-sa
mpl
ete
stin
g.B
olde
dsc
ores
indi
cate
the
resu
ltis
stat
istic
ally
notd
iffer
ent
than
the
best
perf
orm
ing
mod
elus
ing
atw
o-ta
iled
t-te
stw
ith99
%co
nfide
nce.
21
Dat
aset
Cla
ssifi
catio
nL
oss
Ran
king
Los
s
SV
MS
VM
-WS
VM
-RU
SS
VM
-SM
TR
AN
K-S
VM
RA
NK
-RN
DR
AN
K-R
C
Aba
lone
1911
715
5532
2644
2979
2424
Mam
mog
raph
306
2152
119
2987
7252
195
193
Ozo
ne20
674
661
1165
995
5555
Yea
stM
E2
105
429
4159
410
6638
38W
ine4
458
2109
179
3166
3550
136
136
Oil
8431
132
340
488
3131
Sola
rM0
183
845
9011
1410
4251
51C
oil
1754
5560
704
8178
7284
435
435
Thy
roid
332
723
137
1020
2739
168
172
Lib
ras
3593
2212
419
718
18Sc
ene
603
1171
213
1566
1748
132
133
Yea
stM
L8
833
1669
257
1562
1804
133
133
Cri
me
308
631
115
867
1326
112
112
Vow
el0
3537
2545
730
6867
Eut
hyro
id38
967
317
710
0223
0321
621
9A
balo
ne7
713
1391
274
2076
3079
291
292
Sate
llite
773
1158
301
1526
4734
466
469
Page
032
257
014
592
440
1241
541
6E
coli
4768
1811
524
826
26C
ontr
a256
084
339
691
210
9624
924
9
Tabl
e5:
Ave
rage
num
ber
ofsu
ppor
tvec
tors
used
byth
eSV
Man
dra
nkin
gm
odel
sov
er20
tria
ls.
For
rank
ing
mod
els,
supp
ort
vect
ors
are
coun
ted
asth
enu
mbe
rofn
on-z
ero
coef
ficie
nts
asso
ciat
edw
ithke
rnel
func
tions
.
22
5 10 25 50 750.65
0.7
0.75
0.8
0.85
0.9
0.95
1
Percent of data used in training
AU
C
(a)
SVM
SVM−W
SVM−RUS
RANK−RND
RANK−RC
5 10 25 50 75
0
1
2
3
4
5
6
7
Percent of data used in training
Tra
inin
g T
ime
(s)
x 105
50 75
0
1
2x 10
4
(b)
5 10 25 50 75
0
1
2
3
4
5
6
7
8
Percent of data used in training
No
. o
f S
Vs
x 103
(c)
Figure 6: Comparison of (a) test AUC score, (b) training time in seconds, and (c) number ofsupport vectors, for the intrusion detection problem as percent of data used for training is increasedfrom 5% to 75%. In our experiment setup, we were unable to train RANK-SVM due to the largesize of the dataset. Also, for more than 50% of data, SVM-W did not converge after more than 72hours of training.
23
reasonable time, though they do not produce good models. On the other hand, SVM-W quickly423
becomes very expensive. RANK-RC and RANK-RND scale well, while able to produce effective424
models. RANK-RC and RANK-RND also use significantly fewer support vectors than SVM-W.425
7. Conclusion426
In this paper, we use a ranking loss function to tackle the problem of learning from unbal-427
anced datasets. Minimizing biclass ranking loss is equivalent to maximizing the AUC measure,428
which overcomes the inadequacies of accuracy, used by conventional classification algorithms.429
The resulting regularized loss minimization problem corresponds to a biclass RankSVM problem.430
We modify RankSVM to take advantage of the rare class situation by restricting the solution to a431
linear combination of rare class kernel functions (RankRC). This allows us to solve the nonlinear432
ranking problem in O(mm+) time and O(mm+) space, thus enabling us to solve problems which are433
too large for kernel RankSVM. We provided heuristic and theoretical justification for this choice434
and experimentally illustrated the effectiveness of RankRC, in both test performance and training435
time.436
Below we list a few extensions/variants one may consider using the rare class representation:437
1. Regularization: In problem (14) we can use an `1-regularizer, ‖β‖1, instead of βT K++β.438
This would lead to sparser solutions [39] and could be solved using coordinate descent439
methods [40].440
2. Loss function: We can replace the loss function with other variants of ranking loss. The441
AUC concentrates uniformly across all threshold levels. We can use weighted AUC [41] or442
the p-norm push [42] to emphasize specific portions of the AUC curve. Also, we can use443
list based ranking methods to optimize other criteria such as F1-score or Precision/Recall444
breakeven point [43]. The rare-class representation allows us to learn a nonlinear function445
for unbalanced datasets with more complex loss functions, in reasonable time and space.446
3. Stochastic Learning: For very large datasets, the m×m+ kernel submatrix may be too large447
to fit in memory. In this case, we can store K++ ∈ Rm+×m+ and cycle (randomly) through448
majority class examples updating the β ∈Rm+ vector via gradient descent using an adaptive449
learning rate [44]. Unlike standard stochastic gradient descent, in each iteration we use the450
full set of minority examples and a single (or small subset) of majority samples to perform451
the update. This should lead to faster convergence while using only O(m+m+) space.452
In summary, the rare class representation offers significant benefits to learn nonlinear models453
for large-scale rare class problems.454
References455
[1] N. Japkowicz, S. Stephen, The class imbalance problem: A systematic study, Intell. Data Anal. 6 (2002)456429–449.457
[2] B. Raskutti, A. Kowalczyk, Extreme re-balancing for svms: a case study, SIGKDD Explor. Newsl. 6 (2004)45860–69.459
[3] G. Wu, E. Y. Chang, Class-boundary alignment for imbalanced dataset learning, in: ICML, 2003, pp. 49–56.460
24
[4] G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, A study of the behavior of several methods for balancing461machine learning training data, SIGKDD Explor. Newsl. 6 (2004) 20–29.462
[5] N. V. Chawla, N. Japkowicz, A. Kotcz, Editorial: special issue on learning from imbalanced data sets, SIGKDD463Explor. Newsl. 6 (2004) 1–6.464
[6] G. M. Weiss, Mining with rarity: a unifying framework, SIGKDD Explor. Newsl. 6 (2004) 7–19.465[7] K. Ezawa, M. Singh, S. W. Norton, Learning goal oriented bayesian networks for telecommunications risk466
management, in: ICML, 1996, pp. 139–147.467[8] J. Zhang, I. Mani, KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information468
Extraction, in: ICML, 2003.469[9] M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: One-sided selection, in: ICML, 1997,470
pp. 179–186.471[10] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: Synthetic minority over-sampling technique,472
Journal of Artificial Intelligence Research 16 (2002) 321–357.473[11] R. Herbrich, T. Graepel, K. Obermayer, Large Margin Rank Boundaries for Ordinal Regression, MIT Press,474
2000.475[12] O. Chapelle, S. S. Keerthi, Efficient algorithms for ranking with svms, Inf. Retr. 13 (2010) 201–215.476[13] F. Provost, T. Fawcett, R. Kohavi, The case against accuracy estimation for comparing induction algorithms, in:477
In Proceedings of the Fifteenth International Conference on Machine Learning, 1997, pp. 445–453.478[14] M. A. Maloof, Learning when data sets are imbalanced and when costs are unequal and unknown, in: ICML,479
2003.480[15] Y. Sun, M. S. Kamel, A. K. C. Wong, Y. Wang, Cost-sensitive boosting for classification of imbalanced data,481
Pattern Recogn. 40 (2007) 3358–3378.482[16] H. He, E. A. Garcia, Learning from imbalanced data, IEEE Trans. on Knowl. and Data Eng. 21 (2009) 1263–483
1284.484[17] A. P. Bradley, The use of the area under the roc curve in the evaluation of machine learning algorithms, Pattern485
Recognition 30 (1997) 1145–1159.486[18] C. E. Metz, Basic principles of ROC analysis., Seminars in nuclear medicine 8 (1978) 283–298.487[19] J. A. Hanley, B. J. Mcneil, The meaning and use of the area under a receiver operating characteristic (ROC)488
curve., Radiology 143 (1982) 29–36.489[20] E. R. DeLong, D. M. DeLong, D. L. Clarke-Pearson, Comparing the Areas under Two or More Correlated490
Receiver Operating Characteristic Curves: A Nonparametric Approach, Biometrics 44 (1988) 837–845.491[21] P. L. Bartlett, M. I. Jordan, J. D. McAuliffe, Convexity, Classification, and Risk Bounds, Journal of the American492
Statistical Association 101 (2006) 138–156.493[22] B. E. Boser, I. M. Guyon, V. N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of494
the fifth annual workshop on Computational learning theory, COLT ’92, ACM, 1992, pp. 144–152.495[23] V. N. Vapnik, Statistical Learning Theory, 1st ed., Wiley, 1998.496[24] O. Chapelle, Training a support vector machine in the primal, Neural Comput. 19 (2007) 1155–1178.497[25] G. Kimeldorf, G. Wahba, A correspondence between Bayesian estimation of stochastic processes and smoothing498
by splines, Ann. Math. Statist. 41 (1970) 495–502.499[26] B. Schölkopf, R. Herbrich, A. J. Smola, A generalized representer theorem, in: Proceedings of the 14th Annual500
Conference on Computational Learning Theory and and 5th European Conference on Computational Learning501Theory, 2001, pp. 416–426.502
[27] T. Joachims, Optimizing search engines using clickthrough data, in: KDD, 2002, pp. 133–142.503[28] M. Zhu, W. Su, H. A. Chipman, LAGO: A Computationally Efficient Approach for Statistical Detection, Tech-504
nometrics 48 (2006).505[29] S. Boyd, L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.506[30] M. A. Branch, T. F. Coleman, Y. Li, A subspace, interior, and conjugate gradient method for large-scale bound-507
constrained minimization problems, SIAM J. Scientific Computing 21 (1999) 1–23.508[31] E. E. Osuna, R. Freund, F. Girosi, Support Vector Machines: Training and Applications, Technical Report, MIT,509
1997.510[32] C.-C. Chang, C.-J. Lin, LIBSVM: A library for support vector machines, ACM Trans-511
25
actions on Intelligent Systems and Technology 2 (2011) 27:1–27:27. Software available at512http://www.csie.ntu.edu.tw/ cjlin/libsvm.513
[33] J. C. Platt, Advances in kernel methods, MIT Press, 1999, pp. 185–208.514[34] K. Woods, C. Doss, K. Bowyer, J. Solka, C. Preibe, P. Keglmyer, Comparative evaluation of pattern recognition515
techniques for detection of microcalcifications in mammography, International Journal of Pattern Recognition516and Artificial Intelligence 7 (1993) 1417–1436.517
[35] M. Kubat, R. Holte, S. Matwin, Machine learning for the detection of oil spills in satellite radar images, Machine518Learning 30 (1998) 195–215.519
[36] KDD Cup 1999, http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, Accessed:5202013-08-31.521
[37] M. Tavallaee, E. Bagheri, W. Lu, A. A. Ghorbani, A detailed analysis of the kdd cup 99 data set, in: Pro-522ceedings of the Second IEEE international conference on Computational intelligence for security and defense523applications, CISDA’09, 2009, pp. 53–58.524
[38] M. Sabhnani, Application of machine learning algorithms to kdd intrusion detection dataset within misuse525detection context, in: ICML, 2003, pp. 209–215.526
[39] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B 58 (1996) 267–288.527[40] J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent,528
Journal of Statistical Software (2009).529[41] C. G. Weng, J. Poon, A new evaluation measure for imbalanced datasets, in: Seventh Australasian Data Mining530
Conference (AusDM 2008), volume 87, 2008, pp. 27–32.531[42] C. Rudin, The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list, J. Mach.532
Learn. Res. 10 (2009) 2233–2271.533[43] T. Joachims, A support vector method for multivariate performance measures, in: ICML, 2005, pp. 377–384.534[44] L. Bottou, O. Bousquet, The tradeoffs of large scale learning, in: NIPS, 2008, pp. 161–168.535
26
Training Test
Normal 812808 75.6% 47913 62.0%DOS 247266 23.0% 23568 30.5%Probing 13850 1.3% 2677 3.5%U2R 52 0.005% 215 0.278%R2L 999 0.093% 2913 3.769%
Total 1074975 100% 77286 100%
Table 6: Distribution of connection types in training and test sets for the intrusion detection prob-lem.
27
top related