convex optimization in - peopleelghaoui/pdffiles/ntoc2001.pdf · con v ex optimization: conic and...
TRANSCRIPT
New
Tren
ds
inO
ptim
izatio
nand
Com
puta
tionalA
lgorith
ms
Decem
ber9—
13,2001
Convex
Optim
izatio
nin
Cla
ssificatio
nPro
ble
ms
Lau
rent
ElG
haou
i
Dep
artmen
tof
EECS,U
CB
erkeley
1
goal
•con
nection
betw
eenclassifi
cationan
dLP,con
vexQ
Phas
alon
g
history
(Vap
nik,
Man
gasarian,B
ennett,
etc)
•recen
tprogresses
inconvex
optim
ization:
conic
and
semid
efinite
programm
ing;
geometric
programm
ing;
robust
optim
ization
•we’ll
outlin
esom
econ
nection
sbetw
eencon
vexop
timization
and
classification
problem
s
joint
work
with
:M
.Jord
an,N
.Cristian
ini,
G.Lan
ckriet,
C.B
hattach
arrya
2
outlin
e
�con
vexop
timization
•SV
Ms
and
robust
linear
programm
ing
•m
inim
axprob
ability
mach
ine
•kern
elop
timization
3
convex
optim
izatio
n
standard
form:
minx
f0 (x
):
fi (x
)≤
0,i=
1,...,m
•arises
inm
any
application
s
•con
vexitynot
always
recognized
inpractice
•can
solvelarge
classesof
convex
problem
sin
polyn
omial-tim
e
(Nesterov,
Nem
irovski,1990)
4
conic
optim
izatio
n
special
classof
convex
problem
s:
minx
cTx
:A
x=
b,x∈
K
where
Kis
acon
e,direct
product
ofth
efollow
ing
”build
ing
blo
cks”:
K=
Rn+
linear
programm
ing
K={(y
,t)∈
Rn+
1:
t≥‖y‖2 }
second-ord
ercon
eprogram
min
g,
quad
raticprogram
min
g
K={x
∈R
n×
n:
x=
xT�
0}sem
idefi
nite
programm
ing
fact:
cansolve
conic
problem
sin
polyn
omial-tim
e
(Nesterov,
Nem
irovski,1990)
5
conic
duality
dual
ofcon
icprob
lem
minx
cTx
:A
x=
b,x∈
K
is
maxy
bTy
:c−
ATy∈
K∗
where
K∗
={z
:〈z
,x〉≥
0∀
x∈
K}
isth
econ
edualto
K
forth
econ
esm
ention
edbefore,
and
direct
products
ofth
em,K
=K∗
6
robust
optim
izatio
n
conic
problem
indual
form:
max
ybTy
:c−
ATy∈
K
→w
hat
ifA
isunkn
own-b
ut-b
ounded
,say
A∈A
,w
here
Ais
given?
robust
counte
rpart:
max
ybTy
:∀A∈A
,c−
ATy∈
K
•still
convex,
but
tractability
dep
ends
onA
•system
aticways
toap
proximate
(getlow
erbou
nds)
•for
largeclasses
ofA
,ap
proximation
isexact
7
exam
ple
:robust
LP
linear
program:
min
xcTx
:a
Tix≤
b,i=
1,...,m
assum
ea
i ’sare
unkn
own-b
ut-b
ounded
inellip
soids
Ei:=
{
a:
(a−
ai )
TΓ−
1i
(a−
ai )≤
1}
where
ai :
center,
Γi�
0:”sh
ape
matrix”
robust
LP
:m
inx
cTx
:∀
ai∈E
i ,a
Tix≤
b,i=
1,...,m
8
robust
LP
:SO
CP
represe
nta
tion
robust
LP
equivalen
tto
minx
cTx
:a
Tix
+‖Γ
1/2
ix‖2≤
b,i=
1,...,m
→a
second-ord
ercon
eprogram
!
interpretation
:sm
ooth
esbou
ndary
offeasib
leset
9
LP
with
Gaussia
ncoeffi
cie
nts
assum
ea∼N
(a,Γ
),th
enfor
givenx,
Prob{a
Tx≤
b}≥
1−
ε
isequiv
ale
nt
to:
aTx
+κ‖Γ
1/2x‖2≤
b
where
κ=
Φ−
1(1−
ε)an
dΦ
isth
ec.d
.f.ofN
(0,1)
hen
ce,
•can
solveLP
with
Gau
ssianco
efficien
tsusin
gsecond-o
rder
cone
programm
ing
•resu
lting
SO
CP
issim
ilarto
one
obtain
edw
ithellip
soidal
uncertain
ty
10
LP
with
random
coeffi
cie
nts
assum
ea∼
(a,Γ
),i.e.
distrib
ution
ofa
has
mean
aan
dcovarian
ce
matrix
Γ,but
isoth
erwise
unknow
n
Chebychev
inequality
:
Prob{a
Tx≤
b}≥
1−
ε
isequiv
ale
nt
to:
aTx
+κ‖Γ
1/2x‖2≤
b
where
κ=
√
1−
ε
ε
leads
toSO
CP
similar
toon
esob
tained
previously
11
outlin
e
•con
vexop
timization
�SV
Ms
and
robust
linear
programm
ing
•m
inim
axprob
ability
mach
ine
•kern
elop
timization
12
SV
Ms:
setu
p
givendata
poin
tsx
iw
ithlab
elsy
i=±
1,i=
1,...,N
two-class
linear
classification
with
support
vector:
min‖a‖2
:y
i (aTx
i−
b)≥
1,i=
1,...,N
•am
ounts
toselect
one
separatin
ghyp
erplan
eam
ong
the
man
y
possib
le
•prob
lemis
feasible
iffth
ereexists
asep
arating
hyp
erplan
ebetw
een
the
two
classes
13
SV
Ms:
robust
optim
izatio
nin
terpreta
tion
inte
rpreta
tion:
SV
Ms
area
way
tohan
dle
noise
indata
poin
ts
•assu
me
eachdata
poin
tis
unkn
own-b
ut-b
ounded
ina
sphere
of
radiu
sρ
and
center
xi
•find
the
largestρ
such
that
separation
isstill
possib
lebetw
eenth
e
two
classesof
pertu
rbed
poin
ts14
varia
tions
canuse
other
data
noise
models:
•hyp
ercube
uncertain
ty(gives
riseto
LP)
•ellip
soidal
uncertain
ty(→
QP)
•prob
abilistic
uncertain
ty,G
aussian
orChebych
ev(→
QP)
15
separatio
nw
ithhypercube
uncerta
inty
assum
eeach
data
poin
tis
unkn
own-b
ut-b
ounded
inan
hyp
ercubeC
i :
xi∈C
i:=
{xi+
ρP
u:‖u‖∞≤
1}
where
centers
xi
and
”shap
em
atrix”P
aregiven
robust
separatio
n:
leads
tolin
ear
program
min‖P
a‖1
:y
i (aTx
i−
b)≥
1,i=
1,...,N
16
separatio
nw
ithellip
soid
aluncerta
inty
assum
eeach
data
poin
tis
unkn
own-b
ut-b
ounded
inan
ellipsoid
Ei :
xi∈E
i:=
{xi+
ρP
u:‖u‖2≤
1}
where
center
xi
and
”shap
em
atrix”P
aregiven
robust
separation
leads
toQ
P
min‖P
a‖2
:y
i (aTx
i−
b)≥
1,i=
1,...,N
17
outlin
e
•con
vexop
timization
•SV
Ms
and
robust
linear
programm
ing
�m
inim
axprob
ability
mach
ine
•kern
elop
timization
18
min
imax
probability
machin
e
goal:
•m
akeassu
mption
sab
out
the
data
generatin
gpro
cess
•do
not
assum
eG
aussian
distrib
ution
s
•use
second-m
omen
tan
alysisof
the
two
classes
letx±
,Γ±
be
the
mean
and
covariance
matrix
ofclass
y=±
1
MPM
:m
aximize
εsu
chth
atth
ereexists
(a,b)
such
that
inf
x∼
(x+
,Γ+
)P
rob{a
Tx≤
b}≥
1−
ε
inf
x∼
(x−
,Γ−
)P
rob{a
Tx≥
b}≥
1−
ε
19
MP
Ms:
optim
izatio
nproble
m
→tw
o-sided
,m
ultivariab
leChebych
evin
equality:
inf
x∼
(x,Γ
)P
rob{a
Tx≤
b}=
(b−
aTx)2+
(b−
aTx)2+
+a
TΓa
MPM
leads
tosecond-o
rder
cone
program
:
mina‖Γ
1/2
+a‖2
+‖Γ
1/2
−a‖2
:a
T(x
+−
x−
)=
1
complexity
isth
esam
eas
standard
SV
Ms
20
dualproble
m
expressprob
lemas
uncon
strained
min
-max
problem
:
mina
max
‖u‖2≤
1,‖v‖2≤
1u
TΓ
1/2
+a−
vTΓ
1/2
−a
+λ(1−
aT(x
+−
x−
))
exchan
gem
inan
dm
ax,an
dset
ρ:=
1/λ:
min
ρ,u
,vρ
:x
++
Γ1/2
+u
=x−
+Γ
1/2
−v,‖u‖2≤
ρ,‖v‖2≤
ρ
geometric
interpretation
:defi
ne
the
two
ellipsoid
s
E±
(ρ)
:={
x±
+Γ
1/2
±u
:‖u‖2≤
ρ}
and
find
largestρ
forw
hich
ellipsoid
sin
tersect
21
robust
optim
izatio
nin
terpreta
tion
assum
edata
generated
asfollow
s:for
data
with
label
+,
x+∈E+(ρ
):=
{
x+
+Γ
1/2
+u
:‖u‖2≤
ρ}
and
similarly
fordata
with
label−
MPM
finds
largestρ
forw
hich
robust
separation
ispossib
le
PSfrag
replacem
ents
aTx−
b=
0x
+
x−
22
varia
tions
•m
inim
izeweigh
tedsu
mof
misclassifi
cationprob
abilities
•quad
raticsep
aration:
find
aquad
raticset
such
that
inf
x∼
(x+
,Γ+
)P
rob{x
∈Q}
≥1−
ε
inf
x∼
(x−
,Γ−
)P
rob{x
6∈Q}
≥1−
ε
→lead
sto
asem
idefi
nite
programm
ing
problem
•non
linear
classification
viakern
els
(usin
gplu
g-inestim
atesof
mean
and
covariance
matrix)
23
outlin
e
•con
vexop
timization
•SV
Ms
and
robust
linear
programm
ing
•m
inim
axprob
ability
mach
ine
�kern
elop
timization
24
transd
uctio
n
transductio
n:
givenlab
eledtrain
ing
setan
dunlab
eledtest
set,
predict
the
labels
the
data
contain
sboth
labeled
poin
tsan
dunlab
eledpoin
ts
25
kernelm
eth
ods
main
goal:
separate
usin
ga
non
linear
classifier
aTφ(x
)=
b
where
φis
anon
linear
operator
defi
ne
the
kernelm
atr
ix(on
both
labeled
and
unlab
eleddata)
Kij
=φ(x
i )φ(x
j )T
ina
transd
uctive
setting,
allwe
need
tokn
owto
predicts
the
labels
are
a,b
and
the
kernel
matrix
26
kernelm
eth
ods:
idea
ofproof
allth
elin
earclassifi
cationm
ethods
we’ve
seenso
farare
such
that
at
the
optim
um
,a
isin
the
range
ofth
elab
eleddata:
a=
∑
i
λi x
i
thus,
inth
enon
linear
case,th
eop
timization
problem
dep
ends
only
on
the
values
ofkern
elm
atrixK
ijfor
labeled
poin
tsx
i ,xj
ina
transd
uctive
setting,
the
prediction
oflab
elsalso
involves
Kij
only,
since
foran
unlab
eleddata
poin
tx
j ,
aTφ(x
j )=
∑
i
λi φ
(xi )
Tφ(x
j )
involves
only
Kij ’s
27
kerneloptim
izatio
n
allpreviou
salgorith
ms
canbe
”kernelized
”
what
isa
”good”
kernel?
•kern
elsh
ould
be
”close”to
a”target”
kernel
•kern
elm
atrixsatisfi
essom
e”stru
cture”
constrain
ts
main
idea:
kernel
canbe
describ
edvia
the
Gram
matrix
ofdata
poin
ts,
hen
ceis
apositive
semid
efinite
matrix
→sem
idefi
nite
programm
ing
plays
arole
inkern
elop
timization
28
setu
p
we
assum
ewe
have
giventrain
ing
and
testsets
goal:
•m
aximize
”alignm
ent”
toa
givenkern
elon
trainin
gset
(translates
into
constrain
tson
upper-left
blo
ckof
kernel
matrix)
•kern
elm
atrixsatisfi
esstru
cture
constrain
ts
(translates
into
constrain
tson
the
whole
matrix,
inclu
din
gtest
set)
29
alig
nm
ent
idea:
alignK
toa
”targetkern
el”C
bym
aximizin
g
A(K
,C)
:=〈C
,K〉
‖K‖
F‖C‖
F
where
〈C,K
〉=
Tr(C
K)
isth
ein
ner
product
oftw
osym
metric
matrices,
and‖C‖
F=
√
〈C,C〉
isth
eFrob
eniu
snorm
we
canim
pose
alow
erbou
nd
αon
the
alignm
ent
with
the
second-ord
ercon
econ
straint
onK
α‖C‖
F·‖K
‖F≤〈C
,K〉
30
affi
ne
constr
ain
ts
itis
alsousefu
lto
impose
that
the
kernel
liesin
some
affine
subsp
ace
exam
ple
:assu
me
that
Kis
ofth
eform
K=
N∑i=
1
λi u
i uTi
where
λi≥
0are
the
(variable)
eigenvalu
es,an
du
i ’sare
the
(fixed
)
eigenvectors
31
optim
izin
gkernels:
exam
ple
proble
m
goal:find
akern
elth
at
•has
analign
emen
tw
itha
givenm
atrix(eg,
C=
yy
T)
onth
e
trainin
gset
•belon
gsto
some
affine
setK
the
problem
reduces
toa
semid
efinite
programm
ing
feasibility
problem
:
find
Ksu
chth
at
K∈K
,α‖C‖
F·‖K
‖F≤〈C
,K〉,
Kpositive
defi
nite
32
kerneloptim
izatio
n:
what’s
next?
•of
course,
this
isnot
alearn
ing
meth
od
•m
uch
tolearn
fromduality
theory
•m
any
other
constrain
tscan
be
han
dled
,e.g
.,m
arginreq
uirem
ents
33
wrap-u
p
•con
vexop
timization
has
much
tooff
eran
dgain
fromin
teraction
with
classification
•describ
edvariation
son
linear
classification
•m
any
robust
optim
izationin
terpretations
•all
these
meth
ods
canbe
kernelized
•kern
elop
timization
has
high
poten
tial
34
see
also
•Learn
ing
the
Kern
elM
atrixw
ithSem
i-Defi
nite
Program
min
g
(Lan
ckiert,Cristian
ini,
Bartlett,
Elgh
aoui,
Jordan
)In
Prep
aration
(2002)
•M
inim
axprob
ability
mach
ine
(Lan
ckiert,B
hattach
arrya,El
Ghaou
i,Jord
an)
(NIP
S2001)
35