convex optimization in - peopleelghaoui/pdffiles/ntoc2001.pdf · con v ex optimization: conic and...

New

Tren

ds

inO

ptim

izatio

nand

Com

puta

tionalA

lgorith

ms

Decem

ber9—

13,2001

Convex

Optim

izatio

nin

Cla

ssificatio

nPro

ble

ms

Lau

rent

ElG

haou

i

Dep

artmen

tof

EECS,U

CB

erkeley

[email protected]

1

goal

•con

nection

betw

eenclassifi

cationan

dLP,con

vexQ

Phas

alon

g

history

(Vap

nik,

Man

gasarian,B

ennett,

etc)

•recen

tprogresses

inconvex

optim

ization:

conic

and

semid

efinite

programm

ing;

geometric

programm

ing;

robust

optim

ization

•we’ll

outlin

esom

econ

nection

sbetw

eencon

vexop

timization

and

classification

problem

s

joint

work

with

:M

.Jord

an,N

.Cristian

ini,

G.Lan

ckriet,

C.B

hattach

arrya

2

outlin

e

�con

vexop

timization

•SV

Ms

and

robust

linear

programm

ing

•m

inim

axprob

ability

mach

ine

•kern

elop

timization

3

convex

optim

izatio

n

standard

form:

minx

f0 (x

):

fi (x

)≤

0,i=

1,...,m

•arises

inm

any

application

s

•con

vexitynot

always

recognized

inpractice

•can

solvelarge

classesof

convex

problem

sin

polyn

omial-tim

e

(Nesterov,

Nem

irovski,1990)

4

conic

optim

izatio

n

special

classof

convex

problem

s:

minx

cTx

:A

x=

b,x∈

K

where

Kis

acon

e,direct

product

ofth

efollow

ing

”build

ing

blo

cks”:

K=

Rn+

linear

programm

ing

K={(y

,t)∈

Rn+

1:

t≥‖y‖2 }

second-ord

ercon

eprogram

min

g,

quad

raticprogram

min

g

K={x

∈R

n×

n:

x=

xT�

0}sem

idefi

nite

programm

ing

fact:

cansolve

conic

problem

sin

polyn

omial-tim

e

(Nesterov,

Nem

irovski,1990)

5

conic

duality

dual

ofcon

icprob

lem

minx

cTx

:A

x=

b,x∈

K

is

maxy

bTy

:c−

ATy∈

K∗

where

K∗

={z

:〈z

,x〉≥

0∀

x∈

K}

isth

econ

edualto

K

forth

econ

esm

ention

edbefore,

and

direct

products

ofth

em,K

=K∗

6

robust

optim

izatio

n

conic

problem

indual

form:

max

ybTy

:c−

ATy∈

K

→w

hat

ifA

isunkn

own-b

ut-b

ounded

,say

A∈A

,w

here

Ais

given?

robust

counte

rpart:

max

ybTy

:∀A∈A

,c−

ATy∈

K

•still

convex,

but

tractability

dep

ends

onA

•system

aticways

toap

proximate

(getlow

erbou

nds)

•for

largeclasses

ofA

,ap

proximation

isexact

7

exam

ple

:robust

LP

linear

program:

min

xcTx

:a

Tix≤

b,i=

1,...,m

assum

ea

i ’sare

unkn

own-b

ut-b

ounded

inellip

soids

Ei:=

{

a:

(a−

ai )

TΓ−

1i

(a−

ai )≤

1}

where

ai :

center,

Γi�

0:”sh

ape

matrix”

robust

LP

:m

inx

cTx

:∀

ai∈E

i ,a

Tix≤

b,i=

1,...,m

8

robust

LP

:SO

CP

represe

nta

tion

robust

LP

equivalen

tto

minx

cTx

:a

Tix

+‖Γ

1/2

ix‖2≤

b,i=

1,...,m

→a

second-ord

ercon

eprogram

!

interpretation

:sm

ooth

esbou

ndary

offeasib

leset

9

LP

with

Gaussia

ncoeffi

cie

nts

assum

ea∼N

(a,Γ

),th

enfor

givenx,

Prob{a

Tx≤

b}≥

1−

ε

isequiv

ale

nt

to:

aTx

+κ‖Γ

1/2x‖2≤

b

where

κ=

Φ−

1(1−

ε)an

dΦ

isth

ec.d

.f.ofN

(0,1)

hen

ce,

•can

solveLP

with

Gau

ssianco

efficien

tsusin

gsecond-o

rder

cone

programm

ing

•resu

lting

SO

CP

issim

ilarto

one

obtain

edw

ithellip

soidal

uncertain

ty

10

LP

with

random

coeffi

cie

nts

assum

ea∼

(a,Γ

),i.e.

distrib

ution

ofa

has

mean

aan

dcovarian

ce

matrix

Γ,but

isoth

erwise

unknow

n

Chebychev

inequality

:

Prob{a

Tx≤

b}≥

1−

ε

isequiv

ale

nt

to:

aTx

+κ‖Γ

1/2x‖2≤

b

where

κ=

√

1−

ε

ε

leads

toSO

CP

similar

toon

esob

tained

previously

11

outlin

e

•con

vexop

timization

�SV

Ms

and

robust

linear

programm

ing

•m

inim

axprob

ability

mach

ine

•kern

elop

timization

12

SV

Ms:

setu

p

givendata

poin

tsx

iw

ithlab

elsy

i=±

1,i=

1,...,N

two-class

linear

classification

with

support

vector:

min‖a‖2

:y

i (aTx

i−

b)≥

1,i=

1,...,N

•am

ounts

toselect

one

separatin

ghyp

erplan

eam

ong

the

man

y

possib

le

•prob

lemis

feasible

iffth

ereexists

asep

arating

hyp

erplan

ebetw

een

the

two

classes

13

SV

Ms:

robust

optim

izatio

nin

terpreta

tion

inte

rpreta

tion:

SV

Ms

area

way

tohan

dle

noise

indata

poin

ts

•assu

me

eachdata

poin

tis

unkn

own-b

ut-b

ounded

ina

sphere

of

radiu

sρ

and

center

xi

•find

the

largestρ

such

that

separation

isstill

possib

lebetw

eenth

e

two

classesof

pertu

rbed

poin

ts14

varia

tions

canuse

other

data

noise

models:

•hyp

ercube

uncertain

ty(gives

riseto

LP)

•ellip

soidal

uncertain

ty(→

QP)

•prob

abilistic

uncertain

ty,G

aussian

orChebych

ev(→

QP)

15

separatio

nw

ithhypercube

uncerta

inty

assum

eeach

data

poin

tis

unkn

own-b

ut-b

ounded

inan

hyp

ercubeC

i :

xi∈C

i:=

{xi+

ρP

u:‖u‖∞≤

1}

where

centers

xi

and

”shap

em

atrix”P

aregiven

robust

separatio

n:

leads

tolin

ear

program

min‖P

a‖1

:y

i (aTx

i−

b)≥

1,i=

1,...,N

16

separatio

nw

ithellip

soid

aluncerta

inty

assum

eeach

data

poin

tis

unkn

own-b

ut-b

ounded

inan

ellipsoid

Ei :

xi∈E

i:=

{xi+

ρP

u:‖u‖2≤

1}

where

center

xi

and

”shap

em

atrix”P

aregiven

robust

separation

leads

toQ

P

min‖P

a‖2

:y

i (aTx

i−

b)≥

1,i=

1,...,N

17

outlin

e

•con

vexop

timization

•SV

Ms

and

robust

linear

programm

ing

�m

inim

axprob

ability

mach

ine

•kern

elop

timization

18

min

imax

probability

machin

e

goal:

•m

akeassu

mption

sab

out

the

data

generatin

gpro

cess

•do

not

assum

eG

aussian

distrib

ution

s

•use

second-m

omen

tan

alysisof

the

two

classes

letx±

,Γ±

be

the

mean

and

covariance

matrix

ofclass

y=±

1

MPM

:m

aximize

εsu

chth

atth

ereexists

(a,b)

such

that

inf

x∼

(x+

,Γ+

)P

rob{a

Tx≤

b}≥

1−

ε

inf

x∼

(x−

,Γ−

)P

rob{a

Tx≥

b}≥

1−

ε

19

MP

Ms:

optim

izatio

nproble

m

→tw

o-sided

,m

ultivariab

leChebych

evin

equality:

inf

x∼

(x,Γ

)P

rob{a

Tx≤

b}=

(b−

aTx)2+

(b−

aTx)2+

+a

TΓa

MPM

leads

tosecond-o

rder

cone

program

:

mina‖Γ

1/2

+a‖2

+‖Γ

1/2

−a‖2

:a

T(x

+−

x−

)=

1

complexity

isth

esam

eas

standard

SV

Ms

20

dualproble

m

expressprob

lemas

uncon

strained

min

-max

problem

:

mina

max

‖u‖2≤

1,‖v‖2≤

1u

TΓ

1/2

+a−

vTΓ

1/2

−a

+λ(1−

aT(x

+−

x−

))

exchan

gem

inan

dm

ax,an

dset

ρ:=

1/λ:

min

ρ,u

,vρ

:x

++

Γ1/2

+u

=x−

+Γ

1/2

−v,‖u‖2≤

ρ,‖v‖2≤

ρ

geometric

interpretation

:defi

ne

the

two

ellipsoid

s

E±

(ρ)

:={

x±

+Γ

1/2

±u

:‖u‖2≤

ρ}

and

find

largestρ

forw

hich

ellipsoid

sin

tersect

21

robust

optim

izatio

nin

terpreta

tion

assum

edata

generated

asfollow

s:for

data

with

label

+,

x+∈E+(ρ

):=

{

x+

+Γ

1/2

+u

:‖u‖2≤

ρ}

and

similarly

fordata

with

label−

MPM

finds

largestρ

forw

hich

robust

separation

ispossib

le

PSfrag

replacem

ents

aTx−

b=

0x

+

x−

22

varia

tions

•m

inim

izeweigh

tedsu

mof

misclassifi

cationprob

abilities

•quad

raticsep

aration:

find

aquad

raticset

such

that

inf

x∼

(x+

,Γ+

)P

rob{x

∈Q}

≥1−

ε

inf

x∼

(x−

,Γ−

)P

rob{x

6∈Q}

≥1−

ε

→lead

sto

asem

idefi

nite

programm

ing

problem

•non

linear

classification

viakern

els

(usin

gplu

g-inestim

atesof

mean

and

covariance

matrix)

23

outlin

e

•con

vexop

timization

•SV

Ms

and

robust

linear

programm

ing

•m

inim

axprob

ability

mach

ine

�kern

elop

timization

24

transd

uctio

n

transductio

n:

givenlab

eledtrain

ing

setan

dunlab

eledtest

set,

predict

the

labels

the

data

contain

sboth

labeled

poin

tsan

dunlab

eledpoin

ts

25

kernelm

eth

ods

main

goal:

separate

usin

ga

non

linear

classifier

aTφ(x

)=

b

where

φis

anon

linear

operator

defi

ne

the

kernelm

atr

ix(on

both

labeled

and

unlab

eleddata)

Kij

=φ(x

i )φ(x

j )T

ina

transd

uctive

setting,

allwe

need

tokn

owto

predicts

the

labels

are

a,b

and

the

kernel

matrix

26

kernelm

eth

ods:

idea

ofproof

allth

elin

earclassifi

cationm

ethods

we’ve

seenso

farare

such

that

at

the

optim

um

,a

isin

the

range

ofth

elab

eleddata:

a=

∑

i

λi x

i

thus,

inth

enon

linear

case,th

eop

timization

problem

dep

ends

only

on

the

values

ofkern

elm

atrixK

ijfor

labeled

poin

tsx

i ,xj

ina

transd

uctive

setting,

the

prediction

oflab

elsalso

involves

Kij

only,

since

foran

unlab

eleddata

poin

tx

j ,

aTφ(x

j )=

∑

i

λi φ

(xi )

Tφ(x

j )

involves

only

Kij ’s

27

kerneloptim

izatio

n

allpreviou

salgorith

ms

canbe

”kernelized

”

what

isa

”good”

kernel?

•kern

elsh

ould

be

”close”to

a”target”

kernel

•kern

elm

atrixsatisfi

essom

e”stru

cture”

constrain

ts

main

idea:

kernel

canbe

describ

edvia

the

Gram

matrix

ofdata

poin

ts,

hen

ceis

apositive

semid

efinite

matrix

→sem

idefi

nite

programm

ing

plays

arole

inkern

elop

timization

28

setu

p

we

assum

ewe

have

giventrain

ing

and

testsets

goal:

•m

aximize

”alignm

ent”

toa

givenkern

elon

trainin

gset

(translates

into

constrain

tson

upper-left

blo

ckof

kernel

matrix)

•kern

elm

atrixsatisfi

esstru

cture

constrain

ts

(translates

into

constrain

tson

the

whole

matrix,

inclu

din

gtest

set)

29

alig

nm

ent

idea:

alignK

toa

”targetkern

el”C

bym

aximizin

g

A(K

,C)

:=〈C

,K〉

‖K‖

F‖C‖

F

where

〈C,K

〉=

Tr(C

K)

isth

ein

ner

product

oftw

osym

metric

matrices,

and‖C‖

F=

√

〈C,C〉

isth

eFrob

eniu

snorm

we

canim

pose

alow

erbou

nd

αon

the

alignm

ent

with

the

second-ord

ercon

econ

straint

onK

α‖C‖

F·‖K

‖F≤〈C

,K〉

30

affi

ne

constr

ain

ts

itis

alsousefu

lto

impose

that

the

kernel

liesin

some

affine

subsp

ace

exam

ple

:assu

me

that

Kis

ofth

eform

K=

N∑i=

1

λi u

i uTi

where

λi≥

0are

the

(variable)

eigenvalu

es,an

du

i ’sare

the

(fixed

)

eigenvectors

31

optim

izin

gkernels:

exam

ple

proble

m

goal:find

akern

elth

at

•has

analign

emen

tw

itha

givenm

atrix(eg,

C=

yy

T)

onth

e

trainin

gset

•belon

gsto

some

affine

setK

the

problem

reduces

toa

semid

efinite

programm

ing

feasibility

problem

:

find

Ksu

chth

at

K∈K

,α‖C‖

F·‖K

‖F≤〈C

,K〉,

Kpositive

defi

nite

32

kerneloptim

izatio

n:

what’s

next?

•of

course,

this

isnot

alearn

ing

meth

od

•m

uch

tolearn

fromduality

theory

•m

any

other

constrain

tscan

be

han

dled

,e.g

.,m

arginreq

uirem

ents

33

wrap-u

p

•con

vexop

timization

has

much

tooff

eran

dgain

fromin

teraction

with

classification

•describ

edvariation

son

linear

classification

•m

any

robust

optim

izationin

terpretations

•all

these

meth

ods

canbe

kernelized

•kern

elop

timization

has

high

poten

tial

34

see

also

•Learn

ing

the

Kern

elM

atrixw

ithSem

i-Defi

nite

Program

min

g

(Lan

ckiert,Cristian

ini,

Bartlett,

Elgh

aoui,

Jordan

)In

Prep

aration

(2002)

•M

inim

axprob

ability

mach

ine

(Lan

ckiert,B

hattach

arrya,El

Ghaou

i,Jord

an)

(NIP

S2001)

35

convex optimization in - peopleelghaoui/pdffiles/ntoc2001.pdf · con v ex optimization: conic and...

Documents