features, kernels, and similarity functions

Features, Kernels, and Features, Kernels, and Similarity functionsSimilarity functions

Avrim BlumAvrim BlumMachine learning lunch 03/05/07Machine learning lunch 03/05/07

Suppose you want to…Suppose you want to…

use learning to solve some classification problem.

E.g., given a set of images, learn a rule to distinguish men from women.

The first thing you need to do is decide what you want as features.

Or, for algs like SVM and Perceptron, can use a kernel function, which provides an implicit feature space. But then what kernel to use?

Can Theory provide any help or guidance?

Plan for this talkPlan for this talk

Discuss a few ways theory might be of help: Algorithms designed to do well in large

feature spaces when only a small number of features are actually useful. So you can pile a lot on when you don’t know

much about the domain. Kernel functions. Standard theoretical view,

plus new one that may provide more guidance. Bridge between “implicit mapping” and “similarity

function” views. Talk about quality of a kernel in terms of more tangible properties. [work with Nina Balcan]

Combining the above. Using kernels to generate explicit features.

A classic conceptual questionA classic conceptual question How is it possible to learn anything quickly

when there is so much irrelevant information around?

Must there be some hard-coded focusing mechanism, or can learning handle it?

A classic conceptual questionA classic conceptual questionLet’s try a very simple theoretical model. Have n boolean features. Labels are + or -.

1001101110 +1100111101 +0111010111

Assume distinction is based on just one feature.

How many prediction mistakes do you need to make before you’ve figured out which one it is? Can take majority vote over all possibilities

consistent with data so far. Each mistake crosses off at least half. O(log n) mistakes total.

log(n) is good: doubling n only adds 1 more mistake. Can’t do better (consider log(n) random strings with

random labels. Whp there is a consistent feature in hindsight).

A classic conceptual questionA classic conceptual question

What about more interesting classes of functions (not just target a single feature)?

Littlestone’s Winnow algorithmLittlestone’s Winnow algorithm [MLJ [MLJ 1988]1988]

Motivated by the question: what if target is an OR of r n features?

Majority vote scheme over all nr possibilities would make O(r log n) mistakes but totally impractical. Can you do this efficiently?

Winnow is simple efficient algorithm that meets this bound.

More generally, if exists LTF such that positives satisfy w1x1+w2x2+…+wnxn c, negatives satisfy w1x1+w2x2+…+wnxn c - ,

(W=i|wi|) Then # mistakes = O((W/)2 log n).

E.g., if target is “k of r” function, get O(r2 log n). Key point: still only log dependence on n.

100101011001101011 +x4 x7 x10

Littlestone’s Winnow algorithmLittlestone’s Winnow algorithm [MLJ [MLJ 1988]1988]

How does it work? Balanced version: Maintain weight vectors w+ and w-. Initialize all weights to 1. Classify based on

whether w+x or w-x is larger. (Have x00)

If make mistake on positive x, for each xi=1, wi

+ = (1+)wi+, wi

- = (1-)wi-.

And vice-versa for mistake on negative x.Other properties: Can show this approximates maxent constraints. In other direction, [Ng’04] shows that maxent with L1

regularization gets Winnow-like bounds.

1001011011001w+ w-

Practical issuesPractical issues On batch problem, may want to cycle

through data, each time with smaller . Can also do margin version: update if just

barely correct. If want to output a likelihood, natural is

ew+x/[ew+x + ew-x]. Can extend to multiclass too.

William & Vitor have paper with some other nice practical adjustments.

Winnow versus Perceptron/SVMWinnow versus Perceptron/SVM

Winnow is similar at high level to Perceptron updates. What’s the difference?

Suppose data is linearly separable by wx = 0 with |wx| .

For Perceptron, mistakes/samples bounded by O((L2(w)L2(x)/)2)

For Winnow, mistakes/samples bounded by O((L1(w)L(x)/)2(log n)) For boolean features, L(x)=1. L2(x) can be sqrt(n). If target is sparse, examples dense, Winnow is better.

- E.g., x random in {0,1}n, f(x)=x1. Perceptron: O(n) mistakes.

If target is dense (most features are relevant) and examples are sparse, then Perceptron wins.

+++

++

+

---

--

OK, now on to kernels…OK, now on to kernels…

Generic problemGeneric problem

Given a set of images: , want to learn a linear separator to distinguish men from women.

Problem: pixel representation no good.

One approach: Pick a better set of features! But seems ad-hoc.

Instead: Use a Kernel! K( , ) = ( )¢( ).

is implicit, high-dimensional mapping. Perceptron/SVM only interact with data through dot-

products, so can be “kernelized”. If data is separable in -space by large L2 margin, don’t have to pay for it.

E.g., the kernel K(x,y) = (1+x¢y)d for the case of n=2, d=2, corresponds to the implicit mapping:

x2

x1O

O O

O

O

O

O O

XX

X

X

XX

X

X X

X

X

X

X

XX

X

XX

z2

z1

z3

O

O

OO

O

O

O

OO

X X

X XX

X

X

X X

X

X

X

X

X

X

X X

X

KernelsKernels

KernelsKernels Perceptron/SVM only interact with data through

dot-products, so can be “kernelized”. If data is separable in -space by large L2 margin, don’t have to pay for it. E.g., K(x,y) = (1 + xy)d

:(n-diml space) ! (nd-diml space).

E.g., K(x,y) = e-(x-y)2

Conceptual warning: You’re not really “getting all the power of the high dimensional space without paying for it”. The margin matters. E.g., K(x,y)=1 if x=y, K(x,y)=0 otherwise.

Corresponds to mapping where every example gets its own coordinate. Everything is linearly separable but no generalization.

Question: do we need the notion of an implicit space to

understand what makes a kernel helpful for learning?

Focus on batch settingFocus on batch setting Assume examples drawn from some

probability distribution: Distribution D over x, labeled by target function c. Or distribution P over (x, l) Will call P (or (c,D)) our “learning problem”.

Given labeled training data, want algorithm to do well on new data.

On the one hand, operationally a kernel is just a similarity function: K(x,y) 2 [-1,1], with some extra requirements. [here I’m scaling to |(x)| = 1]

And in practice, people think of a good kernel as a good measure of similarity between data points for the task at hand.

But Theory talks about margins in implicit high-dimensional -space. K(x,y) = (x)¢(y).

xy

Something funny about theory of Something funny about theory of kernelskernels

I want to use ML to classify protein structures and I’m trying

to decide on a similarity fn to use. Any help?

It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

Umm… thanks, I guess.

It should be pos. semidefinite, and should result in your data having a large margin separator in implicit high-diml space you probably can’t even calculate.

Something funny about theory of Something funny about theory of kernelskernels

Theory talks about margins in implicit high-dimensional -space. K(x,y) = (x)¢(y). Not great for intuition (do I expect this

kernel or that one to work better for me) Can we connect better with idea of a good

kernel being one that is a good notion of similarity for the problem at hand?- Motivation [BBV]: If margin in -space, then

can pick Õ(1/2) random examples y1,…,yn (“landmarks”), and do mapping x [K(x,y1),…,K(x,yn)]. Whp data in this space will be apx linearly separable.

Goal: notion of Goal: notion of ““good similarity functiongood similarity function”” that…that…

1.1. Talks in terms of more intuitive properties Talks in terms of more intuitive properties (no implicit high-diml spaces, no (no implicit high-diml spaces, no requirement of positive-semidefiniteness, requirement of positive-semidefiniteness, etc)etc)

2.2. If If KK satisfies these properties for our given satisfies these properties for our given problem, then has implications to learning problem, then has implications to learning

3.3. Is broad: includes usual notion of “good Is broad: includes usual notion of “good kernel” (one that induces a large margin kernel” (one that induces a large margin separator in separator in -space).-space).

If so, then this can help with designing the If so, then this can help with designing the KK..[Recent work with Nina, with extensions by Nati Srebro]

Proposal satisfying (1) and (2):Proposal satisfying (1) and (2):

Say have a learning problem P (distribution D over examples labeled by unknown target f).

Sim fn K:(x,y)![-1,1] is (,)-good for P if at least a 1- fraction of examples x satisfy:

Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+

Q: how could you use this to learn?

How to use itHow to use itAt least a 1- prob mass of x satisfy:

Ey~D[K(x,y)|l(y)=l(x)] ¸ Ey~D[K(x,y)|l(y)l(x)]+

Draw S+ of O((2)ln 2) positive examples. Draw S- of O((2)ln 2) negative examples. Classify x based on which gives better score.

Hoeffding: for any given “good x”, prob of error over draw of S,S at most .

So, at most chance our draw is bad on more than fraction of “good x”.

With prob ¸ 1-, error rate · + .

But not broad enoughBut not broad enough

K(x,y)=x¢y has good separator but doesn’t satisfy defn. (half of positives are more similar to negs that to typical pos)

+ +

_

30o

30o

But not broad enoughBut not broad enough

Idea: would work if we didn’t pick y’s from top-left. Broaden to say: OK if 9 large region R s.t. most x are

on average more similar to y2R of same label than to y2R of other label. (even if don’t know R in advance)

+ +

_

30o

30o

Broader defn…Broader defn…

Say K:(x,y)![-1,1] is an (,)-good similarity function for P if exists a weighting function w(y)2[0,1] s.t. at least 1- frac. of x satisfy:

Can still use for learning: Draw S+ = {y1,…,yn}, S- = {z1,…,zn}.

n=Õ(1/2) Use to “triangulate” data:

x [K(x,y1), …,K(x,yn), K(x,z1),…,K(x,zn)]. Whp, exists good separator in this space:

w = [w(yw = [w(y11),…,w(y),…,w(ynn),-w(z),-w(z11),…,-w(z),…,-w(znn)])]

Ey~D[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~D[w(y)K(x,y)|l(y)l(x)]+


Say K:(x,y)![-1,1] is an (,)-good similarity function for P if exists a weighting function w(y)2[0,1] s.t. at least 1- frac. of x satisfy:

Ey~D[w(y)K(x,y)|l(y)=l(x)] ¸ Ey~D[w(y)K(x,y)|l(y)l(x)]+

Whp, exists good separator in this space: w = [w(yw = [w(y11),…,w(y),…,w(ynn),-w(z),-w(z11),…,-w(z),…,-w(znn)])]*Technically bounds are better if adjust definition to

penalize examples more that fail the inequality badly…

So, take new set of examples, project to So, take new set of examples, project to this space, and run your favorite linear this space, and run your favorite linear separator learning algorithm.*separator learning algorithm.*

Algorithm Draw S+={y1, , yd}, S-={z1, , zd}, d=O((1/2) ln(1/2)). Think of these as

“landmarks”.

Use to “triangulate” data:X [K(x,y1), …,K(x,yd), K(x,zd),…,K(x,zd)].

Guarantee: with prob. ¸ 1-, exists linear separator of error · + at margin /4.


Actually, margin is good in both L1 and L2 senses. This particular approach requires wasting examples

for use as the “landmarks”. But could use unlabeled data for this part.

Interesting property of definitionInteresting property of definition

An (,)-good kernel [at least 1- fraction of x have margin ¸ ] is an (’,’)-good sim fn under this definition.

But our current proofs suffer a penalty: ’ = + extra, ’ = 3extra.

So, at qualitative level, can have theory of similarity functions that doesn’t require implicit spaces.

Nati Srebro has improved to 2, which is

tight, + extended to hinge-loss.

Approach we’re investigatingApproach we’re investigating

With Nina & Mugizi: Take a problem where original features

already pretty good, plus you have a couple reasonable similarity functions K1, K2,…

Take some unlabeled data as landmarks, use to enlarge feature space K1(x,y1), K2(x,y1), K1(x,y2),…

Run Winnow on the result. Can prove guarantees if some convex

combination of the Ki is good.

Open questionsOpen questions

This view gives some sufficient conditions for a similarity function to be useful for learning but doesn’t have direct implications to direct use in SVM, say.

Can one define other interesting, reasonably intuitive, sufficient conditions for a similarity function to be useful for learning?

features, kernels, and similarity functions

Documents

x random

n boolean features

ol2wl2x2for winnow

negative x

consistent feature

positive x

explicit features

single feature