poggi analytics - distance - 1a

Post on 16-Jan-2017

55 Views

Category:

Business

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Buenos Aires, marzo de 2016Eduardo Poggi

www.umiacs.umd.edu/~mrastega/

Instance Based Learning

Distancias Introducción k-nearest neighbor Locally weighted regression Radial Basis Functions Case-Based Reasoning Reducción de instancias

Distancias

¿Y si es para algunos en lugar de para todos?

Distancias

Distancias

Distancias

Distancias

Autos Motos Elect. Juguet. Golosinas Trigo Pollos

Autos 1 0.8 0.5 0.2 0.1 0 0

Motos 1 0.5 0.2 0.1 0 0

Elect. 1 0.2 0.1 0 0

Juguet. 1 0.1 0 0

Golosinas 1 0.5 0.5

Trigo 1 0.7

Pollos 1

Distancias

Distancia de Levenshtein, distancia de edición, o distancia entre palabras, al número mínimo de operaciones requeridas para transformar una cadena de caracteres en otra. Se entiende por operación: inserción, eliminación o la sustitución de un carácter.

https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python

Distancias

www.sc.ehu.es/ccwgrrom/transparencias/pdf-vision-1-transparencias/capitulo-1.pdf

Distancias

http://www.nidokidos.org/threads/29243-Animals-humans-face-similarity-funny-pics!!

Distancias

http://lear.inrialpes.fr/people/nowak/similarity/

DistanciasProducto

Comestibles Limpieza Indumentaria

Animal Vegetal Mineral

Lácteos Cárnicos

Leche liquida Leche fermentada Quesos Manteca

Yogurt entero Yogurt descremado

Yogurt natural Yogurt saborizado

¿IBL?

La idea es simple: La clase de una instancia debe ser similar a la clase asociada e

ejemplos parecidos. Almacenar todo los ejemplos. Cuando se recibe una instancia para clasificar se buscan los

ejemplos “más parecidos” y se analizan las clases asignadas.

Pero: La clasificación puede ser costosa ¿Todos los atributos son igual de relevantes? ¿Cuántos son los ejemplos parecidos? ¿Si los ejemplos parecidos tienen clases disímiles? ¿Todos lso ejemplos parecidos “pesan” igual? ¿Qué tan parecidos deben ser los parecidos?

K-nearest neighbor

To define how similar two examples are we need a metric. We assume all examples are points in an n-dimensional space Rn and

use the Euclidean distance: Let Xi and Xj be two examples. Their distance d(Xi,Xj) is defined as:

d(Xi, Xj) = ( Σk [xik – xjk]2 ) ** 1/2

Where xik is the value of attribute k on example Xi.

K-nearest neighbor for discrete classes

K = 4 New example

Nearest Neighbor

Four things make a memory based learner:

1. A distance metric Euclidian

2. How many nearby neighbors to look at?One

3. A weighting function (optional)Unused

4. How to fit with the local points?Just predict the same output as the nearest neighbor.

Voronoi Diagram

Decision surface induced by a 1-nearest neighbor. The decisionsurface is a combination of convex polyhedra surrounding each training example.

The Zen of Voronoi Diagrams

0 Nearest Neighbor

1 Nearest Neighbor

3 Nearest Neighbor

5 Nearest Neighbor

7 Nearest Neighbor

k-Nearest Neighbour Classification Method

Key idea: keep all the training instances Given query example, take vote amongst its k

neighbours Neighbours are determined by using a distance

function

k-Nearest Neighbour Classification Method

(k=1) (k=4)

Probability interpretation: estimate p(y|x) as

, | , ( )( | ) , ( ) is the neighborhood around

| ( ) |i i i ix y y y x N x

p y x N x xN x

Sample adapted from Rong Jin’s slides

k-Nearest Neighbour Classification Method

Advantages: Training is really fast Can learn complex target functions

Disadvantages Slow at query time: Efficient data structures are needed

to speed up the query

How to choose k?

Use validation with leave-one-out method

For k = 1, 2, …, K

Err(k) = 0;

1. Randomly select a training data point and hide its class label

2. Using the remaining data and given k to predict the class label for the left data point

3. Err(k) = Err(k) + 1 if the predicted label is different from the true label

Repeat the procedure until all training examples are tested

Choose the k whose Err(k) is minimal

How to choose k?

Use validation with leave-one-out method

For k = 1, 2, …, K

Err(k) = 0;

1. Randomly select a training data point and hide its class label

2. Using the remaining data and given k to predict the class label for the left data point

3. Err(k) = Err(k) + 1 if the predicted label is different from the true label

Repeat the procedure until all training examples are tested

Choose the k whose Err(k) is minimal

How to choose k?

Use validation with leave-one-out method

For k = 1, 2, …, K

Err(k) = 0;

1. Randomly select a training data point and hide its class label

2. Using the remaining data and given k to predict the class label for the left data point

3. Err(k) = Err(k) + 1 if the predicted label is different from the true label

Repeat the procedure until all training examples are tested

Choose the k whose Err(k) is minimal

(k=1)

How to choose k?

Use validation with leave-one-out method

For k = 1, 2, …, K

Err(k) = 0;

1. Randomly select a training data point and hide its class label

2. Using the remaining data and given k to predict the class label for the left data point

3. Err(k) = Err(k) + 1 if the predicted label is different from the true label

Repeat the procedure until all training examples are tested

Choose the k whose Err(k) is minimal

Err(1) = 1

How to choose k?

Use validation with leave-one-out method

For k = 1, 2, …, K

Err(k) = 0;

1. Randomly select a training data point and hide its class label

2. Using the remaining data and given k to predict the class label for the left data point

3. Err(k) = Err(k) + 1 if the predicted label is different from the true label

Repeat the procedure until all training examples are tested

Choose the k whose Err(k) is minimal

Err(1) = 1

How to choose k?

Use validation with leave-one-out method

For k = 1, 2, …, K

Err(k) = 0;

1. Randomly select a training data point and hide its class label

2. Using the remaining data and given k to predict the class label for the left data point

3. Err(k) = Err(k) + 1 if the predicted label is different from the true label

Repeat the procedure until all training examples are tested

Choose the k whose Err(k) is minimalErr(1) = 3

Err(2) = 2

Err(3) = 6k = 2

K-nearest neighbor for discrete classes

Algorithm (parameter k) For each training example (X,C(X)) add the example to our

training list. When a new example Xq arrives, assign class:

C(Xq) = majority voting on the k nearest neighbors of Xq C(Xq) = argmax v Σi δ(v, C(Xi))

where δ(a,b) = 1 if a = b and 0 otherwise

K-nearest neighbor for real-valued functions

Algorithm (parameter k)

For each training example (X,C(X)) add the example to our training list.

When a new example Xq arrives, assign class: C(Xq) = average value among k nearest neighbors of Xq C(Xq) = Σ C(Xi) / k

Distance Weighted Nearest Neighbor

It makes sense to weight the contribution of each example according to the distance to the new query example.

C(Xq) = argmax v Σi wi δ(v, C(Xi))

For example, wi = 1 / d(Xq,Xi)

Nearest Neighbor

Four things make a memory based learner:

1. A distance metric Euclidian

2. How many nearby neighbors to look at?k

3. A weighting function (optional)1 / d(Xq,Xi)

4. How to fit with the local points?Just predict the same output as the nearest neighbor.

Distance Weighted Nearest Neighbor for Real-Valued Functions

For real valued functions we average based on the weight function and normalize using the sum of all weights.

C(Xq) = Σi wi C(Xi) / Σ wi

Problems with k-nearest Neighbor

The distance between examples is based on all attributes. What if some attributes are irrelevant?

Consider the curse of dimensionality. The larger the number of irrelevant attributes, the higher the

effect on the nearest-neighbor rule.

One solution is to use weights on the attributes. This is like stretching or contracting the dimensions on the input space.

Ideally we would like to eliminate all irrelevant attributes.

Locally Weighted Regression

Let’s remember some terminology:

Regression: Is a problem similar to classification but the value to predict is a real number.

Residual: The difference between the true target value f and our approximation f’: f(X) – f’(X)

Kernel Function: The distance function that provides a weight to each example. The kernel function K is a function of the distance between examples: K = f(d(Xi,Xq))

Locally Weighted Regression

The method is called locally weighted regression for the following reasons:

“Locally” because the predicted value for an example Xq is based only on the vicinity or neighborhood around Xq.

“Weighted” because the contribution of each neighbor of Xq will depend on the distance between the neighbor example and Xq.

“Regression” because the value to predict will be a real number.

Locally Weighted Regression

Consider the problem of approximating a target function using a linear combination of attribute values:

f’(X) = w0 + w1x1 + w2x2 + … + wnxn where X = (x1, x2, …, xn)

We want to find those coefficients that minimize the error: E = ½ Σk [f(X) – f’(X)]2

Locally Weighted Regression

If we do this in the vicinity of an example Xq and we wish to use a kernel function, we get a form of locally weighted regression:

E(Xq) = ½ Σk ( [f(X) – f’(X)]2 K(d(Xq,X) )

where the sum now goes over the neighbors of Xq.

Locally Weighted Regression

Using gradient descent search, the update rule is defined as:

ΔΔ Wj = n Σk [f(X) – f’(X)] K(d(Xq,X) xj

where n is the learning rate and xj is the jth attribute of example X.

Locally Weighted Regression

Then here are somecommonly usedweighting functions…(we use a Gaussian)

Nearest Neighbor

1. A distance metric Scaled Euclidian

2. How many nearby neighbors to look at?All of them

3. A weighting function (optional)w_k = exp(-D(x_k , x_query )^2 / Kw^2 )

4. How to fit with the local points?First form a local linear model. Find the β that

minimizes the locally weighted sum of squared residuals:

Locally Weighted Regression

Remarks:

The literature contains other functions that are non linear.

There are many variations to locally weighted regression that use different kernel functions.

Normally a linear model is sufficiently good to approximate the local neighborhood of an example.

Reducción de instancias

Reducción de instancias

Reducción de instancias

eduardopoggi@yahoo.com.ar

eduardo-poggi

http://ar.linkedin.com/in/eduardoapoggi

https://www.facebook.com/eduardo.poggi

@eduardoapoggi

Bibliografía

top related