lecture 4b similarity and dissimilarity - fudan...

26
2007-4-1 Data Mining:Tech. and Appl. 1 Lecture 4b Similarity and Dissimilarity Zhou Shuigeng April 1, 2007

Upload: others

Post on 16-Apr-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 1

Lecture 4bSimilarity and Dissimilarity

Zhou Shuigeng

April 1, 2007

Page 2: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 2

Measures of Similarity and Dissimilarity

Used by a number of data mining techniquesClustering

Anomaly detection

Nearest-neighbor classification

Page 3: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 3

Measures of Similarity and Dissimilarity

SimilarityNumerical measure of how alike two data objects are.Higher when objects are more alike.Often falls in the range [0,1]

Dissimilarity (sometimes using Distance)Numerical measure of how different two data objects areLower when objects are more alikeMinimum dissimilarity is often 0Upper limit varies

Proximity refers to a similarity or dissimilarity

Page 4: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 4

Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects

Page 5: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 5

Similarity/Dissimilarity for Data Objects

Dissimilarity (mostly for continuous data)

Euclidean distance

Minkowski distance

Mahalanobis distance

Similarity

Binary data (SMC, Jaccard, cosine, Hamming)

Continuous data (Tanimoto, correlation)

Page 6: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 6

Euclidean Distance

Euclidean Distance

Where n is the number of dimensions (attributes) and pk and

qk are, respectively, the kth attributes (components) or data

objects p and q.

Standardization is necessary, if scales differ

Page 7: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 7

Euclidean Distance

Page 8: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 8

Minkowski Distance

Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kthattributes (components) or data objects p and q

Page 9: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 9

Minkowski Distance: Examplesr = 1: City block (Manhattan, taxicab, L1 norm) distance

A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors

r = 2: Euclidean distancer →∞. “supremum” (Lmax norm, L ∞ norm) distance.

This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions

Page 10: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 10

Minkowski Distance

Page 11: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 11

Mahalanobis Distance

2

Page 12: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 12

Mahalanobis Distance

Page 13: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 13

Mahalanobis DistanceMahalanobis distance is a generalization of Eulidean distance. MD is degenerated to ED when the covariance matrix is an identity matrixMahalanobis distance is useful when the attributes

are correlated, have different ranges of values (different variance), and the distribution of the data is approximately Gaussian (normal)

Page 14: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 14

Common Properties of A Distance

Distances, such as the Euclidean distance, have some well known properties

where d(p, q) is the distance (dissimilarity) between points (data objects), p and q

A distance that satisfies these properties is a metric

Page 15: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 15

Common Properties of A Similarity

Similarities, also have some well known properties

where s(p, q) is the similarity between points (data objects), p and q

Page 16: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 16

Similarity between Binary VectorsCommon situation is that objects, p and q, have only binary attributesCompute similarities using the following quantities

Simple Matching and Jaccard Coefficients

Page 17: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 17

SMC vs. Jaccard: Example

Page 18: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 18

Cosine Similarity

If d1 and d2 are two document vectors, then

where · indicates vector dot product and || d || is the length of vector d

Example:

Page 19: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 19

Extended Jaccard Coefficient(Tanimoto)

Variation of Jaccard for continuous or count attributes

Reduces to Jaccard for binary attributes

Page 20: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 20

Correlation

Correlation measures the linear relationship between objectsTo compute correlation, we standardize data objects, p and q, and then take their dot product

Page 21: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 21

Visually Evaluating Correlation

Page 22: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 22

Drawback of Correlation

X = (-3, -2, -1, 0, 1, 2, 3)

Y = (9, 4, 1, 0, 1, 4, 9) Y = X2

Mean(X) = 0, Mean(Y) = 4

Correlation

= (-3)(5)+(-2)(0)+(-1)(-3)+(0)(-4)+(1)(-3)+(2)(0)+3(5)

= 0

Page 23: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 23

Bregman Divergence

Bregman divergences are loss or distortion functionsGiven a strictly convex function φ

L(x): represents a plane that is tangent to function φ at y

φφ

Page 24: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 24

Bregman Divergence

Example: squared Euclidean distance

φ(t) = t2 (convex function)

D(x,y) = x2 – y2 -2y(x-y) = (x-y) 2

Page 25: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 25

Sometimes attributes are of many different types, but an overall similarity is needed

General Approach for Combining Similarities

Page 26: Lecture 4b Similarity and Dissimilarity - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... · 2007-4-1 Data Mining:Tech. and Appl. 3 Measures of Similarity and Dissimilarity

2007-4-1 Data Mining:Tech. and Appl. 26

Using Weights to Combine Similarities

May not want to treat all attributes the same.Use weights wk which are between 0 and 1 and sum to 1