similarity and dissimilarity

Lect 09/10-08-09 1

Similarity and Dissimilarity

Lect 09/10-08-09 2

• Similarity : numerical measure between two objects to the degree they are alike.

• The value of similarity measure will be high if two objects are very similar.

• If scale is [0,1], then 0 indicates no similarity and I indicates complete similarity.

Lect 09/10-08-09 3

• Dissimilarity: – numerical measure between two objects to

the degree to which they are different.– Diss. Is lower for similar pair of objects.

• Transformations:– Suppose simi. between 2 obj. range from 1

( not at all similar) to 10 (comp. simi.)– They can be transf. with in the range [0,1] as

follows:• S’=(S-1)/9, where S and S’ are original and

transformed similarity values.

Lect 09/10-08-09 4

• It can also be generalized as:– S’=(s - min_s)/(max_s – min_s)– Lly, diss. Can be mapped on to the interval

[0,1]– D’=(d-min_d)/(max_d-min_d).

Lect 09/10-08-09 5

Similarity & Diss. Betn simple attributes

• Consider objects having single attribute– One nominal attr.

• In context of such an attr., simi. means whether the two objects have same value or not.

S=1 if att. values match

S=0 if att. Values doesn’t match

Diss. can be defined similarly in opposite way.

Lect 09/10-08-09 6

• One ordinal attr.– Order is important and should be taken into

account.– Suppose an attr. measures the quality of a

product as follows: {poor, fair, OK, good, wonderful}• Product P1= wonderful, Product P2= good,

P3=OK and so on• In order to make this observation

quantitative say {poor=0, fair=1, OK=2, good = 3, wonderful=4}

• Then d(P1,P2)=4-3=1• Or d(P1,P2) = (4-3)/4=0.25

• A simi. For ordinal attr. can be defined as s=1-d

Lect 09/10-08-09 7

• For interval or ratio attr., the measure of dissi. bet. 2 objects is absolute difference between ordinal attributes.

Lect 09/10-08-09 8

Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.

Lect 09/10-08-09 9

Dissimilarity between data objects

• Distances:• The Euclidean distance bet. Two data objects

or points

where n is the no. of dimensions and xk & yk are the kth attr. of x & y.

n

kkk yxyxd

1

2)(),(

Lect 09/10-08-09 10

Euclidean Distance

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x yp1 0 2p2 2 0p3 3 1p4 5 1

Distance Matrix

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

Lect 09/10-08-09 11

Minkowski Distance

• Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

rn

k

rkk qpdist

1

1)||(

Lect 09/10-08-09 12

Minkowski Distance: Examples• r = 1. City block (Manhattan, taxicab, L1 norm)

distance. – A common example of this is the Hamming distance, which is just the

number of bits that are different between two binary vectors

• r = 2. Euclidean distance

• r . “supremum” (Lmax norm, L norm) distance. – This is the maximum difference between any component of the

vectors

• Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Lect 09/10-08-09 13

Minkowski Distance

Distance Matrix

point x yp1 0 2p2 2 0p3 3 1p4 5 1

L1 p1 p2 p3 p4p1 0 4 4 6p2 4 0 2 4p3 4 2 0 2p4 6 4 2 0

L2 p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

L p1 p2 p3 p4

p1 0 2 3 5p2 2 0 1 3p3 3 1 0 2p4 5 3 2 0

Lect 09/10-08-09 14

Common Properties of a Distance• Distances, such as the Euclidean distance, have

some well known properties.

1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)

2. d(p, q) = d(q, p) for all p and q. (Symmetry)3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.

(Triangle Inequality)where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

• A distance that satisfies these properties is a Metric

Lect 09/10-08-09 15

Example of Non-Metric diss.• Two sets A={1,2,3,4}, B={2,3,4}• A-B={1}, B-A={ }• Define the distance d bet. A and B as

d(A,B)=size (A-B), size is function returning the no. of elts. in a set.

• This dist. measure doesn’t satisfy Positivity property, symmetry property and triangular inequality.

• These properties can hold if diss. Meas. Is modified as follows:

D(A,B)=size (A-B) + size (B-A).

Lect 09/10-08-09 16

Common Properties of a Similarity

• Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

Lect 09/10-08-09 17

Similarity Between Binary Vectors• Common situation is that objects, p and q, have

only n binary attributes. • p and q becomes binary vectors & leads to

following 4 quantities.

• Compute similarities using the following quantitiesM01 = the number of attributes where p was 0 and q was 1M10 = the number of attributes where p was 1 and q was 0M00 = the number of attributes where p was 0 and q was 0M11 = the number of attributes where p was 1 and q was 1

• Simple Matching Coefficients ( similarity coeff.)SMC = number of matches / number of attributes

= (M11 + M00) / (M01 + M10 + M11 + M00)

Lect 09/10-08-09 18

SMC versus Jaccard: ExampleJ = number of 11 matches / number of not-both-zero

attributes values = (M11) / (M01 + M10 + M11)

EX. p = (1,0, 0, 0, 0, 0, 0, 0, 0, 0) q = (0, 0, 0, 0, 0, 0, 1, 0, 0,1)

M01 = 2 (the number of attributes where p was 0 and q was 1)M10 = 1 (the number of attributes where p was 1 and q was 0)M00 = 7 (the number of attributes where p was 0 and q was 0)M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Lect 09/10-08-09 19

Cosine Similarity• This similarity measure is used for

documents.

• Doc. are often represented as vectors

• Each attribute represents the frequency of each word.

• Each document have thousands & tens of thousands of attributes.

• The Cosine simi. is one of the most common measure of document similarity.

Lect 09/10-08-09 20

Cosine Similarity• If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| , where indicates vector dot product and || d || is the length of vector d.

• Example: consider 2 document vectors as: d1 = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)

d2 = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)

d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =(42)0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = 0.3150

Lect 09/10-08-09 21

• If cosine simi.= 1, then angle bet x and y is 0 and x & y are same except for magnitude and vice-versa.

θ

y

x

Lect 09/10-08-09 22

Extended Jaccard Coefficient (Tanimoto)

• Can be used for document data

• Reduces to Jaccard for binary attributes

Lect 09/10-08-09 23

Correlation• Correlation measures the linear relationship

between objects

• Example (Perfect Correlation): the following 2 sets of values for x and y indicate cases where corr. Is -1 and +1.

• x=(-3,6,0,3,-6) and y=(1,-2,0,-1,2) [COMPUTE]• x=(3,6,0,3,6) and y=(1,2,0,1,2)[COMPUTE]

Lect 09/10-08-09 24

• To compute correlation, we standardize data objects, p and q, and then take their dot product

)(/))(( pstdpmeanpp kk

)(/))(( qstdqmeanqq kk

qpqpncorrelatio ),(

Lect 09/10-08-09 25

Visually Evaluating Correlation•Scatter plots showing the similarity from –1 to 1.

•It is easy to visualize the

correlation between 2 data

objects x and y by plotting pairs

of corresponding attribute

values.

•This figure shows a no. of

such plots where x and y

has 30 attributes. Each circle

in the plot repres. one of

the 30 attr. Its x-coord. is the

value of one of the att. of x

while its y coord. is the value

of the same att. for y.

Lect 09/10-08-09 26

Issues in computing proximity measures

• 1. Handling the cases where attr. have different scales/ & or correlated.

• 2. Computing proximity between objects when they have different types of attr. (qualitative/quantitative).

• 3. Handling attr. having different weights i.e. when all attr. don’t contribute equally to the proximity of objects.

Lect 09/10-08-09 27

Mahalanobis Distance

Tqpqpqpsmahalanobi )()(),( 1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

is the covariance matrix of the input data X

Mahalanobis distance is useful when attr. Are correlated and have different ranges of values and distribution is approx. Gaussian.

Lect 09/10-08-09 29

Combining Similarities for heterogeneous attributes

• Sometimes attributes are of many different types, but an overall similarity is needed.

Lect 09/10-08-09 30

Using Weights to Combine Similarities

• When some attr. are more important for calculating proximity than others

• May not want to treat all attributes the same.

– Use weights wk which are between 0 and 1 and sum to 1, then previous eqn. becomes

The Minkowski distance becomes:

Lect 09/10-08-09 34

Review Questions (Unit 2)• 1. What are the issues related to data quality? Discuss.• 2. What is the difference between noise and an outlier?• 3. Data quality issues can be considered from an application point of view.

Discuss.• 4. Write a short note on Data Preprocessing.• 5. Discuss the difference between similarity and dissimilarity.• 6. State 3 axioms that makes a distance a metric.• 7. Discuss similarity measures for Binary variables.• 8. What is SMC?• 9. How cosine similarity is used for measuring similarity between 2

document vectors.• 10. Discuss issues related to computation of proximity measures.• 11.For the following vectors x, y calculate the indicated similarity or distance

measure:(a) X=(1,1,1,1) y=(2,2,2,2) cosine, correlation, Euclidean.(b) x=(0,1,0,1), y=(1,0,1,0) cosine, correlation, Euclidean, Jaccard,

Hamming Distance( for binary data L1 distance corresponds to Hamming distance).

similarity and dissimilarity

Documents