similarity and dissimilarity
TRANSCRIPT
Lect 09/10-08-09 1
Similarity and Dissimilarity
Lect 09/10-08-09 2
• Similarity : numerical measure between two objects to the degree they are alike.
• The value of similarity measure will be high if two objects are very similar.
• If scale is [0,1], then 0 indicates no similarity and I indicates complete similarity.
Lect 09/10-08-09 3
• Dissimilarity: – numerical measure between two objects to
the degree to which they are different.– Diss. Is lower for similar pair of objects.
• Transformations:– Suppose simi. between 2 obj. range from 1
( not at all similar) to 10 (comp. simi.)– They can be transf. with in the range [0,1] as
follows:• S’=(S-1)/9, where S and S’ are original and
transformed similarity values.
Lect 09/10-08-09 4
• It can also be generalized as:– S’=(s - min_s)/(max_s – min_s)– Lly, diss. Can be mapped on to the interval
[0,1]– D’=(d-min_d)/(max_d-min_d).
Lect 09/10-08-09 5
Similarity & Diss. Betn simple attributes
• Consider objects having single attribute– One nominal attr.
• In context of such an attr., simi. means whether the two objects have same value or not.
S=1 if att. values match
S=0 if att. Values doesn’t match
Diss. can be defined similarly in opposite way.
Lect 09/10-08-09 6
• One ordinal attr.– Order is important and should be taken into
account.– Suppose an attr. measures the quality of a
product as follows: {poor, fair, OK, good, wonderful}• Product P1= wonderful, Product P2= good,
P3=OK and so on• In order to make this observation
quantitative say {poor=0, fair=1, OK=2, good = 3, wonderful=4}
• Then d(P1,P2)=4-3=1• Or d(P1,P2) = (4-3)/4=0.25
• A simi. For ordinal attr. can be defined as s=1-d
Lect 09/10-08-09 7
• For interval or ratio attr., the measure of dissi. bet. 2 objects is absolute difference between ordinal attributes.
Lect 09/10-08-09 8
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
Lect 09/10-08-09 9
Dissimilarity between data objects
• Distances:• The Euclidean distance bet. Two data objects
or points
where n is the no. of dimensions and xk & yk are the kth attr. of x & y.
n
kkk yxyxd
1
2)(),(
Lect 09/10-08-09 10
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x yp1 0 2p2 2 0p3 3 1p4 5 1
Distance Matrix
p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0
Lect 09/10-08-09 11
Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
rn
k
rkk qpdist
1
1)||(
Lect 09/10-08-09 12
Minkowski Distance: Examples• r = 1. City block (Manhattan, taxicab, L1 norm)
distance. – A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
• r = 2. Euclidean distance
• r . “supremum” (Lmax norm, L norm) distance. – This is the maximum difference between any component of the
vectors
• Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
Lect 09/10-08-09 13
Minkowski Distance
Distance Matrix
point x yp1 0 2p2 2 0p3 3 1p4 5 1
L1 p1 p2 p3 p4p1 0 4 4 6p2 4 0 2 4p3 4 2 0 2p4 6 4 2 0
L2 p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5p2 2 0 1 3p3 3 1 0 2p4 5 3 2 0
Lect 09/10-08-09 14
Common Properties of a Distance• Distances, such as the Euclidean distance, have
some well known properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.
• A distance that satisfies these properties is a Metric
Lect 09/10-08-09 15
Example of Non-Metric diss.• Two sets A={1,2,3,4}, B={2,3,4}• A-B={1}, B-A={ }• Define the distance d bet. A and B as
d(A,B)=size (A-B), size is function returning the no. of elts. in a set.
• This dist. measure doesn’t satisfy Positivity property, symmetry property and triangular inequality.
• These properties can hold if diss. Meas. Is modified as follows:
D(A,B)=size (A-B) + size (B-A).
Lect 09/10-08-09 16
Common Properties of a Similarity
• Similarities, also have some well known properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points (data objects), p and q.
Lect 09/10-08-09 17
Similarity Between Binary Vectors• Common situation is that objects, p and q, have
only n binary attributes. • p and q becomes binary vectors & leads to
following 4 quantities.
• Compute similarities using the following quantitiesM01 = the number of attributes where p was 0 and q was 1M10 = the number of attributes where p was 1 and q was 0M00 = the number of attributes where p was 0 and q was 0M11 = the number of attributes where p was 1 and q was 1
• Simple Matching Coefficients ( similarity coeff.)SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
Lect 09/10-08-09 18
SMC versus Jaccard: ExampleJ = number of 11 matches / number of not-both-zero
attributes values = (M11) / (M01 + M10 + M11)
EX. p = (1,0, 0, 0, 0, 0, 0, 0, 0, 0) q = (0, 0, 0, 0, 0, 0, 1, 0, 0,1)
M01 = 2 (the number of attributes where p was 0 and q was 1)M10 = 1 (the number of attributes where p was 1 and q was 0)M00 = 7 (the number of attributes where p was 0 and q was 0)M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
Lect 09/10-08-09 19
Cosine Similarity• This similarity measure is used for
documents.
• Doc. are often represented as vectors
• Each attribute represents the frequency of each word.
• Each document have thousands & tens of thousands of attributes.
• The Cosine simi. is one of the most common measure of document similarity.
Lect 09/10-08-09 20
Cosine Similarity• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| , where indicates vector dot product and || d || is the length of vector d.
• Example: consider 2 document vectors as: d1 = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)
d2 = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =(42)0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = 0.3150
Lect 09/10-08-09 21
• If cosine simi.= 1, then angle bet x and y is 0 and x & y are same except for magnitude and vice-versa.
θ
y
x
Lect 09/10-08-09 22
Extended Jaccard Coefficient (Tanimoto)
• Can be used for document data
• Reduces to Jaccard for binary attributes
Lect 09/10-08-09 23
Correlation• Correlation measures the linear relationship
between objects
• Example (Perfect Correlation): the following 2 sets of values for x and y indicate cases where corr. Is -1 and +1.
• x=(-3,6,0,3,-6) and y=(1,-2,0,-1,2) [COMPUTE]• x=(3,6,0,3,6) and y=(1,2,0,1,2)[COMPUTE]
Lect 09/10-08-09 24
• To compute correlation, we standardize data objects, p and q, and then take their dot product
)(/))(( pstdpmeanpp kk
)(/))(( qstdqmeanqq kk
qpqpncorrelatio ),(
Lect 09/10-08-09 25
Visually Evaluating Correlation•Scatter plots showing the similarity from –1 to 1.
•It is easy to visualize the
correlation between 2 data
objects x and y by plotting pairs
of corresponding attribute
values.
•This figure shows a no. of
such plots where x and y
has 30 attributes. Each circle
in the plot repres. one of
the 30 attr. Its x-coord. is the
value of one of the att. of x
while its y coord. is the value
of the same att. for y.
Lect 09/10-08-09 26
Issues in computing proximity measures
• 1. Handling the cases where attr. have different scales/ & or correlated.
• 2. Computing proximity between objects when they have different types of attr. (qualitative/quantitative).
• 3. Handling attr. having different weights i.e. when all attr. don’t contribute equally to the proximity of objects.
Lect 09/10-08-09 27
Mahalanobis Distance
Tqpqpqpsmahalanobi )()(),( 1
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
is the covariance matrix of the input data X
Mahalanobis distance is useful when attr. Are correlated and have different ranges of values and distribution is approx. Gaussian.
Lect 09/10-08-09 29
Combining Similarities for heterogeneous attributes
• Sometimes attributes are of many different types, but an overall similarity is needed.
Lect 09/10-08-09 30
Using Weights to Combine Similarities
• When some attr. are more important for calculating proximity than others
• May not want to treat all attributes the same.
– Use weights wk which are between 0 and 1 and sum to 1, then previous eqn. becomes
The Minkowski distance becomes:
Lect 09/10-08-09 34
Review Questions (Unit 2)• 1. What are the issues related to data quality? Discuss.• 2. What is the difference between noise and an outlier?• 3. Data quality issues can be considered from an application point of view.
Discuss.• 4. Write a short note on Data Preprocessing.• 5. Discuss the difference between similarity and dissimilarity.• 6. State 3 axioms that makes a distance a metric.• 7. Discuss similarity measures for Binary variables.• 8. What is SMC?• 9. How cosine similarity is used for measuring similarity between 2
document vectors.• 10. Discuss issues related to computation of proximity measures.• 11.For the following vectors x, y calculate the indicated similarity or distance
measure:(a) X=(1,1,1,1) y=(2,2,2,2) cosine, correlation, Euclidean.(b) x=(0,1,0,1), y=(1,0,1,0) cosine, correlation, Euclidean, Jaccard,
Hamming Distance( for binary data L1 distance corresponds to Hamming distance).